Spaces:

KingNish
/

Bagel-7B-Demo

Running on Zero

App Files Files Community

KingNish commited on about 14 hours ago

Commit

e6af450

verified ·

1 Parent(s): 29a6852

Upload 110 files

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +3 -0
.gitignore +10 -0
EVAL.md +78 -0
LICENSE +201 -0
TRAIN.md +133 -0
app.py +505 -0
assets/arch.png +3 -0
assets/emerging_curves.png +3 -0
assets/teaser.webp +3 -0
data/__init__.py +2 -0
data/configs/example.yaml +45 -0
data/data_utils.py +177 -0
data/dataset_base.py +620 -0
data/dataset_info.py +39 -0
data/distributed_iterable_dataset.py +58 -0
data/interleave_datasets/__init__.py +5 -0
data/interleave_datasets/edit_dataset.py +72 -0
data/interleave_datasets/interleave_t2i_dataset.py +212 -0
data/parquet_utils.py +90 -0
data/t2i_dataset.py +128 -0
data/transforms.py +287 -0
data/video_utils.py +165 -0
data/vlm_dataset.py +195 -0
eval/__init__.py +2 -0
eval/gen/gen_images_mp.py +238 -0
eval/gen/gen_images_mp_wise.py +365 -0
eval/gen/geneval/evaluation/download_models.sh +20 -0
eval/gen/geneval/evaluation/evaluate_images.py +304 -0
eval/gen/geneval/evaluation/evaluate_images_mp.py +332 -0
eval/gen/geneval/evaluation/object_names.txt +80 -0
eval/gen/geneval/evaluation/summary_scores.py +64 -0
eval/gen/geneval/prompts/create_prompts.py +194 -0
eval/gen/geneval/prompts/evaluation_metadata.jsonl +553 -0
eval/gen/geneval/prompts/evaluation_metadata_long.jsonl +0 -0
eval/gen/geneval/prompts/generation_prompts.txt +553 -0
eval/gen/geneval/prompts/object_names.txt +80 -0
eval/gen/wise/cal_score.py +162 -0
eval/gen/wise/final_data.json +0 -0
eval/gen/wise/gpt_eval_mp.py +268 -0
eval/vlm/__init__.py +2 -0
eval/vlm/eval/mathvista/calculate_score.py +271 -0
eval/vlm/eval/mathvista/evaluate_mathvista.py +210 -0
eval/vlm/eval/mathvista/extract_answer.py +160 -0
eval/vlm/eval/mathvista/extract_answer_mp.py +161 -0
eval/vlm/eval/mathvista/prompts/ext_ans.py +51 -0
eval/vlm/eval/mathvista/utilities.py +229 -0
eval/vlm/eval/mmbench/evaluate_mmbench.py +283 -0
eval/vlm/eval/mme/Your_Results/OCR.txt +40 -0
eval/vlm/eval/mme/Your_Results/artwork.txt +400 -0
eval/vlm/eval/mme/Your_Results/celebrity.txt +340 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+assets/arch.png filter=lfs diff=lfs merge=lfs -text
+assets/emerging_curves.png filter=lfs diff=lfs merge=lfs -text
+assets/teaser.webp filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,10 @@

+wandb
+__pycache__
+.vscode
+notebooks
+results
+*.ipynb_checkpoints
+eval_results
+tests
+.DS_Store
+gradio.sh

EVAL.md ADDED Viewed

	@@ -0,0 +1,78 @@

+# VLM
+We follow [InternVL2](https://internvl.readthedocs.io/en/latest/internvl2.0/evaluation.html) to evaluate the performance on MME, MMBench, MMMU, MMVet, MathVista and MMVP.
+## Data prepration
+Please follow the [InternVL2](https://internvl.readthedocs.io/en/latest/get_started/eval_data_preparation.html) to prepare the corresponding data. And the link the data under `vlm`.
+The final directory structure is:
+```shell
+data
+├── MathVista
+├── mmbench
+├── mme
+├── MMMU
+├── mm-vet
+└── MMVP
+```
+## Evaluation
+Directly run `scripts/eval/run_eval_vlm.sh` to evaluate different benchmarks. The output will be saved in `$output_path`.
+- Set `$model_path` and `$output_path` for the path for checkpoint and log.
+- Increase `GPUS` if you want to run faster.
+- For MMBench, please use the official [evaluation server](https://mmbench.opencompass.org.cn/mmbench-submission).
+- For MMVet, please use the official [evaluation server](https://huggingface.co/spaces/whyu/MM-Vet_Evaluator).
+- For MathVista, please set `$openai_api_key` in `scripts/eval/run_eval_vlm.sh` and `your_api_url` in `eval/vlm/eval/mathvista/utilities.py`. The default GPT version is `gpt-4o-2024-11-20`.
+- For MMMU, we use CoT in the report, which improve the accuracy by about 2%. For evaluation of the oprn-ended answer, we use GPT-4o for judgement.
+# GenEval
+We modify the code in [GenEval](https://github.com/djghosh13/geneval/tree/main) for faster evaluation.
+## Setup
+Install the following dependencies:
+```shell
+pip install open-clip-torch
+pip install clip-benchmark
+pip install --upgrade setuptools
+sudo pip install -U openmim
+sudo mim install mmengine mmcv-full==1.7.2
+git clone https://github.com/open-mmlab/mmdetection.git
+cd mmdetection; git checkout 2.x
+pip install -v -e .
+```
+Download Detector:
+```shell
+cd ./eval/gen/geneval
+mkdir model
+bash ./evaluation/download_models.sh ./model
+```
+## Evaluation
+Directly run `scripts/eval/run_geneval.sh` to evaluate GenEVAL. The output will be saved in `$output_path`.
+- Set `$model_path` and `$output_path` for the path for checkpoint and log.
+- Set `metadata_file` to `./eval/gen/geneval/prompts/evaluation_metadata.jsonl` for original GenEval prompts.
+# WISE
+We modify the code in [WISE](https://github.com/PKU-YuanGroup/WISE/tree/main) for faster evaluation.
+## Evaluation
+Directly run `scripts/eval/run_wise.sh` to evaluate WISE. The output will be saved in `$output_path`.
+- Set `$model_path` and `$output_path` for the path for checkpoint and log.
+- Set `$openai_api_key` in `scripts/eval/run_wise.sh` and `your_api_url` in `eval/gen/wise/gpt_eval_mp.py`. The default GPT version is `gpt-4o-2024-11-20`.
+- Use `think` for thinking mode.
+# GEdit-Bench
+Please follow [GEdit-Bench](https://github.com/stepfun-ai/Step1X-Edit/blob/main/GEdit-Bench/EVAL.md) for evaluation.
+# IntelligentBench
+TBD

LICENSE ADDED Viewed

	@@ -0,0 +1,201 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

TRAIN.md ADDED Viewed

	@@ -0,0 +1,133 @@

+# Data prepration
+We provide data examples for **T2I**, **Editing**, and **VLM** tasks. The T2I dataset is generated using [FLUX.1‑dev](https://huggingface.co/black-forest-labs/FLUX.1-dev); the editing examples are randomly sampled from [SEED‑Data‑Edit‑Part3](https://huggingface.co/datasets/AILab-CVC/SEED-Data-Edit-Part2-3); and the VLM set is sourced from [LLaVA‑OneVision‑Data](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data).
+We offer examples in both raw-image folder and parquet shard formats. For other data formats, you can use our dataset code as a template and extend it as needed.
+1. **Download the sample dataset**
+   ```bash
+   wget -O bagel_example.zip \
+     https://lf3-static.bytednsdoc.com/obj/eden-cn/nuhojubrps/bagel_example.zip
+   unzip bagel_example.zip -d /data
+   ```
+2. **Expected hierarchy**
+   ```text
+   bagel_example
+   ├── t2i/                           # text-to-image (parquet)
+   ├── editing/                       # image editing (parquet)
+   │   ├── seedxedit_multi/
+   │   └── parquet_info/
+   └── vlm/
+       ├── images/                    # JPEG / PNG frames
+       └── llava_ov_si.jsonl          # vision‑language SFT conversations
+   ```
+3. Edit every `your_data_path` placeholder in **`data/dataset_info.py`**.
+4. *(Optional)*  Extend `DATASET_INFO` with your own parquet shards or JSONL files to mix extra data.
+---
+# Training
+The baseline full‑feature recipe looks like this (replace environment variables with real paths or values):
+```shell
+torchrun \
+  --nnodes=$num_nodes \
+  --node_rank=$node_rank \
+  --nproc_per_node=8 \
+  --master_addr=$master_addr \
+  --master_port=$master_port \
+  train/pretrain_unified_navit.py \
+  --dataset_config_file ./data/configs/example.yaml \
+  --llm_path $llm_path \
+  --vae_path $vae_path \
+  --vit_path $vit_path \
+  --use_flex True \
+  --resume_from $resume_from \
+  --results_dir $output_path \
+  --checkpoint_dir $ckpt_path \
+  --max_latent_size 64  # 32 for low-resolution pre-training
+```
+- **When fine-tuning BAGEL, please set `max_latent_size=64` to ensure the correct pretrained weights are loaded.**
+- The sum of num_used_data should be larger than NUM_GPUS x NUM_WORKERS.
+- For T2I-only fine-tuning, set `visual_und=False`; for VLM-only, set `visual_gen=False`.
+ You are encouraged to adjust any of these hyperparameters to fit your GPU budget and the scale of your dataset. If you encounter any issues, please open an issue for assistance. 🎉
+## Model config
+| Argument                     | Default                                     | Description                                                     |
+| ---------------------------- | ------------------------------------------- | --------------------------------------------------------------- |
+| `llm_path`                   | `hf/Qwen2.5-0.5B-Instruct`                  | Language‑model backbone (HuggingFace repo or local folder).     |
+| `vae_path`                   | `flux/vae/ae.safetensors`                   | Pre‑trained VAE checkpoint for latent diffusion.                |
+| `vit_path`                   | `hf/siglip-so400m-14-980-flash-attn2-navit` | SigLIP ViT used for image understanding.                        |
+| `max_latent_size`            | `32`                                        | Maximum latent grid side; defines highest generable resolution. |
+| `latent_patch_size`          | `2`                                         | VAE pixels represented by one latent patch.                     |
+| `vit_max_num_patch_per_side` | `70`                                        | Max ViT patches per image side after resizing.                  |
+| `text_cond_dropout_prob`     | `0.1`                                       | Probability to drop text conditioning while training.           |
+| `vae_cond_dropout_prob`      | `0.3`                                       | Dropout on VAE latent inputs.                                   |
+| `vit_cond_dropout_prob`      | `0.3`                                       | Dropout on visual features.                                     |
+*(See `ModelArguments` for many more options.)*
+## Data config
+| Argument                    | Default                     | Description                                               |
+| --------------------------- | --------------------------- | --------------------------------------------------------- |
+| `dataset_config_file`       | `data/configs/example.yaml` | YAML that groups datasets and assigns sampling weights.   |
+| `num_workers`               | `4`                         | Background workers per rank for the PyTorch `DataLoader`. |
+| `prefetch_factor`           | `2`                         | Batches pre‑fetched by each worker.                       |
+| `max_num_tokens_per_sample` | `16384`                     | Skip raw samples longer than this.                        |
+| `max_num_tokens`            | `36864`                     | Hard cap for a packed batch (prevents OOM).               |
+| `max_buffer_size`           | `50`                        | Overflow buffer length for oversized samples.             |
+| `data_seed`                 | `42`                        | Seed for reproducible shuffling and sampling.             |
+## Training config
+| Argument                               | Default                | Description                                            |
+| -------------------------------------- | ---------------------- | ------------------------------------------------------ |
+| `total_steps`                          | `500_000`              | Optimiser steps to run.                                |
+| `lr`                                   | `1e-4`                 | Peak learning rate after warm‑up.                      |
+| `lr_scheduler`                         | `constant`             | Learning‑rate schedule (`constant` or `cosine`).       |
+| `warmup_steps`                         | `2000`                 | Linear warm‑up duration.                               |
+| `ema`                                  | `0.9999`               | Exponential moving‑average decay for model weights.    |
+| `max_grad_norm`                        | `1.0`                  | Gradient‑clipping threshold.                           |
+| `save_every`                           | `2000`                 | Checkpoint frequency (steps).                          |
+| `visual_gen / visual_und`              | `True`                 | Enable image generation / understanding branches.      |
+| `freeze_llm / freeze_vit / freeze_vae` | `False / False / True` | Freeze selected modules to save VRAM or for ablations. |
+| `use_flex`                             | `True` (in example)    | Enable FLEX packing for higher GPU utilisation.        |
+| `sharding_strategy`                    | `HYBRID_SHARD`         | FSDP sharding mode.                                    |
+| `num_shard`                            | `8`                    | Parameter shards per rank in HYBRID mode.              |
+**Distributed‑launch environment variables**
+| Var                           | Meaning                           |
+| ----------------------------- | --------------------------------- |
+| `num_nodes` / `node_rank`     | Multi‑node orchestration indices. |
+| `nproc_per_node`              | Number of GPUs per node.          |
+| `master_addr` / `master_port` | NCCL rendezvous endpoint.         |
+## Logging config
+| Argument         | Default               | Description                                          |
+| ---------------- | --------------------- | ---------------------------------------------------- |
+| `results_dir`    | `results`             | Root directory for logs and metrics.                 |
+| `checkpoint_dir` | `results/checkpoints` | Checkpoints are saved here.                          |
+| `log_every`      | `10`                  | Steps between console / W\&B logs.                   |
+| `wandb_project`  | `bagel`               | Weights & Biases project name.                       |
+| `wandb_name`     | `run`                 | Run name inside the project.                         |
+| `wandb_offline`  | `False`               | Switch to offline mode (logs locally, sync later).   |
+| `wandb_resume`   | `allow`               | Resumption policy if an existing run ID is detected. |
+> **Tip**  Export `WANDB_API_KEY` before launching if you want online dashboards.

app.py ADDED Viewed

	@@ -0,0 +1,505 @@

+import gradio as gr
+import numpy as np
+import os
+import torch
+import random
+from accelerate import infer_auto_device_map, load_checkpoint_and_dispatch, init_empty_weights
+from PIL import Image
+from data.data_utils import add_special_tokens, pil_img2rgb
+from data.transforms import ImageTransform
+from inferencer import InterleaveInferencer
+from modeling.autoencoder import load_ae
+from modeling.bagel.qwen2_navit import NaiveCache
+from modeling.bagel import (
+    BagelConfig, Bagel, Qwen2Config, Qwen2ForCausalLM,
+    SiglipVisionConfig, SiglipVisionModel
+)
+from modeling.qwen2 import Qwen2Tokenizer
+# Model Initialization
+model_path = "/path/to/BAGEL-7B-MoT/weights" #Download from https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT
+llm_config = Qwen2Config.from_json_file(os.path.join(model_path, "llm_config.json"))
+llm_config.qk_norm = True
+llm_config.tie_word_embeddings = False
+llm_config.layer_module = "Qwen2MoTDecoderLayer"
+vit_config = SiglipVisionConfig.from_json_file(os.path.join(model_path, "vit_config.json"))
+vit_config.rope = False
+vit_config.num_hidden_layers -= 1
+vae_model, vae_config = load_ae(local_path=os.path.join(model_path, "ae.safetensors"))
+config = BagelConfig(
+    visual_gen=True,
+    visual_und=True,
+    llm_config=llm_config,
+    vit_config=vit_config,
+    vae_config=vae_config,
+    vit_max_num_patch_per_side=70,
+    connector_act='gelu_pytorch_tanh',
+    latent_patch_size=2,
+    max_latent_size=64,
+)
+with init_empty_weights():
+    language_model = Qwen2ForCausalLM(llm_config)
+    vit_model      = SiglipVisionModel(vit_config)
+    model          = Bagel(language_model, vit_model, config)
+    model.vit_model.vision_model.embeddings.convert_conv2d_to_linear(vit_config, meta=True)
+tokenizer = Qwen2Tokenizer.from_pretrained(model_path)
+tokenizer, new_token_ids, _ = add_special_tokens(tokenizer)
+vae_transform = ImageTransform(1024, 512, 16)
+vit_transform = ImageTransform(980, 224, 14)
+# Model Loading and Multi GPU Infernece Preparing
+device_map = infer_auto_device_map(
+    model,
+    max_memory={i: "80GiB" for i in range(torch.cuda.device_count())},
+    no_split_module_classes=["Bagel", "Qwen2MoTDecoderLayer"],
+)
+same_device_modules = [
+    'language_model.model.embed_tokens',
+    'time_embedder',
+    'latent_pos_embed',
+    'vae2llm',
+    'llm2vae',
+    'connector',
+    'vit_pos_embed'
+]
+if torch.cuda.device_count() == 1:
+    first_device = device_map.get(same_device_modules[0], "cuda:0")
+    for k in same_device_modules:
+        if k in device_map:
+            device_map[k] = first_device
+        else:
+            device_map[k] = "cuda:0"
+else:
+    first_device = device_map.get(same_device_modules[0])
+    for k in same_device_modules:
+        if k in device_map:
+            device_map[k] = first_device
+model = load_checkpoint_and_dispatch(
+    model,
+    checkpoint=os.path.join(model_path, "ema.safetensors"),
+    device_map=device_map,
+    offload_buffers=True,
+    dtype=torch.bfloat16,
+    force_hooks=True,
+).eval()
+# Inferencer Preparing
+inferencer = InterleaveInferencer(
+    model=model,
+    vae_model=vae_model,
+    tokenizer=tokenizer,
+    vae_transform=vae_transform,
+    vit_transform=vit_transform,
+    new_token_ids=new_token_ids,
+)
+def set_seed(seed):
+    """Set random seeds for reproducibility"""
+    if seed > 0:
+        random.seed(seed)
+        np.random.seed(seed)
+        torch.manual_seed(seed)
+        if torch.cuda.is_available():
+            torch.cuda.manual_seed(seed)
+            torch.cuda.manual_seed_all(seed)
+        torch.backends.cudnn.deterministic = True
+        torch.backends.cudnn.benchmark = False
+    return seed
+# Text to Image function with thinking option and hyperparameters
+def text_to_image(prompt, show_thinking=False, cfg_text_scale=4.0, cfg_interval=0.4,
+                 timestep_shift=3.0, num_timesteps=50,
+                 cfg_renorm_min=1.0, cfg_renorm_type="global",
+                 max_think_token_n=1024, do_sample=False, text_temperature=0.3,
+                 seed=0, image_ratio="1:1"):
+    # Set seed for reproducibility
+    set_seed(seed)
+    if image_ratio == "1:1":
+        image_shapes = (1024, 1024)
+    elif image_ratio == "4:3":
+        image_shapes = (768, 1024)
+    elif image_ratio == "3:4":
+        image_shapes = (1024, 768)
+    elif image_ratio == "16:9":
+        image_shapes = (576, 1024)
+    elif image_ratio == "9:16":
+        image_shapes = (1024, 576)
+    # Set hyperparameters
+    inference_hyper = dict(
+        max_think_token_n=max_think_token_n if show_thinking else 1024,
+        do_sample=do_sample if show_thinking else False,
+        text_temperature=text_temperature if show_thinking else 0.3,
+        cfg_text_scale=cfg_text_scale,
+        cfg_interval=[cfg_interval, 1.0],  # End fixed at 1.0
+        timestep_shift=timestep_shift,
+        num_timesteps=num_timesteps,
+        cfg_renorm_min=cfg_renorm_min,
+        cfg_renorm_type=cfg_renorm_type,
+        image_shapes=image_shapes,
+    )
+    # Call inferencer with or without think parameter based on user choice
+    result = inferencer(text=prompt, think=show_thinking, **inference_hyper)
+    return result["image"], result.get("text", None)
+# Image Understanding function with thinking option and hyperparameters
+def image_understanding(image: Image.Image, prompt: str, show_thinking=False,
+                        do_sample=False, text_temperature=0.3, max_new_tokens=512):
+    if image is None:
+        return "Please upload an image."
+    if isinstance(image, np.ndarray):
+        image = Image.fromarray(image)
+    image = pil_img2rgb(image)
+    # Set hyperparameters
+    inference_hyper = dict(
+        do_sample=do_sample,
+        text_temperature=text_temperature,
+        max_think_token_n=max_new_tokens, # Set max_length
+    )
+    # Use show_thinking parameter to control thinking process
+    result = inferencer(image=image, text=prompt, think=show_thinking,
+                        understanding_output=True, **inference_hyper)
+    return result["text"]
+# Image Editing function with thinking option and hyperparameters
+def edit_image(image: Image.Image, prompt: str, show_thinking=False, cfg_text_scale=4.0,
+              cfg_img_scale=2.0, cfg_interval=0.0,
+              timestep_shift=3.0, num_timesteps=50, cfg_renorm_min=1.0,
+              cfg_renorm_type="text_channel", max_think_token_n=1024,
+              do_sample=False, text_temperature=0.3, seed=0):
+    # Set seed for reproducibility
+    set_seed(seed)
+    if image is None:
+        return "Please upload an image.", ""
+    if isinstance(image, np.ndarray):
+        image = Image.fromarray(image)
+    image = pil_img2rgb(image)
+    # Set hyperparameters
+    inference_hyper = dict(
+        max_think_token_n=max_think_token_n if show_thinking else 1024,
+        do_sample=do_sample if show_thinking else False,
+        text_temperature=text_temperature if show_thinking else 0.3,
+        cfg_text_scale=cfg_text_scale,
+        cfg_img_scale=cfg_img_scale,
+        cfg_interval=[cfg_interval, 1.0],  # End fixed at 1.0
+        timestep_shift=timestep_shift,
+        num_timesteps=num_timesteps,
+        cfg_renorm_min=cfg_renorm_min,
+        cfg_renorm_type=cfg_renorm_type,
+    )
+    # Include thinking parameter based on user choice
+    result = inferencer(image=image, text=prompt, think=show_thinking, **inference_hyper)
+    return result["image"], result.get("text", "")
+# Helper function to load example images
+def load_example_image(image_path):
+    try:
+        return Image.open(image_path)
+    except Exception as e:
+        print(f"Error loading example image: {e}")
+        return None
+# Gradio UI
+with gr.Blocks() as demo:
+    gr.Markdown("""
+<div>
+  <img src="https://lf3-static.bytednsdoc.com/obj/eden-cn/nuhojubrps/banner.png" alt="BAGEL" width="380"/>
+</div>
+""")
+    with gr.Tab("📝 Text to Image"):
+        txt_input = gr.Textbox(
+            label="Prompt",
+            value="A female cosplayer portraying an ethereal fairy or elf, wearing a flowing dress made of delicate fabrics in soft, mystical colors like emerald green and silver. She has pointed ears, a gentle, enchanting expression, and her outfit is adorned with sparkling jewels and intricate patterns. The background is a magical forest with glowing plants, mystical creatures, and a serene atmosphere."
+        )
+        with gr.Row():
+            show_thinking = gr.Checkbox(label="Thinking", value=False)
+        # Add hyperparameter controls in an accordion
+        with gr.Accordion("Inference Hyperparameters", open=False):
+            # 参数一排两个布局
+            with gr.Group():
+                with gr.Row():
+                    seed = gr.Slider(minimum=0, maximum=1000000, value=0, step=1,
+                                   label="Seed", info="0 for random seed, positive for reproducible results")
+                    image_ratio = gr.Dropdown(choices=["1:1", "4:3", "3:4", "16:9", "9:16"],
+                                                value="1:1", label="Image Ratio",
+                                                info="The longer size is fixed to 1024")
+                with gr.Row():
+                    cfg_text_scale = gr.Slider(minimum=1.0, maximum=8.0, value=4.0, step=0.1, interactive=True,
+                                             label="CFG Text Scale", info="Controls how strongly the model follows the text prompt (4.0-8.0)")
+                    cfg_interval = gr.Slider(minimum=0.0, maximum=1.0, value=0.4, step=0.1,
+                                           label="CFG Interval", info="Start of CFG application interval (end is fixed at 1.0)")
+                with gr.Row():
+                    cfg_renorm_type = gr.Dropdown(choices=["global", "local", "text_channel"],
+                                                value="global", label="CFG Renorm Type",
+                                                info="If the genrated image is blurry, use 'global'")
+                    cfg_renorm_min = gr.Slider(minimum=0.0, maximum=1.0, value=0.0, step=0.1, interactive=True,
+                                             label="CFG Renorm Min", info="1.0 disables CFG-Renorm")
+                with gr.Row():
+                    num_timesteps = gr.Slider(minimum=10, maximum=100, value=50, step=5, interactive=True,
+                                            label="Timesteps", info="Total denoising steps")
+                    timestep_shift = gr.Slider(minimum=1.0, maximum=5.0, value=3.0, step=0.5, interactive=True,
+                                             label="Timestep Shift", info="Higher values for layout, lower for details")
+                # Thinking parameters in a single row
+                thinking_params = gr.Group(visible=False)
+                with thinking_params:
+                    with gr.Row():
+                        do_sample = gr.Checkbox(label="Sampling", value=False, info="Enable sampling for text generation")
+                        max_think_token_n = gr.Slider(minimum=64, maximum=4006, value=1024, step=64, interactive=True,
+                                                    label="Max Think Tokens", info="Maximum number of tokens for thinking")
+                        text_temperature = gr.Slider(minimum=0.1, maximum=1.0, value=0.3, step=0.1, interactive=True,
+                                                  label="Temperature", info="Controls randomness in text generation")
+        thinking_output = gr.Textbox(label="Thinking Process", visible=False)
+        img_output = gr.Image(label="Generated Image")
+        gen_btn = gr.Button("Generate")
+        # Dynamically show/hide thinking process box and parameters
+        def update_thinking_visibility(show):
+            return gr.update(visible=show), gr.update(visible=show)
+        show_thinking.change(
+            fn=update_thinking_visibility,
+            inputs=[show_thinking],
+            outputs=[thinking_output, thinking_params]
+        )
+        # Process function based on thinking option and hyperparameters
+        def process_text_to_image(prompt, show_thinking, cfg_text_scale,
+                                 cfg_interval, timestep_shift,
+                                 num_timesteps, cfg_renorm_min, cfg_renorm_type,
+                                 max_think_token_n, do_sample, text_temperature, seed, image_ratio):
+            image, thinking = text_to_image(
+                prompt, show_thinking, cfg_text_scale, cfg_interval,
+                timestep_shift, num_timesteps,
+                cfg_renorm_min, cfg_renorm_type,
+                max_think_token_n, do_sample, text_temperature, seed, image_ratio
+            )
+            return image, thinking if thinking else ""
+        gen_btn.click(
+            fn=process_text_to_image,
+            inputs=[
+                txt_input, show_thinking, cfg_text_scale,
+                cfg_interval, timestep_shift,
+                num_timesteps, cfg_renorm_min, cfg_renorm_type,
+                max_think_token_n, do_sample, text_temperature, seed, image_ratio
+            ],
+            outputs=[img_output, thinking_output]
+        )
+    with gr.Tab("🖌️ Image Edit"):
+        with gr.Row():
+            with gr.Column(scale=1):
+                edit_image_input = gr.Image(label="Input Image", value=load_example_image('test_images/women.jpg'))
+                edit_prompt = gr.Textbox(
+                    label="Prompt",
+                    value="She boards a modern subway, quietly reading a folded newspaper, wearing the same clothes."
+                )
+            with gr.Column(scale=1):
+                edit_image_output = gr.Image(label="Result")
+                edit_thinking_output = gr.Textbox(label="Thinking Process", visible=False)
+        with gr.Row():
+            edit_show_thinking = gr.Checkbox(label="Thinking", value=False)
+        # Add hyperparameter controls in an accordion
+        with gr.Accordion("Inference Hyperparameters", open=False):
+            with gr.Group():
+                with gr.Row():
+                    edit_seed = gr.Slider(minimum=0, maximum=1000000, value=0, step=1, interactive=True,
+                                        label="Seed", info="0 for random seed, positive for reproducible results")
+                    edit_cfg_text_scale = gr.Slider(minimum=1.0, maximum=8.0, value=4.0, step=0.1, interactive=True,
+                                                  label="CFG Text Scale", info="Controls how strongly the model follows the text prompt")
+                with gr.Row():
+                    edit_cfg_img_scale = gr.Slider(minimum=1.0, maximum=4.0, value=2.0, step=0.1, interactive=True,
+                                                 label="CFG Image Scale", info="Controls how much the model preserves input image details")
+                    edit_cfg_interval = gr.Slider(minimum=0.0, maximum=1.0, value=0.0, step=0.1, interactive=True,
+                                                label="CFG Interval", info="Start of CFG application interval (end is fixed at 1.0)")
+                with gr.Row():
+                    edit_cfg_renorm_type = gr.Dropdown(choices=["global", "local", "text_channel"],
+                                                     value="text_channel", label="CFG Renorm Type",
+                                                     info="If the genrated image is blurry, use 'global")
+                    edit_cfg_renorm_min = gr.Slider(minimum=0.0, maximum=1.0, value=0.0, step=0.1, interactive=True,
+                                                  label="CFG Renorm Min", info="1.0 disables CFG-Renorm")
+                with gr.Row():
+                    edit_num_timesteps = gr.Slider(minimum=10, maximum=100, value=50, step=5, interactive=True,
+                                                 label="Timesteps", info="Total denoising steps")
+                    edit_timestep_shift = gr.Slider(minimum=1.0, maximum=10.0, value=3.0, step=0.5, interactive=True,
+                                                  label="Timestep Shift", info="Higher values for layout, lower for details")
+                # Thinking parameters in a single row
+                edit_thinking_params = gr.Group(visible=False)
+                with edit_thinking_params:
+                    with gr.Row():
+                        edit_do_sample = gr.Checkbox(label="Sampling", value=False, info="Enable sampling for text generation")
+                        edit_max_think_token_n = gr.Slider(minimum=64, maximum=4006, value=1024, step=64, interactive=True,
+                                                         label="Max Think Tokens", info="Maximum number of tokens for thinking")
+                        edit_text_temperature = gr.Slider(minimum=0.1, maximum=1.0, value=0.3, step=0.1, interactive=True,
+                                                        label="Temperature", info="Controls randomness in text generation")
+        edit_btn = gr.Button("Submit")
+        # Dynamically show/hide thinking process box for editing
+        def update_edit_thinking_visibility(show):
+            return gr.update(visible=show), gr.update(visible=show)
+        edit_show_thinking.change(
+            fn=update_edit_thinking_visibility,
+            inputs=[edit_show_thinking],
+            outputs=[edit_thinking_output, edit_thinking_params]
+        )
+        # Process editing with thinking option and hyperparameters
+        def process_edit_image(image, prompt, show_thinking, cfg_text_scale,
+                              cfg_img_scale, cfg_interval,
+                              timestep_shift, num_timesteps, cfg_renorm_min,
+                              cfg_renorm_type, max_think_token_n, do_sample,
+                              text_temperature, seed):
+            edited_image, thinking = edit_image(
+                image, prompt, show_thinking, cfg_text_scale, cfg_img_scale,
+                cfg_interval, timestep_shift,
+                num_timesteps, cfg_renorm_min, cfg_renorm_type,
+                max_think_token_n, do_sample, text_temperature, seed
+            )
+            return edited_image, thinking if thinking else ""
+        edit_btn.click(
+            fn=process_edit_image,
+            inputs=[
+                edit_image_input, edit_prompt, edit_show_thinking,
+                edit_cfg_text_scale, edit_cfg_img_scale, edit_cfg_interval,
+                edit_timestep_shift, edit_num_timesteps,
+                edit_cfg_renorm_min, edit_cfg_renorm_type,
+                edit_max_think_token_n, edit_do_sample, edit_text_temperature, edit_seed
+            ],
+            outputs=[edit_image_output, edit_thinking_output]
+        )
+    with gr.Tab("🖼️ Image Understanding"):
+        with gr.Row():
+            with gr.Column(scale=1):
+                img_input = gr.Image(label="Input Image", value=load_example_image('test_images/meme.jpg'))
+                understand_prompt = gr.Textbox(
+                    label="Prompt",
+                    value="Can someone explain what's funny about this meme??"
+                )
+            with gr.Column(scale=1):
+                txt_output = gr.Textbox(label="Result", lines=20)
+        with gr.Row():
+            understand_show_thinking = gr.Checkbox(label="Thinking", value=False)
+        # Add hyperparameter controls in an accordion
+        with gr.Accordion("Inference Hyperparameters", open=False):
+            with gr.Row():
+                understand_do_sample = gr.Checkbox(label="Sampling", value=False, info="Enable sampling for text generation")
+                understand_text_temperature = gr.Slider(minimum=0.0, maximum=1.0, value=0.3, step=0.05, interactive=True,
+                                                     label="Temperature", info="Controls randomness in text generation (0=deterministic, 1=creative)")
+                understand_max_new_tokens = gr.Slider(minimum=64, maximum=4096, value=512, step=64, interactive=True,
+                                                   label="Max New Tokens", info="Maximum length of generated text, including potential thinking")
+        img_understand_btn = gr.Button("Submit")
+        # Process understanding with thinking option and hyperparameters
+        def process_understanding(image, prompt, show_thinking, do_sample,
+                                 text_temperature, max_new_tokens):
+            result = image_understanding(
+                image, prompt, show_thinking, do_sample,
+                text_temperature, max_new_tokens
+            )
+            return result
+        img_understand_btn.click(
+            fn=process_understanding,
+            inputs=[
+                img_input, understand_prompt, understand_show_thinking,
+                understand_do_sample, understand_text_temperature, understand_max_new_tokens
+            ],
+            outputs=txt_output
+        )
+    gr.Markdown("""
+<div style="display: flex; justify-content: flex-start; flex-wrap: wrap; gap: 10px;">
+  <a href="https://bagel-ai.org/">
+    <img
+      src="https://img.shields.io/badge/BAGEL-Website-0A66C2?logo=safari&logoColor=white"
+      alt="BAGEL Website"
+    />
+  </a>
+  <a href="https://arxiv.org/abs/2505.14683">
+    <img
+      src="https://img.shields.io/badge/BAGEL-Paper-red?logo=arxiv&logoColor=red"
+      alt="BAGEL Paper on arXiv"
+    />
+  </a>
+  <a href="https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT">
+    <img
+        src="https://img.shields.io/badge/BAGEL-Hugging%20Face-orange?logo=huggingface&logoColor=yellow"
+        alt="BAGEL on Hugging Face"
+    />
+  </a>
+  <a href="https://demo.bagel-ai.org/">
+    <img
+      src="https://img.shields.io/badge/BAGEL-Demo-blue?logo=googleplay&logoColor=blue"
+      alt="BAGEL Demo"
+    />
+  </a>
+  <a href="https://discord.gg/Z836xxzy">
+    <img
+      src="https://img.shields.io/badge/BAGEL-Discord-5865F2?logo=discord&logoColor=purple"
+      alt="BAGEL Discord"
+    />
+  </a>
+  <a href="mailto:[email protected]">
+    <img
+      src="https://img.shields.io/badge/BAGEL-Email-D14836?logo=gmail&logoColor=red"
+      alt="BAGEL Email"
+    />
+  </a>
+</div>
+""")
+demo.launch(share=True)

assets/arch.png ADDED Viewed

Git LFS Details

SHA256: 28affbbfede911a75884bae4e8e1d5b897b8b450fa4c7d9b68818d05492b0967
Pointer size: 131 Bytes
Size of remote file: 168 kB

assets/emerging_curves.png ADDED Viewed

Git LFS Details

SHA256: 0c1ddd355742cddb52045ee59098305cc5de8174cb09afa019bb9afefd868733
Pointer size: 131 Bytes
Size of remote file: 373 kB

assets/teaser.webp ADDED Viewed

Git LFS Details

SHA256: d679e69a1fbdb7f9abceb59d9bc3d29ab65b7e871ba48b59aec0a7f35defa558
Pointer size: 132 Bytes
Size of remote file: 1.1 MB

data/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # Copyright 2025 Bytedance Ltd. and/or its affiliates.
2	+ # SPDX-License-Identifier: Apache-2.0

data/configs/example.yaml ADDED Viewed

	@@ -0,0 +1,45 @@

+t2i_pretrain:
+  dataset_names:
+  - t2i
+  image_transform_args:
+    image_stride: 16
+    max_image_size: 1024
+    min_image_size: 512
+  is_mandatory: true
+  num_used_data: # The sum should be larger that NUM_GPUS x NUM_WORKERS
+  - 10
+  weight: 1
+unified_edit:
+  dataset_names:
+  - seedxedit_multi
+  image_transform_args:
+    image_stride: 16
+    max_image_size: 1024
+    min_image_size: 512
+  vit_image_transform_args:
+    image_stride: 14
+    max_image_size: 518
+    min_image_size: 224
+  is_mandatory: false
+  num_used_data:
+  - 10
+  weight: 1
+vlm_sft:
+  dataset_names:
+  - llava_ov
+  image_transform_args:
+    image_stride: 14
+    max_image_size: 980
+    min_image_size: 378
+    max_pixels: 2_007_040
+  frame_sampler_args:
+    max_num_frames: 12
+    min_num_frames: 8
+  is_mandatory: true
+  shuffle_lines: True
+  shuffle_seed: 0
+  num_used_data:
+  - 1000
+  weight: 1

data/data_utils.py ADDED Viewed

	@@ -0,0 +1,177 @@

+# Copyright 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: Apache-2.0
+import math
+import random
+from PIL import Image
+import torch
+from torch.nn.attention.flex_attention import or_masks, and_masks
+def create_sparse_mask(document_lens, split_lens, attn_modes, device):
+    def causal_mask(b, h, q_idx, kv_idx):
+        return q_idx >= kv_idx
+    def full_and_noise_mask(b, h, q_idx, kv_idx):
+        return (full_and_noise_seq_id[q_idx] == full_and_noise_seq_id[kv_idx]) & (full_and_noise_seq_id[q_idx] >= 0)
+    def remove_noise_mask(b, h, q_idx, kv_idx):
+        return (~((noise_seq_id[kv_idx] >= 0) & (noise_seq_id[q_idx] != noise_seq_id[kv_idx])))
+    def sample_mask(b, h, q_idx, kv_idx):
+        return document_id[q_idx] == document_id[kv_idx]
+    full_and_noise_tmp = []
+    noise_tmp = []
+    for i, (length, model) in enumerate(zip(split_lens, attn_modes)):
+        value = i if model in ['full', 'noise'] else -1
+        full_and_noise_tmp.extend([value] * length)
+        value_noise = i if model == 'noise' else -1
+        noise_tmp.extend([value_noise] * length)
+    full_and_noise_seq_id = torch.Tensor(full_and_noise_tmp).to(device)
+    noise_seq_id = torch.Tensor(noise_tmp).to(device)
+    document_id = torch.cat([torch.full((l,), i) for i, l in enumerate(document_lens, start=1)]).to(device)
+    return and_masks(or_masks(causal_mask, full_and_noise_mask), remove_noise_mask, sample_mask)
+def patchify(image, patch_size):
+    p = patch_size
+    c, h, w = image.shape
+    assert h % p == 0 and w % p == 0
+    image = image.reshape(c, h // p, p, w // p, p)
+    image = torch.einsum("chpwq->hwpqc", image)
+    image = image.reshape(-1, p**2 * c)
+    return image
+def get_flattened_position_ids_extrapolate(img_h, img_w, patch_size, max_num_patches_per_side):
+    num_patches_h, num_patches_w = img_h // patch_size, img_w // patch_size
+    coords_h = torch.arange(0, num_patches_h)
+    coords_w = torch.arange(0, num_patches_w)
+    pos_ids = (coords_h[:, None] * max_num_patches_per_side + coords_w).flatten()
+    return pos_ids
+def get_flattened_position_ids_interpolate(img_h, img_w, patch_size, max_num_patches_per_side):
+    num_patches_h, num_patches_w = img_h // patch_size, img_w // patch_size
+    boundaries = torch.arange(1 / max_num_patches_per_side, 1.0, 1 / max_num_patches_per_side)
+    fractional_coords_h = torch.arange(0, 1 - 1e-6, 1 / num_patches_h)
+    fractional_coords_w = torch.arange(0, 1 - 1e-6, 1 / num_patches_w)
+    bucket_coords_h = torch.bucketize(fractional_coords_h, boundaries, right=True)
+    bucket_coords_w = torch.bucketize(fractional_coords_w, boundaries, right=True)
+    pos_ids = (bucket_coords_h[:, None] * max_num_patches_per_side + bucket_coords_w).flatten()
+    return pos_ids
+def prepare_attention_mask_per_sample(split_lens, attn_modes, device="cpu"):
+    """
+    nested_split_lens: A list of N lists of ints. Each int indicates the length of a split within
+        a sample, where each sample contains multiple splits with different attn modes.
+    nested_attn_modes: whether to use full attn in each split.
+    """
+    sample_len = sum(split_lens)
+    attention_mask = torch.zeros((sample_len, sample_len), dtype=torch.bool, device=device)
+    csum = 0
+    for s, attn_mode in zip(split_lens, attn_modes):
+        assert attn_mode in ['causal', 'full', 'noise']
+        if attn_mode == "causal":
+            attention_mask[csum:csum + s, csum:csum + s] = torch.ones((s, s), device=device).tril()
+            attention_mask[csum:csum + s, :csum] = 1
+        else:
+            attention_mask[csum:csum + s, csum:csum + s] = torch.ones((s, s))
+            attention_mask[csum:csum + s, :csum] = 1
+        csum += s
+    csum = 0
+    for s, attn_mode in zip(split_lens, attn_modes):
+        if attn_mode == "noise":
+            attention_mask[:, csum : csum + s] = torch.zeros((sample_len, s))
+            attention_mask[csum : csum + s, csum : csum + s] = torch.ones((s, s))
+        csum += s
+    attention_mask = torch.zeros_like(attention_mask, dtype=torch.float).masked_fill_(
+        ~attention_mask, float("-inf")
+    )
+    return attention_mask
+def split_integer_exp_decay(S, ng_sample_decay=1.0):
+    if ng_sample_decay == 1.0:
+        N = random.randint(1, S)
+    else:
+        base = (1 - ng_sample_decay) / (1 - math.pow(ng_sample_decay, S))
+        p = [base * math.pow(ng_sample_decay, i) for i in range(S)]
+        N = random.choices(list(range(1, S + 1)), p, k=1)[0]
+    cumsum = [0] + sorted(random.sample(range(1, S), N - 1)) + [S]
+    result = [cumsum[i+1] - cumsum[i] for i in range(len(cumsum) - 1)]
+    return result, cumsum
+def pil_img2rgb(image):
+    if image.mode == "RGBA" or image.info.get("transparency", None) is not None:
+        image = image.convert("RGBA")
+        white = Image.new(mode="RGB", size=image.size, color=(255, 255, 255))
+        white.paste(image, mask=image.split()[3])
+        image = white
+    else:
+        image = image.convert("RGB")
+    return image
+def add_special_tokens(tokenizer):
+    all_special_tokens = []
+    for k, v in tokenizer.special_tokens_map.items():
+        if isinstance(v, str):
+            all_special_tokens.append(v)
+        elif isinstance(v, list):
+            all_special_tokens += v
+    new_tokens = []
+    if '<|im_start|>' not in all_special_tokens:
+        new_tokens.append('<|im_start|>')
+    if '<|im_end|>' not in all_special_tokens:
+        new_tokens.append('<|im_end|>')
+    if '<|vision_start|>' not in all_special_tokens:
+        new_tokens.append('<|vision_start|>')
+    if '<|vision_end|>' not in all_special_tokens:
+        new_tokens.append('<|vision_end|>')
+    num_new_tokens = tokenizer.add_tokens(new_tokens)
+    bos_token_id = tokenizer.convert_tokens_to_ids('<|im_start|>')
+    eos_token_id = tokenizer.convert_tokens_to_ids('<|im_end|>')
+    start_of_image = tokenizer.convert_tokens_to_ids('<|vision_start|>')
+    end_of_image = tokenizer.convert_tokens_to_ids('<|vision_end|>')
+    new_token_ids = dict(
+        bos_token_id=bos_token_id,
+        eos_token_id=eos_token_id,
+        start_of_image=start_of_image,
+        end_of_image=end_of_image,
+    )
+    return tokenizer, new_token_ids, num_new_tokens
+def len2weight(x, loss_reduction='square'):
+    if x == 0:
+        return x
+    if loss_reduction == 'token':
+        return 1
+    if loss_reduction == 'sample':
+        return 1 / x
+    if loss_reduction == 'square':
+        return 1 / (x ** 0.5)
+    raise NotImplementedError(loss_reduction)

data/dataset_base.py ADDED Viewed

	@@ -0,0 +1,620 @@

+# Copyright 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: Apache-2.0
+import random
+import json
+import numpy as np
+import torch
+from .data_utils import (
+    get_flattened_position_ids_interpolate,
+    get_flattened_position_ids_extrapolate,
+    len2weight,
+    patchify,
+    prepare_attention_mask_per_sample,
+)
+from .dataset_info import DATASET_INFO, DATASET_REGISTRY
+from .transforms import ImageTransform
+from .video_utils import FrameSampler
+class DataConfig:
+    def __init__(
+        self,
+        grouped_datasets,
+        text_cond_dropout_prob=0.1,
+        vit_cond_dropout_prob=0.4,
+        vae_cond_dropout_prob=0.1,
+        vae_image_downsample=16,
+        max_latent_size=32,
+        vit_patch_size=14,
+        max_num_patch_per_side=70,
+    ):
+        self.grouped_datasets = grouped_datasets
+        self.text_cond_dropout_prob = text_cond_dropout_prob
+        self.vit_cond_dropout_prob = vit_cond_dropout_prob
+        self.vit_patch_size = vit_patch_size
+        self.max_num_patch_per_side = max_num_patch_per_side
+        self.vae_cond_dropout_prob = vae_cond_dropout_prob
+        self.vae_image_downsample = vae_image_downsample
+        self.max_latent_size = max_latent_size
+class PackedDataset(torch.utils.data.IterableDataset):
+    def __init__(
+        self,
+        data_config,
+        tokenizer,
+        special_tokens,
+        local_rank,
+        world_size,
+        num_workers,
+        expected_num_tokens=32768,
+        max_num_tokens_per_sample=16384,
+        max_num_tokens=36864,
+        prefer_buffer_before=16384,
+        max_buffer_size=50,
+        interpolate_pos=False,
+        use_flex=False,
+        data_status=None,
+    ):
+        super().__init__()
+        self.expected_num_tokens = expected_num_tokens
+        self.max_num_tokens_per_sample = max_num_tokens_per_sample
+        self.prefer_buffer_before = prefer_buffer_before
+        self.max_num_tokens = max_num_tokens
+        self.max_buffer_size = max_buffer_size
+        self.tokenizer = tokenizer
+        self.local_rank = local_rank
+        self.world_size = world_size
+        self.num_workers = num_workers
+        self.use_flex = use_flex
+        for k, v in special_tokens.items():
+            setattr(self, k, v)
+        grouped_datasets, is_mandatory, grouped_weights = self.build_datasets(
+            data_config.grouped_datasets, data_status
+        )
+        self.grouped_datasets = grouped_datasets
+        self.dataset_iters = [iter(dataset) for dataset in grouped_datasets]
+        self.is_mandatory = is_mandatory
+        self.grouped_weights = grouped_weights
+        self.data_config = data_config
+        self.interpolate_pos = interpolate_pos
+        if self.interpolate_pos:
+            self.get_flattened_position_ids = get_flattened_position_ids_interpolate
+        else:
+            self.get_flattened_position_ids = get_flattened_position_ids_extrapolate
+    def build_datasets(self, datasets_metainfo, data_status):
+        datasets = []
+        is_mandatory = []
+        grouped_weights = []
+        for grouped_dataset_name, dataset_args in datasets_metainfo.items():
+            is_mandatory.append(dataset_args.pop('is_mandatory', False))
+            grouped_weights.append(dataset_args.pop('weight', 0.0))
+            if 'frame_sampler_args' in dataset_args.keys():
+                frame_sampler = FrameSampler(**dataset_args.pop('frame_sampler_args'))
+                dataset_args['frame_sampler'] = frame_sampler
+            if 'image_transform_args' in dataset_args.keys():
+                transform = ImageTransform(**dataset_args.pop('image_transform_args'))
+                dataset_args['transform'] = transform
+            if 'vit_image_transform_args' in dataset_args.keys():
+                vit_transform = ImageTransform(**dataset_args.pop('vit_image_transform_args'))
+                dataset_args['vit_transform'] = vit_transform
+            assert 'dataset_names' in dataset_args.keys()
+            dataset_names = dataset_args.pop('dataset_names')
+            dataset_args['data_dir_list'] = []
+            for item in dataset_names:
+                if self.local_rank == 0:
+                    print(f'Preparing Dataset {grouped_dataset_name}/{item}')
+                meta_info = DATASET_INFO[grouped_dataset_name][item]
+                dataset_args['data_dir_list'].append(meta_info['data_dir'])
+                if "parquet_info_path" in meta_info.keys():
+                    if 'parquet_info' not in dataset_args.keys():
+                        dataset_args['parquet_info'] = {}
+                    with open(meta_info['parquet_info_path'], 'r') as f:
+                        parquet_info = json.load(f)
+                    dataset_args['parquet_info'].update(parquet_info)
+                if 'json_dir' in meta_info.keys():
+                    # parquet/tar with json
+                    if 'json_dir_list' not in dataset_args.keys():
+                        dataset_args['json_dir_list'] = [meta_info['json_dir']]
+                    else:
+                        dataset_args['json_dir_list'].append(meta_info['json_dir'])
+                if 'jsonl_path' in meta_info.keys():
+                    # jsonl with jpeg
+                    if 'jsonl_path_list' not in dataset_args.keys():
+                        dataset_args['jsonl_path_list'] = [meta_info['jsonl_path']]
+                    else:
+                        dataset_args['jsonl_path_list'].append(meta_info['jsonl_path'])
+            resume_data_status = dataset_args.pop('resume_data_status', True)
+            if data_status is not None and grouped_dataset_name in data_status.keys() and resume_data_status:
+                data_status_per_group = data_status[grouped_dataset_name]
+            else:
+                data_status_per_group = None
+            dataset = DATASET_REGISTRY[grouped_dataset_name](
+                dataset_name=grouped_dataset_name,
+                tokenizer=self.tokenizer,
+                local_rank=self.local_rank,
+                world_size=self.world_size,
+                num_workers=self.num_workers,
+                data_status=data_status_per_group,
+                **dataset_args
+            )
+            datasets.append(dataset)
+        return datasets, is_mandatory, grouped_weights
+    def set_epoch(self, seed):
+        for dataset in self.grouped_datasets:
+            dataset.set_epoch(seed)
+    def set_sequence_status(self):
+        sequence_status = dict(
+            curr                        = 0,
+            sample_lens                 = list(),
+            packed_position_ids         = list(),
+            nested_attention_masks      = list(),
+            split_lens                  = list(),
+            attn_modes                  = list(),
+            packed_text_ids             = list(),
+            packed_text_indexes         = list(),
+            packed_label_ids            = list(),
+            ce_loss_indexes             = list(),
+            ce_loss_weights             = list(),
+            vae_image_tensors           = list(),
+            packed_latent_position_ids  = list(),
+            vae_latent_shapes           = list(),
+            packed_vae_token_indexes    = list(),
+            packed_timesteps            = list(),
+            mse_loss_indexes            = list(),
+            packed_vit_tokens           = list(),
+            vit_token_seqlens           = list(),
+            packed_vit_position_ids     = list(),
+            packed_vit_token_indexes    = list(),
+        )
+        return sequence_status
+    def to_tensor(self, sequence_status):
+        data = dict(
+            sequence_length=sum(sequence_status['sample_lens']),
+            sample_lens=sequence_status['sample_lens'],
+            packed_text_ids=torch.tensor(sequence_status['packed_text_ids']),
+            packed_text_indexes=torch.tensor(sequence_status['packed_text_indexes']),
+            packed_position_ids=torch.tensor(sequence_status['packed_position_ids']),
+        )
+        if not self.use_flex:
+            data['nested_attention_masks'] = sequence_status['nested_attention_masks']
+        else:
+            sequence_len = data['sequence_length']
+            pad_len = self.max_num_tokens - sequence_len
+            data['split_lens'] = sequence_status['split_lens'] + [pad_len]
+            data['attn_modes'] = sequence_status['attn_modes'] + ['causal']
+            data['sample_lens'] += [pad_len]
+        # if the model has a convnet vae (e.g., as visual tokenizer)
+        if len(sequence_status['vae_image_tensors']) > 0:
+            image_tensors = sequence_status.pop('vae_image_tensors')
+            image_sizes = [item.shape for item in image_tensors]
+            max_image_size = [max(item) for item in list(zip(*image_sizes))]
+            padded_images = torch.zeros(size=(len(image_tensors), *max_image_size))
+            for i, image_tensor in enumerate(image_tensors):
+                padded_images[i, :, :image_tensor.shape[1], :image_tensor.shape[2]] = image_tensor
+            data['padded_images'] = padded_images
+            data['patchified_vae_latent_shapes'] = sequence_status['vae_latent_shapes']
+            data['packed_latent_position_ids'] = torch.cat(sequence_status['packed_latent_position_ids'], dim=0)
+            data['packed_vae_token_indexes'] = torch.tensor(sequence_status['packed_vae_token_indexes'])
+        # if the model has a vit (e.g., as visual tokenizer)
+        if len(sequence_status['packed_vit_tokens']) > 0:
+            data['packed_vit_tokens'] = torch.cat(sequence_status['packed_vit_tokens'], dim=0)
+            data['packed_vit_position_ids'] = torch.cat(sequence_status['packed_vit_position_ids'], dim=0)
+            data['packed_vit_token_indexes'] = torch.tensor(sequence_status['packed_vit_token_indexes'])
+            data['vit_token_seqlens'] = torch.tensor(sequence_status['vit_token_seqlens'])
+        # if the model is required to perform visual generation
+        if len(sequence_status['packed_timesteps']) > 0:
+            data['packed_timesteps'] = torch.tensor(sequence_status['packed_timesteps'])
+            data['mse_loss_indexes'] = torch.tensor(sequence_status['mse_loss_indexes'])
+        # if the model is required to perform text generation
+        if len(sequence_status['packed_label_ids']) > 0:
+            data['packed_label_ids'] = torch.tensor(sequence_status['packed_label_ids'])
+            data['ce_loss_indexes'] = torch.tensor(sequence_status['ce_loss_indexes'])
+            data['ce_loss_weights'] = torch.tensor(sequence_status['ce_loss_weights'])
+        return data
+    def __iter__(self):
+        total_weights = sum(self.grouped_weights)
+        assert total_weights > 0.0
+        group_cumprobs = [sum(self.grouped_weights[:i + 1]) / total_weights
+                          for i in range(len(self.grouped_weights))]
+        sequence_status = self.set_sequence_status()
+        batch_data_indexes = []
+        buffer = []
+        while True:
+            # Ensure at least one sample from each group
+            if sequence_status['curr'] == 0:
+                for group_index, group_iter in enumerate(self.dataset_iters):
+                    if self.is_mandatory[group_index]:
+                        while True:
+                            sample = next(group_iter)
+                            # if a sample is too long, skip it
+                            num_tokens = sample['num_tokens'] + 2 * len(sample['sequence_plan'])
+                            if num_tokens < self.max_num_tokens_per_sample:
+                                sequence_status = self.pack_sequence(sample, sequence_status)
+                                batch_data_indexes.append(sample['data_indexes'])
+                                break
+                            else:
+                                print(f"skip a sample with length {num_tokens}")
+                                continue
+            if sequence_status['curr'] < self.prefer_buffer_before and len(buffer) > 0:
+                sample = buffer.pop(0)
+                sample_from_buffer = True
+            else:
+                # sample normally across all groups
+                n = random.random()
+                group_index = 0
+                for i, cumprob in enumerate(group_cumprobs):
+                    if n < cumprob:
+                        group_index = i
+                        break
+                sample = next(self.dataset_iters[group_index])
+                sample_from_buffer = False
+            # if a sample is too long, skip it
+            num_tokens = sample['num_tokens'] + 2 * len(sample['sequence_plan'])
+            if num_tokens > self.max_num_tokens_per_sample:
+                print(f"skip a sample with length {num_tokens}")
+                continue
+            if sequence_status['curr'] + num_tokens > self.max_num_tokens:
+                if len(buffer) < self.max_buffer_size and not sample_from_buffer:
+                    buffer.append(sample)
+                else:
+                    print(f"Yielding data with length {sum(sequence_status['sample_lens'])}")
+                    data = self.to_tensor(sequence_status)
+                    data['batch_data_indexes'] = batch_data_indexes
+                    yield data
+                    sequence_status = self.set_sequence_status()
+                    batch_data_indexes = []
+                continue
+            sequence_status = self.pack_sequence(sample, sequence_status)
+            batch_data_indexes.append(sample['data_indexes'])
+            if sequence_status['curr'] >= self.expected_num_tokens:
+                data = self.to_tensor(sequence_status)
+                data['batch_data_indexes'] = batch_data_indexes
+                yield data
+                sequence_status = self.set_sequence_status()
+                batch_data_indexes = []
+    def pack_sequence(self, sample, sequence_status):
+        image_tensor_list = sample['image_tensor_list']
+        text_ids_list = sample['text_ids_list']
+        sequence_plan = sample['sequence_plan']
+        split_lens, attn_modes = list(), list()
+        curr = sequence_status['curr']
+        curr_rope_id = 0
+        sample_lens = 0
+        for item in sequence_plan:
+            split_start = item.get('split_start', True)
+            if split_start:
+                curr_split_len = 0
+            if item['type'] == 'text':
+                text_ids = text_ids_list.pop(0)
+                if item['enable_cfg'] == 1 and random.random() < self.data_config.text_cond_dropout_prob:
+                    continue
+                shifted_text_ids = [self.bos_token_id] + text_ids
+                sequence_status['packed_text_ids'].extend(shifted_text_ids)
+                sequence_status['packed_text_indexes'].extend(range(curr, curr + len(shifted_text_ids)))
+                if item['loss'] == 1:
+                    sequence_status['ce_loss_indexes'].extend(range(curr, curr + len(shifted_text_ids)))
+                    sequence_status['ce_loss_weights'].extend(
+                        [len2weight(len(shifted_text_ids))] * len(shifted_text_ids)
+                    )
+                    sequence_status['packed_label_ids'].extend(text_ids + [self.eos_token_id])
+                curr += len(shifted_text_ids)
+                curr_split_len += len(shifted_text_ids)
+                # add a <|im_end|> token
+                sequence_status['packed_text_ids'].append(self.eos_token_id)
+                sequence_status['packed_text_indexes'].append(curr)
+                if item['special_token_loss'] == 1: # <|im_end|> may have loss
+                    sequence_status['ce_loss_indexes'].append(curr)
+                    sequence_status['ce_loss_weights'].append(1.0)
+                    sequence_status['packed_label_ids'].append(item['special_token_label'])
+                curr += 1
+                curr_split_len += 1
+                # update sequence status
+                attn_modes.append("causal")
+                sequence_status['packed_position_ids'].extend(range(curr_rope_id, curr_rope_id + curr_split_len))
+                curr_rope_id += curr_split_len
+            elif item['type'] == 'vit_image':
+                image_tensor = image_tensor_list.pop(0)
+                if item['enable_cfg'] == 1 and random.random() < self.data_config.vit_cond_dropout_prob:
+                    curr_rope_id += 1
+                    continue
+                # add a <|startofimage|> token
+                sequence_status['packed_text_ids'].append(self.start_of_image)
+                sequence_status['packed_text_indexes'].append(curr)
+                curr += 1
+                curr_split_len += 1
+                # preprocess image
+                vit_tokens = patchify(image_tensor, self.data_config.vit_patch_size)
+                num_img_tokens = vit_tokens.shape[0]
+                sequence_status['packed_vit_token_indexes'].extend(range(curr, curr + num_img_tokens))
+                curr += num_img_tokens
+                curr_split_len += num_img_tokens
+                sequence_status['packed_vit_tokens'].append(vit_tokens)
+                sequence_status['vit_token_seqlens'].append(num_img_tokens)
+                sequence_status['packed_vit_position_ids'].append(
+                    self.get_flattened_position_ids(
+                        image_tensor.size(1), image_tensor.size(2),
+                        self.data_config.vit_patch_size,
+                        max_num_patches_per_side=self.data_config.max_num_patch_per_side
+                    )
+                )
+                # add a <|endofimage|> token
+                sequence_status['packed_text_ids'].append(self.end_of_image)
+                sequence_status['packed_text_indexes'].append(curr)
+                if item['special_token_loss'] == 1: # <|endofimage|> may have loss
+                    sequence_status['ce_loss_indexes'].append(curr)
+                    sequence_status['ce_loss_weights'].append(1.0)
+                    sequence_status['packed_label_ids'].append(item['special_token_label'])
+                curr += 1
+                curr_split_len += 1
+                # update sequence status
+                attn_modes.append("full")
+                sequence_status['packed_position_ids'].extend([curr_rope_id] * curr_split_len)
+                curr_rope_id += 1
+            elif item['type'] == 'vae_image':
+                image_tensor = image_tensor_list.pop(0)
+                if item['enable_cfg'] == 1 and random.random() < self.data_config.vae_cond_dropout_prob:
+                    # FIXME fix vae dropout in video2video setting.
+                    curr_rope_id += 1
+                    continue
+                # add a <|startofimage|> token
+                sequence_status['packed_text_ids'].append(self.start_of_image)
+                sequence_status['packed_text_indexes'].append(curr)
+                curr += 1
+                curr_split_len += 1
+                # preprocess image
+                sequence_status['vae_image_tensors'].append(image_tensor)
+                sequence_status['packed_latent_position_ids'].append(
+                    self.get_flattened_position_ids(
+                        image_tensor.size(1), image_tensor.size(2),
+                        self.data_config.vae_image_downsample,
+                        max_num_patches_per_side=self.data_config.max_latent_size
+                    )
+                )
+                H, W = image_tensor.shape[1:]
+                h = H // self.data_config.vae_image_downsample
+                w = W // self.data_config.vae_image_downsample
+                sequence_status['vae_latent_shapes'].append((h, w))
+                num_img_tokens = w * h
+                sequence_status['packed_vae_token_indexes'].extend(range(curr, curr + num_img_tokens))
+                if item['loss'] == 1:
+                    sequence_status['mse_loss_indexes'].extend(range(curr, curr + num_img_tokens))
+                    if split_start:
+                        timestep = np.random.randn()
+                else:
+                    timestep = float('-inf')
+                sequence_status['packed_timesteps'].extend([timestep] * num_img_tokens)
+                curr += num_img_tokens
+                curr_split_len += num_img_tokens
+                # add a <|endofimage|> token
+                sequence_status['packed_text_ids'].append(self.end_of_image)
+                sequence_status['packed_text_indexes'].append(curr)
+                # <|endofimage|> may have loss
+                if item['special_token_loss'] == 1:
+                    sequence_status['ce_loss_indexes'].append(curr)
+                    sequence_status['ce_loss_weights'].append(1.0)
+                    sequence_status['packed_label_ids'].append(item['special_token_label'])
+                curr += 1
+                curr_split_len += 1
+                # update sequence status
+                if split_start:
+                    if item['loss'] == 1 and 'frame_delta' not in item.keys():
+                        attn_modes.append("noise")
+                    else:
+                        attn_modes.append("full")
+                sequence_status['packed_position_ids'].extend([curr_rope_id] * (num_img_tokens + 2))
+                if 'frame_delta' in item.keys():
+                    curr_rope_id += item['frame_delta']
+                elif item['loss'] == 0:
+                    curr_rope_id += 1
+            if item.get('split_end', True):
+                split_lens.append(curr_split_len)
+                sample_lens += curr_split_len
+        sequence_status['curr'] = curr
+        sequence_status['sample_lens'].append(sample_lens)
+        # prepare attention mask
+        if not self.use_flex:
+            sequence_status['nested_attention_masks'].append(
+                prepare_attention_mask_per_sample(split_lens, attn_modes)
+            )
+        else:
+            sequence_status['split_lens'].extend(split_lens)
+            sequence_status['attn_modes'].extend(attn_modes)
+        return sequence_status
+class SimpleCustomBatch:
+    def __init__(self, batch):
+        data = batch[0]
+        self.batch_data_indexes = data['batch_data_indexes']
+        self.sequence_length = data["sequence_length"]
+        self.sample_lens = data["sample_lens"]
+        self.packed_text_ids = data["packed_text_ids"]
+        self.packed_text_indexes = data["packed_text_indexes"]
+        self.packed_position_ids = data["packed_position_ids"]
+        self.use_flex = "nested_attention_masks" not in data.keys()
+        if self.use_flex:
+            self.split_lens = data["split_lens"]
+            self.attn_modes = data["attn_modes"]
+        else:
+            self.nested_attention_masks = data["nested_attention_masks"]
+        if "padded_images" in data.keys():
+            self.padded_images = data["padded_images"]
+            self.patchified_vae_latent_shapes = data["patchified_vae_latent_shapes"]
+            self.packed_latent_position_ids = data["packed_latent_position_ids"]
+            self.packed_vae_token_indexes = data["packed_vae_token_indexes"]
+        if "packed_vit_tokens" in data.keys():
+            self.packed_vit_tokens = data["packed_vit_tokens"]
+            self.packed_vit_position_ids = data["packed_vit_position_ids"]
+            self.packed_vit_token_indexes = data["packed_vit_token_indexes"]
+            self.vit_token_seqlens = data["vit_token_seqlens"]
+        if "packed_timesteps" in data.keys():
+            self.packed_timesteps = data["packed_timesteps"]
+            self.mse_loss_indexes = data["mse_loss_indexes"]
+        if "packed_label_ids" in data.keys():
+            self.packed_label_ids = data["packed_label_ids"]
+            self.ce_loss_indexes = data["ce_loss_indexes"]
+            self.ce_loss_weights = data["ce_loss_weights"]
+    def pin_memory(self):
+        self.packed_text_ids = self.packed_text_ids.pin_memory()
+        self.packed_text_indexes = self.packed_text_indexes.pin_memory()
+        self.packed_position_ids = self.packed_position_ids.pin_memory()
+        if not self.use_flex:
+            self.nested_attention_masks = [item.pin_memory() for item in self.nested_attention_masks]
+        if hasattr(self, 'padded_images'):
+            self.padded_images = self.padded_images.pin_memory()
+            self.packed_vae_token_indexes = self.packed_vae_token_indexes.pin_memory()
+            self.packed_latent_position_ids = self.packed_latent_position_ids.pin_memory()
+        if hasattr(self, 'packed_timesteps'):
+            self.packed_timesteps = self.packed_timesteps.pin_memory()
+            self.mse_loss_indexes = self.mse_loss_indexes.pin_memory()
+        if hasattr(self, 'packed_vit_tokens'):
+            self.packed_vit_tokens = self.packed_vit_tokens.pin_memory()
+            self.packed_vit_position_ids = self.packed_vit_position_ids.pin_memory()
+            self.packed_vit_token_indexes = self.packed_vit_token_indexes.pin_memory()
+            self.vit_token_seqlens = self.vit_token_seqlens.pin_memory()
+        if hasattr(self, 'packed_label_ids'):
+            self.packed_label_ids = self.packed_label_ids.pin_memory()
+            self.ce_loss_indexes = self.ce_loss_indexes.pin_memory()
+            self.ce_loss_weights = self.ce_loss_weights.pin_memory()
+        return self
+    def cuda(self, device):
+        self.packed_text_ids = self.packed_text_ids.to(device)
+        self.packed_text_indexes = self.packed_text_indexes.to(device)
+        self.packed_position_ids = self.packed_position_ids.to(device)
+        if not self.use_flex:
+            self.nested_attention_masks = [item.to(device) for item in self.nested_attention_masks]
+        if hasattr(self, 'padded_images'):
+            self.padded_images = self.padded_images.to(device)
+            self.packed_vae_token_indexes = self.packed_vae_token_indexes.to(device)
+            self.packed_latent_position_ids = self.packed_latent_position_ids.to(device)
+        if hasattr(self, 'packed_timesteps'):
+            self.packed_timesteps = self.packed_timesteps.to(device)
+            self.mse_loss_indexes = self.mse_loss_indexes.to(device)
+        if hasattr(self, 'packed_vit_tokens'):
+            self.packed_vit_tokens = self.packed_vit_tokens.to(device)
+            self.packed_vit_position_ids = self.packed_vit_position_ids.to(device)
+            self.packed_vit_token_indexes = self.packed_vit_token_indexes.to(device)
+            self.vit_token_seqlens = self.vit_token_seqlens.to(device)
+        if hasattr(self, 'packed_label_ids'):
+            self.packed_label_ids = self.packed_label_ids.to(device)
+            self.ce_loss_indexes = self.ce_loss_indexes.to(device)
+            self.ce_loss_weights = self.ce_loss_weights.to(device)
+        return self
+    def to_dict(self):
+        data = dict(
+            sequence_length = self.sequence_length,
+            sample_lens = self.sample_lens,
+            packed_text_ids = self.packed_text_ids,
+            packed_text_indexes = self.packed_text_indexes,
+            packed_position_ids = self.packed_position_ids,
+            batch_data_indexes = self.batch_data_indexes,
+        )
+        if not self.use_flex:
+            data['nested_attention_masks'] = self.nested_attention_masks
+        else:
+            data['split_lens'] = self.split_lens
+            data['attn_modes'] = self.attn_modes
+        if hasattr(self, 'padded_images'):
+            data['padded_images'] = self.padded_images
+            data['patchified_vae_latent_shapes'] = self.patchified_vae_latent_shapes
+            data['packed_latent_position_ids'] = self.packed_latent_position_ids
+            data['packed_vae_token_indexes'] = self.packed_vae_token_indexes
+        if hasattr(self, 'packed_vit_tokens'):
+            data['packed_vit_tokens'] = self.packed_vit_tokens
+            data['packed_vit_position_ids'] = self.packed_vit_position_ids
+            data['packed_vit_token_indexes'] = self.packed_vit_token_indexes
+            data['vit_token_seqlens'] = self.vit_token_seqlens
+        if hasattr(self, 'packed_timesteps'):
+            data['packed_timesteps'] = self.packed_timesteps
+            data['mse_loss_indexes'] = self.mse_loss_indexes
+        if hasattr(self, 'packed_label_ids'):
+            data['packed_label_ids'] = self.packed_label_ids
+            data['ce_loss_indexes'] = self.ce_loss_indexes
+            data['ce_loss_weights'] = self.ce_loss_weights
+        return data
+def collate_wrapper():
+    def collate_fn(batch):
+        return SimpleCustomBatch(batch)
+    return collate_fn

data/dataset_info.py ADDED Viewed

	@@ -0,0 +1,39 @@

+# Copyright 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: Apache-2.0
+from .interleave_datasets import UnifiedEditIterableDataset
+from .t2i_dataset import T2IIterableDataset
+from .vlm_dataset import SftJSONLIterableDataset
+DATASET_REGISTRY = {
+    't2i_pretrain': T2IIterableDataset,
+    'vlm_sft': SftJSONLIterableDataset,
+    'unified_edit': UnifiedEditIterableDataset,
+}
+DATASET_INFO = {
+    't2i_pretrain': {
+        't2i': {
+            'data_dir': 'your_data_path/bagel_example/t2i', # path of the parquet files
+            'num_files': 10, # number of data units to be sharded across all ranks and workers
+            'num_total_samples': 1000, # number of total samples in the dataset
+        },
+    },
+    'unified_edit':{
+        'seedxedit_multi': {
+            'data_dir': 'your_data_path/bagel_example/editing/seedxedit_multi',
+            'num_files': 10,
+            'num_total_samples': 1000,
+            "parquet_info_path": 'your_data_path/bagel_example/editing/parquet_info/seedxedit_multi_nas.json', # information of the parquet files
+		},
+    },
+    'vlm_sft': {
+        'llava_ov': {
+			'data_dir': 'your_data_path/bagel_example/vlm/images',
+			'jsonl_path': 'your_data_path/bagel_example/vlm/llava_ov_si.jsonl',
+			'num_total_samples': 1000
+		},
+    },
+}

data/distributed_iterable_dataset.py ADDED Viewed

	@@ -0,0 +1,58 @@

+# Copyright 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: Apache-2.0
+import random
+import torch
+class DistributedIterableDataset(torch.utils.data.IterableDataset):
+    def __init__(self, dataset_name, local_rank=0, world_size=1, num_workers=8):
+        self.dataset_name = dataset_name
+        self.local_rank = local_rank
+        self.world_size = world_size
+        self.num_workers = num_workers
+        self.rng = random.Random()
+        self.data_paths = None
+    def get_data_paths(self, *args, **kwargs):
+        raise NotImplementedError
+    def set_epoch(self, seed=42):
+        if self.data_paths is None:
+            return
+        if isinstance(self.data_paths[0], tuple):
+            data_paths = sorted(self.data_paths, key=lambda x: (x[0], x[1]))
+        elif isinstance(self.data_paths[0], str):
+            data_paths = sorted(self.data_paths)
+        else:
+            raise ValueError(f"Unknown data_paths type: {type(self.data_paths[0])}")
+        self.rng.seed(seed)
+        self.rng.shuffle(data_paths)
+        num_files_per_rank = len(data_paths) // self.world_size
+        local_start = self.local_rank * num_files_per_rank
+        local_end = (self.local_rank + 1) * num_files_per_rank
+        self.num_files_per_rank = num_files_per_rank
+        self.data_paths_per_rank = data_paths[local_start:local_end]
+    def get_data_paths_per_worker(self):
+        if self.data_paths is None:
+            return None
+        info = torch.utils.data.get_worker_info()
+        if info is None:
+            # Single worker: Use all files assigned to the rank
+            return self.data_paths_per_rank, 0
+        worker_id = info.id
+        num_files_per_worker = self.num_files_per_rank // info.num_workers
+        start = num_files_per_worker * worker_id
+        end = num_files_per_worker * (worker_id + 1)
+        data_paths_per_worker = self.data_paths_per_rank[start:end]
+        return data_paths_per_worker[::-1], worker_id
+    def __iter__(self):
+        raise NotImplementedError

data/interleave_datasets/__init__.py ADDED Viewed

	@@ -0,0 +1,5 @@

+# Copyright 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: Apache-2.0
+from .edit_dataset import UnifiedEditIterableDataset

data/interleave_datasets/edit_dataset.py ADDED Viewed

	@@ -0,0 +1,72 @@

+# Copyright 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: Apache-2.0
+import io
+import random
+from PIL import Image, ImageFile, PngImagePlugin
+from .interleave_t2i_dataset import InterleavedBaseIterableDataset, ParquetStandardIterableDataset
+from ..data_utils import pil_img2rgb
+Image.MAX_IMAGE_PIXELS = 200000000
+ImageFile.LOAD_TRUNCATED_IMAGES = True
+MaximumDecompressedSize = 1024
+MegaByte = 2 ** 20
+PngImagePlugin.MAX_TEXT_CHUNK = MaximumDecompressedSize * MegaByte
+class UnifiedEditIterableDataset(InterleavedBaseIterableDataset, ParquetStandardIterableDataset):
+    def parse_row(self, row):
+        image_num = len(row["image_list"])
+        # randomly choose start and end, return [0, 1] when only two images
+        start_idx = random.choice(range(image_num - 1))
+        max_end = min(start_idx + 3, image_num)
+        end_idx = random.choice(range(start_idx + 1, max_end))
+        data = self._init_data()
+        data = self._add_image(
+            data,
+            pil_img2rgb(Image.open(io.BytesIO(row["image_list"][start_idx]))),
+            need_loss=False,
+            need_vae=True,
+            need_vit=True,
+        )
+        if end_idx - start_idx > 1 and random.random() < 0.5: # concat multiple insturction
+            if end_idx == image_num - 1:
+                end_idx -= 1
+            instruction = ""
+            for idx in range(start_idx + 1, end_idx + 1):
+                instruction += random.choice(row["instruction_list"][idx-1]) + ". "
+            data = self._add_text(data, instruction.rstrip(), need_loss=False)
+            data = self._add_image(
+                data,
+                pil_img2rgb(Image.open(io.BytesIO(row["image_list"][end_idx]))),
+                need_loss=True,
+                need_vae=False,
+                need_vit=False,
+            )
+        else:
+            for idx in range(start_idx + 1, end_idx + 1):
+                instruction = random.choice(row["instruction_list"][idx-1])
+                data = self._add_text(data, instruction, need_loss=False)
+                if idx != end_idx:
+                    data = self._add_image(
+                        data,
+                        pil_img2rgb(Image.open(io.BytesIO(row["image_list"][idx]))),
+                        need_loss=True,
+                        need_vae=True,
+                        need_vit=True,
+                    )
+                else:
+                    data = self._add_image(
+                        data,
+                        pil_img2rgb(Image.open(io.BytesIO(row["image_list"][idx]))),
+                        need_loss=True,
+                        need_vae=False,
+                        need_vit=False,
+                    )
+        return data

data/interleave_datasets/interleave_t2i_dataset.py ADDED Viewed

	@@ -0,0 +1,212 @@

+# Copyright 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: Apache-2.0
+import pyarrow.parquet as pq
+from ..distributed_iterable_dataset import DistributedIterableDataset
+from ..parquet_utils import get_parquet_data_paths, init_arrow_pf_fs
+class InterleavedBaseIterableDataset(DistributedIterableDataset):
+    def _init_data(self):
+        data = {
+            'sequence_plan': [],
+            'text_ids_list': [],
+            'image_tensor_list': [],
+            'num_tokens': 0,
+        }
+        return data
+    def _add_text(self, data, text, need_loss, enable_cfg=True):
+        text_ids = self.tokenizer.encode(text)
+        data['num_tokens'] += len(text_ids)
+        data['text_ids_list'].append(text_ids)
+        data['sequence_plan'].append(
+            {
+                'type': 'text',
+                'enable_cfg': int(enable_cfg),
+                'loss': int(need_loss),
+                'special_token_loss': 0,
+                'special_token_label': None,
+            }
+        )
+        return data
+    def _add_image(self, data, image, need_loss, need_vae, need_vit, enable_cfg=True):
+        assert need_loss or need_vae or need_vit
+        if need_loss:
+            data['sequence_plan'].append(
+                {
+                    'type': 'vae_image',
+                    'enable_cfg': 0,
+                    'loss': 1,
+                    'special_token_loss': 0,
+                    'special_token_label': None,
+                }
+            )
+            image_tensor = self.transform(image)
+            height, width = image_tensor.shape[1:]
+            data['num_tokens'] += width * height // self.transform.stride ** 2
+            data['image_tensor_list'].append(image_tensor)
+        if need_vae:
+            data['sequence_plan'].append(
+                {
+                    'type': 'vae_image',
+                    'enable_cfg': int(enable_cfg),
+                    'loss': 0,
+                    'special_token_loss': 0,
+                    'special_token_label': None,
+                }
+            )
+            image_tensor = self.transform(image)
+            height, width = image_tensor.shape[1:]
+            data['num_tokens'] += width * height // self.transform.stride ** 2
+            data['image_tensor_list'].append(image_tensor.clone())
+        if need_vit:
+            data['sequence_plan'].append(
+                {
+                    'type': 'vit_image',
+                    'enable_cfg': int(enable_cfg),
+                    'loss': 0,
+                    'special_token_loss': 0,
+                    'special_token_label': None,
+                },
+            )
+            vit_image_tensor = self.vit_transform(image)
+            height, width = vit_image_tensor.shape[1:]
+            data['num_tokens'] += width * height // self.vit_transform.stride ** 2
+            data['image_tensor_list'].append(vit_image_tensor)
+        return data
+    def _add_video(self, data, frames, frame_indexes, need_loss, need_vae, enable_cfg=True):
+        assert int(need_loss) + int(need_vae) == 1
+        if need_loss:
+            for idx, (image, frame_idx) in enumerate(zip(frames, frame_indexes)):
+                current_sequence_plan = {
+                    'type': 'vae_image',
+                    'enable_cfg': 0,
+                    'loss': 1,
+                    'special_token_loss': 0,
+                    'special_token_label': None,
+                    'split_start': idx == 0,
+                    'split_end': idx == len(frames) - 1,
+                }
+                if idx < len(frame_indexes) - 1:
+                    current_sequence_plan['frame_delta'] = frame_indexes[idx + 1] - frame_idx
+                data['sequence_plan'].append(current_sequence_plan)
+                image_tensor = self.transform(image)
+                height, width = image_tensor.shape[1:]
+                data['image_tensor_list'].append(image_tensor)
+                data['num_tokens'] += width * height // self.transform.stride ** 2
+        elif need_vae:
+            for idx, (image, frame_idx) in enumerate(zip(frames, frame_indexes)):
+                current_sequence_plan = {
+                    'type': 'vae_image',
+                    'enable_cfg': int(enable_cfg),
+                    'loss': 0,
+                    'special_token_loss': 0,
+                    'special_token_label': None,
+                    'split_start': idx == 0,
+                    'split_end': idx == len(frames) - 1,
+                }
+                if idx < len(frame_indexes) - 1:
+                    current_sequence_plan['frame_delta'] = frame_indexes[idx + 1] - frame_idx
+                data['sequence_plan'].append(current_sequence_plan)
+                image_tensor = self.transform(image)
+                height, width = image_tensor.shape[1:]
+                data['image_tensor_list'].append(image_tensor)
+                data['num_tokens'] += width * height // self.transform.stride ** 2
+        return data
+class ParquetStandardIterableDataset(DistributedIterableDataset):
+    def __init__(
+        self, dataset_name, transform, tokenizer, vit_transform,
+        data_dir_list, num_used_data, parquet_info,
+        local_rank=0, world_size=1, num_workers=8, data_status=None,
+    ):
+        """
+        data_dir_list: list of data directories contains parquet files
+        num_used_data: list of number of sampled data paths for each data directory
+        vit_transform: input transform for vit model.
+        """
+        super().__init__(dataset_name, local_rank, world_size, num_workers)
+        self.transform = transform
+        self.vit_transform = vit_transform
+        self.tokenizer = tokenizer
+        self.data_status = data_status
+        self.data_paths = self.get_data_paths(data_dir_list, num_used_data, parquet_info)
+        self.set_epoch()
+    def get_data_paths(self, data_dir_list, num_used_data, parquet_info):
+        row_groups = []
+        for data_dir, num_data_path in zip(data_dir_list, num_used_data):
+            data_paths = get_parquet_data_paths([data_dir], [num_data_path])
+            for data_path in data_paths:
+                if data_path in parquet_info.keys():
+                    num_row_groups = parquet_info[data_path]['num_row_groups']
+                    for rg_idx in range(num_row_groups):
+                        row_groups.append((data_path, rg_idx))
+        return row_groups
+    def parse_row(self, row):
+        raise NotImplementedError
+    def __iter__(self):
+        file_paths_per_worker, worker_id = self.get_data_paths_per_worker()
+        if self.data_status is not None:
+            global_row_group_start_id = self.data_status[worker_id][0]
+            row_start_id = self.data_status[worker_id][1] + 1
+        else:
+            global_row_group_start_id = 0
+            row_start_id = 0
+        print(
+            f"rank-{self.local_rank} worker-{worker_id} dataset-{self.dataset_name}: "
+            f"resuming data at global_rg#{global_row_group_start_id}, row#{row_start_id}"
+        )
+        while True:
+            file_paths_per_worker_ = file_paths_per_worker[global_row_group_start_id:]
+            for global_row_group_idx, (parquet_file_path, row_group_id) in enumerate(
+                file_paths_per_worker_, start=global_row_group_start_id
+            ):
+                fs = init_arrow_pf_fs(parquet_file_path)
+                with fs.open_input_file(parquet_file_path) as f:
+                    try:
+                        fr = pq.ParquetFile(f)
+                        df = fr.read_row_group(row_group_id).to_pandas()
+                        df = df.iloc[row_start_id:]
+                    except Exception as e:
+                        print(f'Error {e} in rg#{row_group_id}, {parquet_file_path}')
+                        continue
+                    for row_idx, row in df.iterrows():
+                        try:
+                            data = self.parse_row(row)
+                            if len(data) == 0:
+                                continue
+                            data['data_indexes'] = {
+                                "data_indexes": [global_row_group_idx, row_idx],
+                                "worker_id": worker_id,
+                                "dataset_name": self.dataset_name,
+                            }
+                        except Exception as e:
+                            print(f'Error {e} in rg#{row_group_id}, {parquet_file_path}')
+                            continue
+                        yield data
+                    row_start_id = 0
+            global_row_group_start_id = 0
+            print(f"{self.dataset_name} repeat in rank-{self.local_rank} worker-{worker_id}")

data/parquet_utils.py ADDED Viewed

	@@ -0,0 +1,90 @@

+# Copyright 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: Apache-2.0
+import os
+import xml.etree.ElementTree as ET
+import subprocess
+import logging
+import pyarrow.fs as pf
+import torch.distributed as dist
+logger = logging.getLogger(__name__)
+def get_parquet_data_paths(data_dir_list, num_sampled_data_paths, rank=0, world_size=1):
+    num_data_dirs = len(data_dir_list)
+    if world_size > 1:
+        chunk_size = (num_data_dirs + world_size - 1) // world_size
+        start_idx = rank * chunk_size
+        end_idx = min(start_idx + chunk_size, num_data_dirs)
+        local_data_dir_list = data_dir_list[start_idx:end_idx]
+        local_num_sampled_data_paths = num_sampled_data_paths[start_idx:end_idx]
+    else:
+        local_data_dir_list = data_dir_list
+        local_num_sampled_data_paths = num_sampled_data_paths
+    local_data_paths = []
+    for data_dir, num_data_path in zip(local_data_dir_list, local_num_sampled_data_paths):
+        if data_dir.startswith("hdfs://"):
+            files = hdfs_ls_cmd(data_dir)
+            data_paths_per_dir = [
+                file for file in files if file.endswith(".parquet")
+            ]
+        else:
+            files = os.listdir(data_dir)
+            data_paths_per_dir = [
+                os.path.join(data_dir, name)
+                for name in files
+                if name.endswith(".parquet")
+            ]
+        repeat = num_data_path // len(data_paths_per_dir)
+        data_paths_per_dir = data_paths_per_dir * (repeat + 1)
+        local_data_paths.extend(data_paths_per_dir[:num_data_path])
+    if world_size > 1:
+        gather_list = [None] * world_size
+        dist.all_gather_object(gather_list, local_data_paths)
+        combined_chunks = []
+        for chunk_list in gather_list:
+            if chunk_list is not None:
+                combined_chunks.extend(chunk_list)
+    else:
+        combined_chunks = local_data_paths
+    return combined_chunks
+# NOTE: cumtomize this function for your cluster
+def get_hdfs_host():
+    return "hdfs://xxx"
+# NOTE: cumtomize this function for your cluster
+def get_hdfs_block_size():
+    return 134217728
+# NOTE: cumtomize this function for your cluster
+def get_hdfs_extra_conf():
+    return None
+def init_arrow_pf_fs(parquet_file_path):
+    if parquet_file_path.startswith("hdfs://"):
+        fs = pf.HadoopFileSystem(
+            host=get_hdfs_host(),
+            port=0,
+            buffer_size=get_hdfs_block_size(),
+            extra_conf=get_hdfs_extra_conf(),
+        )
+    else:
+        fs = pf.LocalFileSystem()
+    return fs
+def hdfs_ls_cmd(dir):
+    result = subprocess.run(["hdfs", "dfs", "ls", dir], capture_output=True, text=True).stdout
+    return ['hdfs://' + i.split('hdfs://')[-1].strip() for i in result.split('\n') if 'hdfs://' in i]

data/t2i_dataset.py ADDED Viewed

	@@ -0,0 +1,128 @@

+# Copyright 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: Apache-2.0
+import io
+import json
+import pyarrow.parquet as pq
+import random
+from PIL import Image
+from .data_utils import pil_img2rgb
+from .distributed_iterable_dataset import DistributedIterableDataset
+from .parquet_utils import get_parquet_data_paths, init_arrow_pf_fs
+Image.MAX_IMAGE_PIXELS = 20_000_000
+class T2IIterableDataset(DistributedIterableDataset):
+    def __init__(
+        self, dataset_name, transform, tokenizer, data_dir_list, num_used_data,
+        local_rank=0, world_size=1, num_workers=8, data_status=None,
+    ):
+        """
+        data_dir_list: list of data directories contains parquet files
+        num_used_data: list of number of sampled data paths for each data directory
+        """
+        super().__init__(dataset_name, local_rank, world_size, num_workers)
+        self.transform = transform
+        self.tokenizer = tokenizer
+        self.data_status = data_status
+        self.data_paths = self.get_data_paths(data_dir_list, num_used_data)
+        self.set_epoch()
+    def get_data_paths(self, data_dir_list, num_used_data):
+        return get_parquet_data_paths(data_dir_list, num_used_data)
+    def __iter__(self):
+        data_paths_per_worker, worker_id = self.get_data_paths_per_worker()
+        if self.data_status is not None:
+            parquet_start_id = self.data_status[worker_id][0]
+            row_group_start_id = self.data_status[worker_id][1]
+            row_start_id = self.data_status[worker_id][2] + 1
+        else:
+            parquet_start_id = 0
+            row_group_start_id = 0
+            row_start_id = 0
+        transform_stride = self.transform.stride
+        print(
+            f"rank-{self.local_rank} worker-{worker_id} dataset-{self.dataset_name}: "
+            f"resuming data at parquet#{parquet_start_id}, rg#{row_group_start_id}, row#{row_start_id}"
+        )
+        while True:
+            data_paths_per_worker_ = data_paths_per_worker[parquet_start_id:]
+            for parquet_idx, parquet_file_path in enumerate(data_paths_per_worker_, start=parquet_start_id):
+                fs = init_arrow_pf_fs(parquet_file_path)
+                with fs.open_input_file(parquet_file_path) as f:
+                    fr = pq.ParquetFile(f)
+                    row_group_ids = list(range(fr.num_row_groups))
+                    row_group_ids_ = row_group_ids[row_group_start_id:]
+                    for row_group_id in row_group_ids_:
+                        df = fr.read_row_group(row_group_id).to_pandas()
+                        df = df.iloc[row_start_id:]
+                        for row_idx, row in df.iterrows():
+                            num_tokens = 0
+                            try:
+                                image_byte = row['image']
+                                image = pil_img2rgb(Image.open(io.BytesIO(image_byte)))
+                            except Exception as e:
+                                print(f'Error: {e} in rg#{row_group_id}, {parquet_file_path}')
+                                continue
+                            image_tensor = self.transform(image)
+                            height, width = image_tensor.shape[1:]
+                            num_tokens += width * height // transform_stride ** 2
+                            try:
+                                caption_dict = row['captions']
+                                caption_dict = json.loads(caption_dict)
+                            except Exception as e:
+                                print(f'Error: {e} in rg#{row_group_id}, {parquet_file_path}')
+                                continue
+                            caps_token = [self.tokenizer.encode(v) for _, v in caption_dict.items()]
+                            if len(caps_token) == 0:
+                                print(f'no caption in rg#{row_group_id}, {parquet_file_path}')
+                                caption_token = self.tokenizer.encode(' ')
+                            else:
+                                caption_token = random.choice(caps_token)
+                            sequence_plan, text_ids_list = [], []
+                            text_ids = caption_token
+                            num_tokens += len(caption_token)
+                            text_ids_list.append(text_ids)
+                            sequence_plan.append({
+                                'type': 'text',
+                                'enable_cfg': 1,
+                                'loss': 0,
+                                'special_token_loss': 0,
+                                'special_token_label': None,
+                            })
+                            sequence_plan.append({
+                                'type': 'vae_image',
+                                'enable_cfg': 0,
+                                'loss': 1,
+                                'special_token_loss': 0,
+                                'special_token_label': None,
+                            })
+                            sample = dict(
+                                image_tensor_list=[image_tensor],
+                                text_ids_list=text_ids_list,
+                                num_tokens=num_tokens,
+                                sequence_plan=sequence_plan,
+                                data_indexes={
+                                    "data_indexes": [parquet_idx, row_group_id, row_idx],
+                                    "worker_id": worker_id,
+                                    "dataset_name": self.dataset_name,
+                                }
+                            )
+                            yield sample
+                        row_start_id = 0
+                    row_group_start_id = 0
+            parquet_start_id = 0
+            print(f"{self.dataset_name} repeat in rank-{self.local_rank} worker-{worker_id}")

data/transforms.py ADDED Viewed

	@@ -0,0 +1,287 @@

+# Copyright 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: Apache-2.0
+import random
+from PIL import Image
+import cv2
+import numpy as np
+import torch
+from torchvision import transforms
+from torchvision.transforms import functional as F
+from torchvision.transforms import InterpolationMode
+class MaxLongEdgeMinShortEdgeResize(torch.nn.Module):
+    """Resize the input image so that its longest side and shortest side are within a specified range,
+    ensuring that both sides are divisible by a specified stride.
+    Args:
+        max_size (int): Maximum size for the longest edge of the image.
+        min_size (int): Minimum size for the shortest edge of the image.
+        stride (int): Value by which the height and width of the image must be divisible.
+        max_pixels (int): Maximum pixels for the full image.
+        interpolation (InterpolationMode): Desired interpolation enum defined by
+            :class:`torchvision.transforms.InterpolationMode`. Default is ``InterpolationMode.BILINEAR``.
+            If input is Tensor, only ``InterpolationMode.NEAREST``, ``InterpolationMode.NEAREST_EXACT``,
+            ``InterpolationMode.BILINEAR``, and ``InterpolationMode.BICUBIC`` are supported.
+            The corresponding Pillow integer constants, e.g., ``PIL.Image.BILINEAR`` are also accepted.
+        antialias (bool, optional): Whether to apply antialiasing (default is True).
+    """
+    def __init__(
+        self,
+        max_size: int,
+        min_size: int,
+        stride: int,
+        max_pixels: int,
+        interpolation=InterpolationMode.BICUBIC,
+        antialias=True
+    ):
+        super().__init__()
+        self.max_size = max_size
+        self.min_size = min_size
+        self.stride = stride
+        self.max_pixels = max_pixels
+        self.interpolation = interpolation
+        self.antialias = antialias
+    def _make_divisible(self, value, stride):
+        """Ensure the value is divisible by the stride."""
+        return max(stride, int(round(value / stride) * stride))
+    def _apply_scale(self, width, height, scale):
+        new_width = round(width * scale)
+        new_height = round(height * scale)
+        new_width = self._make_divisible(new_width, self.stride)
+        new_height = self._make_divisible(new_height, self.stride)
+        return new_width, new_height
+    def forward(self, img, img_num=1):
+        """
+        Args:
+            img (PIL Image): Image to be resized.
+            img_num (int): Number of images, used to change max_tokens.
+        Returns:
+            PIL Image or Tensor: Rescaled image with divisible dimensions.
+        """
+        if isinstance(img, torch.Tensor):
+            height, width = img.shape[-2:]
+        else:
+            width, height = img.size
+        scale = min(self.max_size / max(width, height), 1.0)
+        scale = max(scale, self.min_size / min(width, height))
+        new_width, new_height = self._apply_scale(width, height, scale)
+        # Ensure the number of pixels does not exceed max_pixels
+        if new_width * new_height > self.max_pixels / img_num:
+            scale = self.max_pixels / img_num / (new_width * new_height)
+            new_width, new_height = self._apply_scale(new_width, new_height, scale)
+        # Ensure longest edge does not exceed max_size
+        if max(new_width, new_height) > self.max_size:
+            scale = self.max_size / max(new_width, new_height)
+            new_width, new_height = self._apply_scale(new_width, new_height, scale)
+        return F.resize(img, (new_height, new_width), self.interpolation, antialias=self.antialias)
+class ImageTransform:
+    def __init__(
+        self,
+        max_image_size,
+        min_image_size,
+        image_stride,
+        max_pixels=14*14*9*1024,
+        image_mean=[0.5, 0.5, 0.5],
+        image_std=[0.5, 0.5, 0.5]
+    ):
+        self.stride = image_stride
+        self.resize_transform = MaxLongEdgeMinShortEdgeResize(
+            max_size=max_image_size,
+            min_size=min_image_size,
+            stride=image_stride,
+            max_pixels=max_pixels,
+        )
+        self.to_tensor_transform = transforms.ToTensor()
+        self.normalize_transform = transforms.Normalize(mean=image_mean, std=image_std, inplace=True)
+    def __call__(self, img, img_num=1):
+        img = self.resize_transform(img, img_num=img_num)
+        img = self.to_tensor_transform(img)
+        img = self.normalize_transform(img)
+        return img
+def decolorization(image):
+    gray_image = image.convert('L')
+    return Image.merge(image.mode, [gray_image] * 3) if image.mode in ('RGB', 'L') else gray_image
+def downscale(image, scale_factor):
+    new_width = int(round(image.width * scale_factor))
+    new_height = int(round(image.height * scale_factor))
+    new_width = max(1, new_width)
+    new_height = max(1, new_height)
+    return image.resize((new_width, new_height), resample=Image.BICUBIC)
+def crop(image, crop_factors):
+    target_h, target_w = crop_factors
+    img_w, img_h = image.size
+    if target_h > img_h or target_w > img_w:
+        raise ValueError("Crop size exceeds image dimensions")
+    x = random.randint(0, img_w - target_w)
+    y = random.randint(0, img_h - target_h)
+    return image.crop((x, y, x + target_w, y + target_h)), [[x, y], [x + target_w, y + target_h]]
+def motion_blur_opencv(image, kernel_size=15, angle=0):
+    # 线性核
+    kernel = np.zeros((kernel_size, kernel_size), dtype=np.float32)
+    kernel[kernel_size // 2, :] = np.ones(kernel_size, dtype=np.float32)
+    # 旋转核
+    center = (kernel_size / 2 - 0.5, kernel_size / 2 - 0.5)
+    M = cv2.getRotationMatrix2D(center, angle, 1)
+    rotated_kernel = cv2.warpAffine(kernel, M, (kernel_size, kernel_size))
+    # 归一化核
+    rotated_kernel /= rotated_kernel.sum() if rotated_kernel.sum() != 0 else 1
+    img = np.array(image)
+    if img.ndim == 2:
+        blurred = cv2.filter2D(img, -1, rotated_kernel, borderType=cv2.BORDER_REFLECT)
+    else:
+        # 对于彩色图像，各通道独立卷积
+        blurred = np.zeros_like(img)
+        for c in range(img.shape[2]):
+            blurred[..., c] = cv2.filter2D(img[..., c], -1, rotated_kernel, borderType=cv2.BORDER_REFLECT)
+    return Image.fromarray(blurred.astype(np.uint8))
+def shuffle_patch(image, num_splits, gap_size=2):
+    """将图像分割为块（允许尺寸不整除），随机打乱后拼接，块间保留间隙"""
+    h_splits, w_splits = num_splits
+    img_w, img_h = image.size
+    base_patch_h = img_h // h_splits
+    patch_heights = [base_patch_h] * (h_splits - 1)
+    patch_heights.append(img_h - sum(patch_heights))
+    base_patch_w = img_w // w_splits
+    patch_widths = [base_patch_w] * (w_splits - 1)
+    patch_widths.append(img_w - sum(patch_widths))
+    patches = []
+    current_y = 0
+    for i in range(h_splits):
+        current_x = 0
+        patch_h = patch_heights[i]
+        for j in range(w_splits):
+            patch_w = patch_widths[j]
+            patch = image.crop((current_x, current_y, current_x + patch_w, current_y + patch_h))
+            patches.append(patch)
+            current_x += patch_w
+        current_y += patch_h
+    random.shuffle(patches)
+    total_width = sum(patch_widths) + (w_splits - 1) * gap_size
+    total_height = sum(patch_heights) + (h_splits - 1) * gap_size
+    new_image = Image.new(image.mode, (total_width, total_height), color=(255, 255, 255))
+    current_y = 0  # 当前行的起始 Y 坐标
+    patch_idx = 0  # 当前处理的块索引
+    for i in range(h_splits):
+        current_x = 0  # 当前列的起始 X 坐标
+        patch_h = patch_heights[i]  # 当前行块的高度
+        for j in range(w_splits):
+            # 取出打乱后的块
+            patch = patches[patch_idx]
+            patch_w = patch_widths[j]  # 当前列块的宽度
+            # 粘贴块（左上角坐标为 (current_x, current_y)）
+            new_image.paste(patch, (current_x, current_y))
+            # 更新 X 坐标（下一个块的起始位置 = 当前块宽度 + 间隙）
+            current_x += patch_w + gap_size
+            patch_idx += 1
+        # 更新 Y 坐标（下一行的起始位置 = 当前行高度 + 间隙）
+        current_y += patch_h + gap_size
+    return new_image
+def inpainting(image, num_splits, blank_ratio=0.3, blank_color=(255, 255, 255)):
+    """
+    图像分割后随机空白部分patch，用于inpainting任务
+    参数：
+        image: PIL.Image 输入图像（RGB模式）
+        h_splits: int 行分割数（垂直方向分割块数）
+        w_splits: int 列分割数（水平方向分割块数）
+        blank_ratio: float 空白patch的比例（0~1）
+        blank_color: tuple 空白区域的颜色（RGB，如白色(255,255,255)）
+    返回：
+        PIL.Image 处理后拼接的图像
+    """
+    h_splits, w_splits = num_splits
+    img_w, img_h = image.size
+    base_patch_h = img_h // h_splits
+    patch_heights = [base_patch_h] * (h_splits - 1)
+    patch_heights.append(img_h - sum(patch_heights))
+    base_patch_w = img_w // w_splits
+    patch_widths = [base_patch_w] * (w_splits - 1)
+    patch_widths.append(img_w - sum(patch_widths))
+    patches = []
+    current_y = 0
+    for i in range(h_splits):
+        current_x = 0
+        patch_h = patch_heights[i]
+        for j in range(w_splits):
+            patch_w = patch_widths[j]
+            patch = image.crop((current_x, current_y, current_x + patch_w, current_y + patch_h))
+            patches.append(patch)
+            current_x += patch_w
+        current_y += patch_h
+    total_patches = h_splits * w_splits
+    num_blank = int(total_patches * blank_ratio)
+    num_blank = max(0, min(num_blank, total_patches))
+    blank_indices = random.sample(range(total_patches), num_blank)
+    processed_patches = []
+    for idx, patch in enumerate(patches):
+        if idx in blank_indices:
+            blank_patch = Image.new("RGB", patch.size, color=blank_color)
+            processed_patches.append(blank_patch)
+        else:
+            processed_patches.append(patch)
+    # 创建结果图像（尺寸与原图一致）
+    result_image = Image.new("RGB", (img_w, img_h))
+    current_y = 0
+    patch_idx = 0
+    for i in range(h_splits):
+        current_x = 0
+        patch_h = patch_heights[i]
+        for j in range(w_splits):
+            # 取出处理后的patch
+            patch = processed_patches[patch_idx]
+            patch_w = patch_widths[j]
+            # 粘贴到原位置
+            result_image.paste(patch, (current_x, current_y))
+            current_x += patch_w
+            patch_idx += 1
+        current_y += patch_h
+    return result_image

data/video_utils.py ADDED Viewed

	@@ -0,0 +1,165 @@

+# Copyright (c) 2023 OpenGVLab
+# Copyright (c) 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: MIT
+#
+# This file has been modified by ByteDance Ltd. and/or its affiliates. on 2025-05-20.
+#
+# Original file was released under MIT, with the full license text
+# available at https://github.com/OpenGVLab/InternVL/blob/main/LICENSE.
+#
+# This modified file is released under the same license.
+import io
+import os
+import random
+import re
+import numpy as np
+import decord
+from PIL import Image
+def get_frame_indices(num_frames, vlen, sample='rand', fix_start=None, input_fps=1, max_num_frames=-1):
+    if sample in ['rand', 'middle']: # uniform sampling
+        acc_samples = min(num_frames, vlen)
+        # split the video into `acc_samples` intervals, and sample from each interval.
+        intervals = np.linspace(start=0, stop=vlen, num=acc_samples + 1).astype(int)
+        ranges = []
+        for idx, interv in enumerate(intervals[:-1]):
+            ranges.append((interv, intervals[idx + 1] - 1))
+        if sample == 'rand':
+            try:
+                frame_indices = [random.choice(range(x[0], x[1])) for x in ranges]
+            except:
+                frame_indices = np.random.permutation(vlen)[:acc_samples]
+                frame_indices.sort()
+                frame_indices = list(frame_indices)
+        elif fix_start is not None:
+            frame_indices = [x[0] + fix_start for x in ranges]
+        elif sample == 'middle':
+            frame_indices = [(x[0] + x[1]) // 2 for x in ranges]
+        else:
+            raise NotImplementedError
+        if len(frame_indices) < num_frames:  # padded with last frame
+            padded_frame_indices = [frame_indices[-1]] * num_frames
+            padded_frame_indices[:len(frame_indices)] = frame_indices
+            frame_indices = padded_frame_indices
+    elif 'fps' in sample:  # fps0.5, sequentially sample frames at 0.5 fps
+        output_fps = float(sample[3:])
+        duration = float(vlen) / input_fps
+        delta = 1 / output_fps  # gap between frames, this is also the clip length each frame represents
+        frame_seconds = np.arange(0 + delta / 2, duration + delta / 2, delta)
+        frame_indices = np.around(frame_seconds * input_fps).astype(int)
+        frame_indices = [e for e in frame_indices if e < vlen]
+        if max_num_frames > 0 and len(frame_indices) > max_num_frames:
+            frame_indices = frame_indices[:max_num_frames]
+    else:
+        raise ValueError
+    return frame_indices
+def read_frames_decord(video_path, num_frames, sample='rand', fix_start=None, clip=None, min_num_frames=4):
+    video_reader = decord.VideoReader(video_path, num_threads=1)
+    vlen = len(video_reader)
+    fps = video_reader.get_avg_fps()
+    duration = vlen / float(fps)
+    if clip:
+        start, end = clip
+        duration = end - start
+        vlen = int(duration * fps)
+        start_index = int(start * fps)
+    t_num_frames = np.random.randint(min_num_frames, num_frames + 1)
+    frame_indices = get_frame_indices(
+        t_num_frames, vlen, sample=sample, fix_start=fix_start,
+        input_fps=fps
+    )
+    if clip:
+        frame_indices = [f + start_index for f in frame_indices]
+    frames = video_reader.get_batch(frame_indices).asnumpy()  # (T, H, W, C), np.uint8
+    frames = [Image.fromarray(frames[i]) for i in range(frames.shape[0])]
+    return frames
+def extract_frame_number(filename):
+    # Extract the numeric part from the filename using regular expressions
+    match = re.search(r'_(\d+).jpg$', filename)
+    return int(match.group(1)) if match else -1
+def sort_frames(frame_paths):
+    # Extract filenames from each path and sort by their numeric part
+    return sorted(frame_paths, key=lambda x: extract_frame_number(os.path.basename(x)))
+def read_frames_folder(video_path, num_frames, sample='rand', fix_start=None, min_num_frames=4):
+    image_list = sort_frames(list(os.listdir(video_path)))
+    frames = []
+    for image in image_list:
+        fp = os.path.join(video_path, image)
+        frame = Image.open(fp).convert('RGB')
+        frames.append(frame)
+    vlen = len(frames)
+    t_num_frames = np.random.randint(min_num_frames, num_frames + 1)
+    if vlen > t_num_frames:
+        frame_indices = get_frame_indices(
+            t_num_frames, vlen, sample=sample, fix_start=fix_start
+        )
+        frames = [frames[i] for i in frame_indices]
+    return frames
+class FrameSampler:
+    def __init__(self, max_num_frames=-1, min_num_frames=8, sample='rand'):
+        self.max_num_frames = max_num_frames
+        self.min_num_frames = min_num_frames
+        self.sample = sample
+    def __call__(self, file_name):
+        fn = read_frames_folder if file_name.endswith('/') else read_frames_decord
+        frames = fn(file_name, num_frames=self.max_num_frames, min_num_frames=self.min_num_frames, sample=self.sample)
+        return frames
+def decode_video_byte(video_bytes):
+    video_stream = io.BytesIO(video_bytes)
+    vr = decord.VideoReader(video_stream)
+    return vr
+def sample_mp4_frames(mp4_p, n_frames=None, fps=None, return_frame_indices=False, random_sample=False):
+    if isinstance(mp4_p, str):
+        vr = decord.VideoReader(mp4_p, num_threads=1)
+    elif isinstance(mp4_p, decord.video_reader.VideoReader):
+        vr = mp4_p
+    video_fps = vr.get_avg_fps()  # 获取视频的帧率
+    video_duration = len(vr) / video_fps
+    if n_frames is not None:
+        if random_sample:
+            frame_indices = sorted(random.sample(range(len(vr)), n_frames))
+        else:
+            frame_indices = np.linspace(0, len(vr)-1, n_frames, dtype=int).tolist()
+    else:
+        frame_indices = [int(i) for i in np.arange(0, len(vr)-1, video_fps/fps)]
+    frames = vr.get_batch(frame_indices).asnumpy()  # 转换为 numpy 数组
+    frames = [Image.fromarray(frame).convert("RGB") for frame in frames]
+    if not return_frame_indices:
+        return frames, video_duration
+    else:
+        return frames, video_duration, frame_indices
+def sample_mp4_frames_by_indices(mp4_p, frame_indices: list):
+    if isinstance(mp4_p, str):
+        vr = decord.VideoReader(mp4_p, num_threads=1)
+    elif isinstance(mp4_p, decord.video_reader.VideoReader):
+        vr = mp4_p
+    # sample the frames in frame_indices
+    frames = vr.get_batch(frame_indices).asnumpy()  # 转换为 numpy 数组
+    frames = [Image.fromarray(frame).convert("RGB") for frame in frames]
+    return frames

data/vlm_dataset.py ADDED Viewed

	@@ -0,0 +1,195 @@

+# Copyright 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: Apache-2.0
+import json
+import os
+import traceback
+from PIL import Image, ImageFile, PngImagePlugin
+from .data_utils import pil_img2rgb
+from .distributed_iterable_dataset import DistributedIterableDataset
+Image.MAX_IMAGE_PIXELS = 200000000
+ImageFile.LOAD_TRUNCATED_IMAGES = True
+MaximumDecompressedSize = 1024
+MegaByte = 2 ** 20
+PngImagePlugin.MAX_TEXT_CHUNK = MaximumDecompressedSize * MegaByte
+class SftJSONLIterableDataset(DistributedIterableDataset):
+    def __init__(
+        self, dataset_name, transform, tokenizer, frame_sampler,
+        jsonl_path_list, data_dir_list, num_used_data,
+        local_rank=0, world_size=1, num_workers=8, data_status=None,
+        shuffle_lines=False, shuffle_seed=0,
+    ):
+        """
+        jsonl_path_list: list of jsonl file paths
+        data_dir_list: list of image directories containing the images of each jsonl file
+        num_used_data: list of number of sampled data points for each jsonl
+        """
+        super().__init__(dataset_name, local_rank, world_size, num_workers)
+        self.transform = transform
+        self.tokenizer = tokenizer
+        self.frame_sampler = frame_sampler
+        self.data_status = data_status
+        self.data_paths = self.get_data_paths(
+            jsonl_path_list,
+            data_dir_list,
+            num_used_data,
+            shuffle_lines,
+            shuffle_seed,
+        )
+        self.set_epoch()
+    def get_data_paths(
+        self,
+        jsonl_path_list,
+        data_dir_list,
+        num_used_data,
+        shuffle_lines,
+        shuffle_seed,
+    ):
+        data_paths = []
+        for jsonl_path, image_dir, num_data_point in zip(
+            jsonl_path_list, data_dir_list, num_used_data
+        ):
+            with open(jsonl_path, 'r') as f:
+                raw_data = f.readlines()
+            if shuffle_lines:
+                self.rng.seed(shuffle_seed)
+                self.rng.shuffle(raw_data)
+            raw_data = raw_data[:num_data_point]
+            data_paths.extend([(json_data, image_dir) for json_data in raw_data])
+        return data_paths
+    def change_format(self, data, num_images):
+        elements = []
+        for conversation in data['conversations']:
+            if conversation['from'] == 'human':
+                if '<image>' not in conversation['value']:
+                    elements.append({
+                        'type': 'text',
+                        'has_loss': 0,
+                        'text': conversation['value'],
+                    })
+                else:
+                    text_list = conversation['value'].split('<image>')
+                    for idx, text in enumerate(text_list):
+                        if text.strip() != '':
+                            elements.append({
+                                'type': 'text',
+                                'has_loss': 0,
+                                'text': text.strip(),
+                            })
+                        if (idx != len(text_list) - 1) and (idx < num_images):
+                            elements.append({'type': 'image',})
+            elif conversation['from'] == 'gpt':
+                elements.append({
+                    'type': 'text',
+                    'has_loss': 1,
+                    'text': conversation['value'],
+                })
+        return elements
+    def __iter__(self):
+        data_paths_per_worker, worker_id = self.get_data_paths_per_worker()
+        if self.data_status is not None:
+            row_start_id = self.data_status[worker_id] + 1
+        else:
+            row_start_id = 0
+        transform_stride = self.transform.stride
+        print(
+            f"rank-{self.local_rank} worker-{worker_id} dataset-{self.dataset_name}: "
+            f"resuming data at row#{row_start_id}"
+        )
+        while True:
+            data_paths_per_worker_ = data_paths_per_worker[row_start_id:]
+            for row_idx, (data, image_dir) in enumerate(data_paths_per_worker_, start=row_start_id):
+                num_tokens = 0
+                image_tensor_list = []
+                text_ids_list = []
+                sequence_plan = []
+                try:
+                    data_item = json.loads(data)
+                    raw_images = None
+                    if 'image' in data_item:
+                        if type(data_item['image']) == list:
+                            raw_images = [
+                                pil_img2rgb(Image.open(os.path.join(image_dir, image)))
+                                for image in data_item['image']
+                            ]
+                        else:
+                            raw_images = [
+                                pil_img2rgb(Image.open(os.path.join(image_dir, data_item['image'])))
+                            ]
+                    elif 'video' in data_item:
+                        raw_images = self.frame_sampler(os.path.join(image_dir, data_item['video']))
+                        special_tokens = '<image>' * len(raw_images)
+                        for item in data_item['conversations']:
+                            if '<video>' in item['value']:
+                                item['value'] = item['value'].replace('<video>', special_tokens)
+                                break
+                            else:
+                                raise ValueError("Cannot find <video> in the conversation!")
+                except:
+                    traceback.print_exc()
+                    continue
+                if raw_images:
+                    for raw_image in raw_images:
+                        image_tensor = self.transform(raw_image, img_num=len(raw_images))
+                        image_tensor_list.append(image_tensor)
+                        height, width = image_tensor.shape[1:]
+                        num_tokens += width * height // transform_stride ** 2
+                elements = self.change_format(data_item, len(image_tensor_list))
+                for item in elements:
+                    if item['type'] == 'text':
+                        text_data = item['text']
+                        text_ids = self.tokenizer.encode(text_data)
+                        if len(text_ids) > 0:
+                            text_ids_list.append(text_ids)
+                            num_tokens += len(text_ids)
+                            current_plan = {
+                                'type': 'text',
+                                'enable_cfg': 0,
+                                'loss': item['has_loss'],
+                                'special_token_loss': 0,
+                                'special_token_label': None,
+                            }
+                            sequence_plan.append(current_plan)
+                    elif item['type'] == 'image':
+                        current_plan = {
+                            'type': 'vit_image',
+                            'enable_cfg': 0,
+                            'loss': 0,
+                            'special_token_loss': 0,
+                            'special_token_label': None,
+                        }
+                        sequence_plan.append(current_plan)
+                has_loss = [item['loss'] for item in sequence_plan]
+                if sum(has_loss) == 0:
+                    print(f'No loss defined, skipped.')
+                    continue
+                yield dict(
+                    image_tensor_list=image_tensor_list,
+                    text_ids_list=text_ids_list,
+                    sequence_plan=sequence_plan,
+                    num_tokens=num_tokens,
+                    data_indexes={
+                        "data_indexes": row_idx,
+                        "worker_id": worker_id,
+                        "dataset_name": self.dataset_name,
+                    }
+                )
+            row_start_id = 0
+            print(f"{self.dataset_name} repeat in rank-{self.local_rank} worker-{worker_id}")

eval/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # Copyright 2025 Bytedance Ltd. and/or its affiliates.
2	+ # SPDX-License-Identifier: Apache-2.0

eval/gen/gen_images_mp.py ADDED Viewed

	@@ -0,0 +1,238 @@

+# Copyright 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: Apache-2.0
+import os
+import json
+import argparse
+from safetensors.torch import load_file
+import torch
+import torch.distributed as dist
+from data.data_utils import add_special_tokens
+from modeling.bagel import (
+    BagelConfig, Bagel, Qwen2Config, Qwen2ForCausalLM, SiglipVisionConfig, SiglipVisionModel
+)
+from modeling.qwen2 import Qwen2Tokenizer
+from modeling.autoencoder import load_ae
+from PIL import Image
+from modeling.bagel.qwen2_navit import NaiveCache
+def move_generation_input_to_device(generation_input, device):
+    # Utility to move all tensors in generation_input to device
+    for k, v in generation_input.items():
+        if isinstance(v, torch.Tensor):
+            generation_input[k] = v.to(device)
+    return generation_input
+def setup_distributed():
+    dist.init_process_group(backend="nccl")
+    torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))
+def generate_image(prompt, num_timesteps=50, cfg_scale=10.0, cfg_interval=[0, 1.0], cfg_renorm_min=0., timestep_shift=1.0, num_images=4, resolution=512, device=None):  # 添加device参数
+    past_key_values = NaiveCache(gen_model.config.llm_config.num_hidden_layers)
+    newlens = [0] * num_images
+    new_rope = [0] * num_images
+    generation_input, newlens, new_rope = gen_model.prepare_prompts(
+        curr_kvlens=newlens,
+        curr_rope=new_rope,
+        prompts=[prompt] * num_images,
+        tokenizer=tokenizer,
+        new_token_ids=new_token_ids,
+    )
+    generation_input = move_generation_input_to_device(generation_input, device)
+    with torch.no_grad():
+        with torch.amp.autocast("cuda", enabled=True, dtype=torch.float16):
+            past_key_values = gen_model.forward_cache_update_text(past_key_values, **generation_input)
+    generation_input = gen_model.prepare_vae_latent(
+        curr_kvlens=newlens,
+        curr_rope=new_rope,
+        image_sizes=[(resolution, resolution)] * num_images,
+        new_token_ids=new_token_ids,
+    )
+    generation_input = move_generation_input_to_device(generation_input, device)
+    cfg_past_key_values = NaiveCache(gen_model.config.llm_config.num_hidden_layers)
+    cfg_newlens = [0] * num_images
+    cfg_new_rope = [0] * num_images
+    generation_input_cfg = model.prepare_vae_latent_cfg(
+        curr_kvlens=cfg_newlens,
+        curr_rope=cfg_new_rope,
+        image_sizes=[(resolution, resolution)] * num_images,
+    )
+    generation_input_cfg = move_generation_input_to_device(generation_input_cfg, device)
+    with torch.no_grad():
+        with torch.amp.autocast("cuda", enabled=True, dtype=torch.bfloat16):
+            unpacked_latent = gen_model.generate_image(
+                past_key_values=past_key_values,
+                num_timesteps=num_timesteps,
+                cfg_text_scale=cfg_scale,
+                cfg_interval=cfg_interval,
+                cfg_renorm_min=cfg_renorm_min,
+                timestep_shift=timestep_shift,
+                cfg_text_past_key_values=cfg_past_key_values,
+                cfg_text_packed_position_ids=generation_input_cfg["cfg_packed_position_ids"],
+                cfg_text_key_values_lens=generation_input_cfg["cfg_key_values_lens"],
+                cfg_text_packed_query_indexes=generation_input_cfg["cfg_packed_query_indexes"],
+                cfg_text_packed_key_value_indexes=generation_input_cfg["cfg_packed_key_value_indexes"],
+                **generation_input,
+            )
+    image_list = []
+    for latent in unpacked_latent:
+        latent = latent.reshape(1, resolution//16, resolution//16, 2, 2, 16)
+        latent = torch.einsum("nhwpqc->nchpwq", latent)
+        latent = latent.reshape(1, 16, resolution//8, resolution//8)
+        image = vae_model.decode(latent.to(device))
+        tmpimage = ((image * 0.5 + 0.5).clamp(0, 1)[0].permute(1, 2, 0) * 255).to(torch.uint8).cpu().numpy()
+        tmpimage = Image.fromarray(tmpimage)
+        image_list.append(tmpimage)
+    return image_list
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Generate images using Bagel model.")
+    parser.add_argument("--output_dir", type=str, required=True, help="Directory to save the generated images.")
+    parser.add_argument("--metadata_file", type=str, required=True, help="JSONL file containing lines of metadata for each prompt.")
+    parser.add_argument("--num_images", type=int, default=4)
+    parser.add_argument("--batch_size", type=int, default=4)
+    parser.add_argument("--cfg_scale", type=float, default=4)
+    parser.add_argument("--resolution", type=int, default=1024)
+    parser.add_argument("--max_latent_size", type=int, default=64)
+    parser.add_argument('--model-path', type=str, default='hf/BAGEL-7B-MoT/')
+    args = parser.parse_args()
+    seed = 42
+    if seed is not None:
+        import random
+        import numpy as np
+        random.seed(seed)
+        np.random.seed(seed)
+        torch.manual_seed(seed)
+        if torch.cuda.is_available():
+            torch.cuda.manual_seed(seed)
+            torch.cuda.manual_seed_all(seed)
+        torch.backends.cudnn.deterministic = True
+        torch.backends.cudnn.benchmark = False
+    setup_distributed()
+    rank = dist.get_rank()
+    world_size = dist.get_world_size()
+    device = f"cuda:{rank}"
+    output_dir = args.output_dir
+    os.makedirs(output_dir, exist_ok=True)
+    if rank == 0:
+        print(f"Output images are saved in {output_dir}")
+    llm_config = Qwen2Config.from_json_file(os.path.join(args.model_path, "llm_config.json"))
+    llm_config.qk_norm = True
+    llm_config.tie_word_embeddings = False
+    llm_config.layer_module = "Qwen2MoTDecoderLayer"
+    vit_config = SiglipVisionConfig.from_json_file(os.path.join(args.model_path, "vit_config.json"))
+    vit_config.rope = False
+    vit_config.num_hidden_layers = vit_config.num_hidden_layers - 1
+    vae_model, vae_config = load_ae(local_path=os.path.join(args.model_path, "ae.safetensors"))
+    config = BagelConfig(
+        visual_gen=True,
+        visual_und=True,
+        llm_config=llm_config,
+        vit_config=vit_config,
+        vae_config=vae_config,
+        vit_max_num_patch_per_side=70,
+        connector_act='gelu_pytorch_tanh',
+        latent_patch_size=2,
+        max_latent_size=args.max_latent_size,
+    )
+    language_model = Qwen2ForCausalLM(llm_config)
+    vit_model = SiglipVisionModel(vit_config)
+    model = Bagel(language_model, vit_model, config)
+    model.vit_model.vision_model.embeddings.convert_conv2d_to_linear(vit_config)
+    tokenizer = Qwen2Tokenizer.from_pretrained(args.model_path)
+    tokenizer, new_token_ids, _ = add_special_tokens(tokenizer)
+    model_state_dict_path = os.path.join(args.model_path, "ema.safetensors")
+    model_state_dict = load_file(model_state_dict_path, device="cpu")
+    msg = model.load_state_dict(model_state_dict, strict=False)
+    if rank == 0:
+        print(msg)
+    del model_state_dict
+    model = model.to(device).eval()
+    vae_model = vae_model.to(device).eval()
+    gen_model = model
+    cfg_scale = args.cfg_scale
+    cfg_interval = [0, 1.0]
+    timestep_shift = 3.0
+    num_timesteps = 50
+    cfg_renorm_min = 0.0
+    with open(args.metadata_file, "r", encoding="utf-8") as fp:
+        metadatas = [json.loads(line) for line in fp]
+    total_metadatas = len(metadatas)
+    prompts_per_gpu = (total_metadatas + world_size - 1) // world_size
+    start = rank * prompts_per_gpu
+    end = min(start + prompts_per_gpu, total_metadatas)
+    print(f"GPU {rank}: Processing {end - start} prompts (indices {start} to {end - 1})")
+    for idx in range(start, end):
+        metadata = metadatas[idx]
+        outpath = os.path.join(output_dir, f"{idx:0>5}")
+        os.makedirs(outpath, exist_ok=True)
+        prompt = metadata['prompt']
+        print(f"GPU {rank} processing prompt {idx - start + 1}/{end - start}: '{prompt}'")
+        sample_path = os.path.join(outpath, "samples")
+        os.makedirs(sample_path, exist_ok=True)
+        flag = True
+        for idx in range(args.num_images):
+            if not os.path.exists(os.path.join(sample_path, f"{idx:05}.png")):
+                flag = False
+                break
+        if flag:
+            print(f"GPU {rank} skipping generation for prompt: {prompt}")
+            continue
+        with open(os.path.join(outpath, "metadata.jsonl"), "w", encoding="utf-8") as fp:
+            json.dump(metadata, fp)
+        image_list = []
+        for i in range(args.num_images // args.batch_size):
+            tmp_image_list = generate_image(
+                prompt=prompt,
+                cfg_scale=cfg_scale,
+                cfg_interval=cfg_interval,
+                cfg_renorm_min=cfg_renorm_min,
+                timestep_shift=timestep_shift,
+                num_timesteps=num_timesteps,
+                num_images=args.batch_size,
+                resolution=args.resolution,
+                device=device,
+            )
+            image_list.extend(tmp_image_list)
+        sample_count = 0
+        for sample in image_list:
+            sample = sample.crop(sample.getbbox())
+            sample.save(os.path.join(sample_path, f"{sample_count:05}.png"))
+            sample_count += 1
+    print(f"GPU {rank} has completed all tasks")
+    dist.barrier()

eval/gen/gen_images_mp_wise.py ADDED Viewed

	@@ -0,0 +1,365 @@

+# Copyright 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: Apache-2.0
+import os
+import json
+import argparse
+from safetensors.torch import load_file
+import torch
+import torch.distributed as dist
+from data.data_utils import add_special_tokens
+from modeling.bagel import (
+    BagelConfig, Bagel, Qwen2Config, Qwen2ForCausalLM, SiglipVisionConfig, SiglipVisionModel
+)
+from modeling.qwen2 import Qwen2Tokenizer
+from modeling.autoencoder import load_ae
+import copy
+from PIL import Image
+from modeling.bagel.qwen2_navit import NaiveCache
+def setup_distributed():
+    dist.init_process_group(backend="nccl")
+    torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))
+SYSTEM_PROMPT = '''You should first think about the planning process in the mind and then generate the image.
+The planning process is enclosed within <think> </think> tags, i.e. <think> planning process here </think> image here'''
+def move_generation_input_to_device(generation_input, device):
+    # Utility to move all tensors in generation_input to device
+    for k, v in generation_input.items():
+        if isinstance(v, torch.Tensor):
+            generation_input[k] = v.to(device)
+    return generation_input
+def generate_image_with_think(
+    prompt, num_timesteps=50, cfg_scale=4.0, cfg_interval=[0, 1.0], cfg_renorm_min=0., timestep_shift=4.0, resolution=1024,
+    max_length=2048, simple_think=False, device=None
+):
+    h, w = resolution, resolution
+    past_key_values = NaiveCache(model.config.llm_config.num_hidden_layers)
+    newlens = [0]
+    new_rope = [0]
+    # system prompt
+    generation_input, newlens, new_rope = model.prepare_prompts(
+        curr_kvlens=newlens,
+        curr_rope=new_rope,
+        prompts=[SYSTEM_PROMPT],
+        tokenizer=tokenizer,
+        new_token_ids=new_token_ids,
+    )
+    generation_input = move_generation_input_to_device(generation_input, device)
+    with torch.amp.autocast("cuda", enabled=True, dtype=torch.bfloat16):
+        past_key_values = model.forward_cache_update_text(past_key_values, **generation_input)
+    ##########  cfg
+    generation_input_cfg = model.prepare_vae_latent_cfg(
+        curr_kvlens=newlens,
+        curr_rope=new_rope,
+        image_sizes=[(h, w)],
+    )
+    generation_input_cfg = move_generation_input_to_device(generation_input_cfg, device)
+    ##########  cfg
+    generation_input, newlens, new_rope = model.prepare_prompts(
+        curr_kvlens=newlens,
+        curr_rope=new_rope,
+        prompts=[prompt],
+        tokenizer=tokenizer,
+        new_token_ids=new_token_ids,
+    )
+    generation_input = move_generation_input_to_device(generation_input, device)
+    with torch.amp.autocast("cuda", enabled=True, dtype=torch.bfloat16):
+        past_key_values = model.forward_cache_update_text(past_key_values, **generation_input)
+    ########## think
+    tmp_past_key_values = copy.deepcopy(past_key_values)
+    tmp_newlens = copy.deepcopy(newlens)
+    tmp_new_rope = copy.deepcopy(new_rope)
+    tmp_generation_input, tmp_newlens, tmp_new_rope = model.prepare_prompts(
+        curr_kvlens=tmp_newlens,
+        curr_rope=tmp_new_rope,
+        prompts=[prompt],
+        tokenizer=tokenizer,
+        new_token_ids=new_token_ids,
+    )
+    tmp_generation_input = move_generation_input_to_device(tmp_generation_input, device)
+    with torch.amp.autocast("cuda", enabled=True, dtype=torch.bfloat16):
+        tmp_past_key_values = model.forward_cache_update_text(tmp_past_key_values, **tmp_generation_input)
+    tmp_generation_input = model.prepare_start_tokens(tmp_newlens, tmp_new_rope, new_token_ids)
+    tmp_generation_input = move_generation_input_to_device(tmp_generation_input, device)
+    with torch.amp.autocast("cuda", enabled=True, dtype=torch.bfloat16):
+        unpacked_latent = model.generate_text(
+            past_key_values=tmp_past_key_values,
+            max_length=max_length,
+            do_sample=True,
+            temperature=0.3,
+            end_token_id=new_token_ids['eos_token_id'],
+            **tmp_generation_input,
+            )
+        output = tokenizer.decode(unpacked_latent[:,0])
+        think_output = output.split('<|im_end|>')[0].split('<|im_start|>')[1]
+    print("="*30, "original think", "="*30)
+    print(think_output)
+    if simple_think:
+        think_output_list = think_output.split("</think>")
+        if think_output_list[1] != "":
+            think_output = think_output_list[1].strip()
+        print("="*30, "processed think", "="*30)
+        print(think_output)
+    ########## think
+    generation_input, newlens, new_rope = model.prepare_prompts(
+        curr_kvlens=newlens,
+        curr_rope=new_rope,
+        prompts=[think_output],
+        tokenizer=tokenizer,
+        new_token_ids=new_token_ids,
+    )
+    generation_input = move_generation_input_to_device(generation_input, device)
+    with torch.amp.autocast("cuda", enabled=True, dtype=torch.bfloat16):
+        past_key_values = model.forward_cache_update_text(past_key_values, **generation_input)
+    generation_input = model.prepare_vae_latent(
+        curr_kvlens=newlens,
+        curr_rope=new_rope,
+        image_sizes=[(h, w)],
+        new_token_ids=new_token_ids,
+    )
+    generation_input = move_generation_input_to_device(generation_input, device)
+    ########## generate image
+    with torch.amp.autocast("cuda", enabled=True, dtype=torch.bfloat16):
+        unpacked_latent = model.generate_image(
+            past_key_values=past_key_values,
+            num_timesteps=num_timesteps,
+            cfg_text_scale=cfg_scale,
+            cfg_interval=cfg_interval,
+            timestep_shift=timestep_shift,
+            cfg_renorm_min=cfg_renorm_min,
+            cfg_renorm_type="global",
+            cfg_text_past_key_values=None,
+            cfg_text_packed_position_ids=generation_input_cfg["cfg_packed_position_ids"],
+            cfg_text_key_values_lens=generation_input_cfg["cfg_key_values_lens"],
+            cfg_text_packed_query_indexes=generation_input_cfg["cfg_packed_query_indexes"],
+            cfg_text_packed_key_value_indexes=generation_input_cfg["cfg_packed_key_value_indexes"],
+            **generation_input,
+        )
+    latent0 = unpacked_latent[0]
+    latent0 = latent0.reshape(1, h//16, w//16, 2, 2, 16)
+    latent0 = torch.einsum("nhwpqc->nchpwq", latent0)
+    latent0 = latent0.reshape(1, 16, h//8, w//8)
+    image = vae_model.decode(latent0.to(device))
+    tmpimage = ((image * 0.5 + 0.5).clamp(0, 1)[0].permute(1, 2, 0) * 255).to(torch.uint8).cpu().numpy()
+    tmpimage = Image.fromarray(tmpimage)
+    return tmpimage, think_output
+def generate_image(prompt, num_timesteps=50, cfg_scale=4.0, cfg_interval=[0, 1.0], cfg_renorm_min=0., timestep_shift=1.0, resolution=1024, device=None):
+    past_key_values = NaiveCache(gen_model.config.llm_config.num_hidden_layers)
+    newlens = [0]
+    new_rope = [0]
+    generation_input, newlens, new_rope = gen_model.prepare_prompts(
+        curr_kvlens=newlens,
+        curr_rope=new_rope,
+        prompts=[prompt],
+        tokenizer=tokenizer,
+        new_token_ids=new_token_ids,
+    )
+    generation_input = move_generation_input_to_device(generation_input, device)
+    with torch.no_grad():
+        with torch.amp.autocast("cuda", enabled=True, dtype=torch.float16):
+            past_key_values = gen_model.forward_cache_update_text(past_key_values, **generation_input)
+    generation_input = gen_model.prepare_vae_latent(
+        curr_kvlens=newlens,
+        curr_rope=new_rope,
+        image_sizes=[(resolution, resolution)],
+        new_token_ids=new_token_ids,
+    )
+    generation_input = move_generation_input_to_device(generation_input, device)
+    cfg_past_key_values = NaiveCache(gen_model.config.llm_config.num_hidden_layers)
+    cfg_newlens = [0]
+    cfg_new_rope = [0]
+    generation_input_cfg = model.prepare_vae_latent_cfg(
+        curr_kvlens=cfg_newlens,
+        curr_rope=cfg_new_rope,
+        image_sizes=[(resolution, resolution)],
+    )
+    generation_input_cfg = move_generation_input_to_device(generation_input_cfg, device)
+    with torch.no_grad():
+        with torch.amp.autocast("cuda", enabled=True, dtype=torch.bfloat16):
+            unpacked_latent = gen_model.generate_image(
+                past_key_values=past_key_values,
+                num_timesteps=num_timesteps,
+                cfg_text_scale=cfg_scale,
+                cfg_interval=cfg_interval,
+                cfg_renorm_min=cfg_renorm_min,
+                timestep_shift=timestep_shift,
+                cfg_text_past_key_values=cfg_past_key_values,
+                cfg_text_packed_position_ids=generation_input_cfg["cfg_packed_position_ids"],
+                cfg_text_key_values_lens=generation_input_cfg["cfg_key_values_lens"],
+                cfg_text_packed_query_indexes=generation_input_cfg["cfg_packed_query_indexes"],
+                cfg_text_packed_key_value_indexes=generation_input_cfg["cfg_packed_key_value_indexes"],
+                **generation_input,
+            )
+    latent = unpacked_latent[0]
+    latent = latent.reshape(1, resolution//16, resolution//16, 2, 2, 16)
+    latent = torch.einsum("nhwpqc->nchpwq", latent)
+    latent = latent.reshape(1, 16, resolution//8, resolution//8)
+    image = vae_model.decode(latent.to(device))
+    tmpimage = ((image * 0.5 + 0.5).clamp(0, 1)[0].permute(1, 2, 0) * 255).to(torch.uint8).cpu().numpy()
+    tmpimage = Image.fromarray(tmpimage)
+    return tmpimage
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Generate images using Bagel model.")
+    parser.add_argument("--output_dir", type=str, required=True, help="Directory to save the generated images.")
+    parser.add_argument("--metadata_file", type=str, required=True, help="JSON file containing lines of metadata for each prompt.")
+    parser.add_argument("--cfg_scale", type=float, default=4)
+    parser.add_argument("--resolution", type=int, default=1024)
+    parser.add_argument("--max_latent_size", type=int, default=64)
+    parser.add_argument("--think", action="store_true")
+    parser.add_argument('--model-path', type=str, default='hf/BAGEL-7B-MoT/')
+    args = parser.parse_args()
+    seed = 42
+    if seed is not None:
+        import random
+        import numpy as np
+        random.seed(seed)
+        np.random.seed(seed)
+        torch.manual_seed(seed)
+        if torch.cuda.is_available():
+            torch.cuda.manual_seed(seed)
+            torch.cuda.manual_seed_all(seed)
+        torch.backends.cudnn.deterministic = True
+        torch.backends.cudnn.benchmark = False
+    setup_distributed()
+    rank = dist.get_rank()
+    world_size = dist.get_world_size()
+    device = f"cuda:{rank}"
+    output_dir = args.output_dir
+    os.makedirs(output_dir, exist_ok=True)
+    if rank == 0:
+        print(f"Output images are saved in {output_dir}")
+    llm_config = Qwen2Config.from_json_file(os.path.join(args.model_path, "llm_config.json"))
+    llm_config.qk_norm = True
+    llm_config.tie_word_embeddings = False
+    llm_config.layer_module = "Qwen2MoTDecoderLayer"
+    vit_config = SiglipVisionConfig.from_json_file(os.path.join(args.model_path, "vit_config.json"))
+    vit_config.rope = False
+    vit_config.num_hidden_layers = vit_config.num_hidden_layers - 1
+    vae_model, vae_config = load_ae(local_path=os.path.join(args.model_path, "ae.safetensors"))
+    config = BagelConfig(
+        visual_gen=True,
+        visual_und=True,
+        llm_config=llm_config,
+        vit_config=vit_config,
+        vae_config=vae_config,
+        vit_max_num_patch_per_side=70,
+        connector_act='gelu_pytorch_tanh',
+        latent_patch_size=2,
+        max_latent_size=args.max_latent_size,
+    )
+    language_model = Qwen2ForCausalLM(llm_config)
+    vit_model = SiglipVisionModel(vit_config)
+    model = Bagel(language_model, vit_model, config)
+    model.vit_model.vision_model.embeddings.convert_conv2d_to_linear(vit_config)
+    tokenizer = Qwen2Tokenizer.from_pretrained(args.model_path)
+    tokenizer, new_token_ids, _ = add_special_tokens(tokenizer)
+    model_state_dict_path = os.path.join(args.model_path, "ema.safetensors")
+    model_state_dict = load_file(model_state_dict_path, device="cpu")
+    msg = model.load_state_dict(model_state_dict, strict=False)
+    if rank == 0:
+        print(msg)
+    del model_state_dict
+    model = model.to(device).eval()
+    vae_model = vae_model.to(device).eval()
+    gen_model = model
+    cfg_scale = args.cfg_scale
+    cfg_interval = [0.4, 1.0]
+    timestep_shift = 3.0
+    num_timesteps = 50
+    cfg_renorm_min = 0.0
+    with open(args.metadata_file, "r") as f:
+        metadatas = json.load(f)
+    total_metadatas = len(metadatas)
+    prompts_per_gpu = (total_metadatas + world_size - 1) // world_size
+    start = rank * prompts_per_gpu
+    end = min(start + prompts_per_gpu, total_metadatas)
+    print(f"GPU {rank}: Processing {end - start} prompts (indices {start} to {end - 1})")
+    for idx in range(start, end):
+        metadata = metadatas[idx]
+        prompt = metadata['Prompt']
+        prompt_id = metadata['prompt_id']
+        outpath = os.path.join(output_dir, f"{prompt_id}.png")
+        print(f"GPU {rank} processing prompt {idx - start + 1}/{end - start}: '{prompt}'")
+        if os.path.exists(outpath):
+            print(f"GPU {rank} skipping generation for prompt: {prompt}")
+            continue
+        if args.think:
+            tmpimage, think_output = generate_image_with_think(
+                prompt=prompt,
+                cfg_scale=cfg_scale,
+                cfg_interval=cfg_interval,
+                cfg_renorm_min=cfg_renorm_min,
+                timestep_shift=timestep_shift,
+                num_timesteps=num_timesteps,
+                resolution=args.resolution,
+                max_length=2048,
+                simple_think=False,
+                device=device,
+            )
+            with open(outpath.replace(".png", ".txt"), "w") as f:
+                f.write(think_output)
+        else:
+            tmpimage = generate_image(
+                prompt=prompt,
+                cfg_scale=cfg_scale,
+                cfg_interval=cfg_interval,
+                cfg_renorm_min=cfg_renorm_min,
+                timestep_shift=timestep_shift,
+                num_timesteps=num_timesteps,
+                resolution=args.resolution,
+                device=device,
+            )
+        tmpimage = tmpimage.crop(tmpimage.getbbox())
+        tmpimage.save(outpath)
+    print(f"GPU {rank} has completed all tasks")
+    dist.barrier()

eval/gen/geneval/evaluation/download_models.sh ADDED Viewed

	@@ -0,0 +1,20 @@

+# Copyright (c) 2023 Dhruba Ghosh
+# Copyright (c) 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: MIT
+#
+# This file has been modified by ByteDance Ltd. and/or its affiliates. on 2025-05-20.
+#
+# Original file was released under MIT, with the full license text
+# available at https://github.com/djghosh13/geneval/blob/main/LICENSE.
+#
+# This modified file is released under the same license.
+#!/bin/bash
+# Download Mask2Former object detection config and weights
+if [ ! -z "$1" ]
+then
+    mkdir -p "$1"
+    wget https://download.openmmlab.com/mmdetection/v2.0/mask2former/mask2former_swin-s-p4-w7-224_lsj_8x2_50e_coco/mask2former_swin-s-p4-w7-224_lsj_8x2_50e_coco_20220504_001756-743b7d99.pth -O "$1/mask2former_swin-s-p4-w7-224_lsj_8x2_50e_coco.pth"
+fi

eval/gen/geneval/evaluation/evaluate_images.py ADDED Viewed

	@@ -0,0 +1,304 @@

+# Copyright (c) 2023 Dhruba Ghosh
+# Copyright (c) 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: MIT
+#
+# This file has been modified by ByteDance Ltd. and/or its affiliates. on 2025-05-20.
+#
+# Original file was released under MIT, with the full license text
+# available at https://github.com/djghosh13/geneval/blob/main/LICENSE.
+#
+# This modified file is released under the same license.
+"""
+Evaluate generated images using Mask2Former (or other object detector model)
+"""
+import argparse
+import json
+import os
+import re
+import sys
+import time
+from tqdm import tqdm
+import warnings
+warnings.filterwarnings("ignore")
+import numpy as np
+import pandas as pd
+from PIL import Image, ImageOps
+import torch
+import mmdet
+from mmdet.apis import inference_detector, init_detector
+import open_clip
+from clip_benchmark.metrics import zeroshot_classification as zsc
+zsc.tqdm = lambda it, *args, **kwargs: it
+# Get directory path
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("imagedir", type=str)
+    parser.add_argument("--outfile", type=str, default="results.jsonl")
+    parser.add_argument("--model-config", type=str, default=None)
+    parser.add_argument("--model-path", type=str, default="./")
+    # Other arguments
+    parser.add_argument("--options", nargs="*", type=str, default=[])
+    args = parser.parse_args()
+    args.options = dict(opt.split("=", 1) for opt in args.options)
+    if args.model_config is None:
+        args.model_config = os.path.join(
+            os.path.dirname(mmdet.__file__),
+            "../configs/mask2former/mask2former_swin-s-p4-w7-224_lsj_8x2_50e_coco.py"
+        )
+    return args
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+assert DEVICE == "cuda"
+def timed(fn):
+    def wrapper(*args, **kwargs):
+        startt = time.time()
+        result = fn(*args, **kwargs)
+        endt = time.time()
+        print(f'Function {fn.__name__!r} executed in {endt - startt:.3f}s', file=sys.stderr)
+        return result
+    return wrapper
+# Load models
+@timed
+def load_models(args):
+    CONFIG_PATH = args.model_config
+    OBJECT_DETECTOR = args.options.get('model', "mask2former_swin-s-p4-w7-224_lsj_8x2_50e_coco")
+    CKPT_PATH = os.path.join(args.model_path, f"{OBJECT_DETECTOR}.pth")
+    object_detector = init_detector(CONFIG_PATH, CKPT_PATH, device=DEVICE)
+    clip_arch = args.options.get('clip_model', "ViT-L-14")
+    clip_model, _, transform = open_clip.create_model_and_transforms(clip_arch, pretrained="openai", device=DEVICE)
+    tokenizer = open_clip.get_tokenizer(clip_arch)
+    with open(os.path.join(os.path.dirname(__file__), "object_names.txt")) as cls_file:
+        classnames = [line.strip() for line in cls_file]
+    return object_detector, (clip_model, transform, tokenizer), classnames
+COLORS = ["red", "orange", "yellow", "green", "blue", "purple", "pink", "brown", "black", "white"]
+COLOR_CLASSIFIERS = {}
+# Evaluation parts
+class ImageCrops(torch.utils.data.Dataset):
+    def __init__(self, image: Image.Image, objects):
+        self._image = image.convert("RGB")
+        bgcolor = args.options.get('bgcolor', "#999")
+        if bgcolor == "original":
+            self._blank = self._image.copy()
+        else:
+            self._blank = Image.new("RGB", image.size, color=bgcolor)
+        self._objects = objects
+    def __len__(self):
+        return len(self._objects)
+    def __getitem__(self, index):
+        box, mask = self._objects[index]
+        if mask is not None:
+            assert tuple(self._image.size[::-1]) == tuple(mask.shape), (index, self._image.size[::-1], mask.shape)
+            image = Image.composite(self._image, self._blank, Image.fromarray(mask))
+        else:
+            image = self._image
+        if args.options.get('crop', '1') == '1':
+            image = image.crop(box[:4])
+        # if args.save:
+        #     base_count = len(os.listdir(args.save))
+        #     image.save(os.path.join(args.save, f"cropped_{base_count:05}.png"))
+        return (transform(image), 0)
+def color_classification(image, bboxes, classname):
+    if classname not in COLOR_CLASSIFIERS:
+        COLOR_CLASSIFIERS[classname] = zsc.zero_shot_classifier(
+            clip_model, tokenizer, COLORS,
+            [
+                f"a photo of a {{c}} {classname}",
+                f"a photo of a {{c}}-colored {classname}",
+                f"a photo of a {{c}} object"
+            ],
+            DEVICE
+        )
+    clf = COLOR_CLASSIFIERS[classname]
+    dataloader = torch.utils.data.DataLoader(
+        ImageCrops(image, bboxes),
+        batch_size=16, num_workers=4
+    )
+    with torch.no_grad():
+        pred, _ = zsc.run_classification(clip_model, clf, dataloader, DEVICE)
+        return [COLORS[index.item()] for index in pred.argmax(1)]
+def compute_iou(box_a, box_b):
+    area_fn = lambda box: max(box[2] - box[0] + 1, 0) * max(box[3] - box[1] + 1, 0)
+    i_area = area_fn([
+        max(box_a[0], box_b[0]), max(box_a[1], box_b[1]),
+        min(box_a[2], box_b[2]), min(box_a[3], box_b[3])
+    ])
+    u_area = area_fn(box_a) + area_fn(box_b) - i_area
+    return i_area / u_area if u_area else 0
+def relative_position(obj_a, obj_b):
+    """Give position of A relative to B, factoring in object dimensions"""
+    boxes = np.array([obj_a[0], obj_b[0]])[:, :4].reshape(2, 2, 2)
+    center_a, center_b = boxes.mean(axis=-2)
+    dim_a, dim_b = np.abs(np.diff(boxes, axis=-2))[..., 0, :]
+    offset = center_a - center_b
+    #
+    revised_offset = np.maximum(np.abs(offset) - POSITION_THRESHOLD * (dim_a + dim_b), 0) * np.sign(offset)
+    if np.all(np.abs(revised_offset) < 1e-3):
+        return set()
+    #
+    dx, dy = revised_offset / np.linalg.norm(offset)
+    relations = set()
+    if dx < -0.5: relations.add("left of")
+    if dx > 0.5: relations.add("right of")
+    if dy < -0.5: relations.add("above")
+    if dy > 0.5: relations.add("below")
+    return relations
+def evaluate(image, objects, metadata):
+    """
+    Evaluate given image using detected objects on the global metadata specifications.
+    Assumptions:
+    * Metadata combines 'include' clauses with AND, and 'exclude' clauses with OR
+    * All clauses are independent, i.e., duplicating a clause has no effect on the correctness
+    * CHANGED: Color and position will only be evaluated on the most confidently predicted objects;
+        therefore, objects are expected to appear in sorted order
+    """
+    correct = True
+    reason = []
+    matched_groups = []
+    # Check for expected objects
+    for req in metadata.get('include', []):
+        classname = req['class']
+        matched = True
+        found_objects = objects.get(classname, [])[:req['count']]
+        if len(found_objects) < req['count']:
+            correct = matched = False
+            reason.append(f"expected {classname}>={req['count']}, found {len(found_objects)}")
+        else:
+            if 'color' in req:
+                # Color check
+                colors = color_classification(image, found_objects, classname)
+                if colors.count(req['color']) < req['count']:
+                    correct = matched = False
+                    reason.append(
+                        f"expected {req['color']} {classname}>={req['count']}, found " +
+                        f"{colors.count(req['color'])} {req['color']}; and " +
+                        ", ".join(f"{colors.count(c)} {c}" for c in COLORS if c in colors)
+                    )
+            if 'position' in req and matched:
+                # Relative position check
+                expected_rel, target_group = req['position']
+                if matched_groups[target_group] is None:
+                    correct = matched = False
+                    reason.append(f"no target for {classname} to be {expected_rel}")
+                else:
+                    for obj in found_objects:
+                        for target_obj in matched_groups[target_group]:
+                            true_rels = relative_position(obj, target_obj)
+                            if expected_rel not in true_rels:
+                                correct = matched = False
+                                reason.append(
+                                    f"expected {classname} {expected_rel} target, found " +
+                                    f"{' and '.join(true_rels)} target"
+                                )
+                                break
+                        if not matched:
+                            break
+        if matched:
+            matched_groups.append(found_objects)
+        else:
+            matched_groups.append(None)
+    # Check for non-expected objects
+    for req in metadata.get('exclude', []):
+        classname = req['class']
+        if len(objects.get(classname, [])) >= req['count']:
+            correct = False
+            reason.append(f"expected {classname}<{req['count']}, found {len(objects[classname])}")
+    return correct, "\n".join(reason)
+def evaluate_image(filepath, metadata):
+    result = inference_detector(object_detector, filepath)
+    bbox = result[0] if isinstance(result, tuple) else result
+    segm = result[1] if isinstance(result, tuple) and len(result) > 1 else None
+    image = ImageOps.exif_transpose(Image.open(filepath))
+    detected = {}
+    # Determine bounding boxes to keep
+    confidence_threshold = THRESHOLD if metadata['tag'] != "counting" else COUNTING_THRESHOLD
+    for index, classname in enumerate(classnames):
+        ordering = np.argsort(bbox[index][:, 4])[::-1]
+        ordering = ordering[bbox[index][ordering, 4] > confidence_threshold] # Threshold
+        ordering = ordering[:MAX_OBJECTS].tolist() # Limit number of detected objects per class
+        detected[classname] = []
+        while ordering:
+            max_obj = ordering.pop(0)
+            detected[classname].append((bbox[index][max_obj], None if segm is None else segm[index][max_obj]))
+            ordering = [
+                obj for obj in ordering
+                if NMS_THRESHOLD == 1 or compute_iou(bbox[index][max_obj], bbox[index][obj]) < NMS_THRESHOLD
+            ]
+        if not detected[classname]:
+            del detected[classname]
+    # Evaluate
+    is_correct, reason = evaluate(image, detected, metadata)
+    return {
+        'filename': filepath,
+        'tag': metadata['tag'],
+        'prompt': metadata['prompt'],
+        'correct': is_correct,
+        'reason': reason,
+        'metadata': json.dumps(metadata),
+        'details': json.dumps({
+            key: [box.tolist() for box, _ in value]
+            for key, value in detected.items()
+        })
+    }
+def main(args):
+    full_results = []
+    for subfolder in tqdm(os.listdir(args.imagedir)):
+        folderpath = os.path.join(args.imagedir, subfolder)
+        if not os.path.isdir(folderpath) or not subfolder.isdigit():
+            continue
+        with open(os.path.join(folderpath, "metadata.jsonl")) as fp:
+            metadata = json.load(fp)
+        # Evaluate each image
+        for imagename in os.listdir(os.path.join(folderpath, "samples")):
+            imagepath = os.path.join(folderpath, "samples", imagename)
+            if not os.path.isfile(imagepath) or not re.match(r"\d+\.png", imagename):
+                continue
+            result = evaluate_image(imagepath, metadata)
+            full_results.append(result)
+    # Save results
+    if os.path.dirname(args.outfile):
+        os.makedirs(os.path.dirname(args.outfile), exist_ok=True)
+    with open(args.outfile, "w") as fp:
+        pd.DataFrame(full_results).to_json(fp, orient="records", lines=True)
+if __name__ == "__main__":
+    args = parse_args()
+    object_detector, (clip_model, transform, tokenizer), classnames = load_models(args)
+    THRESHOLD = float(args.options.get('threshold', 0.3))
+    COUNTING_THRESHOLD = float(args.options.get('counting_threshold', 0.9))
+    MAX_OBJECTS = int(args.options.get('max_objects', 16))
+    NMS_THRESHOLD = float(args.options.get('max_overlap', 1.0))
+    POSITION_THRESHOLD = float(args.options.get('position_threshold', 0.1))
+    main(args)

eval/gen/geneval/evaluation/evaluate_images_mp.py ADDED Viewed

	@@ -0,0 +1,332 @@

+# Copyright (c) 2023 Dhruba Ghosh
+# Copyright (c) 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: MIT
+#
+# This file has been modified by ByteDance Ltd. and/or its affiliates. on 2025-05-20.
+#
+# Original file was released under MIT, with the full license text
+# available at https://github.com/djghosh13/geneval/blob/main/LICENSE.
+#
+# This modified file is released under the same license.
+import argparse
+import json
+import os
+import re
+import sys
+import time
+from tqdm import tqdm
+import warnings
+warnings.filterwarnings("ignore")
+import numpy as np
+import pandas as pd
+from PIL import Image, ImageOps
+import torch
+import torch.distributed as dist
+import mmdet
+from mmdet.apis import inference_detector, init_detector
+import open_clip
+from clip_benchmark.metrics import zeroshot_classification as zsc
+zsc.tqdm = lambda it, *args, **kwargs: it
+def setup_distributed():
+    """初始化分布式环境"""
+    dist.init_process_group(backend="nccl")
+    torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))
+# Get directory path
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("imagedir", type=str)
+    parser.add_argument("--outfile", type=str, default="results.jsonl")
+    parser.add_argument("--model-config", type=str, default=None)
+    parser.add_argument("--model-path", type=str, default="./")
+    # Other arguments
+    parser.add_argument("--options", nargs="*", type=str, default=[])
+    args = parser.parse_args()
+    args.options = dict(opt.split("=", 1) for opt in args.options)
+    if args.model_config is None:
+        args.model_config = os.path.join(
+            os.path.dirname(mmdet.__file__),
+            "../configs/mask2former/mask2former_swin-s-p4-w7-224_lsj_8x2_50e_coco.py"
+        )
+    return args
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+assert DEVICE == "cuda"
+def timed(fn):
+    def wrapper(*args, **kwargs):
+        startt = time.time()
+        result = fn(*args, **kwargs)
+        endt = time.time()
+        print(f'Function {fn.__name__!r} executed in {endt - startt:.3f}s', file=sys.stderr)
+        return result
+    return wrapper
+# Load models
+@timed
+def load_models(args):
+    CONFIG_PATH = args.model_config
+    OBJECT_DETECTOR = args.options.get('model', "mask2former_swin-s-p4-w7-224_lsj_8x2_50e_coco")
+    CKPT_PATH = os.path.join(args.model_path, f"{OBJECT_DETECTOR}.pth")
+    object_detector = init_detector(CONFIG_PATH, CKPT_PATH, device=DEVICE)
+    clip_arch = args.options.get('clip_model', "ViT-L-14")
+    clip_model, _, transform = open_clip.create_model_and_transforms(clip_arch, pretrained="openai", device=DEVICE)
+    tokenizer = open_clip.get_tokenizer(clip_arch)
+    with open(os.path.join(os.path.dirname(__file__), "object_names.txt")) as cls_file:
+        classnames = [line.strip() for line in cls_file]
+    return object_detector, (clip_model, transform, tokenizer), classnames
+COLORS = ["red", "orange", "yellow", "green", "blue", "purple", "pink", "brown", "black", "white"]
+COLOR_CLASSIFIERS = {}
+# Evaluation parts
+class ImageCrops(torch.utils.data.Dataset):
+    def __init__(self, image: Image.Image, objects):
+        self._image = image.convert("RGB")
+        bgcolor = args.options.get('bgcolor', "#999")
+        if bgcolor == "original":
+            self._blank = self._image.copy()
+        else:
+            self._blank = Image.new("RGB", image.size, color=bgcolor)
+        self._objects = objects
+    def __len__(self):
+        return len(self._objects)
+    def __getitem__(self, index):
+        box, mask = self._objects[index]
+        if mask is not None:
+            assert tuple(self._image.size[::-1]) == tuple(mask.shape), (index, self._image.size[::-1], mask.shape)
+            image = Image.composite(self._image, self._blank, Image.fromarray(mask))
+        else:
+            image = self._image
+        if args.options.get('crop', '1') == '1':
+            image = image.crop(box[:4])
+        # if args.save:
+        #     base_count = len(os.listdir(args.save))
+        #     image.save(os.path.join(args.save, f"cropped_{base_count:05}.png"))
+        return (transform(image), 0)
+def color_classification(image, bboxes, classname):
+    if classname not in COLOR_CLASSIFIERS:
+        COLOR_CLASSIFIERS[classname] = zsc.zero_shot_classifier(
+            clip_model, tokenizer, COLORS,
+            [
+                f"a photo of a {{c}} {classname}",
+                f"a photo of a {{c}}-colored {classname}",
+                f"a photo of a {{c}} object"
+            ],
+            DEVICE
+        )
+    clf = COLOR_CLASSIFIERS[classname]
+    dataloader = torch.utils.data.DataLoader(
+        ImageCrops(image, bboxes),
+        batch_size=16, num_workers=4
+    )
+    with torch.no_grad():
+        pred, _ = zsc.run_classification(clip_model, clf, dataloader, DEVICE)
+        return [COLORS[index.item()] for index in pred.argmax(1)]
+def compute_iou(box_a, box_b):
+    area_fn = lambda box: max(box[2] - box[0] + 1, 0) * max(box[3] - box[1] + 1, 0)
+    i_area = area_fn([
+        max(box_a[0], box_b[0]), max(box_a[1], box_b[1]),
+        min(box_a[2], box_b[2]), min(box_a[3], box_b[3])
+    ])
+    u_area = area_fn(box_a) + area_fn(box_b) - i_area
+    return i_area / u_area if u_area else 0
+def relative_position(obj_a, obj_b):
+    """Give position of A relative to B, factoring in object dimensions"""
+    boxes = np.array([obj_a[0], obj_b[0]])[:, :4].reshape(2, 2, 2)
+    center_a, center_b = boxes.mean(axis=-2)
+    dim_a, dim_b = np.abs(np.diff(boxes, axis=-2))[..., 0, :]
+    offset = center_a - center_b
+    #
+    revised_offset = np.maximum(np.abs(offset) - POSITION_THRESHOLD * (dim_a + dim_b), 0) * np.sign(offset)
+    if np.all(np.abs(revised_offset) < 1e-3):
+        return set()
+    #
+    dx, dy = revised_offset / np.linalg.norm(offset)
+    relations = set()
+    if dx < -0.5: relations.add("left of")
+    if dx > 0.5: relations.add("right of")
+    if dy < -0.5: relations.add("above")
+    if dy > 0.5: relations.add("below")
+    return relations
+def evaluate(image, objects, metadata):
+    """
+    Evaluate given image using detected objects on the global metadata specifications.
+    Assumptions:
+    * Metadata combines 'include' clauses with AND, and 'exclude' clauses with OR
+    * All clauses are independent, i.e., duplicating a clause has no effect on the correctness
+    * CHANGED: Color and position will only be evaluated on the most confidently predicted objects;
+        therefore, objects are expected to appear in sorted order
+    """
+    correct = True
+    reason = []
+    matched_groups = []
+    # Check for expected objects
+    for req in metadata.get('include', []):
+        classname = req['class']
+        matched = True
+        found_objects = objects.get(classname, [])[:req['count']]
+        if len(found_objects) < req['count']:
+            correct = matched = False
+            reason.append(f"expected {classname}>={req['count']}, found {len(found_objects)}")
+        else:
+            if 'color' in req:
+                # Color check
+                colors = color_classification(image, found_objects, classname)
+                if colors.count(req['color']) < req['count']:
+                    correct = matched = False
+                    reason.append(
+                        f"expected {req['color']} {classname}>={req['count']}, found " +
+                        f"{colors.count(req['color'])} {req['color']}; and " +
+                        ", ".join(f"{colors.count(c)} {c}" for c in COLORS if c in colors)
+                    )
+            if 'position' in req and matched:
+                # Relative position check
+                expected_rel, target_group = req['position']
+                if matched_groups[target_group] is None:
+                    correct = matched = False
+                    reason.append(f"no target for {classname} to be {expected_rel}")
+                else:
+                    for obj in found_objects:
+                        for target_obj in matched_groups[target_group]:
+                            true_rels = relative_position(obj, target_obj)
+                            if expected_rel not in true_rels:
+                                correct = matched = False
+                                reason.append(
+                                    f"expected {classname} {expected_rel} target, found " +
+                                    f"{' and '.join(true_rels)} target"
+                                )
+                                break
+                        if not matched:
+                            break
+        if matched:
+            matched_groups.append(found_objects)
+        else:
+            matched_groups.append(None)
+    # Check for non-expected objects
+    for req in metadata.get('exclude', []):
+        classname = req['class']
+        if len(objects.get(classname, [])) >= req['count']:
+            correct = False
+            reason.append(f"expected {classname}<{req['count']}, found {len(objects[classname])}")
+    return correct, "\n".join(reason)
+def evaluate_image(filepath, metadata):
+    result = inference_detector(object_detector, filepath)
+    bbox = result[0] if isinstance(result, tuple) else result
+    segm = result[1] if isinstance(result, tuple) and len(result) > 1 else None
+    image = ImageOps.exif_transpose(Image.open(filepath))
+    detected = {}
+    # Determine bounding boxes to keep
+    confidence_threshold = THRESHOLD if metadata['tag'] != "counting" else COUNTING_THRESHOLD
+    for index, classname in enumerate(classnames):
+        ordering = np.argsort(bbox[index][:, 4])[::-1]
+        ordering = ordering[bbox[index][ordering, 4] > confidence_threshold] # Threshold
+        ordering = ordering[:MAX_OBJECTS].tolist() # Limit number of detected objects per class
+        detected[classname] = []
+        while ordering:
+            max_obj = ordering.pop(0)
+            detected[classname].append((bbox[index][max_obj], None if segm is None else segm[index][max_obj]))
+            ordering = [
+                obj for obj in ordering
+                if NMS_THRESHOLD == 1 or compute_iou(bbox[index][max_obj], bbox[index][obj]) < NMS_THRESHOLD
+            ]
+        if not detected[classname]:
+            del detected[classname]
+    # Evaluate
+    is_correct, reason = evaluate(image, detected, metadata)
+    return {
+        'filename': filepath,
+        'tag': metadata['tag'],
+        'prompt': metadata['prompt'],
+        'correct': is_correct,
+        'reason': reason,
+        'metadata': json.dumps(metadata),
+        'details': json.dumps({
+            key: [box.tolist() for box, _ in value]
+            for key, value in detected.items()
+        })
+    }
+if __name__ == "__main__":
+    args = parse_args()
+    THRESHOLD = float(args.options.get('threshold', 0.3))
+    COUNTING_THRESHOLD = float(args.options.get('counting_threshold', 0.9))
+    MAX_OBJECTS = int(args.options.get('max_objects', 16))
+    NMS_THRESHOLD = float(args.options.get('max_overlap', 1.0))
+    POSITION_THRESHOLD = float(args.options.get('position_threshold', 0.1))
+    # Initialize distributed environment
+    setup_distributed()
+    rank = dist.get_rank()
+    world_size = dist.get_world_size()
+    device = f"cuda:{rank}"
+    # Load models
+    if rank == 0:
+        print(f"[Rank 0] Loading model...")
+    object_detector, (clip_model, transform, tokenizer), classnames = load_models(args)
+    full_results = []
+    subfolders = [f for f in os.listdir(args.imagedir) if os.path.isdir(os.path.join(args.imagedir, f)) and f.isdigit()]
+    total_subfolders = len(subfolders)
+    # Divide subfolders to process by GPU
+    subfolders_per_gpu = (total_subfolders + world_size - 1) // world_size
+    start = rank * subfolders_per_gpu
+    end = min(start + subfolders_per_gpu, total_subfolders)
+    print(f"GPU {rank}: Processing {end - start} subfolders (index {start} to {end - 1})")
+    for subfolder in tqdm(subfolders[start:end]):
+        folderpath = os.path.join(args.imagedir, subfolder)
+        with open(os.path.join(folderpath, "metadata.jsonl")) as fp:
+            metadata = json.load(fp)
+        # Evaluate each image
+        for imagename in os.listdir(os.path.join(folderpath, "samples")):
+            imagepath = os.path.join(folderpath, "samples", imagename)
+            if not os.path.isfile(imagepath) or not re.match(r"\d+\.png", imagename):
+                continue
+            result = evaluate_image(imagepath, metadata)
+            full_results.append(result)
+    # Synchronize results from all GPUs
+    all_results = [None] * world_size
+    dist.all_gather_object(all_results, full_results)
+    if rank == 0:
+        # Merge results from all GPUs
+        final_results = []
+        for results in all_results:
+            final_results.extend(results)
+        # Save results
+        if os.path.dirname(args.outfile):
+            os.makedirs(os.path.dirname(args.outfile), exist_ok=True)
+        with open(args.outfile, "w") as fp:
+            pd.DataFrame(final_results).to_json(fp, orient="records", lines=True)
+        print("All GPUs have completed their tasks and the final results have been saved.")
+    else:
+        print(f"GPU {rank} has completed all tasks")

eval/gen/geneval/evaluation/object_names.txt ADDED Viewed

	@@ -0,0 +1,80 @@

+person
+bicycle
+car
+motorcycle
+airplane
+bus
+train
+truck
+boat
+traffic light
+fire hydrant
+stop sign
+parking meter
+bench
+bird
+cat
+dog
+horse
+sheep
+cow
+elephant
+bear
+zebra
+giraffe
+backpack
+umbrella
+handbag
+tie
+suitcase
+frisbee
+skis
+snowboard
+sports ball
+kite
+baseball bat
+baseball glove
+skateboard
+surfboard
+tennis racket
+bottle
+wine glass
+cup
+fork
+knife
+spoon
+bowl
+banana
+apple
+sandwich
+orange
+broccoli
+carrot
+hot dog
+pizza
+donut
+cake
+chair
+couch
+potted plant
+bed
+dining table
+toilet
+tv
+laptop
+computer mouse
+tv remote
+computer keyboard
+cell phone
+microwave
+oven
+toaster
+sink
+refrigerator
+book
+clock
+vase
+scissors
+teddy bear
+hair drier
+toothbrush

eval/gen/geneval/evaluation/summary_scores.py ADDED Viewed

	@@ -0,0 +1,64 @@

+# Copyright (c) 2023 Dhruba Ghosh
+# Copyright (c) 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: MIT
+#
+# This file has been modified by ByteDance Ltd. and/or its affiliates. on 2025-05-20.
+#
+# Original file was released under MIT, with the full license text
+# available at https://github.com/djghosh13/geneval/blob/main/LICENSE.
+#
+# This modified file is released under the same license.
+import argparse
+import os
+import numpy as np
+import pandas as pd
+parser = argparse.ArgumentParser()
+parser.add_argument("filename", type=str)
+args = parser.parse_args()
+# Load classnames
+with open(os.path.join(os.path.dirname(__file__), "object_names.txt")) as cls_file:
+    classnames = [line.strip() for line in cls_file]
+    cls_to_idx = {"_".join(cls.split()):idx for idx, cls in enumerate(classnames)}
+# Load results
+df = pd.read_json(args.filename, orient="records", lines=True)
+# Measure overall success
+print("Summary")
+print("=======")
+print(f"Total images: {len(df)}")
+print(f"Total prompts: {len(df.groupby('metadata'))}")
+print(f"% correct images: {df['correct'].mean():.2%}")
+print(f"% correct prompts: {df.groupby('metadata')['correct'].any().mean():.2%}")
+print()
+# By group
+task_scores = []
+print("Task breakdown")
+print("==============")
+for tag, task_df in df.groupby('tag', sort=False):
+    task_scores.append(task_df['correct'].mean())
+    print(f"{tag:<16} = {task_df['correct'].mean():.2%} ({task_df['correct'].sum()} / {len(task_df)})")
+print()
+print(f"Overall score (avg. over tasks): {np.mean(task_scores):.5f}")
+print("\n\n==============")
+output_info = "SO   TO   CT   CL   POS  ATTR ALL\n"
+for score in task_scores:
+    output_info += f"{score:.2f} "
+output_info += f"{np.mean(task_scores):.2f}" + "\n"
+print(output_info)
+with open(os.path.join(os.path.dirname(args.filename), "geneval_results.txt"), "w") as f:
+    f.write(output_info)

eval/gen/geneval/prompts/create_prompts.py ADDED Viewed

	@@ -0,0 +1,194 @@

+# Copyright (c) 2023 Dhruba Ghosh
+# Copyright (c) 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: MIT
+#
+# This file has been modified by ByteDance Ltd. and/or its affiliates. on 2025-05-20.
+#
+# Original file was released under MIT, with the full license text
+# available at https://github.com/djghosh13/geneval/blob/main/LICENSE.
+#
+# This modified file is released under the same license.
+"""
+Generate prompts for evaluation
+"""
+import argparse
+import json
+import os
+import yaml
+import numpy as np
+# Load classnames
+with open("object_names.txt") as cls_file:
+    classnames = [line.strip() for line in cls_file]
+# Proper a vs an
+def with_article(name: str):
+    if name[0] in "aeiou":
+        return f"an {name}"
+    return f"a {name}"
+# Proper plural
+def make_plural(name: str):
+    if name[-1] in "s":
+        return f"{name}es"
+    return f"{name}s"
+# Generates single object samples
+def generate_single_object_sample(rng: np.random.Generator, size: int = None):
+    TAG = "single_object"
+    if size > len(classnames):
+        size = len(classnames)
+        print(f"Not enough distinct classes, generating only {size} samples")
+    return_scalar = size is None
+    size = size or 1
+    idxs = rng.choice(len(classnames), size=size, replace=False)
+    samples = [dict(
+        tag=TAG,
+        include=[
+            {"class": classnames[idx], "count": 1}
+        ],
+        prompt=f"a photo of {with_article(classnames[idx])}"
+    ) for idx in idxs]
+    if return_scalar:
+        return samples[0]
+    return samples
+# Generate two object samples
+def generate_two_object_sample(rng: np.random.Generator):
+    TAG = "two_object"
+    idx_a, idx_b = rng.choice(len(classnames), size=2, replace=False)
+    return dict(
+        tag=TAG,
+        include=[
+            {"class": classnames[idx_a], "count": 1},
+            {"class": classnames[idx_b], "count": 1}
+        ],
+        prompt=f"a photo of {with_article(classnames[idx_a])} and {with_article(classnames[idx_b])}"
+    )
+# Generate counting samples
+numbers = ["zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"]
+def generate_counting_sample(rng: np.random.Generator, max_count=4):
+    TAG = "counting"
+    idx = rng.choice(len(classnames))
+    num = int(rng.integers(2, max_count, endpoint=True))
+    return dict(
+        tag=TAG,
+        include=[
+            {"class": classnames[idx], "count": num}
+        ],
+        exclude=[
+            {"class": classnames[idx], "count": num + 1}
+        ],
+        prompt=f"a photo of {numbers[num]} {make_plural(classnames[idx])}"
+    )
+# Generate color samples
+colors = ["red", "orange", "yellow", "green", "blue", "purple", "pink", "brown", "black", "white"]
+def generate_color_sample(rng: np.random.Generator):
+    TAG = "colors"
+    idx = rng.choice(len(classnames) - 1) + 1
+    idx = (idx + classnames.index("person")) % len(classnames) # No "[COLOR] person" prompts
+    color = colors[rng.choice(len(colors))]
+    return dict(
+        tag=TAG,
+        include=[
+            {"class": classnames[idx], "count": 1, "color": color}
+        ],
+        prompt=f"a photo of {with_article(color)} {classnames[idx]}"
+    )
+# Generate position samples
+positions = ["left of", "right of", "above", "below"]
+def generate_position_sample(rng: np.random.Generator):
+    TAG = "position"
+    idx_a, idx_b = rng.choice(len(classnames), size=2, replace=False)
+    position = positions[rng.choice(len(positions))]
+    return dict(
+        tag=TAG,
+        include=[
+            {"class": classnames[idx_b], "count": 1},
+            {"class": classnames[idx_a], "count": 1, "position": (position, 0)}
+        ],
+        prompt=f"a photo of {with_article(classnames[idx_a])} {position} {with_article(classnames[idx_b])}"
+    )
+# Generate color attribution samples
+def generate_color_attribution_sample(rng: np.random.Generator):
+    TAG = "color_attr"
+    idxs = rng.choice(len(classnames) - 1, size=2, replace=False) + 1
+    idx_a, idx_b = (idxs + classnames.index("person")) % len(classnames) # No "[COLOR] person" prompts
+    cidx_a, cidx_b = rng.choice(len(colors), size=2, replace=False)
+    return dict(
+        tag=TAG,
+        include=[
+            {"class": classnames[idx_a], "count": 1, "color": colors[cidx_a]},
+            {"class": classnames[idx_b], "count": 1, "color": colors[cidx_b]}
+        ],
+        prompt=f"a photo of {with_article(colors[cidx_a])} {classnames[idx_a]} and {with_article(colors[cidx_b])} {classnames[idx_b]}"
+    )
+# Generate evaluation suite
+def generate_suite(rng: np.random.Generator, n: int = 100, output_path: str = ""):
+    samples = []
+    # Generate single object samples for all COCO classnames
+    samples.extend(generate_single_object_sample(rng, size=len(classnames)))
+    # Generate two object samples (~100)
+    for _ in range(n):
+        samples.append(generate_two_object_sample(rng))
+    # Generate counting samples
+    for _ in range(n):
+        samples.append(generate_counting_sample(rng, max_count=4))
+    # Generate color samples
+    for _ in range(n):
+        samples.append(generate_color_sample(rng))
+    # Generate position samples
+    for _ in range(n):
+        samples.append(generate_position_sample(rng))
+    # Generate color attribution samples
+    for _ in range(n):
+        samples.append(generate_color_attribution_sample(rng))
+    # De-duplicate
+    unique_samples, used_samples = [], set()
+    for sample in samples:
+        sample_text = yaml.safe_dump(sample)
+        if sample_text not in used_samples:
+            unique_samples.append(sample)
+            used_samples.add(sample_text)
+    # Write to files
+    os.makedirs(output_path, exist_ok=True)
+    with open(os.path.join(output_path, "generation_prompts.txt"), "w") as fp:
+        for sample in unique_samples:
+            print(sample['prompt'], file=fp)
+    with open(os.path.join(output_path, "evaluation_metadata.jsonl"), "w") as fp:
+        for sample in unique_samples:
+            print(json.dumps(sample), file=fp)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--seed", type=int, default=43, help="generation seed (default: 43)")
+    parser.add_argument("--num-prompts", "-n", type=int, default=100, help="number of prompts per task (default: 100)")
+    parser.add_argument("--output-path", "-o", type=str, default="prompts", help="output folder for prompts and metadata (default: 'prompts/')")
+    args = parser.parse_args()
+    rng = np.random.default_rng(args.seed)
+    generate_suite(rng, args.num_prompts, args.output_path)

eval/gen/geneval/prompts/evaluation_metadata.jsonl ADDED Viewed

	@@ -0,0 +1,553 @@

+{"tag": "single_object", "include": [{"class": "bench", "count": 1}], "prompt": "a photo of a bench"}
+{"tag": "single_object", "include": [{"class": "cow", "count": 1}], "prompt": "a photo of a cow"}
+{"tag": "single_object", "include": [{"class": "bicycle", "count": 1}], "prompt": "a photo of a bicycle"}
+{"tag": "single_object", "include": [{"class": "clock", "count": 1}], "prompt": "a photo of a clock"}
+{"tag": "single_object", "include": [{"class": "carrot", "count": 1}], "prompt": "a photo of a carrot"}
+{"tag": "single_object", "include": [{"class": "suitcase", "count": 1}], "prompt": "a photo of a suitcase"}
+{"tag": "single_object", "include": [{"class": "fork", "count": 1}], "prompt": "a photo of a fork"}
+{"tag": "single_object", "include": [{"class": "surfboard", "count": 1}], "prompt": "a photo of a surfboard"}
+{"tag": "single_object", "include": [{"class": "refrigerator", "count": 1}], "prompt": "a photo of a refrigerator"}
+{"tag": "single_object", "include": [{"class": "cup", "count": 1}], "prompt": "a photo of a cup"}
+{"tag": "single_object", "include": [{"class": "microwave", "count": 1}], "prompt": "a photo of a microwave"}
+{"tag": "single_object", "include": [{"class": "potted plant", "count": 1}], "prompt": "a photo of a potted plant"}
+{"tag": "single_object", "include": [{"class": "snowboard", "count": 1}], "prompt": "a photo of a snowboard"}
+{"tag": "single_object", "include": [{"class": "zebra", "count": 1}], "prompt": "a photo of a zebra"}
+{"tag": "single_object", "include": [{"class": "parking meter", "count": 1}], "prompt": "a photo of a parking meter"}
+{"tag": "single_object", "include": [{"class": "spoon", "count": 1}], "prompt": "a photo of a spoon"}
+{"tag": "single_object", "include": [{"class": "skateboard", "count": 1}], "prompt": "a photo of a skateboard"}
+{"tag": "single_object", "include": [{"class": "car", "count": 1}], "prompt": "a photo of a car"}
+{"tag": "single_object", "include": [{"class": "motorcycle", "count": 1}], "prompt": "a photo of a motorcycle"}
+{"tag": "single_object", "include": [{"class": "traffic light", "count": 1}], "prompt": "a photo of a traffic light"}
+{"tag": "single_object", "include": [{"class": "book", "count": 1}], "prompt": "a photo of a book"}
+{"tag": "single_object", "include": [{"class": "couch", "count": 1}], "prompt": "a photo of a couch"}
+{"tag": "single_object", "include": [{"class": "backpack", "count": 1}], "prompt": "a photo of a backpack"}
+{"tag": "single_object", "include": [{"class": "computer keyboard", "count": 1}], "prompt": "a photo of a computer keyboard"}
+{"tag": "single_object", "include": [{"class": "toaster", "count": 1}], "prompt": "a photo of a toaster"}
+{"tag": "single_object", "include": [{"class": "bird", "count": 1}], "prompt": "a photo of a bird"}
+{"tag": "single_object", "include": [{"class": "bowl", "count": 1}], "prompt": "a photo of a bowl"}
+{"tag": "single_object", "include": [{"class": "dog", "count": 1}], "prompt": "a photo of a dog"}
+{"tag": "single_object", "include": [{"class": "tie", "count": 1}], "prompt": "a photo of a tie"}
+{"tag": "single_object", "include": [{"class": "laptop", "count": 1}], "prompt": "a photo of a laptop"}
+{"tag": "single_object", "include": [{"class": "computer mouse", "count": 1}], "prompt": "a photo of a computer mouse"}
+{"tag": "single_object", "include": [{"class": "sandwich", "count": 1}], "prompt": "a photo of a sandwich"}
+{"tag": "single_object", "include": [{"class": "baseball bat", "count": 1}], "prompt": "a photo of a baseball bat"}
+{"tag": "single_object", "include": [{"class": "train", "count": 1}], "prompt": "a photo of a train"}
+{"tag": "single_object", "include": [{"class": "cell phone", "count": 1}], "prompt": "a photo of a cell phone"}
+{"tag": "single_object", "include": [{"class": "chair", "count": 1}], "prompt": "a photo of a chair"}
+{"tag": "single_object", "include": [{"class": "tv", "count": 1}], "prompt": "a photo of a tv"}
+{"tag": "single_object", "include": [{"class": "broccoli", "count": 1}], "prompt": "a photo of a broccoli"}
+{"tag": "single_object", "include": [{"class": "bed", "count": 1}], "prompt": "a photo of a bed"}
+{"tag": "single_object", "include": [{"class": "skis", "count": 1}], "prompt": "a photo of a skis"}
+{"tag": "single_object", "include": [{"class": "handbag", "count": 1}], "prompt": "a photo of a handbag"}
+{"tag": "single_object", "include": [{"class": "pizza", "count": 1}], "prompt": "a photo of a pizza"}
+{"tag": "single_object", "include": [{"class": "frisbee", "count": 1}], "prompt": "a photo of a frisbee"}
+{"tag": "single_object", "include": [{"class": "scissors", "count": 1}], "prompt": "a photo of a scissors"}
+{"tag": "single_object", "include": [{"class": "bottle", "count": 1}], "prompt": "a photo of a bottle"}
+{"tag": "single_object", "include": [{"class": "elephant", "count": 1}], "prompt": "a photo of an elephant"}
+{"tag": "single_object", "include": [{"class": "toilet", "count": 1}], "prompt": "a photo of a toilet"}
+{"tag": "single_object", "include": [{"class": "oven", "count": 1}], "prompt": "a photo of an oven"}
+{"tag": "single_object", "include": [{"class": "orange", "count": 1}], "prompt": "a photo of an orange"}
+{"tag": "single_object", "include": [{"class": "person", "count": 1}], "prompt": "a photo of a person"}
+{"tag": "single_object", "include": [{"class": "teddy bear", "count": 1}], "prompt": "a photo of a teddy bear"}
+{"tag": "single_object", "include": [{"class": "vase", "count": 1}], "prompt": "a photo of a vase"}
+{"tag": "single_object", "include": [{"class": "banana", "count": 1}], "prompt": "a photo of a banana"}
+{"tag": "single_object", "include": [{"class": "toothbrush", "count": 1}], "prompt": "a photo of a toothbrush"}
+{"tag": "single_object", "include": [{"class": "tv remote", "count": 1}], "prompt": "a photo of a tv remote"}
+{"tag": "single_object", "include": [{"class": "dining table", "count": 1}], "prompt": "a photo of a dining table"}
+{"tag": "single_object", "include": [{"class": "stop sign", "count": 1}], "prompt": "a photo of a stop sign"}
+{"tag": "single_object", "include": [{"class": "sheep", "count": 1}], "prompt": "a photo of a sheep"}
+{"tag": "single_object", "include": [{"class": "fire hydrant", "count": 1}], "prompt": "a photo of a fire hydrant"}
+{"tag": "single_object", "include": [{"class": "airplane", "count": 1}], "prompt": "a photo of an airplane"}
+{"tag": "single_object", "include": [{"class": "giraffe", "count": 1}], "prompt": "a photo of a giraffe"}
+{"tag": "single_object", "include": [{"class": "horse", "count": 1}], "prompt": "a photo of a horse"}
+{"tag": "single_object", "include": [{"class": "cat", "count": 1}], "prompt": "a photo of a cat"}
+{"tag": "single_object", "include": [{"class": "donut", "count": 1}], "prompt": "a photo of a donut"}
+{"tag": "single_object", "include": [{"class": "boat", "count": 1}], "prompt": "a photo of a boat"}
+{"tag": "single_object", "include": [{"class": "baseball glove", "count": 1}], "prompt": "a photo of a baseball glove"}
+{"tag": "single_object", "include": [{"class": "hair drier", "count": 1}], "prompt": "a photo of a hair drier"}
+{"tag": "single_object", "include": [{"class": "sink", "count": 1}], "prompt": "a photo of a sink"}
+{"tag": "single_object", "include": [{"class": "cake", "count": 1}], "prompt": "a photo of a cake"}
+{"tag": "single_object", "include": [{"class": "wine glass", "count": 1}], "prompt": "a photo of a wine glass"}
+{"tag": "single_object", "include": [{"class": "apple", "count": 1}], "prompt": "a photo of an apple"}
+{"tag": "single_object", "include": [{"class": "bus", "count": 1}], "prompt": "a photo of a bus"}
+{"tag": "single_object", "include": [{"class": "tennis racket", "count": 1}], "prompt": "a photo of a tennis racket"}
+{"tag": "single_object", "include": [{"class": "knife", "count": 1}], "prompt": "a photo of a knife"}
+{"tag": "single_object", "include": [{"class": "hot dog", "count": 1}], "prompt": "a photo of a hot dog"}
+{"tag": "single_object", "include": [{"class": "truck", "count": 1}], "prompt": "a photo of a truck"}
+{"tag": "single_object", "include": [{"class": "umbrella", "count": 1}], "prompt": "a photo of an umbrella"}
+{"tag": "single_object", "include": [{"class": "sports ball", "count": 1}], "prompt": "a photo of a sports ball"}
+{"tag": "single_object", "include": [{"class": "bear", "count": 1}], "prompt": "a photo of a bear"}
+{"tag": "single_object", "include": [{"class": "kite", "count": 1}], "prompt": "a photo of a kite"}
+{"tag": "two_object", "include": [{"class": "bench", "count": 1}, {"class": "sports ball", "count": 1}], "prompt": "a photo of a bench and a sports ball"}
+{"tag": "two_object", "include": [{"class": "toothbrush", "count": 1}, {"class": "snowboard", "count": 1}], "prompt": "a photo of a toothbrush and a snowboard"}
+{"tag": "two_object", "include": [{"class": "toaster", "count": 1}, {"class": "oven", "count": 1}], "prompt": "a photo of a toaster and an oven"}
+{"tag": "two_object", "include": [{"class": "broccoli", "count": 1}, {"class": "vase", "count": 1}], "prompt": "a photo of a broccoli and a vase"}
+{"tag": "two_object", "include": [{"class": "tennis racket", "count": 1}, {"class": "wine glass", "count": 1}], "prompt": "a photo of a tennis racket and a wine glass"}
+{"tag": "two_object", "include": [{"class": "fork", "count": 1}, {"class": "knife", "count": 1}], "prompt": "a photo of a fork and a knife"}
+{"tag": "two_object", "include": [{"class": "hair drier", "count": 1}, {"class": "cake", "count": 1}], "prompt": "a photo of a hair drier and a cake"}
+{"tag": "two_object", "include": [{"class": "horse", "count": 1}, {"class": "giraffe", "count": 1}], "prompt": "a photo of a horse and a giraffe"}
+{"tag": "two_object", "include": [{"class": "horse", "count": 1}, {"class": "computer keyboard", "count": 1}], "prompt": "a photo of a horse and a computer keyboard"}
+{"tag": "two_object", "include": [{"class": "toothbrush", "count": 1}, {"class": "carrot", "count": 1}], "prompt": "a photo of a toothbrush and a carrot"}
+{"tag": "two_object", "include": [{"class": "cake", "count": 1}, {"class": "zebra", "count": 1}], "prompt": "a photo of a cake and a zebra"}
+{"tag": "two_object", "include": [{"class": "hair drier", "count": 1}, {"class": "bear", "count": 1}], "prompt": "a photo of a hair drier and a bear"}
+{"tag": "two_object", "include": [{"class": "knife", "count": 1}, {"class": "zebra", "count": 1}], "prompt": "a photo of a knife and a zebra"}
+{"tag": "two_object", "include": [{"class": "couch", "count": 1}, {"class": "wine glass", "count": 1}], "prompt": "a photo of a couch and a wine glass"}
+{"tag": "two_object", "include": [{"class": "frisbee", "count": 1}, {"class": "vase", "count": 1}], "prompt": "a photo of a frisbee and a vase"}
+{"tag": "two_object", "include": [{"class": "book", "count": 1}, {"class": "laptop", "count": 1}], "prompt": "a photo of a book and a laptop"}
+{"tag": "two_object", "include": [{"class": "dining table", "count": 1}, {"class": "bear", "count": 1}], "prompt": "a photo of a dining table and a bear"}
+{"tag": "two_object", "include": [{"class": "frisbee", "count": 1}, {"class": "couch", "count": 1}], "prompt": "a photo of a frisbee and a couch"}
+{"tag": "two_object", "include": [{"class": "couch", "count": 1}, {"class": "horse", "count": 1}], "prompt": "a photo of a couch and a horse"}
+{"tag": "two_object", "include": [{"class": "toilet", "count": 1}, {"class": "computer mouse", "count": 1}], "prompt": "a photo of a toilet and a computer mouse"}
+{"tag": "two_object", "include": [{"class": "bottle", "count": 1}, {"class": "refrigerator", "count": 1}], "prompt": "a photo of a bottle and a refrigerator"}
+{"tag": "two_object", "include": [{"class": "potted plant", "count": 1}, {"class": "backpack", "count": 1}], "prompt": "a photo of a potted plant and a backpack"}
+{"tag": "two_object", "include": [{"class": "skateboard", "count": 1}, {"class": "cake", "count": 1}], "prompt": "a photo of a skateboard and a cake"}
+{"tag": "two_object", "include": [{"class": "broccoli", "count": 1}, {"class": "parking meter", "count": 1}], "prompt": "a photo of a broccoli and a parking meter"}
+{"tag": "two_object", "include": [{"class": "zebra", "count": 1}, {"class": "bed", "count": 1}], "prompt": "a photo of a zebra and a bed"}
+{"tag": "two_object", "include": [{"class": "oven", "count": 1}, {"class": "bed", "count": 1}], "prompt": "a photo of an oven and a bed"}
+{"tag": "two_object", "include": [{"class": "baseball bat", "count": 1}, {"class": "fork", "count": 1}], "prompt": "a photo of a baseball bat and a fork"}
+{"tag": "two_object", "include": [{"class": "vase", "count": 1}, {"class": "spoon", "count": 1}], "prompt": "a photo of a vase and a spoon"}
+{"tag": "two_object", "include": [{"class": "skateboard", "count": 1}, {"class": "sink", "count": 1}], "prompt": "a photo of a skateboard and a sink"}
+{"tag": "two_object", "include": [{"class": "pizza", "count": 1}, {"class": "bench", "count": 1}], "prompt": "a photo of a pizza and a bench"}
+{"tag": "two_object", "include": [{"class": "bowl", "count": 1}, {"class": "pizza", "count": 1}], "prompt": "a photo of a bowl and a pizza"}
+{"tag": "two_object", "include": [{"class": "tennis racket", "count": 1}, {"class": "bird", "count": 1}], "prompt": "a photo of a tennis racket and a bird"}
+{"tag": "two_object", "include": [{"class": "wine glass", "count": 1}, {"class": "bear", "count": 1}], "prompt": "a photo of a wine glass and a bear"}
+{"tag": "two_object", "include": [{"class": "fork", "count": 1}, {"class": "book", "count": 1}], "prompt": "a photo of a fork and a book"}
+{"tag": "two_object", "include": [{"class": "scissors", "count": 1}, {"class": "bowl", "count": 1}], "prompt": "a photo of a scissors and a bowl"}
+{"tag": "two_object", "include": [{"class": "laptop", "count": 1}, {"class": "carrot", "count": 1}], "prompt": "a photo of a laptop and a carrot"}
+{"tag": "two_object", "include": [{"class": "stop sign", "count": 1}, {"class": "bottle", "count": 1}], "prompt": "a photo of a stop sign and a bottle"}
+{"tag": "two_object", "include": [{"class": "microwave", "count": 1}, {"class": "truck", "count": 1}], "prompt": "a photo of a microwave and a truck"}
+{"tag": "two_object", "include": [{"class": "person", "count": 1}, {"class": "bear", "count": 1}], "prompt": "a photo of a person and a bear"}
+{"tag": "two_object", "include": [{"class": "frisbee", "count": 1}, {"class": "cell phone", "count": 1}], "prompt": "a photo of a frisbee and a cell phone"}
+{"tag": "two_object", "include": [{"class": "parking meter", "count": 1}, {"class": "teddy bear", "count": 1}], "prompt": "a photo of a parking meter and a teddy bear"}
+{"tag": "two_object", "include": [{"class": "tennis racket", "count": 1}, {"class": "bicycle", "count": 1}], "prompt": "a photo of a tennis racket and a bicycle"}
+{"tag": "two_object", "include": [{"class": "stop sign", "count": 1}, {"class": "motorcycle", "count": 1}], "prompt": "a photo of a stop sign and a motorcycle"}
+{"tag": "two_object", "include": [{"class": "fire hydrant", "count": 1}, {"class": "tennis racket", "count": 1}], "prompt": "a photo of a fire hydrant and a tennis racket"}
+{"tag": "two_object", "include": [{"class": "scissors", "count": 1}, {"class": "sandwich", "count": 1}], "prompt": "a photo of a scissors and a sandwich"}
+{"tag": "two_object", "include": [{"class": "pizza", "count": 1}, {"class": "book", "count": 1}], "prompt": "a photo of a pizza and a book"}
+{"tag": "two_object", "include": [{"class": "giraffe", "count": 1}, {"class": "computer mouse", "count": 1}], "prompt": "a photo of a giraffe and a computer mouse"}
+{"tag": "two_object", "include": [{"class": "stop sign", "count": 1}, {"class": "toaster", "count": 1}], "prompt": "a photo of a stop sign and a toaster"}
+{"tag": "two_object", "include": [{"class": "computer mouse", "count": 1}, {"class": "zebra", "count": 1}], "prompt": "a photo of a computer mouse and a zebra"}
+{"tag": "two_object", "include": [{"class": "chair", "count": 1}, {"class": "bench", "count": 1}], "prompt": "a photo of a chair and a bench"}
+{"tag": "two_object", "include": [{"class": "tv", "count": 1}, {"class": "carrot", "count": 1}], "prompt": "a photo of a tv and a carrot"}
+{"tag": "two_object", "include": [{"class": "surfboard", "count": 1}, {"class": "suitcase", "count": 1}], "prompt": "a photo of a surfboard and a suitcase"}
+{"tag": "two_object", "include": [{"class": "computer keyboard", "count": 1}, {"class": "laptop", "count": 1}], "prompt": "a photo of a computer keyboard and a laptop"}
+{"tag": "two_object", "include": [{"class": "computer keyboard", "count": 1}, {"class": "microwave", "count": 1}], "prompt": "a photo of a computer keyboard and a microwave"}
+{"tag": "two_object", "include": [{"class": "scissors", "count": 1}, {"class": "bird", "count": 1}], "prompt": "a photo of a scissors and a bird"}
+{"tag": "two_object", "include": [{"class": "person", "count": 1}, {"class": "snowboard", "count": 1}], "prompt": "a photo of a person and a snowboard"}
+{"tag": "two_object", "include": [{"class": "cow", "count": 1}, {"class": "horse", "count": 1}], "prompt": "a photo of a cow and a horse"}
+{"tag": "two_object", "include": [{"class": "handbag", "count": 1}, {"class": "refrigerator", "count": 1}], "prompt": "a photo of a handbag and a refrigerator"}
+{"tag": "two_object", "include": [{"class": "chair", "count": 1}, {"class": "laptop", "count": 1}], "prompt": "a photo of a chair and a laptop"}
+{"tag": "two_object", "include": [{"class": "toothbrush", "count": 1}, {"class": "bench", "count": 1}], "prompt": "a photo of a toothbrush and a bench"}
+{"tag": "two_object", "include": [{"class": "book", "count": 1}, {"class": "baseball bat", "count": 1}], "prompt": "a photo of a book and a baseball bat"}
+{"tag": "two_object", "include": [{"class": "horse", "count": 1}, {"class": "train", "count": 1}], "prompt": "a photo of a horse and a train"}
+{"tag": "two_object", "include": [{"class": "bench", "count": 1}, {"class": "vase", "count": 1}], "prompt": "a photo of a bench and a vase"}
+{"tag": "two_object", "include": [{"class": "traffic light", "count": 1}, {"class": "backpack", "count": 1}], "prompt": "a photo of a traffic light and a backpack"}
+{"tag": "two_object", "include": [{"class": "sports ball", "count": 1}, {"class": "cow", "count": 1}], "prompt": "a photo of a sports ball and a cow"}
+{"tag": "two_object", "include": [{"class": "computer mouse", "count": 1}, {"class": "spoon", "count": 1}], "prompt": "a photo of a computer mouse and a spoon"}
+{"tag": "two_object", "include": [{"class": "tv", "count": 1}, {"class": "bicycle", "count": 1}], "prompt": "a photo of a tv and a bicycle"}
+{"tag": "two_object", "include": [{"class": "bench", "count": 1}, {"class": "snowboard", "count": 1}], "prompt": "a photo of a bench and a snowboard"}
+{"tag": "two_object", "include": [{"class": "toothbrush", "count": 1}, {"class": "toilet", "count": 1}], "prompt": "a photo of a toothbrush and a toilet"}
+{"tag": "two_object", "include": [{"class": "person", "count": 1}, {"class": "apple", "count": 1}], "prompt": "a photo of a person and an apple"}
+{"tag": "two_object", "include": [{"class": "sink", "count": 1}, {"class": "sports ball", "count": 1}], "prompt": "a photo of a sink and a sports ball"}
+{"tag": "two_object", "include": [{"class": "stop sign", "count": 1}, {"class": "dog", "count": 1}], "prompt": "a photo of a stop sign and a dog"}
+{"tag": "two_object", "include": [{"class": "knife", "count": 1}, {"class": "stop sign", "count": 1}], "prompt": "a photo of a knife and a stop sign"}
+{"tag": "two_object", "include": [{"class": "wine glass", "count": 1}, {"class": "handbag", "count": 1}], "prompt": "a photo of a wine glass and a handbag"}
+{"tag": "two_object", "include": [{"class": "bowl", "count": 1}, {"class": "skis", "count": 1}], "prompt": "a photo of a bowl and a skis"}
+{"tag": "two_object", "include": [{"class": "frisbee", "count": 1}, {"class": "apple", "count": 1}], "prompt": "a photo of a frisbee and an apple"}
+{"tag": "two_object", "include": [{"class": "computer keyboard", "count": 1}, {"class": "cell phone", "count": 1}], "prompt": "a photo of a computer keyboard and a cell phone"}
+{"tag": "two_object", "include": [{"class": "stop sign", "count": 1}, {"class": "fork", "count": 1}], "prompt": "a photo of a stop sign and a fork"}
+{"tag": "two_object", "include": [{"class": "potted plant", "count": 1}, {"class": "boat", "count": 1}], "prompt": "a photo of a potted plant and a boat"}
+{"tag": "two_object", "include": [{"class": "tv", "count": 1}, {"class": "cell phone", "count": 1}], "prompt": "a photo of a tv and a cell phone"}
+{"tag": "two_object", "include": [{"class": "tie", "count": 1}, {"class": "broccoli", "count": 1}], "prompt": "a photo of a tie and a broccoli"}
+{"tag": "two_object", "include": [{"class": "potted plant", "count": 1}, {"class": "donut", "count": 1}], "prompt": "a photo of a potted plant and a donut"}
+{"tag": "two_object", "include": [{"class": "person", "count": 1}, {"class": "sink", "count": 1}], "prompt": "a photo of a person and a sink"}
+{"tag": "two_object", "include": [{"class": "couch", "count": 1}, {"class": "snowboard", "count": 1}], "prompt": "a photo of a couch and a snowboard"}
+{"tag": "two_object", "include": [{"class": "fork", "count": 1}, {"class": "baseball glove", "count": 1}], "prompt": "a photo of a fork and a baseball glove"}
+{"tag": "two_object", "include": [{"class": "apple", "count": 1}, {"class": "toothbrush", "count": 1}], "prompt": "a photo of an apple and a toothbrush"}
+{"tag": "two_object", "include": [{"class": "bus", "count": 1}, {"class": "baseball glove", "count": 1}], "prompt": "a photo of a bus and a baseball glove"}
+{"tag": "two_object", "include": [{"class": "person", "count": 1}, {"class": "stop sign", "count": 1}], "prompt": "a photo of a person and a stop sign"}
+{"tag": "two_object", "include": [{"class": "carrot", "count": 1}, {"class": "couch", "count": 1}], "prompt": "a photo of a carrot and a couch"}
+{"tag": "two_object", "include": [{"class": "baseball bat", "count": 1}, {"class": "bear", "count": 1}], "prompt": "a photo of a baseball bat and a bear"}
+{"tag": "two_object", "include": [{"class": "fire hydrant", "count": 1}, {"class": "train", "count": 1}], "prompt": "a photo of a fire hydrant and a train"}
+{"tag": "two_object", "include": [{"class": "baseball glove", "count": 1}, {"class": "carrot", "count": 1}], "prompt": "a photo of a baseball glove and a carrot"}
+{"tag": "two_object", "include": [{"class": "microwave", "count": 1}, {"class": "bench", "count": 1}], "prompt": "a photo of a microwave and a bench"}
+{"tag": "two_object", "include": [{"class": "cake", "count": 1}, {"class": "stop sign", "count": 1}], "prompt": "a photo of a cake and a stop sign"}
+{"tag": "two_object", "include": [{"class": "car", "count": 1}, {"class": "computer mouse", "count": 1}], "prompt": "a photo of a car and a computer mouse"}
+{"tag": "two_object", "include": [{"class": "suitcase", "count": 1}, {"class": "dining table", "count": 1}], "prompt": "a photo of a suitcase and a dining table"}
+{"tag": "two_object", "include": [{"class": "person", "count": 1}, {"class": "traffic light", "count": 1}], "prompt": "a photo of a person and a traffic light"}
+{"tag": "two_object", "include": [{"class": "cell phone", "count": 1}, {"class": "horse", "count": 1}], "prompt": "a photo of a cell phone and a horse"}
+{"tag": "two_object", "include": [{"class": "baseball bat", "count": 1}, {"class": "giraffe", "count": 1}], "prompt": "a photo of a baseball bat and a giraffe"}
+{"tag": "counting", "include": [{"class": "clock", "count": 2}], "exclude": [{"class": "clock", "count": 3}], "prompt": "a photo of two clocks"}
+{"tag": "counting", "include": [{"class": "backpack", "count": 2}], "exclude": [{"class": "backpack", "count": 3}], "prompt": "a photo of two backpacks"}
+{"tag": "counting", "include": [{"class": "handbag", "count": 4}], "exclude": [{"class": "handbag", "count": 5}], "prompt": "a photo of four handbags"}
+{"tag": "counting", "include": [{"class": "frisbee", "count": 2}], "exclude": [{"class": "frisbee", "count": 3}], "prompt": "a photo of two frisbees"}
+{"tag": "counting", "include": [{"class": "sports ball", "count": 3}], "exclude": [{"class": "sports ball", "count": 4}], "prompt": "a photo of three sports balls"}
+{"tag": "counting", "include": [{"class": "bear", "count": 2}], "exclude": [{"class": "bear", "count": 3}], "prompt": "a photo of two bears"}
+{"tag": "counting", "include": [{"class": "tie", "count": 2}], "exclude": [{"class": "tie", "count": 3}], "prompt": "a photo of two ties"}
+{"tag": "counting", "include": [{"class": "sink", "count": 4}], "exclude": [{"class": "sink", "count": 5}], "prompt": "a photo of four sinks"}
+{"tag": "counting", "include": [{"class": "toothbrush", "count": 2}], "exclude": [{"class": "toothbrush", "count": 3}], "prompt": "a photo of two toothbrushs"}
+{"tag": "counting", "include": [{"class": "person", "count": 3}], "exclude": [{"class": "person", "count": 4}], "prompt": "a photo of three persons"}
+{"tag": "counting", "include": [{"class": "tennis racket", "count": 3}], "exclude": [{"class": "tennis racket", "count": 4}], "prompt": "a photo of three tennis rackets"}
+{"tag": "counting", "include": [{"class": "bowl", "count": 4}], "exclude": [{"class": "bowl", "count": 5}], "prompt": "a photo of four bowls"}
+{"tag": "counting", "include": [{"class": "vase", "count": 4}], "exclude": [{"class": "vase", "count": 5}], "prompt": "a photo of four vases"}
+{"tag": "counting", "include": [{"class": "cup", "count": 3}], "exclude": [{"class": "cup", "count": 4}], "prompt": "a photo of three cups"}
+{"tag": "counting", "include": [{"class": "computer keyboard", "count": 4}], "exclude": [{"class": "computer keyboard", "count": 5}], "prompt": "a photo of four computer keyboards"}
+{"tag": "counting", "include": [{"class": "sink", "count": 3}], "exclude": [{"class": "sink", "count": 4}], "prompt": "a photo of three sinks"}
+{"tag": "counting", "include": [{"class": "oven", "count": 2}], "exclude": [{"class": "oven", "count": 3}], "prompt": "a photo of two ovens"}
+{"tag": "counting", "include": [{"class": "toilet", "count": 2}], "exclude": [{"class": "toilet", "count": 3}], "prompt": "a photo of two toilets"}
+{"tag": "counting", "include": [{"class": "bicycle", "count": 2}], "exclude": [{"class": "bicycle", "count": 3}], "prompt": "a photo of two bicycles"}
+{"tag": "counting", "include": [{"class": "train", "count": 2}], "exclude": [{"class": "train", "count": 3}], "prompt": "a photo of two trains"}
+{"tag": "counting", "include": [{"class": "orange", "count": 3}], "exclude": [{"class": "orange", "count": 4}], "prompt": "a photo of three oranges"}
+{"tag": "counting", "include": [{"class": "bus", "count": 3}], "exclude": [{"class": "bus", "count": 4}], "prompt": "a photo of three buses"}
+{"tag": "counting", "include": [{"class": "handbag", "count": 3}], "exclude": [{"class": "handbag", "count": 4}], "prompt": "a photo of three handbags"}
+{"tag": "counting", "include": [{"class": "snowboard", "count": 3}], "exclude": [{"class": "snowboard", "count": 4}], "prompt": "a photo of three snowboards"}
+{"tag": "counting", "include": [{"class": "snowboard", "count": 2}], "exclude": [{"class": "snowboard", "count": 3}], "prompt": "a photo of two snowboards"}
+{"tag": "counting", "include": [{"class": "dog", "count": 4}], "exclude": [{"class": "dog", "count": 5}], "prompt": "a photo of four dogs"}
+{"tag": "counting", "include": [{"class": "apple", "count": 3}], "exclude": [{"class": "apple", "count": 4}], "prompt": "a photo of three apples"}
+{"tag": "counting", "include": [{"class": "sheep", "count": 2}], "exclude": [{"class": "sheep", "count": 3}], "prompt": "a photo of two sheeps"}
+{"tag": "counting", "include": [{"class": "hot dog", "count": 3}], "exclude": [{"class": "hot dog", "count": 4}], "prompt": "a photo of three hot dogs"}
+{"tag": "counting", "include": [{"class": "zebra", "count": 3}], "exclude": [{"class": "zebra", "count": 4}], "prompt": "a photo of three zebras"}
+{"tag": "counting", "include": [{"class": "kite", "count": 3}], "exclude": [{"class": "kite", "count": 4}], "prompt": "a photo of three kites"}
+{"tag": "counting", "include": [{"class": "apple", "count": 4}], "exclude": [{"class": "apple", "count": 5}], "prompt": "a photo of four apples"}
+{"tag": "counting", "include": [{"class": "cell phone", "count": 3}], "exclude": [{"class": "cell phone", "count": 4}], "prompt": "a photo of three cell phones"}
+{"tag": "counting", "include": [{"class": "baseball glove", "count": 4}], "exclude": [{"class": "baseball glove", "count": 5}], "prompt": "a photo of four baseball gloves"}
+{"tag": "counting", "include": [{"class": "computer keyboard", "count": 3}], "exclude": [{"class": "computer keyboard", "count": 4}], "prompt": "a photo of three computer keyboards"}
+{"tag": "counting", "include": [{"class": "bed", "count": 2}], "exclude": [{"class": "bed", "count": 3}], "prompt": "a photo of two beds"}
+{"tag": "counting", "include": [{"class": "tv remote", "count": 2}], "exclude": [{"class": "tv remote", "count": 3}], "prompt": "a photo of two tv remotes"}
+{"tag": "counting", "include": [{"class": "fire hydrant", "count": 3}], "exclude": [{"class": "fire hydrant", "count": 4}], "prompt": "a photo of three fire hydrants"}
+{"tag": "counting", "include": [{"class": "book", "count": 3}], "exclude": [{"class": "book", "count": 4}], "prompt": "a photo of three books"}
+{"tag": "counting", "include": [{"class": "giraffe", "count": 4}], "exclude": [{"class": "giraffe", "count": 5}], "prompt": "a photo of four giraffes"}
+{"tag": "counting", "include": [{"class": "vase", "count": 2}], "exclude": [{"class": "vase", "count": 3}], "prompt": "a photo of two vases"}
+{"tag": "counting", "include": [{"class": "donut", "count": 4}], "exclude": [{"class": "donut", "count": 5}], "prompt": "a photo of four donuts"}
+{"tag": "counting", "include": [{"class": "chair", "count": 4}], "exclude": [{"class": "chair", "count": 5}], "prompt": "a photo of four chairs"}
+{"tag": "counting", "include": [{"class": "baseball bat", "count": 3}], "exclude": [{"class": "baseball bat", "count": 4}], "prompt": "a photo of three baseball bats"}
+{"tag": "counting", "include": [{"class": "stop sign", "count": 4}], "exclude": [{"class": "stop sign", "count": 5}], "prompt": "a photo of four stop signs"}
+{"tag": "counting", "include": [{"class": "pizza", "count": 2}], "exclude": [{"class": "pizza", "count": 3}], "prompt": "a photo of two pizzas"}
+{"tag": "counting", "include": [{"class": "refrigerator", "count": 3}], "exclude": [{"class": "refrigerator", "count": 4}], "prompt": "a photo of three refrigerators"}
+{"tag": "counting", "include": [{"class": "fire hydrant", "count": 2}], "exclude": [{"class": "fire hydrant", "count": 3}], "prompt": "a photo of two fire hydrants"}
+{"tag": "counting", "include": [{"class": "giraffe", "count": 3}], "exclude": [{"class": "giraffe", "count": 4}], "prompt": "a photo of three giraffes"}
+{"tag": "counting", "include": [{"class": "tv", "count": 4}], "exclude": [{"class": "tv", "count": 5}], "prompt": "a photo of four tvs"}
+{"tag": "counting", "include": [{"class": "wine glass", "count": 3}], "exclude": [{"class": "wine glass", "count": 4}], "prompt": "a photo of three wine glasses"}
+{"tag": "counting", "include": [{"class": "broccoli", "count": 4}], "exclude": [{"class": "broccoli", "count": 5}], "prompt": "a photo of four broccolis"}
+{"tag": "counting", "include": [{"class": "truck", "count": 3}], "exclude": [{"class": "truck", "count": 4}], "prompt": "a photo of three trucks"}
+{"tag": "counting", "include": [{"class": "truck", "count": 2}], "exclude": [{"class": "truck", "count": 3}], "prompt": "a photo of two trucks"}
+{"tag": "counting", "include": [{"class": "carrot", "count": 2}], "exclude": [{"class": "carrot", "count": 3}], "prompt": "a photo of two carrots"}
+{"tag": "counting", "include": [{"class": "sandwich", "count": 2}], "exclude": [{"class": "sandwich", "count": 3}], "prompt": "a photo of two sandwichs"}
+{"tag": "counting", "include": [{"class": "traffic light", "count": 4}], "exclude": [{"class": "traffic light", "count": 5}], "prompt": "a photo of four traffic lights"}
+{"tag": "counting", "include": [{"class": "clock", "count": 4}], "exclude": [{"class": "clock", "count": 5}], "prompt": "a photo of four clocks"}
+{"tag": "counting", "include": [{"class": "car", "count": 2}], "exclude": [{"class": "car", "count": 3}], "prompt": "a photo of two cars"}
+{"tag": "counting", "include": [{"class": "banana", "count": 2}], "exclude": [{"class": "banana", "count": 3}], "prompt": "a photo of two bananas"}
+{"tag": "counting", "include": [{"class": "wine glass", "count": 2}], "exclude": [{"class": "wine glass", "count": 3}], "prompt": "a photo of two wine glasses"}
+{"tag": "counting", "include": [{"class": "pizza", "count": 3}], "exclude": [{"class": "pizza", "count": 4}], "prompt": "a photo of three pizzas"}
+{"tag": "counting", "include": [{"class": "knife", "count": 4}], "exclude": [{"class": "knife", "count": 5}], "prompt": "a photo of four knifes"}
+{"tag": "counting", "include": [{"class": "suitcase", "count": 3}], "exclude": [{"class": "suitcase", "count": 4}], "prompt": "a photo of three suitcases"}
+{"tag": "counting", "include": [{"class": "zebra", "count": 4}], "exclude": [{"class": "zebra", "count": 5}], "prompt": "a photo of four zebras"}
+{"tag": "counting", "include": [{"class": "teddy bear", "count": 2}], "exclude": [{"class": "teddy bear", "count": 3}], "prompt": "a photo of two teddy bears"}
+{"tag": "counting", "include": [{"class": "skateboard", "count": 4}], "exclude": [{"class": "skateboard", "count": 5}], "prompt": "a photo of four skateboards"}
+{"tag": "counting", "include": [{"class": "hot dog", "count": 4}], "exclude": [{"class": "hot dog", "count": 5}], "prompt": "a photo of four hot dogs"}
+{"tag": "counting", "include": [{"class": "bird", "count": 3}], "exclude": [{"class": "bird", "count": 4}], "prompt": "a photo of three birds"}
+{"tag": "counting", "include": [{"class": "boat", "count": 4}], "exclude": [{"class": "boat", "count": 5}], "prompt": "a photo of four boats"}
+{"tag": "counting", "include": [{"class": "microwave", "count": 4}], "exclude": [{"class": "microwave", "count": 5}], "prompt": "a photo of four microwaves"}
+{"tag": "counting", "include": [{"class": "hair drier", "count": 2}], "exclude": [{"class": "hair drier", "count": 3}], "prompt": "a photo of two hair driers"}
+{"tag": "counting", "include": [{"class": "laptop", "count": 3}], "exclude": [{"class": "laptop", "count": 4}], "prompt": "a photo of three laptops"}
+{"tag": "counting", "include": [{"class": "cow", "count": 3}], "exclude": [{"class": "cow", "count": 4}], "prompt": "a photo of three cows"}
+{"tag": "counting", "include": [{"class": "parking meter", "count": 2}], "exclude": [{"class": "parking meter", "count": 3}], "prompt": "a photo of two parking meters"}
+{"tag": "counting", "include": [{"class": "bench", "count": 4}], "exclude": [{"class": "bench", "count": 5}], "prompt": "a photo of four benchs"}
+{"tag": "counting", "include": [{"class": "bench", "count": 3}], "exclude": [{"class": "bench", "count": 4}], "prompt": "a photo of three benchs"}
+{"tag": "counting", "include": [{"class": "frisbee", "count": 4}], "exclude": [{"class": "frisbee", "count": 5}], "prompt": "a photo of four frisbees"}
+{"tag": "counting", "include": [{"class": "book", "count": 4}], "exclude": [{"class": "book", "count": 5}], "prompt": "a photo of four books"}
+{"tag": "counting", "include": [{"class": "bus", "count": 4}], "exclude": [{"class": "bus", "count": 5}], "prompt": "a photo of four buses"}
+{"tag": "colors", "include": [{"class": "fire hydrant", "count": 1, "color": "blue"}], "prompt": "a photo of a blue fire hydrant"}
+{"tag": "colors", "include": [{"class": "car", "count": 1, "color": "pink"}], "prompt": "a photo of a pink car"}
+{"tag": "colors", "include": [{"class": "cup", "count": 1, "color": "purple"}], "prompt": "a photo of a purple cup"}
+{"tag": "colors", "include": [{"class": "cow", "count": 1, "color": "blue"}], "prompt": "a photo of a blue cow"}
+{"tag": "colors", "include": [{"class": "boat", "count": 1, "color": "yellow"}], "prompt": "a photo of a yellow boat"}
+{"tag": "colors", "include": [{"class": "umbrella", "count": 1, "color": "blue"}], "prompt": "a photo of a blue umbrella"}
+{"tag": "colors", "include": [{"class": "elephant", "count": 1, "color": "blue"}], "prompt": "a photo of a blue elephant"}
+{"tag": "colors", "include": [{"class": "elephant", "count": 1, "color": "yellow"}], "prompt": "a photo of a yellow elephant"}
+{"tag": "colors", "include": [{"class": "bicycle", "count": 1, "color": "red"}], "prompt": "a photo of a red bicycle"}
+{"tag": "colors", "include": [{"class": "suitcase", "count": 1, "color": "purple"}], "prompt": "a photo of a purple suitcase"}
+{"tag": "colors", "include": [{"class": "hair drier", "count": 1, "color": "purple"}], "prompt": "a photo of a purple hair drier"}
+{"tag": "colors", "include": [{"class": "sandwich", "count": 1, "color": "white"}], "prompt": "a photo of a white sandwich"}
+{"tag": "colors", "include": [{"class": "elephant", "count": 1, "color": "purple"}], "prompt": "a photo of a purple elephant"}
+{"tag": "colors", "include": [{"class": "microwave", "count": 1, "color": "green"}], "prompt": "a photo of a green microwave"}
+{"tag": "colors", "include": [{"class": "zebra", "count": 1, "color": "red"}], "prompt": "a photo of a red zebra"}
+{"tag": "colors", "include": [{"class": "apple", "count": 1, "color": "red"}], "prompt": "a photo of a red apple"}
+{"tag": "colors", "include": [{"class": "tv remote", "count": 1, "color": "yellow"}], "prompt": "a photo of a yellow tv remote"}
+{"tag": "colors", "include": [{"class": "toilet", "count": 1, "color": "blue"}], "prompt": "a photo of a blue toilet"}
+{"tag": "colors", "include": [{"class": "orange", "count": 1, "color": "orange"}], "prompt": "a photo of an orange orange"}
+{"tag": "colors", "include": [{"class": "donut", "count": 1, "color": "black"}], "prompt": "a photo of a black donut"}
+{"tag": "colors", "include": [{"class": "vase", "count": 1, "color": "red"}], "prompt": "a photo of a red vase"}
+{"tag": "colors", "include": [{"class": "pizza", "count": 1, "color": "purple"}], "prompt": "a photo of a purple pizza"}
+{"tag": "colors", "include": [{"class": "skateboard", "count": 1, "color": "pink"}], "prompt": "a photo of a pink skateboard"}
+{"tag": "colors", "include": [{"class": "skateboard", "count": 1, "color": "green"}], "prompt": "a photo of a green skateboard"}
+{"tag": "colors", "include": [{"class": "bear", "count": 1, "color": "purple"}], "prompt": "a photo of a purple bear"}
+{"tag": "colors", "include": [{"class": "chair", "count": 1, "color": "brown"}], "prompt": "a photo of a brown chair"}
+{"tag": "colors", "include": [{"class": "computer keyboard", "count": 1, "color": "brown"}], "prompt": "a photo of a brown computer keyboard"}
+{"tag": "colors", "include": [{"class": "cow", "count": 1, "color": "orange"}], "prompt": "a photo of an orange cow"}
+{"tag": "colors", "include": [{"class": "skis", "count": 1, "color": "brown"}], "prompt": "a photo of a brown skis"}
+{"tag": "colors", "include": [{"class": "kite", "count": 1, "color": "white"}], "prompt": "a photo of a white kite"}
+{"tag": "colors", "include": [{"class": "dog", "count": 1, "color": "red"}], "prompt": "a photo of a red dog"}
+{"tag": "colors", "include": [{"class": "couch", "count": 1, "color": "green"}], "prompt": "a photo of a green couch"}
+{"tag": "colors", "include": [{"class": "airplane", "count": 1, "color": "yellow"}], "prompt": "a photo of a yellow airplane"}
+{"tag": "colors", "include": [{"class": "tv", "count": 1, "color": "orange"}], "prompt": "a photo of an orange tv"}
+{"tag": "colors", "include": [{"class": "scissors", "count": 1, "color": "white"}], "prompt": "a photo of a white scissors"}
+{"tag": "colors", "include": [{"class": "cell phone", "count": 1, "color": "pink"}], "prompt": "a photo of a pink cell phone"}
+{"tag": "colors", "include": [{"class": "surfboard", "count": 1, "color": "green"}], "prompt": "a photo of a green surfboard"}
+{"tag": "colors", "include": [{"class": "fire hydrant", "count": 1, "color": "white"}], "prompt": "a photo of a white fire hydrant"}
+{"tag": "colors", "include": [{"class": "bicycle", "count": 1, "color": "black"}], "prompt": "a photo of a black bicycle"}
+{"tag": "colors", "include": [{"class": "carrot", "count": 1, "color": "purple"}], "prompt": "a photo of a purple carrot"}
+{"tag": "colors", "include": [{"class": "dining table", "count": 1, "color": "black"}], "prompt": "a photo of a black dining table"}
+{"tag": "colors", "include": [{"class": "potted plant", "count": 1, "color": "purple"}], "prompt": "a photo of a purple potted plant"}
+{"tag": "colors", "include": [{"class": "backpack", "count": 1, "color": "purple"}], "prompt": "a photo of a purple backpack"}
+{"tag": "colors", "include": [{"class": "train", "count": 1, "color": "yellow"}], "prompt": "a photo of a yellow train"}
+{"tag": "colors", "include": [{"class": "potted plant", "count": 1, "color": "pink"}], "prompt": "a photo of a pink potted plant"}
+{"tag": "colors", "include": [{"class": "giraffe", "count": 1, "color": "red"}], "prompt": "a photo of a red giraffe"}
+{"tag": "colors", "include": [{"class": "bear", "count": 1, "color": "brown"}], "prompt": "a photo of a brown bear"}
+{"tag": "colors", "include": [{"class": "train", "count": 1, "color": "black"}], "prompt": "a photo of a black train"}
+{"tag": "colors", "include": [{"class": "laptop", "count": 1, "color": "orange"}], "prompt": "a photo of an orange laptop"}
+{"tag": "colors", "include": [{"class": "hot dog", "count": 1, "color": "green"}], "prompt": "a photo of a green hot dog"}
+{"tag": "colors", "include": [{"class": "parking meter", "count": 1, "color": "yellow"}], "prompt": "a photo of a yellow parking meter"}
+{"tag": "colors", "include": [{"class": "potted plant", "count": 1, "color": "red"}], "prompt": "a photo of a red potted plant"}
+{"tag": "colors", "include": [{"class": "traffic light", "count": 1, "color": "green"}], "prompt": "a photo of a green traffic light"}
+{"tag": "colors", "include": [{"class": "tv", "count": 1, "color": "blue"}], "prompt": "a photo of a blue tv"}
+{"tag": "colors", "include": [{"class": "refrigerator", "count": 1, "color": "brown"}], "prompt": "a photo of a brown refrigerator"}
+{"tag": "colors", "include": [{"class": "tv remote", "count": 1, "color": "black"}], "prompt": "a photo of a black tv remote"}
+{"tag": "colors", "include": [{"class": "scissors", "count": 1, "color": "purple"}], "prompt": "a photo of a purple scissors"}
+{"tag": "colors", "include": [{"class": "orange", "count": 1, "color": "yellow"}], "prompt": "a photo of a yellow orange"}
+{"tag": "colors", "include": [{"class": "toaster", "count": 1, "color": "brown"}], "prompt": "a photo of a brown toaster"}
+{"tag": "colors", "include": [{"class": "parking meter", "count": 1, "color": "red"}], "prompt": "a photo of a red parking meter"}
+{"tag": "colors", "include": [{"class": "orange", "count": 1, "color": "brown"}], "prompt": "a photo of a brown orange"}
+{"tag": "colors", "include": [{"class": "clock", "count": 1, "color": "green"}], "prompt": "a photo of a green clock"}
+{"tag": "colors", "include": [{"class": "sheep", "count": 1, "color": "white"}], "prompt": "a photo of a white sheep"}
+{"tag": "colors", "include": [{"class": "oven", "count": 1, "color": "yellow"}], "prompt": "a photo of a yellow oven"}
+{"tag": "colors", "include": [{"class": "vase", "count": 1, "color": "green"}], "prompt": "a photo of a green vase"}
+{"tag": "colors", "include": [{"class": "teddy bear", "count": 1, "color": "black"}], "prompt": "a photo of a black teddy bear"}
+{"tag": "colors", "include": [{"class": "carrot", "count": 1, "color": "yellow"}], "prompt": "a photo of a yellow carrot"}
+{"tag": "colors", "include": [{"class": "hot dog", "count": 1, "color": "black"}], "prompt": "a photo of a black hot dog"}
+{"tag": "colors", "include": [{"class": "scissors", "count": 1, "color": "red"}], "prompt": "a photo of a red scissors"}
+{"tag": "colors", "include": [{"class": "teddy bear", "count": 1, "color": "white"}], "prompt": "a photo of a white teddy bear"}
+{"tag": "colors", "include": [{"class": "skis", "count": 1, "color": "black"}], "prompt": "a photo of a black skis"}
+{"tag": "colors", "include": [{"class": "dining table", "count": 1, "color": "blue"}], "prompt": "a photo of a blue dining table"}
+{"tag": "colors", "include": [{"class": "refrigerator", "count": 1, "color": "black"}], "prompt": "a photo of a black refrigerator"}
+{"tag": "colors", "include": [{"class": "dog", "count": 1, "color": "white"}], "prompt": "a photo of a white dog"}
+{"tag": "colors", "include": [{"class": "scissors", "count": 1, "color": "orange"}], "prompt": "a photo of an orange scissors"}
+{"tag": "colors", "include": [{"class": "cell phone", "count": 1, "color": "red"}], "prompt": "a photo of a red cell phone"}
+{"tag": "colors", "include": [{"class": "orange", "count": 1, "color": "white"}], "prompt": "a photo of a white orange"}
+{"tag": "colors", "include": [{"class": "clock", "count": 1, "color": "blue"}], "prompt": "a photo of a blue clock"}
+{"tag": "colors", "include": [{"class": "carrot", "count": 1, "color": "blue"}], "prompt": "a photo of a blue carrot"}
+{"tag": "colors", "include": [{"class": "motorcycle", "count": 1, "color": "green"}], "prompt": "a photo of a green motorcycle"}
+{"tag": "colors", "include": [{"class": "stop sign", "count": 1, "color": "pink"}], "prompt": "a photo of a pink stop sign"}
+{"tag": "colors", "include": [{"class": "vase", "count": 1, "color": "black"}], "prompt": "a photo of a black vase"}
+{"tag": "colors", "include": [{"class": "backpack", "count": 1, "color": "black"}], "prompt": "a photo of a black backpack"}
+{"tag": "colors", "include": [{"class": "car", "count": 1, "color": "red"}], "prompt": "a photo of a red car"}
+{"tag": "colors", "include": [{"class": "computer mouse", "count": 1, "color": "green"}], "prompt": "a photo of a green computer mouse"}
+{"tag": "colors", "include": [{"class": "backpack", "count": 1, "color": "red"}], "prompt": "a photo of a red backpack"}
+{"tag": "colors", "include": [{"class": "bus", "count": 1, "color": "green"}], "prompt": "a photo of a green bus"}
+{"tag": "colors", "include": [{"class": "toaster", "count": 1, "color": "orange"}], "prompt": "a photo of an orange toaster"}
+{"tag": "colors", "include": [{"class": "fork", "count": 1, "color": "yellow"}], "prompt": "a photo of a yellow fork"}
+{"tag": "colors", "include": [{"class": "parking meter", "count": 1, "color": "pink"}], "prompt": "a photo of a pink parking meter"}
+{"tag": "colors", "include": [{"class": "book", "count": 1, "color": "blue"}], "prompt": "a photo of a blue book"}
+{"tag": "colors", "include": [{"class": "broccoli", "count": 1, "color": "yellow"}], "prompt": "a photo of a yellow broccoli"}
+{"tag": "colors", "include": [{"class": "computer mouse", "count": 1, "color": "orange"}], "prompt": "a photo of an orange computer mouse"}
+{"tag": "colors", "include": [{"class": "cake", "count": 1, "color": "red"}], "prompt": "a photo of a red cake"}
+{"tag": "position", "include": [{"class": "teddy bear", "count": 1}, {"class": "dog", "count": 1, "position": ["right of", 0]}], "prompt": "a photo of a dog right of a teddy bear"}
+{"tag": "position", "include": [{"class": "kite", "count": 1}, {"class": "wine glass", "count": 1, "position": ["above", 0]}], "prompt": "a photo of a wine glass above a kite"}
+{"tag": "position", "include": [{"class": "cup", "count": 1}, {"class": "couch", "count": 1, "position": ["below", 0]}], "prompt": "a photo of a couch below a cup"}
+{"tag": "position", "include": [{"class": "cow", "count": 1}, {"class": "laptop", "count": 1, "position": ["left of", 0]}], "prompt": "a photo of a laptop left of a cow"}
+{"tag": "position", "include": [{"class": "hair drier", "count": 1}, {"class": "fork", "count": 1, "position": ["above", 0]}], "prompt": "a photo of a fork above a hair drier"}
+{"tag": "position", "include": [{"class": "baseball bat", "count": 1}, {"class": "tie", "count": 1, "position": ["right of", 0]}], "prompt": "a photo of a tie right of a baseball bat"}
+{"tag": "position", "include": [{"class": "fork", "count": 1}, {"class": "stop sign", "count": 1, "position": ["above", 0]}], "prompt": "a photo of a stop sign above a fork"}
+{"tag": "position", "include": [{"class": "skateboard", "count": 1}, {"class": "bird", "count": 1, "position": ["below", 0]}], "prompt": "a photo of a bird below a skateboard"}
+{"tag": "position", "include": [{"class": "tv", "count": 1}, {"class": "apple", "count": 1, "position": ["above", 0]}], "prompt": "a photo of an apple above a tv"}
+{"tag": "position", "include": [{"class": "potted plant", "count": 1}, {"class": "train", "count": 1, "position": ["above", 0]}], "prompt": "a photo of a train above a potted plant"}
+{"tag": "position", "include": [{"class": "refrigerator", "count": 1}, {"class": "truck", "count": 1, "position": ["left of", 0]}], "prompt": "a photo of a truck left of a refrigerator"}
+{"tag": "position", "include": [{"class": "cow", "count": 1}, {"class": "tv remote", "count": 1, "position": ["below", 0]}], "prompt": "a photo of a tv remote below a cow"}
+{"tag": "position", "include": [{"class": "train", "count": 1}, {"class": "bottle", "count": 1, "position": ["right of", 0]}], "prompt": "a photo of a bottle right of a train"}
+{"tag": "position", "include": [{"class": "cow", "count": 1}, {"class": "dog", "count": 1, "position": ["above", 0]}], "prompt": "a photo of a dog above a cow"}
+{"tag": "position", "include": [{"class": "person", "count": 1}, {"class": "skateboard", "count": 1, "position": ["above", 0]}], "prompt": "a photo of a skateboard above a person"}
+{"tag": "position", "include": [{"class": "umbrella", "count": 1}, {"class": "baseball glove", "count": 1, "position": ["below", 0]}], "prompt": "a photo of a baseball glove below an umbrella"}
+{"tag": "position", "include": [{"class": "oven", "count": 1}, {"class": "dining table", "count": 1, "position": ["right of", 0]}], "prompt": "a photo of a dining table right of an oven"}
+{"tag": "position", "include": [{"class": "suitcase", "count": 1}, {"class": "hot dog", "count": 1, "position": ["left of", 0]}], "prompt": "a photo of a hot dog left of a suitcase"}
+{"tag": "position", "include": [{"class": "toothbrush", "count": 1}, {"class": "bus", "count": 1, "position": ["below", 0]}], "prompt": "a photo of a bus below a toothbrush"}
+{"tag": "position", "include": [{"class": "sandwich", "count": 1}, {"class": "backpack", "count": 1, "position": ["right of", 0]}], "prompt": "a photo of a backpack right of a sandwich"}
+{"tag": "position", "include": [{"class": "baseball bat", "count": 1}, {"class": "cake", "count": 1, "position": ["below", 0]}], "prompt": "a photo of a cake below a baseball bat"}
+{"tag": "position", "include": [{"class": "tie", "count": 1}, {"class": "dog", "count": 1, "position": ["right of", 0]}], "prompt": "a photo of a dog right of a tie"}
+{"tag": "position", "include": [{"class": "boat", "count": 1}, {"class": "suitcase", "count": 1, "position": ["right of", 0]}], "prompt": "a photo of a suitcase right of a boat"}
+{"tag": "position", "include": [{"class": "clock", "count": 1}, {"class": "bear", "count": 1, "position": ["above", 0]}], "prompt": "a photo of a bear above a clock"}
+{"tag": "position", "include": [{"class": "umbrella", "count": 1}, {"class": "tv remote", "count": 1, "position": ["left of", 0]}], "prompt": "a photo of a tv remote left of an umbrella"}
+{"tag": "position", "include": [{"class": "umbrella", "count": 1}, {"class": "sports ball", "count": 1, "position": ["left of", 0]}], "prompt": "a photo of a sports ball left of an umbrella"}
+{"tag": "position", "include": [{"class": "dining table", "count": 1}, {"class": "train", "count": 1, "position": ["right of", 0]}], "prompt": "a photo of a train right of a dining table"}
+{"tag": "position", "include": [{"class": "elephant", "count": 1}, {"class": "hair drier", "count": 1, "position": ["below", 0]}], "prompt": "a photo of a hair drier below an elephant"}
+{"tag": "position", "include": [{"class": "spoon", "count": 1}, {"class": "tennis racket", "count": 1, "position": ["right of", 0]}], "prompt": "a photo of a tennis racket right of a spoon"}
+{"tag": "position", "include": [{"class": "hot dog", "count": 1}, {"class": "wine glass", "count": 1, "position": ["right of", 0]}], "prompt": "a photo of a wine glass right of a hot dog"}
+{"tag": "position", "include": [{"class": "bench", "count": 1}, {"class": "computer mouse", "count": 1, "position": ["left of", 0]}], "prompt": "a photo of a computer mouse left of a bench"}
+{"tag": "position", "include": [{"class": "orange", "count": 1}, {"class": "carrot", "count": 1, "position": ["left of", 0]}], "prompt": "a photo of a carrot left of an orange"}
+{"tag": "position", "include": [{"class": "toothbrush", "count": 1}, {"class": "kite", "count": 1, "position": ["above", 0]}], "prompt": "a photo of a kite above a toothbrush"}
+{"tag": "position", "include": [{"class": "traffic light", "count": 1}, {"class": "toaster", "count": 1, "position": ["below", 0]}], "prompt": "a photo of a toaster below a traffic light"}
+{"tag": "position", "include": [{"class": "baseball glove", "count": 1}, {"class": "cat", "count": 1, "position": ["below", 0]}], "prompt": "a photo of a cat below a baseball glove"}
+{"tag": "position", "include": [{"class": "zebra", "count": 1}, {"class": "skis", "count": 1, "position": ["right of", 0]}], "prompt": "a photo of a skis right of a zebra"}
+{"tag": "position", "include": [{"class": "chair", "count": 1}, {"class": "stop sign", "count": 1, "position": ["above", 0]}], "prompt": "a photo of a stop sign above a chair"}
+{"tag": "position", "include": [{"class": "parking meter", "count": 1}, {"class": "stop sign", "count": 1, "position": ["above", 0]}], "prompt": "a photo of a stop sign above a parking meter"}
+{"tag": "position", "include": [{"class": "skateboard", "count": 1}, {"class": "hot dog", "count": 1, "position": ["right of", 0]}], "prompt": "a photo of a hot dog right of a skateboard"}
+{"tag": "position", "include": [{"class": "computer keyboard", "count": 1}, {"class": "pizza", "count": 1, "position": ["below", 0]}], "prompt": "a photo of a pizza below a computer keyboard"}
+{"tag": "position", "include": [{"class": "toilet", "count": 1}, {"class": "hair drier", "count": 1, "position": ["left of", 0]}], "prompt": "a photo of a hair drier left of a toilet"}
+{"tag": "position", "include": [{"class": "stop sign", "count": 1}, {"class": "cow", "count": 1, "position": ["left of", 0]}], "prompt": "a photo of a cow left of a stop sign"}
+{"tag": "position", "include": [{"class": "skis", "count": 1}, {"class": "suitcase", "count": 1, "position": ["above", 0]}], "prompt": "a photo of a suitcase above a skis"}
+{"tag": "position", "include": [{"class": "laptop", "count": 1}, {"class": "book", "count": 1, "position": ["above", 0]}], "prompt": "a photo of a book above a laptop"}
+{"tag": "position", "include": [{"class": "pizza", "count": 1}, {"class": "toothbrush", "count": 1, "position": ["below", 0]}], "prompt": "a photo of a toothbrush below a pizza"}
+{"tag": "position", "include": [{"class": "kite", "count": 1}, {"class": "toilet", "count": 1, "position": ["left of", 0]}], "prompt": "a photo of a toilet left of a kite"}
+{"tag": "position", "include": [{"class": "sink", "count": 1}, {"class": "tie", "count": 1, "position": ["above", 0]}], "prompt": "a photo of a tie above a sink"}
+{"tag": "position", "include": [{"class": "couch", "count": 1}, {"class": "bird", "count": 1, "position": ["left of", 0]}], "prompt": "a photo of a bird left of a couch"}
+{"tag": "position", "include": [{"class": "sports ball", "count": 1}, {"class": "bed", "count": 1, "position": ["right of", 0]}], "prompt": "a photo of a bed right of a sports ball"}
+{"tag": "position", "include": [{"class": "surfboard", "count": 1}, {"class": "elephant", "count": 1, "position": ["below", 0]}], "prompt": "a photo of an elephant below a surfboard"}
+{"tag": "position", "include": [{"class": "motorcycle", "count": 1}, {"class": "frisbee", "count": 1, "position": ["right of", 0]}], "prompt": "a photo of a frisbee right of a motorcycle"}
+{"tag": "position", "include": [{"class": "fire hydrant", "count": 1}, {"class": "vase", "count": 1, "position": ["above", 0]}], "prompt": "a photo of a vase above a fire hydrant"}
+{"tag": "position", "include": [{"class": "elephant", "count": 1}, {"class": "zebra", "count": 1, "position": ["left of", 0]}], "prompt": "a photo of a zebra left of an elephant"}
+{"tag": "position", "include": [{"class": "bear", "count": 1}, {"class": "bench", "count": 1, "position": ["left of", 0]}], "prompt": "a photo of a bench left of a bear"}
+{"tag": "position", "include": [{"class": "bench", "count": 1}, {"class": "donut", "count": 1, "position": ["right of", 0]}], "prompt": "a photo of a donut right of a bench"}
+{"tag": "position", "include": [{"class": "horse", "count": 1}, {"class": "frisbee", "count": 1, "position": ["below", 0]}], "prompt": "a photo of a frisbee below a horse"}
+{"tag": "position", "include": [{"class": "snowboard", "count": 1}, {"class": "computer keyboard", "count": 1, "position": ["above", 0]}], "prompt": "a photo of a computer keyboard above a snowboard"}
+{"tag": "position", "include": [{"class": "cow", "count": 1}, {"class": "tv", "count": 1, "position": ["below", 0]}], "prompt": "a photo of a tv below a cow"}
+{"tag": "position", "include": [{"class": "horse", "count": 1}, {"class": "elephant", "count": 1, "position": ["below", 0]}], "prompt": "a photo of an elephant below a horse"}
+{"tag": "position", "include": [{"class": "banana", "count": 1}, {"class": "suitcase", "count": 1, "position": ["left of", 0]}], "prompt": "a photo of a suitcase left of a banana"}
+{"tag": "position", "include": [{"class": "airplane", "count": 1}, {"class": "train", "count": 1, "position": ["below", 0]}], "prompt": "a photo of a train below an airplane"}
+{"tag": "position", "include": [{"class": "backpack", "count": 1}, {"class": "cat", "count": 1, "position": ["below", 0]}], "prompt": "a photo of a cat below a backpack"}
+{"tag": "position", "include": [{"class": "cake", "count": 1}, {"class": "backpack", "count": 1, "position": ["below", 0]}], "prompt": "a photo of a backpack below a cake"}
+{"tag": "position", "include": [{"class": "knife", "count": 1}, {"class": "sandwich", "count": 1, "position": ["below", 0]}], "prompt": "a photo of a sandwich below a knife"}
+{"tag": "position", "include": [{"class": "parking meter", "count": 1}, {"class": "bicycle", "count": 1, "position": ["above", 0]}], "prompt": "a photo of a bicycle above a parking meter"}
+{"tag": "position", "include": [{"class": "suitcase", "count": 1}, {"class": "knife", "count": 1, "position": ["right of", 0]}], "prompt": "a photo of a knife right of a suitcase"}
+{"tag": "position", "include": [{"class": "knife", "count": 1}, {"class": "hot dog", "count": 1, "position": ["above", 0]}], "prompt": "a photo of a hot dog above a knife"}
+{"tag": "position", "include": [{"class": "parking meter", "count": 1}, {"class": "zebra", "count": 1, "position": ["right of", 0]}], "prompt": "a photo of a zebra right of a parking meter"}
+{"tag": "position", "include": [{"class": "zebra", "count": 1}, {"class": "chair", "count": 1, "position": ["left of", 0]}], "prompt": "a photo of a chair left of a zebra"}
+{"tag": "position", "include": [{"class": "airplane", "count": 1}, {"class": "cow", "count": 1, "position": ["below", 0]}], "prompt": "a photo of a cow below an airplane"}
+{"tag": "position", "include": [{"class": "umbrella", "count": 1}, {"class": "cup", "count": 1, "position": ["left of", 0]}], "prompt": "a photo of a cup left of an umbrella"}
+{"tag": "position", "include": [{"class": "computer keyboard", "count": 1}, {"class": "zebra", "count": 1, "position": ["below", 0]}], "prompt": "a photo of a zebra below a computer keyboard"}
+{"tag": "position", "include": [{"class": "broccoli", "count": 1}, {"class": "zebra", "count": 1, "position": ["below", 0]}], "prompt": "a photo of a zebra below a broccoli"}
+{"tag": "position", "include": [{"class": "sports ball", "count": 1}, {"class": "laptop", "count": 1, "position": ["below", 0]}], "prompt": "a photo of a laptop below a sports ball"}
+{"tag": "position", "include": [{"class": "baseball bat", "count": 1}, {"class": "truck", "count": 1, "position": ["left of", 0]}], "prompt": "a photo of a truck left of a baseball bat"}
+{"tag": "position", "include": [{"class": "baseball bat", "count": 1}, {"class": "refrigerator", "count": 1, "position": ["above", 0]}], "prompt": "a photo of a refrigerator above a baseball bat"}
+{"tag": "position", "include": [{"class": "baseball bat", "count": 1}, {"class": "tv", "count": 1, "position": ["above", 0]}], "prompt": "a photo of a tv above a baseball bat"}
+{"tag": "position", "include": [{"class": "bear", "count": 1}, {"class": "baseball glove", "count": 1, "position": ["right of", 0]}], "prompt": "a photo of a baseball glove right of a bear"}
+{"tag": "position", "include": [{"class": "scissors", "count": 1}, {"class": "refrigerator", "count": 1, "position": ["below", 0]}], "prompt": "a photo of a refrigerator below a scissors"}
+{"tag": "position", "include": [{"class": "suitcase", "count": 1}, {"class": "dining table", "count": 1, "position": ["above", 0]}], "prompt": "a photo of a dining table above a suitcase"}
+{"tag": "position", "include": [{"class": "broccoli", "count": 1}, {"class": "parking meter", "count": 1, "position": ["above", 0]}], "prompt": "a photo of a parking meter above a broccoli"}
+{"tag": "position", "include": [{"class": "truck", "count": 1}, {"class": "frisbee", "count": 1, "position": ["above", 0]}], "prompt": "a photo of a frisbee above a truck"}
+{"tag": "position", "include": [{"class": "banana", "count": 1}, {"class": "pizza", "count": 1, "position": ["right of", 0]}], "prompt": "a photo of a pizza right of a banana"}
+{"tag": "position", "include": [{"class": "boat", "count": 1}, {"class": "bus", "count": 1, "position": ["above", 0]}], "prompt": "a photo of a bus above a boat"}
+{"tag": "position", "include": [{"class": "tennis racket", "count": 1}, {"class": "cell phone", "count": 1, "position": ["left of", 0]}], "prompt": "a photo of a cell phone left of a tennis racket"}
+{"tag": "position", "include": [{"class": "broccoli", "count": 1}, {"class": "horse", "count": 1, "position": ["right of", 0]}], "prompt": "a photo of a horse right of a broccoli"}
+{"tag": "position", "include": [{"class": "bottle", "count": 1}, {"class": "broccoli", "count": 1, "position": ["above", 0]}], "prompt": "a photo of a broccoli above a bottle"}
+{"tag": "position", "include": [{"class": "horse", "count": 1}, {"class": "vase", "count": 1, "position": ["right of", 0]}], "prompt": "a photo of a vase right of a horse"}
+{"tag": "position", "include": [{"class": "spoon", "count": 1}, {"class": "bear", "count": 1, "position": ["above", 0]}], "prompt": "a photo of a bear above a spoon"}
+{"tag": "position", "include": [{"class": "bed", "count": 1}, {"class": "zebra", "count": 1, "position": ["right of", 0]}], "prompt": "a photo of a zebra right of a bed"}
+{"tag": "position", "include": [{"class": "laptop", "count": 1}, {"class": "cow", "count": 1, "position": ["right of", 0]}], "prompt": "a photo of a cow right of a laptop"}
+{"tag": "position", "include": [{"class": "frisbee", "count": 1}, {"class": "bed", "count": 1, "position": ["right of", 0]}], "prompt": "a photo of a bed right of a frisbee"}
+{"tag": "position", "include": [{"class": "motorcycle", "count": 1}, {"class": "tie", "count": 1, "position": ["right of", 0]}], "prompt": "a photo of a tie right of a motorcycle"}
+{"tag": "position", "include": [{"class": "tv", "count": 1}, {"class": "laptop", "count": 1, "position": ["right of", 0]}], "prompt": "a photo of a laptop right of a tv"}
+{"tag": "position", "include": [{"class": "chair", "count": 1}, {"class": "cell phone", "count": 1, "position": ["right of", 0]}], "prompt": "a photo of a cell phone right of a chair"}
+{"tag": "position", "include": [{"class": "potted plant", "count": 1}, {"class": "couch", "count": 1, "position": ["below", 0]}], "prompt": "a photo of a couch below a potted plant"}
+{"tag": "position", "include": [{"class": "tv", "count": 1}, {"class": "clock", "count": 1, "position": ["below", 0]}], "prompt": "a photo of a clock below a tv"}
+{"tag": "position", "include": [{"class": "vase", "count": 1}, {"class": "couch", "count": 1, "position": ["below", 0]}], "prompt": "a photo of a couch below a vase"}
+{"tag": "position", "include": [{"class": "cat", "count": 1}, {"class": "donut", "count": 1, "position": ["below", 0]}], "prompt": "a photo of a donut below a cat"}
+{"tag": "position", "include": [{"class": "toaster", "count": 1}, {"class": "couch", "count": 1, "position": ["left of", 0]}], "prompt": "a photo of a couch left of a toaster"}
+{"tag": "color_attr", "include": [{"class": "wine glass", "count": 1, "color": "purple"}, {"class": "apple", "count": 1, "color": "black"}], "prompt": "a photo of a purple wine glass and a black apple"}
+{"tag": "color_attr", "include": [{"class": "bus", "count": 1, "color": "green"}, {"class": "microwave", "count": 1, "color": "purple"}], "prompt": "a photo of a green bus and a purple microwave"}
+{"tag": "color_attr", "include": [{"class": "skis", "count": 1, "color": "green"}, {"class": "airplane", "count": 1, "color": "brown"}], "prompt": "a photo of a green skis and a brown airplane"}
+{"tag": "color_attr", "include": [{"class": "computer keyboard", "count": 1, "color": "yellow"}, {"class": "sink", "count": 1, "color": "black"}], "prompt": "a photo of a yellow computer keyboard and a black sink"}
+{"tag": "color_attr", "include": [{"class": "oven", "count": 1, "color": "pink"}, {"class": "motorcycle", "count": 1, "color": "green"}], "prompt": "a photo of a pink oven and a green motorcycle"}
+{"tag": "color_attr", "include": [{"class": "parking meter", "count": 1, "color": "purple"}, {"class": "laptop", "count": 1, "color": "red"}], "prompt": "a photo of a purple parking meter and a red laptop"}
+{"tag": "color_attr", "include": [{"class": "skateboard", "count": 1, "color": "yellow"}, {"class": "computer mouse", "count": 1, "color": "orange"}], "prompt": "a photo of a yellow skateboard and an orange computer mouse"}
+{"tag": "color_attr", "include": [{"class": "skis", "count": 1, "color": "red"}, {"class": "tie", "count": 1, "color": "brown"}], "prompt": "a photo of a red skis and a brown tie"}
+{"tag": "color_attr", "include": [{"class": "skateboard", "count": 1, "color": "pink"}, {"class": "train", "count": 1, "color": "black"}], "prompt": "a photo of a pink skateboard and a black train"}
+{"tag": "color_attr", "include": [{"class": "handbag", "count": 1, "color": "white"}, {"class": "bed", "count": 1, "color": "purple"}], "prompt": "a photo of a white handbag and a purple bed"}
+{"tag": "color_attr", "include": [{"class": "elephant", "count": 1, "color": "purple"}, {"class": "sports ball", "count": 1, "color": "brown"}], "prompt": "a photo of a purple elephant and a brown sports ball"}
+{"tag": "color_attr", "include": [{"class": "dog", "count": 1, "color": "purple"}, {"class": "dining table", "count": 1, "color": "black"}], "prompt": "a photo of a purple dog and a black dining table"}
+{"tag": "color_attr", "include": [{"class": "dining table", "count": 1, "color": "white"}, {"class": "car", "count": 1, "color": "red"}], "prompt": "a photo of a white dining table and a red car"}
+{"tag": "color_attr", "include": [{"class": "cell phone", "count": 1, "color": "blue"}, {"class": "apple", "count": 1, "color": "green"}], "prompt": "a photo of a blue cell phone and a green apple"}
+{"tag": "color_attr", "include": [{"class": "car", "count": 1, "color": "red"}, {"class": "potted plant", "count": 1, "color": "orange"}], "prompt": "a photo of a red car and an orange potted plant"}
+{"tag": "color_attr", "include": [{"class": "carrot", "count": 1, "color": "brown"}, {"class": "potted plant", "count": 1, "color": "white"}], "prompt": "a photo of a brown carrot and a white potted plant"}
+{"tag": "color_attr", "include": [{"class": "kite", "count": 1, "color": "black"}, {"class": "bear", "count": 1, "color": "green"}], "prompt": "a photo of a black kite and a green bear"}
+{"tag": "color_attr", "include": [{"class": "laptop", "count": 1, "color": "blue"}, {"class": "bear", "count": 1, "color": "brown"}], "prompt": "a photo of a blue laptop and a brown bear"}
+{"tag": "color_attr", "include": [{"class": "teddy bear", "count": 1, "color": "green"}, {"class": "kite", "count": 1, "color": "brown"}], "prompt": "a photo of a green teddy bear and a brown kite"}
+{"tag": "color_attr", "include": [{"class": "stop sign", "count": 1, "color": "yellow"}, {"class": "potted plant", "count": 1, "color": "blue"}], "prompt": "a photo of a yellow stop sign and a blue potted plant"}
+{"tag": "color_attr", "include": [{"class": "snowboard", "count": 1, "color": "orange"}, {"class": "cat", "count": 1, "color": "green"}], "prompt": "a photo of an orange snowboard and a green cat"}
+{"tag": "color_attr", "include": [{"class": "truck", "count": 1, "color": "orange"}, {"class": "sink", "count": 1, "color": "pink"}], "prompt": "a photo of an orange truck and a pink sink"}
+{"tag": "color_attr", "include": [{"class": "hot dog", "count": 1, "color": "brown"}, {"class": "pizza", "count": 1, "color": "purple"}], "prompt": "a photo of a brown hot dog and a purple pizza"}
+{"tag": "color_attr", "include": [{"class": "couch", "count": 1, "color": "green"}, {"class": "umbrella", "count": 1, "color": "orange"}], "prompt": "a photo of a green couch and an orange umbrella"}
+{"tag": "color_attr", "include": [{"class": "bed", "count": 1, "color": "brown"}, {"class": "cell phone", "count": 1, "color": "pink"}], "prompt": "a photo of a brown bed and a pink cell phone"}
+{"tag": "color_attr", "include": [{"class": "broccoli", "count": 1, "color": "black"}, {"class": "cake", "count": 1, "color": "yellow"}], "prompt": "a photo of a black broccoli and a yellow cake"}
+{"tag": "color_attr", "include": [{"class": "train", "count": 1, "color": "red"}, {"class": "bear", "count": 1, "color": "purple"}], "prompt": "a photo of a red train and a purple bear"}
+{"tag": "color_attr", "include": [{"class": "tennis racket", "count": 1, "color": "purple"}, {"class": "sink", "count": 1, "color": "black"}], "prompt": "a photo of a purple tennis racket and a black sink"}
+{"tag": "color_attr", "include": [{"class": "vase", "count": 1, "color": "blue"}, {"class": "banana", "count": 1, "color": "black"}], "prompt": "a photo of a blue vase and a black banana"}
+{"tag": "color_attr", "include": [{"class": "clock", "count": 1, "color": "blue"}, {"class": "cup", "count": 1, "color": "white"}], "prompt": "a photo of a blue clock and a white cup"}
+{"tag": "color_attr", "include": [{"class": "umbrella", "count": 1, "color": "red"}, {"class": "couch", "count": 1, "color": "blue"}], "prompt": "a photo of a red umbrella and a blue couch"}
+{"tag": "color_attr", "include": [{"class": "handbag", "count": 1, "color": "white"}, {"class": "giraffe", "count": 1, "color": "red"}], "prompt": "a photo of a white handbag and a red giraffe"}
+{"tag": "color_attr", "include": [{"class": "tv remote", "count": 1, "color": "pink"}, {"class": "airplane", "count": 1, "color": "blue"}], "prompt": "a photo of a pink tv remote and a blue airplane"}
+{"tag": "color_attr", "include": [{"class": "handbag", "count": 1, "color": "pink"}, {"class": "scissors", "count": 1, "color": "black"}], "prompt": "a photo of a pink handbag and a black scissors"}
+{"tag": "color_attr", "include": [{"class": "car", "count": 1, "color": "brown"}, {"class": "hair drier", "count": 1, "color": "pink"}], "prompt": "a photo of a brown car and a pink hair drier"}
+{"tag": "color_attr", "include": [{"class": "bus", "count": 1, "color": "black"}, {"class": "cell phone", "count": 1, "color": "brown"}], "prompt": "a photo of a black bus and a brown cell phone"}
+{"tag": "color_attr", "include": [{"class": "sheep", "count": 1, "color": "purple"}, {"class": "banana", "count": 1, "color": "pink"}], "prompt": "a photo of a purple sheep and a pink banana"}
+{"tag": "color_attr", "include": [{"class": "handbag", "count": 1, "color": "blue"}, {"class": "cell phone", "count": 1, "color": "white"}], "prompt": "a photo of a blue handbag and a white cell phone"}
+{"tag": "color_attr", "include": [{"class": "pizza", "count": 1, "color": "white"}, {"class": "umbrella", "count": 1, "color": "green"}], "prompt": "a photo of a white pizza and a green umbrella"}
+{"tag": "color_attr", "include": [{"class": "tie", "count": 1, "color": "white"}, {"class": "skateboard", "count": 1, "color": "purple"}], "prompt": "a photo of a white tie and a purple skateboard"}
+{"tag": "color_attr", "include": [{"class": "sports ball", "count": 1, "color": "yellow"}, {"class": "boat", "count": 1, "color": "green"}], "prompt": "a photo of a yellow sports ball and a green boat"}
+{"tag": "color_attr", "include": [{"class": "wine glass", "count": 1, "color": "white"}, {"class": "giraffe", "count": 1, "color": "brown"}], "prompt": "a photo of a white wine glass and a brown giraffe"}
+{"tag": "color_attr", "include": [{"class": "bowl", "count": 1, "color": "yellow"}, {"class": "baseball glove", "count": 1, "color": "white"}], "prompt": "a photo of a yellow bowl and a white baseball glove"}
+{"tag": "color_attr", "include": [{"class": "microwave", "count": 1, "color": "orange"}, {"class": "spoon", "count": 1, "color": "black"}], "prompt": "a photo of an orange microwave and a black spoon"}
+{"tag": "color_attr", "include": [{"class": "skateboard", "count": 1, "color": "orange"}, {"class": "bowl", "count": 1, "color": "pink"}], "prompt": "a photo of an orange skateboard and a pink bowl"}
+{"tag": "color_attr", "include": [{"class": "toilet", "count": 1, "color": "blue"}, {"class": "suitcase", "count": 1, "color": "white"}], "prompt": "a photo of a blue toilet and a white suitcase"}
+{"tag": "color_attr", "include": [{"class": "boat", "count": 1, "color": "white"}, {"class": "hot dog", "count": 1, "color": "orange"}], "prompt": "a photo of a white boat and an orange hot dog"}
+{"tag": "color_attr", "include": [{"class": "dining table", "count": 1, "color": "yellow"}, {"class": "dog", "count": 1, "color": "pink"}], "prompt": "a photo of a yellow dining table and a pink dog"}
+{"tag": "color_attr", "include": [{"class": "cake", "count": 1, "color": "red"}, {"class": "chair", "count": 1, "color": "purple"}], "prompt": "a photo of a red cake and a purple chair"}
+{"tag": "color_attr", "include": [{"class": "tie", "count": 1, "color": "blue"}, {"class": "dining table", "count": 1, "color": "pink"}], "prompt": "a photo of a blue tie and a pink dining table"}
+{"tag": "color_attr", "include": [{"class": "cow", "count": 1, "color": "blue"}, {"class": "computer keyboard", "count": 1, "color": "black"}], "prompt": "a photo of a blue cow and a black computer keyboard"}
+{"tag": "color_attr", "include": [{"class": "pizza", "count": 1, "color": "yellow"}, {"class": "oven", "count": 1, "color": "green"}], "prompt": "a photo of a yellow pizza and a green oven"}
+{"tag": "color_attr", "include": [{"class": "laptop", "count": 1, "color": "red"}, {"class": "car", "count": 1, "color": "brown"}], "prompt": "a photo of a red laptop and a brown car"}
+{"tag": "color_attr", "include": [{"class": "computer keyboard", "count": 1, "color": "purple"}, {"class": "scissors", "count": 1, "color": "blue"}], "prompt": "a photo of a purple computer keyboard and a blue scissors"}
+{"tag": "color_attr", "include": [{"class": "surfboard", "count": 1, "color": "green"}, {"class": "oven", "count": 1, "color": "orange"}], "prompt": "a photo of a green surfboard and an orange oven"}
+{"tag": "color_attr", "include": [{"class": "parking meter", "count": 1, "color": "yellow"}, {"class": "refrigerator", "count": 1, "color": "pink"}], "prompt": "a photo of a yellow parking meter and a pink refrigerator"}
+{"tag": "color_attr", "include": [{"class": "computer mouse", "count": 1, "color": "brown"}, {"class": "bottle", "count": 1, "color": "purple"}], "prompt": "a photo of a brown computer mouse and a purple bottle"}
+{"tag": "color_attr", "include": [{"class": "umbrella", "count": 1, "color": "red"}, {"class": "cow", "count": 1, "color": "green"}], "prompt": "a photo of a red umbrella and a green cow"}
+{"tag": "color_attr", "include": [{"class": "giraffe", "count": 1, "color": "red"}, {"class": "cell phone", "count": 1, "color": "black"}], "prompt": "a photo of a red giraffe and a black cell phone"}
+{"tag": "color_attr", "include": [{"class": "oven", "count": 1, "color": "brown"}, {"class": "train", "count": 1, "color": "purple"}], "prompt": "a photo of a brown oven and a purple train"}
+{"tag": "color_attr", "include": [{"class": "baseball bat", "count": 1, "color": "blue"}, {"class": "book", "count": 1, "color": "pink"}], "prompt": "a photo of a blue baseball bat and a pink book"}
+{"tag": "color_attr", "include": [{"class": "cup", "count": 1, "color": "green"}, {"class": "bowl", "count": 1, "color": "yellow"}], "prompt": "a photo of a green cup and a yellow bowl"}
+{"tag": "color_attr", "include": [{"class": "suitcase", "count": 1, "color": "yellow"}, {"class": "bus", "count": 1, "color": "brown"}], "prompt": "a photo of a yellow suitcase and a brown bus"}
+{"tag": "color_attr", "include": [{"class": "motorcycle", "count": 1, "color": "orange"}, {"class": "donut", "count": 1, "color": "pink"}], "prompt": "a photo of an orange motorcycle and a pink donut"}
+{"tag": "color_attr", "include": [{"class": "giraffe", "count": 1, "color": "orange"}, {"class": "baseball glove", "count": 1, "color": "white"}], "prompt": "a photo of an orange giraffe and a white baseball glove"}
+{"tag": "color_attr", "include": [{"class": "handbag", "count": 1, "color": "orange"}, {"class": "carrot", "count": 1, "color": "green"}], "prompt": "a photo of an orange handbag and a green carrot"}
+{"tag": "color_attr", "include": [{"class": "bottle", "count": 1, "color": "black"}, {"class": "refrigerator", "count": 1, "color": "white"}], "prompt": "a photo of a black bottle and a white refrigerator"}
+{"tag": "color_attr", "include": [{"class": "dog", "count": 1, "color": "white"}, {"class": "potted plant", "count": 1, "color": "blue"}], "prompt": "a photo of a white dog and a blue potted plant"}
+{"tag": "color_attr", "include": [{"class": "handbag", "count": 1, "color": "orange"}, {"class": "car", "count": 1, "color": "red"}], "prompt": "a photo of an orange handbag and a red car"}
+{"tag": "color_attr", "include": [{"class": "stop sign", "count": 1, "color": "red"}, {"class": "book", "count": 1, "color": "blue"}], "prompt": "a photo of a red stop sign and a blue book"}
+{"tag": "color_attr", "include": [{"class": "car", "count": 1, "color": "yellow"}, {"class": "toothbrush", "count": 1, "color": "orange"}], "prompt": "a photo of a yellow car and an orange toothbrush"}
+{"tag": "color_attr", "include": [{"class": "potted plant", "count": 1, "color": "black"}, {"class": "toilet", "count": 1, "color": "yellow"}], "prompt": "a photo of a black potted plant and a yellow toilet"}
+{"tag": "color_attr", "include": [{"class": "dining table", "count": 1, "color": "brown"}, {"class": "suitcase", "count": 1, "color": "white"}], "prompt": "a photo of a brown dining table and a white suitcase"}
+{"tag": "color_attr", "include": [{"class": "donut", "count": 1, "color": "orange"}, {"class": "stop sign", "count": 1, "color": "yellow"}], "prompt": "a photo of an orange donut and a yellow stop sign"}
+{"tag": "color_attr", "include": [{"class": "suitcase", "count": 1, "color": "green"}, {"class": "boat", "count": 1, "color": "blue"}], "prompt": "a photo of a green suitcase and a blue boat"}
+{"tag": "color_attr", "include": [{"class": "tennis racket", "count": 1, "color": "orange"}, {"class": "sports ball", "count": 1, "color": "yellow"}], "prompt": "a photo of an orange tennis racket and a yellow sports ball"}
+{"tag": "color_attr", "include": [{"class": "computer keyboard", "count": 1, "color": "purple"}, {"class": "chair", "count": 1, "color": "red"}], "prompt": "a photo of a purple computer keyboard and a red chair"}
+{"tag": "color_attr", "include": [{"class": "suitcase", "count": 1, "color": "purple"}, {"class": "pizza", "count": 1, "color": "orange"}], "prompt": "a photo of a purple suitcase and an orange pizza"}
+{"tag": "color_attr", "include": [{"class": "bottle", "count": 1, "color": "white"}, {"class": "sheep", "count": 1, "color": "blue"}], "prompt": "a photo of a white bottle and a blue sheep"}
+{"tag": "color_attr", "include": [{"class": "backpack", "count": 1, "color": "purple"}, {"class": "umbrella", "count": 1, "color": "white"}], "prompt": "a photo of a purple backpack and a white umbrella"}
+{"tag": "color_attr", "include": [{"class": "potted plant", "count": 1, "color": "orange"}, {"class": "spoon", "count": 1, "color": "black"}], "prompt": "a photo of an orange potted plant and a black spoon"}
+{"tag": "color_attr", "include": [{"class": "tennis racket", "count": 1, "color": "green"}, {"class": "dog", "count": 1, "color": "black"}], "prompt": "a photo of a green tennis racket and a black dog"}
+{"tag": "color_attr", "include": [{"class": "handbag", "count": 1, "color": "yellow"}, {"class": "refrigerator", "count": 1, "color": "blue"}], "prompt": "a photo of a yellow handbag and a blue refrigerator"}
+{"tag": "color_attr", "include": [{"class": "broccoli", "count": 1, "color": "pink"}, {"class": "sink", "count": 1, "color": "red"}], "prompt": "a photo of a pink broccoli and a red sink"}
+{"tag": "color_attr", "include": [{"class": "bowl", "count": 1, "color": "red"}, {"class": "sink", "count": 1, "color": "pink"}], "prompt": "a photo of a red bowl and a pink sink"}
+{"tag": "color_attr", "include": [{"class": "toilet", "count": 1, "color": "white"}, {"class": "apple", "count": 1, "color": "red"}], "prompt": "a photo of a white toilet and a red apple"}
+{"tag": "color_attr", "include": [{"class": "dining table", "count": 1, "color": "pink"}, {"class": "sandwich", "count": 1, "color": "black"}], "prompt": "a photo of a pink dining table and a black sandwich"}
+{"tag": "color_attr", "include": [{"class": "car", "count": 1, "color": "black"}, {"class": "parking meter", "count": 1, "color": "green"}], "prompt": "a photo of a black car and a green parking meter"}
+{"tag": "color_attr", "include": [{"class": "bird", "count": 1, "color": "yellow"}, {"class": "motorcycle", "count": 1, "color": "black"}], "prompt": "a photo of a yellow bird and a black motorcycle"}
+{"tag": "color_attr", "include": [{"class": "giraffe", "count": 1, "color": "brown"}, {"class": "stop sign", "count": 1, "color": "white"}], "prompt": "a photo of a brown giraffe and a white stop sign"}
+{"tag": "color_attr", "include": [{"class": "banana", "count": 1, "color": "white"}, {"class": "elephant", "count": 1, "color": "black"}], "prompt": "a photo of a white banana and a black elephant"}
+{"tag": "color_attr", "include": [{"class": "cow", "count": 1, "color": "orange"}, {"class": "sandwich", "count": 1, "color": "purple"}], "prompt": "a photo of an orange cow and a purple sandwich"}
+{"tag": "color_attr", "include": [{"class": "clock", "count": 1, "color": "red"}, {"class": "cell phone", "count": 1, "color": "black"}], "prompt": "a photo of a red clock and a black cell phone"}
+{"tag": "color_attr", "include": [{"class": "knife", "count": 1, "color": "brown"}, {"class": "donut", "count": 1, "color": "blue"}], "prompt": "a photo of a brown knife and a blue donut"}
+{"tag": "color_attr", "include": [{"class": "cup", "count": 1, "color": "red"}, {"class": "handbag", "count": 1, "color": "pink"}], "prompt": "a photo of a red cup and a pink handbag"}
+{"tag": "color_attr", "include": [{"class": "bicycle", "count": 1, "color": "yellow"}, {"class": "motorcycle", "count": 1, "color": "red"}], "prompt": "a photo of a yellow bicycle and a red motorcycle"}
+{"tag": "color_attr", "include": [{"class": "orange", "count": 1, "color": "red"}, {"class": "broccoli", "count": 1, "color": "purple"}], "prompt": "a photo of a red orange and a purple broccoli"}
+{"tag": "color_attr", "include": [{"class": "traffic light", "count": 1, "color": "orange"}, {"class": "toilet", "count": 1, "color": "white"}], "prompt": "a photo of an orange traffic light and a white toilet"}
+{"tag": "color_attr", "include": [{"class": "cup", "count": 1, "color": "green"}, {"class": "pizza", "count": 1, "color": "red"}], "prompt": "a photo of a green cup and a red pizza"}
+{"tag": "color_attr", "include": [{"class": "pizza", "count": 1, "color": "blue"}, {"class": "baseball glove", "count": 1, "color": "yellow"}], "prompt": "a photo of a blue pizza and a yellow baseball glove"}

eval/gen/geneval/prompts/evaluation_metadata_long.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval/gen/geneval/prompts/generation_prompts.txt ADDED Viewed

	@@ -0,0 +1,553 @@

+a photo of a bench
+a photo of a cow
+a photo of a bicycle
+a photo of a clock
+a photo of a carrot
+a photo of a suitcase
+a photo of a fork
+a photo of a surfboard
+a photo of a refrigerator
+a photo of a cup
+a photo of a microwave
+a photo of a potted plant
+a photo of a snowboard
+a photo of a zebra
+a photo of a parking meter
+a photo of a spoon
+a photo of a skateboard
+a photo of a car
+a photo of a motorcycle
+a photo of a traffic light
+a photo of a book
+a photo of a couch
+a photo of a backpack
+a photo of a computer keyboard
+a photo of a toaster
+a photo of a bird
+a photo of a bowl
+a photo of a dog
+a photo of a tie
+a photo of a laptop
+a photo of a computer mouse
+a photo of a sandwich
+a photo of a baseball bat
+a photo of a train
+a photo of a cell phone
+a photo of a chair
+a photo of a tv
+a photo of a broccoli
+a photo of a bed
+a photo of a skis
+a photo of a handbag
+a photo of a pizza
+a photo of a frisbee
+a photo of a scissors
+a photo of a bottle
+a photo of an elephant
+a photo of a toilet
+a photo of an oven
+a photo of an orange
+a photo of a person
+a photo of a teddy bear
+a photo of a vase
+a photo of a banana
+a photo of a toothbrush
+a photo of a tv remote
+a photo of a dining table
+a photo of a stop sign
+a photo of a sheep
+a photo of a fire hydrant
+a photo of an airplane
+a photo of a giraffe
+a photo of a horse
+a photo of a cat
+a photo of a donut
+a photo of a boat
+a photo of a baseball glove
+a photo of a hair drier
+a photo of a sink
+a photo of a cake
+a photo of a wine glass
+a photo of an apple
+a photo of a bus
+a photo of a tennis racket
+a photo of a knife
+a photo of a hot dog
+a photo of a truck
+a photo of an umbrella
+a photo of a sports ball
+a photo of a bear
+a photo of a kite
+a photo of a bench and a sports ball
+a photo of a toothbrush and a snowboard
+a photo of a toaster and an oven
+a photo of a broccoli and a vase
+a photo of a tennis racket and a wine glass
+a photo of a fork and a knife
+a photo of a hair drier and a cake
+a photo of a horse and a giraffe
+a photo of a horse and a computer keyboard
+a photo of a toothbrush and a carrot
+a photo of a cake and a zebra
+a photo of a hair drier and a bear
+a photo of a knife and a zebra
+a photo of a couch and a wine glass
+a photo of a frisbee and a vase
+a photo of a book and a laptop
+a photo of a dining table and a bear
+a photo of a frisbee and a couch
+a photo of a couch and a horse
+a photo of a toilet and a computer mouse
+a photo of a bottle and a refrigerator
+a photo of a potted plant and a backpack
+a photo of a skateboard and a cake
+a photo of a broccoli and a parking meter
+a photo of a zebra and a bed
+a photo of an oven and a bed
+a photo of a baseball bat and a fork
+a photo of a vase and a spoon
+a photo of a skateboard and a sink
+a photo of a pizza and a bench
+a photo of a bowl and a pizza
+a photo of a tennis racket and a bird
+a photo of a wine glass and a bear
+a photo of a fork and a book
+a photo of a scissors and a bowl
+a photo of a laptop and a carrot
+a photo of a stop sign and a bottle
+a photo of a microwave and a truck
+a photo of a person and a bear
+a photo of a frisbee and a cell phone
+a photo of a parking meter and a teddy bear
+a photo of a tennis racket and a bicycle
+a photo of a stop sign and a motorcycle
+a photo of a fire hydrant and a tennis racket
+a photo of a scissors and a sandwich
+a photo of a pizza and a book
+a photo of a giraffe and a computer mouse
+a photo of a stop sign and a toaster
+a photo of a computer mouse and a zebra
+a photo of a chair and a bench
+a photo of a tv and a carrot
+a photo of a surfboard and a suitcase
+a photo of a computer keyboard and a laptop
+a photo of a computer keyboard and a microwave
+a photo of a scissors and a bird
+a photo of a person and a snowboard
+a photo of a cow and a horse
+a photo of a handbag and a refrigerator
+a photo of a chair and a laptop
+a photo of a toothbrush and a bench
+a photo of a book and a baseball bat
+a photo of a horse and a train
+a photo of a bench and a vase
+a photo of a traffic light and a backpack
+a photo of a sports ball and a cow
+a photo of a computer mouse and a spoon
+a photo of a tv and a bicycle
+a photo of a bench and a snowboard
+a photo of a toothbrush and a toilet
+a photo of a person and an apple
+a photo of a sink and a sports ball
+a photo of a stop sign and a dog
+a photo of a knife and a stop sign
+a photo of a wine glass and a handbag
+a photo of a bowl and a skis
+a photo of a frisbee and an apple
+a photo of a computer keyboard and a cell phone
+a photo of a stop sign and a fork
+a photo of a potted plant and a boat
+a photo of a tv and a cell phone
+a photo of a tie and a broccoli
+a photo of a potted plant and a donut
+a photo of a person and a sink
+a photo of a couch and a snowboard
+a photo of a fork and a baseball glove
+a photo of an apple and a toothbrush
+a photo of a bus and a baseball glove
+a photo of a person and a stop sign
+a photo of a carrot and a couch
+a photo of a baseball bat and a bear
+a photo of a fire hydrant and a train
+a photo of a baseball glove and a carrot
+a photo of a microwave and a bench
+a photo of a cake and a stop sign
+a photo of a car and a computer mouse
+a photo of a suitcase and a dining table
+a photo of a person and a traffic light
+a photo of a cell phone and a horse
+a photo of a baseball bat and a giraffe
+a photo of two clocks
+a photo of two backpacks
+a photo of four handbags
+a photo of two frisbees
+a photo of three sports balls
+a photo of two bears
+a photo of two ties
+a photo of four sinks
+a photo of two toothbrushs
+a photo of three persons
+a photo of three tennis rackets
+a photo of four bowls
+a photo of four vases
+a photo of three cups
+a photo of four computer keyboards
+a photo of three sinks
+a photo of two ovens
+a photo of two toilets
+a photo of two bicycles
+a photo of two trains
+a photo of three oranges
+a photo of three buses
+a photo of three handbags
+a photo of three snowboards
+a photo of two snowboards
+a photo of four dogs
+a photo of three apples
+a photo of two sheeps
+a photo of three hot dogs
+a photo of three zebras
+a photo of three kites
+a photo of four apples
+a photo of three cell phones
+a photo of four baseball gloves
+a photo of three computer keyboards
+a photo of two beds
+a photo of two tv remotes
+a photo of three fire hydrants
+a photo of three books
+a photo of four giraffes
+a photo of two vases
+a photo of four donuts
+a photo of four chairs
+a photo of three baseball bats
+a photo of four stop signs
+a photo of two pizzas
+a photo of three refrigerators
+a photo of two fire hydrants
+a photo of three giraffes
+a photo of four tvs
+a photo of three wine glasses
+a photo of four broccolis
+a photo of three trucks
+a photo of two trucks
+a photo of two carrots
+a photo of two sandwichs
+a photo of four traffic lights
+a photo of four clocks
+a photo of two cars
+a photo of two bananas
+a photo of two wine glasses
+a photo of three pizzas
+a photo of four knifes
+a photo of three suitcases
+a photo of four zebras
+a photo of two teddy bears
+a photo of four skateboards
+a photo of four hot dogs
+a photo of three birds
+a photo of four boats
+a photo of four microwaves
+a photo of two hair driers
+a photo of three laptops
+a photo of three cows
+a photo of two parking meters
+a photo of four benchs
+a photo of three benchs
+a photo of four frisbees
+a photo of four books
+a photo of four buses
+a photo of a blue fire hydrant
+a photo of a pink car
+a photo of a purple cup
+a photo of a blue cow
+a photo of a yellow boat
+a photo of a blue umbrella
+a photo of a blue elephant
+a photo of a yellow elephant
+a photo of a red bicycle
+a photo of a purple suitcase
+a photo of a purple hair drier
+a photo of a white sandwich
+a photo of a purple elephant
+a photo of a green microwave
+a photo of a red zebra
+a photo of a red apple
+a photo of a yellow tv remote
+a photo of a blue toilet
+a photo of an orange orange
+a photo of a black donut
+a photo of a red vase
+a photo of a purple pizza
+a photo of a pink skateboard
+a photo of a green skateboard
+a photo of a purple bear
+a photo of a brown chair
+a photo of a brown computer keyboard
+a photo of an orange cow
+a photo of a brown skis
+a photo of a white kite
+a photo of a red dog
+a photo of a green couch
+a photo of a yellow airplane
+a photo of an orange tv
+a photo of a white scissors
+a photo of a pink cell phone
+a photo of a green surfboard
+a photo of a white fire hydrant
+a photo of a black bicycle
+a photo of a purple carrot
+a photo of a black dining table
+a photo of a purple potted plant
+a photo of a purple backpack
+a photo of a yellow train
+a photo of a pink potted plant
+a photo of a red giraffe
+a photo of a brown bear
+a photo of a black train
+a photo of an orange laptop
+a photo of a green hot dog
+a photo of a yellow parking meter
+a photo of a red potted plant
+a photo of a green traffic light
+a photo of a blue tv
+a photo of a brown refrigerator
+a photo of a black tv remote
+a photo of a purple scissors
+a photo of a yellow orange
+a photo of a brown toaster
+a photo of a red parking meter
+a photo of a brown orange
+a photo of a green clock
+a photo of a white sheep
+a photo of a yellow oven
+a photo of a green vase
+a photo of a black teddy bear
+a photo of a yellow carrot
+a photo of a black hot dog
+a photo of a red scissors
+a photo of a white teddy bear
+a photo of a black skis
+a photo of a blue dining table
+a photo of a black refrigerator
+a photo of a white dog
+a photo of an orange scissors
+a photo of a red cell phone
+a photo of a white orange
+a photo of a blue clock
+a photo of a blue carrot
+a photo of a green motorcycle
+a photo of a pink stop sign
+a photo of a black vase
+a photo of a black backpack
+a photo of a red car
+a photo of a green computer mouse
+a photo of a red backpack
+a photo of a green bus
+a photo of an orange toaster
+a photo of a yellow fork
+a photo of a pink parking meter
+a photo of a blue book
+a photo of a yellow broccoli
+a photo of an orange computer mouse
+a photo of a red cake
+a photo of a dog right of a teddy bear
+a photo of a wine glass above a kite
+a photo of a couch below a cup
+a photo of a laptop left of a cow
+a photo of a fork above a hair drier
+a photo of a tie right of a baseball bat
+a photo of a stop sign above a fork
+a photo of a bird below a skateboard
+a photo of an apple above a tv
+a photo of a train above a potted plant
+a photo of a truck left of a refrigerator
+a photo of a tv remote below a cow
+a photo of a bottle right of a train
+a photo of a dog above a cow
+a photo of a skateboard above a person
+a photo of a baseball glove below an umbrella
+a photo of a dining table right of an oven
+a photo of a hot dog left of a suitcase
+a photo of a bus below a toothbrush
+a photo of a backpack right of a sandwich
+a photo of a cake below a baseball bat
+a photo of a dog right of a tie
+a photo of a suitcase right of a boat
+a photo of a bear above a clock
+a photo of a tv remote left of an umbrella
+a photo of a sports ball left of an umbrella
+a photo of a train right of a dining table
+a photo of a hair drier below an elephant
+a photo of a tennis racket right of a spoon
+a photo of a wine glass right of a hot dog
+a photo of a computer mouse left of a bench
+a photo of a carrot left of an orange
+a photo of a kite above a toothbrush
+a photo of a toaster below a traffic light
+a photo of a cat below a baseball glove
+a photo of a skis right of a zebra
+a photo of a stop sign above a chair
+a photo of a stop sign above a parking meter
+a photo of a hot dog right of a skateboard
+a photo of a pizza below a computer keyboard
+a photo of a hair drier left of a toilet
+a photo of a cow left of a stop sign
+a photo of a suitcase above a skis
+a photo of a book above a laptop
+a photo of a toothbrush below a pizza
+a photo of a toilet left of a kite
+a photo of a tie above a sink
+a photo of a bird left of a couch
+a photo of a bed right of a sports ball
+a photo of an elephant below a surfboard
+a photo of a frisbee right of a motorcycle
+a photo of a vase above a fire hydrant
+a photo of a zebra left of an elephant
+a photo of a bench left of a bear
+a photo of a donut right of a bench
+a photo of a frisbee below a horse
+a photo of a computer keyboard above a snowboard
+a photo of a tv below a cow
+a photo of an elephant below a horse
+a photo of a suitcase left of a banana
+a photo of a train below an airplane
+a photo of a cat below a backpack
+a photo of a backpack below a cake
+a photo of a sandwich below a knife
+a photo of a bicycle above a parking meter
+a photo of a knife right of a suitcase
+a photo of a hot dog above a knife
+a photo of a zebra right of a parking meter
+a photo of a chair left of a zebra
+a photo of a cow below an airplane
+a photo of a cup left of an umbrella
+a photo of a zebra below a computer keyboard
+a photo of a zebra below a broccoli
+a photo of a laptop below a sports ball
+a photo of a truck left of a baseball bat
+a photo of a refrigerator above a baseball bat
+a photo of a tv above a baseball bat
+a photo of a baseball glove right of a bear
+a photo of a refrigerator below a scissors
+a photo of a dining table above a suitcase
+a photo of a parking meter above a broccoli
+a photo of a frisbee above a truck
+a photo of a pizza right of a banana
+a photo of a bus above a boat
+a photo of a cell phone left of a tennis racket
+a photo of a horse right of a broccoli
+a photo of a broccoli above a bottle
+a photo of a vase right of a horse
+a photo of a bear above a spoon
+a photo of a zebra right of a bed
+a photo of a cow right of a laptop
+a photo of a bed right of a frisbee
+a photo of a tie right of a motorcycle
+a photo of a laptop right of a tv
+a photo of a cell phone right of a chair
+a photo of a couch below a potted plant
+a photo of a clock below a tv
+a photo of a couch below a vase
+a photo of a donut below a cat
+a photo of a couch left of a toaster
+a photo of a purple wine glass and a black apple
+a photo of a green bus and a purple microwave
+a photo of a green skis and a brown airplane
+a photo of a yellow computer keyboard and a black sink
+a photo of a pink oven and a green motorcycle
+a photo of a purple parking meter and a red laptop
+a photo of a yellow skateboard and an orange computer mouse
+a photo of a red skis and a brown tie
+a photo of a pink skateboard and a black train
+a photo of a white handbag and a purple bed
+a photo of a purple elephant and a brown sports ball
+a photo of a purple dog and a black dining table
+a photo of a white dining table and a red car
+a photo of a blue cell phone and a green apple
+a photo of a red car and an orange potted plant
+a photo of a brown carrot and a white potted plant
+a photo of a black kite and a green bear
+a photo of a blue laptop and a brown bear
+a photo of a green teddy bear and a brown kite
+a photo of a yellow stop sign and a blue potted plant
+a photo of an orange snowboard and a green cat
+a photo of an orange truck and a pink sink
+a photo of a brown hot dog and a purple pizza
+a photo of a green couch and an orange umbrella
+a photo of a brown bed and a pink cell phone
+a photo of a black broccoli and a yellow cake
+a photo of a red train and a purple bear
+a photo of a purple tennis racket and a black sink
+a photo of a blue vase and a black banana
+a photo of a blue clock and a white cup
+a photo of a red umbrella and a blue couch
+a photo of a white handbag and a red giraffe
+a photo of a pink tv remote and a blue airplane
+a photo of a pink handbag and a black scissors
+a photo of a brown car and a pink hair drier
+a photo of a black bus and a brown cell phone
+a photo of a purple sheep and a pink banana
+a photo of a blue handbag and a white cell phone
+a photo of a white pizza and a green umbrella
+a photo of a white tie and a purple skateboard
+a photo of a yellow sports ball and a green boat
+a photo of a white wine glass and a brown giraffe
+a photo of a yellow bowl and a white baseball glove
+a photo of an orange microwave and a black spoon
+a photo of an orange skateboard and a pink bowl
+a photo of a blue toilet and a white suitcase
+a photo of a white boat and an orange hot dog
+a photo of a yellow dining table and a pink dog
+a photo of a red cake and a purple chair
+a photo of a blue tie and a pink dining table
+a photo of a blue cow and a black computer keyboard
+a photo of a yellow pizza and a green oven
+a photo of a red laptop and a brown car
+a photo of a purple computer keyboard and a blue scissors
+a photo of a green surfboard and an orange oven
+a photo of a yellow parking meter and a pink refrigerator
+a photo of a brown computer mouse and a purple bottle
+a photo of a red umbrella and a green cow
+a photo of a red giraffe and a black cell phone
+a photo of a brown oven and a purple train
+a photo of a blue baseball bat and a pink book
+a photo of a green cup and a yellow bowl
+a photo of a yellow suitcase and a brown bus
+a photo of an orange motorcycle and a pink donut
+a photo of an orange giraffe and a white baseball glove
+a photo of an orange handbag and a green carrot
+a photo of a black bottle and a white refrigerator
+a photo of a white dog and a blue potted plant
+a photo of an orange handbag and a red car
+a photo of a red stop sign and a blue book
+a photo of a yellow car and an orange toothbrush
+a photo of a black potted plant and a yellow toilet
+a photo of a brown dining table and a white suitcase
+a photo of an orange donut and a yellow stop sign
+a photo of a green suitcase and a blue boat
+a photo of an orange tennis racket and a yellow sports ball
+a photo of a purple computer keyboard and a red chair
+a photo of a purple suitcase and an orange pizza
+a photo of a white bottle and a blue sheep
+a photo of a purple backpack and a white umbrella
+a photo of an orange potted plant and a black spoon
+a photo of a green tennis racket and a black dog
+a photo of a yellow handbag and a blue refrigerator
+a photo of a pink broccoli and a red sink
+a photo of a red bowl and a pink sink
+a photo of a white toilet and a red apple
+a photo of a pink dining table and a black sandwich
+a photo of a black car and a green parking meter
+a photo of a yellow bird and a black motorcycle
+a photo of a brown giraffe and a white stop sign
+a photo of a white banana and a black elephant
+a photo of an orange cow and a purple sandwich
+a photo of a red clock and a black cell phone
+a photo of a brown knife and a blue donut
+a photo of a red cup and a pink handbag
+a photo of a yellow bicycle and a red motorcycle
+a photo of a red orange and a purple broccoli
+a photo of an orange traffic light and a white toilet
+a photo of a green cup and a red pizza
+a photo of a blue pizza and a yellow baseball glove

eval/gen/geneval/prompts/object_names.txt ADDED Viewed

	@@ -0,0 +1,80 @@

+person
+bicycle
+car
+motorcycle
+airplane
+bus
+train
+truck
+boat
+traffic light
+fire hydrant
+stop sign
+parking meter
+bench
+bird
+cat
+dog
+horse
+sheep
+cow
+elephant
+bear
+zebra
+giraffe
+backpack
+umbrella
+handbag
+tie
+suitcase
+frisbee
+skis
+snowboard
+sports ball
+kite
+baseball bat
+baseball glove
+skateboard
+surfboard
+tennis racket
+bottle
+wine glass
+cup
+fork
+knife
+spoon
+bowl
+banana
+apple
+sandwich
+orange
+broccoli
+carrot
+hot dog
+pizza
+donut
+cake
+chair
+couch
+potted plant
+bed
+dining table
+toilet
+tv
+laptop
+computer mouse
+tv remote
+computer keyboard
+cell phone
+microwave
+oven
+toaster
+sink
+refrigerator
+book
+clock
+vase
+scissors
+teddy bear
+hair drier
+toothbrush

eval/gen/wise/cal_score.py ADDED Viewed

	@@ -0,0 +1,162 @@

+# Copyright 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: Apache-2.0
+import json
+import os
+import argparse
+from collections import defaultdict
+def calculate_wiscore(consistency, realism, aesthetic_quality):
+    return 0.7 * consistency + 0.2 * realism + 0.1 * aesthetic_quality
+def cal_culture(file_path):
+    all_scores = []
+    total_objects = 0
+    has_9_9 = False
+    with open(file_path, 'r') as file:
+        for line in file:
+            total_objects += 1
+            data = json.loads(line)
+            if 9.9 in [data['consistency'], data['realism'], data['aesthetic_quality']]:
+                has_9_9 = True
+            wiscore = calculate_wiscore(data['consistency'], data['realism'], data['aesthetic_quality'])
+            all_scores.append(wiscore)
+    if has_9_9 or total_objects < 400:
+        print(f"Skipping file {file_path}: Contains 9.9 or has less than 400 objects.")
+        return None
+    total_score = sum(all_scores)
+    avg_score = total_score / (len(all_scores)*2) if len(all_scores) > 0 else 0
+    score = {
+        'total': total_score,
+        'average': avg_score
+    }
+    print(f"  Cultural - Total: {score['total']:.2f}, Average: {score['average']:.2f}")
+    return avg_score
+def cal_space_time(file_path):
+    categories = defaultdict(list)
+    total_objects = 0
+    has_9_9 = False
+    with open(file_path, 'r') as file:
+        for line in file:
+            total_objects += 1
+            data = json.loads(line)
+            if 9.9 in [data['consistency'], data['realism'], data['aesthetic_quality']]:
+                has_9_9 = True
+            subcategory = data['Subcategory']
+            wiscore = calculate_wiscore(data['consistency'], data['realism'], data['aesthetic_quality'])
+            if subcategory in ['Longitudinal time', 'Horizontal time']:
+                categories['Time'].append(wiscore)
+            else:
+                categories['Space'].append(wiscore)
+    if has_9_9 or total_objects < 300:
+        print(f"Skipping file {file_path}: Contains 9.9 or has less than 400 objects.")
+        return None
+    total_scores = {category: sum(scores) for category, scores in categories.items()}
+    avg_scores = {category: sum(scores) / (len(scores) * 2 )if len(scores) > 0 else 0 for category, scores in categories.items()}
+    scores = {
+        'total': total_scores,
+        'average': avg_scores
+    }
+    print(f"  Time - Total: {scores['total'].get('Time', 0):.2f}, Average: {scores['average'].get('Time', 0):.2f}")
+    print(f"  Space - Total: {scores['total'].get('Space', 0):.2f}, Average: {scores['average'].get('Space', 0):.2f}")
+    return avg_scores
+def cal_science(file_path):
+    categories = defaultdict(list)
+    total_objects = 0
+    has_9_9 = False
+    with open(file_path, 'r') as file:
+        for line in file:
+            total_objects += 1
+            data = json.loads(line)
+            if 9.9 in [data['consistency'], data['realism'], data['aesthetic_quality']]:
+                has_9_9 = True
+            prompt_id = data.get('prompt_id', 0)
+            if 701 <= prompt_id <= 800:
+                category = 'Biology'
+            elif 801 <= prompt_id <= 900:
+                category = 'Physics'
+            elif 901 <= prompt_id <= 1000:
+                category = 'Chemistry'
+            else:
+                category = "?"
+            wiscore = calculate_wiscore(data['consistency'], data['realism'], data['aesthetic_quality'])
+            categories[category].append(wiscore)
+    if has_9_9 or total_objects < 300:
+        print(f"Skipping file {file_path}: Contains 9.9 or has less than 300 objects.")
+        return None
+    total_scores = {category: sum(scores) for category, scores in categories.items()}
+    avg_scores = {category: sum(scores) / (len(scores)*2) if len(scores) > 0 else 0 for category, scores in categories.items()}
+    scores = {
+        'total': total_scores,
+        'average': avg_scores
+    }
+    for category in ['Biology', 'Physics', 'Chemistry']:
+        print(f"  {category} - Total: {scores['total'].get(category, 0):.2f}, Average: {scores['average'].get(category, 0):.2f}")
+    return avg_scores
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description='Image Quality Assessment Tool')
+    parser.add_argument('--output_dir', required=True,
+                        help='Path to the output directory')
+    args = parser.parse_args()
+    avg_score = dict()
+    score = cal_culture(
+        os.path.join(args.output_dir, "cultural_common_sense_scores.jsonl")
+    )
+    avg_score['Cultural'] = score
+    scores = cal_space_time(
+        os.path.join(args.output_dir, "spatio-temporal_reasoning_scores.jsonl")
+    )
+    avg_score.update(scores)
+    scores = cal_science(
+        os.path.join(args.output_dir, "natural_science_scores.jsonl")
+    )
+    avg_score.update(scores)
+    avg_all = sum(avg_score.values()) / len(avg_score)
+    avg_score['Overall'] = avg_all
+    keys = ""
+    values = ""
+    for k, v in avg_score.items():
+        keys += f"{k} "
+        values += f"{v:.2f} "
+    print(keys)
+    print(values)
+    writer = open(os.path.join(args.output_dir, "results.txt"), 'w')
+    print(f"write results to file {os.path.join(args.output_dir, 'results.txt')}")
+    writer.write(keys + "\n")
+    writer.write(values + "\n")
+    writer.close()

eval/gen/wise/final_data.json ADDED Viewed

The diff for this file is too large to render. See raw diff

eval/gen/wise/gpt_eval_mp.py ADDED Viewed

	@@ -0,0 +1,268 @@

+# Copyright 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: Apache-2.0
+import json
+import os
+import base64
+import re
+import argparse
+import openai
+from pathlib import Path
+from typing import Dict, Any, List
+import concurrent.futures
+openai.api_key = os.getenv('OPENAI_API_KEY')
+print(openai.api_key)
+def parse_arguments():
+    parser = argparse.ArgumentParser(description='Image Quality Assessment Tool')
+    parser.add_argument('--json_path', required=True,
+                        help='Path to the prompts JSON file')
+    parser.add_argument('--image_dir', required=True,
+                        help='Path to the image directory')
+    parser.add_argument('--output_dir', required=True,
+                        help='Path to the output directory')
+    return parser.parse_args()
+def get_config(args):
+    filename = args.json_path.split("/")[-1].split(".")[0]
+    return {
+        "json_path": args.json_path,
+        "image_dir": args.image_dir,
+        "output_dir": args.output_dir,
+        "result_files": {
+            "full": f"{filename}_full.jsonl",
+            "scores": f"{filename}_scores.jsonl",
+        }
+    }
+def extract_scores(evaluation_text: str) -> Dict[str, float]:
+    score_pattern = r"\*{0,2}(Consistency|Realism|Aesthetic Quality)\*{0,2}\s*[:：]?\s*(\d)"
+    matches = re.findall(score_pattern, evaluation_text, re.IGNORECASE)
+    scores = {
+        "consistency": 9.9,
+        "realism": 9.9,
+        "aesthetic_quality": 9.9
+    }
+    for key, value in matches:
+        key = key.lower().replace(" ", "_")
+        if key in scores:
+            scores[key] = float(value)
+    return scores
+def encode_image(image_path: str) -> str:
+    with open(image_path, "rb") as image_file:
+        return base64.b64encode(image_file.read()).decode('utf-8')
+def load_prompts(json_path: str) -> Dict[int, Dict[str, Any]]:
+    with open(json_path, 'r') as f:
+        data = json.load(f)
+    return {item["prompt_id"]: item for item in data}
+def build_evaluation_messages(prompt_data: Dict, image_base64: str) -> list:
+    return [
+        {
+            "role": "system",
+            "content": [
+                {
+                    "type": "text",
+                    "text": "You are a professional Vincennes image quality audit expert, please evaluate the image quality strictly according to the protocol."
+                }
+            ]
+        },
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "text",
+                    "text": f"""Please evaluate strictly and return ONLY the three scores as requested.
+# Text-to-Image Quality Evaluation Protocol
+## System Instruction
+You are an AI quality auditor for text-to-image generation. Apply these rules with ABSOLUTE RUTHLESSNESS. Only images meeting the HIGHEST standards should receive top scores.
+**Input Parameters**
+- PROMPT: [User's original prompt to]
+- EXPLANATION: [Further explanation of the original prompt]
+---
+## Scoring Criteria
+**Consistency (0-2):**  How accurately and completely the image reflects the PROMPT.
+* **0 (Rejected):**  Fails to capture key elements of the prompt, or contradicts the prompt.
+* **1 (Conditional):** Partially captures the prompt. Some elements are present, but not all, or not accurately.  Noticeable deviations from the prompt's intent.
+* **2 (Exemplary):**  Perfectly and completely aligns with the PROMPT.  Every single element and nuance of the prompt is flawlessly represented in the image. The image is an ideal, unambiguous visual realization of the given prompt.
+**Realism (0-2):**  How realistically the image is rendered.
+* **0 (Rejected):**  Physically implausible and clearly artificial. Breaks fundamental laws of physics or visual realism.
+* **1 (Conditional):** Contains minor inconsistencies or unrealistic elements.  While somewhat believable, noticeable flaws detract from realism.
+* **2 (Exemplary):**  Achieves photorealistic quality, indistinguishable from a real photograph.  Flawless adherence to physical laws, accurate material representation, and coherent spatial relationships. No visual cues betraying AI generation.
+**Aesthetic Quality (0-2):**  The overall artistic appeal and visual quality of the image.
+* **0 (Rejected):**  Poor aesthetic composition, visually unappealing, and lacks artistic merit.
+* **1 (Conditional):**  Demonstrates basic visual appeal, acceptable composition, and color harmony, but lacks distinction or artistic flair.
+* **2 (Exemplary):**  Possesses exceptional aesthetic quality, comparable to a masterpiece.  Strikingly beautiful, with perfect composition, a harmonious color palette, and a captivating artistic style. Demonstrates a high degree of artistic vision and execution.
+---
+## Output Format
+**Do not include any other text, explanations, or labels.** You must return only three lines of text, each containing a metric and the corresponding score, for example:
+**Example Output:**
+Consistency: 2
+Realism: 1
+Aesthetic Quality: 0
+---
+**IMPORTANT Enforcement:**
+Be EXTREMELY strict in your evaluation. A score of '2' should be exceedingly rare and reserved only for images that truly excel and meet the highest possible standards in each metric. If there is any doubt, downgrade the score.
+For **Consistency**, a score of '2' requires complete and flawless adherence to every aspect of the prompt, leaving no room for misinterpretation or omission.
+For **Realism**, a score of '2' means the image is virtually indistinguishable from a real photograph in terms of detail, lighting, physics, and material properties.
+For **Aesthetic Quality**, a score of '2' demands exceptional artistic merit, not just pleasant visuals.
+---
+Here are the Prompt and EXPLANATION for this evaluation:
+PROMPT: "{prompt_data['Prompt']}"
+EXPLANATION: "{prompt_data['Explanation']}"
+Please strictly adhere to the scoring criteria and follow the template format when providing your results."""
+                },
+                {
+                    "type": "image_url",
+                    "image_url": {
+                        "url": f"data:image/png;base64,{image_base64}"
+                    }
+                }
+            ]
+        }
+    ]
+def evaluate_image(prompt_data: Dict, image_path: str, config: Dict) -> Dict[str, Any]:
+    try:
+        base64_image = encode_image(image_path)
+        messages = build_evaluation_messages(prompt_data, base64_image)
+        response = openai_client.chat.completions.create(
+            model=model,
+            messages=messages,
+            temperature=0.0,
+            max_tokens=2000,
+            n=1,
+        )
+        response = response.to_dict()
+        evaluation_text = response['choices'][0]['message']['content'].strip()
+        scores = extract_scores(evaluation_text)
+        return {
+            "evaluation": evaluation_text,
+            **scores
+        }
+    except Exception as e:
+        return {
+            "evaluation": f"Evaluation failed: {str(e)}",
+            "consistency": 9.9,
+            "realism": 9.9,
+            "aesthetic_quality": 9.9
+        }
+def save_results(data, filename, config):
+    path = os.path.join(config["output_dir"], filename)
+    assert filename.endswith('.jsonl')
+    with open(path, 'a', encoding='utf-8') as f:
+        json_line = json.dumps(data, ensure_ascii=False)
+        f.write(json_line + '\n')
+def process_prompt(prompt_id, prompt_data, config):
+    image_path = os.path.join(config["image_dir"], f"{prompt_id}.png")
+    if not os.path.exists(image_path):
+        print(f"Warning: Image not found {image_path}")
+        return None
+    print(f"Evaluating prompt_id: {prompt_id}...")
+    evaluation_result = evaluate_image(prompt_data, image_path, config)
+    full_record = {
+        "prompt_id": prompt_id,
+        "prompt": prompt_data["Prompt"],
+        "key": prompt_data["Explanation"],
+        "image_path": image_path,
+        "evaluation": evaluation_result["evaluation"]
+    }
+    score_record = {
+        "prompt_id": prompt_id,
+        "Subcategory": prompt_data["Subcategory"],
+        "consistency": evaluation_result["consistency"],
+        "realism": evaluation_result["realism"],
+        "aesthetic_quality": evaluation_result["aesthetic_quality"]
+    }
+    return full_record, score_record
+if __name__ == "__main__":
+    api_key = openai.api_key
+    base_url = "your_api_url",
+    api_version = "2024-03-01-preview"
+    model = "gpt-4o-2024-11-20"
+    openai_client = openai.AzureOpenAI(
+        azure_endpoint=base_url,
+        api_version=api_version,
+        api_key=api_key,
+    )
+    args = parse_arguments()
+    config = get_config(args)
+    Path(config["output_dir"]).mkdir(parents=True, exist_ok=True)
+    prompts = load_prompts(config["json_path"])
+    processed_ids = set()
+    if os.path.exists(os.path.join(config["output_dir"], config["result_files"]["full"])):
+        with open(os.path.join(config["output_dir"], config["result_files"]["full"]), 'r', encoding='utf-8') as f:
+            for line in f:
+                data = json.loads(line)
+                processed_ids.add(data["prompt_id"])
+    left_prompts = {k: v for k, v in prompts.items() if k not in processed_ids}
+    print(f"Process {len(left_prompts)} prompts...")
+    MAX_THREADS = 30
+    with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_THREADS) as executor:
+        futures = [executor.submit(process_prompt, prompt_id, prompt_data, config)
+                   for prompt_id, prompt_data in left_prompts.items()]
+        for future in concurrent.futures.as_completed(futures):
+            try:
+                result = future.result()
+                if result:
+                    full_record, score_record = result
+                    print(full_record)
+                    save_results(full_record, config["result_files"]["full"], config)
+                    save_results(score_record, config["result_files"]["scores"], config)
+            except Exception as e:
+                print(f"An error occurred: {e}")

eval/vlm/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # Copyright 2025 Bytedance Ltd. and/or its affiliates.
2	+ # SPDX-License-Identifier: Apache-2.0

eval/vlm/eval/mathvista/calculate_score.py ADDED Viewed

	@@ -0,0 +1,271 @@

+# Copyright (c) 2023 OpenGVLab
+# Copyright (c) 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: MIT
+#
+# This file has been modified by ByteDance Ltd. and/or its affiliates. on 2025-05-20.
+#
+# Original file was released under MIT, with the full license text
+# available at https://github.com/OpenGVLab/InternVL/blob/main/LICENSE.
+#
+# This modified file is released under the same license.
+import argparse
+import pandas as pd
+# !pip install python-Levenshtein
+from Levenshtein import distance
+from utilities import *
+def get_most_similar(prediction, choices):
+    """
+    Use the Levenshtein distance (or edit distance) to determine which of the choices is most similar to the given prediction
+    """
+    distances = [distance(prediction, choice) for choice in choices]
+    ind = distances.index(min(distances))
+    return choices[ind]
+    # return min(choices, key=lambda choice: distance(prediction, choice))
+def normalize_extracted_answer(extraction, choices, question_type, answer_type, precision):
+    """
+    Normalize the extracted answer to match the answer type
+    """
+    if question_type == 'multi_choice':
+        # make sure the extraction is a string
+        if isinstance(extraction, str):
+            extraction = extraction.strip()
+        else:
+            try:
+                extraction = str(extraction)
+            except:
+                extraction = ''
+        # extract "A" from "(A) text"
+        letter = re.findall(r'\(([a-zA-Z])\)', extraction)
+        if len(letter) > 0:
+            extraction = letter[0].upper()
+        options = [chr(ord('A') + i) for i in range(len(choices))]
+        if extraction in options:
+            # convert option letter to text, e.g. "A" -> "text"
+            ind = options.index(extraction)
+            extraction = choices[ind]
+        else:
+            # select the most similar option
+            extraction = get_most_similar(extraction, choices)
+        assert extraction in choices
+    elif answer_type == 'integer':
+        try:
+            extraction = str(int(float(extraction)))
+        except:
+            extraction = None
+    elif answer_type == 'float':
+        try:
+            extraction = str(round(float(extraction), int(precision)))
+        except:
+            extraction = None
+    elif answer_type == 'list':
+        try:
+            extraction = str(extraction)
+        except:
+            extraction = None
+    return extraction
+def safe_equal(prediction, answer):
+    """
+    Check if the prediction is equal to the answer, even if they are of different types
+    """
+    try:
+        if prediction == answer:
+            return True
+        return False
+    except Exception as e:
+        print(e)
+        return False
+def get_acc_with_contion(res_pd, key, value):
+    if key == 'skills':
+        # if value in res_pd[key]:
+        total_pd = res_pd[res_pd[key].apply(lambda x: value in x)]
+    else:
+        total_pd = res_pd[res_pd[key] == value]
+    correct_pd = total_pd[total_pd['true_false'] == True] # noqa: E712
+    acc = '{:.2f}'.format(len(correct_pd) / len(total_pd) * 100)
+    return len(correct_pd), len(total_pd), acc
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--output_dir', type=str, default='./results')
+    parser.add_argument('--output_file', type=str, default='output.json')
+    parser.add_argument('--score_file', type=str, default='scores.json')
+    parser.add_argument('--gt_file', type=str, default='./eval/vlm/data/MathVista/annot_testmini.json', help='ground truth file')
+    parser.add_argument('--number', type=int, default=-1, help='number of problems to run')
+    parser.add_argument('--rerun', action='store_true', help='rerun the evaluation')
+    parser.add_argument('--caculate_gain', action='store_true', help='caculate the score gains over random guess')
+    parser.add_argument('--random_file', type=str, default='score_random_guess.json')
+    args = parser.parse_args()
+    # args
+    output_file = os.path.join(args.output_dir, args.output_file)
+    # # quick test
+    # output_file = '../results/llava-llama-2-13b/output_llava_llama_2_13b.json'
+    # read json
+    print(f'Reading {output_file}...')
+    results = read_json(output_file)
+    # read ground truth
+    print(f'Reading {args.gt_file}...')
+    gts = read_json(args.gt_file)
+    # full pids
+    full_pids = list(results.keys())
+    if args.number > 0:
+        full_pids = full_pids[:min(args.number, len(full_pids))]
+    print('Number of testing problems:', len(full_pids))
+    ## [1] Evaluate if the prediction is true or false
+    print('\nEvaluating the predictions...')
+    update_json_flag = False
+    for pid in full_pids:
+        problem = results[pid]
+        # print(problem)
+        if args.rerun:
+            if 'prediction' in problem:
+                del problem['prediction']
+            if 'true_false' in problem:
+                del problem['true_false']
+        choices = problem['choices']
+        question_type = problem['question_type']
+        answer_type = problem['answer_type']
+        precision = problem['precision']
+        extraction = problem['extraction']
+        if 'answer' in problem:
+            answer = problem['answer']
+        else:
+            if pid in gts:
+                answer = gts[pid]['answer']
+            else:
+                answer = ''
+            problem['answer'] = answer
+        # normalize the extracted answer to match the answer type
+        prediction = normalize_extracted_answer(extraction, choices, question_type, answer_type, precision)
+        # verify the prediction is true or false
+        true_false = safe_equal(prediction, answer)
+        # update the problem
+        if 'true_false' not in problem:
+            update_json_flag = True
+        elif true_false != problem['true_false']:
+            update_json_flag = True
+        if 'prediction' not in problem:
+            update_json_flag = True
+        elif prediction != problem['prediction']:
+            update_json_flag = True
+        problem['prediction'] = prediction
+        problem['true_false'] = true_false
+    # save the updated json
+    if update_json_flag:
+        print('\n!!!Some problems are updated.!!!')
+        print(f'\nSaving {output_file}...')
+        save_json(results, output_file)
+    ## [2] Calculate the average accuracy
+    total = len(full_pids)
+    correct = 0
+    for pid in full_pids:
+        if results[pid]['true_false']:
+            correct += 1
+    accuracy = str(round(correct / total * 100, 2))
+    print(f'\nCorrect: {correct}, Total: {total}, Accuracy: {accuracy}%')
+    scores = {'average': {'accuracy': accuracy, 'correct': correct, 'total': total}}
+    ## [3] Calculate the fine-grained accuracy scores
+    # merge the 'metadata' attribute into the data
+    for pid in results:
+        results[pid].update(results[pid].pop('metadata'))
+    # convert the data to a pandas DataFrame
+    df = pd.DataFrame(results).T
+    print(len(df))
+    print('Number of test problems:', len(df))
+    # assert len(df) == 1000 # Important!!!
+    # asign the target keys for evaluation
+    target_keys = ['question_type', 'answer_type', 'language', 'source', 'category', 'task', 'context', 'grade',
+                   'skills']
+    for key in target_keys:
+        print(f'\nType: [{key}]')
+        # get the unique values of the key
+        if key == 'skills':
+            # the value is a list
+            values = []
+            for i in range(len(df)):
+                values += df[key][i]
+            values = list(set(values))
+        else:
+            values = df[key].unique()
+        # print(values)
+        # calculate the accuracy for each value
+        scores[key] = {}
+        for value in values:
+            correct, total, acc = get_acc_with_contion(df, key, value)
+            if total > 0:
+                print(f'[{value}]: {acc}% ({correct}/{total})')
+                scores[key][value] = {'accuracy': acc, 'correct': correct, 'total': total}
+        # sort the scores by accuracy
+        scores[key] = dict(sorted(scores[key].items(), key=lambda item: float(item[1]['accuracy']), reverse=True))
+    # save the scores
+    scores_file = os.path.join(args.output_dir, args.score_file)
+    print(f'\nSaving {scores_file}...')
+    save_json(scores, scores_file)
+    print('\nDone!')
+    # [4] Calculate the score gains over random guess
+    if args.caculate_gain:
+        random_file = os.path.join(args.output_dir, args.random_file)
+        random_scores = json.load(open(random_file))
+        print('\nCalculating the score gains...')
+        for key in scores:
+            if key == 'average':
+                gain = round(float(scores[key]['accuracy']) - float(random_scores[key]['accuracy']), 2)
+                scores[key]['acc_gain'] = gain
+            else:
+                for sub_key in scores[key]:
+                    gain = round(
+                        float(scores[key][sub_key]['accuracy']) - float(random_scores[key][sub_key]['accuracy']), 2)
+                    scores[key][sub_key]['acc_gain'] = str(gain)
+        # save the score gains
+        print(f'\nSaving {scores_file}...')
+        save_json(scores, scores_file)
+        print('\nDone!')

eval/vlm/eval/mathvista/evaluate_mathvista.py ADDED Viewed

	@@ -0,0 +1,210 @@

+# Copyright (c) 2023 OpenGVLab
+# Copyright (c) 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: MIT
+#
+# This file has been modified by ByteDance Ltd. and/or its affiliates. on 2025-05-20.
+#
+# Original file was released under MIT, with the full license text
+# available at https://github.com/OpenGVLab/InternVL/blob/main/LICENSE.
+#
+# This modified file is released under the same license.
+import argparse
+import itertools
+import json
+import os
+import random
+import torch
+from datasets import concatenate_datasets, load_dataset
+from eval.vlm.utils import load_model_and_tokenizer, build_transform, process_conversation
+from tqdm import tqdm
+ds_collections = {
+    'MathVista_testmini': {
+        'root': 'AI4Math/MathVista',
+        'max_new_tokens': 4096,
+        'min_new_tokens': 1,
+        'split': 'testmini'
+    },
+    'MathVista_test': {
+        'root': 'AI4Math/MathVista',
+        'max_new_tokens': 4096,
+        'min_new_tokens': 1,
+        'split': 'test'
+    },
+}
+COT_INSTRUCTION = (
+    'Your task is to answer the question below. '
+    "Give step by step reasoning before you answer, and when you're ready to answer, "
+    "please use the format \"Final answer: ..\""
+    '\n\n'
+    'Question:'
+    '\n\n'
+    '{question}'
+)
+def collate_fn(batches):
+    images = [_['images'] for _ in batches]
+    data_items = [_['data_item'] for _ in batches]
+    return images, data_items
+class MathVistaDataset(torch.utils.data.Dataset):
+    def __init__(self, root, split):
+        dataset = load_dataset(root, cache_dir=os.path.join(os.getcwd(), 'eval/vlm/data/MathVista/'))
+        self.data = dataset[split]
+    def __len__(self):
+        return len(self.data)
+    def __getitem__(self, idx):
+        data_item = self.data[idx]
+        image = data_item['decoded_image']
+        del data_item['decoded_image']
+        images = [image.convert('RGB') if image.mode != 'RGB' else image]
+        return {
+            'images': images,
+            'data_item': data_item,
+        }
+class InferenceSampler(torch.utils.data.sampler.Sampler):
+    def __init__(self, size):
+        self._size = int(size)
+        assert size > 0
+        self._rank = torch.distributed.get_rank()
+        self._world_size = torch.distributed.get_world_size()
+        self._local_indices = self._get_local_indices(size, self._world_size, self._rank)
+    @staticmethod
+    def _get_local_indices(total_size, world_size, rank):
+        shard_size = total_size // world_size
+        left = total_size % world_size
+        shard_sizes = [shard_size + int(r < left) for r in range(world_size)]
+        begin = sum(shard_sizes[:rank])
+        end = min(sum(shard_sizes[:rank + 1]), total_size)
+        return range(begin, end)
+    def __iter__(self):
+        yield from self._local_indices
+    def __len__(self):
+        return len(self._local_indices)
+def evaluate_chat_model():
+    random.seed(args.seed)
+    for ds_name in args.datasets:
+        dataset = MathVistaDataset(
+            root=ds_collections[ds_name]['root'],
+            split=ds_collections[ds_name]['split'],
+        )
+        dataloader = torch.utils.data.DataLoader(
+            dataset=dataset,
+            sampler=InferenceSampler(len(dataset)),
+            batch_size=args.batch_size,
+            num_workers=args.num_workers,
+            pin_memory=True,
+            drop_last=False,
+            collate_fn=collate_fn,
+        )
+        outputs = []
+        for _, (images, data_items) in tqdm(enumerate(dataloader)):
+            if args.cot:
+                question = COT_INSTRUCTION.format(question=data_items[0]['query'])
+            else:
+                question = data_items[0]['query']
+            images = images[0]
+            images, conversation = process_conversation(images, question)
+            pred = model.chat(
+                tokenizer,
+                new_token_ids,
+                image_transform,
+                images=images,
+                prompt=conversation,
+                max_length=ds_collections[ds_name]['max_new_tokens'] if not args.cot else 4096, # TODO: how to use ds_collections[ds_name]['min_new_tokens']
+            )
+            data_item = data_items[0]
+            data_item['response'] = pred
+            outputs.append(data_item)
+        torch.distributed.barrier()
+        world_size = torch.distributed.get_world_size()
+        merged_outputs = [None for _ in range(world_size)]
+        torch.distributed.all_gather_object(merged_outputs, json.dumps(outputs))
+        merged_outputs = [json.loads(_) for _ in merged_outputs]
+        merged_outputs = [_ for _ in itertools.chain.from_iterable(merged_outputs)]
+        if torch.distributed.get_rank() == 0:
+            temp = {}
+            for data_item in merged_outputs:
+                pid = data_item['pid']
+                temp[pid] = data_item
+            print(f'Evaluating {ds_name} ...')
+            results_file = 'results.json'
+            output_path = os.path.join(args.out_dir, 'results.json')
+            json.dump(temp, open(output_path, 'w'), indent=4)
+            print('Results saved to {}'.format(output_path))
+            if args.cot:
+                cmd = f'python eval/vlm/eval/mathvista/extract_answer_mp.py --output_file {results_file} --output_dir {args.out_dir}'
+            else:
+                cmd = f'python eval/vlm/eval/mathvista/extract_answer_mp.py --output_file {results_file} --output_dir {args.out_dir}'
+            print(cmd)
+            os.system(cmd)
+            cmd = f'python eval/vlm/eval/mathvista/calculate_score.py --output_file {results_file} --output_dir {args.out_dir} --score_file score.json'
+            print(cmd)
+            os.system(cmd)
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--datasets', type=str, default='MathVista_testmini')
+    parser.add_argument('--batch-size', type=int, default=1)
+    parser.add_argument('--num-workers', type=int, default=1)
+    parser.add_argument('--out-dir', type=str, default='results')
+    parser.add_argument('--seed', type=int, default=0)
+    parser.add_argument('--cot', action='store_true')
+    parser.add_argument('--model-path', type=str, default='hf/BAGEL-7B-MoT/')
+    args = parser.parse_args()
+    if not os.path.exists(args.out_dir):
+        os.makedirs(args.out_dir, exist_ok=True)
+    args.datasets = args.datasets.split(',')
+    print('datasets:', args.datasets)
+    assert args.batch_size == 1, 'Only batch size 1 is supported'
+    torch.distributed.init_process_group(
+        backend='nccl',
+        world_size=int(os.getenv('WORLD_SIZE', '1')),
+        rank=int(os.getenv('RANK', '0')),
+    )
+    torch.cuda.set_device(int(os.getenv('LOCAL_RANK', 0)))
+    model, tokenizer, new_token_ids = load_model_and_tokenizer(args)
+    image_transform = build_transform()
+    total_params = sum(p.numel() for p in model.parameters()) / 1e9
+    print(f'[test] total_params: {total_params}B')
+    evaluate_chat_model()

eval/vlm/eval/mathvista/extract_answer.py ADDED Viewed

	@@ -0,0 +1,160 @@

+# Copyright (c) 2023 OpenGVLab
+# Copyright (c) 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: MIT
+#
+# This file has been modified by ByteDance Ltd. and/or its affiliates. on 2025-05-20.
+#
+# Original file was released under MIT, with the full license text
+# available at https://github.com/OpenGVLab/InternVL/blob/main/LICENSE.
+#
+# This modified file is released under the same license.
+import argparse
+from tqdm import tqdm
+from utilities import *
+openai.api_key = os.getenv('OPENAI_API_KEY')
+print(openai.api_key)
+# load demo prompt
+from prompts.ext_ans import demo_prompt
+def verify_extraction(extraction):
+    extraction = extraction.strip()
+    if extraction == '' or extraction is None:
+        return False
+    return True
+def create_test_prompt(demo_prompt, query, response):
+    demo_prompt = demo_prompt.strip()
+    test_prompt = f'{query}\n\n{response}'
+    full_prompt = f'{demo_prompt}\n\n{test_prompt}\n\nExtracted answer: '
+    return full_prompt
+def _extract_answer(text):
+    match = re.search(r'(Final answer:|Answer:)\s*(.*)', text, re.IGNORECASE)
+    if match:
+        return match.group(2).strip()
+    return text
+def extract_answer(response, problem, quick_extract=False):
+    question_type = problem['question_type']
+    answer_type = problem['answer_type']
+    choices = problem['choices']
+    query = problem['query']
+    if response == '':
+        return ''
+    if question_type == 'multi_choice' and response in choices:
+        return response
+    if answer_type == 'integer':
+        try:
+            extraction = int(response)
+            return str(extraction)
+        except:
+            pass
+    if answer_type == 'float':
+        try:
+            extraction = str(float(response))
+            return extraction
+        except:
+            pass
+    # quick extraction
+    if quick_extract:
+        print('Quickly extracting answer...')
+        # The answer is "text". -> "text"
+        try:
+            result = _extract_answer(response)
+            return result
+            # result = re.search(r'The answer is "(.*)"\.', response)
+            # if result:
+            #     extraction = result.group(1)
+            #     return extraction
+        except:
+            pass
+    # general extraction
+    try:
+        full_prompt = create_test_prompt(demo_prompt, query, response)
+        extraction = get_chat_response(full_prompt, openai.api_key, patience=5)
+        return extraction
+    except Exception as e:
+        print(e)
+        print(f'Error in extracting answer for {pid}')
+    return ''
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    # input
+    parser.add_argument('--output_dir', type=str, default='./results')
+    parser.add_argument('--output_file', type=str, default='mathvista_answer.json')
+    parser.add_argument('--response_label', type=str, default='response', help='response label for the input file')
+    # model
+    parser.add_argument('--llm_engine', type=str, default='gpt-4-0613', help='llm engine',
+                        choices=['gpt-3.5-turbo', 'gpt-3.5', 'gpt-4', 'gpt-4-0314', 'gpt-4-0613'])
+    parser.add_argument('--number', type=int, default=-1, help='number of problems to run')
+    parser.add_argument('--quick_extract', action='store_true', help='use rules to extract answer for some problems')
+    parser.add_argument('--rerun', action='store_true', help='rerun the answer extraction')
+    # output
+    parser.add_argument('--save_every', type=int, default=10, help='save every n problems')
+    parser.add_argument('--output_label', type=str, default='', help='label for the output file')
+    args = parser.parse_args()
+    # args
+    label = args.response_label
+    result_file = os.path.join(args.output_dir, args.output_file)
+    if args.output_label != '':
+        output_file = result_file.replace('.json', f'_{args.output_label}.json')
+    else:
+        output_file = result_file
+    # read results
+    print(f'Reading {result_file}...')
+    results = read_json(result_file)
+    # full pids
+    full_pids = list(results.keys())
+    if args.number > 0:
+        full_pids = full_pids[:min(args.number, len(full_pids))]
+    print('Number of testing problems:', len(full_pids))
+    # test pids
+    if args.rerun:
+        test_pids = full_pids
+    else:
+        test_pids = []
+        for pid in full_pids:
+            # print(pid)
+            if 'extraction' not in results[pid] or not verify_extraction(results[pid]['extraction']):
+                test_pids.append(pid)
+    test_num = len(test_pids)
+    print('Number of problems to run:', test_num)
+    # print(test_pids)
+    # tqdm, enumerate results
+    for i, pid in enumerate(tqdm(test_pids)):
+        problem = results[pid]
+        assert label in problem
+        response = problem[label]
+        extraction = extract_answer(response, problem, args.quick_extract)
+        results[pid]['extraction'] = extraction
+        if i % args.save_every == 0 or i == test_num - 1:
+            print(f'Saving results to {output_file}...')
+            save_json(results, output_file)
+            print(f'Results saved.')

eval/vlm/eval/mathvista/extract_answer_mp.py ADDED Viewed

	@@ -0,0 +1,161 @@

+# Copyright (c) 2023 OpenGVLab
+# Copyright (c) 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: MIT
+#
+# This file has been modified by ByteDance Ltd. and/or its affiliates. on 2025-05-20.
+#
+# Original file was released under MIT, with the full license text
+# available at https://github.com/OpenGVLab/InternVL/blob/main/LICENSE.
+#
+# This modified file is released under the same license.
+import argparse
+import os
+import re
+import json
+import openai
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from tqdm import tqdm
+from utilities import *
+from prompts.ext_ans import demo_prompt
+openai.api_key = os.getenv('OPENAI_API_KEY')
+print(openai.api_key)
+def verify_extraction(extraction):
+    extraction = extraction.strip()
+    if extraction == '' or extraction is None:
+        return False
+    return True
+def create_test_prompt(demo_prompt, query, response):
+    demo_prompt = demo_prompt.strip()
+    test_prompt = f'{query}\n\n{response}'
+    full_prompt = f'{demo_prompt}\n\n{test_prompt}\n\nExtracted answer: '
+    return full_prompt
+def _extract_answer(text):
+    match = re.search(r'(Final answer:|Answer:)\s*(.*)', text, re.IGNORECASE)
+    if match:
+        return match.group(2).strip()
+    return text
+def extract_answer(response, problem, quick_extract=False):
+    question_type = problem['question_type']
+    answer_type = problem['answer_type']
+    choices = problem['choices']
+    query = problem['query']
+    if response == '':
+        return ''
+    if question_type == 'multi_choice' and response in choices:
+        return response
+    if answer_type == 'integer':
+        try:
+            extraction = int(response)
+            return str(extraction)
+        except:
+            pass
+    if answer_type == 'float':
+        try:
+            extraction = str(float(response))
+            return extraction
+        except:
+            pass
+    # quick extraction
+    if quick_extract:
+        print('Quickly extracting answer...')
+        try:
+            result = _extract_answer(response)
+            return result
+        except:
+            pass
+    try:
+        full_prompt = create_test_prompt(demo_prompt, query, response)
+        extraction = get_chat_response(full_prompt, openai.api_key, patience=5, model=args.llm_engine)
+        return extraction
+    except Exception as e:
+        print(e)
+    return ''
+def process_problem(pid, results, label, args):
+    problem = results[pid]
+    response = problem[label]
+    extraction = extract_answer(response, problem, args.quick_extract)
+    return pid, extraction
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    # input
+    parser.add_argument('--output_dir', type=str, default='./results')
+    parser.add_argument('--output_file', type=str, default='mathvista_answer.json')
+    parser.add_argument('--response_label', type=str, default='response', help='response label for the input file')
+    # model
+    parser.add_argument('--llm_engine', type=str, default='gpt-4o-2024-11-20', help='llm engine',
+                        choices=['gpt-3.5-turbo', 'gpt-3.5', 'gpt-4', 'gpt-4-0314', 'gpt-4-0613',
+                                 'gpt-4o-2024-08-06', 'gpt-4o-2024-11-20'])
+    parser.add_argument('--number', type=int, default=-1, help='number of problems to run')
+    parser.add_argument('--quick_extract', action='store_true', help='use rules to extract answer for some problems')
+    parser.add_argument('--rerun', action='store_true', help='rerun the answer extraction')
+    # output
+    parser.add_argument('--save_every', type=int, default=100, help='save every n problems')
+    parser.add_argument('--output_label', type=str, default='', help='label for the output file')
+    parser.add_argument('--max_workers', type=int, default=40, help='max workers for ThreadPoolExecutor')
+    args = parser.parse_args()
+    label = args.response_label
+    result_file = os.path.join(args.output_dir, args.output_file)
+    if args.output_label != '':
+        output_file = result_file.replace('.json', f'_{args.output_label}.json')
+    else:
+        output_file = result_file
+    print(f'Reading {result_file}...')
+    results = read_json(result_file)
+    full_pids = list(results.keys())
+    if args.number > 0:
+        full_pids = full_pids[:min(args.number, len(full_pids))]
+    print('Number of total problems:', len(full_pids))
+    if args.rerun:
+        test_pids = full_pids
+    else:
+        test_pids = []
+        for pid in full_pids:
+            if 'extraction' not in results[pid] or not verify_extraction(results[pid]['extraction']):
+                test_pids.append(pid)
+    test_num = len(test_pids)
+    print('Number of problems to run:', test_num)
+    with ThreadPoolExecutor(max_workers=args.max_workers) as executor:
+        future_to_pid = {}
+        for pid in test_pids:
+            future = executor.submit(process_problem, pid, results, label, args)
+            future_to_pid[future] = pid
+        completed_count = 0
+        for future in tqdm(as_completed(future_to_pid), total=test_num):
+            pid = future_to_pid[future]
+            try:
+                pid_result, extraction = future.result()
+                results[pid_result]['extraction'] = extraction
+            except Exception as e:
+                print(f'Error processing pid={pid}: {e}')
+            completed_count += 1
+            if (completed_count % args.save_every == 0) or (completed_count == test_num):
+                print(f'Saving results to {output_file}... [{completed_count}/{test_num}]')
+                save_json(results, output_file)
+                print('Results saved.')
+    print('All done!')

eval/vlm/eval/mathvista/prompts/ext_ans.py ADDED Viewed

	@@ -0,0 +1,51 @@

+# Copyright (c) 2023 OpenGVLab
+# Copyright (c) 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: MIT
+#
+# This file has been modified by ByteDance Ltd. and/or its affiliates. on 2025-05-20.
+#
+# Original file was released under MIT, with the full license text
+# available at https://github.com/OpenGVLab/InternVL/blob/main/LICENSE.
+#
+# This modified file is released under the same license.
+# pids = 852,  104,  824,  506,  540
+demo_prompt = """
+Please read the following example. Then extract the answer from the model response and type it at the end of the prompt.
+Hint: Please answer the question requiring an integer answer and provide the final value, e.g., 1, 2, 3, at the end.
+Question: Which number is missing?
+Model response: The number missing in the sequence is 14.
+Extracted answer: 14
+Hint: Please answer the question requiring a floating-point number with one decimal place and provide the final value, e.g., 1.2, 1.3, 1.4, at the end.
+Question: What is the fraction of females facing the camera?
+Model response: The fraction of females facing the camera is 0.6, which means that six out of ten females in the group are facing the camera.
+Extracted answer: 0.6
+Hint: Please answer the question requiring a floating-point number with two decimal places and provide the final value, e.g., 1.23, 1.34, 1.45, at the end.
+Question: How much money does Luca need to buy a sour apple candy and a butterscotch candy? (Unit: $)
+Model response: Luca needs $1.45 to buy a sour apple candy and a butterscotch candy.
+Extracted answer: 1.45
+Hint: Please answer the question requiring a Python list as an answer and provide the final list, e.g., [1, 2, 3], [1.2, 1.3, 1.4], at the end.
+Question: Between which two years does the line  graph saw its maximum peak?
+Model response: The line graph saw its maximum peak between 2007 and 2008.
+Extracted answer: [2007, 2008]
+Hint: Please answer the question and provide the correct option letter, e.g., A, B, C, D, at the end.
+Question: What fraction of the shape is blue?\nChoices:\n(A) 3/11\n(B) 8/11\n(C) 6/11\n(D) 3/5
+Model response: The correct answer is (B) 8/11.
+Extracted answer: B
+"""

eval/vlm/eval/mathvista/utilities.py ADDED Viewed

	@@ -0,0 +1,229 @@

+# Copyright (c) 2023 OpenGVLab
+# Copyright (c) 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: MIT
+#
+# This file has been modified by ByteDance Ltd. and/or its affiliates. on 2025-05-20.
+#
+# Original file was released under MIT, with the full license text
+# available at https://github.com/OpenGVLab/InternVL/blob/main/LICENSE.
+#
+# This modified file is released under the same license.
+import json
+import os
+import pickle
+import re
+import time
+import cv2
+import openai
+from word2number import w2n
+openai_client = None
+def create_dir(output_dir):
+    if not os.path.exists(output_dir):
+        os.makedirs(output_dir)
+def read_csv(file):
+    data = []
+    with open(file, 'r') as f:
+        for line in f:
+            data.append(line.strip())
+    return data
+def read_pandas_csv(csv_path):
+    # read a pandas csv sheet
+    import pandas as pd
+    df = pd.read_csv(csv_path)
+    return df
+def read_json(path):
+    with open(path, 'r', encoding='utf-8') as f:
+        return json.load(f)
+def read_jsonl(file):
+    with open(file, 'r') as f:
+        data = [json.loads(line) for line in f]
+    return data
+def read_pickle(path):
+    with open(path, 'rb') as f:
+        return pickle.load(f)
+def save_json(data, path):
+    with open(path, 'w') as f:
+        json.dump(data, f, indent=4)
+def save_array_img(path, image):
+    cv2.imwrite(path, image)
+def contains_digit(text):
+    # check if text contains a digit
+    if any(char.isdigit() for char in text):
+        return True
+    return False
+def contains_number_word(text):
+    # check if text contains a number word
+    ignore_words = ['a', 'an', 'point']
+    words = re.findall(r'\b\w+\b', text)  # This regex pattern matches any word in the text
+    for word in words:
+        if word in ignore_words:
+            continue
+        try:
+            w2n.word_to_num(word)
+            return True  # If the word can be converted to a number, return True
+        except ValueError:
+            continue  # If the word can't be converted to a number, continue with the next word
+    # check if text contains a digit
+    if any(char.isdigit() for char in text):
+        return True
+    return False  # If none of the words could be converted to a number, return False
+def contains_quantity_word(text, special_keep_words=[]):
+    # check if text contains a quantity word
+    quantity_words = ['most', 'least', 'fewest'
+                                       'more', 'less', 'fewer',
+                      'largest', 'smallest', 'greatest',
+                      'larger', 'smaller', 'greater',
+                      'highest', 'lowest', 'higher', 'lower',
+                      'increase', 'decrease',
+                      'minimum', 'maximum', 'max', 'min',
+                      'mean', 'average', 'median',
+                      'total', 'sum', 'add', 'subtract',
+                      'difference', 'quotient', 'gap',
+                      'half', 'double', 'twice', 'triple',
+                      'square', 'cube', 'root',
+                      'approximate', 'approximation',
+                      'triangle', 'rectangle', 'circle', 'square', 'cube', 'sphere', 'cylinder', 'cone', 'pyramid',
+                      'multiply', 'divide',
+                      'percentage', 'percent', 'ratio', 'proportion', 'fraction', 'rate',
+                      ]
+    quantity_words += special_keep_words  # dataset specific words
+    words = re.findall(r'\b\w+\b', text)  # This regex pattern matches any word in the text
+    if any(word in quantity_words for word in words):
+        return True
+    return False  # If none of the words could be converted to a number, return False
+def is_bool_word(text):
+    if text in ['Yes', 'No', 'True', 'False',
+                'yes', 'no', 'true', 'false',
+                'YES', 'NO', 'TRUE', 'FALSE']:
+        return True
+    return False
+def is_digit_string(text):
+    # remove ".0000"
+    text = text.strip()
+    text = re.sub(r'\.0+$', '', text)
+    try:
+        int(text)
+        return True
+    except ValueError:
+        return False
+def is_float_string(text):
+    # text is a float string if it contains a "." and can be converted to a float
+    if '.' in text:
+        try:
+            float(text)
+            return True
+        except ValueError:
+            return False
+    return False
+def copy_image(image_path, output_image_path):
+    from shutil import copyfile
+    copyfile(image_path, output_image_path)
+def copy_dir(src_dir, dst_dir):
+    from shutil import copytree
+    # copy the source directory to the target directory
+    copytree(src_dir, dst_dir)
+import PIL.Image as Image
+def get_image_size(img_path):
+    img = Image.open(img_path)
+    width, height = img.size
+    return width, height
+def get_chat_response(
+        promot="", api_key="",
+        base_url="your_api_url",
+        api_version="2024-03-01-preview", model="gpt-4-0613",
+        temperature=0, max_tokens=256, n=1, patience=10000000, sleep_time=0
+    ):
+    openai_client = openai.AzureOpenAI(
+        azure_endpoint=base_url,
+        api_version=api_version,
+        api_key=api_key,
+    )
+    messages = [
+        {'role': 'user', 'content': promot},
+    ]
+    while patience > 0:
+        patience -= 1
+        try:
+            response = openai_client.chat.completions.create(
+                model=model,
+                messages=messages,
+                # api_key=api_key,
+                temperature=temperature,
+                max_tokens=max_tokens,
+                n=n,
+            )
+            response = response.to_dict()
+            if n == 1:
+                prediction = response['choices'][0]['message']['content'].strip()
+                if prediction != '' and prediction is not None:
+                    return prediction
+            else:
+                prediction = [choice['message']['content'].strip() for choice in response['choices']]
+                if prediction[0] != '' and prediction[0] is not None:
+                    return prediction
+        except Exception as e:
+            if 'Rate limit' not in str(e):
+                print(e)
+            if 'Please reduce the length of the messages' in str(e):
+                print('!!Reduce promot size')
+                # reduce input prompt and keep the tail
+                new_size = int(len(promot) * 0.9)
+                new_start = len(promot) - new_size
+                promot = promot[new_start:]
+                messages = [
+                    {'role': 'user', 'content': promot},
+                ]
+            if sleep_time > 0:
+                time.sleep(sleep_time)
+    return ''

eval/vlm/eval/mmbench/evaluate_mmbench.py ADDED Viewed

	@@ -0,0 +1,283 @@

+# Copyright (c) 2023 OpenGVLab
+# Copyright (c) 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: MIT
+#
+# This file has been modified by ByteDance Ltd. and/or its affiliates. on 2025-05-20.
+#
+# Original file was released under MIT, with the full license text
+# available at https://github.com/OpenGVLab/InternVL/blob/main/LICENSE.
+#
+# This modified file is released under the same license.
+import argparse
+import base64
+import itertools
+import json
+import os
+import random
+from io import BytesIO
+import pandas as pd
+import torch
+from eval.vlm.utils import load_model_and_tokenizer, build_transform, process_conversation
+from PIL import Image
+from tqdm import tqdm
+ds_collections = {
+    'mmbench_dev_20230712': {
+        'root': 'eval/vlm/data/mmbench/mmbench_dev_20230712.tsv',
+        'max_new_tokens': 100,
+        'min_new_tokens': 1,
+        'type': 'dev',
+        'language': 'en'
+    },
+    'mmbench_dev_cn_20231003': {
+        'root': 'eval/vlm/data/mmbench/mmbench_dev_cn_20231003.tsv',
+        'max_new_tokens': 100,
+        'min_new_tokens': 1,
+        'type': 'dev',
+        'language': 'cn'
+    },
+    'mmbench_dev_en_20231003': {
+        'root': 'eval/vlm/data/mmbench/mmbench_dev_en_20231003.tsv',
+        'max_new_tokens': 100,
+        'min_new_tokens': 1,
+        'type': 'dev',
+        'language': 'en'
+    },
+    'mmbench_test_cn_20231003': {
+        'root': 'eval/vlm/data/mmbench/mmbench_test_cn_20231003.tsv',
+        'max_new_tokens': 100,
+        'min_new_tokens': 1,
+        'type': 'test',
+        'language': 'cn'
+    },
+    'mmbench_test_en_20231003': {
+        'root': 'eval/vlm/data/mmbench/mmbench_test_en_20231003.tsv',
+        'max_new_tokens': 100,
+        'min_new_tokens': 1,
+        'type': 'test',
+        'language': 'en'
+    },
+    'ccbench_dev_cn': {
+        'root': 'eval/vlm/data/mmbench/CCBench_legacy.tsv',
+        'max_new_tokens': 100,
+        'min_new_tokens': 1,
+        'type': 'dev',
+        'language': 'cn'
+    }
+}
+def collate_fn(batches):
+    questions = [_['question'] for _ in batches]
+    images = [_['images'] for _ in batches]
+    conversation = [_['conversation'] for _ in batches]
+    answers = [_['answer'] for _ in batches]
+    indexes = [_['index'] for _ in batches]
+    options = [_['option'] for _ in batches]
+    return questions, images, conversation, answers, indexes, options
+class MMBenchDataset(torch.utils.data.Dataset):
+    def __init__(self, root, prompt, language):
+        self.df = pd.read_csv(root, sep='\t')
+        self.prompt = prompt
+        self.language = language
+    def __len__(self):
+        return len(self.df)
+    def __getitem__(self, idx):
+        index = self.df.iloc[idx]['index']
+        image = self.df.iloc[idx]['image']
+        question = self.df.iloc[idx]['question']
+        answer = self.df.iloc[idx]['answer'] if 'answer' in self.df.iloc[0].keys() else None
+        # catetory = self.df.iloc[idx]['category']
+        # l2_catetory = self.df.iloc[idx]['l2-category']
+        image = Image.open(BytesIO(base64.b64decode(image))).convert('RGB')
+        images = [image]
+        option_candidate = ['A', 'B', 'C', 'D', 'E']
+        options = {
+            cand: self.load_from_df(idx, cand)
+            for cand in option_candidate
+            if self.load_from_df(idx, cand) is not None
+        }
+        hint = self.load_from_df(idx, 'hint')
+        if hint is not None:
+            question = hint + '\n' + question
+        for key, item in options.items():
+            question += f'\n{key}. {item}'
+        if self.language == 'cn':
+            question = question + '\n' + self.prompt['cn']
+        else:
+            question = question + '\n' + self.prompt['en']
+        images, conversation = process_conversation(images, question)
+        return {
+            'question': question,
+            'images': images,
+            'conversation': conversation,
+            'answer': answer,
+            'index': index,
+            'option': options
+        }
+    def load_from_df(self, idx, key):
+        if key in self.df.iloc[idx] and not pd.isna(self.df.iloc[idx][key]):
+            return self.df.iloc[idx][key]
+        else:
+            return None
+class InferenceSampler(torch.utils.data.sampler.Sampler):
+    def __init__(self, size):
+        self._size = int(size)
+        assert size > 0
+        self._rank = torch.distributed.get_rank()
+        self._world_size = torch.distributed.get_world_size()
+        self._local_indices = self._get_local_indices(size, self._world_size, self._rank)
+    @staticmethod
+    def _get_local_indices(total_size, world_size, rank):
+        shard_size = total_size // world_size
+        left = total_size % world_size
+        shard_sizes = [shard_size + int(r < left) for r in range(world_size)]
+        begin = sum(shard_sizes[:rank])
+        end = min(sum(shard_sizes[:rank + 1]), total_size)
+        return range(begin, end)
+    def __iter__(self):
+        yield from self._local_indices
+    def __len__(self):
+        return len(self._local_indices)
+def post_process(pred, option):
+    pred = pred.strip()
+    option_candidate = list(option.keys())
+    if len(pred) == 1:
+        return pred
+    if len(pred) == 0:
+        pred = "C"
+    elif len(pred) != 1 and pred[0] in option_candidate:
+        return pred[0]
+    elif len(pred) != 1 and pred[0] not in option_candidate:
+        for k, v in option.items():
+            if v in pred:
+                return k
+    return pred
+def evaluate_chat_model():
+    random.seed(args.seed)
+    for ds_name in args.datasets:
+        dataset = MMBenchDataset(
+            root=ds_collections[ds_name]['root'],
+            prompt=prompt,
+            language=ds_collections[ds_name]['language'],
+        )
+        dataloader = torch.utils.data.DataLoader(
+            dataset=dataset,
+            sampler=InferenceSampler(len(dataset)),
+            batch_size=args.batch_size,
+            num_workers=args.num_workers,
+            pin_memory=True,
+            drop_last=False,
+            collate_fn=collate_fn,
+        )
+        outputs = []
+        for _, (questions, images, conversation, answers, indexes, options) in tqdm(enumerate(dataloader)):
+            pred = model.chat(
+                tokenizer,
+                new_token_ids,
+                image_transform,
+                images=images[0], # batch=1
+                prompt=conversation[0], # batch=1
+                max_length=ds_collections[ds_name]['max_new_tokens'], # TODO: how to use ds_collections[ds_name]['min_new_tokens']
+            )
+            preds = [post_process(pred, options[0])]
+            for question, pred, answer, index in zip(questions, preds, answers, indexes):
+                outputs.append({
+                    'question': question,
+                    'answer': pred,
+                    'gt_answers': answer,
+                    'index': int(index)
+                })
+        torch.distributed.barrier()
+        world_size = torch.distributed.get_world_size()
+        merged_outputs = [None for _ in range(world_size)]
+        torch.distributed.all_gather_object(merged_outputs, json.dumps(outputs))
+        merged_outputs = [json.loads(_) for _ in merged_outputs]
+        merged_outputs = [_ for _ in itertools.chain.from_iterable(merged_outputs)]
+        if torch.distributed.get_rank() == 0:
+            print(f'Evaluating {ds_name} ...')
+            results_file = 'results.xlsx'
+            output_path = os.path.join(args.out_dir, results_file)
+            df = pd.read_table(ds_collections[ds_name]['root'])
+            cur_df = df.copy()
+            if 'mmbench' in ds_name:
+                cur_df = cur_df.drop(columns=['hint', 'category', 'source', 'image', 'comment', 'l2-category'])
+                cur_df.insert(6, 'prediction', None)
+            else:
+                cur_df = cur_df.drop(columns=['category', 'image'])
+                cur_df.insert(8, 'prediction', None)
+            for item in merged_outputs:
+                cur_df.loc[df['index'] == item['index'], 'prediction'] = item['answer']
+            cur_df.to_excel(output_path, index=False, engine='openpyxl')
+            print('Results saved to {}'.format(output_path))
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--datasets', type=str, default='mmbench_dev_20230712')
+    parser.add_argument('--batch-size', type=int, default=1)
+    parser.add_argument('--num-workers', type=int, default=1)
+    parser.add_argument('--out-dir', type=str, default='results')
+    parser.add_argument('--seed', type=int, default=0)
+    parser.add_argument('--model-path', type=str, default='hf/BAGEL-7B-MoT/')
+    args = parser.parse_args()
+    if not os.path.exists(args.out_dir):
+        os.makedirs(args.out_dir, exist_ok=True)
+    args.datasets = args.datasets.split(',')
+    print('datasets:', args.datasets)
+    assert args.batch_size == 1, 'Only batch size 1 is supported'
+    torch.distributed.init_process_group(
+        backend='nccl',
+        world_size=int(os.getenv('WORLD_SIZE', '1')),
+        rank=int(os.getenv('RANK', '0')),
+    )
+    torch.cuda.set_device(int(os.getenv('LOCAL_RANK', 0)))
+    model, tokenizer, new_token_ids = load_model_and_tokenizer(args)
+    image_transform = build_transform()
+    total_params = sum(p.numel() for p in model.parameters()) / 1e9
+    print(f'[test] total_params: {total_params}B')
+    prompt = {
+        'en': "Answer with the option's letter from the given choices directly.",
+        'cn': '请直接回答选项字母。'
+    }
+    evaluate_chat_model()

eval/vlm/eval/mme/Your_Results/OCR.txt ADDED Viewed

	@@ -0,0 +1,40 @@

+0001.jpg	Is the word in the logo "angie's"? Please answer yes or no.	Yes
+0001.jpg	Is the word in the logo "angle's"? Please answer yes or no.	No
+0002.jpg	Is the word in the logo "c'est cheese"? Please answer yes or no.	Yes
+0002.jpg	Is the word in the logo "crest cheese"? Please answer yes or no.	No
+0003.jpg	Is the word in the logo "beavertails pastry"? Please answer yes or no.	Yes
+0003.jpg	Is the word in the logo "beavertalls pastry"? Please answer yes or no.	No
+0004.jpg	Is the word in the logo "old market sundries"? Please answer yes or no.	Yes
+0004.jpg	Is the word in the logo "old market hundreds"? Please answer yes or no.	No
+0005.jpg	Is the word in the logo "kress"? Please answer yes or no.	Yes
+0005.jpg	Is the word in the logo "dress"? Please answer yes or no.	No
+0006.jpg	Is the word in the logo "the beatles story liver pool"? Please answer yes or no.	Yes
+0006.jpg	Is the word in the logo "the beats story liver pool"? Please answer yes or no.	No
+0007.jpg	Is the phone number in the picture "0131 555 6363"? Please answer yes or no.	Yes
+0007.jpg	Is the phone number in the picture "0137 556 6363"? Please answer yes or no.	No
+0008.jpg	Is the word in the logo "phil's market"? Please answer yes or no.	Yes
+0008.jpg	Is the word in the logo "phll's market"? Please answer yes or no.	No
+0009.jpg	Is the word in the logo "fenders diner"? Please answer yes or no.	Yes
+0009.jpg	Is the word in the logo "finders diner"? Please answer yes or no.	No
+0010.jpg	Is the word in the logo "high time coffee shop"? Please answer yes or no.	Yes
+0010.jpg	Is the word in the logo "high tite cofeee shop"? Please answer yes or no.	No
+0011.jpg	Is the word in the logo "ihop restaurant"? Please answer yes or no.	Yes
+0011.jpg	Is the word in the logo "lhop restaurant"? Please answer yes or no.	No
+0012.jpg	Is the word in the logo "casa grecque restaurants"? Please answer yes or no.	Yes
+0012.jpg	Is the word in the logo "case grecque restaurants"? Please answer yes or no.	No
+0013.jpg	Is the word in the picture "seabreeze motel"? Please answer yes or no.	Yes
+0013.jpg	Is the word in the picture "seebreeze model"? Please answer yes or no.	No
+0014.jpg	Is the word in the logo "penarth pier built 1894"? Please answer yes or no.	Yes
+0014.jpg	Is the word in the logo "penarth pies buid 1894"? Please answer yes or no.	No
+0015.jpg	Is the text in the picture "hollywood"? Please answer yes or no.	Yes
+0015.jpg	Is the text in the picture "holly word"? Please answer yes or no.	No
+0016.jpg	Is the word in the logo "shop rite"? Please answer yes or no.	Yes
+0016.jpg	Is the word in the logo "stop rite"? Please answer yes or no.	No
+0017.jpg	Is the word in the logo "hardco industrial construction"? Please answer yes or no.	Yes
+0017.jpg	Is the word in the logo "hardto industal construction"? Please answer yes or no.	No
+0018.jpg	Is the word in the logo "oldsmobile service"? Please answer yes or no.	Yes
+0018.jpg	Is the word in the logo "old mobile service"? Please answer yes or no.	No
+0019.jpg	Is the word in the logo "exchange hotel"? Please answer yes or no.	Yes
+0019.jpg	Is the word in the logo "excharge hotel"? Please answer yes or no.	No
+0020.jpg	Is the word in the logo "cold drinks"? Please answer yes or no.	Yes
+0020.jpg	Is the word in the logo "cold rinks"? Please answer yes or no.	No

eval/vlm/eval/mme/Your_Results/artwork.txt ADDED Viewed

	@@ -0,0 +1,400 @@

+10002.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+10002.jpg	Does this artwork exist in the form of glassware? Please answer yes or no.	No
+10049.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+10049.jpg	Does this artwork exist in the form of sculpture? Please answer yes or no.	No
+10256.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+10256.jpg	Does this artwork exist in the form of sculpture? Please answer yes or no.	No
+10358.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+10358.jpg	Does this artwork exist in the form of glassware? Please answer yes or no.	No
+10543.jpg	Is this artwork displayed in fogg art museum, harvard university, cambridge? Please answer yes or no.	Yes
+10543.jpg	Is this artwork displayed in museo civico, pistoia? Please answer yes or no.	No
+10581.jpg	Does this artwork belong to the type of portrait? Please answer yes or no.	Yes
+10581.jpg	Does this artwork belong to the type of genre? Please answer yes or no.	No
+1060.jpg	Is this artwork created by antoniazzo romano? Please answer yes or no.	Yes
+1060.jpg	Is this artwork created by gentile da fabriano? Please answer yes or no.	No
+10881.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+10881.jpg	Does this artwork exist in the form of tapestry? Please answer yes or no.	No
+10970.jpg	Does this artwork belong to the type of religious? Please answer yes or no.	Yes
+10970.jpg	Does this artwork belong to the type of study? Please answer yes or no.	No
+11276.jpg	Does this artwork exist in the form of sculpture? Please answer yes or no.	Yes
+11276.jpg	Does this artwork exist in the form of graphics? Please answer yes or no.	No
+11331.jpg	Is this artwork created by donatello? Please answer yes or no.	Yes
+11331.jpg	Is this artwork created by zichy, mihály? Please answer yes or no.	No
+11488.jpg	Does this artwork belong to the type of mythological? Please answer yes or no.	Yes
+11488.jpg	Does this artwork belong to the type of historical? Please answer yes or no.	No
+11724.jpg	Is this artwork created by duccio di buoninsegna? Please answer yes or no.	Yes
+11724.jpg	Is this artwork created by giani, felice? Please answer yes or no.	No
+11726.jpg	Is this artwork titled temptation on the mountain (detail)? Please answer yes or no.	Yes
+11726.jpg	Is this artwork titled in the forest of fontainebleau? Please answer yes or no.	No
+12133.jpg	Is this artwork titled hand study with bible? Please answer yes or no.	Yes
+12133.jpg	Is this artwork titled self-portrait aged 78? Please answer yes or no.	No
+12439.jpg	Is this artwork created by dürer, albrecht? Please answer yes or no.	Yes
+12439.jpg	Is this artwork created by koekkoek, barend cornelis? Please answer yes or no.	No
+12561.jpg	Is this artwork created by eberlein, gustav heinrich? Please answer yes or no.	Yes
+12561.jpg	Is this artwork created by gillemans, jan pauwel the younger? Please answer yes or no.	No
+12652.jpg	Is this artwork displayed in stedelijk museum de lakenhal, leiden? Please answer yes or no.	Yes
+12652.jpg	Is this artwork displayed in palazzo ducale, mantua? Please answer yes or no.	No
+12736.jpg	Is this artwork displayed in cannon hall museum, barnsley? Please answer yes or no.	Yes
+12736.jpg	Is this artwork displayed in protestant parish church, gelnhausen? Please answer yes or no.	No
+12902.jpg	Is this artwork displayed in private collection? Please answer yes or no.	Yes
+12902.jpg	Is this artwork displayed in musée national gustave-moreau, paris? Please answer yes or no.	No
+12908.jpg	Is this artwork titled ruth and boaz? Please answer yes or no.	Yes
+12908.jpg	Is this artwork titled view of dresden from the right bank of the elbe with the augustus bridge? Please answer yes or no.	No
+13091.jpg	Is this artwork titled italianate landscape with figures by classical ruins? Please answer yes or no.	Yes
+13091.jpg	Is this artwork titled two boys singing? Please answer yes or no.	No
+13174.jpg	Is this artwork titled nobility? Please answer yes or no.	Yes
+13174.jpg	Is this artwork titled aretino in the studio of tintoretto? Please answer yes or no.	No
+13239.jpg	Is this artwork titled doge ziani receiving the benediction of pope alexander iii? Please answer yes or no.	Yes
+13239.jpg	Is this artwork titled the adoration of the shepherds? Please answer yes or no.	No
+13288.jpg	Does this artwork exist in the form of architecture? Please answer yes or no.	Yes
+13288.jpg	Does this artwork exist in the form of metalwork? Please answer yes or no.	No
+13696.jpg	Is this artwork displayed in pinacoteca nazionale, siena? Please answer yes or no.	Yes
+13696.jpg	Is this artwork displayed in british embassy, paris? Please answer yes or no.	No
+13760.jpg	Is this artwork titled noli me tangere? Please answer yes or no.	Yes
+13760.jpg	Is this artwork titled profile study of a bearded man? Please answer yes or no.	No
+13821.jpg	Is this artwork created by frangipane, niccolò? Please answer yes or no.	Yes
+13821.jpg	Is this artwork created by drevet, pierre? Please answer yes or no.	No
+13901.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+13901.jpg	Does this artwork exist in the form of metalwork? Please answer yes or no.	No
+14283.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+14283.jpg	Does this artwork exist in the form of mosaic? Please answer yes or no.	No
+14499.jpg	Does this artwork belong to the type of religious? Please answer yes or no.	Yes
+14499.jpg	Does this artwork belong to the type of mythological? Please answer yes or no.	No
+14777.jpg	Does this artwork belong to the type of religious? Please answer yes or no.	Yes
+14777.jpg	Does this artwork belong to the type of historical? Please answer yes or no.	No
+15028.jpg	Does this artwork belong to the type of portrait? Please answer yes or no.	Yes
+15028.jpg	Does this artwork belong to the type of study? Please answer yes or no.	No
+15232.jpg	Is this artwork created by giordano, luca? Please answer yes or no.	Yes
+15232.jpg	Is this artwork created by heyerdahl, hans olaf? Please answer yes or no.	No
+15246.jpg	Is this artwork displayed in palazzo medici riccardi, florence? Please answer yes or no.	Yes
+15246.jpg	Is this artwork displayed in abbey church of sainte-foy, conques (aveyron)? Please answer yes or no.	No
+15311.jpg	Is this artwork created by giorgione? Please answer yes or no.	Yes
+15311.jpg	Is this artwork created by marilhat, prosper? Please answer yes or no.	No
+15989.jpg	Is this artwork displayed in pinacoteca, vatican? Please answer yes or no.	Yes
+15989.jpg	Is this artwork displayed in cathedral museum, zamora? Please answer yes or no.	No
+16006.jpg	Is this artwork displayed in private collection? Please answer yes or no.	Yes
+16006.jpg	Is this artwork displayed in cathedral of san geminiano, modena? Please answer yes or no.	No
+16249.jpg	Does this artwork belong to the type of landscape? Please answer yes or no.	Yes
+16249.jpg	Does this artwork belong to the type of religious? Please answer yes or no.	No
+16538.jpg	Is this artwork created by gogh, vincent van? Please answer yes or no.	Yes
+16538.jpg	Is this artwork created by altdorfer, albrecht? Please answer yes or no.	No
+16835.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+16835.jpg	Does this artwork exist in the form of illumination? Please answer yes or no.	No
+16911.jpg	Is this artwork created by gossart, jan? Please answer yes or no.	Yes
+16911.jpg	Is this artwork created by stanzione, massimo? Please answer yes or no.	No
+17311.jpg	Does this artwork belong to the type of religious? Please answer yes or no.	Yes
+17311.jpg	Does this artwork belong to the type of interior? Please answer yes or no.	No
+17317.jpg	Is this artwork created by gozzoli, benozzo? Please answer yes or no.	Yes
+17317.jpg	Is this artwork created by coriolano, cristoforo? Please answer yes or no.	No
+17535.jpg	Is this artwork created by grebber, pieter de? Please answer yes or no.	Yes
+17535.jpg	Is this artwork created by massys, quentin? Please answer yes or no.	No
+17823.jpg	Is this artwork created by greuze, jean-baptiste? Please answer yes or no.	Yes
+17823.jpg	Is this artwork created by landseer, sir edwin henry? Please answer yes or no.	No
+17838.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+17838.jpg	Does this artwork exist in the form of furniture? Please answer yes or no.	No
+17998.jpg	Does this artwork belong to the type of religious? Please answer yes or no.	Yes
+17998.jpg	Does this artwork belong to the type of genre? Please answer yes or no.	No
+18566.jpg	Is this artwork created by hamen, juan van der? Please answer yes or no.	Yes
+18566.jpg	Is this artwork created by starnina, gherardo di jacopo? Please answer yes or no.	No
+18604.jpg	Is this artwork created by hardouin-mansart, jules? Please answer yes or no.	Yes
+18604.jpg	Is this artwork created by kerseboom, friedrich? Please answer yes or no.	No
+18722.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+18722.jpg	Does this artwork exist in the form of sculpture? Please answer yes or no.	No
+1873.jpg	Does this artwork exist in the form of architecture? Please answer yes or no.	Yes
+1873.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	No
+18902.jpg	Is this artwork created by herrera, francisco de, the elder? Please answer yes or no.	Yes
+18902.jpg	Is this artwork created by ingres, jean-auguste-dominique? Please answer yes or no.	No
+18926.jpg	Is this artwork created by herring, john frederick the younger? Please answer yes or no.	Yes
+18926.jpg	Is this artwork created by cozens, john robert? Please answer yes or no.	No
+19087.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+19087.jpg	Does this artwork exist in the form of metalwork? Please answer yes or no.	No
+19154.jpg	Is this artwork titled portrait of the merchant georg gisze (detail)? Please answer yes or no.	Yes
+19154.jpg	Is this artwork titled pair of table candlesticks? Please answer yes or no.	No
+19417.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+19417.jpg	Does this artwork exist in the form of mosaic? Please answer yes or no.	No
+19452.jpg	Is this artwork titled the artist and his model? Please answer yes or no.	Yes
+19452.jpg	Is this artwork titled the lovesick maiden (detail)? Please answer yes or no.	No
+19839.jpg	Is this artwork created by janneck, franz christoph? Please answer yes or no.	Yes
+19839.jpg	Is this artwork created by goupil, jules-adolphe? Please answer yes or no.	No
+19863.jpg	Does this artwork belong to the type of religious? Please answer yes or no.	Yes
+19863.jpg	Does this artwork belong to the type of mythological? Please answer yes or no.	No
+19993.jpg	Is this artwork displayed in private collection? Please answer yes or no.	Yes
+19993.jpg	Is this artwork displayed in cathedral of st paul, liège? Please answer yes or no.	No
+20176.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+20176.jpg	Does this artwork exist in the form of furniture? Please answer yes or no.	No
+20437.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+20437.jpg	Does this artwork exist in the form of tapestry? Please answer yes or no.	No
+20442.jpg	Is this artwork created by kucharski, aleksander? Please answer yes or no.	Yes
+20442.jpg	Is this artwork created by pourbus, frans the elder? Please answer yes or no.	No
+20455.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+20455.jpg	Does this artwork exist in the form of metalwork? Please answer yes or no.	No
+20483.jpg	Is this artwork titled allegory of the regency? Please answer yes or no.	Yes
+20483.jpg	Is this artwork titled breton woman bathing? Please answer yes or no.	No
+20490.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+20490.jpg	Does this artwork exist in the form of illumination? Please answer yes or no.	No
+20551.jpg	Is this artwork created by lagrenée, jean-jacques? Please answer yes or no.	Yes
+20551.jpg	Is this artwork created by scultori, diana? Please answer yes or no.	No
+20651.jpg	Is this artwork titled a highland landscape? Please answer yes or no.	Yes
+20651.jpg	Is this artwork titled a dog and a cat fighting in a kitchen interior? Please answer yes or no.	No
+20724.jpg	Does this artwork belong to the type of portrait? Please answer yes or no.	Yes
+20724.jpg	Does this artwork belong to the type of landscape? Please answer yes or no.	No
+21048.jpg	Is this artwork created by lemoyne, jean-baptiste ii? Please answer yes or no.	Yes
+21048.jpg	Is this artwork created by kneller, sir godfrey? Please answer yes or no.	No
+21097.jpg	Does this artwork belong to the type of religious? Please answer yes or no.	Yes
+21097.jpg	Does this artwork belong to the type of genre? Please answer yes or no.	No
+21244.jpg	Does this artwork belong to the type of study? Please answer yes or no.	Yes
+21244.jpg	Does this artwork belong to the type of portrait? Please answer yes or no.	No
+21469.jpg	Does this artwork belong to the type of genre? Please answer yes or no.	Yes
+21469.jpg	Does this artwork belong to the type of still-life? Please answer yes or no.	No
+21580.jpg	Is this artwork created by linard, jacques? Please answer yes or no.	Yes
+21580.jpg	Is this artwork created by bonino da campione? Please answer yes or no.	No
+21712.jpg	Is this artwork titled st john the evangelist resuscitating drusiana? Please answer yes or no.	Yes
+21712.jpg	Is this artwork titled la finette? Please answer yes or no.	No
+22329.jpg	Is this artwork titled marriage of the virgin? Please answer yes or no.	Yes
+22329.jpg	Is this artwork titled landscape with river and figures (detail)? Please answer yes or no.	No
+22366.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+22366.jpg	Does this artwork exist in the form of glassware? Please answer yes or no.	No
+22667.jpg	Is this artwork displayed in private collection? Please answer yes or no.	Yes
+22667.jpg	Is this artwork displayed in san francesco d'assisi, pavia? Please answer yes or no.	No
+22760.jpg	Is this artwork titled madonna and child (detail)? Please answer yes or no.	Yes
+22760.jpg	Is this artwork titled view of the south and east walls? Please answer yes or no.	No
+22842.jpg	Is this artwork titled ukrainian peasant girl? Please answer yes or no.	Yes
+22842.jpg	Is this artwork titled virtue crowning merit? Please answer yes or no.	No
+23229.jpg	Is this artwork displayed in national gallery, london? Please answer yes or no.	Yes
+23229.jpg	Is this artwork displayed in notre-dame-la-riche, tours? Please answer yes or no.	No
+23427.jpg	Is this artwork displayed in the hermitage, st. petersburg? Please answer yes or no.	Yes
+23427.jpg	Is this artwork displayed in national gallery of victoria, melbourne? Please answer yes or no.	No
+23465.jpg	Is this artwork displayed in private collection? Please answer yes or no.	Yes
+23465.jpg	Is this artwork displayed in cistertian church, zirc? Please answer yes or no.	No
+23824.jpg	Is this artwork titled christ walking on the water? Please answer yes or no.	Yes
+23824.jpg	Is this artwork titled mademoiselle romaine lacaux? Please answer yes or no.	No
+24122.jpg	Is this artwork displayed in museo correr, venice? Please answer yes or no.	Yes
+24122.jpg	Is this artwork displayed in church of brou, bourg-en-bresse? Please answer yes or no.	No
+24260.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+24260.jpg	Does this artwork exist in the form of illumination? Please answer yes or no.	No
+24291.jpg	Is this artwork titled virgin and child with sts catherine, cecilia, barbara, and ursula? Please answer yes or no.	Yes
+24291.jpg	Is this artwork titled sorrow? Please answer yes or no.	No
+24723.jpg	Is this artwork titled tomb of henry the lion and his wife matilda? Please answer yes or no.	Yes
+24723.jpg	Is this artwork titled god the father? Please answer yes or no.	No
+2490.jpg	Does this artwork belong to the type of landscape? Please answer yes or no.	Yes
+2490.jpg	Does this artwork belong to the type of mythological? Please answer yes or no.	No
+2507.jpg	Is this artwork displayed in private collection? Please answer yes or no.	Yes
+2507.jpg	Is this artwork displayed in st. vitus's cathedral, prague? Please answer yes or no.	No
+25312.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+25312.jpg	Does this artwork exist in the form of metalwork? Please answer yes or no.	No
+25476.jpg	Is this artwork created by michelangelo buonarroti? Please answer yes or no.	Yes
+25476.jpg	Is this artwork created by beuckelaer, joachim? Please answer yes or no.	No
+25492.jpg	Does this artwork exist in the form of sculpture? Please answer yes or no.	Yes
+25492.jpg	Does this artwork exist in the form of illumination? Please answer yes or no.	No
+25513.jpg	Does this artwork belong to the type of religious? Please answer yes or no.	Yes
+25513.jpg	Does this artwork belong to the type of landscape? Please answer yes or no.	No
+26521.jpg	Does this artwork exist in the form of illumination? Please answer yes or no.	Yes
+26521.jpg	Does this artwork exist in the form of furniture? Please answer yes or no.	No
+26973.jpg	Does this artwork belong to the type of religious? Please answer yes or no.	Yes
+26973.jpg	Does this artwork belong to the type of mythological? Please answer yes or no.	No
+27021.jpg	Is this artwork created by miniaturist, german? Please answer yes or no.	Yes
+27021.jpg	Is this artwork created by trinquesse, louis-rolland? Please answer yes or no.	No
+27662.jpg	Does this artwork belong to the type of still-life? Please answer yes or no.	Yes
+27662.jpg	Does this artwork belong to the type of mythological? Please answer yes or no.	No
+27936.jpg	Does this artwork belong to the type of portrait? Please answer yes or no.	Yes
+27936.jpg	Does this artwork belong to the type of interior? Please answer yes or no.	No
+28039.jpg	Is this artwork displayed in cappella palatina, palermo? Please answer yes or no.	Yes
+28039.jpg	Is this artwork displayed in musée des beaux-arts, chambéry? Please answer yes or no.	No
+28345.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+28345.jpg	Does this artwork exist in the form of tapestry? Please answer yes or no.	No
+28400.jpg	Does this artwork belong to the type of religious? Please answer yes or no.	Yes
+28400.jpg	Does this artwork belong to the type of portrait? Please answer yes or no.	No
+28698.jpg	Does this artwork belong to the type of religious? Please answer yes or no.	Yes
+28698.jpg	Does this artwork belong to the type of still-life? Please answer yes or no.	No
+28758.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+28758.jpg	Does this artwork exist in the form of graphics? Please answer yes or no.	No
+28974.jpg	Is this artwork titled prayer before the meal? Please answer yes or no.	Yes
+28974.jpg	Is this artwork titled rest in the mountains? Please answer yes or no.	No
+29266.jpg	Is this artwork created by palma vecchio? Please answer yes or no.	Yes
+29266.jpg	Is this artwork created by maris, jacobus hendricus? Please answer yes or no.	No
+30443.jpg	Is this artwork titled the crucifixion with sts jerome and christopher? Please answer yes or no.	Yes
+30443.jpg	Is this artwork titled tomb of michelangelo (detail)? Please answer yes or no.	No
+3085.jpg	Is this artwork created by bartsius, willem? Please answer yes or no.	Yes
+3085.jpg	Is this artwork created by oehme, ernst ferdinand? Please answer yes or no.	No
+30875.jpg	Is this artwork created by pomarancio? Please answer yes or no.	Yes
+30875.jpg	Is this artwork created by steen, jan? Please answer yes or no.	No
+3114.jpg	Does this artwork belong to the type of religious? Please answer yes or no.	Yes
+3114.jpg	Does this artwork belong to the type of study? Please answer yes or no.	No
+31808.jpg	Is this artwork created by raffaello sanzio? Please answer yes or no.	Yes
+31808.jpg	Is this artwork created by simon von taisten? Please answer yes or no.	No
+32147.jpg	Is this artwork titled lucretia? Please answer yes or no.	Yes
+32147.jpg	Is this artwork titled rinaldo abandoning armida (detail)? Please answer yes or no.	No
+3241.jpg	Is this artwork titled holy family? Please answer yes or no.	Yes
+3241.jpg	Is this artwork titled friedrich iii, the wise, and johann i, the constant, electors of saxony? Please answer yes or no.	No
+33017.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+33017.jpg	Does this artwork exist in the form of glassware? Please answer yes or no.	No
+33069.jpg	Does this artwork belong to the type of historical? Please answer yes or no.	Yes
+33069.jpg	Does this artwork belong to the type of interior? Please answer yes or no.	No
+33173.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+33173.jpg	Does this artwork exist in the form of graphics? Please answer yes or no.	No
+33753.jpg	Is this artwork titled vanitas? Please answer yes or no.	Yes
+33753.jpg	Is this artwork titled legend of st francis: 18. apparition at arles (detail)? Please answer yes or no.	No
+33854.jpg	Does this artwork belong to the type of religious? Please answer yes or no.	Yes
+33854.jpg	Does this artwork belong to the type of study? Please answer yes or no.	No
+339.jpg	Is this artwork displayed in staatliche museen, berlin? Please answer yes or no.	Yes
+339.jpg	Is this artwork displayed in national museum of religious carvings, valladolid? Please answer yes or no.	No
+33933.jpg	Is this artwork titled madonna and child? Please answer yes or no.	Yes
+33933.jpg	Is this artwork titled the bacino di san marco? Please answer yes or no.	No
+3404.jpg	Is this artwork displayed in szépmûvészeti múzeum, budapest? Please answer yes or no.	Yes
+3404.jpg	Is this artwork displayed in s. eustorgio, milan? Please answer yes or no.	No
+34109.jpg	Is this artwork displayed in national gallery of art, washington? Please answer yes or no.	Yes
+34109.jpg	Is this artwork displayed in abbey church of sainte-foy, conques? Please answer yes or no.	No
+34363.jpg	Is this artwork displayed in museo del prado, madrid? Please answer yes or no.	Yes
+34363.jpg	Is this artwork displayed in state tretyakov gallery, moscow? Please answer yes or no.	No
+34539.jpg	Is this artwork titled the victory of eucharistic truth over heresy? Please answer yes or no.	Yes
+34539.jpg	Is this artwork titled a sunday afternoon on the ile de la grande jatte? Please answer yes or no.	No
+34627.jpg	Does this artwork belong to the type of landscape? Please answer yes or no.	Yes
+34627.jpg	Does this artwork belong to the type of genre? Please answer yes or no.	No
+34638.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+34638.jpg	Does this artwork exist in the form of tapestry? Please answer yes or no.	No
+34669.jpg	Does this artwork belong to the type of mythological? Please answer yes or no.	Yes
+34669.jpg	Does this artwork belong to the type of historical? Please answer yes or no.	No
+35345.jpg	Does this artwork belong to the type of religious? Please answer yes or no.	Yes
+35345.jpg	Does this artwork belong to the type of landscape? Please answer yes or no.	No
+35439.jpg	Is this artwork titled madonna and child with a host of musical angels? Please answer yes or no.	Yes
+35439.jpg	Is this artwork titled garden in fontenay? Please answer yes or no.	No
+35460.jpg	Is this artwork created by schinkel, karl friedrich? Please answer yes or no.	Yes
+35460.jpg	Is this artwork created by giolfino, bartolomeo? Please answer yes or no.	No
+35486.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+35486.jpg	Does this artwork exist in the form of furniture? Please answer yes or no.	No
+35513.jpg	Is this artwork created by schongauer, martin? Please answer yes or no.	Yes
+35513.jpg	Is this artwork created by cassioli, amos? Please answer yes or no.	No
+3552.jpg	Is this artwork titled madonna degli alberetti? Please answer yes or no.	Yes
+3552.jpg	Is this artwork titled peter gillis? Please answer yes or no.	No
+35658.jpg	Is this artwork created by sebastiano del piombo? Please answer yes or no.	Yes
+35658.jpg	Is this artwork created by jacobsz., dirck? Please answer yes or no.	No
+35736.jpg	Does this artwork belong to the type of religious? Please answer yes or no.	Yes
+35736.jpg	Does this artwork belong to the type of still-life? Please answer yes or no.	No
+35861.jpg	Does this artwork belong to the type of interior? Please answer yes or no.	Yes
+35861.jpg	Does this artwork belong to the type of still-life? Please answer yes or no.	No
+36805.jpg	Is this artwork titled weir? Please answer yes or no.	Yes
+36805.jpg	Is this artwork titled view of the window wall? Please answer yes or no.	No
+36966.jpg	Does this artwork belong to the type of portrait? Please answer yes or no.	Yes
+36966.jpg	Does this artwork belong to the type of religious? Please answer yes or no.	No
+37010.jpg	Is this artwork titled madonna and child with the young st john? Please answer yes or no.	Yes
+37010.jpg	Is this artwork titled sketch for attila? Please answer yes or no.	No
+37077.jpg	Does this artwork belong to the type of religious? Please answer yes or no.	Yes
+37077.jpg	Does this artwork belong to the type of still-life? Please answer yes or no.	No
+37439.jpg	Is this artwork titled the message? Please answer yes or no.	Yes
+37439.jpg	Is this artwork titled the descent from the cross? Please answer yes or no.	No
+37819.jpg	Is this artwork created by tiepolo, giovanni battista? Please answer yes or no.	Yes
+37819.jpg	Is this artwork created by kerricx, willem ignatius? Please answer yes or no.	No
+37866.jpg	Does this artwork belong to the type of mythological? Please answer yes or no.	Yes
+37866.jpg	Does this artwork belong to the type of still-life? Please answer yes or no.	No
+381.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+381.jpg	Does this artwork exist in the form of architecture? Please answer yes or no.	No
+38178.jpg	Is this artwork created by tintoretto? Please answer yes or no.	Yes
+38178.jpg	Is this artwork created by morel, jean-baptiste? Please answer yes or no.	No
+38536.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+38536.jpg	Does this artwork exist in the form of furniture? Please answer yes or no.	No
+38546.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+38546.jpg	Does this artwork exist in the form of metalwork? Please answer yes or no.	No
+38694.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+38694.jpg	Does this artwork exist in the form of metalwork? Please answer yes or no.	No
+38740.jpg	Is this artwork displayed in musée toulouse-lautrec, albi? Please answer yes or no.	Yes
+38740.jpg	Is this artwork displayed in kupferstichkabinett, gotha? Please answer yes or no.	No
+38881.jpg	Does this artwork belong to the type of genre? Please answer yes or no.	Yes
+38881.jpg	Does this artwork belong to the type of religious? Please answer yes or no.	No
+38993.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+38993.jpg	Does this artwork exist in the form of illumination? Please answer yes or no.	No
+39026.jpg	Does this artwork belong to the type of religious? Please answer yes or no.	Yes
+39026.jpg	Does this artwork belong to the type of historical? Please answer yes or no.	No
+39124.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+39124.jpg	Does this artwork exist in the form of graphics? Please answer yes or no.	No
+39188.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+39188.jpg	Does this artwork exist in the form of architecture? Please answer yes or no.	No
+39482.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+39482.jpg	Does this artwork exist in the form of metalwork? Please answer yes or no.	No
+39556.jpg	Is this artwork created by unknown master, dutch? Please answer yes or no.	Yes
+39556.jpg	Is this artwork created by cuyp, benjamin gerritsz.? Please answer yes or no.	No
+41036.jpg	Is this artwork displayed in kunsthistorisches museum, vienna? Please answer yes or no.	Yes
+41036.jpg	Is this artwork displayed in national museum of art, minsk? Please answer yes or no.	No
+41371.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+41371.jpg	Does this artwork exist in the form of architecture? Please answer yes or no.	No
+41484.jpg	Does this artwork belong to the type of religious? Please answer yes or no.	Yes
+41484.jpg	Does this artwork belong to the type of historical? Please answer yes or no.	No
+41594.jpg	Is this artwork created by veronese, paolo? Please answer yes or no.	Yes
+41594.jpg	Is this artwork created by jeaurat, etienne? Please answer yes or no.	No
+416.jpg	Does this artwork exist in the form of sculpture? Please answer yes or no.	Yes
+416.jpg	Does this artwork exist in the form of others? Please answer yes or no.	No
+41653.jpg	Is this artwork titled view of the sala del collegio? Please answer yes or no.	Yes
+41653.jpg	Is this artwork titled reine lefebvre and margot? Please answer yes or no.	No
+41944.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+41944.jpg	Does this artwork exist in the form of mosaic? Please answer yes or no.	No
+42152.jpg	Is this artwork titled the pieterskerk in leiden? Please answer yes or no.	Yes
+42152.jpg	Is this artwork titled portrait of cardinal reginald pole? Please answer yes or no.	No
+42288.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+42288.jpg	Does this artwork exist in the form of stained-glass? Please answer yes or no.	No
+42303.jpg	Is this artwork displayed in art museum, cincinnati? Please answer yes or no.	Yes
+42303.jpg	Is this artwork displayed in banca del monte di bologna e ravenna, bologna? Please answer yes or no.	No
+42401.jpg	Is this artwork created by waldmüller, fedinand georg? Please answer yes or no.	Yes
+42401.jpg	Is this artwork created by seeman, enoch? Please answer yes or no.	No
+42447.jpg	Is this artwork displayed in musée du louvre, paris? Please answer yes or no.	Yes
+42447.jpg	Is this artwork displayed in santa catarina, pisa? Please answer yes or no.	No
+42585.jpg	Is this artwork created by werff, pieter van der? Please answer yes or no.	Yes
+42585.jpg	Is this artwork created by domenichini, apollonio? Please answer yes or no.	No
+42706.jpg	Is this artwork displayed in musée du louvre, paris? Please answer yes or no.	Yes
+42706.jpg	Is this artwork displayed in galleria nazionale d'arte moderna e contemporanea, rome? Please answer yes or no.	No
+42796.jpg	Is this artwork displayed in private collection? Please answer yes or no.	Yes
+42796.jpg	Is this artwork displayed in museo di san salvi, florence? Please answer yes or no.	No
+42857.jpg	Does this artwork belong to the type of landscape? Please answer yes or no.	Yes
+42857.jpg	Does this artwork belong to the type of study? Please answer yes or no.	No
+42905.jpg	Is this artwork created by wit, jacob de? Please answer yes or no.	Yes
+42905.jpg	Is this artwork created by vittone, bernardo antonio? Please answer yes or no.	No
+42941.jpg	Is this artwork created by witte, emanuel de? Please answer yes or no.	Yes
+42941.jpg	Is this artwork created by bicci di neri? Please answer yes or no.	No
+42956.jpg	Is this artwork titled view of rome with the tiberand castel sant'angelo? Please answer yes or no.	Yes
+42956.jpg	Is this artwork titled st bonaventure enters the franciscan order? Please answer yes or no.	No
+42987.jpg	Is this artwork created by witz, konrad? Please answer yes or no.	Yes
+42987.jpg	Is this artwork created by christus, petrus? Please answer yes or no.	No
+43142.jpg	Does this artwork belong to the type of mythological? Please answer yes or no.	Yes
+43142.jpg	Does this artwork belong to the type of interior? Please answer yes or no.	No
+43175.jpg	Is this artwork displayed in private collection? Please answer yes or no.	Yes
+43175.jpg	Is this artwork displayed in smith college museum of art, northampton? Please answer yes or no.	No
+43349.jpg	Is this artwork created by zuccarelli, francesco? Please answer yes or no.	Yes
+43349.jpg	Is this artwork created by baccanelli, giovanni antonio di giulio? Please answer yes or no.	No
+43445.jpg	Does this artwork belong to the type of religious? Please answer yes or no.	Yes
+43445.jpg	Does this artwork belong to the type of interior? Please answer yes or no.	No
+4836.jpg	Is this artwork displayed in villa cornaro, piombino dese? Please answer yes or no.	Yes
+4836.jpg	Is this artwork displayed in palais saint-vaast, arras? Please answer yes or no.	No
+5227.jpg	Is this artwork created by botticelli, sandro? Please answer yes or no.	Yes
+5227.jpg	Is this artwork created by vigri, caterina? Please answer yes or no.	No
+526.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+526.jpg	Does this artwork exist in the form of tapestry? Please answer yes or no.	No
+5906.jpg	Is this artwork created by bronzino, agnolo? Please answer yes or no.	Yes
+5906.jpg	Is this artwork created by pellegrino da san daniele? Please answer yes or no.	No
+6168.jpg	Does this artwork exist in the form of graphics? Please answer yes or no.	Yes
+6168.jpg	Does this artwork exist in the form of tapestry? Please answer yes or no.	No
+6297.jpg	Is this artwork titled peasants making merry outside a tavern 'the swan'? Please answer yes or no.	Yes
+6297.jpg	Is this artwork titled allegory of quietude? Please answer yes or no.	No
+6478.jpg	Does this artwork belong to the type of religious? Please answer yes or no.	Yes
+6478.jpg	Does this artwork belong to the type of genre? Please answer yes or no.	No
+6969.jpg	Is this artwork titled letizia ramolino bonaparte? Please answer yes or no.	Yes
+6969.jpg	Is this artwork titled job and his daughters? Please answer yes or no.	No
+701.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+701.jpg	Does this artwork exist in the form of others? Please answer yes or no.	No
+7702.jpg	Is this artwork titled reine lefebvre and margot? Please answer yes or no.	Yes
+7702.jpg	Is this artwork titled fire in the oil depot at san marcuola? Please answer yes or no.	No
+8101.jpg	Is this artwork displayed in museu de arte, são paulo? Please answer yes or no.	Yes
+8101.jpg	Is this artwork displayed in national széchényi library, budapest? Please answer yes or no.	No
+815.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+815.jpg	Does this artwork exist in the form of furniture? Please answer yes or no.	No
+8797.jpg	Is this artwork created by coecke van aelst, pieter? Please answer yes or no.	Yes
+8797.jpg	Is this artwork created by abaquesne, masséot? Please answer yes or no.	No
+8885.jpg	Is this artwork displayed in art museum, saint louis? Please answer yes or no.	Yes
+8885.jpg	Is this artwork displayed in museo civico d'arte antica, turin? Please answer yes or no.	No
+9153.jpg	Is this artwork displayed in galleria nazionale, parma? Please answer yes or no.	Yes
+9153.jpg	Is this artwork displayed in hospital de san bernardo, seville? Please answer yes or no.	No
+9395.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+9395.jpg	Does this artwork exist in the form of stained-glass? Please answer yes or no.	No
+9405.jpg	Is this artwork created by courbet, gustave? Please answer yes or no.	Yes
+9405.jpg	Is this artwork created by milani, aureliano? Please answer yes or no.	No
+9599.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	Yes
+9599.jpg	Does this artwork exist in the form of ceramics? Please answer yes or no.	No
+995.jpg	Does this artwork exist in the form of sculpture? Please answer yes or no.	Yes
+995.jpg	Does this artwork exist in the form of painting? Please answer yes or no.	No

eval/vlm/eval/mme/Your_Results/celebrity.txt ADDED Viewed

	@@ -0,0 +1,340 @@

+tt0032138_shot_0395_img_0.jpg	Is the actor inside the red bounding box named Frank Morgan? Please answer yes or no.	Yes
+tt0032138_shot_0395_img_0.jpg	Is the actor inside the red bounding box named Eric Schniewind? Please answer yes or no.	No
+tt0035423_shot_0464_img_0.jpg	Is the actor inside the red bounding box called Hugh Jackman? Please answer yes or no.	Yes
+tt0035423_shot_0464_img_0.jpg	Is the actor inside the red bounding box called Lizzie Hopley? Please answer yes or no.	No
+tt0038650_shot_0737_img_1.jpg	Is the person inside the red bounding box called James Stewart? Please answer yes or no.	Yes
+tt0038650_shot_0737_img_1.jpg	Is the person inside the red bounding box called Phil Selway? Please answer yes or no.	No
+tt0047396_shot_0333_img_0.jpg	Is the actor inside the red bounding box named James Stewart? Please answer yes or no.	Yes
+tt0047396_shot_0333_img_0.jpg	Is the actor inside the red bounding box named Ron Blair? Please answer yes or no.	No
+tt0048545_shot_0124_img_0.jpg	Is the actor inside the red bounding box called Natalie Wood? Please answer yes or no.	Yes
+tt0048545_shot_0124_img_0.jpg	Is the actor inside the red bounding box called Rebecca Jackson Mendoza? Please answer yes or no.	No
+tt0049470_shot_0279_img_0.jpg	Is the person inside the red bounding box called James Stewart? Please answer yes or no.	Yes
+tt0049470_shot_0279_img_0.jpg	Is the person inside the red bounding box called Matt Pashkow? Please answer yes or no.	No
+tt0049730_shot_0273_img_0.jpg	Is the person inside the red bounding box called Vera Miles? Please answer yes or no.	Yes
+tt0049730_shot_0273_img_0.jpg	Is the person inside the red bounding box called Addie Yungmee? Please answer yes or no.	No
+tt0052357_shot_0511_img_0.jpg	Is the actor inside the red bounding box called Kim Novak? Please answer yes or no.	Yes
+tt0052357_shot_0511_img_0.jpg	Is the actor inside the red bounding box called Abigail Van Alyn? Please answer yes or no.	No
+tt0053221_shot_0197_img_0.jpg	Is the actor inside the red bounding box named John Wayne? Please answer yes or no.	Yes
+tt0053221_shot_0197_img_0.jpg	Is the actor inside the red bounding box named Claude-Oliver Rudolph? Please answer yes or no.	No
+tt0054167_shot_0122_img_0.jpg	Is the person inside the red bounding box called Anna Massey? Please answer yes or no.	Yes
+tt0054167_shot_0122_img_0.jpg	Is the person inside the red bounding box called Eddie Tagoe? Please answer yes or no.	No
+tt0056869_shot_0320_img_0.jpg	Is the person inside the red bounding box called Tippi Hedren? Please answer yes or no.	Yes
+tt0056869_shot_0320_img_0.jpg	Is the person inside the red bounding box called Denise Mack? Please answer yes or no.	No
+tt0056923_shot_0835_img_0.jpg	Is the actor inside the red bounding box called Audrey Hepburn? Please answer yes or no.	Yes
+tt0056923_shot_0835_img_0.jpg	Is the actor inside the red bounding box called Chris April? Please answer yes or no.	No
+tt0057115_shot_0686_img_0.jpg	Is the person inside the red bounding box named James Garner? Please answer yes or no.	Yes
+tt0057115_shot_0686_img_0.jpg	Is the person inside the red bounding box named Chutimon Chuengcharoensukying? Please answer yes or no.	No
+tt0058331_shot_0353_img_0.jpg	Is the actor inside the red bounding box named Julie Andrews? Please answer yes or no.	Yes
+tt0058331_shot_0353_img_0.jpg	Is the actor inside the red bounding box named Ed Geldart? Please answer yes or no.	No
+tt0058461_shot_0901_img_0.jpg	Is the actor inside the red bounding box called Gian Maria Volontè? Please answer yes or no.	Yes
+tt0058461_shot_0901_img_0.jpg	Is the actor inside the red bounding box called Jennifer Connelly? Please answer yes or no.	No
+tt0061418_shot_0148_img_0.jpg	Is the actor inside the red bounding box named Faye Dunaway? Please answer yes or no.	Yes
+tt0061418_shot_0148_img_0.jpg	Is the actor inside the red bounding box named Warona Seane? Please answer yes or no.	No
+tt0061722_shot_0259_img_0.jpg	Is the actor inside the red bounding box called Dustin Hoffman? Please answer yes or no.	Yes
+tt0061722_shot_0259_img_0.jpg	Is the actor inside the red bounding box called Christopher Olsen? Please answer yes or no.	No
+tt0062622_shot_0291_img_0.jpg	Is the actor inside the red bounding box named Keir Dullea? Please answer yes or no.	Yes
+tt0062622_shot_0291_img_0.jpg	Is the actor inside the red bounding box named Frank Albanese? Please answer yes or no.	No
+tt0063442_shot_0702_img_0.jpg	Is the actor inside the red bounding box called Linda Harrison? Please answer yes or no.	Yes
+tt0063442_shot_0702_img_0.jpg	Is the actor inside the red bounding box called Michael McKean? Please answer yes or no.	No
+tt0064115_shot_0367_img_0.jpg	Is the actor inside the red bounding box named Robert Redford? Please answer yes or no.	Yes
+tt0064115_shot_0367_img_0.jpg	Is the actor inside the red bounding box named Cooper Murray? Please answer yes or no.	No
+tt0064665_shot_0300_img_0.jpg	Is the actor inside the red bounding box called Jon Voight? Please answer yes or no.	Yes
+tt0064665_shot_0300_img_0.jpg	Is the actor inside the red bounding box called Harvey Meyer? Please answer yes or no.	No
+tt0065214_shot_0366_img_0.jpg	Is the person inside the red bounding box called Robert Ryan? Please answer yes or no.	Yes
+tt0065214_shot_0366_img_0.jpg	Is the person inside the red bounding box called Victor Verhaeghe? Please answer yes or no.	No
+tt0065724_shot_0320_img_1.jpg	Is the person inside the red bounding box named Karen Black? Please answer yes or no.	Yes
+tt0065724_shot_0320_img_1.jpg	Is the person inside the red bounding box named Nick Discenza? Please answer yes or no.	No
+tt0066026_shot_0085_img_0.jpg	Is the person inside the red bounding box called Donald Sutherland? Please answer yes or no.	Yes
+tt0066026_shot_0085_img_0.jpg	Is the person inside the red bounding box called Michael Wollet? Please answer yes or no.	No
+tt0066921_shot_0631_img_0.jpg	Is the actor inside the red bounding box called Malcolm McDowell? Please answer yes or no.	Yes
+tt0066921_shot_0631_img_0.jpg	Is the actor inside the red bounding box called Darling Légitimus? Please answer yes or no.	No
+tt0067116_shot_0122_img_0.jpg	Is the actor inside the red bounding box called Gene Hackman? Please answer yes or no.	Yes
+tt0067116_shot_0122_img_0.jpg	Is the actor inside the red bounding box called Russell G. Jones? Please answer yes or no.	No
+tt0068646_shot_0166_img_0.jpg	Is the actor inside the red bounding box called Marlon Brando? Please answer yes or no.	Yes
+tt0068646_shot_0166_img_0.jpg	Is the actor inside the red bounding box called Voltaire Sterling? Please answer yes or no.	No
+tt0069762_shot_0723_img_0.jpg	Is the person inside the red bounding box named Sissy Spacek? Please answer yes or no.	Yes
+tt0069762_shot_0723_img_0.jpg	Is the person inside the red bounding box named Monica Giordano? Please answer yes or no.	No
+tt0070047_shot_0255_img_0.jpg	Is the actor inside the red bounding box called Ellen Burstyn? Please answer yes or no.	Yes
+tt0070047_shot_0255_img_0.jpg	Is the actor inside the red bounding box called Shawnee Smith? Please answer yes or no.	No
+tt0070379_shot_0569_img_0.jpg	Is the actor inside the red bounding box named Richard Romanus? Please answer yes or no.	Yes
+tt0070379_shot_0569_img_0.jpg	Is the actor inside the red bounding box named Valerie Colgan? Please answer yes or no.	No
+tt0070511_shot_0639_img_0.jpg	Is the person inside the red bounding box called Dustin Hoffman? Please answer yes or no.	Yes
+tt0070511_shot_0639_img_0.jpg	Is the person inside the red bounding box called Fernando Lueches? Please answer yes or no.	No
+tt0070735_shot_0818_img_0.jpg	Is the person inside the red bounding box named Robert Redford? Please answer yes or no.	Yes
+tt0070735_shot_0818_img_0.jpg	Is the person inside the red bounding box named Ellin Dennis? Please answer yes or no.	No
+tt0070849_shot_0021_img_1.jpg	Is the person inside the red bounding box named Maria Schneider? Please answer yes or no.	Yes
+tt0070849_shot_0021_img_1.jpg	Is the person inside the red bounding box named Mary Kellogg? Please answer yes or no.	No
+tt0071315_shot_0153_img_0.jpg	Is the actor inside the red bounding box named Faye Dunaway? Please answer yes or no.	Yes
+tt0071315_shot_0153_img_0.jpg	Is the actor inside the red bounding box named Kelly Hitman? Please answer yes or no.	No
+tt0071562_shot_0684_img_0.jpg	Is the actor inside the red bounding box named Al Pacino? Please answer yes or no.	Yes
+tt0071562_shot_0684_img_0.jpg	Is the actor inside the red bounding box named Debie Jarczewski? Please answer yes or no.	No
+tt0072684_shot_0512_img_1.jpg	Is the person inside the red bounding box named Marisa Berenson? Please answer yes or no.	Yes
+tt0072684_shot_0512_img_1.jpg	Is the person inside the red bounding box named Graham Bohea? Please answer yes or no.	No
+tt0073195_shot_0280_img_0.jpg	Is the actor inside the red bounding box named Roy Scheider? Please answer yes or no.	Yes
+tt0073195_shot_0280_img_0.jpg	Is the actor inside the red bounding box named Abdul Qadir Farookh? Please answer yes or no.	No
+tt0073629_shot_0700_img_0.jpg	Is the person inside the red bounding box named Barry Bostwick? Please answer yes or no.	Yes
+tt0073629_shot_0700_img_0.jpg	Is the person inside the red bounding box named Johnny Galecki? Please answer yes or no.	No
+tt0074119_shot_0814_img_0.jpg	Is the actor inside the red bounding box called Robert Redford? Please answer yes or no.	Yes
+tt0074119_shot_0814_img_0.jpg	Is the actor inside the red bounding box called Delroy Lindo? Please answer yes or no.	No
+tt0074285_shot_0535_img_1.jpg	Is the person inside the red bounding box named William Katt? Please answer yes or no.	Yes
+tt0074285_shot_0535_img_1.jpg	Is the person inside the red bounding box named Stephen Rider? Please answer yes or no.	No
+tt0075148_shot_0618_img_0.jpg	Is the actor inside the red bounding box called Sylvester Stallone? Please answer yes or no.	Yes
+tt0075148_shot_0618_img_0.jpg	Is the actor inside the red bounding box called Eric Hatch? Please answer yes or no.	No
+tt0075686_shot_0373_img_0.jpg	Is the actor inside the red bounding box called Woody Allen? Please answer yes or no.	Yes
+tt0075686_shot_0373_img_0.jpg	Is the actor inside the red bounding box called Penny Wallace? Please answer yes or no.	No
+tt0076729_shot_0451_img_0.jpg	Is the actor inside the red bounding box called Sally Field? Please answer yes or no.	Yes
+tt0076729_shot_0451_img_0.jpg	Is the actor inside the red bounding box called Giorgio Libassi? Please answer yes or no.	No
+tt0076759_shot_0930_img_0.jpg	Is the actor inside the red bounding box called Harrison Ford? Please answer yes or no.	Yes
+tt0076759_shot_0930_img_0.jpg	Is the actor inside the red bounding box called Ryoko Sadoshima? Please answer yes or no.	No
+tt0077402_shot_1220_img_0.jpg	Is the person inside the red bounding box named Scott H. Reiniger? Please answer yes or no.	Yes
+tt0077402_shot_1220_img_0.jpg	Is the person inside the red bounding box named Chris Delaney? Please answer yes or no.	No
+tt0077405_shot_0150_img_0.jpg	Is the actor inside the red bounding box named Sam Shepard? Please answer yes or no.	Yes
+tt0077405_shot_0150_img_0.jpg	Is the actor inside the red bounding box named Bijou Phillips? Please answer yes or no.	No
+tt0077416_shot_1442_img_0.jpg	Is the person inside the red bounding box named Robert De Niro? Please answer yes or no.	Yes
+tt0077416_shot_1442_img_0.jpg	Is the person inside the red bounding box named Stu Smith? Please answer yes or no.	No
+tt0077651_shot_0133_img_0.jpg	Is the person inside the red bounding box called Jamie Lee Curtis? Please answer yes or no.	Yes
+tt0077651_shot_0133_img_0.jpg	Is the person inside the red bounding box called Paris Arrowsmith? Please answer yes or no.	No
+tt0078788_shot_1434_img_0.jpg	Is the person inside the red bounding box called Martin Sheen? Please answer yes or no.	Yes
+tt0078788_shot_1434_img_0.jpg	Is the person inside the red bounding box called Le Capriccio Français? Please answer yes or no.	No
+tt0078841_shot_0692_img_0.jpg	Is the actor inside the red bounding box named Shirley MacLaine? Please answer yes or no.	Yes
+tt0078841_shot_0692_img_0.jpg	Is the actor inside the red bounding box named Tomas Choy? Please answer yes or no.	No
+tt0079417_shot_0735_img_0.jpg	Is the actor inside the red bounding box called Meryl Streep? Please answer yes or no.	Yes
+tt0079417_shot_0735_img_0.jpg	Is the actor inside the red bounding box called Ross Lacy? Please answer yes or no.	No
+tt0079470_shot_0798_img_0.jpg	Is the person inside the red bounding box named Eric Idle? Please answer yes or no.	Yes
+tt0079470_shot_0798_img_0.jpg	Is the person inside the red bounding box named Quincy Taylor? Please answer yes or no.	No
+tt0079945_shot_1411_img_0.jpg	Is the person inside the red bounding box named Persis Khambatta? Please answer yes or no.	Yes
+tt0079945_shot_1411_img_0.jpg	Is the person inside the red bounding box named Alison Waddell? Please answer yes or no.	No
+tt0080339_shot_0711_img_0.jpg	Is the actor inside the red bounding box named Robert Hays? Please answer yes or no.	Yes
+tt0080339_shot_0711_img_0.jpg	Is the actor inside the red bounding box named Grace Sullivan? Please answer yes or no.	No
+tt0080684_shot_1574_img_2.jpg	Is the actor inside the red bounding box called Mark Hamill? Please answer yes or no.	Yes
+tt0080684_shot_1574_img_2.jpg	Is the actor inside the red bounding box called Rodion Salnikov? Please answer yes or no.	No
+tt0081505_shot_0449_img_0.jpg	Is the actor inside the red bounding box called Shelley Duvall? Please answer yes or no.	Yes
+tt0081505_shot_0449_img_0.jpg	Is the actor inside the red bounding box called Antony Carrick? Please answer yes or no.	No
+tt0082089_shot_0046_img_0.jpg	Is the actor inside the red bounding box named Kathleen Turner? Please answer yes or no.	Yes
+tt0082089_shot_0046_img_0.jpg	Is the actor inside the red bounding box named Aaron Henderson? Please answer yes or no.	No
+tt0082198_shot_1353_img_0.jpg	Is the person inside the red bounding box named Arnold Schwarzenegger? Please answer yes or no.	Yes
+tt0082198_shot_1353_img_0.jpg	Is the person inside the red bounding box named Tim Herlihy? Please answer yes or no.	No
+tt0082971_shot_0831_img_0.jpg	Is the actor inside the red bounding box called Harrison Ford? Please answer yes or no.	Yes
+tt0082971_shot_0831_img_0.jpg	Is the actor inside the red bounding box called Richard Angarola? Please answer yes or no.	No
+tt0083658_shot_0963_img_0.jpg	Is the person inside the red bounding box called Rutger Hauer? Please answer yes or no.	Yes
+tt0083658_shot_0963_img_0.jpg	Is the person inside the red bounding box called Stéphane Julien? Please answer yes or no.	No
+tt0083866_shot_0364_img_0.jpg	Is the actor inside the red bounding box called Robert MacNaughton? Please answer yes or no.	Yes
+tt0083866_shot_0364_img_0.jpg	Is the actor inside the red bounding box called Seam Turay? Please answer yes or no.	No
+tt0083907_shot_0633_img_0.jpg	Is the actor inside the red bounding box named Bruce Campbell? Please answer yes or no.	Yes
+tt0083907_shot_0633_img_0.jpg	Is the actor inside the red bounding box named Kaden Leos? Please answer yes or no.	No
+tt0083929_shot_0405_img_0.jpg	Is the actor inside the red bounding box named Jennifer Jason Leigh? Please answer yes or no.	Yes
+tt0083929_shot_0405_img_0.jpg	Is the actor inside the red bounding box named Eric D. Sandgren? Please answer yes or no.	No
+tt0084726_shot_0283_img_0.jpg	Is the actor inside the red bounding box named Leonard Nimoy? Please answer yes or no.	Yes
+tt0084726_shot_0283_img_0.jpg	Is the actor inside the red bounding box named John Cusack? Please answer yes or no.	No
+tt0086190_shot_0815_img_0.jpg	Is the actor inside the red bounding box named Carrie Fisher? Please answer yes or no.	Yes
+tt0086190_shot_0815_img_0.jpg	Is the actor inside the red bounding box named Ernie Adams? Please answer yes or no.	No
+tt0086250_shot_1079_img_0.jpg	Is the actor inside the red bounding box called Steven Bauer? Please answer yes or no.	Yes
+tt0086250_shot_1079_img_0.jpg	Is the actor inside the red bounding box called Bill Nunn? Please answer yes or no.	No
+tt0086856_shot_0929_img_0.jpg	Is the actor inside the red bounding box called Peter Weller? Please answer yes or no.	Yes
+tt0086856_shot_0929_img_0.jpg	Is the actor inside the red bounding box called Tracee Cocco? Please answer yes or no.	No
+tt0086879_shot_0158_img_0.jpg	Is the person inside the red bounding box called Elizabeth Berridge? Please answer yes or no.	Yes
+tt0086879_shot_0158_img_0.jpg	Is the person inside the red bounding box called Ralph Ineson? Please answer yes or no.	No
+tt0087332_shot_0798_img_0.jpg	Is the person inside the red bounding box called Bill Murray? Please answer yes or no.	Yes
+tt0087332_shot_0798_img_0.jpg	Is the person inside the red bounding box called Jiao Xu? Please answer yes or no.	No
+tt0087469_shot_0049_img_2.jpg	Is the person inside the red bounding box named Harrison Ford? Please answer yes or no.	Yes
+tt0087469_shot_0049_img_2.jpg	Is the person inside the red bounding box named Paulo Benedeti? Please answer yes or no.	No
+tt0088847_shot_0109_img_0.jpg	Is the actor inside the red bounding box named Anthony Michael Hall? Please answer yes or no.	Yes
+tt0088847_shot_0109_img_0.jpg	Is the actor inside the red bounding box named Luis Javier? Please answer yes or no.	No
+tt0088944_shot_0634_img_0.jpg	Is the actor inside the red bounding box named Arnold Schwarzenegger? Please answer yes or no.	Yes
+tt0088944_shot_0634_img_0.jpg	Is the actor inside the red bounding box named Shaine Jones? Please answer yes or no.	No
+tt0088993_shot_0569_img_0.jpg	Is the actor inside the red bounding box called George A. Romero? Please answer yes or no.	Yes
+tt0088993_shot_0569_img_0.jpg	Is the actor inside the red bounding box called James Eckhouse? Please answer yes or no.	No
+tt0089218_shot_0327_img_0.jpg	Is the person inside the red bounding box named Sean Astin? Please answer yes or no.	Yes
+tt0089218_shot_0327_img_0.jpg	Is the person inside the red bounding box named Dan Hunter? Please answer yes or no.	No
+tt0089881_shot_0034_img_0.jpg	Is the actor inside the red bounding box called Tatsuya Nakadai? Please answer yes or no.	Yes
+tt0089881_shot_0034_img_0.jpg	Is the actor inside the red bounding box called Nancy Vee? Please answer yes or no.	No
+tt0090022_shot_0464_img_0.jpg	Is the actor inside the red bounding box called Scott Glenn? Please answer yes or no.	Yes
+tt0090022_shot_0464_img_0.jpg	Is the actor inside the red bounding box called Robert Ryan? Please answer yes or no.	No
+tt0090605_shot_0344_img_0.jpg	Is the person inside the red bounding box called Sigourney Weaver? Please answer yes or no.	Yes
+tt0090605_shot_0344_img_0.jpg	Is the person inside the red bounding box called Lia Beldam? Please answer yes or no.	No
+tt0090756_shot_0135_img_0.jpg	Is the person inside the red bounding box named Laura Dern? Please answer yes or no.	Yes
+tt0090756_shot_0135_img_0.jpg	Is the person inside the red bounding box named Keith Frost? Please answer yes or no.	No
+tt0091042_shot_0098_img_0.jpg	Is the person inside the red bounding box called Matthew Broderick? Please answer yes or no.	Yes
+tt0091042_shot_0098_img_0.jpg	Is the person inside the red bounding box called Mina E. Mina? Please answer yes or no.	No
+tt0091738_shot_0073_img_1.jpg	Is the actor inside the red bounding box called Kathleen Turner? Please answer yes or no.	Yes
+tt0091738_shot_0073_img_1.jpg	Is the actor inside the red bounding box called Pat Kiernan? Please answer yes or no.	No
+tt0091867_shot_0422_img_2.jpg	Is the person inside the red bounding box named Simon Callow? Please answer yes or no.	Yes
+tt0091867_shot_0422_img_2.jpg	Is the person inside the red bounding box named Rusty Goffe? Please answer yes or no.	No
+tt0092099_shot_0455_img_1.jpg	Is the person inside the red bounding box called Tom Cruise? Please answer yes or no.	Yes
+tt0092099_shot_0455_img_1.jpg	Is the person inside the red bounding box called Carol Krolick? Please answer yes or no.	No
+tt0092699_shot_0208_img_0.jpg	Is the actor inside the red bounding box called William Hurt? Please answer yes or no.	Yes
+tt0092699_shot_0208_img_0.jpg	Is the actor inside the red bounding box called Hildur Ruriks? Please answer yes or no.	No
+tt0093565_shot_0409_img_0.jpg	Is the actor inside the red bounding box named Cher? Please answer yes or no.	Yes
+tt0093565_shot_0409_img_0.jpg	Is the actor inside the red bounding box named Mark Brady? Please answer yes or no.	No
+tt0093748_shot_0346_img_0.jpg	Is the actor inside the red bounding box called John Candy? Please answer yes or no.	Yes
+tt0093748_shot_0346_img_0.jpg	Is the actor inside the red bounding box called Sarah Heller? Please answer yes or no.	No
+tt0093773_shot_0212_img_0.jpg	Is the person inside the red bounding box named Jesse Ventura? Please answer yes or no.	Yes
+tt0093773_shot_0212_img_0.jpg	Is the person inside the red bounding box named Akio Mitamura? Please answer yes or no.	No
+tt0093779_shot_1047_img_0.jpg	Is the person inside the red bounding box named Peter Falk? Please answer yes or no.	Yes
+tt0093779_shot_1047_img_0.jpg	Is the person inside the red bounding box named Lisa Ann Walter? Please answer yes or no.	No
+tt0094226_shot_0237_img_2.jpg	Is the actor inside the red bounding box called Kevin Costner? Please answer yes or no.	Yes
+tt0094226_shot_0237_img_2.jpg	Is the actor inside the red bounding box called Colin Hill? Please answer yes or no.	No
+tt0094737_shot_0567_img_0.jpg	Is the person inside the red bounding box called Tom Hanks? Please answer yes or no.	Yes
+tt0094737_shot_0567_img_0.jpg	Is the person inside the red bounding box called Chris McHallem? Please answer yes or no.	No
+tt0095016_shot_1170_img_0.jpg	Is the actor inside the red bounding box called Paul Gleason? Please answer yes or no.	Yes
+tt0095016_shot_1170_img_0.jpg	Is the actor inside the red bounding box called Carl Palmer? Please answer yes or no.	No
+tt0095250_shot_0509_img_0.jpg	Is the actor inside the red bounding box named Jean Reno? Please answer yes or no.	Yes
+tt0095250_shot_0509_img_0.jpg	Is the actor inside the red bounding box named Ralph Meyering Jr.? Please answer yes or no.	No
+tt0095765_shot_0008_img_0.jpg	Is the actor inside the red bounding box called Antonella Attili? Please answer yes or no.	Yes
+tt0095765_shot_0008_img_0.jpg	Is the actor inside the red bounding box called Amber Estrada? Please answer yes or no.	No
+tt0095953_shot_0412_img_0.jpg	Is the person inside the red bounding box named Tom Cruise? Please answer yes or no.	Yes
+tt0095953_shot_0412_img_0.jpg	Is the person inside the red bounding box named Lara Mulcahy? Please answer yes or no.	No
+tt0096320_shot_0085_img_0.jpg	Is the actor inside the red bounding box called Arnold Schwarzenegger? Please answer yes or no.	Yes
+tt0096320_shot_0085_img_0.jpg	Is the actor inside the red bounding box called Dan Duran? Please answer yes or no.	No
+tt0096754_shot_0570_img_1.jpg	Is the person inside the red bounding box named Todd Graff? Please answer yes or no.	Yes
+tt0096754_shot_0570_img_1.jpg	Is the person inside the red bounding box named Guy Carleton? Please answer yes or no.	No
+tt0096874_shot_0647_img_0.jpg	Is the actor inside the red bounding box named Michael J. Fox? Please answer yes or no.	Yes
+tt0096874_shot_0647_img_0.jpg	Is the actor inside the red bounding box named Momoko Komatsu? Please answer yes or no.	No
+tt0096895_shot_0819_img_1.jpg	Is the person inside the red bounding box called Michael Keaton? Please answer yes or no.	Yes
+tt0096895_shot_0819_img_1.jpg	Is the person inside the red bounding box called Ben Foster? Please answer yes or no.	No
+tt0097216_shot_0381_img_0.jpg	Is the actor inside the red bounding box named Danny Aiello? Please answer yes or no.	Yes
+tt0097216_shot_0381_img_0.jpg	Is the actor inside the red bounding box named Taissa Farmiga? Please answer yes or no.	No
+tt0097428_shot_0106_img_0.jpg	Is the actor inside the red bounding box named Bill Murray? Please answer yes or no.	Yes
+tt0097428_shot_0106_img_0.jpg	Is the actor inside the red bounding box named Michael Fawcett? Please answer yes or no.	No
+tt0097576_shot_1010_img_2.jpg	Is the actor inside the red bounding box named Harrison Ford? Please answer yes or no.	Yes
+tt0097576_shot_1010_img_2.jpg	Is the actor inside the red bounding box named M. Emmet Walsh? Please answer yes or no.	No
+tt0098635_shot_0556_img_0.jpg	Is the actor inside the red bounding box named Meg Ryan? Please answer yes or no.	Yes
+tt0098635_shot_0556_img_0.jpg	Is the actor inside the red bounding box named Tom Branch? Please answer yes or no.	No
+tt0098724_shot_0474_img_0.jpg	Is the person inside the red bounding box named Andie MacDowell? Please answer yes or no.	Yes
+tt0098724_shot_0474_img_0.jpg	Is the person inside the red bounding box named Linda Taylor? Please answer yes or no.	No
+tt0099423_shot_1010_img_0.jpg	Is the person inside the red bounding box called Bruce Willis? Please answer yes or no.	Yes
+tt0099423_shot_1010_img_0.jpg	Is the person inside the red bounding box called Trevor Eve? Please answer yes or no.	No
+tt0099487_shot_0123_img_0.jpg	Is the actor inside the red bounding box named Johnny Depp? Please answer yes or no.	Yes
+tt0099487_shot_0123_img_0.jpg	Is the actor inside the red bounding box named Farrah Forke? Please answer yes or no.	No
+tt0099674_shot_1356_img_0.jpg	Is the person inside the red bounding box named Al Pacino? Please answer yes or no.	Yes
+tt0099674_shot_1356_img_0.jpg	Is the person inside the red bounding box named Nick Porrazzo? Please answer yes or no.	No
+tt0099685_shot_1132_img_0.jpg	Is the actor inside the red bounding box called Ray Liotta? Please answer yes or no.	Yes
+tt0099685_shot_1132_img_0.jpg	Is the actor inside the red bounding box called Chick Allan? Please answer yes or no.	No
+tt0099810_shot_0285_img_0.jpg	Is the person inside the red bounding box called Alec Baldwin? Please answer yes or no.	Yes
+tt0099810_shot_0285_img_0.jpg	Is the person inside the red bounding box called Jennifer Anglin? Please answer yes or no.	No
+tt0100157_shot_0365_img_0.jpg	Is the actor inside the red bounding box named James Caan? Please answer yes or no.	Yes
+tt0100157_shot_0365_img_0.jpg	Is the actor inside the red bounding box named Bryan Johnson? Please answer yes or no.	No
+tt0100403_shot_0517_img_0.jpg	Is the person inside the red bounding box called Gary Busey? Please answer yes or no.	Yes
+tt0100403_shot_0517_img_0.jpg	Is the person inside the red bounding box called Alfred Tiaki Hotu? Please answer yes or no.	No
+tt0100405_shot_0786_img_0.jpg	Is the actor inside the red bounding box named Jason Alexander? Please answer yes or no.	Yes
+tt0100405_shot_0786_img_0.jpg	Is the actor inside the red bounding box named Alexandra Bastedo? Please answer yes or no.	No
+tt0101410_shot_0105_img_0.jpg	Is the person inside the red bounding box named John Turturro? Please answer yes or no.	Yes
+tt0101410_shot_0105_img_0.jpg	Is the person inside the red bounding box named David Gore? Please answer yes or no.	No
+tt0102492_shot_0086_img_0.jpg	Is the actor inside the red bounding box called Jamie Lee Curtis? Please answer yes or no.	Yes
+tt0102492_shot_0086_img_0.jpg	Is the actor inside the red bounding box called Heidi Fischer? Please answer yes or no.	No
+tt0103064_shot_1206_img_0.jpg	Is the actor inside the red bounding box named Arnold Schwarzenegger? Please answer yes or no.	Yes
+tt0103064_shot_1206_img_0.jpg	Is the actor inside the red bounding box named Gigi Lee? Please answer yes or no.	No
+tt0103064_shot_2602_img_1.jpg	Is the person inside the red bounding box named Arnold Schwarzenegger? Please answer yes or no.	Yes
+tt0103064_shot_2602_img_1.jpg	Is the person inside the red bounding box named Candice Azzara? Please answer yes or no.	No
+tt0103776_shot_0719_img_0.jpg	Is the person inside the red bounding box called Michael Keaton? Please answer yes or no.	Yes
+tt0103776_shot_0719_img_0.jpg	Is the person inside the red bounding box called Nicholas Rice? Please answer yes or no.	No
+tt0104036_shot_0336_img_1.jpg	Is the person inside the red bounding box named Stephen Rea? Please answer yes or no.	Yes
+tt0104036_shot_0336_img_1.jpg	Is the person inside the red bounding box named Mimi Lizio? Please answer yes or no.	No
+tt0104257_shot_0477_img_0.jpg	Is the person inside the red bounding box named Jack Nicholson? Please answer yes or no.	Yes
+tt0104257_shot_0477_img_0.jpg	Is the person inside the red bounding box named Emma Julia Jacobs? Please answer yes or no.	No
+tt0104348_shot_0340_img_0.jpg	Is the person inside the red bounding box called Ed Harris? Please answer yes or no.	Yes
+tt0104348_shot_0340_img_0.jpg	Is the person inside the red bounding box called Carla Lizzette Mejia? Please answer yes or no.	No
+tt0105236_shot_0193_img_0.jpg	Is the actor inside the red bounding box named Harvey Keitel? Please answer yes or no.	Yes
+tt0105236_shot_0193_img_0.jpg	Is the actor inside the red bounding box named Terence Yin? Please answer yes or no.	No
+tt0105665_shot_0351_img_0.jpg	Is the actor inside the red bounding box named Kyle MacLachlan? Please answer yes or no.	Yes
+tt0105665_shot_0351_img_0.jpg	Is the actor inside the red bounding box named Julia Hsu? Please answer yes or no.	No
+tt0105695_shot_1436_img_1.jpg	Is the person inside the red bounding box called Jaimz Woolvett? Please answer yes or no.	Yes
+tt0105695_shot_1436_img_1.jpg	Is the person inside the red bounding box called Hermione Baddeley? Please answer yes or no.	No
+tt0106977_shot_1604_img_0.jpg	Is the person inside the red bounding box named Tommy Lee Jones? Please answer yes or no.	Yes
+tt0106977_shot_1604_img_0.jpg	Is the person inside the red bounding box named Honey Chhaya? Please answer yes or no.	No
+tt0107614_shot_0116_img_0.jpg	Is the person inside the red bounding box called Sally Field? Please answer yes or no.	Yes
+tt0107614_shot_0116_img_0.jpg	Is the person inside the red bounding box called Arthur Senzy? Please answer yes or no.	No
+tt0108399_shot_0778_img_0.jpg	Is the actor inside the red bounding box called Christopher Walken? Please answer yes or no.	Yes
+tt0108399_shot_0778_img_0.jpg	Is the actor inside the red bounding box called Fiona Sit? Please answer yes or no.	No
+tt0109831_shot_0298_img_0.jpg	Is the person inside the red bounding box called Hugh Grant? Please answer yes or no.	Yes
+tt0109831_shot_0298_img_0.jpg	Is the person inside the red bounding box called Renée Zellweger? Please answer yes or no.	No
+tt0111280_shot_0258_img_0.jpg	Is the actor inside the red bounding box named Gates McFadden? Please answer yes or no.	Yes
+tt0111280_shot_0258_img_0.jpg	Is the actor inside the red bounding box named Michael Angarano? Please answer yes or no.	No
+tt0111280_shot_1479_img_2.jpg	Is the actor inside the red bounding box called William Shatner? Please answer yes or no.	Yes
+tt0111280_shot_1479_img_2.jpg	Is the actor inside the red bounding box called Richard Rohrbough? Please answer yes or no.	No
+tt0112384_shot_0878_img_0.jpg	Is the person inside the red bounding box called Kathleen Quinlan? Please answer yes or no.	Yes
+tt0112384_shot_0878_img_0.jpg	Is the person inside the red bounding box called Veronica Diaz Carranza? Please answer yes or no.	No
+tt0112641_shot_0412_img_1.jpg	Is the actor inside the red bounding box called Robert De Niro? Please answer yes or no.	Yes
+tt0112641_shot_0412_img_1.jpg	Is the actor inside the red bounding box called Pierre Malherbe? Please answer yes or no.	No
+tt0112740_shot_1056_img_0.jpg	Is the person inside the red bounding box named Denzel Washington? Please answer yes or no.	Yes
+tt0112740_shot_1056_img_0.jpg	Is the person inside the red bounding box named Bill Pullman? Please answer yes or no.	No
+tt0113101_shot_0547_img_0.jpg	Is the person inside the red bounding box named Tim Roth? Please answer yes or no.	Yes
+tt0113101_shot_0547_img_0.jpg	Is the person inside the red bounding box named Honey Chhaya? Please answer yes or no.	No
+tt0114369_shot_1138_img_0.jpg	Is the person inside the red bounding box named Brad Pitt? Please answer yes or no.	Yes
+tt0114369_shot_1138_img_0.jpg	Is the person inside the red bounding box named Benjamin Nitze? Please answer yes or no.	No
+tt0114388_shot_0162_img_0.jpg	Is the actor inside the red bounding box called Emma Thompson? Please answer yes or no.	Yes
+tt0114388_shot_0162_img_0.jpg	Is the actor inside the red bounding box called Francis P. Hughes? Please answer yes or no.	No
+tt0114388_shot_1207_img_1.jpg	Is the person inside the red bounding box called Hugh Grant? Please answer yes or no.	Yes
+tt0114388_shot_1207_img_1.jpg	Is the person inside the red bounding box called Zach Hopkins? Please answer yes or no.	No
+tt0115798_shot_0844_img_1.jpg	Is the person inside the red bounding box named Jim Carrey? Please answer yes or no.	Yes
+tt0115798_shot_0844_img_1.jpg	Is the person inside the red bounding box named Renee Herlocker? Please answer yes or no.	No
+tt0116367_shot_0755_img_0.jpg	Is the actor inside the red bounding box named George Clooney? Please answer yes or no.	Yes
+tt0116367_shot_0755_img_0.jpg	Is the actor inside the red bounding box named Ben Crowley? Please answer yes or no.	No
+tt0116629_shot_1570_img_2.jpg	Is the person inside the red bounding box called Will Smith? Please answer yes or no.	Yes
+tt0116629_shot_1570_img_2.jpg	Is the person inside the red bounding box called E. Katherine Kerr? Please answer yes or no.	No
+tt0116695_shot_0343_img_0.jpg	Is the person inside the red bounding box named Tom Cruise? Please answer yes or no.	Yes
+tt0116695_shot_0343_img_0.jpg	Is the person inside the red bounding box named Billy Dee? Please answer yes or no.	No
+tt0117060_shot_0412_img_0.jpg	Is the actor inside the red bounding box called Tom Cruise? Please answer yes or no.	Yes
+tt0117060_shot_0412_img_0.jpg	Is the actor inside the red bounding box called Carrie Lazar? Please answer yes or no.	No
+tt0117060_shot_1401_img_0.jpg	Is the actor inside the red bounding box called Jean Reno? Please answer yes or no.	Yes
+tt0117060_shot_1401_img_0.jpg	Is the actor inside the red bounding box called Jill Teed? Please answer yes or no.	No
+tt0117381_shot_0798_img_1.jpg	Is the person inside the red bounding box called Edward Norton? Please answer yes or no.	Yes
+tt0117381_shot_0798_img_1.jpg	Is the person inside the red bounding box called Michael Tezla? Please answer yes or no.	No
+tt0117500_shot_2467_img_0.jpg	Is the actor inside the red bounding box called Ed Harris? Please answer yes or no.	Yes
+tt0117500_shot_2467_img_0.jpg	Is the actor inside the red bounding box called Paul J.Q. Lee? Please answer yes or no.	No
+tt0117509_shot_0041_img_0.jpg	Is the actor inside the red bounding box named Paul Rudd? Please answer yes or no.	Yes
+tt0117509_shot_0041_img_0.jpg	Is the actor inside the red bounding box named Max Martini? Please answer yes or no.	No
+tt0117571_shot_0475_img_0.jpg	Is the person inside the red bounding box named Neve Campbell? Please answer yes or no.	Yes
+tt0117571_shot_0475_img_0.jpg	Is the person inside the red bounding box named Frank Hoyt Taylor? Please answer yes or no.	No
+tt0117731_shot_0300_img_0.jpg	Is the actor inside the red bounding box called Patrick Stewart? Please answer yes or no.	Yes
+tt0117731_shot_0300_img_0.jpg	Is the actor inside the red bounding box called Debra Montague? Please answer yes or no.	No
+tt0117731_shot_1067_img_0.jpg	Is the actor inside the red bounding box called Patrick Stewart? Please answer yes or no.	Yes
+tt0117731_shot_1067_img_0.jpg	Is the actor inside the red bounding box called Jenny Wilson? Please answer yes or no.	No
+tt0118548_shot_1296_img_0.jpg	Is the actor inside the red bounding box called Clint Eastwood? Please answer yes or no.	Yes
+tt0118548_shot_1296_img_0.jpg	Is the actor inside the red bounding box called Kate Winslet? Please answer yes or no.	No
+tt0118571_shot_0627_img_0.jpg	Is the actor inside the red bounding box called Glenn Close? Please answer yes or no.	Yes
+tt0118571_shot_0627_img_0.jpg	Is the actor inside the red bounding box called Arlene Farber? Please answer yes or no.	No
+tt0118636_shot_0007_img_1.jpg	Is the person inside the red bounding box called Brad Renfro? Please answer yes or no.	Yes
+tt0118636_shot_0007_img_1.jpg	Is the person inside the red bounding box called Sandra Park? Please answer yes or no.	No
+tt0118636_shot_0344_img_0.jpg	Is the actor inside the red bounding box called Brad Renfro? Please answer yes or no.	Yes
+tt0118636_shot_0344_img_0.jpg	Is the actor inside the red bounding box called Karen Strassman? Please answer yes or no.	No
+tt0118655_shot_0279_img_0.jpg	Is the person inside the red bounding box called Robert Wagner? Please answer yes or no.	Yes
+tt0118655_shot_0279_img_0.jpg	Is the person inside the red bounding box called Arthur Birnbaum? Please answer yes or no.	No
+tt0118655_shot_1152_img_2.jpg	Is the actor inside the red bounding box called Seth Green? Please answer yes or no.	Yes
+tt0118655_shot_1152_img_2.jpg	Is the actor inside the red bounding box called Sue Doucette? Please answer yes or no.	No
+tt0118689_shot_0706_img_0.jpg	Is the actor inside the red bounding box called Rowan Atkinson? Please answer yes or no.	Yes
+tt0118689_shot_0706_img_0.jpg	Is the actor inside the red bounding box called Hugo Perez? Please answer yes or no.	No
+tt0118689_shot_0969_img_2.jpg	Is the actor inside the red bounding box called Rowan Atkinson? Please answer yes or no.	Yes
+tt0118689_shot_0969_img_2.jpg	Is the actor inside the red bounding box called Jack Shields? Please answer yes or no.	No
+tt0118715_shot_0079_img_0.jpg	Is the actor inside the red bounding box called Jeff Bridges? Please answer yes or no.	Yes
+tt0118715_shot_0079_img_0.jpg	Is the actor inside the red bounding box called Scott Adkins? Please answer yes or no.	No
+tt0118749_shot_0795_img_0.jpg	Is the person inside the red bounding box called John C. Reilly? Please answer yes or no.	Yes
+tt0118749_shot_0795_img_0.jpg	Is the person inside the red bounding box called Chris Lowell? Please answer yes or no.	No
+tt0118883_shot_0691_img_1.jpg	Is the actor inside the red bounding box called Julia Roberts? Please answer yes or no.	Yes
+tt0118883_shot_0691_img_1.jpg	Is the actor inside the red bounding box called Roger Bart? Please answer yes or no.	No
+tt0118971_shot_0679_img_0.jpg	Is the actor inside the red bounding box called Charlize Theron? Please answer yes or no.	Yes
+tt0118971_shot_0679_img_0.jpg	Is the actor inside the red bounding box called Young-min Kim? Please answer yes or no.	No
+tt0119008_shot_0979_img_0.jpg	Is the actor inside the red bounding box named Al Pacino? Please answer yes or no.	Yes
+tt0119008_shot_0979_img_0.jpg	Is the actor inside the red bounding box named Neil Tweddle? Please answer yes or no.	No
+tt0119094_shot_0446_img_2.jpg	Is the actor inside the red bounding box called Nicolas Cage? Please answer yes or no.	Yes
+tt0119094_shot_0446_img_2.jpg	Is the actor inside the red bounding box called Juan Gabriel Pareja? Please answer yes or no.	No
+tt0119116_shot_0721_img_0.jpg	Is the actor inside the red bounding box called Bruce Willis? Please answer yes or no.	Yes
+tt0119116_shot_0721_img_0.jpg	Is the actor inside the red bounding box called Troye Sivan? Please answer yes or no.	No
+tt0119174_shot_0439_img_0.jpg	Is the actor inside the red bounding box named Michael Douglas? Please answer yes or no.	Yes
+tt0119174_shot_0439_img_0.jpg	Is the actor inside the red bounding box named Carola McGuinness? Please answer yes or no.	No
+tt0119314_shot_0572_img_0.jpg	Is the actor inside the red bounding box called Scarlett Johansson? Please answer yes or no.	Yes
+tt0119314_shot_0572_img_0.jpg	Is the actor inside the red bounding box called Daisy Beaumont? Please answer yes or no.	No
+tt0119528_shot_0171_img_0.jpg	Is the person inside the red bounding box called Jim Carrey? Please answer yes or no.	Yes
+tt0119528_shot_0171_img_0.jpg	Is the person inside the red bounding box called Eliot Paton? Please answer yes or no.	No
+tt0119528_shot_0761_img_1.jpg	Is the actor inside the red bounding box named Jim Carrey? Please answer yes or no.	Yes
+tt0119528_shot_0761_img_1.jpg	Is the actor inside the red bounding box named Jari Kinnunen? Please answer yes or no.	No
+tt0119643_shot_0330_img_0.jpg	Is the actor inside the red bounding box named Brad Pitt? Please answer yes or no.	Yes
+tt0119643_shot_0330_img_0.jpg	Is the actor inside the red bounding box named Anthony Hopkins? Please answer yes or no.	No
+tt0119738_shot_0201_img_0.jpg	Is the person inside the red bounding box named Christopher Masterson? Please answer yes or no.	Yes
+tt0119738_shot_0201_img_0.jpg	Is the person inside the red bounding box named Edwin Craig? Please answer yes or no.	No
+tt0119822_shot_0878_img_0.jpg	Is the person inside the red bounding box named Greg Kinnear? Please answer yes or no.	Yes
+tt0119822_shot_0878_img_0.jpg	Is the person inside the red bounding box named Aleksandr Dubina? Please answer yes or no.	No
+tt0120338_shot_0444_img_2.jpg	Is the actor inside the red bounding box named Kate Winslet? Please answer yes or no.	Yes
+tt0120338_shot_0444_img_2.jpg	Is the actor inside the red bounding box named Donald Gibb? Please answer yes or no.	No
+tt0120338_shot_1130_img_2.jpg	Is the person inside the red bounding box called Leonardo DiCaprio? Please answer yes or no.	Yes
+tt0120338_shot_1130_img_2.jpg	Is the person inside the red bounding box called Anne Betancourt? Please answer yes or no.	No