Update README.md
Browse files
README.md
CHANGED
@@ -221,9 +221,64 @@ You can cite us in the following way:
|
|
221 |
## Training Data
|
222 |
SmolVLM2 used 3.3M samples for training originally from ten different datasets: [LlaVa Onevision](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), [M4-Instruct](https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data), [Mammoth](https://huggingface.co/datasets/MAmmoTH-VL/MAmmoTH-VL-Instruct-12M), [LlaVa Video 178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K), [FineVideo](https://huggingface.co/datasets/HuggingFaceFV/finevideo), [VideoStar](https://huggingface.co/datasets/orrzohar/Video-STaR), [VRipt](https://huggingface.co/datasets/Mutonix/Vript), [Vista-400K](https://huggingface.co/datasets/TIGER-Lab/VISTA-400K), [MovieChat](https://huggingface.co/datasets/Enxin/MovieChat-1K_train) and [ShareGPT4Video](https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video).
|
223 |
In the following plots we give a general overview of the samples across modalities and the source of those samples.
|
224 |
-
|
225 |
<center><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_split.png" width="auto" height="auto" alt="Image description">
|
226 |
</center>
|
227 |
|
228 |
### Details
|
229 |
-
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_datadetails.png" width="auto" height="auto" alt="Image description">
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
221 |
## Training Data
|
222 |
SmolVLM2 used 3.3M samples for training originally from ten different datasets: [LlaVa Onevision](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), [M4-Instruct](https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data), [Mammoth](https://huggingface.co/datasets/MAmmoTH-VL/MAmmoTH-VL-Instruct-12M), [LlaVa Video 178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K), [FineVideo](https://huggingface.co/datasets/HuggingFaceFV/finevideo), [VideoStar](https://huggingface.co/datasets/orrzohar/Video-STaR), [VRipt](https://huggingface.co/datasets/Mutonix/Vript), [Vista-400K](https://huggingface.co/datasets/TIGER-Lab/VISTA-400K), [MovieChat](https://huggingface.co/datasets/Enxin/MovieChat-1K_train) and [ShareGPT4Video](https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video).
|
223 |
In the following plots we give a general overview of the samples across modalities and the source of those samples.
|
224 |
+
<!--
|
225 |
<center><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_split.png" width="auto" height="auto" alt="Image description">
|
226 |
</center>
|
227 |
|
228 |
### Details
|
229 |
+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_datadetails.png" width="auto" height="auto" alt="Image description"> -->
|
230 |
+
|
231 |
+
## Data Split per modality
|
232 |
+
|
233 |
+
| Data Type | Percentage |
|
234 |
+
|--------------|------------|
|
235 |
+
| Image | 34.4% |
|
236 |
+
| Text | 20.2% |
|
237 |
+
| Video | 33.0% |
|
238 |
+
| Multi-image | 12.3% |
|
239 |
+
|
240 |
+
|
241 |
+
## Granular dataset slices per modality
|
242 |
+
|
243 |
+
### Text Datasets
|
244 |
+
| Dataset | Percentage |
|
245 |
+
|--------------------------------------------|------------|
|
246 |
+
| llava-onevision/magpie_pro_ft3_80b_mt | 6.8% |
|
247 |
+
| llava-onevision/magpie_pro_ft3_80b_tt | 6.8% |
|
248 |
+
| llava-onevision/magpie_pro_qwen2_72b_tt | 5.8% |
|
249 |
+
| llava-onevision/mathqa | 0.9% |
|
250 |
+
|
251 |
+
### Multi-image Datasets
|
252 |
+
| Dataset | Percentage |
|
253 |
+
|--------------------------------------------|------------|
|
254 |
+
| m4-instruct-data/m4_instruct_multiimage | 10.4% |
|
255 |
+
| mammoth/multiimage-cap6 | 1.9% |
|
256 |
+
|
257 |
+
### Image Datasets
|
258 |
+
| Dataset | Percentage |
|
259 |
+
|--------------------------------------------|------------|
|
260 |
+
| llava-onevision/other | 17.4% |
|
261 |
+
| llava-onevision/vision_flan | 3.9% |
|
262 |
+
| llava-onevision/mavis_math_metagen | 2.6% |
|
263 |
+
| llava-onevision/mavis_math_rule_geo | 2.5% |
|
264 |
+
| llava-onevision/sharegpt4o | 1.7% |
|
265 |
+
| llava-onevision/sharegpt4v_coco | 1.5% |
|
266 |
+
| llava-onevision/image_textualization | 1.3% |
|
267 |
+
| llava-onevision/sharegpt4v_llava | 0.9% |
|
268 |
+
| llava-onevision/mapqa | 0.9% |
|
269 |
+
| llava-onevision/qa | 0.8% |
|
270 |
+
| llava-onevision/textocr | 0.8% |
|
271 |
+
|
272 |
+
### Video Datasets
|
273 |
+
| Dataset | Percentage |
|
274 |
+
|--------------------------------------------|------------|
|
275 |
+
| llava-video-178k/1-2m | 7.3% |
|
276 |
+
| llava-video-178k/2-3m | 7.0% |
|
277 |
+
| other-video/combined | 5.7% |
|
278 |
+
| llava-video-178k/hound | 4.4% |
|
279 |
+
| llava-video-178k/0-30s | 2.4% |
|
280 |
+
| video-star/starb | 2.2% |
|
281 |
+
| vista-400k/combined | 2.2% |
|
282 |
+
| vript/long | 1.0% |
|
283 |
+
| ShareGPT4Video/all | 0.8% |
|
284 |
+
|