mfarre HF staff commited on
Commit
83e8cba
·
verified ·
1 Parent(s): f83c1bb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +57 -2
README.md CHANGED
@@ -221,9 +221,64 @@ You can cite us in the following way:
221
  ## Training Data
222
  SmolVLM2 used 3.3M samples for training originally from ten different datasets: [LlaVa Onevision](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), [M4-Instruct](https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data), [Mammoth](https://huggingface.co/datasets/MAmmoTH-VL/MAmmoTH-VL-Instruct-12M), [LlaVa Video 178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K), [FineVideo](https://huggingface.co/datasets/HuggingFaceFV/finevideo), [VideoStar](https://huggingface.co/datasets/orrzohar/Video-STaR), [VRipt](https://huggingface.co/datasets/Mutonix/Vript), [Vista-400K](https://huggingface.co/datasets/TIGER-Lab/VISTA-400K), [MovieChat](https://huggingface.co/datasets/Enxin/MovieChat-1K_train) and [ShareGPT4Video](https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video).
223
  In the following plots we give a general overview of the samples across modalities and the source of those samples.
224
-
225
  <center><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_split.png" width="auto" height="auto" alt="Image description">
226
  </center>
227
 
228
  ### Details
229
- <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_datadetails.png" width="auto" height="auto" alt="Image description">
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
221
  ## Training Data
222
  SmolVLM2 used 3.3M samples for training originally from ten different datasets: [LlaVa Onevision](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), [M4-Instruct](https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data), [Mammoth](https://huggingface.co/datasets/MAmmoTH-VL/MAmmoTH-VL-Instruct-12M), [LlaVa Video 178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K), [FineVideo](https://huggingface.co/datasets/HuggingFaceFV/finevideo), [VideoStar](https://huggingface.co/datasets/orrzohar/Video-STaR), [VRipt](https://huggingface.co/datasets/Mutonix/Vript), [Vista-400K](https://huggingface.co/datasets/TIGER-Lab/VISTA-400K), [MovieChat](https://huggingface.co/datasets/Enxin/MovieChat-1K_train) and [ShareGPT4Video](https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video).
223
  In the following plots we give a general overview of the samples across modalities and the source of those samples.
224
+ <!--
225
  <center><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_split.png" width="auto" height="auto" alt="Image description">
226
  </center>
227
 
228
  ### Details
229
+ <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_datadetails.png" width="auto" height="auto" alt="Image description"> -->
230
+
231
+ ## Data Split per modality
232
+
233
+ | Data Type | Percentage |
234
+ |--------------|------------|
235
+ | Image | 34.4% |
236
+ | Text | 20.2% |
237
+ | Video | 33.0% |
238
+ | Multi-image | 12.3% |
239
+
240
+
241
+ ## Granular dataset slices per modality
242
+
243
+ ### Text Datasets
244
+ | Dataset | Percentage |
245
+ |--------------------------------------------|------------|
246
+ | llava-onevision/magpie_pro_ft3_80b_mt | 6.8% |
247
+ | llava-onevision/magpie_pro_ft3_80b_tt | 6.8% |
248
+ | llava-onevision/magpie_pro_qwen2_72b_tt | 5.8% |
249
+ | llava-onevision/mathqa | 0.9% |
250
+
251
+ ### Multi-image Datasets
252
+ | Dataset | Percentage |
253
+ |--------------------------------------------|------------|
254
+ | m4-instruct-data/m4_instruct_multiimage | 10.4% |
255
+ | mammoth/multiimage-cap6 | 1.9% |
256
+
257
+ ### Image Datasets
258
+ | Dataset | Percentage |
259
+ |--------------------------------------------|------------|
260
+ | llava-onevision/other | 17.4% |
261
+ | llava-onevision/vision_flan | 3.9% |
262
+ | llava-onevision/mavis_math_metagen | 2.6% |
263
+ | llava-onevision/mavis_math_rule_geo | 2.5% |
264
+ | llava-onevision/sharegpt4o | 1.7% |
265
+ | llava-onevision/sharegpt4v_coco | 1.5% |
266
+ | llava-onevision/image_textualization | 1.3% |
267
+ | llava-onevision/sharegpt4v_llava | 0.9% |
268
+ | llava-onevision/mapqa | 0.9% |
269
+ | llava-onevision/qa | 0.8% |
270
+ | llava-onevision/textocr | 0.8% |
271
+
272
+ ### Video Datasets
273
+ | Dataset | Percentage |
274
+ |--------------------------------------------|------------|
275
+ | llava-video-178k/1-2m | 7.3% |
276
+ | llava-video-178k/2-3m | 7.0% |
277
+ | other-video/combined | 5.7% |
278
+ | llava-video-178k/hound | 4.4% |
279
+ | llava-video-178k/0-30s | 2.4% |
280
+ | video-star/starb | 2.2% |
281
+ | vista-400k/combined | 2.2% |
282
+ | vript/long | 1.0% |
283
+ | ShareGPT4Video/all | 0.8% |
284
+