thenlper tastelikefeet commited on
Commit
06bd79d
·
verified ·
1 Parent(s): d42eca5

Support fine-tuning (#7)

Browse files

- Support fine-tuning (d94acaf1a57be2532adcb39c31836b80a21c043b)


Co-authored-by: tastelikefeet <[email protected]>

Files changed (1) hide show
  1. README.md +42 -0
README.md CHANGED
@@ -3762,6 +3762,48 @@ The [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) English t
3762
 
3763
  **More detailed experimental results can be found in the [paper](http://arxiv.org/abs/2412.16855)**.
3764
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3765
  ## Limitations
3766
 
3767
  - **Single Image Input**: In `Qwen2-VL`, an image could be converted into a very large number of visual tokens. We limit the number of visual tokens to 1024 to obtain a good training efficiency.
 
3762
 
3763
  **More detailed experimental results can be found in the [paper](http://arxiv.org/abs/2412.16855)**.
3764
 
3765
+ ## Community support
3766
+
3767
+ ### Fine-tuning
3768
+
3769
+ GME models can be fine-tuned by SWIFT:
3770
+
3771
+ ```shell
3772
+ pip install ms-swift -U
3773
+ ```
3774
+
3775
+ ```shell
3776
+ # MAX_PIXELS settings to reduce memory usage
3777
+ # check: https://swift.readthedocs.io/en/latest/BestPractices/Embedding.html
3778
+ nproc_per_node=8
3779
+ MAX_PIXELS=1003520 \
3780
+ USE_HF=1 \
3781
+ NPROC_PER_NODE=$nproc_per_node \
3782
+ swift sft \
3783
+ --model Alibaba-NLP/gme-Qwen2-VL-7B-Instruct \
3784
+ --train_type lora \
3785
+ --dataset 'HuggingFaceM4/TextCaps:emb' \
3786
+ --torch_dtype bfloat16 \
3787
+ --num_train_epochs 1 \
3788
+ --per_device_train_batch_size 2 \
3789
+ --per_device_eval_batch_size 2 \
3790
+ --gradient_accumulation_steps $(expr 64 / $nproc_per_node) \
3791
+ --eval_steps 100 \
3792
+ --save_steps 100 \
3793
+ --eval_strategy steps \
3794
+ --save_total_limit 5 \
3795
+ --logging_steps 5 \
3796
+ --output_dir output \
3797
+ --lazy_tokenize true \
3798
+ --warmup_ratio 0.05 \
3799
+ --learning_rate 5e-6 \
3800
+ --deepspeed zero3 \
3801
+ --dataloader_num_workers 4 \
3802
+ --task_type embedding \
3803
+ --loss_type infonce \
3804
+ --dataloader_drop_last true
3805
+ ```
3806
+
3807
  ## Limitations
3808
 
3809
  - **Single Image Input**: In `Qwen2-VL`, an image could be converted into a very large number of visual tokens. We limit the number of visual tokens to 1024 to obtain a good training efficiency.