Support fine-tuning (#7)
Browse files- Support fine-tuning (d94acaf1a57be2532adcb39c31836b80a21c043b)
Co-authored-by: tastelikefeet <[email protected]>
README.md
CHANGED
@@ -3762,6 +3762,48 @@ The [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) English t
|
|
3762 |
|
3763 |
**More detailed experimental results can be found in the [paper](http://arxiv.org/abs/2412.16855)**.
|
3764 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3765 |
## Limitations
|
3766 |
|
3767 |
- **Single Image Input**: In `Qwen2-VL`, an image could be converted into a very large number of visual tokens. We limit the number of visual tokens to 1024 to obtain a good training efficiency.
|
|
|
3762 |
|
3763 |
**More detailed experimental results can be found in the [paper](http://arxiv.org/abs/2412.16855)**.
|
3764 |
|
3765 |
+
## Community support
|
3766 |
+
|
3767 |
+
### Fine-tuning
|
3768 |
+
|
3769 |
+
GME models can be fine-tuned by SWIFT:
|
3770 |
+
|
3771 |
+
```shell
|
3772 |
+
pip install ms-swift -U
|
3773 |
+
```
|
3774 |
+
|
3775 |
+
```shell
|
3776 |
+
# MAX_PIXELS settings to reduce memory usage
|
3777 |
+
# check: https://swift.readthedocs.io/en/latest/BestPractices/Embedding.html
|
3778 |
+
nproc_per_node=8
|
3779 |
+
MAX_PIXELS=1003520 \
|
3780 |
+
USE_HF=1 \
|
3781 |
+
NPROC_PER_NODE=$nproc_per_node \
|
3782 |
+
swift sft \
|
3783 |
+
--model Alibaba-NLP/gme-Qwen2-VL-7B-Instruct \
|
3784 |
+
--train_type lora \
|
3785 |
+
--dataset 'HuggingFaceM4/TextCaps:emb' \
|
3786 |
+
--torch_dtype bfloat16 \
|
3787 |
+
--num_train_epochs 1 \
|
3788 |
+
--per_device_train_batch_size 2 \
|
3789 |
+
--per_device_eval_batch_size 2 \
|
3790 |
+
--gradient_accumulation_steps $(expr 64 / $nproc_per_node) \
|
3791 |
+
--eval_steps 100 \
|
3792 |
+
--save_steps 100 \
|
3793 |
+
--eval_strategy steps \
|
3794 |
+
--save_total_limit 5 \
|
3795 |
+
--logging_steps 5 \
|
3796 |
+
--output_dir output \
|
3797 |
+
--lazy_tokenize true \
|
3798 |
+
--warmup_ratio 0.05 \
|
3799 |
+
--learning_rate 5e-6 \
|
3800 |
+
--deepspeed zero3 \
|
3801 |
+
--dataloader_num_workers 4 \
|
3802 |
+
--task_type embedding \
|
3803 |
+
--loss_type infonce \
|
3804 |
+
--dataloader_drop_last true
|
3805 |
+
```
|
3806 |
+
|
3807 |
## Limitations
|
3808 |
|
3809 |
- **Single Image Input**: In `Qwen2-VL`, an image could be converted into a very large number of visual tokens. We limit the number of visual tokens to 1024 to obtain a good training efficiency.
|