PyTorch
phi3_v
custom_code
m-elio commited on
Commit
a6721e8
·
verified ·
1 Parent(s): 1afc914

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +95 -0
README.md ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - de
4
+ - en
5
+ - es
6
+ - fr
7
+ - it
8
+ base_model:
9
+ - meta-llama/Llama-3.1-8B-Instruct
10
+ ---
11
+ # Model Card for xVLM2Vec_image_loss
12
+
13
+ ## Model description
14
+
15
+ <!-- Provide a quick summary of what the model is/does. -->
16
+
17
+ **xVLM2Vec_image_loss** is a *Large Vision-Language Model (LVLM)* aligned over **TIGER-Lab/VLM2Vec-LoRA**.
18
+ This model has been trained for increased performance in multilingual retrieval tasks, specifically it was trained on a [machine-translated parallel corpus](https://huggingface.co/datasets/swap-uniba/xMMEB-train).
19
+ It is capable of performing several multimodal retrieval tasks (e.g. Text-to-Image, Image-to-Text, VQA, Visual Grounding and Classification).
20
+
21
+ It was trained with a different loss w.r.t. [swap-uniba/xVLM2Vec](), however no significant performance differences were found.
22
+
23
+ More details regarding the training procedure (e.g. hyperparameters, dataset construction, and so on) can be found in the [paper].
24
+
25
+ - **Developed by:** Elio Musacchio, Lucia Siciliani, Pierpaolo Basile
26
+ - **Model type:** Phi-3.5-vision-instruct
27
+ - **Language(s) (NLP):** English, French, German, Italian and Spanish
28
+ - **License:** [Apache 2.0](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md)
29
+ - **Finetuned from model:** [TIGER-Lab/VLM2Vec-LoRA](https://huggingface.co/TIGER-Lab/VLM2Vec-LoRA)
30
+
31
+ ## How to Get Started with the Model
32
+
33
+ Below you can find an example of model usage. To facilitate its usage, we recommend pulling from GitHub the version of the VLM2Vec source code we used for both training and inference:
34
+
35
+ ```
36
+ git clone https://github.com/swap-uniba/xVLM2Vec
37
+ mv xVLM2Vec/src/mmeb_src .
38
+ rm -r xVLM2Vec
39
+ ```
40
+
41
+ Now you should be able to run the following:
42
+
43
+ ```python
44
+ from mmeb_src.model import MMEBModel
45
+ from mmeb_src.arguments import ModelArguments
46
+
47
+ from PIL import Image
48
+ from transformers import AutoProcessor
49
+
50
+ import torch
51
+ import requests
52
+
53
+ model_args = ModelArguments(
54
+ model_name='microsoft/Phi-3.5-vision-instruct',
55
+ checkpoint_path="m-elio/xVLM2Vec_image_loss",
56
+ pooling='last',
57
+ normalize=True,
58
+ lora=False,
59
+ )
60
+
61
+ processor = AutoProcessor.from_pretrained(
62
+ "microsoft/Phi-3.5-vision-instruct",
63
+ trust_remote_code=True,
64
+ num_crops=4,
65
+ )
66
+
67
+ model = MMEBModel.load(model_args)
68
+ model.eval()
69
+ model = model.to('cuda', dtype=torch.bfloat16)
70
+
71
+ with torch.no_grad():
72
+ inputs = processor("<|image_1|>\nTrova una didascalia che descriva l'immagine di tutti i giorni", [Image.open(requests.get("http://images.cocodataset.org/train2017/000000514915.jpg", stream=True).raw)])
73
+ inputs = {key: value.to('cuda') for key, value in inputs.items()}
74
+ qry_output = model(qry=inputs)["qry_reps"]
75
+
76
+ strings = ['Un cane steso sul pavimento', 'Un gatto steso sul pavimento']
77
+ inputs = processor(strings)
78
+ inputs = {key: value.to('cuda') for key, value in inputs.items()}
79
+ tgt_output = model(tgt=inputs)["tgt_reps"]
80
+ cos_sim = model.compute_similarity(qry_output, tgt_output).squeeze()
81
+
82
+ for string_, sim_ in zip(strings, cos_sim):
83
+ print(string_, '=', sim_)
84
+ ```
85
+
86
+ This is a use case where the model is being used to retrieve an image caption in Italian.
87
+
88
+
89
+ ## Citation
90
+
91
+ If you use this model in your research, please cite the following:
92
+
93
+ ```bibtex
94
+ TBD
95
+ ```