|
--- |
|
license: mit |
|
language: |
|
- en |
|
library_name: transformers |
|
tags: |
|
- Llava |
|
- Multimodal |
|
- Image-Text-to-Text |
|
- FineTuned |
|
- Vision |
|
|
|
--- |
|
|
|
# Model Details |
|
This model is a fine-tuned version of the LLaVA-v1.5-7B language |
|
model, which has been adapted to work with a custom Historical Paintings |
|
Dataset. The fine-tuning process utilized PEFT (Parameter-Efficient Fine-Tuning) |
|
LoRA and DeepSpeed to reduce the number of trainable parameters and efficiently |
|
utilize GPU resources. |
|
|
|
## Dataset |
|
|
|
The dataset used for fine-tuning is a collection of famous historical paintings/arts from artists like Leonardo da vinci, or Von Aachen. |
|
The dataset consists of 3k instances of image-text pairs. Given below is sample of text used in the data. Each instance contains image id as well as image path, which is important for llava. |
|
|
|
|
|
```python |
|
{ |
|
"id": "data_0001", |
|
"image": "images/dataset/1.jpg", |
|
"conversations": [ |
|
{ |
|
"from": "human", |
|
"value": "What is this image?" |
|
}, |
|
{ |
|
"from": "gpt", |
|
"value": "The Procuring Scene by Hans von Aachen is a captivating masterpiece that showcases the artists exceptional talent in depicting the nuances of human behavior and social dynamics. With remarkable attention to detail von Aachen portrays a scene of seduction and illicit liaisons subtly hinting at the undercurrents of desire and power play that permeated the elite circles of his time. Through his deft brushstrokes and skillful " |
|
} |
|
] |
|
}, |
|
``` |
|
|
|
## How to use? |
|
|
|
**Note** - Don't use the model with the transformers 'Use this model' on huggingface, alternatively follow the belows step wise approach for inferencing this model. |
|
|
|
The folder 'llava-v1.5-7b-task-lora' contains the lora weights and the folder 'llava-ftmodel' contains the merged model weights and configurations. |
|
- To use the model: |
|
```bash |
|
git clone https://github.com/haotian-liu/LLaVA.git |
|
cd LLaVA |
|
``` |
|
- Now, Place the folder 'llava-ftmodel' (this repo) in 'LLaVA' directory |
|
- Make sure transformers version is 4.37.2! |
|
- Now, place the 'test.jpg' from this repo, in the 'LLaVA' directory (To use it as a test image) |
|
- Now run the following command: |
|
```bash |
|
python -m llava.serve.cli --model-path 'llava-ftmodel' --image-file 'test.jpg' |
|
``` |
|
The model will ask for Human input, Type 'Describe this image' or 'What is depicted in this figure?' and hit enter! |
|
ENJOY! |
|
|
|
## Model key metrices |
|
- "train/global_step": 940, |
|
- "train/train_samples_per_second": 7.443, |
|
- "_step": 940, |
|
- "train/loss": 0.1388, |
|
- "train/epoch": 5, |
|
|
|
## Intended Use |
|
The fine-tuned LLaVA model is designed for tasks related to historical paintings, such as image captioning, visual question answering, and |
|
multimodal understanding. It can be used by researchers, historians, and |
|
enthusiasts interested in exploring and analyzing historical artworks. |
|
|
|
## Fine Tuning Procedure |
|
The model was fine-tuned using NVIDIA A40 GPU, with 48 GB of VRAM. The training process leveraged the efficiency of PEFT LoRA and |
|
DeepSpeed to optimize the use of GPU resources and minimize the number of |
|
trainable parameters. Once the new lora weights were trained, they were merged to the original model weights. After fine-tuning, the model achieved a final loss value |
|
of 0.13 |
|
|
|
## Performance |
|
The fine-tuned LLaVA model has demonstrated improved performance on tasks related to historical paintings compared to the original LLaVA-v1.5-7B |
|
model. However, the exact performance metrics and benchmarks are not provided in |
|
this model card. |
|
|
|
### Limitations and Biases |
|
As with any language model, the fine-tuned LLaVA model may exhibit biases present in the training data, which could include historical, |
|
cultural, or societal biases. Additionally, the model's performance may be |
|
limited by the quality and diversity of the Historical Paintings Dataset used |
|
for fine-tuning. |
|
|
|
### Ethical Considerations |
|
Users of this model should be aware of potential ethical implications, such as the use of historical artworks without proper attribution |
|
or consent. It is essential to respect intellectual property rights and ensure |
|
that any generated content or analyses are used responsibly and respectfully. |
|
|
|
|