Update README.md

1bad355 verified 9 months ago

4.25 kB

	---
	license: mit
	language:
	- en
	library_name: transformers
	tags:
	- Llava
	- Multimodal
	- Image-Text-to-Text
	- FineTuned
	- Vision

	---

	# Model Details
	This model is a fine-tuned version of the LLaVA-v1.5-7B language
	model, which has been adapted to work with a custom Historical Paintings
	Dataset. The fine-tuning process utilized PEFT (Parameter-Efficient Fine-Tuning)
	LoRA and DeepSpeed to reduce the number of trainable parameters and efficiently
	utilize GPU resources.

	## Dataset

	The dataset used for fine-tuning is a collection of famous historical paintings/arts from artists like Leonardo da vinci, or Von Aachen.
	The dataset consists of 3k instances of image-text pairs. Given below is sample of text used in the data. Each instance contains image id as well as image path, which is important for llava.


	```python
	{
	"id": "data_0001",
	"image": "images/dataset/1.jpg",
	"conversations": [
	{
	"from": "human",
	"value": "What is this image?"
	},
	{
	"from": "gpt",
	"value": "The Procuring Scene by Hans von Aachen is a captivating masterpiece that showcases the artists exceptional talent in depicting the nuances of human behavior and social dynamics. With remarkable attention to detail von Aachen portrays a scene of seduction and illicit liaisons subtly hinting at the undercurrents of desire and power play that permeated the elite circles of his time. Through his deft brushstrokes and skillful "
	}
	]
	},
	```

	## How to use?

	Note - Don't use the model with the transformers 'Use this model' on huggingface, alternatively follow the belows step wise approach for inferencing this model.

	The folder 'llava-v1.5-7b-task-lora' contains the lora weights and the folder 'llava-ftmodel' contains the merged model weights and configurations.
	- To use the model:
	```bash
	git clone https://github.com/haotian-liu/LLaVA.git
	cd LLaVA
	```
	- Now, Place the folder 'llava-ftmodel' (this repo) in 'LLaVA' directory
	- Make sure transformers version is 4.37.2!
	- Now, place the 'test.jpg' from this repo, in the 'LLaVA' directory (To use it as a test image)
	- Now run the following command:
	```bash
	python -m llava.serve.cli --model-path 'llava-ftmodel' --image-file 'test.jpg'
	```
	The model will ask for Human input, Type 'Describe this image' or 'What is depicted in this figure?' and hit enter!
	ENJOY!

	## Model key metrices
	- "train/global_step": 940,
	- "train/train_samples_per_second": 7.443,
	- "_step": 940,
	- "train/loss": 0.1388,
	- "train/epoch": 5,

	## Intended Use
	The fine-tuned LLaVA model is designed for tasks related to historical paintings, such as image captioning, visual question answering, and
	multimodal understanding. It can be used by researchers, historians, and
	enthusiasts interested in exploring and analyzing historical artworks.

	## Fine Tuning Procedure
	The model was fine-tuned using NVIDIA A40 GPU, with 48 GB of VRAM. The training process leveraged the efficiency of PEFT LoRA and
	DeepSpeed to optimize the use of GPU resources and minimize the number of
	trainable parameters. Once the new lora weights were trained, they were merged to the original model weights. After fine-tuning, the model achieved a final loss value
	of 0.13

	## Performance
	The fine-tuned LLaVA model has demonstrated improved performance on tasks related to historical paintings compared to the original LLaVA-v1.5-7B
	model. However, the exact performance metrics and benchmarks are not provided in
	this model card.

	### Limitations and Biases
	As with any language model, the fine-tuned LLaVA model may exhibit biases present in the training data, which could include historical,
	cultural, or societal biases. Additionally, the model's performance may be
	limited by the quality and diversity of the Historical Paintings Dataset used
	for fine-tuning.

	### Ethical Considerations
	Users of this model should be aware of potential ethical implications, such as the use of historical artworks without proper attribution
	or consent. It is essential to respect intellectual property rights and ensure
	that any generated content or analyses are used responsibly and respectfully.