SpaceLLaVA / README.md

Update README.md

5a69f71 verified 14 days ago

4.14 kB

	---
	license: apache-2.0
	datasets:
	- remyxai/vqasynth_spacellava
	tags:
	- remyx
	- vqasynth
	- spatial-reasoning
	- multimodal
	- vision-language-model
	- vlm
	- llava
	- robotics
	- embodied-ai
	- quantitative-spatial-reasoning
	- distance-estimation
	base_model:
	- liuhaotian/llava-v1.5-13b
	language:
	- en
	pipeline_tag: image-text-to-text
	library_name: transformers
	new_version: remyxai/SpaceQwen2.5-VL-3B-Instruct
	---

	# SpaceLLaVA

	![image/gif](https://cdn-uploads.huggingface.co/production/uploads/647777304ae93470ffc28913/ZGZ0aNfZLxtdHaXN8F2ki.gif)

	- Model Type: Multimodal, Vision-Language Model
	- Architecture: `llava-v1.5-13b`
	- Model Size: 13.4B parameters (FP16)
	- Finetuned from: liuhaotian/llava-v1.5-13b
	- Finetune Strategy: LoRA (Low-Rank Adaptation)
	- License: Apache-2.0


	# Model Overview

	SpaceLLaVA is a vision-language model adapted from LLaVA-1.5 (13B) and fine-tuned by LoRA to improve spatial reasoning.
	Trained using a synthetic VQA [dataset](https://huggingface.co/datasets/remyxai/vqasynth_spacellava) inspired by the methods described in [SpatialVLM](https://spatial-vlm.github.io/).
	SpaceLLaVA demonstrates strong qualitative and quantitative spatial reasoning after distilling 3D scene understanding from the pipelines in [VQASynth](https://github.com/remyxai/VQASynth/tree/main).


	## Running SpaceLLaVA

	## GGUF

	Use this notebook to query spatial relationships between objects in a scene with llama-cpp-python.

	[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1WPE7Br5A5ERSij8BL1M22EoEMLVkD8EP?usp=sharing)


	## Docker

	`docker build -f Dockerfile -t spacellava-server:latest`

	`docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 12G spacellava-server:latest`

	`python3 client.py --image_path "https://remyx.ai/assets/spatialvlm/warehouse_rgb.jpg" --prompt "What is the distance between the man in the red hat and the pallet of boxes?"`


	# Dataset & Training

	- Dataset: [SpaceLLaVA](https://huggingface.co/datasets/remyxai/vqasynth_spacellava)
	- Code: [VQASynth](https://github.com/remyxai/VQASynth/tree/main)
	- Reference: [SpatialVLM](https://spatial-vlm.github.io/)

	- ~28,000 synthetic samples created using templated VQA pairs with a 3D scene reconstruction pipeline
	- Formats: image (RGB), question (text), answer (text)
	- Spatial relation types include: “distances”, “size”, “left of”, “above”, “closer to”, “inside”

	Scripts for LoRA SFT available at [trl](https://github.com/huggingface/trl/blob/main/examples/scripts/sft_vlm.py)
	Check out the [SpaceVLMs collection](https://huggingface.co/collections/remyxai/spacevlms-66a3dbb924756d98e7aec678)

	# Model Evaluation (Coming Soon)

	TODO: VLMEvalKit evaluation on the QSpatial benchmark, VSR, etc.


	Try it on Discord: http://discord.gg/b2yGuCNpuC


	![image/png](https://cdn-uploads.huggingface.co/production/uploads/647777304ae93470ffc28913/Rsu5VpDgdZh9jemw97w8T.png)

	# ⚠️ Limitations & Ethical Considerations

	- Performance may degrade in cluttered environments or camera perspective.
	- This model was fine-tuned using synthetic reasoning over an internet image dataset.
	- Multimodal biases inherent to the base model (LLaVA) may persist.
	- Not intended for use in safety-critical or legal decision-making.

	> Users are encouraged to evaluate outputs critically and consider fine-tuning for domain-specific safety and performance.


	## License and Citation


	Licensed under Apache-2.0.

	```
	@article{chen2024spatialvlm,
	title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
	author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
	journal = {arXiv preprint arXiv:2401.12168},
	year = {2024},
	url = {https://arxiv.org/abs/2401.12168},
	}

	@misc{liu2023llava,
	title={Visual Instruction Tuning},
	author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
	publisher={NeurIPS},
	year={2023},
	}
	```