SpaceLLaVA / README.md
salma-remyx's picture
Update README.md
5a69f71 verified
|
raw
history blame contribute delete
4.14 kB
---
license: apache-2.0
datasets:
- remyxai/vqasynth_spacellava
tags:
- remyx
- vqasynth
- spatial-reasoning
- multimodal
- vision-language-model
- vlm
- llava
- robotics
- embodied-ai
- quantitative-spatial-reasoning
- distance-estimation
base_model:
- liuhaotian/llava-v1.5-13b
language:
- en
pipeline_tag: image-text-to-text
library_name: transformers
new_version: remyxai/SpaceQwen2.5-VL-3B-Instruct
---
# SpaceLLaVA
![image/gif](https://cdn-uploads.huggingface.co/production/uploads/647777304ae93470ffc28913/ZGZ0aNfZLxtdHaXN8F2ki.gif)
- **Model Type:** Multimodal, Vision-Language Model
- **Architecture**: `llava-v1.5-13b`
- **Model Size:** 13.4B parameters (FP16)
- **Finetuned from:** liuhaotian/llava-v1.5-13b
- **Finetune Strategy:** LoRA (Low-Rank Adaptation)
- **License:** Apache-2.0
# Model Overview
**SpaceLLaVA** is a vision-language model adapted from LLaVA-1.5 (13B) and fine-tuned by LoRA to improve spatial reasoning.
Trained using a synthetic VQA [dataset](https://huggingface.co/datasets/remyxai/vqasynth_spacellava) inspired by the methods described in [SpatialVLM](https://spatial-vlm.github.io/).
SpaceLLaVA demonstrates strong qualitative and quantitative spatial reasoning after distilling 3D scene understanding from the pipelines in [VQASynth](https://github.com/remyxai/VQASynth/tree/main).
## Running SpaceLLaVA
## GGUF
Use this notebook to query spatial relationships between objects in a scene with llama-cpp-python.
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1WPE7Br5A5ERSij8BL1M22EoEMLVkD8EP?usp=sharing)
## Docker
`docker build -f Dockerfile -t spacellava-server:latest`
`docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 12G spacellava-server:latest`
`python3 client.py --image_path "https://remyx.ai/assets/spatialvlm/warehouse_rgb.jpg" --prompt "What is the distance between the man in the red hat and the pallet of boxes?"`
# Dataset & Training
- **Dataset:** [SpaceLLaVA](https://huggingface.co/datasets/remyxai/vqasynth_spacellava)
- **Code:** [VQASynth](https://github.com/remyxai/VQASynth/tree/main)
- **Reference:** [SpatialVLM](https://spatial-vlm.github.io/)
- ~28,000 synthetic samples created using templated VQA pairs with a 3D scene reconstruction pipeline
- Formats: image (RGB), question (text), answer (text)
- Spatial relation types include: “distances”, “size”, “left of”, “above”, “closer to”, “inside”
Scripts for LoRA SFT available at [trl](https://github.com/huggingface/trl/blob/main/examples/scripts/sft_vlm.py)
Check out the [SpaceVLMs collection](https://huggingface.co/collections/remyxai/spacevlms-66a3dbb924756d98e7aec678)
# Model Evaluation (Coming Soon)
**TODO:** VLMEvalKit evaluation on the QSpatial benchmark, VSR, etc.
Try it on Discord: http://discord.gg/b2yGuCNpuC
![image/png](https://cdn-uploads.huggingface.co/production/uploads/647777304ae93470ffc28913/Rsu5VpDgdZh9jemw97w8T.png)
# ⚠️ Limitations & Ethical Considerations
- Performance may degrade in cluttered environments or camera perspective.
- This model was fine-tuned using synthetic reasoning over an internet image dataset.
- Multimodal biases inherent to the base model (LLaVA) may persist.
- Not intended for use in safety-critical or legal decision-making.
> Users are encouraged to evaluate outputs critically and consider fine-tuning for domain-specific safety and performance.
## License and Citation
Licensed under Apache-2.0.
```
@article{chen2024spatialvlm,
title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
journal = {arXiv preprint arXiv:2401.12168},
year = {2024},
url = {https://arxiv.org/abs/2401.12168},
}
@misc{liu2023llava,
title={Visual Instruction Tuning},
author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
publisher={NeurIPS},
year={2023},
}
```