|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- remyxai/vqasynth_spacellava |
|
tags: |
|
- remyx |
|
- vqasynth |
|
- spatial-reasoning |
|
- multimodal |
|
- vision-language-model |
|
- vlm |
|
- llava |
|
- robotics |
|
- embodied-ai |
|
- quantitative-spatial-reasoning |
|
- distance-estimation |
|
base_model: |
|
- liuhaotian/llava-v1.5-13b |
|
language: |
|
- en |
|
pipeline_tag: image-text-to-text |
|
library_name: transformers |
|
new_version: remyxai/SpaceQwen2.5-VL-3B-Instruct |
|
--- |
|
|
|
# SpaceLLaVA |
|
|
|
 |
|
|
|
- **Model Type:** Multimodal, Vision-Language Model |
|
- **Architecture**: `llava-v1.5-13b` |
|
- **Model Size:** 13.4B parameters (FP16) |
|
- **Finetuned from:** liuhaotian/llava-v1.5-13b |
|
- **Finetune Strategy:** LoRA (Low-Rank Adaptation) |
|
- **License:** Apache-2.0 |
|
|
|
|
|
# Model Overview |
|
|
|
**SpaceLLaVA** is a vision-language model adapted from LLaVA-1.5 (13B) and fine-tuned by LoRA to improve spatial reasoning. |
|
Trained using a synthetic VQA [dataset](https://huggingface.co/datasets/remyxai/vqasynth_spacellava) inspired by the methods described in [SpatialVLM](https://spatial-vlm.github.io/). |
|
SpaceLLaVA demonstrates strong qualitative and quantitative spatial reasoning after distilling 3D scene understanding from the pipelines in [VQASynth](https://github.com/remyxai/VQASynth/tree/main). |
|
|
|
|
|
## Running SpaceLLaVA |
|
|
|
## GGUF |
|
|
|
Use this notebook to query spatial relationships between objects in a scene with llama-cpp-python. |
|
|
|
[](https://colab.research.google.com/drive/1WPE7Br5A5ERSij8BL1M22EoEMLVkD8EP?usp=sharing) |
|
|
|
|
|
## Docker |
|
|
|
`docker build -f Dockerfile -t spacellava-server:latest` |
|
|
|
`docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 12G spacellava-server:latest` |
|
|
|
`python3 client.py --image_path "https://remyx.ai/assets/spatialvlm/warehouse_rgb.jpg" --prompt "What is the distance between the man in the red hat and the pallet of boxes?"` |
|
|
|
|
|
# Dataset & Training |
|
|
|
- **Dataset:** [SpaceLLaVA](https://huggingface.co/datasets/remyxai/vqasynth_spacellava) |
|
- **Code:** [VQASynth](https://github.com/remyxai/VQASynth/tree/main) |
|
- **Reference:** [SpatialVLM](https://spatial-vlm.github.io/) |
|
|
|
- ~28,000 synthetic samples created using templated VQA pairs with a 3D scene reconstruction pipeline |
|
- Formats: image (RGB), question (text), answer (text) |
|
- Spatial relation types include: “distances”, “size”, “left of”, “above”, “closer to”, “inside” |
|
|
|
Scripts for LoRA SFT available at [trl](https://github.com/huggingface/trl/blob/main/examples/scripts/sft_vlm.py) |
|
Check out the [SpaceVLMs collection](https://huggingface.co/collections/remyxai/spacevlms-66a3dbb924756d98e7aec678) |
|
|
|
# Model Evaluation (Coming Soon) |
|
|
|
**TODO:** VLMEvalKit evaluation on the QSpatial benchmark, VSR, etc. |
|
|
|
|
|
Try it on Discord: http://discord.gg/b2yGuCNpuC |
|
|
|
|
|
 |
|
|
|
# ⚠️ Limitations & Ethical Considerations |
|
|
|
- Performance may degrade in cluttered environments or camera perspective. |
|
- This model was fine-tuned using synthetic reasoning over an internet image dataset. |
|
- Multimodal biases inherent to the base model (LLaVA) may persist. |
|
- Not intended for use in safety-critical or legal decision-making. |
|
|
|
> Users are encouraged to evaluate outputs critically and consider fine-tuning for domain-specific safety and performance. |
|
|
|
|
|
## License and Citation |
|
|
|
|
|
Licensed under Apache-2.0. |
|
|
|
``` |
|
@article{chen2024spatialvlm, |
|
title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities}, |
|
author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei}, |
|
journal = {arXiv preprint arXiv:2401.12168}, |
|
year = {2024}, |
|
url = {https://arxiv.org/abs/2401.12168}, |
|
} |
|
|
|
@misc{liu2023llava, |
|
title={Visual Instruction Tuning}, |
|
author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae}, |
|
publisher={NeurIPS}, |
|
year={2023}, |
|
} |
|
``` |