---
license: apache-2.0
language:
- en
metrics:
- code_eval
library_name: transformers
pipeline_tag: image-to-text
tags:
- text-generation-inference
---
<u><b>We are creating a spatial aware vision-language(VL) model.</b></u>

This is a trained model on COCO dataset images including extra information regarding the spatial relationship between the entities of the image.

This is a sequence to sequence model for visual question-answering. The architecture is <u><b>BLIP.(BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation)</b></u>

<details>
  <summary>Requirements!</summary>
- 4GB GPU RAM.
- CUDA enabled docker
</details>

The way to download and run this: 
```python
from transformers import BlipProcessor, BlipForQuestionAnswering
import torch
from PIL import Image
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Specify the path to the directory where the model was saved
model_path = "voxeality/rgb-language_vqa"
# Load the model
model = BlipForQuestionAnswering.from_pretrained(model_path).to(device, torch.float16)
question = "any question in the form of where is an object or what is to the left/right/above/below/in front/behind the object"
image_path= 'path/to/file'
image = Image.open(image_path).convert("RGB")

# Load the processor used during training for consistent preprocessing
processor = BlipProcessor.from_pretrained(model_path)
# prepare inputs
encoding = processor(image, question, return_tensors="pt").to("cuda", torch.float16)
# Welcome to the VOXReality Horizon Europe Project 

out = model.generate(**encoding, max_new_tokens=200)
generated_text = processor.decode(out[0], skip_special_tokens=True)
print(generated_text)
```
Below you'll find the necessary instructions in order to run our provided code. The instructions refer to the building of the rgb-language_vqa service which exposes 1 endpoint and utilizes the VOXReality vision-language spatial visual question answering (open type) model.


The model is trained to produce a spatial answer to any question regarding spaial relationships between objects of the image.

<i>The output of this dialogue is either of that form: 

Q. Where is "Object1"?. A. to the "Left/Right etc." of another "Object2".
## 1. Requirements
---
1. CUDA compatible GPU. 
   1. We recommend at least 4GB of GPU memory.
   2. The code was tested on Nvidia proprietary driver 515 and 525.
2. For LINUX (tested on Ubuntu 20.04).
   1. Make sure Docker is installed on your system.
   2. Make sure you have the NVIDIA Container Toolking installed. More info and instructions can be found in the [official installation guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker)
3. For Windows (tested on Windows 10 and 11).
   1. Make sure Docker is installed on your system.