--- license: apache-2.0 language: - en metrics: - code_eval library_name: transformers pipeline_tag: image-to-text tags: - text-generation-inference --- We are creating a spatial aware vision-language(VL) model. This is a trained model on COCO dataset images including extra information regarding the spatial relationship between the entities of the image. This is a sequence to sequence model for visual question-answering. The architecture is BLIP.(BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation)
Requirements! - 4GB GPU RAM. - CUDA enabled docker
The way to download and run this: ```python from transformers import BlipProcessor, BlipForQuestionAnswering import torch from PIL import Image device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") # Specify the path to the directory where the model was saved model_path = "voxeality/rgb-language_vqa" # Load the model model = BlipForQuestionAnswering.from_pretrained(model_path).to(device, torch.float16) question = "any question in the form of where is an object or what is to the left/right/above/below/in front/behind the object" image_path= 'path/to/file' image = Image.open(image_path).convert("RGB") # Load the processor used during training for consistent preprocessing processor = BlipProcessor.from_pretrained(model_path) # prepare inputs encoding = processor(image, question, return_tensors="pt").to("cuda", torch.float16) # Welcome to the VOXReality Horizon Europe Project out = model.generate(**encoding, max_new_tokens=200) generated_text = processor.decode(out[0], skip_special_tokens=True) print(generated_text) ``` Below you'll find the necessary instructions in order to run our provided code. The instructions refer to the building of the rgb-language_vqa service which exposes 1 endpoint and utilizes the VOXReality vision-language spatial visual question answering (open type) model. The model is trained to produce a spatial answer to any question regarding spaial relationships between objects of the image. The output of this dialogue is either of that form: Q. Where is "Object1"?. A. to the "Left/Right etc." of another "Object2". ## 1. Requirements --- 1. CUDA compatible GPU. 1. We recommend at least 4GB of GPU memory. 2. The code was tested on Nvidia proprietary driver 515 and 525. 2. For LINUX (tested on Ubuntu 20.04). 1. Make sure Docker is installed on your system. 2. Make sure you have the NVIDIA Container Toolking installed. More info and instructions can be found in the [official installation guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker) 3. For Windows (tested on Windows 10 and 11). 1. Make sure Docker is installed on your system.