How much VRAM is required? I have 8gb RTX 3060 and seems in sufficient
Hey all
I am running this model according to the instructions but it keeps saying I am running out of VRAM despite having 8gb with my RTX 3060.
Seems abnormal as the model is roughly 6gb.
Can anyone help with this? As I don't think I have sufficient VRAM to run the AWQ 7b model.
Have you tried with 3b model?
This is the 3B model repository here. I've tried on Collab with a T4 GPU and 16Go RAM. It fails too...
OutOfMemoryError: CUDA out of memory. Tried to allocate 12.20 GiB. GPU 0 has a total capacity of 14.74 GiB of which 6.24 GiB is free. Process 2446 has 8.49 GiB memory in use. Of the allocated memory 8.28 GiB is allocated by PyTorch, and 96.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
There is something wrong in the snippet and I'm trying to find what.
Try reducing the image size a bit.
OK... the problem with the given example snippet is that the image is HUGE.
This worked on collab:
First, !pip install accelerate qwen_vl_utils
Then:
from PIL import Image
import requests
image = Image.open(requests.get("https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", stream=True).raw).reduce(2).convert('RGB')
image.save('demo.jpeg')
It saves the image in a lower size, and I ensure that the image is RGB.
Then:
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-VL-3B-Instruct",
torch_dtype=torch.float16, # I use float16, not bfloat16, but maybe it works with default
device_map="auto",
)
# this can help
model = torch.compile(model)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")
And finally:
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "file://demo.jpeg", # <== the reduced image
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
).to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
This gives:
The image depicts a serene beach scene with a person and a dog sitting on the sand. The person is wearing a plaid shirt and black pants, and they appear to be smiling or laughing. The dog, which looks like a Labrador Retriever, is also sitting on the sand and is wearing a harness. The dog is extending its paw towards the person, possibly in a gesture of greeting or playfulness. The background shows the ocean with gentle waves lapping at the shore, and the sky is clear with a soft light suggesting either early morning or late afternoon. The overall atmosphere of the image is peaceful and joyful.
Try reducing the image size a bit.
Are we synchroniezed? :-) I just gave the same answer :-p