TIGER-Lab/VLM2Vec-Full · ValueError: not enough values to unpack (expected 5, got 4) when using image+text with TIGER-Lab/VLM2Vec-Full

Hi TIGER Lab team and community 👋

I'm currently trying to run the TIGER-Lab/VLM2Vec-Full model both on Google Colab and in my local VS Code environment. I'm strictly following the example code provided on the Hugging Face model card as well as the GitHub repository instructions.

Everything works fine up to the point where I attempt to run inference on an image+text pair using:

inputs = processor('<|image_1|> Represent the given image with the following question: What is in the image', [Image.open('figures/example.jpg')])
inputs = {key: value.to('cuda') for key, value in inputs.items()}
qry_output = model(qry=inputs)["qry_reps"]

At this point, I consistently get the following error:
ValueError: not enough values to unpack (expected 5, got 4)

Digging into the source code, it seems the error originates from phi3_v/image_embedding_phi3_v.py here:
num_images, num_crops, c, h, w = pixel_values.shape

Apparently, pixel_values is only 4D at that point (e.g., [1, 3, 336, 336]), whereas the model expects 5D input: [batch_size, num_crops, 3, 336, 336].

🧪 What I’ve Tried
I manually printed the shape of pixel_values returned by the processor, and it sometimes gives 5D shape like [1, 5, 3, 336, 336], but other times only 4D depending on how it's called.

I attempted to use .unsqueeze(1) to force it into 5D, but that doesn't consistently fix the issue and may be masking the real problem.

I noticed that the example code references a load_processor() function from src.utils, but this function is not present in the current codebase.

❓ My Questions
What is the correct way to use the processor to ensure 5D pixel_values are always returned when needed?

Is there a specific load_processor() function you recommend that handles this more reliably?

Should AutoProcessor.from_pretrained('TIGER-Lab/VLM2Vec-Full') be sufficient? Or does it lack the custom logic needed for phi3_v's multi-crop setup?

Are we supposed to configure num_crops=16 manually somewhere in the processor or preprocessor pipeline?

I'm really excited about this model — it looks incredibly promising for multimodal tasks, and I would love to get it running end-to-end. Any guidance you could provide would be very much appreciated 🙏

Thanks in advance!