Error during inference with image and text.

#12
by aarbelle - opened

Running into the following error when trying inference with Image+Text

/home/.cache/huggingface/modules/transformers_modules/microsoft/Phi-4-multimodal-instruct/879783f7b23e43c12d1c682e3458f115f3a7718d/modeling_phi4mm.py", line 399, in forward
    assert temp_len == output_imgs[-1].shape[1], f'temp_len: {temp_len}, output_imgs[-1].shape[1]: {output_imgs[-1].shape[1]}'
AssertionError: temp_len: 5409, output_imgs[-1].shape[1]: 5393

It doesn't happen for all images, just some.

Same error with:

import requests
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

# Define model path
model_path = "microsoft/Phi-4-multimodal-instruct"

# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    device_map="mps", 
    trust_remote_code=True, 
    _attn_implementation='eager',
).to("mps")

# Load generation config
generation_config = GenerationConfig.from_pretrained(model_path)

# Define prompt structure
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'

# Part 1: Image Processing
print("\n--- IMAGE PROCESSING ---")
image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')

# Download and open image
image = Image.open(requests.get(image_url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors='pt').to('mps')

# Generate response
generate_ids = model.generate(
    **inputs,
    max_new_tokens=256,
    generation_config=generation_config,
)
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')

Error:

File "/Users/ericbuehler/.cache/huggingface/modules/transformers_modules/phi4_multimodal/modeling_phi4mm.py", line 399, in forward
    assert temp_len == output_imgs[-1].shape[1], f'temp_len: {temp_len}, output_imgs[-1].shape[1]: {output_imgs[-1].shape[1]}'
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: temp_len: 1381, output_imgs[-1].shape[1]: 933```

@aarbelle @EricB Did you find a solution to this problem yet? I am facing the same error...

nguyenbh changed discussion status to closed

@nguyenbh , I think this needs to be re-opened. There appears to be a bug here, and it's easily reproducable.

output_imgs[-1].shape[1] comes from a concatenation: torch.cat([sub_img, self.glb_GN, glb_img], dim=1), where:

  • sub_image.shape[1] comes from a concatenation of something the size of (a) the useful_height and useful_width product (16 * 12), and (b) temp_sub_GN (16)
  • self.glb_GN.shape[1] is 1
  • glb_img.shape[1] is 16 * 16 + 16 = 272

Summing up (in my case, with a small 448x448 image) to: 208 + 1 + 272 = 481. This looks theoretically correct to me.

With the image attention mask on (default), temp_len is defined as:

temp_len = int(image_attention_mask[_bs,:B_+1,0::2,0::2].sum().item()) + (useful_height+1) + base_feat_height//base_feat_height_reduction

In my case, the numbers for these are: 320 + 17 + 16 = 353.

The whole code for Phi4MMImageEmbedding is rather impenetrable, but I don't see how those are supposed to be equal. The culprit seems to be the logic around temp_len calculation.

Even when I simply comment out the assertion statement, I get an error right below when I send a large image:

  File "/Users/myuser/repos/phi4model/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/myuser/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py", line 769, in forward
    image_hidden_states = self.image_embed(
  File "/Users/myuser/repos/a11y-models/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/myuser/repos/a11y-models/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/myuser/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py", line 448, in forward
    new_hidden_states = hidden_states.index_put(
RuntimeError: shape mismatch: value tensor of shape [900, 3072] cannot be broadcast to indexing result of shape [1792, 3072]

Interestingly, both earlier variables that are asserted on, i.e.temp_len (1332) and output_imgs[-1].shape[1] (900), are not the expected shape (1792) here!

nguyenbh changed discussion status to open

@nguyenbh have you been able to reproduce this?

Hi, we test your code on our side and we cannot reproduce the issue you reported. I'm not sure whether the issue is caused by mps, since we use cuda and we do not get this error.

When running on my Macbook using device = "cpu", I still get:

AssertionError: temp_len: 1536, output_imgs[-1].shape[1]: 1792

But if I comment out the assertion, things actually run fine.

It looks like perhaps the code is a bit too CUDA-specific.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment