Error during inference with image and text.
Running into the following error when trying inference with Image+Text
/home/.cache/huggingface/modules/transformers_modules/microsoft/Phi-4-multimodal-instruct/879783f7b23e43c12d1c682e3458f115f3a7718d/modeling_phi4mm.py", line 399, in forward
assert temp_len == output_imgs[-1].shape[1], f'temp_len: {temp_len}, output_imgs[-1].shape[1]: {output_imgs[-1].shape[1]}'
AssertionError: temp_len: 5409, output_imgs[-1].shape[1]: 5393
It doesn't happen for all images, just some.
Same error with:
import requests
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
# Define model path
model_path = "microsoft/Phi-4-multimodal-instruct"
# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="mps",
trust_remote_code=True,
_attn_implementation='eager',
).to("mps")
# Load generation config
generation_config = GenerationConfig.from_pretrained(model_path)
# Define prompt structure
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# Part 1: Image Processing
print("\n--- IMAGE PROCESSING ---")
image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')
# Download and open image
image = Image.open(requests.get(image_url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors='pt').to('mps')
# Generate response
generate_ids = model.generate(
**inputs,
max_new_tokens=256,
generation_config=generation_config,
)
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')
Error:
File "/Users/ericbuehler/.cache/huggingface/modules/transformers_modules/phi4_multimodal/modeling_phi4mm.py", line 399, in forward
assert temp_len == output_imgs[-1].shape[1], f'temp_len: {temp_len}, output_imgs[-1].shape[1]: {output_imgs[-1].shape[1]}'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: temp_len: 1381, output_imgs[-1].shape[1]: 933```
@nguyenbh , I think this needs to be re-opened. There appears to be a bug here, and it's easily reproducable.
output_imgs[-1].shape[1]
comes from a concatenation: torch.cat([sub_img, self.glb_GN, glb_img], dim=1)
, where:
sub_image.shape[1]
comes from a concatenation of something the size of (a) theuseful_height
anduseful_width
product (16 * 12), and (b)temp_sub_GN
(16)self.glb_GN.shape[1]
is1
glb_img.shape[1]
is16 * 16 + 16 = 272
Summing up (in my case, with a small 448x448 image) to: 208 + 1 + 272 = 481
. This looks theoretically correct to me.
With the image attention mask on (default), temp_len
is defined as:
temp_len = int(image_attention_mask[_bs,:B_+1,0::2,0::2].sum().item()) + (useful_height+1) + base_feat_height//base_feat_height_reduction
In my case, the numbers for these are: 320 + 17 + 16 = 353
.
The whole code for Phi4MMImageEmbedding
is rather impenetrable, but I don't see how those are supposed to be equal. The culprit seems to be the logic around temp_len
calculation.
Even when I simply comment out the assertion statement, I get an error right below when I send a large image:
File "/Users/myuser/repos/phi4model/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/myuser/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py", line 769, in forward
image_hidden_states = self.image_embed(
File "/Users/myuser/repos/a11y-models/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/Users/myuser/repos/a11y-models/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/myuser/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py", line 448, in forward
new_hidden_states = hidden_states.index_put(
RuntimeError: shape mismatch: value tensor of shape [900, 3072] cannot be broadcast to indexing result of shape [1792, 3072]
Interestingly, both earlier variables that are asserted on, i.e.temp_len
(1332) and output_imgs[-1].shape[1]
(900), are not the expected shape (1792) here!
Hi, we test your code on our side and we cannot reproduce the issue you reported. I'm not sure whether the issue is caused by mps, since we use cuda and we do not get this error.
When running on my Macbook using device = "cpu"
, I still get:
AssertionError: temp_len: 1536, output_imgs[-1].shape[1]: 1792
But if I comment out the assertion, things actually run fine.
It looks like perhaps the code is a bit too CUDA-specific.
I believe I fixed it, see PR: https://huggingface.co/microsoft/Phi-4-multimodal-instruct/discussions/45