microsoft/Phi-4-multimodal-instruct · Detected version 0.0.0. Error: FlashAttention2

4 days ago

❌ Error: FlashAttention2 has been toggled on, but it cannot be used due to the following error: you need flash_attn package version to be greater or equal than 2.1.0. Detected version 0.0.0. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2.
Traceback (most recent call last):
File "/home/vivian/test/phi4-minimal-test.py", line 78, in main
model = AutoModelForCausalLM.from_pretrained(args.model_path, **load_kwargs)
File "/home/vivian/.pyenv/versions/3.9.21/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 559, in from_pretrained
return model_class.from_pretrained(
File "/home/vivian/.pyenv/versions/3.9.21/lib/python3.9/site-packages/transformers/modeling_utils.py", line 4105, in from_pretrained
config = cls._autoset_attn_implementation(
File "/home/vivian/.pyenv/versions/3.9.21/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1525, in _autoset_attn_implementation
cls._check_and_enable_flash_attn_2(
File "/home/vivian/.pyenv/versions/3.9.21/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1664, in _check_and_enable_flash_attn_2
raise ImportError(
ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: you need flash_attn package version to be greater or equal than 2.1.0. Detected version 0.0.0. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2.

i am getting this error every single time i try using this model. has anyone else encountered this issue and know how to fix it i've tried installing several different versions of the FlashAttention2

saivarmavu

4 days ago

use colab with a100 gpu it will work

the1Domo

4 days ago

use colab with a100 gpu it will work

I want to run it on my own network I have a Tesla P40 and I'm just trying to get it to run

nguyenbh

Microsoft org 3 days ago

@the1Domo We have not tested on P40, but you can change attention to eager to see if inference works

Example code in https://huggingface.co/microsoft/Phi-4-multimodal-instruct#loading-the-model-locally

model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True,
    # ***if you do not use Ampere or later GPUs, change attention to "eager" ***
    _attn_implementation='flash_attention_2',
).cuda()