Detected version 0.0.0. Error: FlashAttention2

#43
by the1Domo - opened

❌ Error: FlashAttention2 has been toggled on, but it cannot be used due to the following error: you need flash_attn package version to be greater or equal than 2.1.0. Detected version 0.0.0. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2.
Traceback (most recent call last):
File "/home/vivian/test/phi4-minimal-test.py", line 78, in main
model = AutoModelForCausalLM.from_pretrained(args.model_path, **load_kwargs)
File "/home/vivian/.pyenv/versions/3.9.21/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 559, in from_pretrained
return model_class.from_pretrained(
File "/home/vivian/.pyenv/versions/3.9.21/lib/python3.9/site-packages/transformers/modeling_utils.py", line 4105, in from_pretrained
config = cls._autoset_attn_implementation(
File "/home/vivian/.pyenv/versions/3.9.21/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1525, in _autoset_attn_implementation
cls._check_and_enable_flash_attn_2(
File "/home/vivian/.pyenv/versions/3.9.21/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1664, in _check_and_enable_flash_attn_2
raise ImportError(
ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: you need flash_attn package version to be greater or equal than 2.1.0. Detected version 0.0.0. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2.

i am getting this error every single time i try using this model. has anyone else encountered this issue and know how to fix it i've tried installing several different versions of the FlashAttention2

use colab with a100 gpu it will work

use colab with a100 gpu it will work

I want to run it on my own network I have a Tesla P40 and I'm just trying to get it to run

Microsoft org

@the1Domo We have not tested on P40, but you can change attention to eager to see if inference works

Example code in https://huggingface.co/microsoft/Phi-4-multimodal-instruct#loading-the-model-locally

model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True,
    # ***if you do not use Ampere or later GPUs, change attention to "eager" ***
    _attn_implementation='flash_attention_2',
).cuda()
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment