Different number of attention heads, makes rotary_ndims vs rope scaling factors wrong?

by bartowski - opened 18 days ago

18 days ago

In configuration_phi3.py, it has:

rotary_ndims = int(self.hidden_size // self.num_attention_heads * self.partial_rotary_factor)

so rotary_ndims would be 3072 // 24 * 1.0 = 128

Then rope_scaling_short_factor is a list of length 48

it then raises an error if

len(rope_scaling_short_factor) != rotary_ndims //2

and since 48 != 64, this is an error (and I get a similar one in llama.cpp)

The question is, is the number of heads incorrect? In both phi 3.5 mini the num_attention_heads is 32, which would give a rotary_ndims of 96, which when divided by 2 gives the number 48 that we expect

Any idea what's incorrect?

ykim362

Microsoft org 18 days ago

•

edited 18 days ago

Thanks for your interest!

In the config, the rotary factor is 0.75.
Could you share how you are loading the config?

https://huggingface.co/microsoft/Phi-4-mini-instruct/blob/4b00ec8714b0cb224e4fb33380cbf0919f177f3e/config.json#L31

dinerburger

17 days ago

I was attempting to quantize this to an 8-bit EXL2 quant this morning, which also failed for what I assume are similar reasons. Looks like it's missing the check for partial_rotary_factor. Very cool to see 128K context on Phi-4. Will work to get the associated infrastructure in place.

aqweteddy

17 days ago

Maybe there is a same issue with Sglang?

When I run the following command:

python3 -m sglang.launch_server --model-path microsoft/Phi-4-mini-instruct --host 0.0.0.0 --port 30000 --dp 4 --enable-p2p-check --mem-fraction-static 0.95

I get this error:

  File "/usr/local/lib/python3.10/dist-packages/transformers/models/phi3/configuration_phi3.py", line 159, in __init__
self._rope_scaling_validation()
File "/usr/local/lib/python3.10/dist-packages/transformers/models/phi3/configuration_phi3.py", line 208, in _rope_scaling_validation
raise ValueError(
ValueError: `rope_scaling`'s short_factor field must have length 64, got 48```

bartowski

17 days ago

Yes it seems likely that most of these tools are ignoring the 0.75 scaling, thanks for pointing that out @ykim362 ! Will investigate

leflak

17 days ago

•

edited 17 days ago

Maybe there is a same issue with Sglang?

When I run the following command:

python3 -m sglang.launch_server --model-path microsoft/Phi-4-mini-instruct --host 0.0.0.0 --port 30000 --dp 4 --enable-p2p-check --mem-fraction-static 0.95

I get this error:

  File "/usr/local/lib/python3.10/dist-packages/transformers/models/phi3/configuration_phi3.py", line 159, in __init__
self._rope_scaling_validation()
File "/usr/local/lib/python3.10/dist-packages/transformers/models/phi3/configuration_phi3.py", line 208, in _rope_scaling_validation
raise ValueError(
ValueError: `rope_scaling`'s short_factor field must have length 64, got 48```

Same issue with vllm even if with version 0.7.2 in OAI server mode.

ykim362

Microsoft org 17 days ago

•

edited 17 days ago

Hi @leflak ,

Thanks for your interest!
We already integrate it to vllm, and it will be available from v0.7.3.
https://github.com/vllm-project/vllm/pull/12718

Thanks.

legolasyiu

17 days ago

getting same error when GRPO training with unsloth - valueError: rope_scaling's short_factor field must have length 64, got 48```

jhflow

17 days ago

same error when sft with huggingface trl

ValueError: `rope_scaling`'s short_factor field must have length 64, got 48

legolasyiu

17 days ago

This error is raised because the length of your rope_scaling dictionary’s short_factor list doesn’t match what the model configuration expects. In the validation method, the code calculates:

rotary_ndims = int(self.hidden_size // self.num_attention_heads * self.partial_rotary_factor)

Then it requires that the length of rope_scaling["short_factor"] be exactly rotary_ndims // 2. In your case, the error message indicates that it expected a length of 64, meaning:

rotary_ndims

128
rotary_ndims=128
128
÷
2

64
128÷2=64
But your provided list has only 48 elements.

To resolve this issue, you have two options:

Update the rope_scaling dictionary:
Modify your rope_scaling["short_factor"] (and similarly the long_factor, if applicable) so that its length is 64, matching the computed expectation.

Adjust model parameters:
If the list of 48 elements is what you intend to use, then you’ll need to adjust your model’s configuration (for example, by changing hidden_size, num_attention_heads, or partial_rotary_factor) so that the computed value of rotary_ndims // 2 equals 48.

Review your model configuration settings and ensure that the dimensions in rope_scaling align with the derived value from your model parameters.

legolasyiu

16 days ago

Is there fix in Microsoft for that?

ykim362

Microsoft org 16 days ago

Hi @legolasyiu .
Thanks for your interest!
Yes, the new model feature is added to the latest HF(v4.49.0) and vllm (v0.7.3) already.

VLLM: https://github.com/vllm-project/vllm/pull/12718
HF: https://github.com/huggingface/transformers/pull/35947

leflak

16 days ago

Hi @leflak ,

Thanks for your interest!
We already integrate it to vllm, and it will be available from v0.7.3.
https://github.com/vllm-project/vllm/pull/12718

Thanks.

Hi @ykim362 thanks for your reply (I suggest an update of the modelcard which requires vllm>=0.7.2)!

legolasyiu

16 days ago

Hi @legolasyiu .
Thanks for your interest!
Yes, the new model feature is added to the latest HF(v4.49.0) and vllm (v0.7.3) already.

VLLM: https://github.com/vllm-project/vllm/pull/12718
HF: https://github.com/huggingface/transformers/pull/35947

Thanks. I am so glad you guys are fixing it.

ykim362

Microsoft org 16 days ago

Thanks, @leflak .
Will update the model card to vllm v0.7.3.

nguyenbh changed discussion status to closed 10 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Different number of attention heads, makes rotary_ndims vs rope scaling factors wrong?

rotary_ndims

128rotary_ndims=128128÷2

128
rotary_ndims=128
128
÷
2