Really good work

by divmgl - opened Feb 2

Feb 2

•

Hello. Just wanted to comment that this model works flawlessly on my 4090.

Just out of curiosity: did you do anything during the quantization of the model? For some reason, casperhensen's version doesn't run nearly as well or as accurate as this version.

Thanks!

stelterlab

Owner Feb 2

Hi!

Happy to hear that! I do nothing special. I just use the code snippet from caspers README. Maybe it's the versions of the python libs, the NVIDIA/CUDA drivers or the GPU I use for quantization (currently NVIDIA L40).

We could compare my setup with that of @casperhansen - I think he might also be interested.

For the software part:

accelerate               1.2.1
autoawq                  0.2.7.post3
huggingface-hub          0.27.0
safetensors              0.4.5
tokenizers               0.21.0
torch                    2.5.1
transformers             4.47.1
triton                   3.1.0

Running on Ubuntu 22.04 LTS with:

nvidia-driver-570-open - 570.86.15-0ubuntu1
cuda-tools-12-6 - 12.6.3-1

quantize is called with quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

Best regards, cos

olegivaniv

Feb 2

@divmgl Can you please share steps how you run the model? When I try VLLM with: vllm serve stelterlab/Mistral-Small-24B-Instruct-2501-AWQ --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice --quantization awq I get run error.

stelterlab

Owner Feb 2

@olegivaniv What error message do you get? Do you limit the MAX_MODEL_LEN?

I use for vLLM on my RTX 4090:

--kv-cache-dtype fp8
--max-model-len 8192

to reduce the memory footprint. When not limiting the MAX_MODEL_LEN you will probably get:

ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (11296). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

Memory usage is then:

INFO 02-02 09:20:34 worker.py:266] Memory profiling takes 2.86 seconds
INFO 02-02 09:20:34 worker.py:266] the current vLLM instance can use total_gpu_memory (23.53GiB) x gpu_memory_utilization (0.90) = 21.17GiB
INFO 02-02 09:20:34 worker.py:266] model weights take 13.30GiB; non_torch_memory takes 0.08GiB; PyTorch activation peak memory takes 1.73GiB; the rest of the memory reserved for KV Cache is 6.07GiB.

olegivaniv

Feb 2

•

edited Feb 2

@stelterlab That's very helpful, thank you! Seems like the issue was with using --config_format mistral --load_format mistral. Btw, are you using tool calling? None of my tools are getting picked up by the model even with --tool-call-parser mistral --enable-auto-tool-choice

stelterlab

Owner Feb 3

@olegivaniv Nope. I didn't use tool calling, yet.

davy60

Feb 4

@olegivaniv
--tokenizer-mode mistral --tokenizer mistralai/Mistral-Small-24B-Instruct-2501 --enable-auto-tool-choice --tool-call-parser mistral

olegivaniv

Feb 6

@olegivaniv
--tokenizer-mode mistral --tokenizer mistralai/Mistral-Small-24B-Instruct-2501 --enable-auto-tool-choice --tool-call-parser mistral

That's the missing piece! Thanks

graelo

Mar 30

•

edited Mar 30

Hi @stelterlab ,

I guess the above command would only work on mistralai/Mistral-Small-24B-Instruct-2501 because it has the tokenizer exported as tekken.json (which vLLM looks for).
I don't think the command will work for your uploaded model because you have not yet uploaded that tokenizer file.

I opened #3 to fix this.

Thanks in advance!

EDIT: PR was merged, and this should solve your issue with vLLM (it did for me).

zacksiri

22 days ago

•

edited 22 days ago

I ran into this issue raise ValueError("Only fast tokenizers are supported") when trying to do tool calling. Using vllm/vllm-openai v0.8.4

stelterlab

Owner 21 days ago

As I have never used tool calling before I tried the example from the offical documentation:

https://docs.mistral.ai/capabilities/function_calling/
https://colab.research.google.com/github/mistralai/cookbook/blob/main/mistral/function_calling/function_calling.ipynb

I ran it locally with a few modifications.

Changing the default server url to my local instance:

api_key = "dummy-api-key"
model = "stelterlab/Mistral-Small-24B-Instruct-2501-AWQ"
server_url = "http://127.0.0.1:8000"

client = Mistral(
    server_url=server_url,
    api_key=api_key
    )

And I had to change tool_choice

response = client.chat.complete(
    model = model,
    messages = messages,
    tools = tools,
    tool_choice = "any",
)

from "any" to "auto"

response = client.chat.complete(
    model = model,
    messages = messages,
    tools = tools,
    tool_choice = "auto",
)

That did work except for Step 4.

ERROR 04-20 01:35:54 [serving_chat.py:200] pydantic_core._pydantic_core.ValidationError: 1 validation error for ChatCompletionRequest
ERROR 04-20 01:35:54 [serving_chat.py:200] messages.1.assistant.tool_calls.0.index
ERROR 04-20 01:35:54 [serving_chat.py:200]   Extra inputs are not permitted [type=extra_forbidden, input_value=0, input_type=int]
ERROR 04-20 01:35:54 [serving_chat.py:200]     For further information visit https://errors.pydantic.dev/2.11/v/extra_forbidden

That raised an error. Without deeper knowlegde I would recommend to open a discussion in the vllm repo. Seems to be a vLLM problem.

How do you use vLLM? As a container? I used the standard docker image vllm/vllm-openai:v0.8.4 for my testing.

zacksiri

21 days ago

@stelterlab Thank you for posting your reply. I figured out what I was doing wrong. I set the tool_choice to "required" apparently that format is for non mistral models, for mistral only auto and any are valid.

any doesn't work with vllm. I will open an issue on vllm for that.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment