Really good work
Hello. Just wanted to comment that this model works flawlessly on my 4090.
Just out of curiosity: did you do anything during the quantization of the model? For some reason, casperhensen's version doesn't run nearly as well or as accurate as this version.
Thanks!
Hi!
Happy to hear that! I do nothing special. I just use the code snippet from caspers README. Maybe it's the versions of the python libs, the NVIDIA/CUDA drivers or the GPU I use for quantization (currently NVIDIA L40).
We could compare my setup with that of @casperhansen - I think he might also be interested.
For the software part:
accelerate 1.2.1
autoawq 0.2.7.post3
huggingface-hub 0.27.0
safetensors 0.4.5
tokenizers 0.21.0
torch 2.5.1
transformers 4.47.1
triton 3.1.0
Running on Ubuntu 22.04 LTS with:
nvidia-driver-570-open - 570.86.15-0ubuntu1
cuda-tools-12-6 - 12.6.3-1
quantize is called with quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
Best regards, cos
@divmgl
Can you please share steps how you run the model? When I try VLLM with: vllm serve stelterlab/Mistral-Small-24B-Instruct-2501-AWQ --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice --quantization awq
I get run error.
@olegivaniv What error message do you get? Do you limit the MAX_MODEL_LEN?
I use for vLLM on my RTX 4090:
--kv-cache-dtype fp8
--max-model-len 8192
to reduce the memory footprint. When not limiting the MAX_MODEL_LEN you will probably get:
ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (11296). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
Memory usage is then:
INFO 02-02 09:20:34 worker.py:266] Memory profiling takes 2.86 seconds
INFO 02-02 09:20:34 worker.py:266] the current vLLM instance can use total_gpu_memory (23.53GiB) x gpu_memory_utilization (0.90) = 21.17GiB
INFO 02-02 09:20:34 worker.py:266] model weights take 13.30GiB; non_torch_memory takes 0.08GiB; PyTorch activation peak memory takes 1.73GiB; the rest of the memory reserved for KV Cache is 6.07GiB.
@stelterlab
That's very helpful, thank you! Seems like the issue was with using --config_format mistral --load_format mistral
. Btw, are you using tool calling? None of my tools are getting picked up by the model even with --tool-call-parser mistral --enable-auto-tool-choice
@olegivaniv
--tokenizer-mode mistral --tokenizer mistralai/Mistral-Small-24B-Instruct-2501 --enable-auto-tool-choice --tool-call-parser mistral
@olegivaniv
--tokenizer-mode mistral --tokenizer mistralai/Mistral-Small-24B-Instruct-2501 --enable-auto-tool-choice --tool-call-parser mistral
That's the missing piece! Thanks
Hi @stelterlab ,
I guess the above command would only work on mistralai/Mistral-Small-24B-Instruct-2501
because it has the tokenizer exported as tekken.json
(which vLLM looks for).
I don't think the command will work for your uploaded model because you have not yet uploaded that tokenizer file.
I opened #3 to fix this.
Thanks in advance!
EDIT: PR was merged, and this should solve your issue with vLLM (it did for me).
I ran into this issue raise ValueError("Only fast tokenizers are supported")
when trying to do tool calling. Using vllm/vllm-openai
v0.8.4
As I have never used tool calling before I tried the example from the offical documentation:
https://docs.mistral.ai/capabilities/function_calling/
https://colab.research.google.com/github/mistralai/cookbook/blob/main/mistral/function_calling/function_calling.ipynb
I ran it locally with a few modifications.
Changing the default server url to my local instance:
api_key = "dummy-api-key"
model = "stelterlab/Mistral-Small-24B-Instruct-2501-AWQ"
server_url = "http://127.0.0.1:8000"
client = Mistral(
server_url=server_url,
api_key=api_key
)
And I had to change tool_choice
response = client.chat.complete(
model = model,
messages = messages,
tools = tools,
tool_choice = "any",
)
from "any" to "auto"
response = client.chat.complete(
model = model,
messages = messages,
tools = tools,
tool_choice = "auto",
)
That did work except for Step 4.
ERROR 04-20 01:35:54 [serving_chat.py:200] pydantic_core._pydantic_core.ValidationError: 1 validation error for ChatCompletionRequest
ERROR 04-20 01:35:54 [serving_chat.py:200] messages.1.assistant.tool_calls.0.index
ERROR 04-20 01:35:54 [serving_chat.py:200] Extra inputs are not permitted [type=extra_forbidden, input_value=0, input_type=int]
ERROR 04-20 01:35:54 [serving_chat.py:200] For further information visit https://errors.pydantic.dev/2.11/v/extra_forbidden
That raised an error. Without deeper knowlegde I would recommend to open a discussion in the vllm repo. Seems to be a vLLM problem.
How do you use vLLM? As a container? I used the standard docker image vllm/vllm-openai:v0.8.4 for my testing.
@stelterlab
Thank you for posting your reply. I figured out what I was doing wrong. I set the tool_choice to "required" apparently that format is for non mistral models, for mistral only auto
and any
are valid.
any
doesn't work with vllm. I will open an issue on vllm for that.