Add tool calling template for HF format

#63
by Frrosta - opened

Using this template, one can serve the model in vLLM using the HF format and also use tool calling. For this to work, one first needs to save the jinja template from here to its own file (for example by loading this json in python and then dumping the content of the "chat_template" key to a new file) and then serve the model with the command:

vllm serve mistralai/Mistral-Small-3.1-24B-Instruct-2503 --chat-template <path-to-jinja-template> --tool-call-parser mistral --enable-auto-tool-choice

When calling the server, one needs to set the Sampling Parameter skip_special_tokens to False (see https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#id5), so that the mistral tool parser of vLLM can correctly parse the tool calls.

I was only able to test this using the unsloth BnB quantized version of the model as my GPU is too small but I presume this should work here as well.

I tried setting skip_special_tokens to False but got the following error on vLLM:
skip_special_tokens=False is not supported for Mistral tokenizers.

If you use the mistral tokenizer, tool calling should work out of the box, as suggested in the example command in the model card:

vllm serve mistralai/Mistral-Small-3.1-24B-Instruct-2503 --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice --limit_mm_per_prompt 'image=10' --tensor-parallel-size 2

This chat template plus the suggested setting is only when the model is loaded in the huggingface format with the default tokenizer. I also tried loading the mistral tokenizer with the huggingface model, but I ran into some issues there (I don't recall precisely what though).

worked for me using Mistral-Small-3.1-24B-Instruct-2503-FP8-dynamic. Thank you!

Ready to merge
This branch is ready to get merged automatically.
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment