用vllm时应该是什么参数
需要加那个什么apply_chat_template不,我用的https://hf-mirror.com/lmstudio-community/QwQ-32B-GGUF 这里的gguf,好像没法抽出来tokenizer
prompt_final = [{"role": "user", "content": "xxx"}]
tensor_parallel_size=1
pipeline_parallel_size=1
ckpt_path="./QwQ-32B-Q4_K_M.gguf"
sampling_params = SamplingParams(temperature=0.6, max_tokens=1000)
batch_prompts = [prompt_final]
llm = LLM(model=ckpt_path, tensor_parallel_size=tensor_parallel_size, distributed_executor_backend="mp", pipeline_parallel_size=pipeline_parallel_size)#,
preds = llm.chat(batch_prompts, sampling_params)
for output in preds:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}\n")
请问解决用VLLM部署了吗
上面那个代码用vllm部署没有问题,现在主要问题是感觉qwq的cot特别长,相比r1来说
有的问题r1-distill-qwen-32b可能五百个token就回答完了(cot+response),而qwq可能cot要想1000多个token还没想完,很影响性能
vllm serve ~/.cache/models--Qwen--QwQ-32B/snapshots/f28e641280ed3228b25df45b02ce6526b472cbea/ --tokenizer ~/Downloads/QwQ-32B/ --host 0.0.0.0 --port 21434 --tensor-parallel-size 4 --max-model-len 34576 --served-model-name qwq-32b
魔塔社区的官方公众号,已经发了使用 vLLM 和sgLang 的方式 ,大家可以去看下
vllm serve /ModelPath/QwQ-32B --port 8000 --reasoning-parser deepseek_r1 --max_model_len 4096 --enable-auto-tool-choice --tool-call-parser hermes
python -m sglang.launch_server --model-path /ModelPath/QwQ-32B --port 3001 --host 0.0.0.0 --tool-call-parser qwen25
魔塔社区的官方公众号,已经发了使用 vLLM 和sgLang 的方式 ,大家可以去看下
vllm serve /ModelPath/QwQ-32B --port 8000 --reasoning-parser deepseek_r1 --max_model_len 4096 --enable-auto-tool-choice --tool-call-parser hermespython -m sglang.launch_server --model-path /ModelPath/QwQ-32B --port 3001 --host 0.0.0.0 --tool-call-parser qwen25
The vllm documentation mentions that there are currently incompatibilities with structured output and Tool-Calling if inferential parsing is used, and this is actually the case as described in the documentation.