yangapku commited on
Commit
9bfd3ac
·
verified ·
1 Parent(s): 9886c1a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -26
README.md CHANGED
@@ -95,7 +95,7 @@ print("thinking content:", thinking_content)
95
  print("content:", content)
96
  ```
97
 
98
- For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.8.4` or to create an OpenAI-compatible API endpoint:
99
  - SGLang:
100
  ```shell
101
  python -m sglang.launch_server --model-path Qwen/Qwen3-1.7B-FP8 --reasoning-parser qwen3
@@ -105,39 +105,16 @@ For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.8.4` or to create
105
  vllm serve Qwen/Qwen3-1.7B-FP8 --enable-reasoning --reasoning-parser deepseek_r1
106
  ```
107
 
108
- For local use, applications such as llama.cpp, Ollama, LMStudio, and MLX-LM have also supported Qwen3.
109
 
110
  ## Note on FP8
111
 
112
  For convenience and performance, we have provided `fp8`-quantized model checkpoint for Qwen3, whose name ends with `-FP8`. The quantization method is fine-grained `fp8` quantization with block size of 128. You can find more details in the `quantization_config` field in `config.json`.
113
 
114
- You can use the Qwen3-1.7B-FP8 model with serveral inference frameworks, including `transformers`, `vllm`, and `sglang`, as the original bfloat16 model.
115
  However, please pay attention to the following known issues:
116
  - `transformers`:
117
  - there are currently issues with the "fine-grained fp8" method in `transformers` for distributed inference. You may need to set the environment variable `CUDA_LAUNCH_BLOCKING=1` if multiple devices are used in inference.
118
- - vLLM:
119
- - there are currently compatibility issues with `vllm`. For a quick fix, you should make the following changes to `vllm/vllm/model_executor/layers/linear.py`:
120
- ```python
121
- # these changes are in QKVParallelLinear.weight_loader_v2() of vllm/vllm/model_executor/layers/linear.py
122
- ...
123
- shard_offset = self._get_shard_offset_mapping(loaded_shard_id)
124
- shard_size = self._get_shard_size_mapping(loaded_shard_id)
125
-
126
- # add the following code
127
- if isinstance(param, BlockQuantScaleParameter):
128
- weight_block_size = self.quant_method.quant_config.weight_block_size
129
- block_n, _ = weight_block_size[0], weight_block_size[1]
130
- shard_offset = (shard_offset + block_n - 1) // block_n
131
- shard_size = (shard_size + block_n - 1) // block_n
132
- # end of the modification
133
-
134
- param.load_qkv_weight(loaded_weight=loaded_weight,
135
- num_heads=self.num_kv_head_replicas,
136
- shard_id=loaded_shard_id,
137
- shard_offset=shard_offset,
138
- shard_size=shard_size)
139
- ...
140
- ```
141
 
142
  ## Switching Between Thinking and Non-Thinking Mode
143
 
 
95
  print("content:", content)
96
  ```
97
 
98
+ For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.8.5` or to create an OpenAI-compatible API endpoint:
99
  - SGLang:
100
  ```shell
101
  python -m sglang.launch_server --model-path Qwen/Qwen3-1.7B-FP8 --reasoning-parser qwen3
 
105
  vllm serve Qwen/Qwen3-1.7B-FP8 --enable-reasoning --reasoning-parser deepseek_r1
106
  ```
107
 
108
+ For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.
109
 
110
  ## Note on FP8
111
 
112
  For convenience and performance, we have provided `fp8`-quantized model checkpoint for Qwen3, whose name ends with `-FP8`. The quantization method is fine-grained `fp8` quantization with block size of 128. You can find more details in the `quantization_config` field in `config.json`.
113
 
114
+ You can use the Qwen3-1.7B-FP8 model with serveral inference frameworks, including `transformers`, `sglang`, and `vllm`, as the original bfloat16 model.
115
  However, please pay attention to the following known issues:
116
  - `transformers`:
117
  - there are currently issues with the "fine-grained fp8" method in `transformers` for distributed inference. You may need to set the environment variable `CUDA_LAUNCH_BLOCKING=1` if multiple devices are used in inference.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
118
 
119
  ## Switching Between Thinking and Non-Thinking Mode
120