yangapku commited on
Commit
93980c7
·
verified ·
1 Parent(s): 57c8978

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -29
README.md CHANGED
@@ -94,7 +94,7 @@ print("thinking content:", thinking_content)
94
  print("content:", content)
95
  ```
96
 
97
- For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.8.4` or to create an OpenAI-compatible API endpoint:
98
  - SGLang:
99
  ```shell
100
  python -m sglang.launch_server --model-path Qwen/Qwen3-235B-A22B-FP8 --reasoning-parser qwen3
@@ -104,39 +104,16 @@ For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.8.4` or to create
104
  vllm serve Qwen/Qwen3-235B-A22B-FP8 --enable-reasoning --reasoning-parser deepseek_r1
105
  ```
106
 
107
- For local use, applications such as llama.cpp, Ollama, LMStudio, and MLX-LM have also supported Qwen3.
108
 
109
  ## Note on FP8
110
 
111
  For convenience and performance, we have provided `fp8`-quantized model checkpoint for Qwen3, whose name ends with `-FP8`. The quantization method is fine-grained `fp8` quantization with block size of 128. You can find more details in the `quantization_config` field in `config.json`.
112
 
113
- You can use the Qwen3-235B-A22B-FP8 model with serveral inference frameworks, including `transformers`, `vllm`, and `sglang`, as the original bfloat16 model.
114
  However, please pay attention to the following known issues:
115
  - `transformers`:
116
  - there are currently issues with the "fine-grained fp8" method in `transformers` for distributed inference. You may need to set the environment variable `CUDA_LAUNCH_BLOCKING=1` if multiple devices are used in inference.
117
- - vLLM:
118
- - there are currently compatibility issues with `vllm`. For a quick fix, you should make the following changes to `vllm/vllm/model_executor/layers/linear.py`:
119
- ```python
120
- # these changes are in QKVParallelLinear.weight_loader_v2() of vllm/vllm/model_executor/layers/linear.py
121
- ...
122
- shard_offset = self._get_shard_offset_mapping(loaded_shard_id)
123
- shard_size = self._get_shard_size_mapping(loaded_shard_id)
124
-
125
- # add the following code
126
- if isinstance(param, BlockQuantScaleParameter):
127
- weight_block_size = self.quant_method.quant_config.weight_block_size
128
- block_n, _ = weight_block_size[0], weight_block_size[1]
129
- shard_offset = (shard_offset + block_n - 1) // block_n
130
- shard_size = (shard_size + block_n - 1) // block_n
131
- # end of the modification
132
-
133
- param.load_qkv_weight(loaded_weight=loaded_weight,
134
- num_heads=self.num_kv_head_replicas,
135
- shard_id=loaded_shard_id,
136
- shard_offset=shard_offset,
137
- shard_size=shard_size)
138
- ...
139
- ```
140
 
141
  ## Switching Between Thinking and Non-Thinking Mode
142
 
@@ -310,7 +287,7 @@ YaRN is currently supported by several inference frameworks, e.g., `transformers
310
  {
311
  ...,
312
  "rope_scaling": {
313
- "type": "yarn",
314
  "factor": 4.0,
315
  "original_max_position_embeddings": 32768
316
  }
@@ -322,12 +299,12 @@ YaRN is currently supported by several inference frameworks, e.g., `transformers
322
 
323
  For `vllm`, you can use
324
  ```shell
325
- vllm serve ... --rope-scaling '{"type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072
326
  ```
327
 
328
  For `sglang`, you can use
329
  ```shell
330
- python -m sglang.launch_server ... --json-model-override-args '{"rope_scaling":{"type":"yarn","factor":4.0,"original_max_position_embeddings":32768}}'
331
  ```
332
 
333
  For `llama-server` from `llama.cpp`, you can use
 
94
  print("content:", content)
95
  ```
96
 
97
+ For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.8.5` or to create an OpenAI-compatible API endpoint:
98
  - SGLang:
99
  ```shell
100
  python -m sglang.launch_server --model-path Qwen/Qwen3-235B-A22B-FP8 --reasoning-parser qwen3
 
104
  vllm serve Qwen/Qwen3-235B-A22B-FP8 --enable-reasoning --reasoning-parser deepseek_r1
105
  ```
106
 
107
+ For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.
108
 
109
  ## Note on FP8
110
 
111
  For convenience and performance, we have provided `fp8`-quantized model checkpoint for Qwen3, whose name ends with `-FP8`. The quantization method is fine-grained `fp8` quantization with block size of 128. You can find more details in the `quantization_config` field in `config.json`.
112
 
113
+ You can use the Qwen3-235B-A22B-FP8 model with serveral inference frameworks, including `transformers`, `sglang`, and `vllm`, as the original bfloat16 model.
114
  However, please pay attention to the following known issues:
115
  - `transformers`:
116
  - there are currently issues with the "fine-grained fp8" method in `transformers` for distributed inference. You may need to set the environment variable `CUDA_LAUNCH_BLOCKING=1` if multiple devices are used in inference.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
117
 
118
  ## Switching Between Thinking and Non-Thinking Mode
119
 
 
287
  {
288
  ...,
289
  "rope_scaling": {
290
+ "rope_type": "yarn",
291
  "factor": 4.0,
292
  "original_max_position_embeddings": 32768
293
  }
 
299
 
300
  For `vllm`, you can use
301
  ```shell
302
+ vllm serve ... --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072
303
  ```
304
 
305
  For `sglang`, you can use
306
  ```shell
307
+ python -m sglang.launch_server ... --json-model-override-args '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}}'
308
  ```
309
 
310
  For `llama-server` from `llama.cpp`, you can use