pooja-ganesh commited on
Commit
a92fd6a
·
verified ·
1 Parent(s): fb367ec

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +51 -74
README.md CHANGED
@@ -1,74 +1,51 @@
1
- ---
2
- language:
3
- - en
4
- pipeline_tag: text-generation
5
- base_model:
6
- - google/gemma-2-2b
7
- license: gemma
8
- ---
9
-
10
- # gemma-2-2b-awq-uint4-asym-g128-lmhead-g32-fp16-onnx
11
- - ## Introduction
12
- This model was created by applying [Quark](https://quark.docs.amd.com/latest/index.html) with calibration samples from Pile dataset.
13
- - ## Quantization Strategy
14
- - ***Quantized Layers***: All linear layers
15
- - ***Weight***: uint4 asymmetric per-group. group_size=32 for lm_head, and group_size=128 for the rest.
16
- - ## Quick Start
17
- 1. [Download and install Quark](https://quark.docs.amd.com/latest/install.html)
18
- 2. Run the quantization script in the example folder using the following command line:
19
- ```sh
20
- export MODEL_DIR = [local model checkpoint folder] or google/gemma-2-2b
21
- # single GPU
22
- python quantize_quark.py --model_dir $MODEL_DIR \
23
- --output_dir output_dir $MODEL_NAME-awq-uint4-asym-g128-lmhead-g32-fp16 \
24
- --quant_scheme w_uint4_per_group_asym \
25
- --num_calib_data 128 \
26
- --quant_algo awq \
27
- --dataset pileval_for_awq_benchmark \
28
- --model_export hf_format \
29
- --group_size 128 \
30
- --group_size_per_layer lm_head 32 \
31
- --data_type float32 \
32
- --exclude_layers
33
- # cpu
34
- python quantize_quark.py --model_dir $MODEL_DIR \
35
- --output_dir output_dir $MODEL_NAME-awq-uint4-asym-g128-lmhead-g32-fp16 \
36
- --quant_scheme w_uint4_per_group_asym \
37
- --num_calib_data 128 \
38
- --quant_algo awq \
39
- --dataset pileval_for_awq_benchmark \
40
- --model_export hf_format \
41
- --group_size 128 \
42
- --group_size_per_layer lm_head 32 \
43
- --data_type float32 \
44
- --exclude_layers \
45
- --device cpu
46
- ```
47
- ## Deployment
48
- Quark has its own export format, quark_safetensors, which is compatible with autoAWQ exports.
49
- ## Evaluation
50
- Quark currently uses perplexity(PPL) as the evaluation metric for accuracy loss before and after quantization.The specific PPL algorithm can be referenced in the quantize_quark.py.
51
- The quantization evaluation results are conducted in pseudo-quantization mode, which may slightly differ from the actual quantized inference accuracy. These results are provided for reference only.
52
- #### Evaluation scores
53
- <table>
54
- <tr>
55
- <td><strong>Benchmark</strong>
56
- </td>
57
- <td><strong>google/gemma-2-2b (float16)</strong>
58
- </td>
59
- <td><strong>amd/gemma-2-2b-awq-uint4-asym-g128-lmhead-g32-fp16-onnx (this model)</strong>
60
- </td>
61
- </tr>
62
- <tr>
63
- <td>Perplexity-wikitext2
64
- </td>
65
- <td>64.41
66
- </td>
67
- <td>71.43 (evalauted by CPU)
68
- </td>
69
- </tr>
70
-
71
- </table>
72
-
73
- #### License
74
- Modifications copyright(c) 2024 Advanced Micro Devices,Inc. All rights reserved.
 
1
+ ---
2
+ language:
3
+ - en
4
+ pipeline_tag: text-generation
5
+ base_model:
6
+ - google/gemma-2-2b
7
+ license: gemma
8
+ ---
9
+
10
+ # gemma-2-2b-awq-uint4-asym-g128-lmhead-g32-fp16-onnx
11
+ - ## Introduction
12
+ This model was created by applying [Quark](https://quark.docs.amd.com/latest/index.html) with calibration samples from Pile dataset.
13
+ - ## Quantization Strategy
14
+ - ***Quantized Layers***: All linear layers
15
+ - ***Weight***: uint4 asymmetric per-group. group_size=32 for lm_head, and group_size=128 for the rest.
16
+ - ## Quick Start
17
+ 1. [Download and install Quark](https://quark.docs.amd.com/latest/install.html)
18
+ 2. Run the quantization script in the example folder using the following command line:
19
+ ```sh
20
+ export MODEL_DIR = [local model checkpoint folder] or google/gemma-2-2b
21
+ # single GPU
22
+ python quantize_quark.py --model_dir $MODEL_DIR \
23
+ --output_dir output_dir $MODEL_NAME-awq-uint4-asym-g128-lmhead-g32-fp16 \
24
+ --quant_scheme w_uint4_per_group_asym \
25
+ --num_calib_data 128 \
26
+ --quant_algo awq \
27
+ --dataset pileval_for_awq_benchmark \
28
+ --model_export hf_format \
29
+ --group_size 128 \
30
+ --group_size_per_layer lm_head 32 \
31
+ --data_type float32 \
32
+ --exclude_layers
33
+ # cpu
34
+ python quantize_quark.py --model_dir $MODEL_DIR \
35
+ --output_dir output_dir $MODEL_NAME-awq-uint4-asym-g128-lmhead-g32-fp16 \
36
+ --quant_scheme w_uint4_per_group_asym \
37
+ --num_calib_data 128 \
38
+ --quant_algo awq \
39
+ --dataset pileval_for_awq_benchmark \
40
+ --model_export hf_format \
41
+ --group_size 128 \
42
+ --group_size_per_layer lm_head 32 \
43
+ --data_type float32 \
44
+ --exclude_layers \
45
+ --device cpu
46
+ ```
47
+ ## Deployment
48
+ Quark has its own export format, quark_safetensors, which is compatible with autoAWQ exports.
49
+
50
+ #### License
51
+ Modifications copyright(c) 2025 Advanced Micro Devices,Inc. All rights reserved.