amd
/

gemma-2-2b-awq-uint4-asym-g128-lmhead-g32-fp16-onnx-hybrid

Text Generation

ONNX

English

Model card Files Files and versions Community

pooja-ganesh commited on Feb 21

Commit

a92fd6a

verified ·

1 Parent(s): fb367ec

Update README.md

Browse files

Files changed (1) hide show

README.md +51 -74

README.md CHANGED Viewed

@@ -1,74 +1,51 @@
----
-language:
-- en
-pipeline_tag: text-generation
-base_model:
-- google/gemma-2-2b
-license: gemma
----
-# gemma-2-2b-awq-uint4-asym-g128-lmhead-g32-fp16-onnx
-- ## Introduction
-  This model was created by applying [Quark](https://quark.docs.amd.com/latest/index.html) with calibration samples from Pile dataset.
-- ## Quantization Strategy
-  - ***Quantized Layers***: All linear layers
-  - ***Weight***: uint4 asymmetric per-group. group_size=32 for lm_head, and group_size=128 for the rest.
-- ## Quick Start
-1. [Download and install Quark](https://quark.docs.amd.com/latest/install.html)
-2. Run the quantization script in the example folder using the following command line:
-    ```sh
-    export MODEL_DIR = [local model checkpoint folder] or google/gemma-2-2b
-    # single GPU
-    python quantize_quark.py --model_dir $MODEL_DIR \
-                            --output_dir output_dir $MODEL_NAME-awq-uint4-asym-g128-lmhead-g32-fp16 \
-                            --quant_scheme w_uint4_per_group_asym \
-                            --num_calib_data 128 \
-                            --quant_algo awq \
-                            --dataset pileval_for_awq_benchmark \
-                            --model_export hf_format \
-                            --group_size 128 \
-                            --group_size_per_layer lm_head 32 \
-                            --data_type float32 \
-                            --exclude_layers
-    # cpu
-    python quantize_quark.py --model_dir $MODEL_DIR \
-                            --output_dir output_dir $MODEL_NAME-awq-uint4-asym-g128-lmhead-g32-fp16 \
-                            --quant_scheme w_uint4_per_group_asym \
-                            --num_calib_data 128 \
-                            --quant_algo awq \
-                            --dataset pileval_for_awq_benchmark \
-                            --model_export hf_format \
-                            --group_size 128 \
-                            --group_size_per_layer lm_head 32 \
-                            --data_type float32 \
-                            --exclude_layers \
-                            --device cpu
-    ```
-## Deployment
-Quark has its own export format, quark_safetensors, which is compatible with autoAWQ exports.
-## Evaluation
-Quark currently uses perplexity(PPL) as the evaluation metric for accuracy loss before and after quantization.The specific PPL algorithm can be referenced in the quantize_quark.py.
-The quantization evaluation results are conducted in pseudo-quantization mode, which may slightly differ from the actual quantized inference accuracy. These results are provided for reference only.
-#### Evaluation scores
-<table>
-  <tr>
-   <td><strong>Benchmark</strong>
-   </td>
-   <td><strong>google/gemma-2-2b (float16)</strong>
-   </td>
-   <td><strong>amd/gemma-2-2b-awq-uint4-asym-g128-lmhead-g32-fp16-onnx (this model)</strong>
-   </td>
-  </tr>
-  <tr>
-   <td>Perplexity-wikitext2
-   </td>
-   <td>64.41
-   </td>
-   <td>71.43 (evalauted by CPU)
-   </td>
-  </tr>
-</table>
-#### License
-Modifications copyright(c) 2024 Advanced Micro Devices,Inc. All rights reserved.

+---
+language:
+- en
+pipeline_tag: text-generation
+base_model:
+- google/gemma-2-2b
+license: gemma
+---
+# gemma-2-2b-awq-uint4-asym-g128-lmhead-g32-fp16-onnx
+- ## Introduction
+  This model was created by applying [Quark](https://quark.docs.amd.com/latest/index.html) with calibration samples from Pile dataset.
+- ## Quantization Strategy
+  - ***Quantized Layers***: All linear layers
+  - ***Weight***: uint4 asymmetric per-group. group_size=32 for lm_head, and group_size=128 for the rest.
+- ## Quick Start
+1. [Download and install Quark](https://quark.docs.amd.com/latest/install.html)
+2. Run the quantization script in the example folder using the following command line:
+    ```sh
+    export MODEL_DIR = [local model checkpoint folder] or google/gemma-2-2b
+    # single GPU
+    python quantize_quark.py --model_dir $MODEL_DIR \
+                            --output_dir output_dir $MODEL_NAME-awq-uint4-asym-g128-lmhead-g32-fp16 \
+                            --quant_scheme w_uint4_per_group_asym \
+                            --num_calib_data 128 \
+                            --quant_algo awq \
+                            --dataset pileval_for_awq_benchmark \
+                            --model_export hf_format \
+                            --group_size 128 \
+                            --group_size_per_layer lm_head 32 \
+                            --data_type float32 \
+                            --exclude_layers
+    # cpu
+    python quantize_quark.py --model_dir $MODEL_DIR \
+                            --output_dir output_dir $MODEL_NAME-awq-uint4-asym-g128-lmhead-g32-fp16 \
+                            --quant_scheme w_uint4_per_group_asym \
+                            --num_calib_data 128 \
+                            --quant_algo awq \
+                            --dataset pileval_for_awq_benchmark \
+                            --model_export hf_format \
+                            --group_size 128 \
+                            --group_size_per_layer lm_head 32 \
+                            --data_type float32 \
+                            --exclude_layers \
+                            --device cpu
+    ```
+## Deployment
+Quark has its own export format, quark_safetensors, which is compatible with autoAWQ exports.
+#### License
+Modifications copyright(c) 2025 Advanced Micro Devices,Inc. All rights reserved.