Abstract
YAQA, an adaptive rounding algorithm using Kronecker-factored approximations of the full model's Hessian, reduces KL divergence and improves performance for post-training quantization of large language models.
The main goal of post-training quantization (PTQ) is to produced a compressed model whose output distribution is as close to the original model's as possible. To do this tractably, almost all LLM PTQ algorithms quantize linear layers by independently minimizing the immediate activation error. However, this localized objective ignores the effect of subsequent layers, so reducing it does not necessarily give a closer model. In this work, we introduce Yet Another Quantization Algorithm (YAQA), an adaptive rounding algorithm that uses Kronecker-factored approximations of each linear layer's Hessian with respect to the full model KL divergence. YAQA consists of two components: Kronecker-factored sketches of the full layerwise Hessian that can be tractably computed for hundred-billion parameter LLMs, and a quantizer-independent rounding algorithm that uses these sketches and comes with theoretical guarantees. Across a wide range of models and quantizers, YAQA empirically reduces the KL divergence to the original model by approx 30% while achieving state of the art performance on downstream tasks.
Community
A new quantization algorithm that directly minimizes the end to end KL to the original model by using a better Hessian estimate than the usual one used in GPTQ and similar methods. This method achieves state of the art downstream performance and reduces the KL over GPTQ derivatives by a factor of 1/3. This method even outperforms Google's QAT on Gemma 3 in terms of KL to the original model.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Quantization Error Propagation: Revisiting Layer-Wise Post-Training Quantization (2025)
- Mix-QSAM: Mixed-Precision Quantization of the Segment Anything Model (2025)
- GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance (2025)
- APHQ-ViT: Post-Training Quantization with Average Perturbation Hessian Based Reconstruction for Vision Transformers (2025)
- Achieving binary weight and activation for LLMs using Post-Training Quantization (2025)
- Enhancing Ultra-Low-Bit Quantization of Large Language Models Through Saliency-Aware Partial Retraining (2025)
- Precision Neural Network Quantization via Learnable Adaptive Modules (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper