arxiv:2505.22988

Model-Preserving Adaptive Rounding

Published on May 29

· Submitted by

at676 on May 30

Upvote

Authors:

Albert Tseng ,

Zhaofeng Sun ,

Abstract

YAQA, an adaptive rounding algorithm using Kronecker-factored approximations of the full model's Hessian, reduces KL divergence and improves performance for post-training quantization of large language models.

AI-generated summary

The main goal of post-training quantization (PTQ) is to produced a compressed model whose output distribution is as close to the original model's as possible. To do this tractably, almost all LLM PTQ algorithms quantize linear layers by independently minimizing the immediate activation error. However, this localized objective ignores the effect of subsequent layers, so reducing it does not necessarily give a closer model. In this work, we introduce Yet Another Quantization Algorithm (YAQA), an adaptive rounding algorithm that uses Kronecker-factored approximations of each linear layer's Hessian with respect to the full model KL divergence. YAQA consists of two components: Kronecker-factored sketches of the full layerwise Hessian that can be tractably computed for hundred-billion parameter LLMs, and a quantizer-independent rounding algorithm that uses these sketches and comes with theoretical guarantees. Across a wide range of models and quantizers, YAQA empirically reduces the KL divergence to the original model by approx 30% while achieving state of the art performance on downstream tasks.

View arXiv page View PDF GitHub repository Add to collection

Community

at676

Paper author Paper submitter 4 days ago

A new quantization algorithm that directly minimizes the end to end KL to the original model by using a better Hessian estimate than the usual one used in GPTQ and similar methods. This method achieves state of the art downstream performance and reduces the KL over GPTQ derivatives by a factor of 1/3. This method even outperforms Google's QAT on Gemma 3 in terms of KL to the original model.

librarian-bot

4 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.22988 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.22988 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.22988 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.