arxiv:2501.12956

GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models

Published on Jan 22

Authors:

Abstract

Large Language Models (LLMs) face significant deployment challenges due to their substantial resource requirements. While low-bit quantized weights can reduce memory usage and improve inference efficiency, current hardware lacks native support for mixed-precision General Matrix Multiplication (mpGEMM), resulting in inefficient dequantization-based implementations. Moreover, uniform quantization methods often fail to capture weight distributions adequately, leading to performance degradation. We propose GANQ (GPU-Adaptive Non-Uniform Quantization), a layer-wise post-training non-uniform quantization framework optimized for hardware-efficient lookup table-based mpGEMM. GANQ achieves superior quantization performance by utilizing a training-free, GPU-adaptive optimization algorithm to efficiently reduce layer-wise quantization errors. Extensive experiments demonstrate GANQ's ability to reduce the perplexity gap from the FP16 baseline compared to state-of-the-art methods for both 3-bit and 4-bit quantization. Furthermore, when deployed on a single NVIDIA RTX 4090 GPU, GANQ's quantized models achieve up to 2.57times speedup over the baseline, advancing memory and inference efficiency in LLM deployment.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2501.12956 in a model README.md to link it from this page.

No dataset linking this paper

Cite arxiv.org/abs/2501.12956 in a dataset README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2501.12956 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.