KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction
Abstract
KVzip, a query-agnostic KV cache eviction method for transformer-based LLMs, reduces KV cache size and decoding latency while maintaining performance across various tasks and models.
Transformer-based large language models (LLMs) cache context as key-value (KV) pairs during inference. As context length grows, KV cache sizes expand, leading to substantial memory overhead and increased attention latency. This paper introduces KVzip, a query-agnostic KV cache eviction method enabling effective reuse of compressed KV caches across diverse queries. KVzip quantifies the importance of a KV pair using the underlying LLM to reconstruct original contexts from cached KV pairs, subsequently evicting pairs with lower importance. Extensive empirical evaluations demonstrate that KVzip reduces KV cache size by 3-4times and FlashAttention decoding latency by approximately 2times, with negligible performance loss in question-answering, retrieval, reasoning, and code comprehension tasks. Evaluations include various models such as LLaMA3.1-8B, Qwen2.5-14B, and Gemma3-12B, with context lengths reaching up to 170K tokens. KVzip significantly outperforms existing query-aware KV eviction methods, which suffer from performance degradation even at a 90% cache budget ratio under multi-query scenarios.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query (2025)
- TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization (2025)
- KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments (2025)
- KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference (2025)
- Mustafar: Promoting Unstructured Sparsity for KV Cache Pruning in LLM Inference (2025)
- CAOTE: KV Caching through Attention Output Error based Token Eviction (2025)
- LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper