DFloat11
/

DeepSeek-R1-Distill-Qwen-32B-DF11

lossless compression

70% size, 100% accuracy

Model card Files Files and versions Community

DeepSeek-R1-Distill-Qwen-32B-DF11 / README.md

LeanQuant's picture

Add files using upload-large-folder tool

af17d2a verified 24 days ago

|

history blame contribute delete

3.19 kB

	---
	base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
	base_model_relation: quantized
	tags:
	- dfloat11
	- df11
	- lossless compression
	- 70% size, 100% accuracy
	---

	## DFloat11 Compressed Model: `deepseek-ai/DeepSeek-R1-Distill-Qwen-32B`

	This is a losslessly compressed version of [`deepseek-ai/DeepSeek-R1-Distill-Qwen-32B`](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) using our custom DFloat11 format. The outputs of this compressed model are bit-for-bit identical to the original BFloat16 model, while reducing GPU memory consumption by approximately 30%.

	### 🔍 How It Works

	DFloat11 compresses model weights using Huffman coding of BFloat16 exponent bits, combined with hardware-aware algorithmic designs that enable efficient on-the-fly decompression directly on the GPU. During inference, the weights remain compressed in GPU memory and are decompressed just before matrix multiplications, then immediately discarded after use to minimize memory footprint.

	Key benefits:

	* No CPU decompression or host-device data transfer -- all operations are handled entirely on the GPU.
	* Decompression overhead is constant per forward pass and independent of batch size, making DFloat11 increasingly efficient at larger batch sizes.
	* DFloat11 is much faster than CPU-offloading approaches, enabling practical deployment in memory-constrained environments.
	* At batch size = 1, inference is approximately 2× slower than the original BF16 model, but the performance gap narrows significantly with larger batches.
	* The compression is fully lossless, guaranteeing that the model’s outputs are bit-for-bit identical to those of the original model.

	### 🔧 How to Use

	1. Install the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed):

	```bash
	pip install dfloat11[cuda12]
	# or if you have CUDA version 11:
	# pip install dfloat11[cuda11]
	```

	2. To use the DFloat11 model, run the following example code in Python:

	```python
	import torch
	from dfloat11 import DFloat11Model
	from transformers import AutoTokenizer

	model_id = "DFloat11/DeepSeek-R1-Distill-Qwen-32B-DF11"

	model = DFloat11Model.from_pretrained(model_id, device_map="auto")

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	tokenizer.pad_token = tokenizer.eos_token

	prompt = "Question: What is a binary tree and its applications? Answer:"
	inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(model.device)

	with torch.no_grad():
	output = model.generate(
	**inputs,
	max_new_tokens=256,
	do_sample=True,
	)

	print(tokenizer.batch_decode(output, skip_special_tokens=True))
	```

	### 📄 Learn More

	* Paper: [70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float](https://arxiv.org/abs/2504.11651)
	* GitHub: [https://github.com/LeanModels/DFloat11](https://github.com/LeanModels/DFloat11)
	* HuggingFace: [https://huggingface.co/DFloat11](https://huggingface.co/DFloat11)