LeanQuant commited on
Commit
618edda
·
verified ·
1 Parent(s): f9c33af

Add files using upload-large-folder tool

Browse files
Files changed (1) hide show
  1. README.md +68 -0
README.md ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: Qwen/Qwen3-4B
3
+ base_model_relation: quantized
4
+ tags:
5
+ - dfloat11
6
+ - df11
7
+ - lossless compression
8
+ - 70% size, 100% accuracy
9
+ ---
10
+
11
+ ## DFloat11 Compressed Model: `Qwen/Qwen3-4B`
12
+
13
+ This is a **losslessly compressed** version of [`Qwen/Qwen3-4B`](https://huggingface.co/Qwen/Qwen3-4B) using our custom **DFloat11** format. The outputs of this compressed model are **bit-for-bit identical** to the original BFloat16 model, while reducing GPU memory consumption by approximately **30%**.
14
+
15
+ ### 🔍 How It Works
16
+
17
+ DFloat11 compresses model weights using **Huffman coding** of BFloat16 exponent bits, combined with **hardware-aware algorithmic designs** that enable efficient on-the-fly decompression directly on the GPU. During inference, the weights remain compressed in GPU memory and are **decompressed just before matrix multiplications**, then **immediately discarded after use** to minimize memory footprint.
18
+
19
+ Key benefits:
20
+
21
+ * **No CPU decompression or host-device data transfer** -- all operations are handled entirely on the GPU.
22
+ * **Decompression overhead is constant** per forward pass and **independent of batch size**, making DFloat11 increasingly efficient at larger batch sizes.
23
+ * DFloat11 is **much faster than CPU-offloading approaches**, enabling practical deployment in memory-constrained environments.
24
+ * At **batch size = 1**, inference is approximately **2× slower** than the original BF16 model, but the performance gap **narrows significantly** with larger batches.
25
+ * The compression is **fully lossless**, guaranteeing that the model’s outputs are **bit-for-bit identical** to those of the original model.
26
+
27
+ ### 🔧 How to Use
28
+
29
+ 1. Install the DFloat11 pip package *(installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed)*:
30
+
31
+ ```bash
32
+ pip install dfloat11[cuda12]
33
+ # or if you have CUDA version 11:
34
+ # pip install dfloat11[cuda11]
35
+ ```
36
+
37
+ 2. To use the DFloat11 model, run the following example code in Python:
38
+
39
+ ```python
40
+ import torch
41
+ from dfloat11 import DFloat11Model
42
+ from transformers import AutoTokenizer
43
+
44
+ model_id = "DFloat11/Qwen3-4B-DF11"
45
+
46
+ model = DFloat11Model.from_pretrained(model_id, device_map="auto")
47
+
48
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
49
+ tokenizer.pad_token = tokenizer.eos_token
50
+
51
+ prompt = "Question: What is a binary tree and its applications? Answer:"
52
+ inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(model.device)
53
+
54
+ with torch.no_grad():
55
+ output = model.generate(
56
+ **inputs,
57
+ max_new_tokens=256,
58
+ do_sample=True,
59
+ )
60
+
61
+ print(tokenizer.batch_decode(output, skip_special_tokens=True))
62
+ ```
63
+
64
+ ### 📄 Learn More
65
+
66
+ * **Paper**: [70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float](https://arxiv.org/abs/2504.11651)
67
+ * **GitHub**: [https://github.com/LeanModels/DFloat11](https://github.com/LeanModels/DFloat11)
68
+ * **HuggingFace**: [https://huggingface.co/DFloat11](https://huggingface.co/DFloat11)