DFloat11 Compressed Model: google/gemma-3-12b-it
This is a losslessly compressed version of google/gemma-3-12b-it
using our custom DFloat11 format. The model retains 100% output fidelity: its outputs are bit-for-bit identical to the original BF16 model, while reducing GPU memory consumption by approximately 30%.
π How It Works
DFloat11 compresses model weights using Huffman coding of exponent bits, paired with hardware-aware algorithmic designs for fast on-GPU decompression. During inference, the model weights remain compressed in GPU memory and are decompressed on-the-fly by a custom CUDA kernel for matrix multiplications.
This means:
- Only weights are compressed; the architecture, tokenizer, and activations remain unchanged.
- Decompression is constant-time per forward pass and does not scale with batch size, making performance more efficient at larger batch sizes.
- DFloat11 is much faster than CPU-offloading, enabling more efficient inference under resource constraints.
- Inference at batch size 1 is ~2Γ slower than original BF16 on GPU, but the gap narrows significantly with larger batches.
π§ How to Use
Install the DFloat11 pip package (Installs the CUDA kernel automatically; select based on your CUDA version):
pip install dfloat11[cuda12] # or pip install dfloat11[cuda11]
To use the DFloat11 model, run the following example code in Python:
import torch from dfloat11 import DFloat11Model from transformers import AutoTokenizer model_id = "DFloat11/gemma-3-12b-it-DF11" model = DFloat11Model.from_pretrained(model_id, device_map="auto") tokenizer = AutoTokenizer.from_pretrained(model_id) tokenizer.pad_token = tokenizer.eos_token prompt = "What is a binary tree?" inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(model.device) with torch.no_grad(): output = model.generate( **inputs, max_new_tokens=256, do_sample=True, ) print(tokenizer.batch_decode(output, skip_special_tokens=True))
π Learn More
- Downloads last month
- 13