DFloat11/gemma-3-12b-it-DF11

DFloat11 Compressed Model: `google/gemma-3-12b-it`

This is a losslessly compressed version of google/gemma-3-12b-it using our custom DFloat11 format. The model retains 100% output fidelity: its outputs are bit-for-bit identical to the original BF16 model, while reducing GPU memory consumption by approximately 30%.

🔍 How It Works

DFloat11 compresses model weights using Huffman coding of exponent bits, paired with hardware-aware algorithmic designs for fast on-GPU decompression. During inference, the model weights remain compressed in GPU memory and are decompressed on-the-fly by a custom CUDA kernel for matrix multiplications.

This means:

Only weights are compressed; the architecture, tokenizer, and activations remain unchanged.
Decompression is constant-time per forward pass and does not scale with batch size, making performance more efficient at larger batch sizes.
DFloat11 is much faster than CPU-offloading, enabling more efficient inference under resource constraints.
Inference at batch size 1 is ~2× slower than original BF16 on GPU, but the gap narrows significantly with larger batches.

🔧 How to Use

Install the DFloat11 pip package (Installs the CUDA kernel automatically; select based on your CUDA version):
```
pip install dfloat11[cuda12]  # or pip install dfloat11[cuda11]
```

To use the DFloat11 model, run the following example code in Python:

import torch
from dfloat11 import DFloat11Model
from transformers import AutoTokenizer

model_id = "DFloat11/gemma-3-12b-it-DF11"

model = DFloat11Model.from_pretrained(model_id, device_map="auto")

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

prompt = "What is a binary tree?"
inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(model.device)

with torch.no_grad():
   output = model.generate(
      **inputs,
      max_new_tokens=256,
      do_sample=True,
   )

print(tokenizer.batch_decode(output, skip_special_tokens=True))

📄 Learn More

Paper: 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float
GitHub: https://github.com/LeanModels/DFloat11

DFloat11
/

gemma-3-12b-it-DF11

DFloat11 Compressed Model: `google/gemma-3-12b-it`

🔍 How It Works

🔧 How to Use

📄 Learn More

Model tree for DFloat11/gemma-3-12b-it-DF11

Collection including DFloat11/gemma-3-12b-it-DF11

DFloat11 | Gemma 3

DFloat11 Compressed Model: google/gemma-3-12b-it

🔍 How It Works

🔧 How to Use

📄 Learn More

Model tree for DFloat11/gemma-3-12b-it-DF11

Collection including DFloat11/gemma-3-12b-it-DF11

DFloat11 Compressed Model: `google/gemma-3-12b-it`