DFloat11 Compressed Model: google/gemma-3-12b-it

This is a losslessly compressed version of google/gemma-3-12b-it using our custom DFloat11 format. The model retains 100% output fidelity: its outputs are bit-for-bit identical to the original BF16 model, while reducing GPU memory consumption by approximately 30%.

πŸ” How It Works

DFloat11 compresses model weights using Huffman coding of exponent bits, paired with hardware-aware algorithmic designs for fast on-GPU decompression. During inference, the model weights remain compressed in GPU memory and are decompressed on-the-fly by a custom CUDA kernel for matrix multiplications.

This means:

  • Only weights are compressed; the architecture, tokenizer, and activations remain unchanged.
  • Decompression is constant-time per forward pass and does not scale with batch size, making performance more efficient at larger batch sizes.
  • DFloat11 is much faster than CPU-offloading, enabling more efficient inference under resource constraints.
  • Inference at batch size 1 is ~2Γ— slower than original BF16 on GPU, but the gap narrows significantly with larger batches.

πŸ”§ How to Use

  1. Install the DFloat11 pip package (Installs the CUDA kernel automatically; select based on your CUDA version):

    pip install dfloat11[cuda12]  # or pip install dfloat11[cuda11]
    
  2. To use the DFloat11 model, run the following example code in Python:

    import torch
    from dfloat11 import DFloat11Model
    from transformers import AutoTokenizer
    
    model_id = "DFloat11/gemma-3-12b-it-DF11"
    
    model = DFloat11Model.from_pretrained(model_id, device_map="auto")
    
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    tokenizer.pad_token = tokenizer.eos_token
    
    prompt = "What is a binary tree?"
    inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(model.device)
    
    with torch.no_grad():
       output = model.generate(
          **inputs,
          max_new_tokens=256,
          do_sample=True,
       )
    
    print(tokenizer.batch_decode(output, skip_special_tokens=True))
    

πŸ“„ Learn More

Downloads last month
13
Safetensors
Model size
1.43B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for DFloat11/gemma-3-12b-it-DF11

Quantized
(80)
this model

Collection including DFloat11/gemma-3-12b-it-DF11