gemma-3-27b-it-GPTQ-4b-128g

Model Overview

This model was obtained by quantizing the weights of gemma-3-27b-it to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.

Only the weights of the linear operators within language_model transformers blocks are quantized. Vision model and multimodal projection are kept in original precision. Weights are quantized using a symmetric per-group scheme, with group size 128. The GPTQ algorithm is applied for quantization.

Model checkpoint is saved in compressed_tensors format.

Usage

  • To use the model in transformers update the package to stable release of Gemma3:

    pip install git+https://github.com/huggingface/[email protected]

  • To use the model in vLLM update the package to version after this PR.

And example of inference via transformers is provided below:

# pip install accelerate

from transformers import AutoProcessor, Gemma3ForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g"

model = Gemma3ForConditionalGeneration.from_pretrained(
    model_id, device_map="auto"
).eval()

processor = AutoProcessor.from_pretrained(model_id)

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)

input_len = inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]

decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)

# **Overall Impression:** The image is a close-up shot of a vibrant garden scene, 
# focusing on a cluster of pink cosmos flowers and a busy bumblebee. 
# It has a slightly soft, natural feel, likely captured in daylight.
Downloads last month
708
Safetensors
Model size
5.23B params
Tensor type
I64
I32
BF16
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g

Quantized
(38)
this model