Are any of these the QAT releases of Gemma 3

#15

by Downtown-Case - opened 2 days ago

Discussion

Downtown-Case

2 days ago

The Gemma 3 paper mentions targeting the GGUF format for llama.cpp, but... I can't find where they were released?

Are any of these Google's QAT GGUFs, or are they just regular quantizations made from the 16-bit release?

bartowski

Owner 2 days ago

No these are just made from 16-bit, i'm not even sure if QAT would work nicely for llama.cpp's format since it uses some clever rounding methods 🤔

Downtown-Case

2 days ago

@bartowski There is some discussion of it in other threads.

Someone already made a script to convert the int4 flax version: https://huggingface.co/gaunernst/gemma-3-1b-it-int4-awq/blob/main/convert_flax.py

It should export to Huggingface 32 bit if you skip the awq step, and the author metioned they are considering hacking out a Q4_K version: https://github.com/turboderp-org/exllamav2/issues/751#issuecomment-2727966003

Downtown-Case

2 days ago

•

edited 2 days ago

And the section from the paper:

Along with the raw checkpoints, we also provide quantized versions of our models in different standard formats. These versions are obtained by finetuning each model for a small number of steps, typically 5,000, using Quantization Aware Training (QAT) (Jacob et al., 2018). We use probabilities from the non-quantized checkpoint as targets, and adapt the data to match the pretraining and post-training distributions. Based on the most popular open source quantization inference engines (e.g. llama.cpp), we focus on three weight representations: per-channel int4, per-block int4, and switched fp8. In Table 3, we report the memory filled by raw and quantized models for each weight representation with and without a KV-cache for a sequence of 32k tokens.

It's possible they confused that with gemma.cpp.

bartowski

Owner 2 days ago

Yeah the mention of 6 bits being used in Q4_K is one of my concerns, QAT is typically trying to target one specific bit rate, not variable ones, but I'm interested in following that work and seeing if anything can come from it!

bartowski

Owner 2 days ago

It's also entirely possible that Q4_0 would make for a strong llama.cpp quant level with QAT

Downtown-Case

1 day ago

Also I realized I'm in the wrong repo lol

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment