why is the size bigger than regular Q4_0 quants ?

#1
by lefromage - opened

this quant is : 16GB /gemma-3-27b-it-q4_0.gguf

the same using llama.cpp quant is smaller and works better:
bartowski_google_gemma-3-27b-it-GGUF_google_gemma-3-27b-it-Q4_0.gguf is 15GB

token_embd.weight is in fp16 with is model, but Q6_K for the other quant you linked. That alone is about 1.4B params, so in f16 that takes 2.8GB vs 1.15GB when using q6.

This comment has been hidden (marked as Off-Topic)

the same using llama.cpp quant is smaller and works better:
bartowski_google_gemma-3-27b-it-GGUF_google_gemma-3-27b-it-Q4_0.gguf is 15GB

You mean the normal imatrix quant works better than this produced with quantization aware training? On what tasks is the bartowski quant better?

For those who want I have uploaded a smaller version of this model with quantiezed token embeddings table. It doesn't seem to significantly hurt the performance.

I'd love to see a comparison between these QAT 4-bit quants and some normal llama.cpp imatrix GGUFs with roughly the same bpw. Easiest and fastest way to do it would probably be to simply compare perplexity with same text and seed between a couple 4-bit GGUFs and the bf16 as baseline e.g.

wget https://github.com/user-attachments/files/19090237/wiki.test.raw.gz
gunzip wiki.test.raw.gz

./build/bin/llama-perplexity \
    --model /mnt/models/gemma-3-27b-it-qat-q4_0.gguf \
    -ngl 99 \
    --ctx-size 512 \
    --ubatch-size 512 \
    -f wiki.test.raw \
    --seed 1337 \
    --threads 8

...

Final estimate: PPL = ?.???? +/- 0.0????

I did a write-up with some results on the new QAT model: https://github.com/ikawrakow/ik_llama.cpp/discussions/334

Sign up or log in to comment