why is the size bigger than regular Q4_0 quants ?
this quant is : 16GB /gemma-3-27b-it-q4_0.gguf
the same using llama.cpp quant is smaller and works better:
bartowski_google_gemma-3-27b-it-GGUF_google_gemma-3-27b-it-Q4_0.gguf is 15GB
token_embd.weight
is in fp16 with is model, but Q6_K for the other quant you linked. That alone is about 1.4B params, so in f16 that takes 2.8GB vs 1.15GB when using q6.
the same using llama.cpp quant is smaller and works better:
bartowski_google_gemma-3-27b-it-GGUF_google_gemma-3-27b-it-Q4_0.gguf is 15GB
You mean the normal imatrix quant works better than this produced with quantization aware training? On what tasks is the bartowski quant better?
For those who want I have uploaded a smaller version of this model with quantiezed token embeddings table. It doesn't seem to significantly hurt the performance.
I'd love to see a comparison between these QAT 4-bit quants and some normal llama.cpp imatrix GGUFs with roughly the same bpw. Easiest and fastest way to do it would probably be to simply compare perplexity with same text and seed between a couple 4-bit GGUFs and the bf16 as baseline e.g.
wget https://github.com/user-attachments/files/19090237/wiki.test.raw.gz
gunzip wiki.test.raw.gz
./build/bin/llama-perplexity \
--model /mnt/models/gemma-3-27b-it-qat-q4_0.gguf \
-ngl 99 \
--ctx-size 512 \
--ubatch-size 512 \
-f wiki.test.raw \
--seed 1337 \
--threads 8
...
Final estimate: PPL = ?.???? +/- 0.0????
I did a write-up with some results on the new QAT model: https://github.com/ikawrakow/ik_llama.cpp/discussions/334