8GB GPU can run this,10t/s
https://huggingface.co/mradermacher/QwQ-32B-i1-GGUF
llama-server.exe -m QwQ-32B.i1-IQ2_XXS.gguf -ngl 60 -fa -ctk q8_0 -ctv q8_0 --temp 0.6 --top-p 0.95 --top-k 30 -c 2048 -n -1 --host 0.0.0.0 --port 8080 --reasoning-format deepseek
speed is about 10t/s
I guess this is Nvidia using Cuda, right? Certainly not Vulkan, because I have 8GB AMD GPU using Vulkan and I'm getting about 2 t/s at best with Q2_K. I see you're running imatrix version, that's a whole story in itself for me personally they never work well - they never utilize GPU at all and they are slower (probably due to lack of GPU offloading to begin with).
I guess this is Nvidia using Cuda, right? Certainly not Vulkan, because I have 8GB AMD GPU using Vulkan and I'm getting about 2 t/s at best with Q2_K. I see you're running imatrix version, that's a whole story in itself for me personally they never work well - they never utilize GPU at all and they are slower (probably due to lack of GPU offloading to begin with).
Notice! The model he used is QwQ-32B.i1-IQ2_XXS.gguf
! Only 12.9G!
As a contrast, in an 8 GB Nvidia GPU:
speed is 8-10 tps running QwQ-32B.i1-IQ2_XXS.gguf
speed is 3-5 tps running QwQ 32B