Crashes after several responses
For some reason it crashes after one or two responses. For example i ask it to write code, it gives answer, then i ask it to improve the code, and then it crashes. Both in LM Studio and in llama.cpp. Could it be because i'm on AMD + vulkan ? No problems with other models though...
Update: trying 0/49 layers to GPU now, same problem.
Try using --batch-size 365
Recent problem with GLM-32b (have to use -b 8 -ub 8), now this... What's going on...
Recent problem with GLM-32b (have to use -b 8 -ub 8), now this... What's going on...
Did you upload to the latest version? Does it still crash on the smaller non Moe models?
Yes, recent version 32B-UD-Q3_K_XL.gguf. I know there was a problem with template, but this is probably a different bug. Only crash on this model. But @sidran suggestion seems to help, which is --batch-size 365.
@urtuuuu
I was just guessing and wrote this just in case it might help.
https://github.com/ggml-org/llama.cpp/issues/13164
@sidran btw, i wonder how you have only 10.7t/s? I don't even have a graphics card. Mini PC Ryzen 7735hs integrated graphics. It allows me to use my 32gb RAM for VRAM, which i can set to 8GB. I offload all 49/49 layers to GPU in vulkan llama.cpp and the speed is 24t/s in the beginning. Haven't tried how much context is possible, i just set it to 10000.
Oh, actually i'm using Q3_K_XL... yours is probably Q8 or something :)
@urtuuuu
I really dont know mate but I am suspecting you probably made an error since you made a few here as well. First you said that you use "32B-UD-Q3_K_XL.gguf" (no way you are running a dense model this fast), then you said you allocate 8Gb and shove whole 30b? into that much (impossible), then you say 49/49 layers but 30b has only 48. I cannot be sure but I am suspecting that you are mixing something up as my numbers seem quite good for my hardware. I am using Qwen3-30B-A3B-UD-Q4_K_XL.gguf with context 12288 and get slightly over 12 t/s at the very start with a hundred or so tokens. I am running dense models like QWQ 32B at terrible speed of 1.7 t/s which is to be expected and so is the same for Qwen3 32b (dense). All with same 12288 context length.
I dont know what your architecture really is but too much is confusing. I know Macs have unified memory and run LLMs on par with best GPUs but I dont think your Ryzen has such memory.
Maybe I am missing something but I really dont know what.