Crashes after several responses

by urtuuuu - opened about 12 hours ago

about 12 hours ago

•

For some reason it crashes after one or two responses. For example i ask it to write code, it gives answer, then i ask it to improve the code, and then it crashes. Both in LM Studio and in llama.cpp. Could it be because i'm on AMD + vulkan ? No problems with other models though...

Update: trying 0/49 layers to GPU now, same problem.

sidran

about 11 hours ago

Try using --batch-size 365

urtuuuu

about 10 hours ago

Recent problem with GLM-32b (have to use -b 8 -ub 8), now this... What's going on...

shimmyshimmer

Unsloth AI org about 9 hours ago

Recent problem with GLM-32b (have to use -b 8 -ub 8), now this... What's going on...

Did you upload to the latest version? Does it still crash on the smaller non Moe models?

urtuuuu

about 9 hours ago

•

edited about 9 hours ago

Yes, recent version 32B-UD-Q3_K_XL.gguf. I know there was a problem with template, but this is probably a different bug. Only crash on this model. But @sidran suggestion seems to help, which is --batch-size 365.

sidran

about 9 hours ago

@urtuuuu
I was just guessing and wrote this just in case it might help.
https://github.com/ggml-org/llama.cpp/issues/13164

urtuuuu

about 9 hours ago

https://github.com/ggml-org/llama.cpp/issues/13164

So i'm not only one who has this. AMD again...

urtuuuu

about 8 hours ago

•

edited about 8 hours ago

@sidran btw, i wonder how you have only 10.7t/s? I don't even have a graphics card. Mini PC Ryzen 7735hs integrated graphics. It allows me to use my 32gb RAM for VRAM, which i can set to 8GB. I offload all 49/49 layers to GPU in vulkan llama.cpp and the speed is 24t/s in the beginning. Haven't tried how much context is possible, i just set it to 10000.

Oh, actually i'm using Q3_K_XL... yours is probably Q8 or something :)

sidran

15 minutes ago

•

edited 12 minutes ago

@urtuuuu
I really dont know mate but I am suspecting you probably made an error since you made a few here as well. First you said that you use "32B-UD-Q3_K_XL.gguf" (no way you are running a dense model this fast), then you said you allocate 8Gb and shove whole 30b? into that much (impossible), then you say 49/49 layers but 30b has only 48. I cannot be sure but I am suspecting that you are mixing something up as my numbers seem quite good for my hardware. I am using Qwen3-30B-A3B-UD-Q4_K_XL.gguf with context 12288 and get slightly over 12 t/s at the very start with a hundred or so tokens. I am running dense models like QWQ 32B at terrible speed of 1.7 t/s which is to be expected and so is the same for Qwen3 32b (dense). All with same 12288 context length.
I dont know what your architecture really is but too much is confusing. I know Macs have unified memory and run LLMs on par with best GPUs but I dont think your Ryzen has such memory.
Maybe I am missing something but I really dont know what.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment