Thank you but some issues
Appreciate these new quants which now seem to support MLA.
Heads up for others — I experienced severe degradation at long context with the UD-IQ2_M and UD-Q2_K_XL quants. Without dropping cache type k to Q8 I was getting random Chinese and gibberish in response to a 7K token prompt on my setup. This was not the case with the previous quants made prior to the MLA commit to llama.cpp. I have a suspicion something is awry with the changes made in that commit.
Dropping to Q8 k cache resulted in non-gibberish, decent, if subpar results compared to old quants. Short context prompts were fine with either Q8 or FP16 cache.
I get the same issue, I'm using a mix of CUDA + CPU though (128GB VRAM + 192GB RAM)
I get gibberish in any prompt that is larger than like 2048 tokens, or ctx in general (like resuming a chat)
Old version doesn't suffer this, but cache uses way more VRAM.
Yeah, I’m using a mix of CUDA and CPU too (120GB VRAM + 256GB RAM with a Threadripper Pro 5965 CPU).
Have you tried -ctk q8_0? That got it working on longer contexts for me. But I’m still validating if there’s a quality loss. I’m also finding that my CPU temps are up from 70C with the old combo to 85-90C now. I think it’s the MLA commit causing these issues but not 100% sure as it seems Unsloth are also using a new methodology to make these dynamic quants so maybe it’s an interaction thing.
I tried with -ctk q8_0 but got the same issue sadly :(
I have a Ryzen 7 7800X3D, 5090+4090x2+A6000, tested on Fedora 42.
Interesting. Doesn’t make much sense to me that q8_0 k cache fixed the long context issues for me, I was just randomly running through different things to try and resolve it. I usually just run with FP16 kv cache. Without the q8 k cache I was always getting a couple of lines of combined Chinese/English/Russian then an EOS. Something definitely wrong there. Might be an issue with the context shifting that was part of the MLA commit?
Oh my - I'll investigate asap - sorry on the issues! It might due to llama.cpp having their new MLA implementation
Oh my - I'll investigate asap - sorry on the issues! It might due to llama.cpp having their new MLA implementation
Did y'all make a new imatrix with the latest llama-cpp including the MLA patches or is this an older imatrix dat?
Just curious as yours says 720 entries, which is what i've seen using ik_llama.cpp
fork as it has supported MLA longer... but i've heard making a new imatrix might say 721 entries and throw an error complaining about attn_k_b_weight
wrong size in this github PR here
Might be a clue, or a red herring, not sure, just sharing what I've seen! Thanks!
Btw are you guys using CPU Offloading? We think that might be the issue because full GPU offloading works fine.
And which bits are you guys using?
We will be reuploading them nonetheless
Same issue here with CPU offloading, Threadripper with 128GB DDR5 + 5xRTX3090's.
I'm using ~ 25 layers on GPU and the rest on CPU on my case. Q2-K-XL.
Yes, similar here, I’m offloading 20 layers to my GPUs, the rest is CPU. I’ve tried the UD-IQ2_M and the UD-Q2_K_XL. I’ve used the old UD-Q2_K_XL in the past with no issues.
I'm seeing the nonsense output with UD-IQ3_XXS, even with a context size as low as 1000. It's not just the UD Q2 versions that show this issue.
@MB7977
's workaround of using -ctk q8_0
does work for me (thanks for that), even with context over 10000. I have 32 GB VRAM and 128 GB main memory (which means I'm also running the model directly from my SSD drive as well). I'm offloading 4 layers to the GPU.
Wait could this be related - https://github.com/ggml-org/llama.cpp/pull/13113
I'll test CPU overloading and report back!
No luck with the latest commit for me, unfortunately. I suspect it’s a bug in the original MLA commit.
Sorry I should have clarified. I don't think there's any issue with your quants. It's something wrong with llama.cpp's MLA implementation when using CUDA+CPU.
I had the same problem a few weeks ago when I built the MLA PR of llama.cpp and tried various R1 MLA quants (eg. regular Q3_K). I probably should have raised it on github but was too busy / didn't realize CUDA with CPU offload was an edge case.
It's something wrong with llama.cpp's MLA implementation when using CUDA+CPU.
Thanks for the report, I still haven't tried the recently merged MLA features in mainline llama.cpp yet. For any intrepid users, the ik_llama.cpp
fork has had it working for a while now and I have an ik_llama.cpp
exclusive quant ubergarm/DeepSeek-V3-0324-GGUF people have been asking me to compare with these quants.
I hope to test MLA on mainline llama.cpp more thoroughly once I get access to a big RAM rig again soon, especially given bartowski's issues with imatrix on that mentioned above.
Cheers!
Sadly I get a similar issue with IQ2_K_R4 https://github.com/ikawrakow/ik_llama.cpp/issues/305
Not exactly the same as I get just "DDDDD" there, while on main llamacpp I get gibberish (symbols and random letters)
We will be reuploading them nonetheless
It looks like all files were replaced indeed. What was fixed?
@shimmyshimmer Thanks a lot! I just downloaded and the new DeepSeek-V3-0324-UD-IQ1_S and gave it a quick test. The problem persists with CUDA + CPU.
Model output example:
Message 1: "Hi"
Model response: "Hello! How can I help you today? 😊"
Message2:
Model response: "55235A?@0!3'&!EC,."
Fairly low context:
prompt eval time = 10193.24 ms / 571 tokens ( 17.85 ms per token, 56.02 tokens per second)
eval time = 2250.50 ms / 19 tokens ( 118.45 ms per token, 8.44 tokens per second)
total time = 12443.74 ms / 590 tokens
I really don't think there's anything wrong with your quants, it's a llama.cpp + (cuda without offloading all layers) + mla issue, I've had the same thing happen with other quants / building the mla PR before it was merged.
I also tried compiling llama.cpp with Vulkan instead of cuda and tested it with the same settings, couldn't reproduce the problem with either of your uploads.
(Performance is unusable on Vulkan though with 2 t/s prompt processing / 3-4 t/s generation and higher gpu power usage)
I added a request here for investigation: https://github.com/ggml-org/llama.cpp/pull/12801#issuecomment-2835070735
Wonder if we could do a 'git revert' to the MLA commits and see how it behaves with these new versions, mostly to keep the -ot parameter.
I also get the same issue, so as @gghfez says, I don't think it is an issue with your quants, but an issue when offloading and using CUDA with MLA. I think a way to disable it would be great since for now it seems to be forced to use MLA.
Thank you for that. I was going to open an issue but it'll get more weight from you guys. I wonder if it can't be fixed if including an -mla flag might be an option going forward. That seemed to be part of the original PR. I'd love for MLA to work but for the moment my main priority is still being able to use DeepSeek quants with the latest builds of llama.cpp, especially with R2 not far away.
Thank you again for being so helpful and engaged.
Have you guys tried the latest llama.cpp version? Apparently it fixes it? https://github.com/ggml-org/llama.cpp/pull/13137
No luck with the latest commit for me, unfortunately.
No luck with the latest commit for me, unfortunately.
Oh rip. 😔 https://github.com/ggml-org/llama.cpp/pull/12801#issuecomment-2835304458
I got it working!
prompt eval time = 116176.21 ms / 8150 tokens ( 14.25 ms per token, 70.15 tokens per second) 👈 that's a lot faster than what I had before!
eval time = 114519.95 ms / 875 tokens ( 130.88 ms per token, 7.64 tokens per second) 👈 This is above 10 at lower contexts
total time = 230696.16 ms / 9025 tokens
It's stable at 8k context. No more garbage outputs, and it's got it's "personality" back at low contexts (feels like Deepseek again). Prompt processing is also faster than just using -ngl <whatever I can fit>
Whatever the offload to CPU issue is, doesn't seem to affect expert tensors. So I ended up with this:
-ngl 99 -v --override-tensor 'blk\.(2[5-9]|[3-5][0-9]|60)\..*_exps\.=CPU' --override-tensor 'blk\.([1-4])\..*_exps\.=CUDA1' --override-tensor 'blk\.([5-9])\..*_exps\.=CUDA2' --override-tensor 'blk\.(1[0-4])\..*_exps\.=CUDA0' --override-tensor 'blk\.(1[5-9])\..*_exps\.=CUDA4' --override-tensor 'blk\.(2[0-4])\..*_exps\.=CUDA3'
The important thing is -ngl 99
to ensure all the non-expert layers are on CUDA devices., then, put experts on CPU.
To put them all on CPU, set ngl 99
then add this flag:
-ot "\d+.ffn_.*_exps.=CPU"
But that's slow too slow for me as I don't have the DDR5 capacity and end up memory-map'd to SSD.
So I started spreading experts across my CUDA CUDA devices to get system memory usage below 120GB like this:
- Replaced
-ot "\d+.ffn_.*_exps.=CPU"
with this to only offload experts 25-60 to CPU:
--override-tensor 'blk\.(2[5-9]|[3-5][0-9]|60)\..*_exps\.=CPU'
Change it to match however many experts you can't fit onto CUDA ^
- Individually assign experts to each CUDA device. For example, this puts puts experts 1-4 on CUDA1:
--override-tensor 'blk\.([1-4])\..*_exps\.=CUDA1'
- Do the same for all CUDA devices (prepare for trial and error / CUDA OOM to get it right)
--override-tensor 'blk\.([1-4])\..*_exps\.=CUDA1' \
--override-tensor 'blk\.([5-9])\..*_exps\.=CUDA2' \
--override-tensor 'blk\.(1[0-4])\..*_exps\.=CUDA0' \
--override-tensor 'blk\.(1[5-9])\..*_exps\.=CUDA4' \
--override-tensor 'blk\.(2[0-4])\..*_exps\.=CUDA3'
And be sure to verify that every expert is accounted for between the CPU and CUDA devices ^ or any you miss will end up on CUDA0 and OOM
I also added the -v
to see where each tensor was being assigned (it's super verbose).
Here's "Hello", performance is better at lower context
prompt eval time = 977.57 ms / 9 tokens ( 108.62 ms per token, 9.21 tokens per second) 👈 that just looks slow because there are less than 70 tokens in the prompt
eval time = 874.05 ms / 12 tokens ( 72.84 ms per token, 13.73 tokens per second)
total time = 1851.63 ms / 21 tokens
That's great @gghfez !
Is there a way to see what are the exp tensors? How much are the active parameters, about 40B? How many experts do we have? Sorry for too many questions.
I have 192GB RAM + 128GB VRAM, but I have searched a bit of -ot but I have not understand how to use it. I have a 4090 + 4090 + 5090 + A6000 (devices ordered like that)
In theory we would want ot have all active params + some experts on GPU, and the rest of the other expers on CPU? Then without active params on CPU issue shouldn't happen? Is there a way to see how much RAM/VRAM does each expert use?
Okay after tinkering like 5 hours, I can confirm @gghfez finding.
If you load all the active parameters into GPU and then load some experts on CUDA, and the rest on CPU, it works fine. My load is pretty bad for now but made it work for now.
My speeds are pretty bad though, but probably it's because I'm using X16/X4/X4/X4 instead of something like X8/X8/X8/X8
prompt eval time = 432377.39 ms / 3070 tokens ( 140.84 ms per token, 7.10 tokens per second)
eval time = 44220.34 ms / 307 tokens ( 144.04 ms per token, 6.94 tokens per second)
And no gibberish.
So I guess there is a bug when using MLA on active params on a mix with CPU + CUDA.
EDIT: After tinkering a bit got better speeds
prompt eval time = 146999.55 ms / 3070 tokens ( 47.88 ms per token, 20.88 tokens per second)
eval time = 34334.69 ms / 257 tokens ( 133.60 ms per token, 7.49 tokens per second)
./llama-server -m '/home/GGUFs/DeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf' -c 16384 --no-mmap --no-warmup -v -ngl 99 --override-tensor 'blk\.(2[5-9]|[3-6][0-9])\..*_exps\.=CPU' --override-tensor 'blk\.([1-6])\..*_exps\.=CUDA0' --override-tensor 'blk\.([7-9]|1[0])\..*_exps\.=CUDA1' --override-tensor 'blk\.(1[1-5])\..*_exps\.=CUDA2' --override-tensor 'blk\.(1[6-9]|2[0-4])\..*_exps\.=CUDA3'
My CUDA0 is saturated at 26gb/s during prompt processing. Make sure that's happening on your X16 GPU. There's some llama.cpp flag to do this iirc, but I got lucky this time and it's happening by default now.
Also note that mla is slower at longer contexts. The performance is more like when I quantized the K cache with the older builds (8 t/s at 4k context instead of closer to 10).
They know about this trade-off vs needing 4MB of VRAM per token of context with the old builds.
Oh and regarding
Wonder if we could do a 'git revert' to the MLA commits and see how it behaves with these new versions, mostly to keep the -ot parameter.
That won't work, the new quants require MLA. I remember reading that somewhere on github.
I also just tested my old llama.cpp build with where I merged pr-11397 to use -ot
(This is how I run the normal DS quants) and it can't load these ones.