Model seems to have factual issues beyond ~6500 tokens

#3
by Reithan - opened

Been using this model for over a week now and testing with a variety of prompt & story types. One persistent problem I keep seeing with it that I haven't seen with other 8192 models I tested, is that the model seems to get very confused with facts or details in the context that are past ~6500 tokens in.

The model is very smart and accurate as long as the submitted context is less than 6500 tokens, or if you ask it questions that are earlier in the context than around that limit back in the context.

Once you start asking it to use or repeat details from further back than that (6500-8192) the model tends to get very confused. It mixed up characters, pronouns, adds hallucinated details, missed essential details that are in the context, etc. It's a pretty stark difference, too. Details up to around 6500 are used appropriately and smartly with very little mix-up, but after around 6500 it consistently makes these sorts of mistakes.

I've been using the IQ4_X_S quant, so I'm not sure if this happens on every quant version, but it's super consistent with the IQ4_X_S quant.

Just tested a few inputs in Q4_K_M and it seemed to have the same issue, but testing right now with Q8_0 seem to not have this issue. (Unfortunately, Q8_0 is really slow for me locally. :(

(specifically, these are the quants from mradermacher)

Doing some testing now with the Q4_K_M quant GGUF from SteelQuant, and it seems to be much more coherent at 8192 than mradermacher's. It makes minor mistakes or omissions at 8192 sometimes, but is generally factually coherent, whereas mradermacher's Q4_K_M almost never gets facts right from 8192 deep descriptions.

It seems like something about the quants below Q8_0 from mradermacher have some sort of context-length related issue?

BTW: my method of testing this was I had an ongoing story I was using the model for, and once it exceeded 8291 by quite a bit, I started asking it questions about details regarding similar characters or related events from the beginning of the context. (Inserted by lorebook). If I set the context limit to 8192, it consistently made the same factual transpositions and errors (swapping characterA for characterB, swapping details of eventA for eventB) over ~6 retries. Same errors each time.

Setting the max context to anything lower than 6500 removed the errors, and almost all answers (6+ tries each) were factually accurate without transpositions.

Interesting and it's very possible that the issue lies in the way the moerges are created. Altho, It could stem from the quantization Mradermacher is using. I'm not well-versed in the quantization processes though, I mainly use the auto quant space on HF.

I'll ping them and maybe they can shed some more light on this
@mradermacher

At the moment, there is really only one way to make quants, so there is no difference in the method between quants made by me and others, other than the imatrix training set (i.e. static quants should be identical at least in their contents, and imatrix quants are practically always better) and the fact that I usually calculate imatrix data from the source model and not the quantized model. Q4_K_M is generally pretty close to the original, and without any kind of dependable evidence, this is most likely just random variation. Models do get worse with higher context, higher quantisation, and it's very common for any model to hallucinate and give confused answers, just to give good quality answers when retrying.

Yep, I'm aware of RNG do being RNG as it does.[SIC]

I started looking into this as I had consistent factual confusion issues with this model over the course of a week or so that as soon as my context got to around the 8k limit, it would start making factual errors from the World Info (using SillyTavern).

However, after swapping from the IQ4_X_S version to Steel's Q4_K_M, I've been using the model for several hours now, and it's honestly a night-and-day difference. When testing Q4_K_M from mradermacher, I still saw similar issue. That, if anything, would be the RNG bias, since I only did a limited amount of tests, so it's possible I just got incredibly unlucky with testing mradermacher's Q4_K_M version, and it would also 'fix' the issue. But, going from mradermacher's IQ4_X_S to Steel's Q4_K_M is such a stark difference, it's like flipping a switch. The model was struggling to remember key details and transposing characters descriptions and actions with IQ4_X_S, and with the Q4_K_M, it just ISN'T anymore.

Additionally: I didn't go as far as finding the actual 'limit' on the IQ4_X_S model, but it's also a very very stark and noticeable difference. Like I'd set context to 8192, and ask a series of 2-3 questions, get the same set of errors in almost every response after regenerating response 6+ times.

Changing the context to 6400, and suddenly zero or close to zero errors in every response through 6+ regens. (I didn't see the same effect at all if I set it to 6800, 7000, 7500, etc. It had to be below 6500)

I'm aware 6 is a very small sample size, but that was just my initial testing. The 'test' before this was about a week of using the model and noticing these errors with larger context for anything at the 'top' of the World Info, consistently. (So, probably 100s or 1000s of samples), versus using Q4_K_M all day today and that issue has basically vanished. (dozens of samples?)

Well, you could check the gguf metadata to see if there is any difference. Then the question is which quant you used (e.g. do you compare an imatrix quant and a static one?), whether you use the same settings, did you use a language other than english and so on. And yes, I get reports monthly from people who claim to see big differences between two quants - the most common case is people swearing the f16 is so much better than the q8_0 only to find out they never tested the f16 and compared the q8_0 with itself. You really need something more than subjective handwaving to get me excited :-)

I've compared steeskulls q4_k_,m and my static q4_k_m, and the metadata is identical other than some extra keys i add. Barring some big undetected numerical bug in llama.cpp, they are almost certainly identical in performance and any difference would be a figment of imagination.

Yeah, as noted the testing of your Q4_K_M vs Steel's was probably the smallest/weakest test I did. I tested the IQ4_X_S for around a week just via use, and the fact mixing is what made me start looking into fixes. I noted that lowering the context fixed it, and the fact that going from 6400 to 7000 was such a big difference surprised me.

I only tested the other quants...I dunno why. Just had a thought to do it. I'm still testing Steel's Q4_K_M model right now and still haven't seen the issues with the IQ4_X_S come back. If the 2 Q4_K_M quants are identical, then it's probably just some issue between the IQ vs K quants??

I don't have deep enough knowledge to know why or how that might happen, but I'm pretty convinced by my testing. What data can I provide you to see if you can replicate? I can give you my sampler settings and stuff if that helps. I'd rather not supply the world info and character info atm, though.

Notes: I'm running koboldCpp locally with ST as my UI.
My samplers are Temp 0.75, TFS 0.85, Smoothing Factor 0.25, RepPenalty 1.15 @ 2048 length, 0.05 presence penalty. 8192 token context with 1024 output. I'm using Llama3 prompt format.
I'm running koboldCpp with Flash Attention, QKV Cache, 12 GPU layers, 24 threads, CuBLAS. Also tried with QKV off and Context shift on, didn't seem to make a difference.

Ok, noticed the responses degrading again just now. Have been running this model without restart since the first post here(~6 hr ago), basic test I'm running is, I have 2 similar characters in the world info with different names, and in a summary there's some changes to their appearance. I ask the model for various descriptive details about char A, and the issue I see is sometimes it describes charB instead, or if I ask it to describe both characters, it'll swap their descriptions. I've tried updating the format or wording of the WI entries, but the issue persists, tried rewording the question - issue persist.

If I drop context from 8192 to 6400, issue immediately stops. I just tried this several times in a row for both. Is it possible this is some sort of caching issue with koboldCpp? I had no issues for hours, now it won't stop doing it, again. I guess it's still possible it's just RNG, but it seems too streaky for me to just dismiss this. Going to try restarting kobold and see if the issue goes away again.

If you're able to reproduce any of this, drop another comment, I'm just going to investigate further on my own. If I can get to any really solid result that I think is reproduceable over a scale longer than a few minutes, I'll try to get enough data up here that someone else can verify.

Reithan changed discussion status to closed

Hey @mradermacher I appreciate the replies and experience on the subject.

IQ4_XS is lower quality than Q4_K_M and imatrix quants are higher quality than the corresponding non-imatrix quants. If you get this behaviour only with the imatrix quant and it is real, then most likely what you are seeing is the real model performance. performance always degrades near the maximum context length, and when you reduce context, you force kobold to regenerate the whole context and regenerate it (and use a shorter context).

As for what you would have to show, wlel, everything you do is subjective, and you use different methods for both quants. You'd have to find an objective method to compare. Again, people regularly and with high confidence conclude that some quant is much better than some other quant, even if it is the same but they don't know it. LLMs are inherently random, and from what you describe, your testing works like this: run until you dislike the result, then regenerate or try with another quant. That's guaranteed to give you this impression even if you compared a quant to itself.

It's very unlikely that effect that you describe really exists - there is no known mechanism that would actually damage context length in the way you describe, and mor eimportantly, there are practically no configurable knobs, as I have explained - I practically couldn't break it even if I wanted to. The only possibility would be a bug in llama.cpp that would somehow modify the model to reduce context length using numerical errors.

However, we are all humans, and if you have the feeeling you like steelskulls quant better, then go for it. I also provide both static and imatrix quants, because some people think that static quants "feel" better (and, well, my imatrix training set emphasizes english over other languages).

Echoing @Steelskull , definitely thank you very much for the replies @mradermacher you clearly have a ton of experience on the subject.
I think you're 100% correct on your take.
Only small correction I have is

from what you describe, your testing works like this: run until you dislike the result, then regenerate or try with another quant. That's guaranteed to give you this impression even if you compared a quant to itself.

That's not exactly correct. I started by running until I disliked the result. But after that I didn't just regenerate the full result or anything.
I would take that exact context/settings/inputs that was consistently giving me a result I didn't like, and start sending it to different models/quants to see if any of them performed better with the same inputs.

I think you're still right that this is still super-subject to RNG and my own flawed/biased/subjective interpretation of the result. It's also super likely that my prompt is just not well formed and the model's doing the best it can.

What you describe sounds exactly like what I wrote to me. Hmm. The key point I was making is that you take the existing context and give it to another model - even if the new model was very bad, you are bound to get a better result, because the average result of another llm might still be better than the worse result from a better llm.

There might still be an issue with the quants, but at least for the static quants, I only believe that when proven, especially give the metadata is the same, leaving only the tensor data. The imatrix quants might show different behaviour, but even random tokens give objectively better results (e.g. when testign with k-l-divergence) than static quants in all tests i have seen. The only recurse here would be a bug introduced between when i built my llama.cpp and steelskull did it (or vice versa), causing tensor data to be miscalculated.

I think I see what you mean. I've basically fallen into a single bad case/context with the initial model, and so testing others, even if they are overall worse/same/better, based on RNG, is possible to give a better result, even if the original model on average use case produced better result. You're absolutely right.

Thanks again for the responses.

Sign up or log in to comment