Wur doomed!

#14
by jukofyork - opened

Continuation of THE THREAD OF DOOM.

jukofyork pinned discussion

What do you and the others think of the distilled R1 models for writing?

The llama3 / qwen models SFT'd on R1 outputs? I only tried 2 of them.

R1 Qwen (32b) - Lacks knowledge of fiction (same as the official Qwen release), so it's writing is no better.

R1 Llama3 - This is generally the worst of them (not just for writing). It'll generate the CoT and then write something completely different.

CoT traces won't let the model do anything out of distribution, so not very useful if the base model doesn't have a lot in it's training data.

Yeah, I have tried the same two and felt the same way.

I also felt that any attempt to add an R1 distill to the merge recipe of an existing merge project made it worse...so far...

@gghfez @BigHuggyD that has been my experience as well, which is a shame as I had a go of R1 on Openrouter and I was blown away.

What model is anywhere close that is usable on a 24gb vram machine with 32gb of ram in your experience?

There's nothing like it for now. I'm running R1 slowly on my ThreadRipper:

prompt eval time =   14026.61 ms /   918 tokens (   15.28 ms per token,    65.45 tokens per second)
       eval time =  398806.12 ms /  1807 tokens (  220.70 ms per token,     4.53 tokens per second)
      total time =  412832.73 ms /  2725 tokens

I tried training Wizard2 8x22b MoE on R1 data, but it doesn't really work well. It will plan ahead in think tags eg:

I need to ensure the story maintains its gritty, realistic tone without becoming overly melodramatic. The characters' growth should be subtle but significant. Also, the ending should leave a sense of hope but not be too neat—their redemption is fragile, and the future is uncertain.

Let me outline the next few chapters:

Chapter 5: Nightmares and Trust
...

But it doesn't backtrack like R1 does. Just kind of agrees with it's self and ends up writing how it usually would:

“I don’t know what I want anymore,” she admitted, voice barely above a whisper as rain tapped against corrugated roofing overhead.

lol

Ahhh thats a shame :-(

"I don’t know what I want anymore,” she admitted, voice barely above a whisper as rain tapped against corrugated roofing overhead."

Oh god!

I'll have to keep an eye on this thread.

I did enjoy Ppoyaa/MythoNemo-L3.1-70B-v1.0

But my tastes are probably not as refined as others on this thread ;-)

I get the same thing with it fixating on typos or lazy slang lol

I get the same thing with it fixating on typos or lazy slang lol

https://old.reddit.com/r/LocalLLaMA/comments/1kdrx3b/i_am_probably_late_to_the_party/mqdd8sm/

The first time I set up QwQ I was in the middle of watching For all Mankind and I tested it with “Say hello Bob”.

This caused it to nearly haven an existential crisis. 2k tokens of “did the user make a typo and forget a comma? Am I bob? Wait a minute who am I? What if the user and I are both named bob.

This rings true!

I asked it this:

Create a single HTML file containing CSS and JavaScript to generate an animated weather card. The card should visually represent the following weather conditions with distinct animations: Wind: (e.g., moving clouds, swaying trees, or wind lines) Rain: (e.g., falling raindrops, puddles forming) Sun: (e.g., shining rays, bright background) Snow: (e.g., falling snowflakes, snow accumulating) Show all the weather card side by side The card should have a dark background. Provide all the HTML, CSS, and JavaScript code within this single file. The JavaScript should include a way to switch between the different weather conditions (e.g., a function or a set of buttons) to demonstrate the animations for each.

https://old.reddit.com/r/LocalLLaMA/comments/1jiqi81/new_deepseek_v3_vs_r1_first_is_v3/mjh4r77/

and it spent 8k tokens trying to decide if I'd made a mistake due to the ambiguous "Show all the weather card side by side"

:D

image.png
Look at that... Please tell me I've accidentally put the temperature way down with nemotron. PLEASE. 90 fucking % for name starting with K. What the actual fuck?

I don't really like these new models. Hours downloading and converting just to find out they are shit. Nemotron... I think you've seen enough with that picture. Arenamaxxed and benchmaxxed af. Big Qwen being dumb did not get fully solved by using unquantized version. It can follow the instruction to not use words, but has still terrible knowledge. Why run Qwen when you can run DeepSeek?

Nemotron

Oh that model, yeah I gave up on it pretty fast. Too slow for work, too cliche for writing. Did you see the Gemma3-style safety training?

https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset/blob/main/SFT/safety/safety.jsonl

I don't really like these new models. Hours downloading and converting just to find out they are shit.

Yeah I love DeepSeek but this 8 t/s MoE trend is annoying. Looks like they might have screwed it up?

https://xcancel.com/kalomaze/status/1918238263330148487

Expert 38 activated 0.00% of the time. Why are they bloating these things for nothing?

and it spent 8k tokens trying to decide if I'd made a mistake due to the ambiguous "Show all the weather card side by side"

I got a similar 4k time waste by saying 'gh pr' where it was trying to decided if I really meant "github" and "pull request"

Mixed feelings about R1T. While it is thinking enough for RP/writing, it is not thinking hard enough to chew through complex problems, which R1 can solve. Not R1 replacement, just a sidegrade for speed.

Mixed feelings about R1T. While it is thinking enough for RP/writing, it is not thinking hard enough to chew through complex problems, which R1 can solve. Not R1 replacement, just a sidegrade for speed.

Yeah, I found it was too terse compared to r1 too.

and it spent 8k tokens trying to decide if I'd made a mistake due to the ambiguous "Show all the weather card side by side"

I got a similar 4k time waste by saying 'gh pr' where it was trying to decided if I really meant "github" and "pull request"

I'm testing out the changes suggested here (for qwq, but seems to work well for qwen3 too):

https://docs.unsloth.ai/basics/tutorials-how-to-fine-tune-and-run-llms/qwq-32b-how-to-run-effectively#tutorial-how-to-run-qwq-32b-in-llama.cpp

and so far it doesn't seem quite as terrible.

Mixed feelings about R1T. While it is thinking enough for RP/writing, it is not thinking hard enough to chew through complex problems, which R1 can solve. Not R1 replacement, just a sidegrade for speed

I wasn't able to test it as I can't run it in FP8 and only found a BF16 gguf. I would have been tempted to run it for a faster R1, but R1 is much faster after this PR 13306

I'm testing out the changes suggested here (for qwq, but seems to work well for qwen3 too)

Guess I'll give it another try. I was using min_p=0.05 (in general I prefer to use min_p)

Sign up or log in to comment