Qwen3 is great, but could be better.

#18
by phil111 - opened

Firstly, the Qwen3 family is notably better than other similarly sized models when it comes to coding, math, reasoning, general STEM knowledge, and a several other tasks.

However, it has some very notable weakness, such as poem writing. But most notably, it has virtually no popular knowledge, such as music, movies and the rest of pop culture.

Yes, for simple pop culture question and answering traditional RAG and web searching works great, but it no more helps with tasks like writing original poems, jokes, apt metaphors in stories, and other creative tasks than solving math or coding problems. If anything, the former is more complex and nuanced. Yet the entire AI industry is becoming obsessed with coding and STEM, which isn't surprising since it's mostly comprised of coders, and same goes for the early adopters of open source models.

Anyways, since popular knowledge is random, extensive, and hard to accurately store in LLM weights, such the song lists from albums and the casts of shows linked to their respective actors, why not fuse the weights with a small database containing only core facts about humanity's most popular information, then populate the thinking tag with the relevant information?

I tested this by manually populating the thinking tags with core information, such as a show after it got the entire cast wrong (Corner Gas), and like with humans, this core knowledge dredged up related knowledge, such as subsequently saying the gas station and diner were owned by two different characters vs the same character.

So if it works manually, then why couldn't an LLM automatically populate the thinking tags with core information from a fused database (e.g. main cast and plot) when the user's prompt references the show, movie, singer, game character, sporting event, or other pop culture subject, helping it not only jog its memory, but not hallucinate said core facts, including the main characters in movies seen by countless millions of people?

Also, it seems to me that since it would be a simple database of facts stripped of sentence structure and other ambiguity that the knowledge would prove beneficial across all supported languages.

Anyways, Qwen3 is great, but it's way out of balance and unusable as a general purpose AI model for the general non-coder population. Ending training with more creative and pop culture tokens like poetry and humor will almost certainly lower its coding, math and STEM test scores a bit, but I strongly feel it's necessary. Plus pulling the core facts about the subject referenced by the user's prompt out of an included database and putting it into the working memory of the LLM should drastically reduce the hallucination rate, which frankly is out of control. When it comes to numerous popular domains of knowledge Qwen3 30 & 34b are only on par with 1b models.

man, i'm not being critical or something, but i've seen you many times and they are all about a new model which knows little about pop culture...

Yeah, he keeps running the same of his own test questions on new models. It's not about failing this specific test though. It's about how these new models share a similar shift in behavior, resulting in these same specific test questions failing, which comes with disadvantages like already mentioned above.

I know it gets repetitive, but when you think you have something important to say there isn't a better option than to repeat it again and again... he's already given various examples before so he's not wrong, you just have to decide for yourself if you think that information is necessary for making a good language model. I personally agree.

@CHNtentes I totally get it. My repetitiveness is annoying me probably more than anyone else. But like nplguy said, what choice do I have? The entire OS AI industry shifted towards maximizing coding and math performance, including Meta with Llama 4.

And sure, this comes with advantages. Qwen3 30b 3b did better on my coding, STEM, reasoning, and math questions than any OS model I've tested, including Gemma 3 27b, Mistral Small, or Llama 3.1 70b. Plus the instruction following is the best yet, and story writing is OK, but worse than Mistral Small's and Gemma 3 27b's.

However, it performs astonishingly bad at a large number of things the general population cares most about. For example, it's called pop culture because it's POPULAR. Plus common use cases like poem writing is abysmal. It can write OK poems it regurgitates, but when prompted for original poems they lack depth and eloquence, don't even begin to respect meter, and about half the lines break the rhyming scheme.

The primary reason I feel compelled to make these points over and over is because I want what's best for the open source AI community, including the broad adoption by the general population, and these lopsided models with >10 the STEM performance than pop culture and creative performance won't allow that to happen. They need more balanced training. Plus the hallucinations about very popular things (e.g. the most popular movies, shows, songs, games, celebrities...) needs to be reduced by a couple orders of magnitude, and not with simple RAG and web access which only really helps with simple question and answering, hence the need for a dense database containing only core facts (e.g. main cast and plot) to keep LLMs from constantly falling off the rails and vomiting hallucinations about very popular things.

"culture" is for the weak. Both humans and AI have work to get done.

@CHNtentes I totally get it. My repetitiveness is annoying me probably more than anyone else. But like nplguy said, what choice do I have? The entire OS AI industry shifted towards maximizing coding and math performance, including Meta with Llama 4.

And sure, this comes with advantages. Qwen3 30b 3b did better on my coding, STEM, reasoning, and math questions than any OS model I've tested, including Gemma 3 27b, Mistral Small, or Llama 3.1 70b. Plus the instruction following is the best yet, and story writing is OK, but worse than Mistral Small's and Gemma 3 27b's.

However, it performs astonishingly bad at a large number of things the general population cares most about. For example, it's called pop culture because it's POPULAR. Plus common use cases like poem writing is abysmal. It can write OK poems it regurgitates, but when prompted for original poems they lack depth and eloquence, don't even begin to respect meter, and about half the lines break the rhyming scheme.

The primary reason I feel compelled to make these points over and over is because I want what's best for the open source AI community, including the broad adoption by the general population, and these lopsided models with >10 the STEM performance than pop culture and creative performance won't allow that to happen. They need more balanced training. Plus the hallucinations about very popular things (e.g. the most popular movies, shows, songs, games, celebrities...) needs to be reduced by a couple orders of magnitude, and not with simple RAG and web access which only really helps with simple question and answering, hence the need for a dense database containing only core facts (e.g. main cast and plot) to keep LLMs from constantly falling off the rails and vomiting hallucinations about very popular things.

I totally agree you have the right to evaluate models by your approach, but it seems that the reality now is most model providers and users emphasize STEM benchmark scores especially coding ability. Those are easier to compare and advertise.

I think i found another weak point... Ok, so i gave it a not too long story, maybe 4000tokens and asked it for a rating. It answered something like 9.5/10...great story...blablabla, and pointed at what could be improved, this and that... So i told it "Can you write this story better?", it answered "Certainly! Here's a revised version of your story....blablabla". Then it told me what was changed, but when i looked at it, it was exactly the same story lol. I couldn't find any changes.
Even funnier, i gave this "new" story to chatgpt 4o after i also asked it for a rating for the "original" one. And i asked chatgpt 4o if the "new" version of the story is better. It said, YES definitely, and started to list what was "improved". But those "improved" parts were already there...
I don't even know how to call it? Hallucinations?

@urtuuuu Yep, I noticed this same issue across most LLMs. For example, one of my prompts is providing a limerick that doesn't rhyme and asking it to make it rhyme while preserving its meaning, yet most LLMs just repeat it back word for word, or make a couple minor changes, and then claim success. Even the dumbest humans who have ever lived would never do such a thing.

Llama 3.1 8 & 70b did a good job, as did Qwen2 72b, but Qwen2.5 did poorly. And Gemma 2/3 only made minor mistakes, while Mistral Small 2402 did poorly, but at least tried, and its successor Mistral Small 2501 just repeated it back word for word. So there's a pattern of regression when it comes to such tasks as models were being overfit with coding and math.

This is because none of the models are "thinking" and haven't gained a single generalized IQ point after being trained on mountains of coding, math, and logic tokens. They need to be trained on a diverse set of instructions.

I agree, it fundamentally lacks world knowledge, atleast for european and american cultures. https://x.com/cedric_chee/status/1917218568263053448

I've also made a post about this on the HF page for the 32B. Qwen 3 is really good otherwise but its factual knowledge is overproportionally bad.

Perhaps they trained it on too much chinese knowledge? Or perhaps 100 languages is just way too much and they should instead focus on the most important languages.

@Dampfinchen Your conclusion in the SimpleQA field "~2x the dataset size but deepfried world knowledge" sums it up pretty good. A notable boost in code, math, and overall STEM abilities is nice, but it's no longer a general purpose AI model when >95 out of 100 responses to humanity's most popular and widely known information is an hallucination.

On my absurdly easy broad knowledge test that GPT4o scores 100/100 on, Qwen3 30b-3b scores bellow 2b models, and far bellow its own Qwen2 7b, despite scoring higher than Llama 3.1 70b on STEM.

42.3 Qwen3 30b-3b

45.7 Gemma 2 2b
54.1 Qwen2 7b
62.1 Llama 3.2 3b
69.7 Llama 3.1 8b
82.9 Mistral Small 2409 22b
88.5 Llama 3.1 70b

The paranoid part of my brain is beginning to think this is deliberate sabotage of western open source AI models.

That is, boost English tests scores like the MMLU & GPQA as high as possible by selectively training on the small subset of popular English knowledge covered by them, causing the narrow-minded herd of coding AI first adopters on X to praise it as superior, putting pressure on western LLMs makers like Llama and Mistral to also grossly overfit said tests in order to appear competitive. And in the end leaving all western open source AI models one-side piles of useless crap that will never succeed with the general western population. I don't actually believe this. But regardless, this is the path we're heading down.

This is a 3B model. Use RAG. That'll always be best, regardless of what the model knows already.

Also, this is a dupe of #6.

This is a 3B model. Use RAG. That'll always be best, regardless of what the model knows already.

Also, this is a dupe of #6.

This is a Qwen 3 issue, not parameter size. Even Gemma 3 4B does a better job in my testing.

Also, RAG is complicated and I haven't found a good solution yet for this kind of tasks. Plus, many UIs do not support it or barely especially on the mobile side.

@urroxyz Firstly, this IS NOT a 3b model. It's a 30b model. All MOEs, like Mixtral 8x7b, have the information density of their total parameters. For example, Mixtral scores 86.7/100 on the aforementioned knowledge test despite its low active parameter count. Additionally, Qwen3 34b dense scores the same as Qwen3 30b 3b.

The entire Qwen3 family is the most ignorant LLM I've ever tested at a given parameter count. This goes way beyond just overfitting the tests and a handful of tasks coding first adopters care about.

Qwen3 can recall very esoteric STEM knowledge with near perfect accuracy, yet hallucinates about nearly everything that isn't academic other than the top 10 shows, movies, games.... For example, it can list the main cast of Friends, a show watched by billions. Yet they still hallucinated like crazy if you ask about less prominent characters, such as main character's parents (e.g. Ross & Monica's parents) cameos by A-list celebrities, relationships between the characters, and so on. Even 1b models can generally do a better job at this.

. Even 1b models can generally do a better job at this.

Then use them, or create your own.

Some ( ok, many ) of us dont need or want some model that gives a crap about TV shows. Its about work and we want it to excel in that. The continued bashing due to decisions made on direction is ludicrous. They chose to go the STEM route. There is NOTHING wrong with that. No single model is going to meet all use cases, unless its going to under-perform in all of them. THEN id agree that its useless. For value, targeting is the way to go. You want one to excel in your silly ass porn role play instead of hard science or coding, go train your own.

Expert models can match or exceed dense models on certain narrow tasks by allocating the most relevant experts, sure, but they still struggle to coordinate across experts when confronted with multi‑facet, creative queries.

“Pop culture” is not exactly its own concept to a mixed AI, but rather a combination of many.

If you're going to rely on generalized LLMs for data that already exists in fine quality within their very datasets, it seems to me that you're using AI wrongly.

Why should a company waste model parameters on information that can be provided by the user? AI literally exists not to recite existing knowledge but to put it together and provide new, reasonable output.

Lots of culture is active, too, and cannot be updated within a stateless model, and so there's minimal point.

People use AI for different things.

The fact of the matter is that this is a general purpose model, not Qwen 3 Coder. It's not only a little worse at factual knowledge compared to competitors, it's far worse and IMO for a model as popular as Qwen 3 that is something that needs to be improved drastically with the next release. Of course Qwen has a focus on STEM and that's perfectly fine, but other models do too and are not that bad at factual knowledge. I think something is wrong here and Kalomaze noticed some experts are not used at all:

https://x.com/kalomaze/status/1918238263330148487

Perhaps it has something to do with that. Perhaps these experts are responsible for world knowledge and something went wrong with the routing of these specific experts. Qwen 3 is a really smart model and fantastic at coding, reasoning and all that stuff so it doesn't make sense that it's worse than the smallest Gemma 3 models in factual knowledge.

Either that or the training data does not include enough American/European facts but given they trained it on 36T tokens I doubt that. Or perhaps 100 languages are just too many.

@ZiggyS You're only proving my point. For example, you're creating a red herring by providing only one dismissive example ("silly ass porn roleplay"). I've never once tested roleplaying, let alone porn roleplaying. That's not what any of this is about. It's about creative tasks like poem writing, humor..., and doing them with adherence to all popular topics, which requires broad popular knowledge.

Alibaba is promoting Qwen3 as a general purpose AI model to compete with the likes of Llama, Mistral, and Gemma, and didn't add "coder" or "math" to the name like they've done in the past when making a model grossly overfit to a specific set of tasks. They really need to rename Qwen3 to something like Qwen3-STEM.

I of course fully understanding your desire to have the best possible coder and problem solver at a given model size. In fact, my formal education is in STEM and have little interest in creative tasks like poetry. But your nonsensical use of a red herring to dismiss the inclusion of the creative tasks like poetry and broad popular knowledge that the general population cares more about than STEM is frankly idiotic. It also proves my point that the first adopting community is filled with narrow-minded autistic coding nerds that are ill-suited for the task of reviewing AI models.

It also proves my point that the first adopting community is filled with narrow-minded autistic coding nerds that are ill-suited for the task of reviewing AI models.

For it to be effective it has to be targeted. That is the way of things. No one wants a huge 'do it all, but not well' sort of system. We all want 'good' results, without extra bloat for things that are not part of use case. Dont like STEM targeting, great, its not your use case so find another that fits your needs. Or again, make your own, no one is stopping you and there is room for everyone to do their own thing. ( ether from scratch if you have resources, or just adding to an existing one you like.. )

My issue here is all the unwarranted bashing of them for choosing to go in "direction A" with their models. No one is forcing anyone to use their freely given away work that cost them millions to create. Its sort of like if someone offers a free chicken sandwich because they have a chicken farm: "well i prefer beef, so they suck" Entitlement attitudes suck and the hate towards them is totally unwarranted.

And while i agree they did not append 'stem' or whatever to this model to help identify it for people that are lazy, its totally transparent of their direction in their documentation.

@ZiggyS The vast majority of the general population clearly do want a balanced AI model. This is why GPT4o and Gemini 2.5 are by far the two most widely used models, and they both have very broad abilities and knowledge. You're projecting your own desires onto the general population.

Additionally, the Chinese models do have broad Chinese abilities and knowledge (e.g. high Chinese SimpleQA scores). So they're only drastically overfitting western knowledge to boost western test scores (e.g. the English MMLU) as high as possible with the fewest possible non-Chinese training tokens and compute.

And I don't want to come across like I dislike or hate Alibaba. I genuinely want them to succeed. This is why I stipulated things like this was the most powerful model I've ever tested at this speed, and it outperformed Llama 3.1 70b at STEM. And what they claimed this model's improvements were over their previous releases (coding, math, instruction following, story writing...) are genuine and accurate.

And as far as being ungrateful. I don't personally use any of Qwen's models. I just don't want them to drastically overfit STEM, then have narrow-minded coding first adopters praise them for it, putting pressure on other model makers to also overfit in order to appear competitive, leading to a decent into broad incompetence (better at STEM, worse at everything else).

Sign up or log in to comment