Brainstorming

#6
by Downtown-Case - opened

I have thoughts:

  • If not already done, could you reconfigure the router to 'route away' from the plucked layers? For instance, if something normally routed to expert 72, route it to some other expert instead?

  • Instead of removing layers... merge them. For instance, if expert 84, 85, and 86 are infrequently used, merge them together, and route everything normally going to one of them to the merged model. A technique for merging that works well without a base model, like Karcher Mean, is probably ideal here: https://github.com/arcee-ai/mergekit/pull/546

  • I'm unsure of the optimal way to do this, perhaps expert merges could be grouped together by "similarity" base on their activation scores for your tested tokens?

And even if you don't do merging, maybe rerouting could be more optimally configured this way.

I'd love to have Qwen's team view on this. I find it somewhat hard to believe they have not evaluated their experts to ensure they actually aren't redundant or useless. I'm also wondering if the fact that some display low occurrence probability isn't due to fact that they've been trained on data/use-cases that isn't used in kalomaze's evaluation dataset. Especially considering the fact that Qwen emphasises that Qwen3 supports "119 languages and dialects", so I'm really wondering if some experts aren't dedicated to these specifically. Maybe I've missed the article but some transparency from Qwen's team about the purpose of each expert would help us understand what may be the outcome of removing them (or merging them as you've suggested) - at the moment their specialisations could still be reverse-engineered by logging and analysing routing patterns, but hard to do if we don't have the right data to trigger each one of them consistently to begin with. I'd also love to have the option to have a flexible model for which I can dynamically decide to prune the experts I will unlikely use for my use case.

I had a brainstorm thought too about how this type of expert activation probability can be used to potentially optimize quantization. Not directly related to pruning, but, what if we quantize the model expert-wise based on the activation %, more bits for the more commonly used experts, and fewer bits (as low as 1-2-bits?) for those very rarely used. Instead of uniformly applying the same technique across all the experts.

ChatGPT has entered the chat - https://chatgpt.com/share/681db805-d550-800f-aca1-413283f9f9d2

I like the LoRA idea, so that experts can just be plug and play depending on the user's use case.

ChatGPT has entered the chat - https://chatgpt.com/share/681db805-d550-800f-aca1-413283f9f9d2

I like the LoRA idea.

I tried to implement this here, it is a bit jank as a concept but it seems to work
https://huggingface.co/RDson/Qwen3-30B-A3B-By-Expert-Quantization-GGUF

Sign up or log in to comment