--- license: apache-2.0 tags: - moe - llm - efficient-inference pipeline_tag: text-generation --- # TC-MoE: Augmenting Mixture of Experts with Ternary Expert Choice ## Model Description TC-MoE is a novel Mixture-of-Experts (MoE) architecture that enhances traditional MoE models through expert space expansion. By applying the ternary set {-1, 0, 1} to each original expert, TC-MoE achieves: - ​**9% reduction** in activated experts compared to Top-K routing - ​**1.1% average performance gain** on language understanding benchmarks - Flexible efficiency-effectiveness trade-off via reward mechanism Key innovations: - 🎯 ​**Ternary Expert Expansion**: Creates parameter-sharing expert variants (-1, 0, +1) without significant computational overhead - ⚖️ ​**Adaptive Load Balancing**: Novel load balance loss for expert workload distribution - 🎮 ​**Reward-Driven Routing**: Dynamic control of expert activation ratios ## Model Overview - ​**Architecture**: Decoder-only transformer based on LLaMA - ​**Pretraining Data**: - RedPajama (100B tokens) - ​**Model Size**: - Base (681M/2.3B params) ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("stiger1000/TC-MoE") tokenizer = AutoTokenizer.from_pretrained("stiger1000/TC-MoE") inputs = tokenizer("The capital of France is", return_tensors="pt") outputs = model.generate(**inputs, max_length=50) print(tokenizer.decode(outputs[0])) ``` ## Training Details - **Optimizer**: AdamW (β₁=0.9, β₂=0.95) - **Learning Rate**: 1e-4 with cosine decay - **Batch Size**: 4M tokens - **Loss Components**: - Language Modeling Loss - Load Balance Loss (α₁=0.01) - Reward Loss (α₂=0.0) ## Citation ```bibtex @inproceedings{yan2025tcmoe, title={TC-MoE: Augmenting Mixture of Experts with Ternary Expert Choice}, author={Yan, Shen and Bin, Xingyan and Zhang, Sijun and Wang, Yisen and Lin, Zhouchen}, booktitle={The Thirteenth International Conference on Learning Representations}, year={2025} } ``` 📚 **Repository**: [GitHub](https://github.com/stiger1000/TC-MoE)