erax-ai
/

EraX-LLaMA3.1-8B-DeepSeekR1-MLA-MoE-Raw

Text Generation

Mixture of Experts

multihead_latent_attention

mixtured_of_experts

Model card Files Files and versions Community

erax commited on Mar 10

Commit

31d0f30

·

verified ·

1 Parent(s): 1c479d3

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -55,7 +55,7 @@ We're excited to share the code and rather raw model – refined with insights f
     * **Dimension Safety Checks:** All transfers include dimension checks to handle potential mismatches:
-*   **Experience the Future:** Reload and rigorously test the newly architected model, unlocking its potential. Use standard 🤗 Transformers without any PR required. This model has **50% more capacity up to 13.2B params**, use **much less memory for long context up to 256k on 8xH100** but requires only approximated **9B** params activated during training and inference which is closed to the original LLaMA 3.1 8B.  Training expected will be a bit slower though (10 - 15%) due to overhead of both MLA and MoE.
 *   **Unlock New Frontiers:** Leverage our continual pretraining code, powered by FSDP (or DDP for BitAndBytes 8-bit optimization), to push the boundaries of model performance. Our codes freeze all original layers and only continual pretraining the new ones. You will need to continual pretrain the new model with 25G - 40G multi-lingual multi-domain corpus and some 100k finetuning (or distiling from DeepSeek R1), plus some serious GRPO to make use the full power of this new model and retain most of LLaMA-3.1 8B world knowledge.

     * **Dimension Safety Checks:** All transfers include dimension checks to handle potential mismatches:
+*   **Experience the Future:** Reload and rigorously test the newly architected model, unlocking its potential. Use standard 🤗 Transformers without any PR required. This model has **50% more capacity up to 13.2B params**, use **much less memory for long context up to 256k on 8xH100** but requires only approximated **9B** params activated during  full training (or **only 5.2G** if freezing original layers) and inference which is closed to the original LLaMA 3.1 8B.  Training expected will be a bit slower though (10 - 15%) due to overhead of both MLA and MoE.
 *   **Unlock New Frontiers:** Leverage our continual pretraining code, powered by FSDP (or DDP for BitAndBytes 8-bit optimization), to push the boundaries of model performance. Our codes freeze all original layers and only continual pretraining the new ones. You will need to continual pretrain the new model with 25G - 40G multi-lingual multi-domain corpus and some 100k finetuning (or distiling from DeepSeek R1), plus some serious GRPO to make use the full power of this new model and retain most of LLaMA-3.1 8B world knowledge.