deepseek-moe / README.md
bird-of-paradise's picture
fix license format
e2f0513
|
raw
history blame contribute delete
6.25 kB
---
library_name: deepseek-moe
tags:
- mixture-of-experts
- transformers
- pytorch
- moe
- efficient-transformer
pipeline_tag: text-generation
language: en
license: apache-2.0
---
# DeepSeek MoE Implementation
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
*Note: This repository contains a modular implementation of the DeepSeek MoE architecture, not trained model weights.*
A clean, efficient implementation of DeepSeek's Mixture of Experts (MoE) architecture in PyTorch. This repository provides a simplified version of the architecture described in the DeepSeek paper, focusing on the core innovations that make their MoE approach unique.
This repository is part of a series implementing the key architectural innovations from the DeepSeek paper. See the 'Related Implementations' section for the complete series.
<p align="center">
<img src="./assets/moe_architecture.png" alt="DeepSeek MoE Architecture" width="600"/>
</p>
## Overview
Mixture of Experts (MoE) architectures enable dramatic scaling of model parameters while maintaining computational efficiency by activating only a subset of parameters for any given input. DeepSeek's approach introduces several key innovations to the MoE architecture that improve performance and efficiency.
Key features of this implementation:
- **Hybrid Expert Structure**: Combines shared experts (processing all tokens) with routed experts (processing specific tokens)
- **Efficient Top-K Routing**: Token-to-expert affinity calculation based on dot product similarity
- **Multi-Level Load Balancing**: Cascading auxiliary losses at expert, device, and communication levels
- **Device-Limited Routing**: Bounds communication costs in distributed training scenarios
- **Token Dropping Strategy**: Optimize computation by dropping tokens with low affinities
## Quick Start
```python
import torch
from moe import MixtureOfExperts
# Create input tensor
batch_size = 8
seq_length = 16
d_model = 512
inputs = torch.randn(batch_size, seq_length, d_model)
# Create MoE layer
moe = MixtureOfExperts(
d_model=512, # Input dimension
d_expert=1024, # Expert hidden dimension
K=2, # Top-K experts per token
N_s=2, # Number of shared experts
N_r=8, # Number of routed experts
alpha1=0.01, # Expert balance factor
alpha2=0.01, # Device balance factor
alpha3=0.01, # Communication balance factor
D=4, # Number of devices
M=3 # Device limit for routing
)
# Forward pass
outputs, expert_loss, device_loss, commu_loss = moe(inputs)
```
## Architecture Details
For a detailed explanation of the architecture, see [architecture.md](insights/architecture.md).
### DeepSeek MoE Key Innovations
The DeepSeek MoE architecture introduces several elegant design choices:
1. **Hybrid Expert Structure**: Using both shared experts and routed experts with residual connections maintains global information flow while allowing for specialization.
2. **Token-Expert Affinity**: Calculating token-to-expert similarity through dot product with expert centroids, similar to attention mechanisms.
3. **Multi-Level Balancing**: Cascading auxiliary losses that enforce balance at expert, device, and communication levels, creating a holistic approach to load distribution.
4. **Device-Limited Routing**: Constraining each token to experts on at most M devices to bound communication costs.
## Implementation Details
The implementation consists of two main classes:
### 1. Expert
A feed-forward network with two linear transformations and a ReLU activation in between.
```python
Expert(x) = max(0, xW1 + b1)W2 + b2
```
### 2. MixtureOfExperts
The main MoE implementation that:
- Combines shared and routed experts
- Calculates token-to-expert affinities
- Applies top-K routing
- Calculates auxiliary balance losses
```python
MoE(x) = x + ∑ Expert^s_i(x) + ∑ gate(x;K)*Expert^r_i(x)
```
## Testing
Unit tests are provided to verify the correct functioning of:
- Expert computations
- MoE routing mechanisms
- Load balancing losses
- Residual connections
Run the tests with:
```bash
python -m src.tests.test_moe
```
## Related Implementations
This repository is part of a series implementing the key architectural innovations from the DeepSeek paper:
1. **[DeepSeek MoE](https://huggingface.co/bird-of-paradise/deepseek-moe)** (This Repository): Implementation of DeepSeek's Mixture of Experts architecture that enables efficient scaling of model parameters.
2. **[DeepSeek Multi-head Latent Attention](https://huggingface.co/bird-of-paradise/deepseek-mla)**: Implementation of DeepSeek's MLA mechanism for efficient KV cache usage during inference.
3. **[Transformer Implementation Tutorial](https://huggingface.co/datasets/bird-of-paradise/transformer-from-scratch-tutorial)**: A detailed tutorial on implementing transformer architecture with explanations of key components.
Together, these implementations cover the core innovations that power DeepSeek's state-of-the-art performance. By combining the MoE architecture with Multi-head Latent Attention, you can build a complete DeepSeek-style model with improved training efficiency and inference performance.
## Contributing
Contributions are welcome! Feel free to:
- Report bugs and issues
- Submit pull requests for improvements
- Add additional test cases
- Provide documentation clarifications
Please ensure all tests pass before submitting pull requests.
## Citation
If you use this implementation in your research, please cite:
```bibtex
@misc{deepseek-moe-2025,
author = {Jen Wei},
title = {DeepSeek MoE Implementation},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://huggingface.co/bird-of-paradise/deepseek-moe}}
}
```
## License
This project is licensed under the Apache License 2.0.
## Acknowledgements
This implementation is inspired by the DeepSeek paper and other open-source MoE implementations:
- [DeepSeek](https://github.com/deepseek-ai)
- [Switch Transformers](https://arxiv.org/abs/2101.03961)
- [GShard](https://arxiv.org/abs/2006.16668)