deepseek-moe / README.md

fix license format

e2f0513 2 months ago

6.25 kB

	---
	library_name: deepseek-moe
	tags:
	- mixture-of-experts
	- transformers
	- pytorch
	- moe
	- efficient-transformer
	pipeline_tag: text-generation
	language: en
	license: apache-2.0
	---

	# DeepSeek MoE Implementation
	[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

	Note: This repository contains a modular implementation of the DeepSeek MoE architecture, not trained model weights.

	A clean, efficient implementation of DeepSeek's Mixture of Experts (MoE) architecture in PyTorch. This repository provides a simplified version of the architecture described in the DeepSeek paper, focusing on the core innovations that make their MoE approach unique.

	This repository is part of a series implementing the key architectural innovations from the DeepSeek paper. See the 'Related Implementations' section for the complete series.

	<p align="center">
	<img src="./assets/moe_architecture.png" alt="DeepSeek MoE Architecture" width="600"/>
	</p>

	## Overview

	Mixture of Experts (MoE) architectures enable dramatic scaling of model parameters while maintaining computational efficiency by activating only a subset of parameters for any given input. DeepSeek's approach introduces several key innovations to the MoE architecture that improve performance and efficiency.

	Key features of this implementation:

	- Hybrid Expert Structure: Combines shared experts (processing all tokens) with routed experts (processing specific tokens)
	- Efficient Top-K Routing: Token-to-expert affinity calculation based on dot product similarity
	- Multi-Level Load Balancing: Cascading auxiliary losses at expert, device, and communication levels
	- Device-Limited Routing: Bounds communication costs in distributed training scenarios
	- Token Dropping Strategy: Optimize computation by dropping tokens with low affinities

	## Quick Start

	```python
	import torch
	from moe import MixtureOfExperts

	# Create input tensor
	batch_size = 8
	seq_length = 16
	d_model = 512
	inputs = torch.randn(batch_size, seq_length, d_model)

	# Create MoE layer
	moe = MixtureOfExperts(
	d_model=512, # Input dimension
	d_expert=1024, # Expert hidden dimension
	K=2, # Top-K experts per token
	N_s=2, # Number of shared experts
	N_r=8, # Number of routed experts
	alpha1=0.01, # Expert balance factor
	alpha2=0.01, # Device balance factor
	alpha3=0.01, # Communication balance factor
	D=4, # Number of devices
	M=3 # Device limit for routing
	)

	# Forward pass
	outputs, expert_loss, device_loss, commu_loss = moe(inputs)
	```

	## Architecture Details

	For a detailed explanation of the architecture, see [architecture.md](insights/architecture.md).

	### DeepSeek MoE Key Innovations

	The DeepSeek MoE architecture introduces several elegant design choices:

	1. Hybrid Expert Structure: Using both shared experts and routed experts with residual connections maintains global information flow while allowing for specialization.

	2. Token-Expert Affinity: Calculating token-to-expert similarity through dot product with expert centroids, similar to attention mechanisms.

	3. Multi-Level Balancing: Cascading auxiliary losses that enforce balance at expert, device, and communication levels, creating a holistic approach to load distribution.

	4. Device-Limited Routing: Constraining each token to experts on at most M devices to bound communication costs.

	## Implementation Details

	The implementation consists of two main classes:

	### 1. Expert

	A feed-forward network with two linear transformations and a ReLU activation in between.

	```python
	Expert(x) = max(0, xW1 + b1)W2 + b2
	```

	### 2. MixtureOfExperts

	The main MoE implementation that:
	- Combines shared and routed experts
	- Calculates token-to-expert affinities
	- Applies top-K routing
	- Calculates auxiliary balance losses

	```python
	MoE(x) = x + ∑ Expert^s_i(x) + ∑ gate(x;K)*Expert^r_i(x)
	```

	## Testing

	Unit tests are provided to verify the correct functioning of:
	- Expert computations
	- MoE routing mechanisms
	- Load balancing losses
	- Residual connections

	Run the tests with:

	```bash
	python -m src.tests.test_moe
	```

	## Related Implementations

	This repository is part of a series implementing the key architectural innovations from the DeepSeek paper:

	1. [DeepSeek MoE](https://huggingface.co/bird-of-paradise/deepseek-moe) (This Repository): Implementation of DeepSeek's Mixture of Experts architecture that enables efficient scaling of model parameters.

	2. [DeepSeek Multi-head Latent Attention](https://huggingface.co/bird-of-paradise/deepseek-mla): Implementation of DeepSeek's MLA mechanism for efficient KV cache usage during inference.

	3. [Transformer Implementation Tutorial](https://huggingface.co/datasets/bird-of-paradise/transformer-from-scratch-tutorial): A detailed tutorial on implementing transformer architecture with explanations of key components.

	Together, these implementations cover the core innovations that power DeepSeek's state-of-the-art performance. By combining the MoE architecture with Multi-head Latent Attention, you can build a complete DeepSeek-style model with improved training efficiency and inference performance.

	## Contributing

	Contributions are welcome! Feel free to:
	- Report bugs and issues
	- Submit pull requests for improvements
	- Add additional test cases
	- Provide documentation clarifications

	Please ensure all tests pass before submitting pull requests.


	## Citation

	If you use this implementation in your research, please cite:

	```bibtex
	@misc{deepseek-moe-2025,
	author = {Jen Wei},
	title = {DeepSeek MoE Implementation},
	year = {2025},
	publisher = {GitHub},
	journal = {GitHub repository},
	howpublished = {\url{https://huggingface.co/bird-of-paradise/deepseek-moe}}
	}
	```

	## License

	This project is licensed under the Apache License 2.0.


	## Acknowledgements

	This implementation is inspired by the DeepSeek paper and other open-source MoE implementations:

	- [DeepSeek](https://github.com/deepseek-ai)
	- [Switch Transformers](https://arxiv.org/abs/2101.03961)
	- [GShard](https://arxiv.org/abs/2006.16668)