File size: 2,902 Bytes

523e6aa
 
 
 
 
 
 
 
a628884
523e6aa
 
 
 
 
 
 
 
 
d4a4b84
523e6aa
d4a4b84
523e6aa
 
 
 
 
 
 
d4a4b84
523e6aa

---
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Llama-8B
---
# DeepSeek-R1-Distill-Llama-8B-ENK-Aligned

## Overview

**DeepSeek-R1-Distill-Llama-8B-ENK-Aligned** is a safety-aligned version of [`deepseek-ai/DeepSeek-R1-Distill-Llama-8B`](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B). It has been aligned using the **Enkrypt AI Safety Alignment dataset**, which was generated with the **SAGE** process:

> **SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming**  
> Anurakt Kumar, Divyanshu Kumar, Jatan Loya, Nitin Aravind Birur, Tanay Baswa, Sahil Agarwal, Prashanth Harshangi (2024)  
> [[arXiv:2408.11851]](https://arxiv.org/abs/2408.11851)

This alignment significantly **reduces toxicity, harmfulness, and jailbreak vulnerabilities** across various safety topics while **maintaining model performance**.

## Red Team Results

![Safety Comparison](assets/safety_comparison.png)

## Performance Results
| Model | MMLU-Pro Score |
|--------|----------------|
| DeepSeek-R1-Distill-Llama-8B (Base) | **44.71** |
| DeepSeek-R1-Distill-Llama-8B-ENK-Aligned | **46.43** |

## Training Configuration

The model was trained using the **SimPO (Simple Preference Optimization)** approach with the following hyperparameters:

```yaml
cpo_config:
  loss_type: 'simpo'
  max_prompt_length: 1800
  max_length: 3600
  per_device_train_batch_size: 8
  gradient_accumulation_steps: 1
  learning_rate: 1.8e-6
  optim: 'adamw_torch'
  lr_scheduler_type: 'cosine'
  gradient_checkpointing: True
  beta: 5
  num_train_epochs: 1
  bf16: False
  simpo_gamma: 0.8
  warmup_ratio: 0.1
  cpo_alpha: 0.0
```

## Key Improvements

- **Enhanced Safety**: Significant reduction in harmful or toxic outputs.
- **Improved Robustness**: Stronger resistance to adversarial jailbreak prompts.
- **Minimal Performance Tradeoff**: Slight improvement in MMLU-Pro despite additional alignment constraints.

## Use Cases

This model is ideal for applications requiring **safe, aligned, and high-performance language generation**, including:
- **Conversational AI**: Ensuring responsible and aligned assistant behavior.
- **Content Moderation**: Filtering harmful content while maintaining contextual understanding.
- **Education & Research**: Deploying AI in sensitive environments with reduced risks.

<!-- ## Citation

If you use this model, please cite the SAGE-RT paper:

```bibtex
@misc{kumar2024sagertsyntheticalignmentdata,
  title={SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming},
  author={Anurakt Kumar and Divyanshu Kumar and Jatan Loya and Nitin Aravind Birur and Tanay Baswa and Sahil Agarwal and Prashanth Harshangi},
  year={2024},
  eprint={2408.11851},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2408.11851}
}
``` -->

---
For questions or contributions, reach out to the **Enkrypt AI** team!