Safetensors
English
qwen2
safety

Model Description

Model Summary

This is a fine-tuned Qwen2.5-72B-Instruct model on the Egida-DPO-Qwen2.5-72B-Instruct dataset.

The Egida dataset is a collection of adversarial prompts that are thought to ellicit unsafe behaviors from language models. Specifically for this case, the Egida train split is used to run inference on Qwen2.5-72B-Instruct. Unsafe answers are selected, and paired with safe answers to create a customized DPO dataset for this model. This results in a DPO dataset composed by triplets < ”question”, ”chosen answer”, ”discarded answer” > which contain questions that elicit unsafe responses by this target model, as well as the unsafe responses produced by it.

Training Details

  • Hardware: NVIDIA H100 64 GB GPUs
  • Devices: 64 GPUs (16 nodes)
  • Time: 10.23h
  • Batch Size: 63
  • LR: 10βˆ’6

Performance

Safety Performance (Attack Success Ratio)

Egida (test) ↓ DELPHI ↓ Alert-Base ↓ Alert-Adv ↓
Qwen-2.5-72B-Instruct 0.235 0.051 0.329 0.050
Qwen-2.5-72B-Instruct-Egida-DPO 0.125 0.042 0.210 0.019

General Purpose Performance

OpenLLM Leaderboard (Average) ↑ MMLU Generative (ROUGE1) ↑
Qwen-2.5-72B-Instruct 0.618 0.771
Qwen-2.5-72B-Instruct-Egida-DPO 0.620 0.768

Refusal Ratio

OR Bench 80K (refusal) ↓ OR Bench Hard (refusal) ↓
Qwen-2.5-72B-Instruct 0.015 0.102
Qwen-2.5-72B-Instruct-Egida-DPO 0.016 0.170

Note that this refusal ratio is computed as keyword matching with a curated list of keywords. For more information, check the paper.

Environmental Impact

Citation Information

@misc{garciagasulla2025efficientsafetyretrofittingjailbreaking,
      title={Efficient Safety Retrofitting Against Jailbreaking for LLMs}, 
      author={Dario Garcia-Gasulla and Adrian Tormos and Anna Arias-Duart and Daniel Hinjos and Oscar Molina-Sedano and Ashwin Kumar Gururajan and Maria Eugenia Cardello},
      year={2025},
      eprint={2502.13603},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.13603}, 
}
Downloads last month
28
Safetensors
Model size
72.7B params
Tensor type
BF16
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for HPAI-BSC/Qwen2.5-72B-Instruct-Egida-DPO

Base model

Qwen/Qwen2.5-72B
Finetuned
(40)
this model
Quantizations
1 model

Dataset used to train HPAI-BSC/Qwen2.5-72B-Instruct-Egida-DPO

Collection including HPAI-BSC/Qwen2.5-72B-Instruct-Egida-DPO