|
--- |
|
license: cc-by-4.0 |
|
language: |
|
- am |
|
- ru |
|
- en |
|
- uk |
|
- de |
|
- ar |
|
- zh |
|
- es |
|
- hi |
|
datasets: |
|
- s-nlp/ru_paradetox |
|
- s-nlp/paradetox |
|
- textdetox/multilingual_paradetox |
|
library_name: transformers |
|
pipeline_tag: text2text-generation |
|
--- |
|
|
|
# mT0-XL-detox-orpo |
|
|
|
**Resources**: |
|
* [Paper](https://arxiv.org/abs/2407.05449) |
|
* [GitHub with training scripts and data](https://github.com/s-nlp/multilingual-transformer-detoxification) |
|
|
|
## Model Information |
|
This is a multilingual 3.7B text detoxification model for 9 languages built on [TextDetox 2024 shared task](https://pan.webis.de/clef24/pan24-web/text-detoxification.html) based on [mT0-XL](https://huggingface.co/bigscience/mt0-xl). The model was trained in a two-step setup: the first step is full fine-tuning on different parallel text detoxification datasets, and the second step is ORPO alignment on a self-annotated preference dataset collected using toxicity and similarity classifiers. See the paper for more details. |
|
|
|
In terms of human evaluation, the model is a second-best approach on the [TextDetox 2024 shared task](https://pan.webis.de/clef24/pan24-web/text-detoxification.html). More precisely, the model shows state-of-the-art performance for the Ukrainian language, top-2 scores for Arabic, and near state-of-the-art performance for other languages. |
|
|
|
## Example usage |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
model = AutoModelForSeq2SeqLM.from_pretrained('s-nlp/mt0-xl-detox-orpo', device_map="auto") |
|
tokenizer = AutoTokenizer.from_pretrained('s-nlp/mt0-xl-detox-orpo') |
|
|
|
LANG_PROMPTS = { |
|
'zh': '排毒:', |
|
'es': 'Desintoxicar: ', |
|
'ru': 'Детоксифицируй: ', |
|
'ar': 'إزالة السموم: ', |
|
'hi': 'विषहरण: ', |
|
'uk': 'Детоксифікуй: ', |
|
'de': 'Entgiften: ', |
|
'am': 'መርዝ መርዝ: ', |
|
'en': 'Detoxify: ', |
|
} |
|
|
|
def detoxify(text, lang, model, tokenizer): |
|
encodings = tokenizer(LANG_PROMPTS[lang] + text, return_tensors='pt').to(model.device) |
|
|
|
outputs = model.generate(**encodings.to(model.device), |
|
max_length=128, |
|
num_beams=10, |
|
no_repeat_ngram_size=3, |
|
repetition_penalty=1.2, |
|
num_beam_groups=5, |
|
diversity_penalty=2.5, |
|
num_return_sequences=5, |
|
early_stopping=True, |
|
) |
|
|
|
return tokenizer.batch_decode(outputs, skip_special_tokens=True) |
|
``` |
|
|
|
## Citation |
|
``` |
|
@inproceedings{smurfcat_at_pan, |
|
author = {Elisei Rykov and |
|
Konstantin Zaytsev and |
|
Ivan Anisimov and |
|
Alexandr Voronin}, |
|
editor = {Guglielmo Faggioli and |
|
Nicola Ferro and |
|
Petra Galusc{\'{a}}kov{\'{a}} and |
|
Alba Garc{\'{\i}}a Seco de Herrera}, |
|
title = {SmurfCat at {PAN} 2024 TextDetox: Alignment of Multilingual Transformers |
|
for Text Detoxification}, |
|
booktitle = {Working Notes of the Conference and Labs of the Evaluation Forum {(CLEF} |
|
2024), Grenoble, France, 9-12 September, 2024}, |
|
series = {{CEUR} Workshop Proceedings}, |
|
volume = {3740}, |
|
pages = {2866--2871}, |
|
publisher = {CEUR-WS.org}, |
|
year = {2024}, |
|
url = {https://ceur-ws.org/Vol-3740/paper-276.pdf}, |
|
timestamp = {Wed, 21 Aug 2024 22:46:00 +0200}, |
|
biburl = {https://dblp.org/rec/conf/clef/RykovZAV24.bib}, |
|
bibsource = {dblp computer science bibliography, https://dblp.org} |
|
} |
|
``` |
|
|