🛡️ LexiGuard: Misogyny, Misandry & Toxicity Detection in English and Slovak

LexiGuard is a multilingual multitask model designed to detect and classify offensive language, with a focus on misogyny, misandry, and toxicity levels in English. The model also supports Slovak, making it suitable for multilingual analysis of social media content.

It performs dual classification:

Category: Misogyny, Misandry, or Neutral
Toxicity level: Low, Medium, or High

The model is based on xlm-roberta-base and was fine-tuned on a custom dataset primarily in English, with additional annotated samples in Slovak.

🧠 Model Overview

Base model: xlm-roberta-base
Tasks: Multitask classification (2 output heads)
Primary language: English
Secondary language: Slovak
Use case: Detecting offensive, sexist, or toxic comments in multilingual social media

🛠️ Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("Megyy/lexiguard")
model = AutoModelForSequenceClassification.from_pretrained("Megyy/lexiguard")

text = "Women are useless in politics."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# outputs.logits contains predictions for both tasks

Note: The model has two output heads:

Head 1: Category (misogyny/misandry/neutral)

Head 2: Toxicity (low/medium/high)

📊 Label Definitions

Task 1 – Category Classification

0: Neutral
1: Misogyny
2: Misandry

Task 2 – Toxicity Prediction

0: Low
1: Medium
2: High

🧪 Training Data

Over 5,000 manually annotated comments
Domain: Online discussions, social media, and forums
Language distribution:
- ~80% English
- ~20% Slovak

📁 Model Files

pytorch_model.bin / model.safetensors: model weights
config.json: model configuration
tokenizer.json, vocab.txt, etc.: tokenizer files
README.md: model card

📚 Citation

If you use this model in your work, please cite:

@bachelorsthesis{majercakova2025lexiguard,
  title={LexiGuard: Offensive Language Detection in English and Slovak Social Media},
  author={Magdalena Majercakova},
  year={2025},
  note={Bachelor's thesis, TUKE},
}

👨‍💻 Author

Developed by Magdaléna Majerčáková as part of a Bachelor's Thesis
Supervised by Ing. Zuzana Sokolová, PhD
Faculty of Electrical Engineering and Informatics, TUKE (2025)