Sadeed: Advancing Arabic Diacritization Through Small Language Model
Abstract
Arabic text diacritization remains a persistent challenge in natural language processing due to the language's morphological richness. In this paper, we introduce Sadeed, a novel approach based on a fine-tuned decoder-only language model adapted from Kuwain 1.5B Hennara et al. [2025], a compact model originally trained on diverse Arabic corpora. Sadeed is fine-tuned on carefully curated, high-quality diacritized datasets, constructed through a rigorous data-cleaning and normalization pipeline. Despite utilizing modest computational resources, Sadeed achieves competitive results compared to proprietary large language models and outperforms traditional models trained on similar domains. Additionally, we highlight key limitations in current benchmarking practices for Arabic diacritization. To address these issues, we introduce SadeedDiac-25, a new benchmark designed to enable fairer and more comprehensive evaluation across diverse text genres and complexity levels. Together, Sadeed and SadeedDiac-25 provide a robust foundation for advancing Arabic NLP applications, including machine translation, text-to-speech, and language learning tools.
Community
we introduce Sadeed , a novel approach based on a fine-tuned decoder-only language model adapted from Kuwain 1.5B a compact model originally trained on diverse Arabic corpora. Sadeed is fine-tuned on carefully curated, high-quality diacritized datasets, constructed through a rigorous data-cleaning and normalization pipeline. We also introduce SadeedDiac-25, a new benchmark designed to enable fairer and more comprehensive evaluation across diverse text genres and complexity
levels. Together, Sadeed and SadeedDiac-25 provide a robust foundation for advancing Arabic NLP
applications, including machine translation, text-to-speech, and language learning tools.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Low-Resource Transliteration for Roman-Urdu and Urdu Using Transformer-Based Models (2025)
- Kuwain 1.5B: An Arabic SLM via Language Injection (2025)
- Whispering in Amharic: Fine-tuning Whisper for Low-resource Language (2025)
- Command R7B Arabic: A Small, Enterprise Focused, Multilingual, and Culturally Aware Arabic LLM (2025)
- COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing (2025)
- ARLED: Leveraging LED-based ARMAN Model for Abstractive Summarization of Persian Long Documents (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper