Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,89 @@
|
|
1 |
-
---
|
2 |
-
license: cc-by-nc-4.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-nc-4.0
|
3 |
+
language:
|
4 |
+
- az
|
5 |
+
base_model:
|
6 |
+
- google/mt5-small
|
7 |
+
pipeline_tag: text2text-generation
|
8 |
+
tags:
|
9 |
+
- spell
|
10 |
+
- correction
|
11 |
+
- azerbaijani
|
12 |
+
---
|
13 |
+
|
14 |
+
# Azerbaijani Spell Correction Model
|
15 |
+
|
16 |
+
This repository contains an Azerbaijani spell correction model based on `google/mt5-small`. The model is fine-tuned to correct orthographic and typographical errors in Azerbaijani text, closely mimicking real-world spelling mistakes. It leverages the characteristics of the Azerbaijani language and keyboard layouts to provide accurate corrections.
|
17 |
+
|
18 |
+
## Table of Contents
|
19 |
+
|
20 |
+
- [Overview](#overview)
|
21 |
+
- [Dataset](#dataset)
|
22 |
+
- [Imitating Real Spelling Errors](#imitating-real-spelling-errors)
|
23 |
+
- [Keyboard Layout Considerations](#keyboard-layout-considerations)
|
24 |
+
- [Typical Substitutions](#typical-substitutions)
|
25 |
+
- [Model Training](#model-training)
|
26 |
+
- [Usage](#usage)
|
27 |
+
- [Examples](#examples)
|
28 |
+
- [License](#license)
|
29 |
+
|
30 |
+
## Overview
|
31 |
+
|
32 |
+
The spell correction model is designed to automatically correct spelling errors in Azerbaijani text. It is built upon the multilingual T5 model (`google/mt5-small`), which supports Azerbaijani language characters and has been fine-tuned on a custom dataset that reflects common spelling mistakes made by native speakers.
|
33 |
+
|
34 |
+
## Dataset
|
35 |
+
|
36 |
+
### Imitating Real Spelling Errors
|
37 |
+
|
38 |
+
The dataset was meticulously constructed to imitate real orthographic and typographical errors commonly found in Azerbaijani text. This was achieved by:
|
39 |
+
|
40 |
+
- **Collecting a corpus of correct Azerbaijani sentences** from various sources such as news articles, books, and online content.
|
41 |
+
- **Introducing errors into the sentences** based on common spelling mistakes, character substitutions, and keyboard mishits.
|
42 |
+
- **Ensuring a balance between different types of errors** to train the model to handle a wide variety of mistakes.
|
43 |
+
|
44 |
+
### Keyboard Layout Considerations
|
45 |
+
|
46 |
+
The Azerbaijani keyboard layout and neighboring letters were taken into account to simulate realistic typing errors. Since many spelling mistakes occur due to pressing adjacent keys, the dataset includes errors resulting from:
|
47 |
+
|
48 |
+
- **Mishits of neighboring keys** on the Azerbaijani keyboard.
|
49 |
+
- **Omission or duplication of characters** that are close to each other.
|
50 |
+
- **Transposition of adjacent characters** due to fast typing.
|
51 |
+
|
52 |
+
The consideration of the keyboard layout enhances the model's ability to correct errors that are likely to occur during actual typing.
|
53 |
+
|
54 |
+
### Typical Substitutions
|
55 |
+
|
56 |
+
Several typical substitutions were included in the dataset to reflect common errors:
|
57 |
+
|
58 |
+
- **Character Replacements:**
|
59 |
+
- `ə` replaced with `e`
|
60 |
+
- `ş` replaced with `s` or `w`
|
61 |
+
- `ç` replaced with `c`
|
62 |
+
- `ö` replaced with `o`
|
63 |
+
- `ğ` replaced with `g`
|
64 |
+
- `ı` replaced with `i`
|
65 |
+
- `ü` replaced with `u`
|
66 |
+
- `w` used instead of `v`
|
67 |
+
- `q` replaced with `k`
|
68 |
+
- `c` replaced with `j`
|
69 |
+
|
70 |
+
- **Examples:**
|
71 |
+
- `gədər` written as `qeder`
|
72 |
+
- `güllə` written as `gulle`
|
73 |
+
- `maşın` written as `masin`
|
74 |
+
- `əlaqə` written as `elaqe`
|
75 |
+
|
76 |
+
These substitutions represent phonetic similarities and common mistakes made by speakers when writing Azerbaijani text, especially when using Latin characters.
|
77 |
+
|
78 |
+
## Model Training
|
79 |
+
|
80 |
+
The model was fine-tuned using the Hugging Face Transformers library. Key points in the training process include:
|
81 |
+
|
82 |
+
- **Model Architecture:** `google/mt5-small` was chosen for its multilingual support and capability to handle Azerbaijani characters.
|
83 |
+
- **Training Objective:** The model was trained to map incorrect sentences to their correct versions.
|
84 |
+
- **Evaluation:** The performance was evaluated using metrics like BLEU and ROUGE to ensure the quality of corrections.
|
85 |
+
|
86 |
+
## Usage
|
87 |
+
|
88 |
+
To use the model for spell correction:
|
89 |
+
|