Text2Text Generation
Safetensors
Azerbaijani
t5
spell
correction
azerbaijani
File size: 5,104 Bytes
fc53a14
b866085
fc53a14
 
 
 
 
 
 
 
 
1f3ae6a
 
fc53a14
 
 
 
 
 
 
 
 
e7da493
fc53a14
 
 
 
 
 
e7da493
f51df6a
 
fc53a14
 
 
 
e7da493
 
 
7dcdc1c
 
6a51e0a
7dcdc1c
 
3efb2f8
 
fc53a14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20f6ae7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cc4259f
 
 
20f6ae7
e7da493
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1f3ae6a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
license: cc-by-nc-nd-4.0
language:
- az
base_model:
- google/mt5-small
pipeline_tag: text2text-generation
tags:
- spell
- correction
- azerbaijani
datasets:
- LocalDoc/azerbaijani_spell_corrector_dataset
---

# Azerbaijani Spell Correction Model

This repository contains an Azerbaijani spell correction model based on `google/mt5-small`. The model is fine-tuned to correct orthographic and typographical errors in Azerbaijani text, closely mimicking real-world spelling mistakes. It leverages the characteristics of the Azerbaijani language and keyboard layouts to provide accurate corrections.

## Table of Contents

- [Overview](#overview)
- [Examples](#examples)
- [Dataset](#dataset)
  - [Imitating Real Spelling Errors](#imitating-real-spelling-errors)
  - [Keyboard Layout Considerations](#keyboard-layout-considerations)
  - [Typical Substitutions](#typical-substitutions)
- [Usage](#usage)
- [License](#license)
- [Contact](#сontact)


## Overview

The spell correction model is designed to automatically correct spelling errors in Azerbaijani text. It is built upon the multilingual T5 model (`google/mt5-small`), which supports Azerbaijani language characters and has been fine-tuned on a custom dataset that reflects common spelling mistakes made by native speakers.


## Examples

**Input**
`Gozel arasdirlma apariril. bizde kassirlerden sikayetci ola bilmirsen umumilikde goturende bezi yerlerden sikayetci olmaq mumkun deyil`

**Output**
`Gözəl araşdırma aparılır. bizdə kassirlərdən şikayətçi ola bilmirsən ümumilikdə götürüləndə bəzi yerlərdən şikayətçi olmaq mümkün deyil`


## Dataset

### Imitating Real Spelling Errors

The dataset was meticulously constructed to imitate real orthographic and typographical errors commonly found in Azerbaijani text. This was achieved by:

- **Collecting a corpus of correct Azerbaijani sentences** from various sources such as news articles, books, and online content.
- **Introducing errors into the sentences** based on common spelling mistakes, character substitutions, and keyboard mishits.
- **Ensuring a balance between different types of errors** to train the model to handle a wide variety of mistakes.

### Keyboard Layout Considerations

The Azerbaijani keyboard layout and neighboring letters were taken into account to simulate realistic typing errors. Since many spelling mistakes occur due to pressing adjacent keys, the dataset includes errors resulting from:

- **Mishits of neighboring keys** on the Azerbaijani keyboard.
- **Omission or duplication of characters** that are close to each other.
- **Transposition of adjacent characters** due to fast typing.

The consideration of the keyboard layout enhances the model's ability to correct errors that are likely to occur during actual typing.

### Typical Substitutions

Several typical substitutions were included in the dataset to reflect common errors:

- **Character Replacements:**
  - `ə` replaced with `e`
  - `ş` replaced with `s` or `w`
  - `ç` replaced with `c`
  - `ö` replaced with `o`
  - `ğ` replaced with `g`
  - `ı` replaced with `i`
  - `ü` replaced with `u`
  - `w` used instead of `v`
  - `q` replaced with `k`
  - `c` replaced with `j`

- **Examples:**
  - `gədər` written as `qeder`
  - `güllə` written as `gulle`
  - `maşın` written as `masin`
  - `əlaqə` written as `elaqe`

These substitutions represent phonetic similarities and common mistakes made by speakers when writing Azerbaijani text, especially when using Latin characters.


## Usage

To use the model for spell correction:

```python
import torch
from transformers import MT5ForConditionalGeneration, MT5Tokenizer

# Load the tokenizer and model
tokenizer = MT5Tokenizer.from_pretrained('LocalDoc/azerbaijani_spell_corrector')
model = MT5ForConditionalGeneration.from_pretrained('LocalDoc/azerbaijani_spell_corrector')

# Function to correct sentences
def correct_sentence(sentence):
    input_text = "correct: " + sentence
    input_ids = tokenizer.encode(input_text, return_tensors='pt', max_length=128, truncation=True)
    outputs = model.generate(input_ids=input_ids, max_length=128, num_beams=5, early_stopping=True)
    corrected_sentence = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return corrected_sentence

# Example usage
incorrect_sentence = "Pul dogru adamlarda deyil"
print(correct_sentence(incorrect_sentence))

# Pul doğru adamlarda deyil.

```

## License

This model licensed under the CC BY-NC-ND 4.0 license.
What does this license allow?

    Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made.
    Non-Commercial: You may not use the material for commercial purposes.
    No Derivatives: If you remix, transform, or build upon the material, you may not distribute the modified material.

For more information, please refer to the <a target="_blank" href="https://creativecommons.org/licenses/by-nc-nd/4.0/">CC BY-NC-ND 4.0 license</a>.


## Contact

For more information, questions, or issues, please contact LocalDoc at [[email protected]].