cjvt
/

File size: 1,972 Bytes
285573d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
05b9fdf
 
 
 
 
 
 
 
285573d
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
---
license: cc-by-sa-4.0
language:
- cro
tags:
- word spelling error annotator
---

---
language: 
- cro

license: cc-by-sa-4.0
---

# BERTic-Incorrect-Spelling-Annotator

This BERTic model is designed to annotate incorrectly spelled words in text. It utilizes the following labels:

- 0: Word is written correctly,
- 1: Word is written incorrectly.

## Model Output Example

Imagine we have the following Croatian text:

_Model u tekstu prepoznije riječi u kojima se nalazaju pogreške ._

If we convert input data to format acceptable by BERTic model:

_[CLS] model [MASK] u [MASK] tekstu [MASK] prepo ##znije [MASK] riječi [MASK] u [MASK] kojima [MASK] se [MASK] nalaza ##ju [MASK] pogreške [MASK] . [MASK] [SEP]_

The model might return the following predictions (note: predictions chosen for demonstration/explanation, not reproducibility!):

_Model 0 u 0 tekstu 0 prepoznije 1 riječi 0 u 0 kojima 0 se 0 nalazaju 1 pogreške 0 . 0_

We can observe that in the input sentence, the word `prepoznije` and `nalazaju` are spelled incorrectly, so the model marks them with the token (1).

## More details

Testing model with **generated** test sets provides following result:

Precision: 0.9954
Recall: 0.8764
F1 Score: 0.9321
F0.5 Score: 0.9691

Testing the model with test sets constructed using the **Croatian corpus of non-professional written language by typical speakers and speakers with language disorders RAPUT 1.0** dataset provides the following results:

Precision: 0.8213  
Recall: 0.3921  
F1 Score: 0.5308  
F0.5 Score: 0.6738  

## Acknowledgement

The authors acknowledge the financial support from the Slovenian Research and Innovation Agency - research core funding No. P6-0411: Language Resources and Technologies for Slovene and research project No. J7-3159: Empirical foundations for digitally-supported development of writing skills.

## Authors

Thanks to Martin Božič, Marko Robnik-Šikonja and Špela Arhar Holdt for developing this model.