cjvt
/

Martin97Bozic commited on
Commit
285573d
·
1 Parent(s): d33cc3d

added model together with usage example and the README

Browse files
README.md ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-sa-4.0
3
+ datasets:
4
+ - cjvt/cc_gigafida
5
+ - cjvt/solar3
6
+ - cjvt/sloleks
7
+ language:
8
+ - cro
9
+ tags:
10
+ - word spelling error annotator
11
+ ---
12
+
13
+ ---
14
+ language:
15
+ - cro
16
+
17
+ license: cc-by-sa-4.0
18
+ ---
19
+
20
+ # BERTic-Incorrect-Spelling-Annotator
21
+
22
+ This BERTic model is designed to annotate incorrectly spelled words in text. It utilizes the following labels:
23
+
24
+ - 0: Word is written correctly,
25
+ - 1: Word is written incorrectly.
26
+
27
+ ## Model Output Example
28
+
29
+ Imagine we have the following Croatian text:
30
+
31
+ _Model u tekstu prepoznije riječi u kojima se nalazaju pogreške ._
32
+
33
+ If we convert input data to format acceptable by BERTic model:
34
+
35
+ _[CLS] model [MASK] u [MASK] tekstu [MASK] prepo ##znije [MASK] riječi [MASK] u [MASK] kojima [MASK] se [MASK] nalaza ##ju [MASK] pogreške [MASK] . [MASK] [SEP]_
36
+
37
+ The model might return the following predictions (note: predictions chosen for demonstration/explanation, not reproducibility!):
38
+
39
+ _Model 0 u 0 tekstu 0 prepoznije 1 riječi 0 u 0 kojima 0 se 0 nalazaju 1 pogreške 0 . 0_
40
+
41
+ We can observe that in the input sentence, the word `prepoznije` and `nalazaju` are spelled incorrectly, so the model marks them with the token (1).
42
+
43
+ ## More details
44
+
45
+ Testing model with **generated** test sets provides following result:
46
+
47
+ Precision: 0.9954
48
+ Recall: 0.8764
49
+ F1 Score: 0.9321
50
+ F0.5 Score: 0.9691
51
+
52
+ Testing the model with test sets constructed using the **Croatian corpus of non-professional written language by typical speakers and speakers with language disorders RAPUT 1.0** dataset provides the following results:
53
+
54
+ Precision: 0.8213
55
+ Recall: 0.3921
56
+ F1 Score: 0.5308
57
+ F0.5 Score: 0.6738
58
+
59
+ ## Authors
60
+
61
+ Thanks to Martin Božič, Marko Robnik-Šikonja and Špela Arhar Holdt for developing this model.
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "./BERTic",
3
+ "architectures": [
4
+ "BertForTokenClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "embedding_size": 768,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-12,
15
+ "max_position_embeddings": 512,
16
+ "model_type": "bert",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "pad_token_id": 0,
20
+ "position_embedding_type": "absolute",
21
+ "torch_dtype": "float32",
22
+ "transformers_version": "4.37.2",
23
+ "type_vocab_size": 2,
24
+ "use_cache": true,
25
+ "vocab_size": 32000
26
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:296ea1e18f2c377a85f7ace0867ccf264318fe0caf2f04c4d773ef0774f57fbb
3
+ size 440136504
model_usage_example.py ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn.functional as F
3
+ from transformers import BertTokenizer, BertForTokenClassification
4
+ import re
5
+ import string
6
+
7
+
8
+ def preprocess_input_text(text):
9
+ """
10
+ This function adds a [MASK] token after each word, inserts a space before every punctuation mark,
11
+ and converts all words to lowercase.
12
+
13
+ It returns the original words from the input text along with the preprocessed version of the input text.
14
+ """
15
+ text = re.sub(r'([' + string.punctuation + '])', r' \1', text)
16
+ text = re.sub(' +', ' ', text)
17
+
18
+ words = text.split(" ")
19
+
20
+ text = text.lower()
21
+
22
+ output = []
23
+
24
+ for word in text.split(" "):
25
+ output.append(word)
26
+ output.append("[MASK]")
27
+
28
+ return words, " ".join(output)
29
+
30
+
31
+ def predict_using_trained_model_old(input_text, model_dir, device):
32
+ """
33
+ This function loads a model and predicts whether each word in the input text is correct or incorrect.
34
+
35
+ The output is the input text, where each word is followed by a label indicating whether the word is correct (0) or incorrect (1).
36
+ """
37
+
38
+ words, input_text = preprocess_input_text(input_text)
39
+
40
+ tokenizer = BertTokenizer.from_pretrained(model_dir)
41
+ model = BertForTokenClassification.from_pretrained(model_dir, num_labels=2)
42
+
43
+ model.to(device)
44
+
45
+ tokenized_inputs = tokenizer(input_text, max_length=128, padding='max_length', truncation=True, return_tensors="pt")
46
+ input_ids = tokenized_inputs["input_ids"].to(device)
47
+ attention_mask = tokenized_inputs["attention_mask"].to(device)
48
+
49
+ model.eval()
50
+
51
+ with torch.no_grad():
52
+ outputs = model(input_ids, attention_mask=attention_mask)
53
+ logits = outputs.logits
54
+
55
+ predictions = torch.argmax(logits, dim=-1).squeeze().cpu().numpy()
56
+
57
+ tokens = tokenizer.convert_ids_to_tokens(input_ids.squeeze().cpu().numpy())
58
+
59
+ model_output = []
60
+ mask_index = 0
61
+
62
+ for token, prediction in zip(tokens, predictions):
63
+ if token == "[MASK]":
64
+ model_output.append(str(prediction))
65
+ mask_index += 1
66
+ elif token != "[CLS]" and token != "[SEP]" and token != "[PAD]":
67
+ model_output.append(words[mask_index])
68
+
69
+ return " ".join(model_output)
70
+
71
+
72
+ if __name__ == '__main__':
73
+ input_text = "Model u tekstu prepoznije riječi u kojima se nalazaju pogreške."
74
+ model_dir = "."
75
+
76
+ if torch.cuda.is_available():
77
+ device = torch.device("cuda")
78
+ elif torch.backends.mps.is_available():
79
+ device = torch.device("mps")
80
+ else:
81
+ device = torch.device("cpu")
82
+
83
+ print(f"Using device: {device}")
84
+
85
+ model_output_text = predict_using_trained_model_old(input_text, model_dir, device)
86
+
87
+ print(f"Model output: {model_output_text}")
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": false,
48
+ "mask_token": "[MASK]",
49
+ "max_len": 512,
50
+ "model_max_length": 512,
51
+ "never_split": null,
52
+ "pad_token": "[PAD]",
53
+ "sep_token": "[SEP]",
54
+ "strip_accents": false,
55
+ "tokenize_chinese_chars": true,
56
+ "tokenizer_class": "BertTokenizer",
57
+ "unk_token": "[UNK]"
58
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff