File size: 3,015 Bytes
bb1771e
 
 
 
 
 
 
2978268
 
 
 
 
 
 
 
 
 
 
bb1771e
 
d6162ce
2978268
 
 
 
 
bb1771e
 
 
 
2978268
4dc859e
2978268
4dc859e
2978268
 
 
bb1771e
f2df942
bb1771e
 
2978268
bb1771e
2978268
bb1771e
2978268
 
bb1771e
2978268
 
bb1771e
 
2978268
bb1771e
2978268
bb1771e
2978268
 
bb1771e
2978268
 
 
 
 
 
 
bb1771e
2978268
 
 
 
 
 
 
bb1771e
2978268
bb1771e
2978268
 
 
bb1771e
 
 
 
2978268
bb1771e
 
 
2978268
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---
library_name: transformers
tags:
- ner
- msu
- wiki
- fine-tuned
datasets:
- RCC-MSU/collection3
language:
- ru
metrics:
- precision
- recall
- f1
base_model:
- Babelscape/wikineural-multilingual-ner
pipeline_tag: token-classification
---

# Fine-tuned multilingual model for russian language NER
This is the model card for fine-tuned [Babelscape/wikineural-multilingual-ner](https://huggingface.co/Babelscape/wikineural-multilingual-ner), which has multilingual mBERT as its base.
I`ve fine-tuned it using [RCC-MSU/collection3](https://huggingface.co/datasets/RCC-MSU/collection3) dataset for token-classification task. The dataset has BIO-pattern and following labels: 
```python
label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
```


## Model Details

Fine-tuning was proceeded in 3 epochs, and computed next metrics:

| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy |
| ----- | ------------- | --------------- | --------- | ------ | -- | -------- |
| 1 | 0.041000 | 0.032810 | 0.959569 | 0.974253 | 0.966855 | 0.993325 |
| 2 | 0.020800 | 0.028395 | 0.959569 | 0.974253 | 0.966855 | 0.993325 |
| 3 | 0.010500 | 0.029138 | 0.963239 | 0.973767 | 0.968474 | 0.993247 |

To avoid over-fitting due to a small amount of training samples, i used high weight_decay = 0.1.


## Basic usage

So, you can easily use this model with pipeline for 'token-classification' task.

```python
import torch

from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
from datasets import load_dataset


model_ckpt = "nesemenpolkov/msu-wiki-ner"

label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']

id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForTokenClassification.from_pretrained(
    model_ckpt,
    id2label=id2label,
    label2id=label2id,
    ignore_mismatched_sizes=True
)

pipe = pipeline(
    task="token-classification",
    model=model,
    tokenizer=tokenizer,
    device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
    aggregation_strategy="simple"
)

demo_sample = "Этот Иван Иванов, в паспорте Иванов И.И."

with torch.no_grad():
    out = pipe(demo_sample)
```


## Bias, Risks, and Limitations

This model is finetuned version of [Babelscape/wikineural-multilingual-ner](https://huggingface.co/Babelscape/wikineural-multilingual-ner), on a russian language NER dataset [RCC-MSU/collection3](https://huggingface.co/datasets/RCC-MSU/collection3). It can show low scores on another language texts.


## Citation [optional]
```
@inproceedings{tedeschi-etal-2021-wikineural-combined,
    title = "Fine-tuned multilingual model for russian language NER.",
    author = "nesemenpolkov",
    booktitle = "Detecting names in noisy and dirty data.",
    month = oct,
    year = "2024",
    address = "Moscow, Russian Federation",
}
```