File size: 3,015 Bytes

---
library_name: transformers
tags:
- ner
- msu
- wiki
- fine-tuned
datasets:
- RCC-MSU/collection3
language:
- ru
metrics:
- precision
- recall
- f1
base_model:
- Babelscape/wikineural-multilingual-ner
pipeline_tag: token-classification
---

# Fine-tuned multilingual model for russian language NER
This is the model card for fine-tuned [Babelscape/wikineural-multilingual-ner](https://huggingface.co/Babelscape/wikineural-multilingual-ner), which has multilingual mBERT as its base.
I`ve fine-tuned it using [RCC-MSU/collection3](https://huggingface.co/datasets/RCC-MSU/collection3) dataset for token-classification task. The dataset has BIO-pattern and following labels: 
```python
label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
```


## Model Details

Fine-tuning was proceeded in 3 epochs, and computed next metrics:

| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy |
| ----- | ------------- | --------------- | --------- | ------ | -- | -------- |
| 1 | 0.041000 | 0.032810 | 0.959569 | 0.974253 | 0.966855 | 0.993325 |
| 2 | 0.020800 | 0.028395 | 0.959569 | 0.974253 | 0.966855 | 0.993325 |
| 3 | 0.010500 | 0.029138 | 0.963239 | 0.973767 | 0.968474 | 0.993247 |

To avoid over-fitting due to a small amount of training samples, i used high weight_decay = 0.1.


## Basic usage

So, you can easily use this model with pipeline for 'token-classification' task.

```python
import torch

from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
from datasets import load_dataset


model_ckpt = "nesemenpolkov/msu-wiki-ner"

label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']

id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForTokenClassification.from_pretrained(
    model_ckpt,
    id2label=id2label,
    label2id=label2id,
    ignore_mismatched_sizes=True
)

pipe = pipeline(
    task="token-classification",
    model=model,
    tokenizer=tokenizer,
    device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
    aggregation_strategy="simple"
)

demo_sample = "Этот Иван Иванов, в паспорте Иванов И.И."

with torch.no_grad():
    out = pipe(demo_sample)
```


## Bias, Risks, and Limitations

This model is finetuned version of [Babelscape/wikineural-multilingual-ner](https://huggingface.co/Babelscape/wikineural-multilingual-ner), on a russian language NER dataset [RCC-MSU/collection3](https://huggingface.co/datasets/RCC-MSU/collection3). It can show low scores on another language texts.


## Citation [optional]
```
@inproceedings{tedeschi-etal-2021-wikineural-combined,
    title = "Fine-tuned multilingual model for russian language NER.",
    author = "nesemenpolkov",
    booktitle = "Detecting names in noisy and dirty data.",
    month = oct,
    year = "2024",
    address = "Moscow, Russian Federation",
}
```