|
--- |
|
library_name: transformers |
|
tags: |
|
- ner |
|
- msu |
|
- wiki |
|
- fine-tuned |
|
datasets: |
|
- RCC-MSU/collection3 |
|
language: |
|
- ru |
|
metrics: |
|
- precision |
|
- recall |
|
- f1 |
|
base_model: |
|
- Babelscape/wikineural-multilingual-ner |
|
pipeline_tag: token-classification |
|
--- |
|
|
|
# Fine-tuned multilingual model for russian language NER |
|
This is the model card for fine-tuned [Babelscape/wikineural-multilingual-ner](https://huggingface.co/Babelscape/wikineural-multilingual-ner), which has multilingual mBERT as its base. |
|
I`ve fine-tuned it using [RCC-MSU/collection3](https://huggingface.co/datasets/RCC-MSU/collection3) dataset for token-classification task. The dataset has BIO-pattern and following labels: |
|
```python |
|
label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'] |
|
``` |
|
|
|
|
|
## Model Details |
|
|
|
Fine-tuning was proceeded in 3 epochs, and computed next metrics: |
|
|
|
| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy | |
|
| ----- | ------------- | --------------- | --------- | ------ | -- | -------- | |
|
| 1 | 0.041000 | 0.032810 | 0.959569 | 0.974253 | 0.966855 | 0.993325 | |
|
| 2 | 0.020800 | 0.028395 | 0.959569 | 0.974253 | 0.966855 | 0.993325 | |
|
| 3 | 0.010500 | 0.029138 | 0.963239 | 0.973767 | 0.968474 | 0.993247 | |
|
|
|
To avoid over-fitting due to a small amount of training samples, i used high weight_decay = 0.1. |
|
|
|
|
|
## Basic usage |
|
|
|
So, you can easily use this model with pipeline for 'token-classification' task. |
|
|
|
```python |
|
import torch |
|
|
|
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline |
|
from datasets import load_dataset |
|
|
|
|
|
model_ckpt = "nesemenpolkov/msu-wiki-ner" |
|
|
|
label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'] |
|
|
|
id2label = {i: label for i, label in enumerate(label_names)} |
|
label2id = {v: k for k, v in id2label.items()} |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_ckpt) |
|
model = AutoModelForTokenClassification.from_pretrained( |
|
model_ckpt, |
|
id2label=id2label, |
|
label2id=label2id, |
|
ignore_mismatched_sizes=True |
|
) |
|
|
|
pipe = pipeline( |
|
task="token-classification", |
|
model=model, |
|
tokenizer=tokenizer, |
|
device=torch.device("cuda" if torch.cuda.is_available() else "cpu"), |
|
aggregation_strategy="simple" |
|
) |
|
|
|
demo_sample = "Этот Иван Иванов, в паспорте Иванов И.И." |
|
|
|
with torch.no_grad(): |
|
out = pipe(demo_sample) |
|
``` |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
This model is finetuned version of [Babelscape/wikineural-multilingual-ner](https://huggingface.co/Babelscape/wikineural-multilingual-ner), on a russian language NER dataset [RCC-MSU/collection3](https://huggingface.co/datasets/RCC-MSU/collection3). It can show low scores on another language texts. |
|
|
|
|
|
## Citation [optional] |
|
``` |
|
@inproceedings{tedeschi-etal-2021-wikineural-combined, |
|
title = "Fine-tuned multilingual model for russian language NER.", |
|
author = "nesemenpolkov", |
|
booktitle = "Detecting names in noisy and dirty data.", |
|
month = oct, |
|
year = "2024", |
|
address = "Moscow, Russian Federation", |
|
} |
|
``` |