File size: 3,015 Bytes
bb1771e 2978268 bb1771e d6162ce 2978268 bb1771e 2978268 4dc859e 2978268 4dc859e 2978268 bb1771e f2df942 bb1771e 2978268 bb1771e 2978268 bb1771e 2978268 bb1771e 2978268 bb1771e 2978268 bb1771e 2978268 bb1771e 2978268 bb1771e 2978268 bb1771e 2978268 bb1771e 2978268 bb1771e 2978268 bb1771e 2978268 bb1771e 2978268 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 |
---
library_name: transformers
tags:
- ner
- msu
- wiki
- fine-tuned
datasets:
- RCC-MSU/collection3
language:
- ru
metrics:
- precision
- recall
- f1
base_model:
- Babelscape/wikineural-multilingual-ner
pipeline_tag: token-classification
---
# Fine-tuned multilingual model for russian language NER
This is the model card for fine-tuned [Babelscape/wikineural-multilingual-ner](https://huggingface.co/Babelscape/wikineural-multilingual-ner), which has multilingual mBERT as its base.
I`ve fine-tuned it using [RCC-MSU/collection3](https://huggingface.co/datasets/RCC-MSU/collection3) dataset for token-classification task. The dataset has BIO-pattern and following labels:
```python
label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
```
## Model Details
Fine-tuning was proceeded in 3 epochs, and computed next metrics:
| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy |
| ----- | ------------- | --------------- | --------- | ------ | -- | -------- |
| 1 | 0.041000 | 0.032810 | 0.959569 | 0.974253 | 0.966855 | 0.993325 |
| 2 | 0.020800 | 0.028395 | 0.959569 | 0.974253 | 0.966855 | 0.993325 |
| 3 | 0.010500 | 0.029138 | 0.963239 | 0.973767 | 0.968474 | 0.993247 |
To avoid over-fitting due to a small amount of training samples, i used high weight_decay = 0.1.
## Basic usage
So, you can easily use this model with pipeline for 'token-classification' task.
```python
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
from datasets import load_dataset
model_ckpt = "nesemenpolkov/msu-wiki-ner"
label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForTokenClassification.from_pretrained(
model_ckpt,
id2label=id2label,
label2id=label2id,
ignore_mismatched_sizes=True
)
pipe = pipeline(
task="token-classification",
model=model,
tokenizer=tokenizer,
device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
aggregation_strategy="simple"
)
demo_sample = "Этот Иван Иванов, в паспорте Иванов И.И."
with torch.no_grad():
out = pipe(demo_sample)
```
## Bias, Risks, and Limitations
This model is finetuned version of [Babelscape/wikineural-multilingual-ner](https://huggingface.co/Babelscape/wikineural-multilingual-ner), on a russian language NER dataset [RCC-MSU/collection3](https://huggingface.co/datasets/RCC-MSU/collection3). It can show low scores on another language texts.
## Citation [optional]
```
@inproceedings{tedeschi-etal-2021-wikineural-combined,
title = "Fine-tuned multilingual model for russian language NER.",
author = "nesemenpolkov",
booktitle = "Detecting names in noisy and dirty data.",
month = oct,
year = "2024",
address = "Moscow, Russian Federation",
}
``` |