nesemenpolkov
/

msu-wiki-ner

Token Classification

Model card Files Files and versions Community

msu-wiki-ner / README.md

nesemenpolkov's picture

Update README.md

d6162ce verified 7 months ago

|

history blame contribute delete

3.02 kB

	---
	library_name: transformers
	tags:
	- ner
	- msu
	- wiki
	- fine-tuned
	datasets:
	- RCC-MSU/collection3
	language:
	- ru
	metrics:
	- precision
	- recall
	- f1
	base_model:
	- Babelscape/wikineural-multilingual-ner
	pipeline_tag: token-classification
	---

	# Fine-tuned multilingual model for russian language NER
	This is the model card for fine-tuned [Babelscape/wikineural-multilingual-ner](https://huggingface.co/Babelscape/wikineural-multilingual-ner), which has multilingual mBERT as its base.
	I`ve fine-tuned it using [RCC-MSU/collection3](https://huggingface.co/datasets/RCC-MSU/collection3) dataset for token-classification task. The dataset has BIO-pattern and following labels:
	```python
	label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
	```


	## Model Details

	Fine-tuning was proceeded in 3 epochs, and computed next metrics:

	\| Epoch \| Training Loss \| Validation Loss \| Precision \| Recall \| F1 \| Accuracy \|
	\| ----- \| ------------- \| --------------- \| --------- \| ------ \| -- \| -------- \|
	\| 1 \| 0.041000 \| 0.032810 \| 0.959569 \| 0.974253 \| 0.966855 \| 0.993325 \|
	\| 2 \| 0.020800 \| 0.028395 \| 0.959569 \| 0.974253 \| 0.966855 \| 0.993325 \|
	\| 3 \| 0.010500 \| 0.029138 \| 0.963239 \| 0.973767 \| 0.968474 \| 0.993247 \|

	To avoid over-fitting due to a small amount of training samples, i used high weight_decay = 0.1.


	## Basic usage

	So, you can easily use this model with pipeline for 'token-classification' task.

	```python
	import torch

	from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
	from datasets import load_dataset


	model_ckpt = "nesemenpolkov/msu-wiki-ner"

	label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']

	id2label = {i: label for i, label in enumerate(label_names)}
	label2id = {v: k for k, v in id2label.items()}

	tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
	model = AutoModelForTokenClassification.from_pretrained(
	model_ckpt,
	id2label=id2label,
	label2id=label2id,
	ignore_mismatched_sizes=True
	)

	pipe = pipeline(
	task="token-classification",
	model=model,
	tokenizer=tokenizer,
	device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
	aggregation_strategy="simple"
	)

	demo_sample = "Этот Иван Иванов, в паспорте Иванов И.И."

	with torch.no_grad():
	out = pipe(demo_sample)
	```


	## Bias, Risks, and Limitations

	This model is finetuned version of [Babelscape/wikineural-multilingual-ner](https://huggingface.co/Babelscape/wikineural-multilingual-ner), on a russian language NER dataset [RCC-MSU/collection3](https://huggingface.co/datasets/RCC-MSU/collection3). It can show low scores on another language texts.


	## Citation [optional]
	```
	@inproceedings{tedeschi-etal-2021-wikineural-combined,
	title = "Fine-tuned multilingual model for russian language NER.",
	author = "nesemenpolkov",
	booktitle = "Detecting names in noisy and dirty data.",
	month = oct,
	year = "2024",
	address = "Moscow, Russian Federation",
	}
	```