NER model for definition component recognition in German scientific texts

distilbert-base-multilingual-cased-definitions_ner is a NER model (token classification) in the scientific domain in German, finetuned from the model distilbert-base-multilingual-cased. It was trained using a custom annotated dataset of around 10,000 training and 2,000 test examples containing definition- and non-definition-related sentences from wikipedia articles in german.

The model is specifically designed to recognize and classify components of definitions, using the following entity labels:

  • DF: Definiendum (the term being defined)
  • VF: Definitor (the verb or phrase introducing the definition)
  • GF: Definiens (the explanation or meaning)

Training was conducted using a standard NER objective. The model achieves an F1 score of approximately 81% on the evaluation set.

Here are the overall final metrics on the test dataset after 5 epochs of training:

  • f1: 0.812455003599712
  • precision: 0.8076097328244275
  • recall: 0.8173587638821825
  • loss: 0.329479843378067

Model Performance Comparision on wiki_definitions_de_multitask:

Model Precision Recall F1 Score Eval Samples per Second Epoch
distilbert-base-multilingual-cased-definitions_ner 80.76 81.74 81.25 457.53 5.0
scibert_scivocab_cased-definitions_ner 80.54 82.11 81.32 236.61 4.0
GottBERT_base_best-definitions_ner 82.98 82.81 82.90 272.26 5.0
xlm-roberta-base-definitions_ner 81.90 83.35 82.62 241.21 5.0
gbert-base-definitions_ner 82.73 83.56 83.14 278.87 5.0
Downloads last month
16
Safetensors
Model size
135M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for samirmsallem/distilbert-base-multilingual-cased-definitions_ner

Finetuned
(272)
this model

Dataset used to train samirmsallem/distilbert-base-multilingual-cased-definitions_ner

Collection including samirmsallem/distilbert-base-multilingual-cased-definitions_ner

Evaluation results