NERCat Classifier

Model Overview

The NERCat classifier is a fine-tuned version of the GLiNER Knowledgator model, designed specifically for Named Entity Recognition (NER) in the Catalan language. By leveraging a manually annotated dataset of Catalan-language television transcriptions, this classifier significantly improves the recognition of named entities across diverse categories, addressing the challenges posed by the scarcity of high-quality training data for Catalan.

The pre-trained version used for fine-tuning was: knowledgator/gliner-bi-large-v1.0.

Quickstart

import torch
from gliner import GLiNER

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = GLiNER.from_pretrained("Ugiat/NERCat").to(device)

text = "La Universitat de Barcelona és una de les institucions educatives més importants de Catalunya."

labels = [
    "Person",
    "Facility",
    "Organization",
    "Location",
    "Product",
    "Event",
    "Date",
    "Law"
]

entities = model.predict_entities(text, labels, threshold=0.5)

for entity in entities:
    print(entity["text"], "=>", entity["label"])

Performance Evaluation

We evaluated the fine-tuned NERCat classifier against the baseline GLiNER model using a manually classified evaluation dataset of 100 sentences. The results demonstrate significant performance improvements across all named entity categories:

Entity Type NERCat Precision NERCat Recall NERCat F1 GLiNER Precision GLiNER Recall GLiNER F1 Δ Precision Δ Recall Δ F1
Person 1.00 1.00 1.00 0.92 0.80 0.86 +0.08 +0.20 +0.14
Facility 0.89 1.00 0.94 0.67 0.25 0.36 +0.22 +0.75 +0.58
Organization 1.00 1.00 1.00 0.72 0.62 0.67 +0.28 +0.38 +0.33
Location 1.00 0.97 0.99 0.83 0.54 0.66 +0.17 +0.43 +0.33
Product 0.96 1.00 0.98 0.63 0.21 0.31 +0.34 +0.79 +0.67
Event 0.88 0.88 0.88 0.60 0.38 0.46 +0.28 +0.50 +0.41
Date 0.88 1.00 0.93 1.00 0.07 0.13 -0.13 +0.93 +0.80
Law 0.67 1.00 0.80 0.00 0.00 0.00 +0.67 +1.00 +0.80

Fine-Tuning Process

The fine-tuning process followed a structured approach, including dataset preparation, model training, and optimization:

  • Data Splitting: The dataset was shuffled and split into training (90%) and testing (10%) subsets.
  • Training Setup:
    • Batch size: 8
    • Steps: 500
    • Loss function: Focal loss (α = 0.75, γ = 2) to address class imbalances
    • Learning rates:
      • Entity layers: $5 \times 10^{-6}$
      • Other model parameters: $1 \times 10^{-5}$
    • Scheduler: Linear with a warmup ratio of 0.1
    • Evaluation frequency: Every 100 steps
    • Checkpointing: Every 1000 steps

The dataset included 13,732 named entity instances across eight categories:

Other

Citation Information

@misc{article_id,
  title        = {NERCat: Fine-Tuning for Enhanced Named Entity Recognition in Catalan},
  author       = {Guillem Cadevall Ferreres, Marc Bardeli Gámez, Marc Serrano Sanz, Pol Gerdt Basullas, Francesc Tarres Ruiz, Raul Quijada Ferrero},
  year         = {2025},
  archivePrefix = {arXiv},
  url          = {https://arxiv.org/abs/2503.14173}
}
Downloads last month
17
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Dataset used to train Ugiat/NERCat