Model Card for excribe/ner_sgd_roberta

Model Details

Model Description

This model is a fine-tuned version of PlanTL-GOB-ES/roberta-base-bne for Named Entity Recognition (NER) in Spanish. It is designed to identify entities such as direccion, telefono, mail, nombre, documento, referencia, departamento, and municipio in texts related to administrative or governmental correspondence. The model uses the BIO (Beginning, Inside, Outside) tagging scheme and was trained on a custom dataset derived from a Parquet file (final.parquet).

Developed by: Exscribe
Model type: Token Classification (NER)
Language(s): Spanish (es)
License: CC-BY-NC-3.0
Base Model: PlanTL-GOB-ES/roberta-base-bne
Finetuned Model Repository: excribe/ner_sgd_roberta

Model Architecture

The model is based on the RoBERTa architecture, specifically the PlanTL-GOB-ES/roberta-base-bne checkpoint, which is pre-trained on a large corpus of Spanish texts. It has been fine-tuned for token classification with a custom classification head tailored to the defined entity labels.

Number of Labels: 17 (including O and BIO tags for 8 entity types: DIRECCION, TELEFONO, MAIL, NOMBRE, DOCUMENTO, REFERENCIA, DEPARTAMENTO, MUNICIPIO)
Label Schema: BIO (e.g., B-DIRECCION, I-DIRECCION, O)

Training Details

Training Data

The model was trained on a custom dataset derived from a Parquet file (final.parquet) containing administrative texts. The dataset includes:

Number of Rows: 27,807
Number of Columns: 32
Key Columns Used for NER:
- texto_entrada (input text)
- Entity columns: direccion, telefono, mail, nombre, documento, referencia, departamento, municipio
Null Values per Entity Column:
- direccion: 82
- telefono: 10,073
- mail: 1,086
- nombre: 0
- documento: 6,407
- referencia: 200
- departamento: 0
- municipio: 0
Dataset Description: The dataset contains administrative correspondence data with fields like case IDs (radicado), dates (fecha_radicacion), document paths, and text inputs (texto_entrada). The entity columns were used to generate BIO tags for NER training.

The dataset was preprocessed to convert raw text and entity annotations into BIO format, tokenized using the PlanTL-GOB-ES/roberta-base-bne tokenizer, and split into training (81%), validation (9%), and test (10%) sets.

Training Procedure

The model was fine-tuned using the Hugging Face transformers library with the following configuration:

Training Arguments:
- Epochs: 3
- Learning Rate: 2e-5
- Batch Size: 8 (per device)
- Weight Decay: 0.01
- Evaluation Strategy: Per epoch
- Save Strategy: Per epoch
- Load Best Model at End: True (based on F1 score)
- Optimizer: AdamW
- Precision: Mixed precision (FP16) on GPU
- Seed: 42
Hardware: GPU (CUDA-enabled, if available) or CPU
Libraries Used:
- transformers
- datasets
- evaluate
- seqeval
- pandas
- pyarrow
- torch

The training process included:

Loading and preprocessing the Parquet dataset.
Converting text and entity annotations to BIO format.
Tokenizing and aligning labels with sub-tokens.
Fine-tuning the model with a custom classification head.
Evaluating on the validation set after each epoch.
Saving the best model based on the F1 score.

Training Metrics

The model was evaluated on the test set after training, achieving the following metrics:

Precision: 0.8948
Recall: 0.9052
F1-Score: 0.9000
Accuracy: 0.9857
Evaluation Loss: 0.0455
Runtime: 12.16 seconds
Samples per Second: 228.612
Steps per Second: 28.607

Evaluation

Evaluation Metrics

The model was evaluated using the seqeval metric in strict IOB2 mode, which computes:

Precision: Proportion of correctly predicted entity tokens.
Recall: Proportion of true entity tokens correctly identified.
F1-Score: Harmonic mean of precision and recall.
Accuracy: Proportion of correctly classified tokens (including non-entity tokens).

Test Set Performance:

Precision: 0.8948
Recall: 0.9052
F1-Score: 0.9000
Accuracy: 0.9857

Example Inference

Below are example outputs from the model using the pipeline for NER:

**Input Text 1:**
"Se informa que el asunto principal es la Factura #REF123. Contactar a Juan Pérez en la dirección Calle Falsa 123, Bogotá. Teléfono 555-9876 o al mail [email protected]. El documento asociado es el ID-98765."

Output:

Entidad: "Calle Falsa 123, Bogotá" → Tipo: DIRECCION (Confianza: ~0.99)
Entidad: "555-9876" → Tipo: TELEFONO (Confianza: ~0.98)
Entidad: "[email protected]" → Tipo: MAIL (Confianza: ~0.99)
Entidad: "Juan Pérez" → Tipo: NOMBRE (Confianza: ~0.99)
Entidad: "ID-98765" → Tipo: DOCUMENTO (Confianza: ~0.97)
Entidad: "#REF123" → Tipo: REFERENCIA (Confianza: ~0.98)

**Input Text 2:**
"Referencia: EXP-002. Municipio de Chía, departamento Cundinamarca. Necesitamos hablar sobre el pago pendiente. Email de contacto: [email protected]. Tel: 3001234567"

Output:

Entidad: "EXP-002" → Tipo: REFERENCIA (Confianza: ~0.98)
Entidad: "Chía" → Tipo: MUNICIPIO (Confianza: ~0.99)
Entidad: "Cundinamarca" → Tipo: DEPARTAMENTO (Confianza: ~0.99)
Entidad: "[email protected]" → Tipo: MAIL (Confianza: ~0.99)
Entidad: "3001234567" → Tipo: TELEFONO (Confianza: ~0.98)

Usage

Using the Model with Hugging Face Transformers

To use the model for inference, you can load it with the transformers library and create a pipeline for NER:

from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer

# Load the model and tokenizer
model = AutoModelForTokenClassification.from_pretrained("excribe/ner_sgd_roberta")
tokenizer = AutoTokenizer.from_pretrained("excribe/ner_sgd_roberta")

# Create NER pipeline
ner_pipeline = pipeline(
    "ner",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple",
    device=0 if torch.cuda.is_available() else -1
)

# Example text
text = "Contactar a Juan Pérez en Calle Falsa 123, Bogotá. Teléfono 555-9876."

# Perform inference
entities = ner_pipeline(text)
for entity in entities:
    print(f"Entidad: {entity['word']} → Tipo: {entity['entity_group']} (Confianza: {entity['score']:.4f})")

Installation Requirements

To run the model, install the required libraries:

pip install transformers[torch] datasets evaluate seqeval accelerate pandas pyarrow

Hardware Requirements

Inference: Can run on CPU or GPU. GPU (e.g., NVIDIA with CUDA) is recommended for faster processing.
Training: GPU with at least 8GB VRAM is recommended for fine-tuning. The model was trained with mixed precision (FP16) to optimize memory usage.

Limitations

Dataset Bias: The model was trained on administrative texts, so it may not generalize well to other domains (e.g., social media, literature).
Entity Overlap: The preprocessing handles overlapping entities by prioritizing earlier matches, which may lead to missed entities in complex cases.
Null Values: High null rates in some entity columns (e.g., telefono: 10,073) may reduce performance for those entities.
Language: The model is optimized for Spanish and may not perform well on other languages.

Citation

If you use this model, please cite:

@misc{excribe_ner_sgd_roberta,
  author = {Exscribe},
  title = {NER Model for Spanish Administrative Texts},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/excribe/ner_sgd_roberta}}
}

Contact

For questions or issues, please contact the maintainers via the Hugging Face repository or open an issue.

excribe
/

ner_sgd_roberta