NER SGD Bertina RoBERTa

Model Overview

This is a fine-tuned Named Entity Recognition (NER) model for extracting specific entities from Spanish text, designed for administrative and document management contexts. It is based on the bertin-project/bertin-roberta-base-spanish model and trained to identify entities such as direccion, telefono, mail, nombre, documento, referencia, departamento, and municipio using the BIO tagging scheme (B-TAG, I-TAG, O).

The model was trained on a custom dataset stored in a Parquet file (final.parquet), containing Spanish text and labeled entities, likely from a document management system (SGD). It leverages the Hugging Face transformers library for training and inference, aligning with your prior interest in automating data extraction for document management systems.

Model Details

Base Model: bertin-project/bertin-roberta-base-spanish
Task: Named Entity Recognition (NER)
Language: Spanish
Labels:
- O: Outside of an entity
- B-DIRECCION, I-DIRECCION: Address
- B-TELEFONO, I-TELEFONO: Phone number
- B-MAIL, I-MAIL: Email
- B-NOMBRE, I-NOMBRE: Name
- B-DOCUMENTO, I-DOCUMENTO: Document ID
- B-REFERENCIA, I-REFERENCIA: Reference
- B-DEPARTAMENTO, I-DEPARTAMENTO: Department
- B-MUNICIPIO, I-MUNICIPIO: Municipality
Training Framework: Hugging Face transformers, datasets, evaluate, seqeval
Training Hardware: GPU (if available) or CPU

Intended Use

This model is designed for extracting structured information from unstructured Spanish text in administrative or document management contexts, such as extracting contact details or references from official correspondence. It is intended for non-commercial use only, as per the CC-BY-NC-3.0 license, aligning with your focus on academic or institutional applications.

Example Usage

Below is an example of how to use the model with the Hugging Face pipeline for NER:

from transformers import pipeline

# Load the model and tokenizer
ner_pipeline = pipeline("ner", model="excribe/ner_sgd_bertina_roberta", aggregation_strategy="simple")

# Example text
text = "Contactar a Juan Pérez en Calle Falsa 123, Bogotá. Teléfono 555-9876 o al mail [email protected] tradedoubler:59376,59376"

# Run inference
entities = ner_pipeline(text)

# Print results
for entity in entities:
    print(f"Entity: {entity['word']} | Type: {entity['entity_group']} | Score: {entity['score']:.4f}")

Example Output:

Entity: Juan Pérez | Type: NOMBRE | Score: 0.9876
Entity: Calle Falsa 123 | Type: DIRECCION | Score: 0.9789
Entity: Bogotá | Type: MUNICIPIO | Score: 0.9654
Entity: 555-9876 | Type: TELEFONO | Score: 0.9921
Entity: [email protected] | Type: MAIL | Score: 0.9890

Training Data

The model was trained on a custom dataset (final.parquet) containing 27,807 rows and 32 columns, likely sourced from a document management system (SGD). The dataset includes Spanish texts in the texto_entrada column and labeled entities in the following columns: direccion, telefono, mail, nombre, documento, referencia, departamento, and municipio. Other columns, such as radicado, fecha_radicacion, and sgd_tpr_descrip, suggest the data is related to administrative or official documents, aligning with your interest in enhancing document management systems.

Dataset Details

Number of Rows: 27,807
Relevant Columns for NER:
- texto_entrada: Source text
- Entity columns: direccion, telefono, mail, nombre, documento, referencia, departamento, municipio
Missing Values:
- direccion: 82 missing
- telefono: 10,073 missing
- mail: 1,086 missing
- nombre: 0 missing
- documento: 6,407 missing
- referencia: 200 missing
- departamento: 0 missing
- municipio: 0 missing
Preprocessing:
- A custom function (convert_row_to_bio_optimized) converted entity columns into BIO tags, handling overlaps by prioritizing earlier entities.
- The dataset was tokenized using the base model's tokenizer, with labels aligned to sub-tokens.
- Split: Training (~~80%), Validation (~~10%), Test (~10%).

Training Procedure

The model was fine-tuned using the Hugging Face Trainer API with the following hyperparameters, reflecting your interest in fine-tuning for NER tasks:

Epochs: 3
Learning Rate: 2e-5
Batch Size: 8 (per device)
Weight Decay: 0.01
Evaluation Strategy: Per epoch
Optimizer: AdamW (default in transformers)
Mixed Precision: Enabled if GPU is available
Metrics: Precision, Recall, F1, Accuracy (via seqeval)

The training process included:

Loading the dataset from Parquet and converting it to a Hugging Face Dataset.
Generating BIO tags for each text. 3.-pep8
Tokenizing and aligning labels with the model's tokenizer.
Fine-tuning the model with the Trainer API.
Evaluating on the validation set and saving the best model based on F1 score.
Final evaluation on the test set.

Evaluation Metrics

The model was evaluated on the test set after 3 epochs, achieving the following metrics:

Precision: 0.9031
Recall: 0.9149
F1-Score: 0.9090
Accuracy: 0.9869
Loss: 0.0465
Runtime: 12.22 seconds
Samples per Second: 227.546
Steps per Second: 28.474

These metrics were computed using the seqeval library with the IOB2 scheme in strict mode, ensuring accurate entity boundary and type matching.

How to Use

To use the model, install the required dependencies and load it with the Hugging Face transformers library:

pip install transformers torch

Then, use the pipeline as shown in the example above, or load the model manually:

from transformers import AutoModelForTokenClassification, AutoTokenizer

model = AutoModelForTokenClassification.from_pretrained("excribe/ner_sgd_bertina_roberta")
tokenizer = AutoTokenizer.from_pretrained("excribe/ner_sgd_bertina_roberta")

# Tokenize input text
inputs = tokenizer("Calle Falsa 123, Bogotá", return_tensors="pt")

# Get predictions
outputs = model(**inputs)
logits = outputs.logits
predictions = logits.argmax(dim=-1)

Limitations

The model is trained on a dataset from an administrative document management context and may not generalize well to other domains (e.g., social media or informal texts).
Overlapping entities are resolved by prioritizing earlier matches, which may miss some valid entities.
Missing values in entity columns (e.g., 10,073 missing telefono values) may reduce performance for certain entity types.
The model is optimized for Spanish and may not perform well on other languages.
Due to the CC-BY-NC-3.0 license, the model cannot be used for commercial purposes.

Ethical Considerations

Bias: The model may reflect biases in the training data, such as underrepresentation of certain entity types (e.g., telefono has many missing values) or overrepresentation of formal administrative language.
Privacy: The model extracts sensitive entities like names, addresses, and phone numbers. Ensure input texts do not contain personal data unless authorized, especially given your focus on document management systems handling potentially sensitive data.
Non-Commercial Use: The model is licensed for non-commercial use only, as per CC-BY-NC-3.0, aligning with your likely academic or institutional goals.

License

This model is licensed under the Creative Commons Attribution-NonCommercial 3.0 Unported License (CC-BY-NC-3.0). You are free to share and adapt the model for non-commercial purposes, provided you give appropriate credit to the author.

Contact

For issues or questions, please contact the model author via the Hugging Face repository or open an issue.

Acknowledgments

This model was trained using the Hugging Face ecosystem (transformers, datasets, evaluate, seqeval). Thanks to the bertin-project team for providing the base model bertin-roberta-base-spanish.

excribe
/

ner_sgd_bertina_roberta