NER SGD Bertina RoBERTa
Model Overview
This is a fine-tuned Named Entity Recognition (NER) model for extracting specific entities from Spanish text, designed for administrative and document management contexts. It is based on the bertin-project/bertin-roberta-base-spanish
model and trained to identify entities such as direccion
, telefono
, mail
, nombre
, documento
, referencia
, departamento
, and municipio
using the BIO tagging scheme (B-TAG
, I-TAG
, O
).
The model was trained on a custom dataset stored in a Parquet file (final.parquet
), containing Spanish text and labeled entities, likely from a document management system (SGD). It leverages the Hugging Face transformers
library for training and inference, aligning with your prior interest in automating data extraction for document management systems.
Model Details
- Base Model:
bertin-project/bertin-roberta-base-spanish
- Task: Named Entity Recognition (NER)
- Language: Spanish
- Labels:
O
: Outside of an entityB-DIRECCION
,I-DIRECCION
: AddressB-TELEFONO
,I-TELEFONO
: Phone numberB-MAIL
,I-MAIL
: EmailB-NOMBRE
,I-NOMBRE
: NameB-DOCUMENTO
,I-DOCUMENTO
: Document IDB-REFERENCIA
,I-REFERENCIA
: ReferenceB-DEPARTAMENTO
,I-DEPARTAMENTO
: DepartmentB-MUNICIPIO
,I-MUNICIPIO
: Municipality
- Training Framework: Hugging Face
transformers
,datasets
,evaluate
,seqeval
- Training Hardware: GPU (if available) or CPU
Intended Use
This model is designed for extracting structured information from unstructured Spanish text in administrative or document management contexts, such as extracting contact details or references from official correspondence. It is intended for non-commercial use only, as per the CC-BY-NC-3.0 license, aligning with your focus on academic or institutional applications.
Example Usage
Below is an example of how to use the model with the Hugging Face pipeline
for NER:
from transformers import pipeline
# Load the model and tokenizer
ner_pipeline = pipeline("ner", model="excribe/ner_sgd_bertina_roberta", aggregation_strategy="simple")
# Example text
text = "Contactar a Juan Pérez en Calle Falsa 123, Bogotá. Teléfono 555-9876 o al mail [email protected] tradedoubler:59376,59376"
# Run inference
entities = ner_pipeline(text)
# Print results
for entity in entities:
print(f"Entity: {entity['word']} | Type: {entity['entity_group']} | Score: {entity['score']:.4f}")
Example Output:
Entity: Juan Pérez | Type: NOMBRE | Score: 0.9876
Entity: Calle Falsa 123 | Type: DIRECCION | Score: 0.9789
Entity: Bogotá | Type: MUNICIPIO | Score: 0.9654
Entity: 555-9876 | Type: TELEFONO | Score: 0.9921
Entity: [email protected] | Type: MAIL | Score: 0.9890
Training Data
The model was trained on a custom dataset (final.parquet
) containing 27,807 rows and 32 columns, likely sourced from a document management system (SGD). The dataset includes Spanish texts in the texto_entrada
column and labeled entities in the following columns: direccion
, telefono
, mail
, nombre
, documento
, referencia
, departamento
, and municipio
. Other columns, such as radicado
, fecha_radicacion
, and sgd_tpr_descrip
, suggest the data is related to administrative or official documents, aligning with your interest in enhancing document management systems.
Dataset Details
- Number of Rows: 27,807
- Relevant Columns for NER:
texto_entrada
: Source text- Entity columns:
direccion
,telefono
,mail
,nombre
,documento
,referencia
,departamento
,municipio
- Missing Values:
direccion
: 82 missingtelefono
: 10,073 missingmail
: 1,086 missingnombre
: 0 missingdocumento
: 6,407 missingreferencia
: 200 missingdepartamento
: 0 missingmunicipio
: 0 missing
- Preprocessing:
- A custom function (
convert_row_to_bio_optimized
) converted entity columns into BIO tags, handling overlaps by prioritizing earlier entities. - The dataset was tokenized using the base model's tokenizer, with labels aligned to sub-tokens.
- Split: Training (
80%), Validation (10%), Test (~10%).
- A custom function (
Training Procedure
The model was fine-tuned using the Hugging Face Trainer
API with the following hyperparameters, reflecting your interest in fine-tuning for NER tasks:
- Epochs: 3
- Learning Rate: 2e-5
- Batch Size: 8 (per device)
- Weight Decay: 0.01
- Evaluation Strategy: Per epoch
- Optimizer: AdamW (default in
transformers
) - Mixed Precision: Enabled if GPU is available
- Metrics: Precision, Recall, F1, Accuracy (via
seqeval
)
The training process included:
- Loading the dataset from Parquet and converting it to a Hugging Face
Dataset
. - Generating BIO tags for each text. 3.-pep8
- Tokenizing and aligning labels with the model's tokenizer.
- Fine-tuning the model with the
Trainer
API. - Evaluating on the validation set and saving the best model based on F1 score.
- Final evaluation on the test set.
Evaluation Metrics
The model was evaluated on the test set after 3 epochs, achieving the following metrics:
- Precision: 0.9031
- Recall: 0.9149
- F1-Score: 0.9090
- Accuracy: 0.9869
- Loss: 0.0465
- Runtime: 12.22 seconds
- Samples per Second: 227.546
- Steps per Second: 28.474
These metrics were computed using the seqeval
library with the IOB2 scheme in strict mode, ensuring accurate entity boundary and type matching.
How to Use
To use the model, install the required dependencies and load it with the Hugging Face transformers
library:
pip install transformers torch
Then, use the pipeline
as shown in the example above, or load the model manually:
from transformers import AutoModelForTokenClassification, AutoTokenizer
model = AutoModelForTokenClassification.from_pretrained("excribe/ner_sgd_bertina_roberta")
tokenizer = AutoTokenizer.from_pretrained("excribe/ner_sgd_bertina_roberta")
# Tokenize input text
inputs = tokenizer("Calle Falsa 123, Bogotá", return_tensors="pt")
# Get predictions
outputs = model(**inputs)
logits = outputs.logits
predictions = logits.argmax(dim=-1)
Limitations
- The model is trained on a dataset from an administrative document management context and may not generalize well to other domains (e.g., social media or informal texts).
- Overlapping entities are resolved by prioritizing earlier matches, which may miss some valid entities.
- Missing values in entity columns (e.g., 10,073 missing
telefono
values) may reduce performance for certain entity types. - The model is optimized for Spanish and may not perform well on other languages.
- Due to the CC-BY-NC-3.0 license, the model cannot be used for commercial purposes.
Ethical Considerations
- Bias: The model may reflect biases in the training data, such as underrepresentation of certain entity types (e.g.,
telefono
has many missing values) or overrepresentation of formal administrative language. - Privacy: The model extracts sensitive entities like names, addresses, and phone numbers. Ensure input texts do not contain personal data unless authorized, especially given your focus on document management systems handling potentially sensitive data.
- Non-Commercial Use: The model is licensed for non-commercial use only, as per CC-BY-NC-3.0, aligning with your likely academic or institutional goals.
License
This model is licensed under the Creative Commons Attribution-NonCommercial 3.0 Unported License (CC-BY-NC-3.0). You are free to share and adapt the model for non-commercial purposes, provided you give appropriate credit to the author.
Contact
For issues or questions, please contact the model author via the Hugging Face repository or open an issue.
Acknowledgments
This model was trained using the Hugging Face ecosystem (transformers
, datasets
, evaluate
, seqeval
). Thanks to the bertin-project
team for providing the base model bertin-roberta-base-spanish
.
- Downloads last month
- 4
Model tree for excribe/ner_sgd_bertina_roberta
Base model
bertin-project/bertin-roberta-base-spanishEvaluation results
- Precision on Custom SGD Datasetself-reported0.903
- Recall on Custom SGD Datasetself-reported0.915
- F1 on Custom SGD Datasetself-reported0.909
- Accuracy on Custom SGD Datasetself-reported0.987