Excribe Classifier SGD Longformer 4096
Model Overview
Excribe/Classifier_SGD_Longformer_4099 is a fine-tuned version of the allenai/longformer-base-4096
model, designed for text classification tasks in document management, specifically for classifying Spanish-language input documents into document type categories (tipo_documento_codigo
). Developed by Excribe.co, this model leverages the Longformer architecture to handle long texts (up to 4096 tokens) and is optimized for GPU environments, such as NVIDIA A100.
The model was trained on a Spanish dataset (final.parquet
) containing 8,850 samples across 109 document type classes. It addresses class imbalance using SMOTE (Synthetic Minority Over-sampling Technique) applied to the training set, ensuring robust performance on minority classes. The fine-tuning process achieved an evaluation F1-score of 0.4855, accuracy of 0.6096, precision of 0.5212, and recall of 0.5006 on a validation set of 1,770 samples.
Key Features
- Task: Multi-class text classification for document type identification.
- Language: Spanish.
- Input: Raw text (
texto_entrada
) from documents. - Output: Predicted document type code (
tipo_documento_codigo
) from 109 classes. - Handling Long Texts: Processes the first 4096-token chunk of input text.
- Class Imbalance: Mitigated using SMOTE on the training set.
- Hardware Optimization: Fine-tuned with mixed precision (fp16) and gradient accumulation for A100 GPUs.
Dataset
The training dataset (final.parquet
) consists of 8,850 Spanish text samples, each labeled with a document type code (tipo_documento_codigo
). The dataset exhibits significant class imbalance, with class frequencies ranging from 10 to 2,363 samples per class. The dataset was split into:
- Training set: 7,080 samples (before SMOTE, expanded to 9,903 after SMOTE).
- Validation set: 1,770 samples (untouched by SMOTE for unbiased evaluation).
SMOTE was applied to the training set to oversample minority classes (those with fewer than 30 samples) to a target of 40 samples per class, generating 2,823 synthetic samples. Single-instance classes were excluded from SMOTE to avoid resampling errors and were included in the training set as-is.
Model Training
Base Model
The model is based on allenai/longformer-base-4096
, a transformer model designed for long-document processing with a sparse attention mechanism, allowing efficient handling of sequences up to 4096 tokens.
Fine-Tuning
The fine-tuning process was conducted using the Hugging Face Trainer
API with the following configuration:
- Epochs: 3
- Learning Rate: 2e-5
- Batch Size: Effective batch size of 16 (per_device_train_batch_size=2, gradient_accumulation_steps=8)
- Optimizer: AdamW with weight decay (0.01)
- Warmup Steps: 50
- Mixed Precision: fp16 for GPU efficiency
- Evaluation Strategy: Per epoch, with the best model selected based on the macro F1-score
- SMOTE: Applied to the training set to balance classes
- Hardware: NVIDIA A100 GPU
The training process took approximately 159.09 minutes (9,545.32 seconds) and produced the following evaluation metrics on the validation set:
- Eval Loss: 1.5475
- Eval Accuracy: 0.6096
- Eval F1 (macro): 0.4855
- Eval Precision (macro): 0.5212
- Eval Recall (macro): 0.5006
Training logs and checkpoints are saved in ./results
, with TensorBoard logs in ./logs
. The final model and tokenizer are saved in ./fine_tuned_longformer
.
Usage
Installation
To use the model, install the required dependencies:
pip install transformers torch pandas scikit-learn numpy
Inference Example
Below is a Python script to load and use the fine-tuned model for inference:
from transformers import LongformerTokenizer, LongformerForSequenceClassification
import torch
import numpy as np
# Load the model and tokenizer
model_path = "excribe/classifier_sgd_longformer_4099"
tokenizer = LongformerTokenizer.from_pretrained(model_path)
model = LongformerForSequenceClassification.from_pretrained(model_path)
# Load label encoder classes
label_encoder_classes = np.load("label_encoder_classes.npy", allow_pickle=True)
id2label = {i: int(label) for i, label in enumerate(label_encoder_classes)}
# Example text
text = "Your Spanish document text here..."
# Tokenize input
inputs = tokenizer(
text,
add_special_tokens=True,
max_length=4096,
padding="max_length",
truncation=True,
return_tensors="pt"
)
# Move inputs to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
inputs = {k: v.to(device) for k, v in inputs.items()}
# Perform inference
model.eval()
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predicted_id = torch.argmax(logits, dim=1).item()
# Map prediction to label
predicted_label = id2label[predicted_id]
print(f"Predicted document type code: {predicted_label}")
Notes
- The model processes only the first 4096 tokens of the input text. For longer documents, consider chunking strategies or alternative models.
- Ensure the input text is in Spanish, as the model was trained exclusively on Spanish data.
- The label encoder classes (
label_encoder_classes.npy
) must be available to map predicted IDs to document type codes.
Limitations
- First Chunk Limitation: The model uses only the first 4096-token chunk, which may miss relevant information in longer documents.
- Class Imbalance: While SMOTE improves minority class performance, some classes (e.g., single-instance classes) may still be underrepresented.
- Macro Metrics: The reported F1-score (0.4855) is macro-averaged, meaning it treats all classes equally, which may mask performance disparities across imbalanced classes.
- Hardware Requirements: Inference on CPU is possible but slower; a GPU is recommended for efficiency.
License
This model is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) license. You are free to share and adapt the model for non-commercial purposes, provided appropriate credit is given to Excribe.co.
Author
- Organization: Excribe.co
- Contact: Reach out via Hugging Face (https://huggingface.co/excribe)
Citation
If you use this model in your work, please cite:
@misc{excribe_classifier_sgd_longformer_4099,
author = {Excribe.co},
title = {Classifier SGD Longformer 4099: A Fine-Tuned Model for Spanish Document Type Classification},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/excribe/classifier_sgd_longformer_4099}
}
Acknowledgments
- Built upon the
allenai/longformer-base-4096
model. - Utilizes the Hugging Face
transformers
library andTrainer
API. - Thanks to the open-source community for tools like
imbalanced-learn
andscikit-learn
.
- Downloads last month
- 7
Model tree for excribe/classifer_sgd_longformer_4099
Base model
allenai/longformer-base-4096