Excribe Classifier SGD Longformer 4096

Model Overview

Excribe/Classifier_SGD_Longformer_4099 is a fine-tuned version of the allenai/longformer-base-4096 model, designed for text classification tasks in document management, specifically for classifying Spanish-language input documents into document type categories (tipo_documento_codigo). Developed by Excribe.co, this model leverages the Longformer architecture to handle long texts (up to 4096 tokens) and is optimized for GPU environments, such as NVIDIA A100.

The model was trained on a Spanish dataset (final.parquet) containing 8,850 samples across 109 document type classes. It addresses class imbalance using SMOTE (Synthetic Minority Over-sampling Technique) applied to the training set, ensuring robust performance on minority classes. The fine-tuning process achieved an evaluation F1-score of 0.4855, accuracy of 0.6096, precision of 0.5212, and recall of 0.5006 on a validation set of 1,770 samples.

Key Features

  • Task: Multi-class text classification for document type identification.
  • Language: Spanish.
  • Input: Raw text (texto_entrada) from documents.
  • Output: Predicted document type code (tipo_documento_codigo) from 109 classes.
  • Handling Long Texts: Processes the first 4096-token chunk of input text.
  • Class Imbalance: Mitigated using SMOTE on the training set.
  • Hardware Optimization: Fine-tuned with mixed precision (fp16) and gradient accumulation for A100 GPUs.

Dataset

The training dataset (final.parquet) consists of 8,850 Spanish text samples, each labeled with a document type code (tipo_documento_codigo). The dataset exhibits significant class imbalance, with class frequencies ranging from 10 to 2,363 samples per class. The dataset was split into:

  • Training set: 7,080 samples (before SMOTE, expanded to 9,903 after SMOTE).
  • Validation set: 1,770 samples (untouched by SMOTE for unbiased evaluation).

SMOTE was applied to the training set to oversample minority classes (those with fewer than 30 samples) to a target of 40 samples per class, generating 2,823 synthetic samples. Single-instance classes were excluded from SMOTE to avoid resampling errors and were included in the training set as-is.

Model Training

Base Model

The model is based on allenai/longformer-base-4096, a transformer model designed for long-document processing with a sparse attention mechanism, allowing efficient handling of sequences up to 4096 tokens.

Fine-Tuning

The fine-tuning process was conducted using the Hugging Face Trainer API with the following configuration:

  • Epochs: 3
  • Learning Rate: 2e-5
  • Batch Size: Effective batch size of 16 (per_device_train_batch_size=2, gradient_accumulation_steps=8)
  • Optimizer: AdamW with weight decay (0.01)
  • Warmup Steps: 50
  • Mixed Precision: fp16 for GPU efficiency
  • Evaluation Strategy: Per epoch, with the best model selected based on the macro F1-score
  • SMOTE: Applied to the training set to balance classes
  • Hardware: NVIDIA A100 GPU

The training process took approximately 159.09 minutes (9,545.32 seconds) and produced the following evaluation metrics on the validation set:

  • Eval Loss: 1.5475
  • Eval Accuracy: 0.6096
  • Eval F1 (macro): 0.4855
  • Eval Precision (macro): 0.5212
  • Eval Recall (macro): 0.5006

Training logs and checkpoints are saved in ./results, with TensorBoard logs in ./logs. The final model and tokenizer are saved in ./fine_tuned_longformer.

Usage

Installation

To use the model, install the required dependencies:

pip install transformers torch pandas scikit-learn numpy

Inference Example

Below is a Python script to load and use the fine-tuned model for inference:

from transformers import LongformerTokenizer, LongformerForSequenceClassification
import torch
import numpy as np

# Load the model and tokenizer
model_path = "excribe/classifier_sgd_longformer_4099"
tokenizer = LongformerTokenizer.from_pretrained(model_path)
model = LongformerForSequenceClassification.from_pretrained(model_path)

# Load label encoder classes
label_encoder_classes = np.load("label_encoder_classes.npy", allow_pickle=True)
id2label = {i: int(label) for i, label in enumerate(label_encoder_classes)}

# Example text
text = "Your Spanish document text here..."

# Tokenize input
inputs = tokenizer(
    text,
    add_special_tokens=True,
    max_length=4096,
    padding="max_length",
    truncation=True,
    return_tensors="pt"
)

# Move inputs to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
inputs = {k: v.to(device) for k, v in inputs.items()}

# Perform inference
model.eval()
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_id = torch.argmax(logits, dim=1).item()

# Map prediction to label
predicted_label = id2label[predicted_id]
print(f"Predicted document type code: {predicted_label}")

Notes

  • The model processes only the first 4096 tokens of the input text. For longer documents, consider chunking strategies or alternative models.
  • Ensure the input text is in Spanish, as the model was trained exclusively on Spanish data.
  • The label encoder classes (label_encoder_classes.npy) must be available to map predicted IDs to document type codes.

Limitations

  • First Chunk Limitation: The model uses only the first 4096-token chunk, which may miss relevant information in longer documents.
  • Class Imbalance: While SMOTE improves minority class performance, some classes (e.g., single-instance classes) may still be underrepresented.
  • Macro Metrics: The reported F1-score (0.4855) is macro-averaged, meaning it treats all classes equally, which may mask performance disparities across imbalanced classes.
  • Hardware Requirements: Inference on CPU is possible but slower; a GPU is recommended for efficiency.

License

This model is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) license. You are free to share and adapt the model for non-commercial purposes, provided appropriate credit is given to Excribe.co.

Author

Citation

If you use this model in your work, please cite:

@misc{excribe_classifier_sgd_longformer_4099,
  author = {Excribe.co},
  title = {Classifier SGD Longformer 4099: A Fine-Tuned Model for Spanish Document Type Classification},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/excribe/classifier_sgd_longformer_4099}
}

Acknowledgments

  • Built upon the allenai/longformer-base-4096 model.
  • Utilizes the Hugging Face transformers library and Trainer API.
  • Thanks to the open-source community for tools like imbalanced-learn and scikit-learn.
Downloads last month
7
Safetensors
Model size
149M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for excribe/classifer_sgd_longformer_4099

Finetuned
(117)
this model