Fine-tuned mt5-base model for restoring capitalization and punctuation for Macedonian language

The model is fine-tuned on a subset of the Macedonian portion of Wikipedia.

Authors:

  1. Dejan Porjazovski
  2. Ilina Jakimovska
  3. Ordan Chukaliev
  4. Nikola Stikov

This collaboration is part of the activities of the Center for Advanced Interdisciplinary Research (CAIR) at UKIM.

Usage

pip install transformers
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

recap_model_name = "Macedonian-ASR/mt5-restore-capitalization-macedonian"
recap_tokenizer = T5Tokenizer.from_pretrained(recap_model_name)
recap_model = T5ForConditionalGeneration.from_pretrained(recap_model_name)
recap_model.to(device)

sentence = "скопје е главен град на македонија"
inputs = recap_tokenizer(["restore capitalization and punctuation: " + sentence], return_tensors="pt", padding=True).to(device)
outputs = recap_model.generate(**inputs, max_length=768, num_beams=5, early_stopping=True).squeeze(0)
recap_result = recap_tokenizer.decode(outputs, skip_special_tokens=True)
print(recap_result)
-> "Скопје е главен град на Македонија."
Downloads last month
353
Safetensors
Model size
582M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for Macedonian-ASR/mt5-restore-capitalization-macedonian

Base model

google/mt5-base
Finetuned
(179)
this model

Dataset used to train Macedonian-ASR/mt5-restore-capitalization-macedonian

Spaces using Macedonian-ASR/mt5-restore-capitalization-macedonian 3