Model Card for tibetan-to-english-translation

This model is a machine translation model for translating Literary Tibetan to Spanish.

The model expects Tibetan text in Tibetan script as an input and outputs a Spanish translation.

This work is licensed under Creative Commons Attribution-NonCommercial 4.0 International

Model Details

Model Description

This model is a finetuned T5 model with 220 million parameters.

Developed by: billingsmoore
Model type: [More Information Needed]
Language(s) (NLP): Tibetan, English
License: Attribution-NonCommercial 4.0 International
Finetuned from model: 'google-t5/t5-base'

Model Sources

Repository: MLotsawa on Github

Uses

This model is intended to be used as the translation model in the larger MLotsawa software, but can also be used in a Jupyter notebook or Python script.

Direct Use

To use this model for translation you can use the following code:

from transformers import pipeline

translator = pipeline('translation', 'billingsmoore/tibetan-to-spanish-translation-v0')

input_text = <your Tibetan text>

translation = translator(input_text)

print(translation)

Downstream Use

The model can be further finetuned by adapting the finetuning notebooks found in the GitHub repository linked above.

Training Details

Training Data

This model was trained on two datasources. Firstly, the 21.5k translation pairs found in billingsmoore/tibetan-to-spanish-translation-dataset. That dataset was was scraped from Lotsawa House and is released under the same license as the texts from which it is sourced. Secondly, 501 translation pairs of longer sequences which were generously provided by Andres Montano.

10% of this data was set aside for evaluation.

Training Procedure

Training proceeded in three phases:

First, a custom BytePieceEncoder tokenizer was trained to accomodate the Tibetan text and the unique vocabulary of the Buddhist corpus.

Second, the model underwent continued pretraining on the training data. Pretraining for a T5 model consists of corrupting spans of text and having the model predict the missing span. This was performed for 2 epochs with a final loss of 0.037.

Third, the model was finetuned on the training data for 18 epochs, after which training was stopped by an early stopping callback.

Training Hyperparameters

This model was trained using the Adafactor optimizer with a learning rate of 3e-4.

Evaluation

The model was evaluted with BLEU, chrF, and TER, on the evaluation data. The result of evaluation were:

BLEU: 75.5765
chrF: 80.1954
TER: 28.3847

Please note that the training and evaluation was performed on an extremely small datasets and these metrics should not be taken as representative of performance in ordinary usage.

billingsmoore
/

prototype-tibetan-to-spanish-translation-v0