Model Card for mlotsawa-ground-small

This model is a transformers machine translation model for translating Tibetan Buddhist texts to English, produced as part of the larger MLotsawa project

Model Details

Model Description

This model is a finetuned T5 model (small size) with 60 million parameters. It is intended for translation of Tibetan Buddhist texts into English. It expects input in Uchen script. This model uses the getok tokenizer. Details on the training data and procedure can be found below.

This model is a ground model in that, while its performance is reasonably good, it is intended to be used as a base for further finetuning on either a larger corpus or a tradition-specific (i.e. Dzogchen) corpus for improved translation quality.

  • Developed by: billingsmoore
  • Model type: translation
  • Languages: Tibetan, English
  • License: MIT
  • Finetuned from model: google-t5/t5-small

Model Sources

Uses

This model may be used directly for translation, or further finetuned for improved performance.

Direct Use

This model can be used directly for translation using a transformers pipeline as in the code block below.

from transformers import pipeline

pipe = pipeline('translation', 'billingsmoore/mlotsawa-ground-small', device='cpu') # select a device of your choice (i.e. 'cuda:0')

input = ["ཁྱེད་ལ་བསྟོད་ཅིང་གསོལ་བ་བཏབ་པའི་མཐུས༔",
"བདག་གི་ཚེ་བསོད་དཔལ་འབྱོར་རྒྱས་པ་དང་༔",
"འཇིགས་པ་བཅུ་དྲུག་རྐྱེན་ངན་བར་ཆད་སོལ༔"]

output = pipe(input)

translation = [elt['translation_text'] for elt in output]

print(translation)

The code above will produce the following output.

['Through the power of praising and praying to you', 'Increase my lifespan merit and prosperity', 'Remove the sixteen fears and obstacles of adversity.']

Alternatively the model can be used with a graphical user interface following the instructions found here.

Downstream Use

The performance of this model can improved with additional finetuning. You might finetune using a larger dataset for better general performance or finetune on a specific set of material for improved performance on that subset (i.e. Dzogchen texts).

The model can be finetuned following the recipe below.

# Load Your Data
from datasets import load_dataset

dataset = load_dataset(<your dataset>)

# Load the Model and Tokenizer
from transformers import AutoTokenizer, DataCollatorForSeq2Seq, AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("billingsmoore/mlotsawa-ground-small", device_map="cuda:0") # this line assumes you want to use a single CUDA enabled gpu
tokenizer = AutoTokenizer.from_pretrained('billingsmoore/mlotsawa-ground-small')
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

# Preprocess the Data
def translation_preprocess_function(examples):

    # Prepare translation inputs and targets
    translation_inputs = ['Translate Tibetan to English: ' + example for example in examples['bo']]
    translation_targets = [example for example in examples['en']]
    
    # Tokenize translation inputs and targets
    translation_model_inputs = tokenizer(translation_inputs, text_target=translation_targets, 
                                         max_length=256, truncation=True, padding="max_length")
    
    
    return translation_model_inputs

tokenized_dataset = dataset.map(translation_preprocess_function, batched=True)

# Define Evaluation Metrics
import numpy as np
import evaluate

# Load BLEU and CHRF metrics
bleu_metric = evaluate.load("sacrebleu")
chrf_metric = evaluate.load("chrf")
ter_metric = evaluate.load("ter")

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    
    # Decode predictions and labels
    preds = np.where(preds != -100, preds, tokenizer.pad_token_id)
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Postprocess text
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    # Compute BLEU score
    bleu_result = bleu_metric.compute(predictions=decoded_preds, references=decoded_labels)
    bleu_score = bleu_result["score"]

    # Compute CHRF score
    chrf_result = chrf_metric.compute(predictions=decoded_preds, references=decoded_labels)
    chrf_score = chrf_result["score"]

    # Compute TER score
    ter_result = ter_metric.compute(predictions=decoded_preds, references=decoded_labels)
    ter_score = ter_result["score"]

    # Return rounded results
    metrics = {
        "bleu": round(bleu_score, 4),
        "chrf": round(chrf_score, 4),
        "ter": round(ter_score, 4)
    }

    #print("Computed Metrics:", metrics)

    return metrics

# Set Up Training Arguments and Optimizer
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, Adafactor, EarlyStoppingCallback
from accelerate import Accelerator

accelerator = Accelerator()

optimizer = Adafactor(
    model.parameters(), 
    scale_parameter=True, 
    relative_step=False, 
    warmup_init=False, 
    lr=3e-4
)

model, optimizer = accelerator.prepare(model, optimizer)

training_args = Seq2SeqTrainingArguments(
    output_dir=f"output-dir", # select an output directory of your choice
    auto_find_batch_size=True,
    predict_with_generate=True,
    fp16=False,
    push_to_hub=False,
    eval_strategy='epoch',
    save_strategy='epoch',
    num_train_epochs=100, # select your preferred number of training epochs
    load_best_model_at_end=True,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['dev'],
    processing_class=tokenizer,
    optimizers=(optimizer, None),
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback()]
)

trainer.train()

Bias, Risks, and Limitations

This model is intended for the translation of Buddhist texts. Because of the complexity and importance of this material, all translations should be treated as preliminary and should never be used without the input of an experienced human translator.

Additionally, this model was trained exclusively on Tibetan Buddhist material and should not be expected to perform well on other material (i.e. vernacular Tibetan).

Training Details

Training Data

The training data for this model was 861,417 translation pairs from Buddhist texts. This data was collected from publically available material as well as material generously provided by Monlam AI and the Tibetan and Himalayan Library.

Training Procedure

The model underwent continued pretraining as well as finetuning as described below.

Pretraining

The model was pretrained on the training data for one epoch with a learning rate of 3e-4.The pretraining objective remained the original span corruption denoising task, in which random spans of input tokens are masked and the model is trained to reconstruct the missing content. This pretraining allowed the model to adapt to the new tokenizer and to learn the linguistic and structural characteristics of the Tibetan Buddhist materials.

Finetuning

The model was finetuned on the translation pairs for 50 epochs using the Adafactor optimizer and an initial learning rate of 3e-4.

Evaluation

The model was evaluated on test data with BLEU, chrF, and TER. The results are shown below.

BLEU chrF TER
3.54 19.89 87.58

These scores are exceptionally low, however actual translation results are relatively good. Sample translations are shown below.

From Advice on Bending Mind Toward the Good by Khenchen Ngawang Palzang | Translated by Joseph McClellan with editorial assistance from Ninjyed N.T., 2024.

Original Human Translation Machine Translation
གྲུབ་བརྒྱའི་སྤྱི་མེས་པཎ་ཆེན་བི་མ་ལ། །
བསམ་བཞིན་སྤྲུལ་པའི་ཟློས་གར་ཉེར་བཟུང་བ། །
རྒྱལ་བའི་དབང་པོ་ཀློང་ཆེན་རབ་འབྱམས་པ། །
འདི་ཙམ་མ་ཡིན་ཚེ་རབས་གཏན་གྱི་སྐྱབས། །
Grandsire of a hundred siddhas—great scholar, Vimalamitra,
And you who fully embraced the spectacle of intentional emanation,
Lord of conquerors, Longchen Rabjam—
You are my unfailing refuge; not just now, but in the concatenation of my lives.
Great paṇḍita Vimalamitra, forefather of hundreds of siddhas,
Manifesting in the form of a play,
Lord of the victorious ones, Longchen Rabjam,
Not just this but the constant refuge throughout all my lives,

From Protection from All Fears A Prayer to Ārya Tārā from the Reality Ḍākinīs’ Secret Treasury (Chönyi Khandrö Sangdzö) by Sera Khandro | Translated by Adam Pearcey, 2025.

Original Human Translation Machine Translation
ཀ་དག་སྤྲོས་བྲལ་འོད་གསལ་རིག་པའི་དབྱིངས༔
ལྷུན་གྲུབ་སྣང་ཆ་མ་འགགས་སྒྱུ་འཕྲུལ་གར༔
ཐུགས་རྗེ་རྒྱལ་བ་ཀུན་གྱི་ཡུམ་གཅིག་མ༔
རྗེ་བཙུན་ཨཱརྱ་ཏཱ་རེ་ཚེ་སྦྱིན་དཔལ༔
གསོལ་བ་འདེབས་སོ་རླུང་སེམས་དབང་བསྡུས་ནས༔
ཚེ་དང་བསོད་ནམས་འཕེལ་བར་མཛད་དུ་གསོལ༔
Out of the primordially pure unelaborate space of luminous awareness,
As the magical manifestation of unobstructed spontaneous presence,
Arises the compassionate one, the one and only mother of all victorious ones,
Noble Lady Ārya Tārā, glorious bestower of longevity,
To you I pray! Take control of my vital winds and mind,
And increase my lifespan and merit!
Within the space of awareness—primordial purity free of elaboration—
Illusory dance of spontaneously present appearances unceasing
Only mother of all the buddhas of compassion
Noble Ārya Tārā glorious Tārā
To you I pray: bringing the vāyu-mind under control
And increase our lifespan and merit.

Model Card Authors

billingsmoore

Model Card Contact

billingsmoore[at]gmail[dot]com

Downloads last month
22
Safetensors
Model size
60.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for billingsmoore/mlotsawa-ground-small

Base model

google-t5/t5-small
Finetuned
(1892)
this model