Model Card for mlotsawa-ground-small
This model is a transformers machine translation model for translating Tibetan Buddhist texts to English, produced as part of the larger MLotsawa project
Model Details
Model Description
This model is a finetuned T5 model (small size) with 60 million parameters. It is intended for translation of Tibetan Buddhist texts into English. It expects input in Uchen script. This model uses the getok tokenizer. Details on the training data and procedure can be found below.
This model is a ground model in that, while its performance is reasonably good, it is intended to be used as a base for further finetuning on either a larger corpus or a tradition-specific (i.e. Dzogchen) corpus for improved translation quality.
- Developed by: billingsmoore
- Model type: translation
- Languages: Tibetan, English
- License: MIT
- Finetuned from model: google-t5/t5-small
Model Sources
- Repository: MLotsawa on GitHub
Uses
This model may be used directly for translation, or further finetuned for improved performance.
Direct Use
This model can be used directly for translation using a transformers pipeline as in the code block below.
from transformers import pipeline
pipe = pipeline('translation', 'billingsmoore/mlotsawa-ground-small', device='cpu') # select a device of your choice (i.e. 'cuda:0')
input = ["ཁྱེད་ལ་བསྟོད་ཅིང་གསོལ་བ་བཏབ་པའི་མཐུས༔",
"བདག་གི་ཚེ་བསོད་དཔལ་འབྱོར་རྒྱས་པ་དང་༔",
"འཇིགས་པ་བཅུ་དྲུག་རྐྱེན་ངན་བར་ཆད་སོལ༔"]
output = pipe(input)
translation = [elt['translation_text'] for elt in output]
print(translation)
The code above will produce the following output.
['Through the power of praising and praying to you', 'Increase my lifespan merit and prosperity', 'Remove the sixteen fears and obstacles of adversity.']
Alternatively the model can be used with a graphical user interface following the instructions found here.
Downstream Use
The performance of this model can improved with additional finetuning. You might finetune using a larger dataset for better general performance or finetune on a specific set of material for improved performance on that subset (i.e. Dzogchen texts).
The model can be finetuned following the recipe below.
# Load Your Data
from datasets import load_dataset
dataset = load_dataset(<your dataset>)
# Load the Model and Tokenizer
from transformers import AutoTokenizer, DataCollatorForSeq2Seq, AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("billingsmoore/mlotsawa-ground-small", device_map="cuda:0") # this line assumes you want to use a single CUDA enabled gpu
tokenizer = AutoTokenizer.from_pretrained('billingsmoore/mlotsawa-ground-small')
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
# Preprocess the Data
def translation_preprocess_function(examples):
# Prepare translation inputs and targets
translation_inputs = ['Translate Tibetan to English: ' + example for example in examples['bo']]
translation_targets = [example for example in examples['en']]
# Tokenize translation inputs and targets
translation_model_inputs = tokenizer(translation_inputs, text_target=translation_targets,
max_length=256, truncation=True, padding="max_length")
return translation_model_inputs
tokenized_dataset = dataset.map(translation_preprocess_function, batched=True)
# Define Evaluation Metrics
import numpy as np
import evaluate
# Load BLEU and CHRF metrics
bleu_metric = evaluate.load("sacrebleu")
chrf_metric = evaluate.load("chrf")
ter_metric = evaluate.load("ter")
def postprocess_text(preds, labels):
preds = [pred.strip() for pred in preds]
labels = [[label.strip()] for label in labels]
return preds, labels
def compute_metrics(eval_preds):
preds, labels = eval_preds
if isinstance(preds, tuple):
preds = preds[0]
# Decode predictions and labels
preds = np.where(preds != -100, preds, tokenizer.pad_token_id)
decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
# Postprocess text
decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
# Compute BLEU score
bleu_result = bleu_metric.compute(predictions=decoded_preds, references=decoded_labels)
bleu_score = bleu_result["score"]
# Compute CHRF score
chrf_result = chrf_metric.compute(predictions=decoded_preds, references=decoded_labels)
chrf_score = chrf_result["score"]
# Compute TER score
ter_result = ter_metric.compute(predictions=decoded_preds, references=decoded_labels)
ter_score = ter_result["score"]
# Return rounded results
metrics = {
"bleu": round(bleu_score, 4),
"chrf": round(chrf_score, 4),
"ter": round(ter_score, 4)
}
#print("Computed Metrics:", metrics)
return metrics
# Set Up Training Arguments and Optimizer
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, Adafactor, EarlyStoppingCallback
from accelerate import Accelerator
accelerator = Accelerator()
optimizer = Adafactor(
model.parameters(),
scale_parameter=True,
relative_step=False,
warmup_init=False,
lr=3e-4
)
model, optimizer = accelerator.prepare(model, optimizer)
training_args = Seq2SeqTrainingArguments(
output_dir=f"output-dir", # select an output directory of your choice
auto_find_batch_size=True,
predict_with_generate=True,
fp16=False,
push_to_hub=False,
eval_strategy='epoch',
save_strategy='epoch',
num_train_epochs=100, # select your preferred number of training epochs
load_best_model_at_end=True,
)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset['train'],
eval_dataset=tokenized_dataset['dev'],
processing_class=tokenizer,
optimizers=(optimizer, None),
data_collator=data_collator,
compute_metrics=compute_metrics,
callbacks=[EarlyStoppingCallback()]
)
trainer.train()
Bias, Risks, and Limitations
This model is intended for the translation of Buddhist texts. Because of the complexity and importance of this material, all translations should be treated as preliminary and should never be used without the input of an experienced human translator.
Additionally, this model was trained exclusively on Tibetan Buddhist material and should not be expected to perform well on other material (i.e. vernacular Tibetan).
Training Details
Training Data
The training data for this model was 861,417 translation pairs from Buddhist texts. This data was collected from publically available material as well as material generously provided by Monlam AI and the Tibetan and Himalayan Library.
Training Procedure
The model underwent continued pretraining as well as finetuning as described below.
Pretraining
The model was pretrained on the training data for one epoch with a learning rate of 3e-4.The pretraining objective remained the original span corruption denoising task, in which random spans of input tokens are masked and the model is trained to reconstruct the missing content. This pretraining allowed the model to adapt to the new tokenizer and to learn the linguistic and structural characteristics of the Tibetan Buddhist materials.
Finetuning
The model was finetuned on the translation pairs for 50 epochs using the Adafactor optimizer and an initial learning rate of 3e-4.
Evaluation
The model was evaluated on test data with BLEU, chrF, and TER. The results are shown below.
BLEU | chrF | TER |
---|---|---|
3.54 | 19.89 | 87.58 |
These scores are exceptionally low, however actual translation results are relatively good. Sample translations are shown below.
From Advice on Bending Mind Toward the Good by Khenchen Ngawang Palzang | Translated by Joseph McClellan with editorial assistance from Ninjyed N.T., 2024.
Original | Human Translation | Machine Translation |
---|---|---|
གྲུབ་བརྒྱའི་སྤྱི་མེས་པཎ་ཆེན་བི་མ་ལ། ། བསམ་བཞིན་སྤྲུལ་པའི་ཟློས་གར་ཉེར་བཟུང་བ། ། རྒྱལ་བའི་དབང་པོ་ཀློང་ཆེན་རབ་འབྱམས་པ། ། འདི་ཙམ་མ་ཡིན་ཚེ་རབས་གཏན་གྱི་སྐྱབས། ། |
Grandsire of a hundred siddhas—great scholar, Vimalamitra, And you who fully embraced the spectacle of intentional emanation, Lord of conquerors, Longchen Rabjam— You are my unfailing refuge; not just now, but in the concatenation of my lives. |
Great paṇḍita Vimalamitra, forefather of hundreds of siddhas, Manifesting in the form of a play, Lord of the victorious ones, Longchen Rabjam, Not just this but the constant refuge throughout all my lives, |
From Protection from All Fears A Prayer to Ārya Tārā from the Reality Ḍākinīs’ Secret Treasury (Chönyi Khandrö Sangdzö) by Sera Khandro | Translated by Adam Pearcey, 2025.
Original | Human Translation | Machine Translation |
---|---|---|
ཀ་དག་སྤྲོས་བྲལ་འོད་གསལ་རིག་པའི་དབྱིངས༔ ལྷུན་གྲུབ་སྣང་ཆ་མ་འགགས་སྒྱུ་འཕྲུལ་གར༔ ཐུགས་རྗེ་རྒྱལ་བ་ཀུན་གྱི་ཡུམ་གཅིག་མ༔ རྗེ་བཙུན་ཨཱརྱ་ཏཱ་རེ་ཚེ་སྦྱིན་དཔལ༔ གསོལ་བ་འདེབས་སོ་རླུང་སེམས་དབང་བསྡུས་ནས༔ ཚེ་དང་བསོད་ནམས་འཕེལ་བར་མཛད་དུ་གསོལ༔ |
Out of the primordially pure unelaborate space of luminous awareness, As the magical manifestation of unobstructed spontaneous presence, Arises the compassionate one, the one and only mother of all victorious ones, Noble Lady Ārya Tārā, glorious bestower of longevity, To you I pray! Take control of my vital winds and mind, And increase my lifespan and merit! |
Within the space of awareness—primordial purity free of elaboration— Illusory dance of spontaneously present appearances unceasing Only mother of all the buddhas of compassion Noble Ārya Tārā glorious Tārā To you I pray: bringing the vāyu-mind under control And increase our lifespan and merit. |
Model Card Authors
billingsmoore
Model Card Contact
billingsmoore[at]gmail[dot]com
- Downloads last month
- 22
Model tree for billingsmoore/mlotsawa-ground-small
Base model
google-t5/t5-small