NLLB-200 (600 M) – LoRA fine-tuned for Turkmen ↔ English

Author : Merdan Durdyyev
Base model : facebook/nllb-200-distilled-600M
Tuning method : Low-Rank Adaptation (LoRA) on only the q_proj & v_proj matrices (≈ 2.4 M trainable → 0.38 % of total params).

I built this checkpoint as the final project for my Deep-Learning class and as a small contribution to the Turkmen AI community, where open-source resources are scarce.


TL;DR & Quick results

Try it on Space demo Article with full technical journey is available Medium.

Model Comparison (Fine-tuned vs Original)

English to Turkmen

Metric Fine-tuned Original Difference
BLEU 8.24 8.12 +0.12
chrF 39.55 39.46 +0.09
TER (lower is better) 87.20 87.30 -0.10

Turkmen to English

Metric Fine-tuned Original Difference
BLEU 25.88 26.48 -0.60
chrF 52.71 52.91 -0.20
TER (lower is better) 67.70 69.70 -2.00

Scores computed with sacre BLEU 2.5, chrF, TER on the official test split.
A separate spreadsheet with human adequacy/fluency ratings is available in the article.


Intended use & scope

  • Good for: research prototypes, student projects, quick experiments on Turkmen text.
  • Not for: commercial MT systems (license is CC-BY-NC 4.0), critical medical/legal translation, or production workloads without further validation.

How to use

(If you want to take a look to the LoRA adapter visit nllb-200-turkmen-english-lora-adapter)

Using piplene

from transformers import pipeline

# Create the translation pipeline
pipe = pipeline("translation", model="XSkills/nllb-200-turkmen-english-lora")

# Translate from English to Turkmen
# You need to specify the source and target languages using their FLORES-200 codes
text = "Hello, how are you today?"
translated = pipe(text, src_lang="eng_Latn", tgt_lang="tuk_Latn")
print(translated)

Using Tokenizer

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "XSkills/nllb-200-turkmen-english-lora" 
tok   = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

def tr(text, src="tuk_Latn", tgt="eng_Latn"):
    tok.src_lang = src
    ids = tok(text, return_tensors="pt", truncation=True, max_length=128)
    out = model.generate(
        **ids,
        forced_bos_token_id=tok.convert_tokens_to_ids(tgt),
        max_length=128,
        num_beams=5
    )
    return tok.decode(out[0], skip_special_tokens=True)

print(tr("Men kitaby okaýaryn."))

Training data

  • Dataset : XSkills/turkmen_english_s500 619 parallel sentences (495 train / 62 val / 62 test) of news & official communiqués.
  • Collecting even this small corpus proved challenging because publicly available Turkmen data are limited.

Training procedure

Item Value
GPU 1 × NVIDIA A100 40 GB (Google Colab)
Wall-time ~ 3 minutes
Optimiser AdamW
Learning rate 1 × 10⁻⁵, cosine schedule, warm-up 10%
Epochs 5
Batch size 4 (train) / 8 (eval)
Weight-decay 0.005
FP16 Yes
LoRA config r=16, alpha=32, dropout=0.05, modules = ["q_proj","v_proj"]

LoRA Config

lora_config = LoraConfig(
    r=16,                           
    lora_alpha=32,                  
    target_modules=["q_proj", "v_proj"],   
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM,
)

Training Configuration

training_args = Seq2SeqTrainingArguments(
    output_dir=FINETUNED_DIR,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=8,
    weight_decay=0.005,
    save_total_limit=3,
    learning_rate=1e-5,
    num_train_epochs=5,
    lr_scheduler_type="cosine",
    predict_with_generate=True,
    fp16=True if torch.cuda.is_available() else False,
    logging_dir="./logs",
    logging_steps=50,
    eval_steps=50,               
    save_steps=100,              
    eval_accumulation_steps=2,  
    report_to="tensorboard",
    warmup_ratio=0.1,
    metric_for_best_model="eval_bleu",  # Use BLEU for model selection
    greater_is_better=True,
)

Evaluation

Automatic metrics are given in TL;DR. A manual review on 50 random test sentences showed:

  • Adequacy: 36 / 50 translations judged “Good” or better.
  • Fluency: 38 / 50 sound natural to a native speaker. (Full spreadsheet available — ask via contact below.)

Limitations & bias

  • Only 500ish sentences → limited vocabulary & domain coverage.
  • May hallucinate proper nouns or numbers on longer inputs.
  • Gender/ politeness nuances not guaranteed.
  • CC-BY-NC licence forbids commercial use; respect Meta’s original terms.

How to Contribute

We welcome contributions to improve Turkmen-English translation capabilities! Here's how you can help:

Data Contributions

  • Read Dataset Contribution: You can find the instructions for contributing to the dataset at Dataset Readme

Code Contributions

  • Hyperparameter experiments: Try different LoRA configurations and document your results
  • Evaluation: Help with human evaluation of translation quality and fluency
  • Bug fixes: Report issues or submit fixes for the model implementation

Use Cases & Documentation

  • Example applications: Share how you're using the model for research or projects
  • Domain-specific guides: Create guides for using the model in specific domains
  • Translation examples: Share interesting or challenging translation examples

Getting Started

  1. Fork the repository
  2. Make your changes
  3. Submit a pull request with clear documentation of your contribution
  4. For data contributions, contact the maintainer directly

All contributors will be acknowledged in the model documentation. Contact [email protected] with any questions or to discuss potential contributions.


Note: This model is licensed under CC-BY-NC-4.0, so all contributions must be compatible with non-commercial use only.

Citation

@misc{durdyyev2025turkmenNLLBLoRA,
  title  = {LoRA Fine‐tuning of NLLB‐200 for Turkmen–English Translation},
  author = {Durdyyev, Merdan},
  year   = {2025},
  url    = {https://huggingface.co/XSkills/nllb-200-turkmen-english-lora}
}

Contact

If you have questions, suggestions or want to collaborate, please reach out through e-mail, LinkedIn or Telegram.

Future Work

  • Try to tune on bigger dataset.
  • Try to tweak the hyperparameters
  • Use sacreBLEU metric
Downloads last month
30
Safetensors
Model size
615M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for XSkills/nllb-200-turkmen-english-lora

Adapter
(16)
this model

Dataset used to train XSkills/nllb-200-turkmen-english-lora

Space using XSkills/nllb-200-turkmen-english-lora 1