language:
- ace
- acm
- acq
- aeb
- af
- ajp
- ak
- als
- am
- apc
- ar
- ars
- ary
- arz
- as
- ast
- awa
- ayr
- azb
- azj
- ba
- bm
- ban
- be
- bem
- bn
- bho
- bjn
- bo
- bs
- bug
- bg
- ca
- ceb
- cs
- cjk
- ckb
- crh
- cy
- da
- de
- dik
- dyu
- dz
- el
- en
- eo
- et
- eu
- ee
- fo
- fj
- fi
- fon
- fr
- fur
- fuv
- gaz
- gd
- ga
- gl
- gn
- gu
- ht
- ha
- he
- hi
- hne
- hr
- hu
- hy
- ig
- ilo
- id
- is
- it
- jv
- ja
- kab
- kac
- kam
- kn
- ks
- ka
- kk
- kbd
- kbp
- kea
- khk
- km
- ki
- rw
- ky
- kmb
- kmr
- knc
- kg
- ko
- lo
- lij
- li
- ln
- lt
- lmo
- ltg
- lb
- lua
- lg
- luo
- lus
- lvs
- mag
- mai
- ml
- mar
- min
- mk
- mt
- mni
- mos
- mi
- my
- nl
- nn
- nb
- npi
- nso
- nus
- ny
- oc
- ory
- pag
- pa
- pap
- pbt
- pes
- plt
- pl
- pt
- prs
- quy
- ro
- rn
- ru
- sg
- sa
- sat
- scn
- shn
- si
- sk
- sl
- sm
- sn
- sd
- so
- st
- es
- sc
- sr
- ss
- su
- sv
- swh
- szl
- ta
- taq
- tt
- te
- tg
- tl
- th
- ti
- tpi
- tn
- ts
- tk
- tum
- tr
- tw
- tzm
- ug
- uk
- umb
- ur
- uzn
- vec
- vi
- war
- wo
- xh
- ydd
- yo
- yue
- zh
- zsm
- zu
language_details: >-
ace_Arab, ace_Latn, acm_Arab, acq_Arab, aeb_Arab, afr_Latn, ajp_Arab,
aka_Latn, amh_Ethi, apc_Arab, arb_Arab, ars_Arab, ary_Arab, arz_Arab,
asm_Beng, ast_Latn, awa_Deva, ayr_Latn, azb_Arab, azj_Latn, bak_Cyrl,
bam_Latn, ban_Latn,bel_Cyrl, bem_Latn, ben_Beng, bho_Deva, bjn_Arab, bjn_Latn,
bod_Tibt, bos_Latn, bug_Latn, bul_Cyrl, cat_Latn, ceb_Latn, ces_Latn,
cjk_Latn, ckb_Arab, crh_Latn, cym_Latn, dan_Latn, deu_Latn, dik_Latn,
dyu_Latn, dzo_Tibt, ell_Grek, eng_Latn, epo_Latn, est_Latn, eus_Latn,
ewe_Latn, fao_Latn, pes_Arab, fij_Latn, fin_Latn, fon_Latn, fra_Latn,
fur_Latn, fuv_Latn, gla_Latn, gle_Latn, glg_Latn, grn_Latn, guj_Gujr,
hat_Latn, hau_Latn, heb_Hebr, hin_Deva, hne_Deva, hrv_Latn, hun_Latn,
hye_Armn, ibo_Latn, ilo_Latn, ind_Latn, isl_Latn, ita_Latn, jav_Latn,
jpn_Jpan, kab_Latn, kac_Latn, kam_Latn, kan_Knda, kas_Arab, kas_Deva,
kat_Geor, knc_Arab, knc_Latn, kaz_Cyrl, kbd_Cyrl, kbp_Latn, kea_Latn,
khm_Khmr, kik_Latn, kin_Latn, kir_Cyrl, kmb_Latn, kon_Latn, kor_Hang,
kmr_Latn, lao_Laoo, lvs_Latn, lij_Latn, lim_Latn, lin_Latn, lit_Latn,
lmo_Latn, ltg_Latn, ltz_Latn, lua_Latn, lug_Latn, luo_Latn, lus_Latn,
mag_Deva, mai_Deva, mal_Mlym, mar_Deva, min_Latn, mkd_Cyrl, plt_Latn,
mlt_Latn, mni_Beng, khk_Cyrl, mos_Latn, mri_Latn, zsm_Latn, mya_Mymr,
nld_Latn, nno_Latn, nob_Latn, npi_Deva, nso_Latn, nus_Latn, nya_Latn,
oci_Latn, gaz_Latn, ory_Orya, pag_Latn, pan_Guru, pap_Latn, pol_Latn,
por_Latn, prs_Arab, pbt_Arab, quy_Latn, ron_Latn, run_Latn, rus_Cyrl,
sag_Latn, san_Deva, sat_Beng, scn_Latn, shn_Mymr, sin_Sinh, slk_Latn,
slv_Latn, smo_Latn, sna_Latn, snd_Arab, som_Latn, sot_Latn, spa_Latn,
als_Latn, srd_Latn, srp_Cyrl, ssw_Latn, sun_Latn, swe_Latn, swh_Latn,
szl_Latn, tam_Taml, tat_Cyrl, tel_Telu, tgk_Cyrl, tgl_Latn, tha_Thai,
tir_Ethi, taq_Latn, taq_Tfng, tpi_Latn, tsn_Latn, tso_Latn, tuk_Latn,
tum_Latn, tur_Latn, twi_Latn, tzm_Tfng, uig_Arab, ukr_Cyrl, umb_Latn,
urd_Arab, uzn_Latn, vec_Latn, vie_Latn, war_Latn, wol_Latn, xho_Latn,
ydd_Hebr, yor_Latn, yue_Hant, zho_Hans, zho_Hant, zul_Latn
tags:
- nllb
- translation
license: cc-by-nc-4.0
datasets:
- flores-200
metrics:
- bleu
- spbleu
- chrf++
inference: false
base_model:
- facebook/nllb-200-1.3B
NLLB-200 1.3B Pre-trained for Kabardian Translation
Model Details
- Model Name: nllb-200-1.3b-kbd-pretrain
- Base Model: NLLB-200 1.3B
- Model Type: Translation
- Language(s): Kabardian and others from NLLB-200 (200 languages)
- License: CC-BY-NC (inherited from base model)
- Developer: panagoa (fine-tuning), Meta AI (base model)
- Last Updated: January 23, 2025
- Paper: NLLB Team et al, No Language Left Behind: Scaling Human-Centered Machine Translation, Arxiv, 2022
Model Description
This model is a pre-trained adaptation of the NLLB-200 (No Language Left Behind) 1.3B parameter model that has been specifically optimized to improve translation capabilities for the Kabardian language (kbd). The base NLLB-200 model was developed by Meta AI and supports 200 languages, with this variant specifically adjusted for Kabardian language translation tasks.
Intended Uses
- Machine translation to and from Kabardian
- NLP applications involving the Kabardian language
- Research on low-resource language translation
- Cultural and linguistic preservation efforts for the Kabardian language
Training Data
This model has been pre-trained building upon the original NLLB-200 model, which used parallel multilingual data from various sources and monolingual data constructed from Common Crawl. The specific additional pre-training for Kabardian likely involved specialized Kabardian language resources.
The original NLLB-200 model was evaluated using the Flores-200 dataset.
Performance and Limitations
- As a pre-trained model, this version is intended to be further fine-tuned for specific translation tasks
- Inherits limitations from the base NLLB-200 model:
- Not intended for production deployment (research model)
- Not optimized for domain-specific texts (medical, legal, etc.)
- Not designed for document translation (optimized for single sentences)
- Training limited to input sequences not exceeding 512 tokens
- Translations cannot be used as certified translations
- May have additional limitations when handling specific cultural contexts, dialectal variations, or specialized terminology in Kabardian
Usage Example
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_name = "panagoa/nllb-200-1.3b-kbd-pretrain"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# Example: Translating to Kabardian
src_lang = "eng_Latn" # English
tgt_lang = "kbd_Cyrl" # Kabardian in Cyrillic script
text = "Hello, how are you?"
inputs = tokenizer(f"{src_lang}: {text}", return_tensors="pt")
translated_tokens = model.generate(
**inputs,
forced_bos_token_id=tokenizer.lang_code_to_id[tgt_lang],
max_length=30
)
translation = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
print(translation)
Ethical Considerations
As noted for the base NLLB-200 model:
- This work prioritizes human users and aims to minimize risks transferred to them
- Translation access for low-resource languages can improve education and information access but could potentially make groups with lower digital literacy vulnerable to misinformation
- Despite extensive data cleaning, personally identifiable information may not be entirely eliminated from training data
- Mistranslations could have adverse impacts on those relying on translations for important decisions
Caveats and Recommendations
- The base model was primarily tested on the Wikimedia domain with limited investigation on other domains
- Supported languages may have variations that the model does not capture
- Users should make appropriate assessments for their specific use cases
- This pre-trained model is part of a series of models specifically focused on Kabardian language translation
- For production use cases, consider the fully fine-tuned versions (v0.1, v0.2) rather than this pre-trained version
Additional Information
This model is part of a collection of NLLB models fine-tuned for Kabardian language translation developed by panagoa.