panagoa's picture
Update README.md
5376a84 verified
metadata
language:
  - ace
  - acm
  - acq
  - aeb
  - af
  - ajp
  - ak
  - als
  - am
  - apc
  - ar
  - ars
  - ary
  - arz
  - as
  - ast
  - awa
  - ayr
  - azb
  - azj
  - ba
  - bm
  - ban
  - be
  - bem
  - bn
  - bho
  - bjn
  - bo
  - bs
  - bug
  - bg
  - ca
  - ceb
  - cs
  - cjk
  - ckb
  - crh
  - cy
  - da
  - de
  - dik
  - dyu
  - dz
  - el
  - en
  - eo
  - et
  - eu
  - ee
  - fo
  - fj
  - fi
  - fon
  - fr
  - fur
  - fuv
  - gaz
  - gd
  - ga
  - gl
  - gn
  - gu
  - ht
  - ha
  - he
  - hi
  - hne
  - hr
  - hu
  - hy
  - ig
  - ilo
  - id
  - is
  - it
  - jv
  - ja
  - kab
  - kac
  - kam
  - kn
  - ks
  - ka
  - kk
  - kbd
  - kbp
  - kea
  - khk
  - km
  - ki
  - rw
  - ky
  - kmb
  - kmr
  - knc
  - kg
  - ko
  - lo
  - lij
  - li
  - ln
  - lt
  - lmo
  - ltg
  - lb
  - lua
  - lg
  - luo
  - lus
  - lvs
  - mag
  - mai
  - ml
  - mar
  - min
  - mk
  - mt
  - mni
  - mos
  - mi
  - my
  - nl
  - nn
  - nb
  - npi
  - nso
  - nus
  - ny
  - oc
  - ory
  - pag
  - pa
  - pap
  - pbt
  - pes
  - plt
  - pl
  - pt
  - prs
  - quy
  - ro
  - rn
  - ru
  - sg
  - sa
  - sat
  - scn
  - shn
  - si
  - sk
  - sl
  - sm
  - sn
  - sd
  - so
  - st
  - es
  - sc
  - sr
  - ss
  - su
  - sv
  - swh
  - szl
  - ta
  - taq
  - tt
  - te
  - tg
  - tl
  - th
  - ti
  - tpi
  - tn
  - ts
  - tk
  - tum
  - tr
  - tw
  - tzm
  - ug
  - uk
  - umb
  - ur
  - uzn
  - vec
  - vi
  - war
  - wo
  - xh
  - ydd
  - yo
  - yue
  - zh
  - zsm
  - zu
language_details: >-
  ace_Arab, ace_Latn, acm_Arab, acq_Arab, aeb_Arab, afr_Latn, ajp_Arab,
  aka_Latn, amh_Ethi, apc_Arab, arb_Arab, ars_Arab, ary_Arab, arz_Arab,
  asm_Beng, ast_Latn, awa_Deva, ayr_Latn, azb_Arab, azj_Latn, bak_Cyrl,
  bam_Latn, ban_Latn,bel_Cyrl, bem_Latn, ben_Beng, bho_Deva, bjn_Arab, bjn_Latn,
  bod_Tibt, bos_Latn, bug_Latn, bul_Cyrl, cat_Latn, ceb_Latn, ces_Latn,
  cjk_Latn, ckb_Arab, crh_Latn, cym_Latn, dan_Latn, deu_Latn, dik_Latn,
  dyu_Latn, dzo_Tibt, ell_Grek, eng_Latn, epo_Latn, est_Latn, eus_Latn,
  ewe_Latn, fao_Latn, pes_Arab, fij_Latn, fin_Latn, fon_Latn, fra_Latn,
  fur_Latn, fuv_Latn, gla_Latn, gle_Latn, glg_Latn, grn_Latn, guj_Gujr,
  hat_Latn, hau_Latn, heb_Hebr, hin_Deva, hne_Deva, hrv_Latn, hun_Latn,
  hye_Armn, ibo_Latn, ilo_Latn, ind_Latn, isl_Latn, ita_Latn, jav_Latn,
  jpn_Jpan, kab_Latn, kac_Latn, kam_Latn, kan_Knda, kas_Arab, kas_Deva,
  kat_Geor, knc_Arab, knc_Latn, kaz_Cyrl, kbd_Cyrl, kbp_Latn, kea_Latn,
  khm_Khmr, kik_Latn, kin_Latn, kir_Cyrl, kmb_Latn, kon_Latn, kor_Hang,
  kmr_Latn, lao_Laoo, lvs_Latn, lij_Latn, lim_Latn, lin_Latn, lit_Latn,
  lmo_Latn, ltg_Latn, ltz_Latn, lua_Latn, lug_Latn, luo_Latn, lus_Latn,
  mag_Deva, mai_Deva, mal_Mlym, mar_Deva, min_Latn, mkd_Cyrl, plt_Latn,
  mlt_Latn, mni_Beng, khk_Cyrl, mos_Latn, mri_Latn, zsm_Latn, mya_Mymr,
  nld_Latn, nno_Latn, nob_Latn, npi_Deva, nso_Latn, nus_Latn, nya_Latn,
  oci_Latn, gaz_Latn, ory_Orya, pag_Latn, pan_Guru, pap_Latn, pol_Latn,
  por_Latn, prs_Arab, pbt_Arab, quy_Latn, ron_Latn, run_Latn, rus_Cyrl,
  sag_Latn, san_Deva, sat_Beng, scn_Latn, shn_Mymr, sin_Sinh, slk_Latn,
  slv_Latn, smo_Latn, sna_Latn, snd_Arab, som_Latn, sot_Latn, spa_Latn,
  als_Latn, srd_Latn, srp_Cyrl, ssw_Latn, sun_Latn, swe_Latn, swh_Latn,
  szl_Latn, tam_Taml, tat_Cyrl, tel_Telu, tgk_Cyrl, tgl_Latn, tha_Thai,
  tir_Ethi, taq_Latn, taq_Tfng, tpi_Latn, tsn_Latn, tso_Latn, tuk_Latn,
  tum_Latn, tur_Latn, twi_Latn, tzm_Tfng, uig_Arab, ukr_Cyrl, umb_Latn,
  urd_Arab, uzn_Latn, vec_Latn, vie_Latn, war_Latn, wol_Latn, xho_Latn,
  ydd_Hebr, yor_Latn, yue_Hant, zho_Hans, zho_Hant, zul_Latn
tags:
  - nllb
  - translation
license: cc-by-nc-4.0
datasets:
  - flores-200
metrics:
  - bleu
  - spbleu
  - chrf++
inference: false
base_model:
  - facebook/nllb-200-1.3B

NLLB-200 1.3B Pre-trained for Kabardian Translation

Model Details

Model Description

This model is a pre-trained adaptation of the NLLB-200 (No Language Left Behind) 1.3B parameter model that has been specifically optimized to improve translation capabilities for the Kabardian language (kbd). The base NLLB-200 model was developed by Meta AI and supports 200 languages, with this variant specifically adjusted for Kabardian language translation tasks.

Intended Uses

  • Machine translation to and from Kabardian
  • NLP applications involving the Kabardian language
  • Research on low-resource language translation
  • Cultural and linguistic preservation efforts for the Kabardian language

Training Data

This model has been pre-trained building upon the original NLLB-200 model, which used parallel multilingual data from various sources and monolingual data constructed from Common Crawl. The specific additional pre-training for Kabardian likely involved specialized Kabardian language resources.

The original NLLB-200 model was evaluated using the Flores-200 dataset.

Performance and Limitations

  • As a pre-trained model, this version is intended to be further fine-tuned for specific translation tasks
  • Inherits limitations from the base NLLB-200 model:
    • Not intended for production deployment (research model)
    • Not optimized for domain-specific texts (medical, legal, etc.)
    • Not designed for document translation (optimized for single sentences)
    • Training limited to input sequences not exceeding 512 tokens
    • Translations cannot be used as certified translations
  • May have additional limitations when handling specific cultural contexts, dialectal variations, or specialized terminology in Kabardian

Usage Example

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "panagoa/nllb-200-1.3b-kbd-pretrain"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Example: Translating to Kabardian
src_lang = "eng_Latn"  # English
tgt_lang = "kbd_Cyrl"  # Kabardian in Cyrillic script

text = "Hello, how are you?"
inputs = tokenizer(f"{src_lang}: {text}", return_tensors="pt")
translated_tokens = model.generate(
    **inputs, 
    forced_bos_token_id=tokenizer.lang_code_to_id[tgt_lang],
    max_length=30
)
translation = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
print(translation)

Ethical Considerations

As noted for the base NLLB-200 model:

  • This work prioritizes human users and aims to minimize risks transferred to them
  • Translation access for low-resource languages can improve education and information access but could potentially make groups with lower digital literacy vulnerable to misinformation
  • Despite extensive data cleaning, personally identifiable information may not be entirely eliminated from training data
  • Mistranslations could have adverse impacts on those relying on translations for important decisions

Caveats and Recommendations

  • The base model was primarily tested on the Wikimedia domain with limited investigation on other domains
  • Supported languages may have variations that the model does not capture
  • Users should make appropriate assessments for their specific use cases
  • This pre-trained model is part of a series of models specifically focused on Kabardian language translation
  • For production use cases, consider the fully fine-tuned versions (v0.1, v0.2) rather than this pre-trained version

Additional Information

This model is part of a collection of NLLB models fine-tuned for Kabardian language translation developed by panagoa.