shadabtanjeed
/

mbart-banglish-to-bengali-transliteration

Safetensors

mbart

nlp

seq2seq

Model card Files Files and versions Community

shadabtanjeed commited on Feb 15

Commit

5734a57

verified ·

1 Parent(s): ffafbd4

Update README.md

Browse files

Files changed (1) hide show

README.md +122 -1

README.md CHANGED Viewed

@@ -6,4 +6,125 @@ base_model:
 tags:
 - nlp
 - seq2seq
----

 tags:
 - nlp
 - seq2seq
+---
+# Model Card for Banglish to Bengali Transliteration using mBART
+This model is designed to perform transliteration from Banglish (Romanized Bengali) to Bengali script using the [facebook/mbart-large-50-many-to-many-mmt](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt) model. The training was conducted using the dataset [SKNahin/bengali-transliteration-data](https://huggingface.co/datasets/SKNahin/bengali-transliteration-data).
+The notebook used for training can be found here: [Kaggle Notebook](https://www.kaggle.com/code/shadabtanjeed/mbart-banglish-to-bengali-transliteration).
+## Model Details
+### Model Description
+- **Developed by:** Shadab Tanjeed
+- **Model type:** Sequence-to-sequence (Seq2Seq) Transformer model
+- **Language(s) (NLP):** Bengali, Banglish (Romanized Bengali)
+- **Finetuned from model:** [facebook/mbart-large-50-many-to-many-mmt](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt)
+### Model Sources
+- **Repository:** [facebook/mbart-large-50-many-to-many-mmt](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt)
+## Uses
+### Direct Use
+The model is intended for direct transliteration of Banglish text to Bengali script.
+### Downstream Use
+It can be integrated into NLP applications where transliteration from Banglish to Bengali is required, such as chatbots, text normalization, and digital content processing.
+### Out-of-Scope Use
+The model is not designed for language translation beyond transliteration, and it may not perform well on text containing mixed languages or code-switching.
+## Bias, Risks, and Limitations
+- The model may struggle with ambiguous words that have multiple possible transliterations.
+- It may not perform well on informal or highly stylized text.
+- Limited dataset coverage could lead to errors in transliterating uncommon words.
+### Recommendations
+Users should validate outputs, especially for critical applications, and consider further fine-tuning if necessary.
+## How to Get Started with the Model
+```python
+from transformers import MBartForConditionalGeneration, MBartTokenizer
+model_name = "facebook/mbart-large-50-many-to-many-mmt"
+tokenizer = MBartTokenizer.from_pretrained(model_name)
+model = MBartForConditionalGeneration.from_pretrained(model_name)
+text = "ami tomake bhalobashi"
+inputs = tokenizer(text, return_tensors="pt")
+translated_tokens = model.generate(**inputs)
+output = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
+print(output)  # Expected Bengali transliteration
+```
+## Training Details
+### Training Data
+The dataset used for training is [SKNahin/bengali-transliteration-data](https://huggingface.co/datasets/SKNahin/bengali-transliteration-data), which contains pairs of Banglish (Romanized Bengali) and corresponding Bengali script.
+### Training Procedure
+#### Preprocessing
+- Tokenization was performed using the mBART tokenizer.
+- Text normalization techniques were applied to remove noise.
+#### Training Hyperparameters
+- **Batch size:** 8
+- **Learning rate:** 3e-5
+- **Epochs:** 5
+## Evaluation
+### Testing Data, Factors & Metrics
+#### Testing Data
+- The same dataset [SKNahin/bengali-transliteration-data](https://huggingface.co/datasets/SKNahin/bengali-transliteration-data) was used for evaluation.
+## Technical Specifications
+### Model Architecture and Objective
+The model follows the Transformer-based Seq2Seq architecture from mBART.
+#### Software
+- **Framework:** Hugging Face Transformers
+## Citation
+If you use this model, please cite the dataset and base model:
+```bibtex
+@inproceedings{SKNahin2023,
+  author = {SK Nahin},
+  title = {Bengali Transliteration Dataset},
+  year = {2023},
+  publisher = {Hugging Face Datasets},
+  url = {https://huggingface.co/datasets/SKNahin/bengali-transliteration-data}
+}
+@article{lewis2020mbart,
+  title={mBART: Multilingual Denoising Pre-training for Neural Machine Translation},
+  author={Lewis, Mike and others},
+  journal={arXiv preprint arXiv:2001.08210},
+  year={2020}
+}