shadabtanjeed commited on
Commit
5734a57
·
verified ·
1 Parent(s): ffafbd4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +122 -1
README.md CHANGED
@@ -6,4 +6,125 @@ base_model:
6
  tags:
7
  - nlp
8
  - seq2seq
9
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  tags:
7
  - nlp
8
  - seq2seq
9
+ ---
10
+
11
+ # Model Card for Banglish to Bengali Transliteration using mBART
12
+
13
+ This model is designed to perform transliteration from Banglish (Romanized Bengali) to Bengali script using the [facebook/mbart-large-50-many-to-many-mmt](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt) model. The training was conducted using the dataset [SKNahin/bengali-transliteration-data](https://huggingface.co/datasets/SKNahin/bengali-transliteration-data).
14
+
15
+ The notebook used for training can be found here: [Kaggle Notebook](https://www.kaggle.com/code/shadabtanjeed/mbart-banglish-to-bengali-transliteration).
16
+
17
+ ## Model Details
18
+
19
+ ### Model Description
20
+
21
+ - **Developed by:** Shadab Tanjeed
22
+ - **Model type:** Sequence-to-sequence (Seq2Seq) Transformer model
23
+ - **Language(s) (NLP):** Bengali, Banglish (Romanized Bengali)
24
+ - **Finetuned from model:** [facebook/mbart-large-50-many-to-many-mmt](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt)
25
+
26
+ ### Model Sources
27
+
28
+ - **Repository:** [facebook/mbart-large-50-many-to-many-mmt](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt)
29
+
30
+ ## Uses
31
+
32
+ ### Direct Use
33
+
34
+ The model is intended for direct transliteration of Banglish text to Bengali script.
35
+
36
+ ### Downstream Use
37
+
38
+ It can be integrated into NLP applications where transliteration from Banglish to Bengali is required, such as chatbots, text normalization, and digital content processing.
39
+
40
+ ### Out-of-Scope Use
41
+
42
+ The model is not designed for language translation beyond transliteration, and it may not perform well on text containing mixed languages or code-switching.
43
+
44
+ ## Bias, Risks, and Limitations
45
+
46
+ - The model may struggle with ambiguous words that have multiple possible transliterations.
47
+ - It may not perform well on informal or highly stylized text.
48
+ - Limited dataset coverage could lead to errors in transliterating uncommon words.
49
+
50
+ ### Recommendations
51
+
52
+ Users should validate outputs, especially for critical applications, and consider further fine-tuning if necessary.
53
+
54
+ ## How to Get Started with the Model
55
+
56
+ ```python
57
+ from transformers import MBartForConditionalGeneration, MBartTokenizer
58
+
59
+ model_name = "facebook/mbart-large-50-many-to-many-mmt"
60
+ tokenizer = MBartTokenizer.from_pretrained(model_name)
61
+ model = MBartForConditionalGeneration.from_pretrained(model_name)
62
+
63
+ text = "ami tomake bhalobashi"
64
+ inputs = tokenizer(text, return_tensors="pt")
65
+
66
+ translated_tokens = model.generate(**inputs)
67
+ output = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
68
+
69
+ print(output) # Expected Bengali transliteration
70
+
71
+ ```
72
+
73
+ ## Training Details
74
+
75
+ ### Training Data
76
+
77
+ The dataset used for training is [SKNahin/bengali-transliteration-data](https://huggingface.co/datasets/SKNahin/bengali-transliteration-data), which contains pairs of Banglish (Romanized Bengali) and corresponding Bengali script.
78
+
79
+ ### Training Procedure
80
+
81
+ #### Preprocessing
82
+
83
+ - Tokenization was performed using the mBART tokenizer.
84
+ - Text normalization techniques were applied to remove noise.
85
+
86
+ #### Training Hyperparameters
87
+
88
+ - **Batch size:** 8
89
+ - **Learning rate:** 3e-5
90
+ - **Epochs:** 5
91
+
92
+ ## Evaluation
93
+
94
+ ### Testing Data, Factors & Metrics
95
+
96
+ #### Testing Data
97
+
98
+ - The same dataset [SKNahin/bengali-transliteration-data](https://huggingface.co/datasets/SKNahin/bengali-transliteration-data) was used for evaluation.
99
+
100
+
101
+ ## Technical Specifications
102
+
103
+ ### Model Architecture and Objective
104
+
105
+ The model follows the Transformer-based Seq2Seq architecture from mBART.
106
+
107
+
108
+ #### Software
109
+
110
+ - **Framework:** Hugging Face Transformers
111
+
112
+ ## Citation
113
+
114
+ If you use this model, please cite the dataset and base model:
115
+
116
+ ```bibtex
117
+ @inproceedings{SKNahin2023,
118
+ author = {SK Nahin},
119
+ title = {Bengali Transliteration Dataset},
120
+ year = {2023},
121
+ publisher = {Hugging Face Datasets},
122
+ url = {https://huggingface.co/datasets/SKNahin/bengali-transliteration-data}
123
+ }
124
+
125
+ @article{lewis2020mbart,
126
+ title={mBART: Multilingual Denoising Pre-training for Neural Machine Translation},
127
+ author={Lewis, Mike and others},
128
+ journal={arXiv preprint arXiv:2001.08210},
129
+ year={2020}
130
+ }