|
--- |
|
license: mit |
|
--- |
|
# Bengali to English Word Aligner |
|
Finetuned Model for **Bengali to English Word** which was build on `bert-base-multilingual-cased` |
|
|
|
## Quick Start |
|
Initialize to use it in your project |
|
|
|
```python |
|
tokenizer = AutoTokenizer.from_pretrained("musfiqdehan/bengali-english-word-aligner") |
|
model = AutoModel.from_pretrained("musfiqdehan/bengali-english-word-aligner") |
|
``` |
|
## Bengali-English Word Alignment |
|
|
|
[](https://colab.research.google.com/drive/1x5wUXS7vdWNeROkJS_B_lUwKTJZGaB7v?usp=sharing) |
|
|
|
[](https://www.kaggle.com/musfiqdehan/bengali-english-alignment-demo) |
|
|
|
Install Dependencies |
|
``` |
|
!pip install -U data-preprocessors |
|
!pip install -U bangla-postagger |
|
``` |
|
Import Necessary Libraries |
|
```python |
|
from pprint import pprint |
|
from data_preprocessors import text_preprocessor as tp |
|
from bangla_postagger import (en_postaggers as ep, |
|
bn_en_mapper as bem, |
|
translators as trans) |
|
``` |
|
|
|
Testing Word Mapping and Alignment |
|
|
|
```python |
|
src = "আমি ভাত খাই না, রুটি খাই।" |
|
tgt = "I do not eat rice, I eat bread." |
|
|
|
# Give one space before and after punctuation |
|
# for easy tokenization |
|
src = tp.space_punc(src) |
|
tgt = tp.space_punc(tgt) |
|
|
|
print("Word Mapping:") |
|
mapping = bem.get_word_mapping( |
|
source=src, target=tgt, model_path="musfiqdehan/bengali-english-word-aligner") |
|
pprint(mapping) |
|
``` |
|
|
|
Output |
|
``` |
|
Word Mapping: |
|
['bn:(আমি) -> en:(I)', |
|
'bn:(ভাত) -> en:(rice)', |
|
'bn:(খাই) -> en:(do)', |
|
'bn:(খাই) -> en:(eat)', |
|
'bn:(না) -> en:(not)', |
|
'bn:(,) -> en:(,)', |
|
'bn:(রুটি) -> en:(bread)', |
|
'bn:(খাই) -> en:(eat)', |
|
'bn:(।) -> en:(.)'] |
|
``` |