Commit
路
44bfa66
1
Parent(s):
ed4bb7e
Update README.md
Browse files
README.md
CHANGED
@@ -5,15 +5,54 @@ tags:
|
|
5 |
- language model
|
6 |
---
|
7 |
|
8 |
-
|
|
|
|
|
|
|
|
|
9 |
(for testing purpose, please use original checkpoints of the authors of this model)
|
10 |
|
11 |
-
|
|
|
12 |
|
13 |
-
When using AlephBertGimmel, please reference:
|
14 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
```
|
16 |
|
17 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
18 |
|
19 |
```
|
|
|
5 |
- language model
|
6 |
---
|
7 |
|
8 |
+
## AlephBertGimmel
|
9 |
+
Modern Hebrew pretrained BERT model with a 128K token vocabulary.
|
10 |
+
|
11 |
+
|
12 |
+
Checkpoint of the alephbertgimmel-base-512 from [alephbertgimmel](https://github.com/Dicta-Israel-Center-for-Text-Analysis/alephbertgimmel)
|
13 |
(for testing purpose, please use original checkpoints of the authors of this model)
|
14 |
|
15 |
+
```python
|
16 |
+
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
17 |
|
|
|
18 |
|
19 |
+
import torch
|
20 |
+
from transformers import AutoModelForMaskedLM, AutoTokenizer
|
21 |
+
|
22 |
+
model = AutoModelForMaskedLM.from_pretrained("imvladikon/alephbertgimmel-base-512")
|
23 |
+
tokenizer = AutoTokenizer.from_pretrained("imvladikon/alephbertgimmel-base-512")
|
24 |
+
|
25 |
+
text = "{} 讛讬讗 诪讟专讜驻讜诇讬谉 讛诪讛讜讜讛 讗转 诪专讻讝 讛讻诇讻诇讛"
|
26 |
+
|
27 |
+
input = tokenizer.encode(text.format("[MASK]"), return_tensors="pt")
|
28 |
+
mask_token_index = torch.where(input == tokenizer.mask_token_id)[1]
|
29 |
+
|
30 |
+
token_logits = model(input).logits
|
31 |
+
mask_token_logits = token_logits[0, mask_token_index, :]
|
32 |
+
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
|
33 |
+
|
34 |
+
for token in top_5_tokens:
|
35 |
+
print(text.format(tokenizer.decode([token])))
|
36 |
+
|
37 |
+
# 讛注讬专 讛讬讗 诪讟专讜驻讜诇讬谉 讛诪讛讜讜讛 讗转 诪专讻讝 讛讻诇讻诇讛
|
38 |
+
# 讬专讜砖诇讬诐 讛讬讗 诪讟专讜驻讜诇讬谉 讛诪讛讜讜讛 讗转 诪专讻讝 讛讻诇讻诇讛
|
39 |
+
# 讞讬驻讛 讛讬讗 诪讟专讜驻讜诇讬谉 讛诪讛讜讜讛 讗转 诪专讻讝 讛讻诇讻诇讛
|
40 |
+
# 诇讜谞讚讜谉 讛讬讗 诪讟专讜驻讜诇讬谉 讛诪讛讜讜讛 讗转 诪专讻讝 讛讻诇讻诇讛
|
41 |
+
# 讗讬诇转 讛讬讗 诪讟专讜驻讜诇讬谉 讛诪讛讜讜讛 讗转 诪专讻讝 讛讻诇讻诇讛
|
42 |
```
|
43 |
|
44 |
+
|
45 |
+
When using AlephBertGimmel, please reference:
|
46 |
+
|
47 |
+
```bibtex
|
48 |
+
|
49 |
+
@misc{guetta2022large,
|
50 |
+
title={Large Pre-Trained Models with Extra-Large Vocabularies: A Contrastive Analysis of Hebrew BERT Models and a New One to Outperform Them All},
|
51 |
+
author={Eylon Guetta and Avi Shmidman and Shaltiel Shmidman and Cheyn Shmuel Shmidman and Joshua Guedalia and Moshe Koppel and Dan Bareket and Amit Seker and Reut Tsarfaty},
|
52 |
+
year={2022},
|
53 |
+
eprint={2211.15199},
|
54 |
+
archivePrefix={arXiv},
|
55 |
+
primaryClass={cs.CL}
|
56 |
+
}
|
57 |
|
58 |
```
|