imvladikon commited on
Commit
44bfa66
1 Parent(s): ed4bb7e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -4
README.md CHANGED
@@ -5,15 +5,54 @@ tags:
5
  - language model
6
  ---
7
 
8
- Checkpoint of the alephbertgimmel-base-512 from https://github.com/Dicta-Israel-Center-for-Text-Analysis/alephbertgimmel
 
 
 
 
9
  (for testing purpose, please use original checkpoints of the authors of this model)
10
 
11
- AlephBertGimmel - Modern Hebrew pretrained BERT model with a 128K token vocabulary.
 
12
 
13
- When using AlephBertGimmel, please reference:
14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  ```
16
 
17
- Eylon Guetta, Avi Shmidman, Shaltiel Shmidman, Cheyn Shmuel Shmidman, Joshua Guedalia, Moshe Koppel, Dan Bareket, Amit Seker and Reut Tsarfaty, "Large Pre-Trained Models with Extra-Large Vocabularies: A Contrastive Analysis of Hebrew BERT Models and a New One to Outperform Them All", Nov 2022 [http://arxiv.org/abs/2211.15199]
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
  ```
 
5
  - language model
6
  ---
7
 
8
+ ## AlephBertGimmel
9
+ Modern Hebrew pretrained BERT model with a 128K token vocabulary.
10
+
11
+
12
+ Checkpoint of the alephbertgimmel-base-512 from [alephbertgimmel](https://github.com/Dicta-Israel-Center-for-Text-Analysis/alephbertgimmel)
13
  (for testing purpose, please use original checkpoints of the authors of this model)
14
 
15
+ ```python
16
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
17
 
 
18
 
19
+ import torch
20
+ from transformers import AutoModelForMaskedLM, AutoTokenizer
21
+
22
+ model = AutoModelForMaskedLM.from_pretrained("imvladikon/alephbertgimmel-base-512")
23
+ tokenizer = AutoTokenizer.from_pretrained("imvladikon/alephbertgimmel-base-512")
24
+
25
+ text = "{} 讛讬讗 诪讟专讜驻讜诇讬谉 讛诪讛讜讜讛 讗转 诪专讻讝 讛讻诇讻诇讛"
26
+
27
+ input = tokenizer.encode(text.format("[MASK]"), return_tensors="pt")
28
+ mask_token_index = torch.where(input == tokenizer.mask_token_id)[1]
29
+
30
+ token_logits = model(input).logits
31
+ mask_token_logits = token_logits[0, mask_token_index, :]
32
+ top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
33
+
34
+ for token in top_5_tokens:
35
+ print(text.format(tokenizer.decode([token])))
36
+
37
+ # 讛注讬专 讛讬讗 诪讟专讜驻讜诇讬谉 讛诪讛讜讜讛 讗转 诪专讻讝 讛讻诇讻诇讛
38
+ # 讬专讜砖诇讬诐 讛讬讗 诪讟专讜驻讜诇讬谉 讛诪讛讜讜讛 讗转 诪专讻讝 讛讻诇讻诇讛
39
+ # 讞讬驻讛 讛讬讗 诪讟专讜驻讜诇讬谉 讛诪讛讜讜讛 讗转 诪专讻讝 讛讻诇讻诇讛
40
+ # 诇讜谞讚讜谉 讛讬讗 诪讟专讜驻讜诇讬谉 讛诪讛讜讜讛 讗转 诪专讻讝 讛讻诇讻诇讛
41
+ # 讗讬诇转 讛讬讗 诪讟专讜驻讜诇讬谉 讛诪讛讜讜讛 讗转 诪专讻讝 讛讻诇讻诇讛
42
  ```
43
 
44
+
45
+ When using AlephBertGimmel, please reference:
46
+
47
+ ```bibtex
48
+
49
+ @misc{guetta2022large,
50
+ title={Large Pre-Trained Models with Extra-Large Vocabularies: A Contrastive Analysis of Hebrew BERT Models and a New One to Outperform Them All},
51
+ author={Eylon Guetta and Avi Shmidman and Shaltiel Shmidman and Cheyn Shmuel Shmidman and Joshua Guedalia and Moshe Koppel and Dan Bareket and Amit Seker and Reut Tsarfaty},
52
+ year={2022},
53
+ eprint={2211.15199},
54
+ archivePrefix={arXiv},
55
+ primaryClass={cs.CL}
56
+ }
57
 
58
  ```