imvladikon
/

alephbertgimmel-base-512

Inference Endpoints

Model card Files Files and versions Community

imvladikon commited on Apr 6, 2023

Commit

44bfa66

·

1 Parent(s): ed4bb7e

Update README.md

Files changed (1) hide show

README.md +43 -4

README.md CHANGED Viewed

@@ -5,15 +5,54 @@ tags:
 - language model
 ---
-Checkpoint of the alephbertgimmel-base-512 from https://github.com/Dicta-Israel-Center-for-Text-Analysis/alephbertgimmel
 (for testing purpose, please use original checkpoints of the authors of this model)
-AlephBertGimmel - Modern Hebrew pretrained BERT model with a 128K token vocabulary.
-When using AlephBertGimmel, please reference:
 ```
-Eylon Guetta, Avi Shmidman, Shaltiel Shmidman, Cheyn Shmuel Shmidman, Joshua Guedalia, Moshe Koppel, Dan Bareket, Amit Seker and Reut Tsarfaty, "Large Pre-Trained Models with Extra-Large Vocabularies: A Contrastive Analysis of Hebrew BERT Models and a New One to Outperform Them All", Nov 2022 [http://arxiv.org/abs/2211.15199]
 ```

 - language model
 ---
+## AlephBertGimmel
+Modern Hebrew pretrained BERT model with a 128K token vocabulary.
+Checkpoint of the alephbertgimmel-base-512 from [alephbertgimmel](https://github.com/Dicta-Israel-Center-for-Text-Analysis/alephbertgimmel)
 (for testing purpose, please use original checkpoints of the authors of this model)
+```python
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+import torch
+from transformers import AutoModelForMaskedLM, AutoTokenizer
+model = AutoModelForMaskedLM.from_pretrained("imvladikon/alephbertgimmel-base-512")
+tokenizer = AutoTokenizer.from_pretrained("imvladikon/alephbertgimmel-base-512")
+text = "{} היא מטרופולין המהווה את מרכז הכלכלה"
+input = tokenizer.encode(text.format("[MASK]"), return_tensors="pt")
+mask_token_index = torch.where(input == tokenizer.mask_token_id)[1]
+token_logits = model(input).logits
+mask_token_logits = token_logits[0, mask_token_index, :]
+top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
+for token in top_5_tokens:
+    print(text.format(tokenizer.decode([token])))
+# העיר היא מטרופולין המהווה את מרכז הכלכלה
+# ירושלים היא מטרופולין המהווה את מרכז הכלכלה
+# חיפה היא מטרופולין המהווה את מרכז הכלכלה
+# לונדון היא מטרופולין המהווה את מרכז הכלכלה
+# אילת היא מטרופולין המהווה את מרכז הכלכלה
 ```
+When using AlephBertGimmel, please reference:
+```bibtex
+@misc{guetta2022large,
+      title={Large Pre-Trained Models with Extra-Large Vocabularies: A Contrastive Analysis of Hebrew BERT Models and a New One to Outperform Them All},
+      author={Eylon Guetta and Avi Shmidman and Shaltiel Shmidman and Cheyn Shmuel Shmidman and Joshua Guedalia and Moshe Koppel and Dan Bareket and Amit Seker and Reut Tsarfaty},
+      year={2022},
+      eprint={2211.15199},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
 ```