imvladikon
/

alephbertgimmel-small-128

Feature Extraction

text-embeddings-inference

Inference Endpoints

Model card Files Files and versions Community

alephbertgimmel-small-128 / README.md

imvladikon's picture

Create README.md

7177c13 over 1 year ago

|

history blame contribute delete

3.4 kB

	---
	language:
	- he
	tags:
	- language model
	pipeline_tag: feature-extraction
	---

	## AlephBertGimmel
	Modern Hebrew pretrained BERT model with a 128K token vocabulary.


	[Checkpoint](https://github.com/Dicta-Israel-Center-for-Text-Analysis/alephbertgimmel/tree/main/alephbertgimmel-small/ckpt_29400--Max128Seq) of the alephbertgimmel-small-128 from [alephbertgimmel](https://github.com/Dicta-Israel-Center-for-Text-Analysis/alephbertgimmel)


	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM


	import torch
	from transformers import AutoModelForMaskedLM, AutoTokenizer

	model = AutoModelForMaskedLM.from_pretrained("imvladikon/alephbertgimmel-small-128")
	tokenizer = AutoTokenizer.from_pretrained("imvladikon/alephbertgimmel-small-128")

	text = "{} היא מטרופולין המהווה את מרכז הכלכלה"

	input = tokenizer.encode(text.format("[MASK]"), return_tensors="pt")
	mask_token_index = torch.where(input == tokenizer.mask_token_id)[1]

	token_logits = model(input).logits
	mask_token_logits = token_logits[0, mask_token_index, :]
	top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

	for token in top_5_tokens:
	print(text.format(tokenizer.decode([token])))

	# ישראל היא מטרופולין המהווה את מרכז הכלכלה
	# ירושלים היא מטרופולין המהווה את מרכז הכלכלה
	# חיפה היא מטרופולין המהווה את מרכז הכלכלה
	# אילת היא מטרופולין המהווה את מרכז הכלכלה
	# אשדוד היא מטרופולין המהווה את מרכז הכלכלה
	```

	```python
	def ppl_naive(text, model, tokenizer):
	input = tokenizer.encode(text, return_tensors="pt")
	loss = model(input, labels=input)[0]
	return torch.exp(loss).item()

	text = """{} היא עיר הבירה של מדינת ישראל, והעיר הגדולה ביותר בישראל בגודל האוכלוסייה"""

	for word in ["חיפה", "ירושלים", "תל אביב"]:
	print(ppl_naive(text.format(word), model, tokenizer))

	# 9.825098991394043
	# 10.594215393066406
	# 9.536449432373047

	# I'd expect that for "ירושלים" should be the smallest value, but...

	@torch.inference_mode()
	def ppl_pseudo(text, model, tokenizer, ignore_idx=-100):
	input = tokenizer.encode(text, return_tensors='pt')
	mask = torch.ones(input.size(-1) - 1).diag(1)[:-2]
	repeat_input = input.repeat(input.size(-1) - 2, 1)
	input = repeat_input.masked_fill(mask == 1, tokenizer.mask_token_id)
	labels = repeat_input.masked_fill(input != tokenizer.mask_token_id, ignore_idx)
	loss = model(input, labels=labels)[0]
	return torch.exp(loss).item()


	for word in ["חיפה", "ירושלים", "תל אביב"]:
	print(ppl_pseudo(text.format(word), model, tokenizer))
	# 4.346900939941406
	# 3.292382001876831
	# 2.732590913772583
	```

	When using AlephBertGimmel, please reference:

	```bibtex

	@misc{guetta2022large,
	title={Large Pre-Trained Models with Extra-Large Vocabularies: A Contrastive Analysis of Hebrew BERT Models and a New One to Outperform Them All},
	author={Eylon Guetta and Avi Shmidman and Shaltiel Shmidman and Cheyn Shmuel Shmidman and Joshua Guedalia and Moshe Koppel and Dan Bareket and Amit Seker and Reut Tsarfaty},
	year={2022},
	eprint={2211.15199},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}

	```