milanvelinovski commited on
Commit
263beb8
·
verified ·
1 Parent(s): 0eab055

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +109 -0
README.md ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - stanfordnlp/imdb
4
+ metrics:
5
+ - perplexity
6
+ base_model:
7
+ - EuroBERT/EuroBERT-210m
8
+ pipeline_tag: fill-mask
9
+ tags:
10
+ - art
11
+ ---
12
+ # Data Card for EuroBERT-210m-finetuned-imdb
13
+
14
+ ## Model Overview
15
+
16
+ - **Model Name**: EuroBERT-210m-finetuned-imdb
17
+ - **Base Model**: EuroBERT-210m
18
+ - **Fine-tuned On**: IMDb dataset
19
+ - **Task**: Masked Language Modeling (MLM)
20
+ - **Training Objective**: Minimize Perplexity
21
+
22
+ ## Dataset Details
23
+
24
+ - **Dataset Used**: IMDb
25
+ - **Dataset Version**: Default version from `datasets` library
26
+ - **Dataset Source**: Hugging Face `datasets`
27
+ - **Training Split**: `train`
28
+ - **Evaluation Split**: `test`
29
+
30
+ ## Training & Evaluation
31
+
32
+ ### Training Process
33
+ - The model was fine-tuned for three epochs using PyTorch and Hugging Face's `transformers` library.
34
+ - The optimizer and learning rate scheduler were set up within the `accelerate` framework.
35
+
36
+ ### Evaluation Metrics
37
+ - The model was evaluated using **Perplexity (PPL)** on the test set.
38
+ - Results:
39
+ - **Epoch 0**: PPL = 12.63
40
+ - **Epoch 1**: PPL = 9.35
41
+ - **Epoch 2**: PPL = 8.12
42
+
43
+ ## Model Usage
44
+
45
+ ### Inference
46
+ The model can be used for masked token prediction using the following script:
47
+
48
+ ```python
49
+ import torch
50
+ from transformers import AutoModelForMaskedLM, AutoTokenizer
51
+
52
+ def predict_masked_sentence(sentence, mask_token="<|mask|>"):
53
+ """
54
+ Predicts top-1 tokens for all mask tokens in a sentence and returns the reconstructed text.
55
+
56
+ Args:
57
+ sentence (str): Input sentence with mask tokens (e.g., "The movie was [MASK]!").
58
+ mask_token (str, optional): Token used as mask in the input sentence. Defaults to "<|mask|>".
59
+
60
+ Returns:
61
+ str: Sentence with all mask tokens replaced by top-1 predictions.
62
+ """
63
+ model_checkpoint = "milanvelinovski/EuroBERT-210m-finetuned-imdb"
64
+ model = AutoModelForMaskedLM.from_pretrained(model_checkpoint, trust_remote_code=True)
65
+ tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, trust_remote_code=True)
66
+
67
+ sentence_with_model_mask = sentence.replace(mask_token, "<|mask|>")
68
+ inputs = tokenizer(sentence_with_model_mask, return_tensors="pt")
69
+ token_logits = model(**inputs).logits
70
+
71
+ mask_token_indices = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
72
+ top_tokens = [torch.topk(token_logits[0, idx, :], 1).indices.item() for idx in mask_token_indices]
73
+
74
+ text_parts = sentence.split(mask_token)
75
+ final_text = text_parts[0] + ''.join(tokenizer.decode([token]) + text_parts[i+1] for i, token in enumerate(top_tokens))
76
+
77
+ return final_text
78
+
79
+ text = "The protagonist's journey was <|mask|>, filled with <|mask|> obstacles that made the ending feel <|mask|>."
80
+ final_text = predict_masked_sentence(text)
81
+ print(final_text)
82
+ ```
83
+
84
+ ## Libraries Used
85
+
86
+ | Library | Version |
87
+ |-------------|----------|
88
+ | datasets | 3.3.1 |
89
+ | transformers| 4.49.0 |
90
+ | evaluate | 0.4.3 |
91
+ | accelerate | 1.2.1 |
92
+ | torch | 2.5.1+cu121 |
93
+
94
+ ## Model Limitations
95
+ - The model is primarily trained for masked language modeling and may not generalize well to other NLP tasks.
96
+ - The perplexity scores indicate that further fine-tuning or hyperparameter optimization might improve performance.
97
+ - Model predictions are constrained by the IMDb dataset and may not generalize well to other domains.
98
+
99
+ ## Citation
100
+ If you use this model, please cite:
101
+ ```
102
+ @misc{EuroBERT-210m-finetuned-imdb,
103
+ author = {Milan Velinovski},
104
+ title = {EuroBERT-210m-finetuned-imdb},
105
+ year = {2025},
106
+ publisher = {Hugging Face},
107
+ url = {https://huggingface.co/milanvelinovski/EuroBERT-210m-finetuned-imdb}
108
+ }
109
+ ```