Update README.md
Browse files
README.md
CHANGED
@@ -89,7 +89,7 @@ We adjusted the `factor` parameter of the `Stepwise` function, and the `factor`
|
|
89 |
|
90 |
<figure>
|
91 |
|
92 |
-
](https://arxiv.org/abs/1907.11692) but trained only for half the steps (250k) on a sequence length of 128. In particular, `Gaussian` trained for the 250k steps, while `Random` was stopped at 230k
|
143 |
|
144 |
Then, we continued training the most promising model for a few steps (~25k) more on sequence length 512. We tried two strategies for this, since it is not easy to find clear details about this change in the literature. It turns out this decision had a big impact in the final performance.
|
145 |
|
|
|
89 |
|
90 |
<figure>
|
91 |
|
92 |
+

|
93 |
|
94 |
<caption>Figure 3. Expected perplexity distributions of the sample mc4-es after applying the Stepwise function.</caption>
|
95 |
</figure>
|
|
|
139 |
|
140 |
### Training details
|
141 |
|
142 |
+
We then used the same setup and hyperparameters as [Liu et al. (2019)](https://arxiv.org/abs/1907.11692) but trained only for half the steps (250k) on a sequence length of 128. In particular, `Gaussian` trained for the 250k steps, while `Random` was stopped at 230k. `Stepwise` needed to be initially stopped at 180k to allow downstream tests (sequence length 128), but was later resumed. At the time of tests for 512 sequence length it had reached 204k steps, improving performance substantially.
|
143 |
|
144 |
Then, we continued training the most promising model for a few steps (~25k) more on sequence length 512. We tried two strategies for this, since it is not easy to find clear details about this change in the literature. It turns out this decision had a big impact in the final performance.
|
145 |
|