Added thinking ablation evaluation results
Browse files
README.md
CHANGED
@@ -191,7 +191,7 @@ So, you need to add 10 liters of a 70% acid solution to the 10 liters of a 30% a
|
|
191 |
|
192 |
**Evaluation Results:**
|
193 |
<table>
|
194 |
-
|
195 |
<thead>
|
196 |
<tr>
|
197 |
<th style="text-align:left; background-color: #001d6c; color: white;">Models</th>
|
@@ -309,7 +309,7 @@ So, you need to add 10 liters of a 70% acid solution to the 10 liters of a 30% a
|
|
309 |
|
310 |
<tr>
|
311 |
<td style="text-align:left; background-color: #DAE8FF; color: black;">Granite-3.2-2B-Instruct</td>
|
312 |
-
<td style="text-align:center; background-color: #DAE8FF; color: black;">
|
313 |
<td style="text-align:center; background-color: #DAE8FF; color: black;">34.51</td>
|
314 |
<td style="text-align:center; background-color: #DAE8FF; color: black;">57.18</td>
|
315 |
<td style="text-align:center; background-color: #DAE8FF; color: black;">20.56</td>
|
@@ -340,10 +340,54 @@ So, you need to add 10 liters of a 70% acid solution to the 10 liters of a 30% a
|
|
340 |
|
341 |
</tr>
|
342 |
|
343 |
-
|
344 |
-
|
345 |
</tbody></table>
|
346 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
347 |
**Training Data:**
|
348 |
Overall, our training data is largely comprised of two key sources: (1) publicly available datasets with permissive license, (2) internal synthetically generated data targeted to enhance reasoning capabilites.
|
349 |
<!-- A detailed attribution of datasets can be found in [Granite 3.2 Technical Report (coming soon)](#), and [Accompanying Author List](https://github.com/ibm-granite/granite-3.0-language-models/blob/main/author-ack.pdf). -->
|
|
|
191 |
|
192 |
**Evaluation Results:**
|
193 |
<table>
|
194 |
+
<caption><b> Comparison with Other Models</b></caption>
|
195 |
<thead>
|
196 |
<tr>
|
197 |
<th style="text-align:left; background-color: #001d6c; color: white;">Models</th>
|
|
|
309 |
|
310 |
<tr>
|
311 |
<td style="text-align:left; background-color: #DAE8FF; color: black;">Granite-3.2-2B-Instruct</td>
|
312 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">26.6</td>
|
313 |
<td style="text-align:center; background-color: #DAE8FF; color: black;">34.51</td>
|
314 |
<td style="text-align:center; background-color: #DAE8FF; color: black;">57.18</td>
|
315 |
<td style="text-align:center; background-color: #DAE8FF; color: black;">20.56</td>
|
|
|
340 |
|
341 |
</tr>
|
342 |
|
|
|
|
|
343 |
</tbody></table>
|
344 |
|
345 |
+
<table>
|
346 |
+
<caption><b>Thinking Ablation</b></caption>
|
347 |
+
<thead>
|
348 |
+
<tr>
|
349 |
+
<th rowspan="2" style="text-align:left; background-color: #001d6c; color: white;">Models</th>
|
350 |
+
<th colspan="2" style="text-align:center; background-color: #001d6c; color: white;">Thinking=False</th>
|
351 |
+
<th colspan="2" style="text-align:center; background-color: #001d6c; color: white;">Thinking=True</th>
|
352 |
+
</tr>
|
353 |
+
<tr>
|
354 |
+
<th style="text-align:center; background-color: #001d6c; color: white;">ArenaHard</th>
|
355 |
+
<th style="text-align:center; background-color: #001d6c; color: white;">Alpaca-Eval-2</th>
|
356 |
+
<th style="text-align:center; background-color: #001d6c; color: white;">ArenaHard</th>
|
357 |
+
<th style="text-align:center; background-color: #001d6c; color: white;">Alpaca-Eval-2</th>
|
358 |
+
</tr></thead>
|
359 |
+
<tbody>
|
360 |
+
<tr>
|
361 |
+
<td style="text-align:left; background-color: #DAE8FF; color: black;">Granite-3.1-8B-Instruct</td>
|
362 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">37.58</td>
|
363 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">30.34</td>
|
364 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">-</td>
|
365 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">-</td>
|
366 |
+
</tr>
|
367 |
+
<tr>
|
368 |
+
<td style="text-align:left; background-color: #DAE8FF; color: black;">Granite-3.1-2B-Instruct</td>
|
369 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">23.3</td>
|
370 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">27.17</td>
|
371 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">-</td>
|
372 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">-</td>
|
373 |
+
</tr>
|
374 |
+
<tr>
|
375 |
+
<td style="text-align:left; background-color: #DAE8FF; color: black;">Granite-3.2-2B-Instruct</td>
|
376 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">30.42</td>
|
377 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">31.65</td>
|
378 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">26.6</td>
|
379 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">34.51</td>
|
380 |
+
</tr>
|
381 |
+
<tr>
|
382 |
+
<td style="text-align:left; background-color: #DAE8FF; color: black;"><b>Granite-3.2-8B-Instruct</b></td>
|
383 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">40.54</td>
|
384 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">36.89</td>
|
385 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">55.25</td>
|
386 |
+
<td style="text-align:center; background-color: #DAE8FF; color: black;">61.19</td>
|
387 |
+
</tr>
|
388 |
+
</tbody>
|
389 |
+
</table>
|
390 |
+
|
391 |
**Training Data:**
|
392 |
Overall, our training data is largely comprised of two key sources: (1) publicly available datasets with permissive license, (2) internal synthetically generated data targeted to enhance reasoning capabilites.
|
393 |
<!-- A detailed attribution of datasets can be found in [Granite 3.2 Technical Report (coming soon)](#), and [Accompanying Author List](https://github.com/ibm-granite/granite-3.0-language-models/blob/main/author-ack.pdf). -->
|