djstrong's picture
citation
329b225
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Bielik in European LLM Leaderboard</title>
<style>
body {
font-family: Arial, sans-serif;
max-width: 1200px;
margin: 0 auto;
padding: 20px;
background-color: #f5f5f5;
}
.header {
text-align: center;
padding: 20px;
background-color: #dbeaf9;
color: black;
border-radius: 8px;
margin-bottom: 20px;
}
.leaderboard {
background-color: white;
padding: 20px;
border-radius: 8px;
box-shadow: 0 2px 4px rgba(0,0,0,0.1);
}
table {
width: 100%;
border-collapse: collapse;
margin-top: 20px;
}
th, td {
padding: 12px;
text-align: left;
border-bottom: 1px solid #ddd;
}
th {
background-color: #34495e;
color: white;
}
tr:hover {
background-color: #f8f9fa;
}
.model-name {
font-weight: bold;
color: #2c3e50;
}
.score {
font-weight: bold;
color: #27ae60;
}
iframe {
width: 100%;
height: 500px;
border: none;
margin: 20px 0;
border-radius: 8px;
box-shadow: 0 2px 4px rgba(0,0,0,0.1);
background-color: white;
}
.header h2 {
color: #3498db;
margin: 10px 0;
}
.header p {
margin: 10px 0;
line-height: 1.6;
font-size: 1.1em;
text-align: left;
}
/* Add styles for links */
a {
color: #3498db;
text-decoration: none;
transition: color 0.2s ease;
}
a:visited {
color: #8e44ad;
}
a:hover {
color: #2980b9;
text-decoration: underline;
}
</style>
</head>
<body>
<div class="header">
<h1>Bielik in European LLM Leaderboard</h1>
<p>Welcome to the performance showcase of Bielik, a state-of-the-art Polish language model. This leaderboard presents Bielik's capabilities compared to other models in the <a href="https://huggingface.co/spaces/openGPT-X/european-llm-leaderboard" target="_blank">European LLM Leaderboard</a> (Thellmann, K., Stadler, B., Fromm, M., Schulze Buschhoff, J., Jude, A., Barth, F., Leveling, J., Flores-Herr, N., Köhler, J., Jäkel, R., & Ali, M. (2024). <a href="https://arxiv.org/abs/2410.08928">Towards Multilingual LLM Evaluation for European Languages</a>).</p>
<p>Bielik is designed specifically for Polish language understanding and generation, demonstrating strong performance across various natural language processing tasks.</p>
</div>
<div class="leaderboard">
<h2>Models Performance in Polish Language</h2>
<p>Bielik-11B-v2.3-Instruct demonstrates strong performance across various language understanding and reasoning tasks, achieving an impressive average score of 0.66. This places it as the third-best performing model in the evaluation, behind only Gemma-2-27b-Instruct and Meta-Llama-3.1-70B-Instruct, while outperforming larger models like Mixtral-8x7B.</p>
<p>Key highlights of Bielik's performance:</p>
<ul>
<li>Strong reasoning capabilities shown in ARC (0.69) and GSM8K (0.68) benchmarks, demonstrating effective scientific reasoning and mathematical problem-solving abilities</li>
<li>Excellent common sense understanding with a HellaSwag score of 0.71, matching top performers</li>
<li>Competitive performance in factual accuracy with a TruthfulQA score of 0.62, surpassing many larger models</li>
<li>Solid broad knowledge demonstrated by MMLU score of 0.63, showing good understanding across diverse subjects</li>
</ul>
<!-- This ifrmae shows results from the European LLM Leaderboard only for Polish language -->
<iframe src="https://docs.google.com/spreadsheets/d/e/2PACX-1vSDy-6yMwJEZ77uhrQuSgMMPRYJ7Z0lYIncbjucTtPj7dY-fbDWVTU4wSfXxs6xmwUdvMC72h0snZ0b/pubhtml?gid=0&amp;single=true&amp;widget=false&amp;headers=false&amp;chrome=false"></iframe>
<h2>Models Performance in German Language</h2>
<p>In German language evaluation, Bielik-11B-v2.3-Instruct shows good performance with an average score of 0.62, positioning it in the middle range of evaluated models. Despite being primarily trained on Polish and English data, the model demonstrates reasonable cross-lingual transfer capabilities:</p>
<ul>
<li>Competitive performance in TruthfulQA (0.59), matching or outperforming several larger models including Phi-3 variants and approaching Gemma-2-27b-Instruct's score</li>
<li>Solid reasoning capabilities in ARC (0.64) and GSM8K (0.65), showing effective transfer of mathematical and scientific reasoning skills to German</li>
<li>Acceptable performance in common sense tasks with HellaSwag (0.62), though trailing behind larger multilingual models</li>
<li>Reasonable broad knowledge transfer demonstrated by MMLU score (0.60), considering the model wasn't specifically trained on German data</li>
</ul>
<p>These results are particularly noteworthy as they demonstrate Bielik's ability to generalize to German despite not being explicitly trained on German language data. While it doesn't match the performance of top models like Meta-Llama-3.1-70B-Instruct (0.71) or Gemma-2-27b-Instruct (0.71), it maintains competitive performance against similarly sized models and shows promising cross-lingual capabilities.</p>
<!-- This iframe shows results from the European LLM Leaderboard only for German language -->
<iframe src="https://docs.google.com/spreadsheets/d/e/2PACX-1vSDy-6yMwJEZ77uhrQuSgMMPRYJ7Z0lYIncbjucTtPj7dY-fbDWVTU4wSfXxs6xmwUdvMC72h0snZ0b/pubhtml?gid=2125166605&amp;single=true&amp;widget=false&amp;headers=false&amp;chrome=false"></iframe>
<h2>Models Performance in Czech Language</h2>
<p>In Czech language evaluation, Bielik-11B-v2.3-Instruct achieves a solid average score of 0.60, demonstrating notable cross-lingual transfer capabilities despite being primarily trained on Polish and English data. The model's performance places it in the top tier of evaluated models, showing interesting patterns across different tasks:</p>
<ul>
<li>Strong scientific reasoning capabilities shown in ARC (0.63), outperforming several dedicated multilingual models</li>
<li>Impressive mathematical problem-solving abilities with a GSM8K score of 0.60, surpassing larger models like Mixtral-8x7B-Instruct-v0.1 (0.50)</li>
<li>Moderate performance in common sense understanding with HellaSwag (0.59), showing reasonable transfer of contextual understanding to Czech</li>
<li>Consistent broad knowledge demonstrated by MMLU score (0.59), comparable to specialized multilingual models like Mistral-Nemo-Instruct</li>
<li>Reliable factual accuracy with a TruthfulQA score of 0.58, matching the performance of much larger models like Meta-Llama-3.1-70B-Instruct</li>
</ul>
<p>While top performers like Meta-Llama-3.1-70B-Instruct (0.71) and Gemma-2-27b-Instruct (0.70) maintain their leading positions, Bielik's performance is particularly noteworthy given that it wasn't explicitly trained on Czech data. The model demonstrates robust zero-shot cross-lingual transfer, performing comparably to or better than several larger models, including some variants of Mixtral-8x7B and Meta-Llama-3 series, especially in structured reasoning tasks like GSM8K.</p>
<!-- This iframe shows results from the European LLM Leaderboard only for Czech language -->
<iframe src="https://docs.google.com/spreadsheets/d/e/2PACX-1vSDy-6yMwJEZ77uhrQuSgMMPRYJ7Z0lYIncbjucTtPj7dY-fbDWVTU4wSfXxs6xmwUdvMC72h0snZ0b/pubhtml?gid=1568229718&amp;single=true&amp;widget=false&amp;headers=false&amp;chrome=false"></iframe>
<h2>Models Performance in Translation Benchmark in Polish Language</h2>
<p>In the FLORES200 translation benchmark for Polish language, Bielik-11B-v2.3-Instruct demonstrates competitive performance with an average BLEU score of 13.515, positioning it in the middle range of evaluated models. What's particularly interesting is the model's asymmetric performance in translation directions:</p>
<ul>
<li>Strong performance in translating into Polish (target) with a BLEU score of 15.31, outperforming many larger models including Mixtral-8x7B variants and approaching top performers</li>
<li>More moderate performance in translating from Polish (source) with a BLEU score of 11.72, suggesting room for improvement in source language comprehension</li>
<li>Overall translation capabilities comparable to established models like Meta-Llama-3-8B-Instruct (13.25) and Mixtral-8x7B-v0.1 (14.11)</li>
</ul>
<p>These results are particularly noteworthy considering Bielik's specialized training focus on Polish and English. While larger models like EuroLLM-9B-Instruct (20.65) and Meta-Llama-3.1-70B-Instruct (19.52) achieve higher overall scores, Bielik's strong performance in translating into Polish aligns with its design goals and demonstrates effective specialization for Polish language generation.</p>
<!-- This iframe shows results for LLM translation benchmark from the European LLM Leaderboard only for Polish language -->
<iframe src="https://docs.google.com/spreadsheets/d/e/2PACX-1vSDy-6yMwJEZ77uhrQuSgMMPRYJ7Z0lYIncbjucTtPj7dY-fbDWVTU4wSfXxs6xmwUdvMC72h0snZ0b/pubhtml?gid=1450374773&amp;single=true&amp;widget=false&amp;headers=false&amp;chrome=false"></iframe>
<h2>Bielik 2.3 Performance in Translation Benchmark in Polish Language</h2>
<p>Detailed analysis of Bielik-11B-v2.3-Instruct's performance in the FLORES200 translation benchmark reveals interesting patterns across different language pairs with Polish. The model demonstrates asymmetric capabilities in translation directions, which aligns with its training focus on Polish and English:</p>
<ul>
<li>Strongest performance in English-Polish language pair (BLEU scores of 21.93 to Polish, 28.32 from Polish), reflecting the model's primary training languages</li>
<li>Notable performance with West Slavic languages:
<ul>
<li>Czech (19.30 to Polish, 14.58 from Polish)</li>
<li>Slovak (17.65 to Polish, though only 6.60 from Polish)</li>
</ul>
</li>
<li>Strong results with major European languages:
<ul>
<li>German (19.18 to Polish, 14.93 from Polish)</li>
<li>French (18.97 to Polish, 19.06 from Polish)</li>
<li>Portuguese (19.10 to Polish, 19.76 from Polish)</li>
</ul>
</li>
<li>Significantly lower performance with Baltic and Finno-Ugric languages:
<ul>
<li>Estonian (6.13 to Polish, 1.53 from Polish)</li>
<li>Lithuanian (7.99 to Polish, 1.28 from Polish)</li>
<li>Latvian (5.85 to Polish, 0.88 from Polish)</li>
</ul>
</li>
</ul>
<p>These results demonstrate that Bielik excels in its primary training languages (Polish-English) and shows strong transfer to linguistically similar languages or widely-spoken European languages. The model generally performs better when translating into Polish (average BLEU 15.31) compared to translating from Polish (average BLEU 11.36), suggesting stronger generation capabilities in its primary training language. However, performance drops significantly with less related language families, particularly Baltic and Finno-Ugric languages, indicating limitations in cross-linguistic transfer to more distant language families.</p>
<!-- This iframe shows results for LLM translation benchmark from the European LLM Leaderboard only for Polish language and Bielik 2.3 model -->
<iframe src="https://docs.google.com/spreadsheets/d/e/2PACX-1vSDy-6yMwJEZ77uhrQuSgMMPRYJ7Z0lYIncbjucTtPj7dY-fbDWVTU4wSfXxs6xmwUdvMC72h0snZ0b/pubhtml?gid=1814445878&amp;single=true&amp;widget=false&amp;headers=false&amp;chrome=false"></iframe>
</div>
</body>
</html>