You need to agree to share your contact information to access this model
If you want to learn more about how you can use the model, please refer to our Terms of Use.
Log in or Sign Up to review the conditions and access this model content.
Bielik-4.5B-v3
Bielik-4.5B-v3 is a generative text model featuring 4.6 billion parameters. The aforementioned model stands as a testament to the unique collaboration between the open-science/open-source project SpeakLeash and the High Performance Computing (HPC) center: ACK Cyfronet AGH. Developed and trained on Polish text corpora, which have been cherry-picked and processed by the SpeakLeash team, this endeavor leverages Polish large-scale computing infrastructure, specifically within the PLGrid environment, and more precisely, the HPC center: ACK Cyfronet AGH. The creation and training of the Bielik-4.5B-v3 was propelled by the support of computational grant number PLG/2024/017214 and PLG/2025/018338, conducted on the Athena and Helios supercomputer, enabling the use of cutting-edge technology and computational resources essential for large-scale machine learning processes. As a result, the model exhibits an exceptional ability to understand and process the Polish language, providing accurate responses and performing a variety of linguistic tasks with high precision.
⚠️ This is a base model intended for further fine-tuning across most use cases. If you're looking for a model ready for chatting or following instructions out-of-the-box, please use Bielik-4.5B-v3-Instruct.
📚 Technical report: https://arxiv.org/abs/2505.02550
Model
Bielik-4.5B-v3 model training was conducted on the Helios Supercomputer at the ACK Cyfronet AGH, utilizing 256 NVidia GH200 cards.
The training dataset was composed of Polish texts collected and made available through the SpeakLeash project, as well as a subset of CommonCrawl data. We used 292 billion tokens for 1.2 epochs of training.
Bielik-4.5B-v3 model has been trained with the use of an original open source framework called ALLaMo implemented by Krzysztof Ociepa. This framework allows users to train language models with architecture similar to LLaMA and Mistral in fast and efficient way.
Model description:
- Developed by: SpeakLeash & ACK Cyfronet AGH
- Language: Polish
- Model type: causal decoder-only
- Initialized from: Qwen2.5 3B
- License: Apache 2.0 and Terms of Use
Quality evaluation
An XGBoost classification model was prepared and created to evaluate the quality of texts in native Polish language. It is based on 93 features, such as the ratio of out-of-vocabulary words to all words (OOVs), the number of nouns, verbs, average sentence length etc. The model outputs the category of a given document (either HIGH, MEDIUM or LOW) along with the probability. This approach allows implementation of a dedicated pipeline to choose documents, from which we've used entries with HIGH quality index and probability exceeding 90%.
This filtration and appropriate selection of texts enable the provision of a condensed and high-quality database of texts in Polish for training purposes.
Quickstart
This model can be easily loaded using the AutoModelForCausalLM functionality.
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "speakleash/Bielik-4.5B-v3"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
In order to reduce the memory usage, you can use smaller precision (bfloat16
).
import torch
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
And then you can use HuggingFace Pipelines to generate text:
import transformers
text = "Najważniejszym celem człowieka na ziemi jest"
pipeline = transformers.pipeline("text-generation", model=model, tokenizer=tokenizer)
sequences = pipeline(max_new_tokens=100, do_sample=True, top_k=50, eos_token_id=tokenizer.eos_token_id, text_inputs=text)
for seq in sequences:
print(f"Result: {seq['generated_text']}")
Generated output:
Najważniejszym celem człowieka na ziemi jest życie w pokoju, harmonii i miłości. Dla każdego z nas bardzo ważne jest, aby otaczać się kochanymi osobami.
Limitations and Biases
Bielik-4.5B-v3 is not intended for deployment without fine-tuning. It should not be used for human-facing interactions without further guardrails and user consent.
Bielik-4.5B-v3 can produce factually incorrect output, and should not be relied on to produce factually accurate data. Bielik-4.5B-v3 was trained on various public datasets. While great efforts have been taken to clear the training data, it is possible that this model can generate lewd, false, biased or otherwise offensive outputs.
Citation
Please cite this model using the following format:
@misc{ociepa2025bielikv3smalltechnical,
title={Bielik v3 Small: Technical Report},
author={Krzysztof Ociepa and Łukasz Flis and Remigiusz Kinas and Krzysztof Wróbel and Adrian Gwoździej},
year={2025},
eprint={2505.02550},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2505.02550},
}
@misc{Bielik11Bv2b,
title = {Bielik-45B-v3 model card},
author = {Ociepa, Krzysztof and Flis, Łukasz and Wróbel, Krzysztof and Gwoździej, Adrian and {SpeakLeash Team} and {Cyfronet Team}},
year = {2025},
url = {https://huggingface.co/speakleash/Bielik-4.5B-v3},
note = {Accessed: 2025-05-06},
urldate = {2025-05-06}
}
Responsible for training the model
- Krzysztof OciepaSpeakLeash - team leadership, conceptualizing, data preparation, process optimization and oversight of training
- Łukasz FlisCyfronet AGH - coordinating and supervising the training
- Remigiusz KinasSpeakLeash - conceptualizing, coordinating RL trainings, data preparation, benchmarking and quantizations
- Adrian GwoździejSpeakLeash - data preparation and ensuring data quality
- Krzysztof WróbelSpeakLeash - benchmarks
The model could not have been created without the commitment and work of the entire SpeakLeash team, whose contribution is invaluable. Thanks to the hard work of many individuals, it was possible to gather a large amount of content in Polish and establish collaboration between the open-science SpeakLeash project and the HPC center: ACK Cyfronet AGH. Individuals who contributed to the creation of the model: Sebastian Kondracki, Igor Ciuciura, Szymon Baczyński, Jacek Chwiła, Dominika Basaj, Kuba Sołtys, Karol Jezierski, Anna Przybył, Agnieszka Ratajska, Witold Wydmański, Izabela Babis, Nina Babis.
Members of the ACK Cyfronet AGH team providing valuable support and expertise: Szymon Mazurek, Marek Magryś, Mieszko Cholewa .
We gratefully acknowledge Polish high-performance computing infrastructure PLGrid (HPC Center: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no. PLG/2024/017214 and PLG/2025/018338.
Contact Us
If you have any questions or suggestions, please use the discussion tab. If you want to contact us directly, join our Discord SpeakLeash.
- Downloads last month
- 7