|
--- |
|
language: bg |
|
tags: |
|
- gpt2 |
|
- lora |
|
- bulgarian |
|
- causal-lm |
|
license: mit |
|
datasets: |
|
- cc100 |
|
model-index: |
|
- name: GPT-2 Bulgarian LoRA Adapter (Merged) |
|
results: [] |
|
--- |
|
|
|
# 🤖 GPT-2 Bulgarian LoRA Adapter (Merged) |
|
|
|
**I will be training a much larger sample in the coming days (1k is small - but my computers bandwidth is smaller)** |
|
|
|
This model is a fine-tuned and merged version of `openai-community/gpt2-medium`, adapted to Bulgarian using the [LoRA](https://arxiv.org/abs/2106.09685) technique. Training was performed on a filtered sample of the Bulgarian subset of the CC100 dataset using [PEFT](https://github.com/huggingface/peft). |
|
|
|
## 🔧 Model Details |
|
|
|
- **Base Model**: `openai-community/gpt2-medium` |
|
- **LoRA Rank**: 8 |
|
- **Target Modules**: `c_attn` |
|
- **Dataset**: `cc100.bg` (1000 filtered samples) |
|
- **Max Seq Length**: 512 tokens |
|
- **Batch Size**: 2 (with gradient accumulation) |
|
- **Steps**: 1000 |
|
- **Merged Model**: Yes (LoRA weights fused into base model) |
|
|
|
## 💬 Example Usage |
|
|
|
This model is a fine-tuned and merged version of `openai-community/gpt2-medium`, adapted to Bulgarian using the [LoRA](https://arxiv.org/abs/2106.09685) technique. Training was performed on a filtered sample of the Bulgarian subset of the CC100 dataset using [PEFT](https://github.com/huggingface/peft). |
|
|
|
## 🔧 Model Details |
|
|
|
- **Base Model**: `openai-community/gpt2-medium` |
|
- **LoRA Rank**: 8 |
|
- **Target Modules**: `c_attn` |
|
- **Dataset**: `cc100.bg` (1000 filtered samples) |
|
- **Max Seq Length**: 512 tokens |
|
- **Batch Size**: 2 (with gradient accumulation) |
|
- **Steps**: 1000 |
|
- **Merged Model**: Yes (LoRA weights fused into base model) |
|
|
|
## 💬 Example Usage |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
model = AutoModelForCausalLM.from_pretrained("your-username/gpt2-bulgarian-merged") |
|
tokenizer = AutoTokenizer.from_pretrained("your-username/gpt2-bulgarian-merged") |
|
inputs = tokenizer("България е известна със своите", return_tensors="pt") |
|
outputs = model.generate(**inputs, max_new_tokens=100) |
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
``` |
|
|
|
## 📈 Intended Use |
|
|
|
For educational purposes, experimentation, and research on low-resource language modeling in Bulgarian. |
|
|
|
## ⚠️ Limitations |
|
|
|
- Trained on a small 1k sample. |
|
- No toxic content filtering or safety tuning. |
|
- Should not be used in production without further validation. |
|
|
|
## 👤 Author |
|
|
|
Developed by [Vanessa Beck](https://github.com/stochastic-sisyphus) on Google Colab using 🤗 Transformers + PEFT. |