File size: 2,555 Bytes
0afbf98 ec00692 a5b3d91 0afbf98 ec00692 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
---
language: bg
tags:
- gpt2
- lora
- bulgarian
- causal-lm
license: mit
datasets:
- cc100
model-index:
- name: GPT-2 Bulgarian LoRA Adapter (Merged)
results: []
---
# 🤖 GPT-2 Bulgarian LoRA Adapter (Merged)
**I will be training a much larger sample in the coming days (1k is small - but my computers bandwidth is smaller)**
This model is a fine-tuned and merged version of `openai-community/gpt2-medium`, adapted to Bulgarian using the [LoRA](https://arxiv.org/abs/2106.09685) technique. Training was performed on a filtered sample of the Bulgarian subset of the CC100 dataset using [PEFT](https://github.com/huggingface/peft).
## 🔧 Model Details
- **Base Model**: `openai-community/gpt2-medium`
- **LoRA Rank**: 8
- **Target Modules**: `c_attn`
- **Dataset**: `cc100.bg` (1000 filtered samples)
- **Max Seq Length**: 512 tokens
- **Batch Size**: 2 (with gradient accumulation)
- **Steps**: 1000
- **Merged Model**: Yes (LoRA weights fused into base model)
## 💬 Example Usage
This model is a fine-tuned and merged version of `openai-community/gpt2-medium`, adapted to Bulgarian using the [LoRA](https://arxiv.org/abs/2106.09685) technique. Training was performed on a filtered sample of the Bulgarian subset of the CC100 dataset using [PEFT](https://github.com/huggingface/peft).
## 🔧 Model Details
- **Base Model**: `openai-community/gpt2-medium`
- **LoRA Rank**: 8
- **Target Modules**: `c_attn`
- **Dataset**: `cc100.bg` (1000 filtered samples)
- **Max Seq Length**: 512 tokens
- **Batch Size**: 2 (with gradient accumulation)
- **Steps**: 1000
- **Merged Model**: Yes (LoRA weights fused into base model)
## 💬 Example Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("your-username/gpt2-bulgarian-merged")
tokenizer = AutoTokenizer.from_pretrained("your-username/gpt2-bulgarian-merged")
inputs = tokenizer("България е известна със своите", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## 📈 Intended Use
For educational purposes, experimentation, and research on low-resource language modeling in Bulgarian.
## ⚠️ Limitations
- Trained on a small 1k sample.
- No toxic content filtering or safety tuning.
- Should not be used in production without further validation.
## 👤 Author
Developed by [Vanessa Beck](https://github.com/stochastic-sisyphus) on Google Colab using 🤗 Transformers + PEFT. |