metadata
language: bg
tags:
- gpt2
- lora
- bulgarian
- causal-lm
license: mit
datasets:
- cc100
model-index:
- name: GPT-2 Bulgarian LoRA Adapter (Merged)
results: []
🤖 GPT-2 Bulgarian LoRA Adapter (Merged)
I will be training a much larger sample in the coming days (1k is small - but my computers bandwidth is smaller)
This model is a fine-tuned and merged version of openai-community/gpt2-medium
, adapted to Bulgarian using the LoRA technique. Training was performed on a filtered sample of the Bulgarian subset of the CC100 dataset using PEFT.
🔧 Model Details
- Base Model:
openai-community/gpt2-medium
- LoRA Rank: 8
- Target Modules:
c_attn
- Dataset:
cc100.bg
(1000 filtered samples) - Max Seq Length: 512 tokens
- Batch Size: 2 (with gradient accumulation)
- Steps: 1000
- Merged Model: Yes (LoRA weights fused into base model)
💬 Example Usage
This model is a fine-tuned and merged version of openai-community/gpt2-medium
, adapted to Bulgarian using the LoRA technique. Training was performed on a filtered sample of the Bulgarian subset of the CC100 dataset using PEFT.
🔧 Model Details
- Base Model:
openai-community/gpt2-medium
- LoRA Rank: 8
- Target Modules:
c_attn
- Dataset:
cc100.bg
(1000 filtered samples) - Max Seq Length: 512 tokens
- Batch Size: 2 (with gradient accumulation)
- Steps: 1000
- Merged Model: Yes (LoRA weights fused into base model)
💬 Example Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("your-username/gpt2-bulgarian-merged")
tokenizer = AutoTokenizer.from_pretrained("your-username/gpt2-bulgarian-merged")
inputs = tokenizer("България е известна със своите", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
📈 Intended Use
For educational purposes, experimentation, and research on low-resource language modeling in Bulgarian.
⚠️ Limitations
- Trained on a small 1k sample.
- No toxic content filtering or safety tuning.
- Should not be used in production without further validation.
👤 Author
Developed by Vanessa Beck on Google Colab using 🤗 Transformers + PEFT.