File size: 2,555 Bytes

0afbf98
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ec00692
 
a5b3d91
0afbf98
ec00692

---
language: bg
tags:
  - gpt2
  - lora
  - bulgarian
  - causal-lm
license: mit
datasets:
  - cc100
model-index:
  - name: GPT-2 Bulgarian LoRA Adapter (Merged)
    results: []
---

# 🤖 GPT-2 Bulgarian LoRA Adapter (Merged)

**I will be training a much larger sample in the coming days (1k is small - but my computers bandwidth is smaller)** 

This model is a fine-tuned and merged version of `openai-community/gpt2-medium`, adapted to Bulgarian using the [LoRA](https://arxiv.org/abs/2106.09685) technique. Training was performed on a filtered sample of the Bulgarian subset of the CC100 dataset using [PEFT](https://github.com/huggingface/peft).

## 🔧 Model Details

- **Base Model**: `openai-community/gpt2-medium`
- **LoRA Rank**: 8
- **Target Modules**: `c_attn`
- **Dataset**: `cc100.bg` (1000 filtered samples)
- **Max Seq Length**: 512 tokens
- **Batch Size**: 2 (with gradient accumulation)
- **Steps**: 1000
- **Merged Model**: Yes (LoRA weights fused into base model)

## 💬 Example Usage

This model is a fine-tuned and merged version of `openai-community/gpt2-medium`, adapted to Bulgarian using the [LoRA](https://arxiv.org/abs/2106.09685) technique. Training was performed on a filtered sample of the Bulgarian subset of the CC100 dataset using [PEFT](https://github.com/huggingface/peft).

## 🔧 Model Details

- **Base Model**: `openai-community/gpt2-medium`
- **LoRA Rank**: 8
- **Target Modules**: `c_attn`
- **Dataset**: `cc100.bg` (1000 filtered samples)
- **Max Seq Length**: 512 tokens
- **Batch Size**: 2 (with gradient accumulation)
- **Steps**: 1000
- **Merged Model**: Yes (LoRA weights fused into base model)

## 💬 Example Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("your-username/gpt2-bulgarian-merged")
tokenizer = AutoTokenizer.from_pretrained("your-username/gpt2-bulgarian-merged")
inputs = tokenizer("България е известна със своите", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## 📈 Intended Use

For educational purposes, experimentation, and research on low-resource language modeling in Bulgarian.

## ⚠️ Limitations

- Trained on a small 1k sample.
- No toxic content filtering or safety tuning.
- Should not be used in production without further validation.

## 👤 Author

Developed by [Vanessa Beck](https://github.com/stochastic-sisyphus) on Google Colab using 🤗 Transformers + PEFT.