--- language: bg tags: - gpt2 - lora - bulgarian - causal-lm license: mit datasets: - cc100 model-index: - name: GPT-2 Bulgarian LoRA Adapter (Merged) results: [] --- # 🤖 GPT-2 Bulgarian LoRA Adapter (Merged) **I will be training a much larger sample in the coming days (1k is small - but my computers bandwidth is smaller)** This model is a fine-tuned and merged version of `openai-community/gpt2-medium`, adapted to Bulgarian using the [LoRA](https://arxiv.org/abs/2106.09685) technique. Training was performed on a filtered sample of the Bulgarian subset of the CC100 dataset using [PEFT](https://github.com/huggingface/peft). ## 🔧 Model Details - **Base Model**: `openai-community/gpt2-medium` - **LoRA Rank**: 8 - **Target Modules**: `c_attn` - **Dataset**: `cc100.bg` (1000 filtered samples) - **Max Seq Length**: 512 tokens - **Batch Size**: 2 (with gradient accumulation) - **Steps**: 1000 - **Merged Model**: Yes (LoRA weights fused into base model) ## 💬 Example Usage This model is a fine-tuned and merged version of `openai-community/gpt2-medium`, adapted to Bulgarian using the [LoRA](https://arxiv.org/abs/2106.09685) technique. Training was performed on a filtered sample of the Bulgarian subset of the CC100 dataset using [PEFT](https://github.com/huggingface/peft). ## 🔧 Model Details - **Base Model**: `openai-community/gpt2-medium` - **LoRA Rank**: 8 - **Target Modules**: `c_attn` - **Dataset**: `cc100.bg` (1000 filtered samples) - **Max Seq Length**: 512 tokens - **Batch Size**: 2 (with gradient accumulation) - **Steps**: 1000 - **Merged Model**: Yes (LoRA weights fused into base model) ## 💬 Example Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("your-username/gpt2-bulgarian-merged") tokenizer = AutoTokenizer.from_pretrained("your-username/gpt2-bulgarian-merged") inputs = tokenizer("България е известна със своите", return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## 📈 Intended Use For educational purposes, experimentation, and research on low-resource language modeling in Bulgarian. ## ⚠️ Limitations - Trained on a small 1k sample. - No toxic content filtering or safety tuning. - Should not be used in production without further validation. ## 👤 Author Developed by [Vanessa Beck](https://github.com/stochastic-sisyphus) on Google Colab using 🤗 Transformers + PEFT.