stochastic-sisyphus's picture
Update README.md
a5b3d91 verified
metadata
language: bg
tags:
  - gpt2
  - lora
  - bulgarian
  - causal-lm
license: mit
datasets:
  - cc100
model-index:
  - name: GPT-2 Bulgarian LoRA Adapter (Merged)
    results: []

🤖 GPT-2 Bulgarian LoRA Adapter (Merged)

I will be training a much larger sample in the coming days (1k is small - but my computers bandwidth is smaller)

This model is a fine-tuned and merged version of openai-community/gpt2-medium, adapted to Bulgarian using the LoRA technique. Training was performed on a filtered sample of the Bulgarian subset of the CC100 dataset using PEFT.

🔧 Model Details

  • Base Model: openai-community/gpt2-medium
  • LoRA Rank: 8
  • Target Modules: c_attn
  • Dataset: cc100.bg (1000 filtered samples)
  • Max Seq Length: 512 tokens
  • Batch Size: 2 (with gradient accumulation)
  • Steps: 1000
  • Merged Model: Yes (LoRA weights fused into base model)

💬 Example Usage

This model is a fine-tuned and merged version of openai-community/gpt2-medium, adapted to Bulgarian using the LoRA technique. Training was performed on a filtered sample of the Bulgarian subset of the CC100 dataset using PEFT.

🔧 Model Details

  • Base Model: openai-community/gpt2-medium
  • LoRA Rank: 8
  • Target Modules: c_attn
  • Dataset: cc100.bg (1000 filtered samples)
  • Max Seq Length: 512 tokens
  • Batch Size: 2 (with gradient accumulation)
  • Steps: 1000
  • Merged Model: Yes (LoRA weights fused into base model)

💬 Example Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("your-username/gpt2-bulgarian-merged")
tokenizer = AutoTokenizer.from_pretrained("your-username/gpt2-bulgarian-merged")
inputs = tokenizer("България е известна със своите", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

📈 Intended Use

For educational purposes, experimentation, and research on low-resource language modeling in Bulgarian.

⚠️ Limitations

  • Trained on a small 1k sample.
  • No toxic content filtering or safety tuning.
  • Should not be used in production without further validation.

👤 Author

Developed by Vanessa Beck on Google Colab using 🤗 Transformers + PEFT.