stochastic-sisyphus
/

gpt2-bulgarian-merged

Model card Files Files and versions Community

gpt2-bulgarian-merged / README.md

stochastic-sisyphus's picture

stochastic-sisyphus

Update README.md

a5b3d91 verified 14 days ago

|

history blame contribute delete

2.56 kB

	---
	language: bg
	tags:
	- gpt2
	- lora
	- bulgarian
	- causal-lm
	license: mit
	datasets:
	- cc100
	model-index:
	- name: GPT-2 Bulgarian LoRA Adapter (Merged)
	results: []
	---

	# 🤖 GPT-2 Bulgarian LoRA Adapter (Merged)

	I will be training a much larger sample in the coming days (1k is small - but my computers bandwidth is smaller)

	This model is a fine-tuned and merged version of `openai-community/gpt2-medium`, adapted to Bulgarian using the [LoRA](https://arxiv.org/abs/2106.09685) technique. Training was performed on a filtered sample of the Bulgarian subset of the CC100 dataset using [PEFT](https://github.com/huggingface/peft).

	## 🔧 Model Details

	- Base Model: `openai-community/gpt2-medium`
	- LoRA Rank: 8
	- Target Modules: `c_attn`
	- Dataset: `cc100.bg` (1000 filtered samples)
	- Max Seq Length: 512 tokens
	- Batch Size: 2 (with gradient accumulation)
	- Steps: 1000
	- Merged Model: Yes (LoRA weights fused into base model)

	## 💬 Example Usage

	This model is a fine-tuned and merged version of `openai-community/gpt2-medium`, adapted to Bulgarian using the [LoRA](https://arxiv.org/abs/2106.09685) technique. Training was performed on a filtered sample of the Bulgarian subset of the CC100 dataset using [PEFT](https://github.com/huggingface/peft).

	## 🔧 Model Details

	- Base Model: `openai-community/gpt2-medium`
	- LoRA Rank: 8
	- Target Modules: `c_attn`
	- Dataset: `cc100.bg` (1000 filtered samples)
	- Max Seq Length: 512 tokens
	- Batch Size: 2 (with gradient accumulation)
	- Steps: 1000
	- Merged Model: Yes (LoRA weights fused into base model)

	## 💬 Example Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	model = AutoModelForCausalLM.from_pretrained("your-username/gpt2-bulgarian-merged")
	tokenizer = AutoTokenizer.from_pretrained("your-username/gpt2-bulgarian-merged")
	inputs = tokenizer("България е известна със своите", return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=100)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## 📈 Intended Use

	For educational purposes, experimentation, and research on low-resource language modeling in Bulgarian.

	## ⚠️ Limitations

	- Trained on a small 1k sample.
	- No toxic content filtering or safety tuning.
	- Should not be used in production without further validation.

	## 👤 Author

	Developed by [Vanessa Beck](https://github.com/stochastic-sisyphus) on Google Colab using 🤗 Transformers + PEFT.