Low-resource Vocabulary Expansion
Collection
Collection of models for "How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?"
โข
266 items
โข
Updated
This model is built on top of Llama2 7B adapted for Arabic using 30K target language sentences sampled from CC-100.
Use the code below to get started with the model.
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"atsuki-yamaguchi/Llama-2-7b-hf-ar-30K-mean"
)
model = PeftModelForCausalLM.from_pretrained(
model,
"atsuki-yamaguchi/Llama-2-7b-hf-ar-30K-mean"
)
model = model.merge_and_unload()
tokenizer = AutoTokenizer.from_pretrained(
"atsuki-yamaguchi/Llama-2-7b-hf-ar-30K-mean"
)
@article{yamaguchi-etal-2024-effectively,
title={How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?},
author={Atsuki Yamaguchi and Aline Villavicencio and Nikolaos Aletras},
year={2024},
journal={ArXiv},
year={2024},
volume={abs/2406.11477},
url={https://arxiv.org/abs/2406.11477},
}
Base model
meta-llama/Llama-2-7b-hf