llama3-biotoken3pretrain-kaniwa

This is a LoRA adapter.

The base model is Llama 3 quantized by Unsloth: unsloth/llama-3-8b-bnb-4bit

The tokenizer has added "biotokens" ∎A, ∎C, ∎G, and ∎T.

The dataset was ~20% of BYU's 2019 kaniwa (Chenopodium pallidicaule) genome, from https://genomevolution.org/coge/GenomeInfo.pl?gid=53872

The adapter was finetuned for several hours on an A100 GPU. The data was split into ~6k nucleotide snippets with an Alpaca like message format.

Training Notebook (before copying over to Lambda): https://colab.research.google.com/drive/1IrRBC2LKlU7_7zjzmmzslT0uDOacwyfO?usp=sharing

Sample message:

Write information about the nucleotide sequence.

### Sequence:
∎G∎C∎C∎T∎A∎T∎A∎G∎T∎G∎T∎G∎T∎A∎G...

### Annotation:
Information about location in the kaniwa chromosome: >lcl|Cp5

Usage

Inference with DNA sequence

from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer

model = AutoPeftModelForCausalLM.from_pretrained("monsoon-nlp/llama3-biotoken3pretrain-kaniwa", load_in_4bit=True).to("cuda")
tokenizer = AutoTokenizer.from_pretrained("monsoon-nlp/llama3-biotoken3pretrain-kaniwa")
tokenizer.pad_token = tokenizer.eos_token # pad fix

qed = "∎" # from math symbols, used in pretraining
sequence = "".join([(qed + nt.upper()) for nt in "GCCTATAGTGTGTAGCTAATGAGCCTAGGTTATCGACCCTAATCT"])

inputs = tokenizer(f"{prefix}{sequence}{annotation}", return_tensors="pt")
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=50)
sample = tokenizer.batch_decode(outputs, skip_special_tokens=False)[0]

LoRA finetuning on a new task

from transformers import AutoTokenizer
from trl import SFTTrainer
from unsloth import FastLanguageModel

model, _ = FastLanguageModel.from_pretrained(
    model_name = "monsoon-nlp/llama3-biotoken3pretrain-kaniwa",
    max_seq_length = 6_500, # max 6,000 bp for AgroNT tasks
    dtype = None,
    load_in_4bit = True,
    resize_model_vocab=128260, # includes biotokens
)
tokenizer = AutoTokenizer.from_pretrained("monsoon-nlp/llama3-biotoken3pretrain-kaniwa")
tokenizer.pad_token = tokenizer.eos_token # pad fix

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
...
)

This llama model was trained 2x faster with Unsloth and Huggingface's TRL library.

Genome Citation

Mangelson H, et al. The genome of Chenopodium pallidicaule: an emerging Andean super grain. Appl. Plant Sci. 2019;7:e11300. doi: 10.1002/aps3.11300

Downloads last month
12
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.

Model tree for monsoon-nlp/llama3-biotoken3pretrain-kaniwa

Adapter
(209)
this model