BiooBang-1.0
Collection
A biological dual-language foundation model for protein/RNA representation learning and CDS generation
•
2 items
•
Updated
BiooBang is an advanced biological language model designed to integrate protein amino acid sequences and mRNA coding sequences (CDS) within a unified framework. Built on a Transformer-based prefix decoder architecture, BiooBang leverages the principles of natural language processing to treat protein and CDS sequences as biological “languages”, enabling self-supervised learning for comprehensive training.
The source code corresponding to this model is publicly available on GitHub at: https://github.com/lonelycrab888/BiooBang
After installing BiooBang, you can use:
import torch
# ========== Set device
device = "cuda:0"
# ========== Prepare Data
data = [
("Protein", "MASSDKQTSPKPPPSPSPLRNSKFCQSNMRILIS"),
("RNA", "ATGGCGTCTAGTGATAAACAAACAAGCCCAAAGCCTCCTCCTTCACCGTCTCCTCTCCGTAATT")
]
# ========== BiooBang Model
from model.modeling_UniBioseq import UniBioseqForEmbedding
from model.tokenization_UniBioseq import UBSLMTokenizer
model = UniBioseqForEmbedding.from_pretrained("lonelycrab88/BiooBang-1.0")
tokenizer = UBSLMTokenizer.from_pretrained("lonelycrab88/BiooBang-1.0")
model.eval()
model.to(device)
# ========== get Embeddings
embeddings = {}
hidden_states = {}
for name,input_seq in data:
input_ids = tokenizer(input_seq, return_tensors="pt")['input_ids'].to(device)
with torch.no_grad():
# get sequence embedding
embeddings[name] = model(input_ids).logits
# get last hidden states (token embeddings)
hidden_states[name] = model(input_ids).hidden_states[:,1:-1,:]
# ========== generate CDS
from transformers.generation.logits_process import LogitsProcessorList
from model.UBL_utils import CodonLogitsProcessor
from model.modeling_UniBioseq import UniBioseqForCausalLM
tokenizer = UBSLMTokenizer.from_pretrained("lonelycrab88/BiooBang-1.0")
model = UniBioseqForCausalLM.from_pretrained("lonelycrab88/BiooBang-1.0", device_map='auto')
protein_prompt = "MASSDKQTSPKPPPSPSPLRNSKFCQSNMRILIS"
input_ids = torch.tensor([tokenizer.encode(input_protein)+[36]]).to(model.device)
max_length = 4*len(input_protein)+6
logits_processor = LogitsProcessorList()
logits_processor.append(CodonLogitsProcessor(input_protein, tokenizer, len(input_protein)))
result = model.generate(input_ids, max_length = max_length, num_beams = 10, logits_processor=logits_processor, low_memory=True, num_return_sequences=1)
result_CDS_tok = tokenizer.decode(result[0][len(input_protein)+3:].tolist()).replace(" ","").upper()
Please cite our paper:
@article {Zhao2024.10.24.620004,
author = {Zhao, Heng-Rui and Cheng, Meng-Ting and Zhu, Jinhua and Wang, Hao and Yang, Xiang-Rui and Wang, Bo and Sun, Yuan-Xin and Fang, Ming-Hao and Chen, Enhong and Li, Houqiang and Han, Shu-Jing and Chen, Yuxing and Zhou, Cong-Zhao},
title = {Integration of protein and coding sequences enables mutual augmentation of the language model},
elocation-id = {2024.10.24.620004},
year = {2024},
doi = {10.1101/2024.10.24.620004},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2024/10/29/2024.10.24.620004},
eprint = {https://www.biorxiv.org/content/early/2024/10/29/2024.10.24.620004.full.pdf},
journal = {bioRxiv}
}
If you’re interested in other cell lines and open to collaboration, please don’t hesitate to contact us!