|
--- |
|
tags: |
|
- BiooBang |
|
- Deep Learning |
|
- Language Model |
|
- Bioinformatics |
|
- Synthetic Biology |
|
- HEK293T |
|
- CDS generation |
|
license: cc-by-nc-4.0 |
|
--- |
|
|
|
**BiooBang** is an advanced biological language model designed to integrate protein amino acid sequences and mRNA coding sequences (CDS) within a unified framework. Built on a Transformer-based prefix decoder architecture, BiooBang leverages the principles of natural language processing to treat protein and CDS sequences as biological “languages”, enabling self-supervised learning for comprehensive training. |
|
|
|
## Note |
|
This model was fine-tuned specifically for codon optimization in the HEK293T cell line. |
|
|
|
## Use Case |
|
The source code corresponding to this model is publicly available on GitHub at: https://github.com/lonelycrab888/BiooBang |
|
|
|
After installing BiooBang, you can use: |
|
```python |
|
# ========== generate CDS |
|
import torch |
|
from model.tokenization_UniBioseq import UBSLMTokenizer |
|
from model.modeling_UniBioseq import UniBioseqForCausalLM |
|
from model.UBL_utils import CodonLogitsProcessor |
|
|
|
from transformers.generation.logits_process import LogitsProcessorList |
|
|
|
tokenizer = UBSLMTokenizer.from_pretrained("lonelycrab88/BiooBang-1.0-HEK293T") |
|
model = UniBioseqForCausalLM.from_pretrained("lonelycrab88/BiooBang-1.0-HEK293T", device_map='auto') |
|
|
|
protein_prompt = "MASSDKQTSPKPPPSPSPLRNSKFCQSNMRILIS" |
|
input_ids = torch.tensor([tokenizer.encode(input_protein)+[36]]).to(model.device) |
|
max_length = 4*len(input_protein)+6 |
|
|
|
logits_processor = LogitsProcessorList() |
|
logits_processor.append(CodonLogitsProcessor(input_protein, tokenizer, len(input_protein))) |
|
result = model.generate(input_ids, max_length = max_length, num_beams = 10, logits_processor=logits_processor, low_memory=True, num_return_sequences=1) |
|
result_CDS_tok = tokenizer.decode(result[0][len(input_protein)+3:].tolist()).replace(" ","").upper() |
|
``` |
|
|
|
## Citing this Work |
|
|
|
Please cite our paper: |
|
|
|
```bibtex |
|
@article {Zhao2024.10.24.620004, |
|
author = {Zhao, Heng-Rui and Cheng, Meng-Ting and Zhu, Jinhua and Wang, Hao and Yang, Xiang-Rui and Wang, Bo and Sun, Yuan-Xin and Fang, Ming-Hao and Chen, Enhong and Li, Houqiang and Han, Shu-Jing and Chen, Yuxing and Zhou, Cong-Zhao}, |
|
title = {Integration of protein and coding sequences enables mutual augmentation of the language model}, |
|
elocation-id = {2024.10.24.620004}, |
|
year = {2024}, |
|
doi = {10.1101/2024.10.24.620004}, |
|
publisher = {Cold Spring Harbor Laboratory}, |
|
URL = {https://www.biorxiv.org/content/early/2024/10/29/2024.10.24.620004}, |
|
eprint = {https://www.biorxiv.org/content/early/2024/10/29/2024.10.24.620004.full.pdf}, |
|
journal = {bioRxiv} |
|
} |
|
``` |
|
|
|
|
|
## Contacts |
|
|
|
If you’re interested in other cell lines and open to collaboration, please don’t hesitate to contact us! |
|
|