lonelycrab88
/

BiooBang-1.0-HEK293T

Synthetic Biology

Model card Files Files and versions Community

BiooBang-1.0-HEK293T / README.md

lonelycrab88's picture

Update README.md

7400256 verified 9 days ago

|

history blame contribute delete

2.75 kB

	---
	tags:
	- BiooBang
	- Deep Learning
	- Language Model
	- Bioinformatics
	- Synthetic Biology
	- HEK293T
	- CDS generation
	license: cc-by-nc-4.0
	---

	BiooBang is an advanced biological language model designed to integrate protein amino acid sequences and mRNA coding sequences (CDS) within a unified framework. Built on a Transformer-based prefix decoder architecture, BiooBang leverages the principles of natural language processing to treat protein and CDS sequences as biological “languages”, enabling self-supervised learning for comprehensive training.

	## Note
	This model was fine-tuned specifically for codon optimization in the HEK293T cell line.

	## Use Case
	The source code corresponding to this model is publicly available on GitHub at: https://github.com/lonelycrab888/BiooBang

	After installing BiooBang, you can use:
	```python
	# ========== generate CDS
	import torch
	from model.tokenization_UniBioseq import UBSLMTokenizer
	from model.modeling_UniBioseq import UniBioseqForCausalLM
	from model.UBL_utils import CodonLogitsProcessor

	from transformers.generation.logits_process import LogitsProcessorList

	tokenizer = UBSLMTokenizer.from_pretrained("lonelycrab88/BiooBang-1.0-HEK293T")
	model = UniBioseqForCausalLM.from_pretrained("lonelycrab88/BiooBang-1.0-HEK293T", device_map='auto')

	protein_prompt = "MASSDKQTSPKPPPSPSPLRNSKFCQSNMRILIS"
	input_ids = torch.tensor([tokenizer.encode(input_protein)+[36]]).to(model.device)
	max_length = 4*len(input_protein)+6

	logits_processor = LogitsProcessorList()
	logits_processor.append(CodonLogitsProcessor(input_protein, tokenizer, len(input_protein)))
	result = model.generate(input_ids, max_length = max_length, num_beams = 10, logits_processor=logits_processor, low_memory=True, num_return_sequences=1)
	result_CDS_tok = tokenizer.decode(result[0][len(input_protein)+3:].tolist()).replace(" ","").upper()
	```

	## Citing this Work

	Please cite our paper:

	```bibtex
	@article {Zhao2024.10.24.620004,
	author = {Zhao, Heng-Rui and Cheng, Meng-Ting and Zhu, Jinhua and Wang, Hao and Yang, Xiang-Rui and Wang, Bo and Sun, Yuan-Xin and Fang, Ming-Hao and Chen, Enhong and Li, Houqiang and Han, Shu-Jing and Chen, Yuxing and Zhou, Cong-Zhao},
	title = {Integration of protein and coding sequences enables mutual augmentation of the language model},
	elocation-id = {2024.10.24.620004},
	year = {2024},
	doi = {10.1101/2024.10.24.620004},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2024/10/29/2024.10.24.620004},
	eprint = {https://www.biorxiv.org/content/early/2024/10/29/2024.10.24.620004.full.pdf},
	journal = {bioRxiv}
	}
	```


	## Contacts

	If you’re interested in other cell lines and open to collaboration, please don’t hesitate to contact us!