🌿 RNAtranslator
Modeling protein-conditional RNA design as sequence-to-sequence natural language translation
Overview
RNAtranslator is a generative transformer model that reframes RNA design as a sequence-to-sequence translation problem, treating proteins and RNAs as distinct biological "languages". Trained on millions of RNA–protein interactions, it generates novel RNA sequences with high binding affinity and biological plausibility — no post-optimization required.
This opens new frontiers in RNA therapeutics and synthetic biology, especially for undruggable proteins.
Architecture
RNAtranslator is based on an encoder–decoder Transformer (T5
) architecture:
- Encoder: Receives a protein sequence as input
- Decoder: Predicts a binding RNA sequence conditioned on the encoded protein
Key Features
- Protein→RNA Translation: Treats protein-to-RNA mapping like natural language translation
- Trained on 38M+ interactions: 26M from RNAInter and 12M validated experimental samples
- End-to-end generation: No need for hand-crafted rules or post-processing
- Dual-tokenizer support: Separate tokenizers for encoder (protein) and decoder (RNA)
- Multi-GPU training with
Hugging Face Accelerate
Usage
from transformers import T5ForConditionalGeneration, PreTrainedTokenizerFast
def postprocess_rna(rna):
return rna.replace('b', 'A').replace('j', 'C').replace(
'u', 'U').replace('z', 'G').replace(' ', '').replace(
'B', 'A').replace('J', 'C').replace('U', 'U').replace('Z', 'G')
# Load model
model = T5ForConditionalGeneration.from_pretrained("SobhanShukueian/rnatranslator")
# Load separate tokenizers
protein_tokenizer = PreTrainedTokenizerFast.from_pretrained("SobhanShukueian/rnatranslator", subfolder="protein_tokenizer")
rna_tokenizer = PreTrainedTokenizerFast.from_pretrained("SobhanShukueian/rnatranslator", subfolder="rna_tokenizer")
protein_seq = "MSGGGVIRGPAGNNDCRIYVGNLPPDIRTKDIEDVFYKYGAIRDIDLKNRRGGPPFAFVEFEDPRDAEDAVYGRDGYDYDGYRLRVEFPRSGRGTGRGGGGGGGGGAPRGRYGPPSRRSENRVVVSGLPPSGSWQDLKDHMREAGDVCYADVYRDGTGVVEFVRKEDMTYAVRKLDNTKFRSHEGETAYIRVKVDGPRSPSYGRSRSRSRSRSRSRSRSNSRSRSYSPRRSRGSPRYSPRHSRSRSRT"
inputs = protein_tokenizer(protein_seq, return_tensors="pt").input_ids
# Generate RNA
gen_args = {
'max_length': 256,
'repetition_penalty': 1.5,
'encoder_repetition_penalty': 1.3,
'num_return_sequences': 1,
'top_k': 30,
'temperature': 1.5,
'num_beams': 1,
'do_sample': True,
}
outputs = model.generate(inputs, **gen_args)
rna_sequence = rna_tokenizer.decode(outputs[0], skip_special_tokens=True)
print(postprocess_rna(rna_sequence))
Training
RNAtranslator was trained using Hugging Face's Accelerate on multi-GPU systems.
Hyperparameters, tokenizer training code, and data pipelines are all available in the GitHub repository:
https://github.com/ciceklab/RNAtranslator
Files & Structure
This model repository contains:
- model.safetensors – Fine-tuned model weights
- config.json – Model configuration
- protein_tokenizer/ – Tokenizer for protein sequences (encoder)
- rna_tokenizer/ – Tokenizer for RNA sequences (decoder)
- README.md – This model card
Citation
If you use RNAtranslator in your research, please cite: Preprint: https://www.biorxiv.org/content/10.1101/2025.03.04.641375v1
License
CC BY-NC-SA 2.0 — for academic use only.
For commercial licensing, please contact the authors.
Full code and documentation: https://github.com/ciceklab/RNAtranslator
- Downloads last month
- 10