🌿 RNAtranslator

Modeling protein-conditional RNA design as sequence-to-sequence natural language translation

Overview

RNAtranslator is a generative transformer model that reframes RNA design as a sequence-to-sequence translation problem, treating proteins and RNAs as distinct biological "languages". Trained on millions of RNA–protein interactions, it generates novel RNA sequences with high binding affinity and biological plausibility — no post-optimization required.

This opens new frontiers in RNA therapeutics and synthetic biology, especially for undruggable proteins.

Architecture

RNAtranslator is based on an encoder–decoder Transformer (T5) architecture:

Encoder: Receives a protein sequence as input
Decoder: Predicts a binding RNA sequence conditioned on the encoded protein

Key Features

Protein→RNA Translation: Treats protein-to-RNA mapping like natural language translation
Trained on 38M+ interactions: 26M from RNAInter and 12M validated experimental samples
End-to-end generation: No need for hand-crafted rules or post-processing
Dual-tokenizer support: Separate tokenizers for encoder (protein) and decoder (RNA)
Multi-GPU training with Hugging Face Accelerate

Usage

from transformers import T5ForConditionalGeneration, PreTrainedTokenizerFast

def postprocess_rna(rna):
    return rna.replace('b', 'A').replace('j', 'C').replace(
                    'u', 'U').replace('z', 'G').replace(' ', '').replace(
                    'B', 'A').replace('J', 'C').replace('U', 'U').replace('Z', 'G')

# Load model
model = T5ForConditionalGeneration.from_pretrained("SobhanShukueian/rnatranslator")

# Load separate tokenizers
protein_tokenizer = PreTrainedTokenizerFast.from_pretrained("SobhanShukueian/rnatranslator", subfolder="protein_tokenizer")
rna_tokenizer = PreTrainedTokenizerFast.from_pretrained("SobhanShukueian/rnatranslator", subfolder="rna_tokenizer")


protein_seq = "MSGGGVIRGPAGNNDCRIYVGNLPPDIRTKDIEDVFYKYGAIRDIDLKNRRGGPPFAFVEFEDPRDAEDAVYGRDGYDYDGYRLRVEFPRSGRGTGRGGGGGGGGGAPRGRYGPPSRRSENRVVVSGLPPSGSWQDLKDHMREAGDVCYADVYRDGTGVVEFVRKEDMTYAVRKLDNTKFRSHEGETAYIRVKVDGPRSPSYGRSRSRSRSRSRSRSRSNSRSRSYSPRRSRGSPRYSPRHSRSRSRT"
inputs = protein_tokenizer(protein_seq, return_tensors="pt").input_ids

# Generate RNA
gen_args = {
    'max_length': 256,
    'repetition_penalty': 1.5,
    'encoder_repetition_penalty': 1.3,
    'num_return_sequences': 1,
    'top_k': 30, 
    'temperature': 1.5, 
    'num_beams': 1,
    'do_sample': True,
}

outputs = model.generate(inputs, **gen_args)
rna_sequence = rna_tokenizer.decode(outputs[0], skip_special_tokens=True)
print(postprocess_rna(rna_sequence))

Training

RNAtranslator was trained using Hugging Face's Accelerate on multi-GPU systems.

Hyperparameters, tokenizer training code, and data pipelines are all available in the GitHub repository:
https://github.com/ciceklab/RNAtranslator

Files & Structure

This model repository contains:

- model.safetensors – Fine-tuned model weights  
- config.json – Model configuration  
- protein_tokenizer/ – Tokenizer for protein sequences (encoder)  
- rna_tokenizer/ – Tokenizer for RNA sequences (decoder)  
- README.md – This model card

Citation

If you use RNAtranslator in your research, please cite: Preprint: https://www.biorxiv.org/content/10.1101/2025.03.04.641375v1

License

CC BY-NC-SA 2.0 — for academic use only.
For commercial licensing, please contact the authors.

Full code and documentation: https://github.com/ciceklab/RNAtranslator