arxiv:2505.20589

Prot2Token: A Unified Framework for Protein Modeling via Next-Token Prediction

Published on May 26

· Submitted by

Mahdip72 on May 29

Upvote

Authors:

Mahdi Pourmirzaei ,

Abstract

Prot2Token unifies protein prediction tasks using an autoregressive decoder with task tokens, improving efficiency and accuracy across different benchmarks compared to specialized models.

AI-generated summary

The diverse nature of protein prediction tasks has traditionally necessitated specialized models, hindering the development of broadly applicable and computationally efficient Protein Language Models (PLMs). In this work, we introduce Prot2Token, a unified framework that overcomes these challenges by converting a wide spectrum of protein-related predictions, from sequence-level properties and residue-specific attributes to complex inter-protein interactions, into a standardized next-token prediction format. At its core, Prot2Token employs an autoregressive decoder, conditioned on embeddings from pre-trained protein encoders and guided by learnable task tokens, to perform diverse predictions. This architecture uniquely facilitates multi-task learning, enabling a single model to master numerous tasks with improved efficiency. We present extensive experimental validation across a variety of benchmarks, demonstrating Prot2Tokens strong predictive power in different types of protein-prediction tasks. Key results include significant speedups (e.g., near 1000x over AlphaFold2 with MSA) and performance often matching or exceeding specialized approaches. Beyond that, we introduce an auxiliary self-supervised decoder pre-training approach to improve spatially sensitive task performance. Prot2Token thus offers a significant step towards a versatile, high-throughput paradigm for protein modeling, promising to accelerate biological discovery and the development of novel therapeutics. The code is available at https://github.com/mahdip72/prot2token .

View arXiv page View PDF GitHub repository Add to collection

Community

Mahdip72

Paper author Paper submitter 3 days ago

Prot2Token demonstrates that framing a wide range of protein-prediction problems as a single next-token prediction task—powered by a new tokenization scheme that converts sequences, structures, and interaction graphs into one shared vocabulary—can replace today’s patchwork of narrow, task-specific models with one unified GPT-style decoder. By simply swapping in next-token prediction, the framework achieves comparable accuracy and, on structure-related benchmarks, runs orders of magnitude faster than AlphaFold2, making high-throughput analyses practical on standard hardware. In a field where bioinformatics and protein modeling still rely on highly specialized architectures, Prot2Token offers a concrete path toward unifying them within one autoregressive transformer predictor.

librarian-bot

3 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.20589 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.20589 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.20589 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.