Abstract
MuToR is a multi-token prediction approach that integrates learnable register tokens to future target prediction, enhancing language model pretraining and fine-tuning without significant parameter increase or architectural changes.
Multi-token prediction has emerged as a promising objective for improving language model pretraining, but its benefits have not consistently generalized to other settings such as fine-tuning. In this paper, we propose MuToR, a simple and effective approach to multi-token prediction that interleaves learnable register tokens into the input sequence, each tasked with predicting future targets. Compared to existing methods, MuToR offers several key advantages: it introduces only a negligible number of additional parameters, requires no architectural changes--ensuring compatibility with off-the-shelf pretrained language models--and remains aligned with the next-token pretraining objective, making it especially well-suited for supervised fine-tuning. Moreover, it naturally supports scalable prediction horizons. We demonstrate the effectiveness and versatility of MuToR across a range of use cases, including supervised fine-tuning, parameter-efficient fine-tuning (PEFT), and pretraining, on challenging generative tasks in both language and vision domains. Our code will be available at: https://github.com/nasosger/MuToR.
Community
We propose a novel approach for multi-token prediction, using learnable register tokens which are tasked to predict future tokens during training.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Efficient Joint Prediction of Multiple Future Tokens (2025)
- From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models (2025)
- SWAN-GPT: An Efficient and Scalable Approach for Long-Context Language Modeling (2025)
- Adaptive Layer-skipping in Pre-trained LLMs (2025)
- Efficient Pretraining Length Scaling (2025)
- Platonic Grounding for Efficient Multimodal Language Models (2025)
- Instruction-Guided Autoregressive Neural Network Parameter Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper