Efficient Data Selection at Scale via Influence Distillation
Abstract
Influence Distillation, using second-order information, optimally selects training data for LLM fine-tuning with landmark-based approximation, achieving faster and competitive performance on various tasks.
Effective data selection is critical for efficient training of modern Large Language Models (LLMs). This paper introduces Influence Distillation, a novel, mathematically-justified framework for data selection that employs second-order information to optimally weight training samples. By distilling each sample's influence on a target distribution, our method assigns model-specific weights that are used to select training data for LLM fine-tuning, guiding it toward strong performance on the target domain. We derive these optimal weights for both Gradient Descent and Adam optimizers. To ensure scalability and reduce computational cost, we propose a landmark-based approximation: influence is precisely computed for a small subset of "landmark" samples and then efficiently propagated to all other samples to determine their weights. We validate Influence Distillation by applying it to instruction tuning on the Tulu V2 dataset, targeting a range of tasks including GSM8k, SQuAD, and MMLU, across several models from the Llama and Qwen families. Experiments show that Influence Distillation matches or outperforms state-of-the-art performance while achieving up to 3.5times faster selection.
Community
Find a first version of the code here: https://github.com/IST-DASLab/influence_distillation
Stay tuned for updates!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Enhancing Training Data Attribution with Representational Optimization (2025)
- DIDS: Domain Impact-aware Data Sampling for Large Language Model Training (2025)
- Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning (2025)
- Subset Selection for Fine-Tuning: A Utility-Diversity Balanced Approach for Mathematical Domain Adaptation (2025)
- ESLM: Risk-Averse Selective Language Modeling for Efficient Pretraining (2025)
- W-PCA Based Gradient-Free Proxy for Efficient Search of Lightweight Language Models (2025)
- AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper