R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training
Abstract
Data mixing strategies have successfully reduced the costs involved in training language models. While promising, such methods suffer from two flaws. First, they rely on predetermined data domains (e.g., data sources, task types), which may fail to capture critical semantic nuances, leaving performance on the table. Second, these methods scale with the number of domains in a computationally prohibitive way. We address these challenges via R&B, a framework that re-partitions training data based on semantic similarity (Regroup) to create finer-grained domains, and efficiently optimizes the data composition (Balance) by leveraging a Gram matrix induced by domain gradients obtained throughout training. Unlike prior works, it removes the need for additional compute to obtain evaluation information such as losses or gradients. We analyze this technique under standard regularity conditions and provide theoretical insights that justify R&B's effectiveness compared to non-adaptive mixing approaches. Empirically, we demonstrate the effectiveness of R&B on five diverse datasets ranging from natural language to reasoning and multimodal tasks. With as little as 0.01% additional compute overhead, R&B matches or exceeds the performance of state-of-the-art data mixing strategies.
Community
This paper introduces R&B, a novel data mixing framework that improves language model training by addressing two key limitations of existing approaches. R&B works by:
- Regrouping training data into finer-grained domains based on semantic similarity rather than predetermined categories, and
- Balancing data composition efficiently using a gradient-based Gram matrix obtained during training.
Unlike previous methods, R&B requires minimal additional computational overhead (only 0.01%) while eliminating the need for separate evaluation information. The authors provide theoretical analysis under standard conditions and demonstrate R&B's effectiveness across five diverse datasets spanning natural language, reasoning, and multimodal tasks, where it matches or exceeds state-of-the-art data mixing strategies.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper