Papers
arxiv:2402.17327

Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

Published on Feb 27, 2024
Authors:
,
,
,
,
,
,
,

Abstract

We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on k-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is H\"older continuous, our approach provably allows selecting a set of ``typical'' k + 1/varepsilon^2 elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative (1pmvarepsilon) factor and an additive varepsilon lambda Phi_k, where Phi_k represents the k-means cost for the input embeddings and lambda is the H\"older constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performances of leverage score sampling, while being conceptually simpler and more scalable.

Community

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2402.17327 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2402.17327 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2402.17327 in a Space README.md to link it from this page.

Collections including this paper 1