Optimizing Data Collection for Machine Learning
Authors
Research Topics
Paper Information
-
Journal:
Journal of Machine Learning Research -
Added to Tracker:
Jul 15, 2025
Abstract
Modern deep learning systems require huge data sets to achieve impressive performance, but there is little guidance on how much or what kind of data to collect. Over-collecting data incurs unnecessary present costs, while under-collecting may incur future costs and delay workflows. We propose a new paradigm to model the data collection workflow as a formal optimal data collection problem that allows designers to specify performance targets, collection costs, a time horizon, and penalties for failing to meet the targets. This formulation generalizes to tasks with multiple data sources, such as labeled and unlabeled data used in semi-supervised learning, and can be easily modified to customized analyses such as how to introduce data from new classes to an existing model. To solve our problem, we develop Learn-Optimize-Collect (LOC), which minimizes expected future collection costs. Finally, we numerically compare our framework to the conventional baseline of estimating data requirements by extrapolating from neural scaling laws. We significantly reduce the risks of failing to meet desired performance targets on several classification, segmentation, and detection tasks, while maintaining low total collection costs.
Author Details
Rafid Mahmood
AuthorJames Lucas
AuthorJose M. Alvarez
AuthorSanja Fidler
AuthorMarc T. Law
AuthorResearch Topics & Keywords
Machine Learning
Research AreaCitation Information
APA Format
Rafid Mahmood
,
James Lucas
,
Jose M. Alvarez
,
Sanja Fidler
&
Marc T. Law
.
Optimizing Data Collection for Machine Learning.
Journal of Machine Learning Research
.
BibTeX Format
@article{JMLR:v26:23-0292,
author = {Rafid Mahmood and James Lucas and Jose M. Alvarez and Sanja Fidler and Marc T. Law},
title = {Optimizing Data Collection for Machine Learning},
journal = {Journal of Machine Learning Research},
year = {2025},
volume = {26},
number = {38},
pages = {1--52},
url = {http://jmlr.org/papers/v26/23-0292.html}
}