JMLR

Optimizing Data Collection for Machine Learning

Authors
Rafid Mahmood James Lucas Jose M. Alvarez Sanja Fidler Marc T. Law
Research Topics
Machine Learning
Paper Information
  • Journal:
    Journal of Machine Learning Research
  • Added to Tracker:
    Jul 15, 2025
Abstract

Modern deep learning systems require huge data sets to achieve impressive performance, but there is little guidance on how much or what kind of data to collect. Over-collecting data incurs unnecessary present costs, while under-collecting may incur future costs and delay workflows. We propose a new paradigm to model the data collection workflow as a formal optimal data collection problem that allows designers to specify performance targets, collection costs, a time horizon, and penalties for failing to meet the targets. This formulation generalizes to tasks with multiple data sources, such as labeled and unlabeled data used in semi-supervised learning, and can be easily modified to customized analyses such as how to introduce data from new classes to an existing model. To solve our problem, we develop Learn-Optimize-Collect (LOC), which minimizes expected future collection costs. Finally, we numerically compare our framework to the conventional baseline of estimating data requirements by extrapolating from neural scaling laws. We significantly reduce the risks of failing to meet desired performance targets on several classification, segmentation, and detection tasks, while maintaining low total collection costs.

Author Details
Rafid Mahmood
Author
James Lucas
Author
Jose M. Alvarez
Author
Sanja Fidler
Author
Marc T. Law
Author
Research Topics & Keywords
Machine Learning
Research Area
Citation Information
APA Format
Rafid Mahmood , James Lucas , Jose M. Alvarez , Sanja Fidler & Marc T. Law . Optimizing Data Collection for Machine Learning. Journal of Machine Learning Research .
BibTeX Format
@article{JMLR:v26:23-0292,
  author  = {Rafid Mahmood and James Lucas and Jose M. Alvarez and Sanja Fidler and Marc T. Law},
  title   = {Optimizing Data Collection for Machine Learning},
  journal = {Journal of Machine Learning Research},
  year    = {2025},
  volume  = {26},
  number  = {38},
  pages   = {1--52},
  url     = {http://jmlr.org/papers/v26/23-0292.html}
}
Related Papers