JMLR

“What is Different Between These Datasets?” A Framework for Explaining Data Distribution Shifts

Authors
Varun Babbar* Zhicheng Guo* Cynthia Rudin
Research Topics
Machine Learning
Paper Information
  • Journal:
    Journal of Machine Learning Research
  • Added to Tracker:
    Sep 08, 2025
Abstract

The performance of machine learning models relies heavily on the quality of input data, yet real-world applications often face significant data-related challenges. A common issue arises when curating training data or deploying models: two datasets from the same domain may exhibit differing distributions. While many techniques exist for detecting such distribution shifts, there is a lack of comprehensive methods to explain these differences in a human-understandable way beyond opaque quantitative metrics. To bridge this gap, we propose a versatile framework of interpretable methods for comparing datasets. Using a variety of case studies, we demonstrate the effectiveness of our approach across diverse data modalities—including tabular data, text data, images, time-series signals – in both low and high-dimensional settings. These methods complement existing techniques by providing actionable and interpretable insights to better understand and address distribution shifts.

Author Details
Varun Babbar*
Author
Zhicheng Guo*
Author
Cynthia Rudin
Author
Research Topics & Keywords
Machine Learning
Research Area
Citation Information
APA Format
Varun Babbar* , Zhicheng Guo* & Cynthia Rudin . “What is Different Between These Datasets?” A Framework for Explaining Data Distribution Shifts. Journal of Machine Learning Research .
BibTeX Format
@article{paper475,
  title = { “What is Different Between These Datasets?” A Framework for Explaining Data Distribution Shifts },
  author = { Varun Babbar* and Zhicheng Guo* and Cynthia Rudin },
  journal = { Journal of Machine Learning Research },
  url = { https://www.jmlr.org/papers/v26/24-0352.html }
}