“What is Different Between These Datasets?” A Framework for Explaining Data Distribution Shifts
Authors
Research Topics
Paper Information
-
Journal:
Journal of Machine Learning Research -
Added to Tracker:
Sep 08, 2025
Abstract
The performance of machine learning models relies heavily on the quality of input data, yet real-world applications often face significant data-related challenges. A common issue arises when curating training data or deploying models: two datasets from the same domain may exhibit differing distributions. While many techniques exist for detecting such distribution shifts, there is a lack of comprehensive methods to explain these differences in a human-understandable way beyond opaque quantitative metrics. To bridge this gap, we propose a versatile framework of interpretable methods for comparing datasets. Using a variety of case studies, we demonstrate the effectiveness of our approach across diverse data modalities—including tabular data, text data, images, time-series signals – in both low and high-dimensional settings. These methods complement existing techniques by providing actionable and interpretable insights to better understand and address distribution shifts.
Author Details
Varun Babbar*
AuthorZhicheng Guo*
AuthorCynthia Rudin
AuthorResearch Topics & Keywords
Machine Learning
Research AreaCitation Information
APA Format
Varun Babbar*
,
Zhicheng Guo*
&
Cynthia Rudin
.
“What is Different Between These Datasets?” A Framework for Explaining Data Distribution Shifts.
Journal of Machine Learning Research
.
BibTeX Format
@article{paper475,
title = { “What is Different Between These Datasets?” A Framework for Explaining Data Distribution Shifts },
author = {
Varun Babbar*
and Zhicheng Guo*
and Cynthia Rudin
},
journal = { Journal of Machine Learning Research },
url = { https://www.jmlr.org/papers/v26/24-0352.html }
}