JMLR

“What is Different Between These Datasets?” A Framework for Explaining Data Distribution Shifts

Authors

Varun Babbar* Zhicheng Guo* Cynthia Rudin

Research Topics

Machine Learning

View Full Paper

Paper Information

Journal:
Journal of Machine Learning Research
Added to Tracker:
Sep 08, 2025

Abstract

The performance of machine learning models relies heavily on the quality of input data, yet real-world applications often face significant data-related challenges. A common issue arises when curating training data or deploying models: two datasets from the same domain may exhibit differing distributions. While many techniques exist for detecting such distribution shifts, there is a lack of comprehensive methods to explain these differences in a human-understandable way beyond opaque quantitative metrics. To bridge this gap, we propose a versatile framework of interpretable methods for comparing datasets. Using a variety of case studies, we demonstrate the effectiveness of our approach across diverse data modalities—including tabular data, text data, images, time-series signals – in both low and high-dimensional settings. These methods complement existing techniques by providing actionable and interpretable insights to better understand and address distribution shifts.

Author Details

Varun Babbar*

Author

Zhicheng Guo*

Author

Cynthia Rudin

Author

Research Topics & Keywords

Machine Learning

Research Area

Citation Information

APA Format


                                
                                    
                                    Varun Babbar*
                                
                                    
                                        , 
                                    
                                    Zhicheng Guo*
                                
                                    
                                         & 
                                    
                                    Cynthia Rudin
                                
                                . 
                                “What is Different Between These Datasets?” A Framework for Explaining Data Distribution Shifts. 
                                Journal of Machine Learning Research
                                .

BibTeX Format


@article{paper475,

  title = { “What is Different Between These Datasets?” A Framework for Explaining Data Distribution Shifts },

  author = { 
                                
                                    Varun Babbar*
                                
                                     and Zhicheng Guo*
                                
                                     and Cynthia Rudin
                                
                                },

  journal = { Journal of Machine Learning Research },



  url = { https://www.jmlr.org/papers/v26/24-0352.html }

}

Back to Papers

View Full Paper More from JMLR