JMLR

Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability

Authors
Atticus Geiger Duligur Ibeling Amir Zur Maheep Chaudhary Sonakshi Chauhan Jing Huang Aryaman Arora Zhengxuan Wu Noah Goodman Christopher Potts Thomas Icard
Research Topics
Causal Inference
Paper Information
  • Journal:
    Journal of Machine Learning Research
  • Added to Tracker:
    Jul 15, 2025
Abstract

Causal abstraction provides a theoretical foundation for mechanistic interpretability, the field concerned with providing intelligible algorithms that are faithful simplifications of the known, but opaque low-level details of black box AI models. Our contributions are (1) generalizing the theory of causal abstraction from mechanism replacement (i.e., hard and soft interventions) to arbitrary mechanism transformation (i.e., functionals from old mechanisms to new mechanisms), (2) providing a flexible, yet precise formalization for the core concepts of polysemantic neurons, the linear representation hypothesis, modular features, and graded faithfulness, and (3) unifying a variety of mechanistic interpretability methods in the common language of causal abstraction, namely, activation and path patching, causal mediation analysis, causal scrubbing, causal tracing, circuit analysis, concept erasure, sparse autoencoders, differential binary masking, distributed alignment search, and steering.

Author Details
Atticus Geiger
Author
Duligur Ibeling
Author
Amir Zur
Author
Maheep Chaudhary
Author
Sonakshi Chauhan
Author
Jing Huang
Author
Aryaman Arora
Author
Zhengxuan Wu
Author
Noah Goodman
Author
Christopher Potts
Author
Thomas Icard
Author
Research Topics & Keywords
Causal Inference
Research Area
Citation Information
APA Format
Atticus Geiger , Duligur Ibeling , Amir Zur , Maheep Chaudhary , Sonakshi Chauhan , Jing Huang , Aryaman Arora , Zhengxuan Wu , Noah Goodman , Christopher Potts & Thomas Icard . Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability. Journal of Machine Learning Research .
BibTeX Format
@article{JMLR:v26:23-0058,
  author  = {Atticus Geiger and Duligur Ibeling and Amir Zur and Maheep Chaudhary and Sonakshi Chauhan and Jing Huang and Aryaman Arora and Zhengxuan Wu and Noah Goodman and Christopher Potts and Thomas Icard},
  title   = {Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability},
  journal = {Journal of Machine Learning Research},
  year    = {2025},
  volume  = {26},
  number  = {83},
  pages   = {1--64},
  url     = {http://jmlr.org/papers/v26/23-0058.html}
}
Related Papers