[Book] Interpretable Machine Learning: A Guide for Making Black Box Models Explainable

LESSWRONG
is fundraising!
LW

[Book] Interpretable Machine Learning: A Guide for Making Black Box Models Explainable — LessWrong

This is a book by Christoph Molnar about interpretability. It covers a lot of traditional, non AI safety interpretability. Some recommendations for chapters to read are in an excerpt from our summary of it in the interpretability starter resources:

An amazing and up-to-date introduction to traditional interpretability research. We recommend reading the taxonomy of interpretability and about the specific methods of PDP, ALE, ICE, LIME, Shapley values, and SHAP . Also read the chapter on neural network interpretation such as saliency maps and adversarial examples. He has also published a "Common pitfalls of interpretability" post (and its paper) that is recommended reading.

Why share it here? It is a continuously-updated, high quality book on interpretability as seen from outside AI safety which seems very relevant to understanding what the field in general looks like. It has 78 authors and seems very canon. The first edition is from Apr 11, 2019 while the second edition (latest version) is from Mar 04, 2022. It was last updated on the Oct 13, 2022.

From the introduction of the book:

Machine learning has great potential for improving products, processes and research. But computers usually do not explain their predictions which is a barrier to the adoption of machine learning. This book is about making machine learning models and their decisions interpretable.
After exploring the concepts of interpretability, you will learn about simple, interpretable models such as decision trees, decision rules and linear regression. The focus of the book is on model-agnostic methods for interpreting black box models such as feature importance and accumulated local effects, and explaining individual predictions with Shapley values and LIME. In addition, the book presents methods specific to deep neural networks.
All interpretation methods are explained in depth and discussed critically. How do they work under the hood? What are their strengths and weaknesses? How can their outputs be interpreted? This book will enable you to select and correctly apply the interpretation method that is most suitable for your machine learning project. Reading the book is recommended for machine learning practitioners, data scientists, statisticians, and anyone else interested in making machine learning models interpretable.