A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior
This is a summary of our new paper. TL;DR: Existing faithfulness metrics are not suitable for evaluating frontier LLMs. We introduce a new metric based on whether a model's self-explanations help people predict its behavior on related inputs. We find self-explanations encode valuable information about a model decision-making process (though...
Feb 2623

