Message

noahys

Message

noahys

A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior

harrymayne, Justin kang, Dewi Gould, noahys

2mo

This is a summary of our new paper.

TL;DR: Existing faithfulness metrics are not suitable for evaluating frontier LLMs. We introduce a new metric based on whether a model's self-explanations help people predict its behavior on related inputs. We find self-explanations encode valuable information about a model decision-making process (though they remain imperfect).

Authors: Harry Mayne*, Justin Singh Kang*, Dewi Gould, Kannan Ramchandran, Adam Mahdi, Noah Y. Siegel (*Equal contribution, order randomised).

**A Positive Case for Faithfulness.** We measure whether an observer can predict an LLM's behaviour both *without* and *with* access to its self-explanations. Self-explanations consistently help prediction.

Explanatory faithfulness

Explanatory faithfulness asks whether an LLM's self-explanations (CoT or post-hoc) accurately describe the model's true reasoning process. This is important because:

CoT faithfulness is an important (though not sufficient) condition for AI

...

(See More - 999 more words)

The Case for The EA Hotel

noahys7y40

I would have no fixed amount of rooms for paying or non-paying guests.

I think having at least some rooms reserved for each is actually pretty important. If there aren't any non-paying guests working on projects then you lose out on the networking/synergy/culture of productivity, which is the main reason the hotel is interesting in the first place. Not having rooms for short-term paying guests also seems like a failure mode, for cultural reasons: the hotel's status as a "destination" raises its own visibility and attracts more projects i... (read more)