TL;DR: Existing faithfulness metrics are not suitable for evaluating frontier LLMs. We introduce a new metric based on whether a model's self-explanations help people predict its behavior on related inputs. We find self-explanations encode valuable information about a model decision-making process (though they remain imperfect).
Authors: Harry Mayne*, Justin Singh Kang*, Dewi Gould, Kannan Ramchandran, Adam Mahdi, Noah Y. Siegel (*Equal contribution, order randomised).
A Positive Case for Faithfulness. We measure whether an observer can predict an LLM's behaviour both without and with access to its self-explanations. Self-explanations consistently help prediction.
Explanatory faithfulness
Explanatory faithfulness asks whether an LLM's self-explanations (CoT or post-hoc) accurately describe the model's true reasoning... (read 1019 more words →)
I would have no fixed amount of rooms for paying or non-paying guests.
I think having at least some rooms reserved for each is actually pretty important. If there aren't any non-paying guests working on projects then you lose out on the networking/synergy/culture of productivity, which is the main reason the hotel is interesting in the first place. Not having rooms for short-term paying guests also seems like a failure mode, for cultural reasons: the hotel's status as a "destination" raises its own visibility and attracts more projects in the future, but I think even more importantly it serves as a symbol of the community. Taking a train out to the countryside to... (read more)