This post was written under Evan Hubinger’s direct guidance and mentorship, as a part of the Stanford Existential Risks Institute ML Alignment Theory Scholars (MATS) program.
TL;DR: In this post, I refine universality, in order to explain the "Filter". The Filter is a method of risk analysis for the outputs of powerful AI systems. I refine universality by replacing the language of “beliefs” and proposing a handful of general definitions in part I. Then, I explain the key application of universality of building the Filter. I explain the Filter non-technically through a parable in part II, then technically in part III. To understand the Filter as presented in part II, it is not... (read 3025 more words →)
Building a filter can be the human overseers’ response to this suggestion.
An analysis of risk that is separate from an HCH system mediates quality in that system.
The observables are actions of the system, and HCH dominates with respect to some action-guiding information which we’ve deduced from these observable actions.
Yes, it might be expensive.
We hope that with more information on its subject, a mediator of quality may better describe its risk. Better risk analysis translates to more safety.