# Ω 5

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This post was written under Evan Hubinger’s direct guidance and mentorship, as a part of the Stanford Existential Risks Institute ML Alignment Theory Scholars (MATS) program.

TL;DR: In this post, I refine universality, in order to explain the "Filter". The Filter is a method of risk analysis for the outputs of powerful AI systems. I refine universality by replacing the language of “beliefs” and proposing a handful of general definitions in part I. Then, I explain the key application of universality of building the Filter. I explain the Filter non-technically through a parable in part II, then technically in part III. To understand the Filter as presented in part II, it is not necessary to comprehend universality to the rigor that I present in part I; it is only necessary to see that all models are wrong1. While this post builds technically off a previous post2, part II is meant to be understandable to all audiences.

# I. Refining Universality

Universality is discussed here, here, and here. Here are some features of universality that have been missing from its presentation.

## 1. The need to replace “beliefs”

A “belief” of a system is used 60 times in Paul Christiano’s attempt to formalize universality. I think a “belief” is not the right language for an object of a system for two reasons. First, it conflates terms worth making distinct. When “beliefs” are used in Paul Christiano’s attempt to formalize universality, they don’t always refer to the same thing. Second, “belief” should be reserved for a different concept. The way humans think of “belief” is not how it’s used there.

So, I rewrote the formalization of universality to not include the word “belief”. The handful of terms presented below combine to replace the previous notion of “belief”. Presenting universality without the language of “beliefs”, all said, is certainly more cumbersome.

I hope this reframing will help improve the methodology for applying universality to HCH. I apply the general definitions below to universality and the Filter in section III.2.

## 2. General definitions for universality

Consider , a list of action-shaping intelligible information. , where  are pieces of intelligible information. Pieces of intelligible information are facts, which are sentences  (i.e. “The sky is blue”, “there are infinitely many twin primes”). Another term, , is the output of a computation.   is a list of actions,  (i.e. Printed to my computer the sentence “the sky is blue”, wiped the memory of Alice’s computer next door). A view may be modeled One may view a computation  as, upon an input of the world , taking a set of actions  is a part of ; action-shaping intelligible information of a system is part of the input of the world which determines the actions the system will take.

More formally:

Now, what was previously known as “belief” is now usually denoted as an action-shaping list of intelligible information, . In item II (above), a formal constraint of causality requires some causal theory, and also importing some causal assumptions. I do not provide this here. A causal constraint is necessary in addition to the statement  to preclude the inclusion of arbitrary information in  which are not related to the actions  and which, without this constraint, are permitted to be very numerous.

I expect that item III (above) is closer to a common rule of thumb than to a universal truth. Some theories of causality between  and  might afford that in special circumstances,  is unique ().

Then, I may present universality with a couple formulas:

Note that the first and third statements are directly transcribed from Paul Christiano’s attempt to formalize universality.

The reason I’m working to make universality precise is that this precision is important to build the Filter. The Filter is a method for risk analysis on large HCH models.

## 3. Edit of Paul Christiano’s “Towards Formalizing Universality” with these definitions

In light of the statements above, in this appendix is a markup of Paul Christiano’s original presentation of universality, with the formulas now uncovered and no mention of “belief”.

# II. Parable of the Social Welfare HCH Team’s Blue People

## 1. Parable of the blue skin

Say we have a social welfare HCH team which, up until the present, is very well trusted. The team is innovative and successful. For the past decade, the team has proposed interventions that prove to be very helpful. Over time, the team has carefully developed a set of metrics by which they internally measure how successful an intervention will be for improving social welfare. These interventions are very well-received and regarded by the world populous. People agree the team’s proposals have done a great good. The team is very sane; we agree with all their expectations about how their interventions will affect the world.

With a long and honorable reputation, the social welfare team proposes a particularly striking intervention: the team suggests we turn people’s skin blue. This is simply because blue skin exceeds all the team’s metrics. The blue skin intervention ranks higher on the team’s metrics than any previous proposal the team has made.

Humans assemble to discuss what to do. Since this intervention is unusual, the human board running the HCH model holds council. It’s not the first time in the social welfare team’s history that this council has convened, but it happens rarely. They suppose an explanation behind this move may be to start a ‘blank slate’ from the skin-color-based social disparity that has accumulated. After a longer than usual meeting, the board concludes that because of the apparent reasonable explanation behind the blue skin intervention, and the trusted reputation of the team, the social welfare team is allowed to proceed with its plan. Maybe another human board wouldn’t have approved the blue skin plan. But, this one did.

With the seal of approval, the social welfare team gets straight to work on the blue skin. The team  learns a bit of chemical engineering, and develops a new, very safe, fast acting bleaching chemical for skin.

As a result of turning everyone’s skin blue, we see unusually and extremely negative consequences in the world. The process of genetic modification the team uses in the chemical has extreme adverse health effects for a small number of people. In the administration of this chemical, allergic reactions cannot be universally accounted for, and infection remains a salient risk. These are small concerns to the social welfare team compared to the gleaming benefits of blue skin on the team’s social welfare metrics. Concern grows.

Blue skin also extinguishes many human values and cultural practices that revolve around skin, and which were fundamental to their lives. Blue skin is very dry, similar to fish scales. Rituals related to oil have no meaning, as oil immediately runs off blue skin. People grow very indignant about the blue skin intervention. Across the globe, anti-blue skin campaigns emerge. The social welfare team’s metrics for social welfare shine so brightly that the team continues to turn people’s skin blue anyways.

Although the metrics developed by the social welfare team indeed soar, we watch in horror as travesty unfolds. A third of the world’s skin has been turned blue before the intervention is stopped.

______

Now let’s consider an alternate version of reality. We’re back to the social welfare team’s first proposal for blue skin, before the council convenes. Now we have two computations in hand: “Info” and “Trouble”. Info and Trouble don’t have a particular stake in social welfare. They definitely don’t abide by the objectives and functions that the social welfare team has developed.

Info and Trouble know more about the world than just social welfare. They know why the social welfare team made its choice to turn people blue. But they also know about all sorts of other ways to figure out why turning people blue is undesirable, that have nothing to do with social welfare. Say Info and Trouble access to widespread world health data, and can notice that turning people blue has a big effect on this data. And, say Info and Trouble have access to all of humanity’s literary history. They can notice that in literature and poetry, humans have often expressed a deep affinity to organs like the brain, and the heart. They can notice that along with this affinity, humans express fear and resentment at changes in these organs, and call these changes “travesties”. They can also notice that in biology texts, the brain and heart are expressed in a very similar structure as the skin.

Info notices all these reasons why blue skin might be bad, and more. Info develops a huge array of possible methods to pose a conclusion about the effect of turning people’s skin blue. These methods are based on information reaching far past the scope of  social welfare’ that the social welfare team considers. From this wide array, Info chooses a handful of the methods that will most likely demonstrate a problem about blue skin. Info then presents this handful of methods to Trouble.

Now, I can only take a naive guess as the author of this tale, without a real Info in hand, what are the most likely methods Info finds that blue skin poses a concern. Let’s say these are really the two examples of impact on world health and historical affinity to organs. Info passes this conclusion, as a big datafile on “BLUE SKIN”, to Trouble. Trouble can now determine whether blue skin really poses a problem.

Trouble sees strong literary connections between blue skin and a bunch of negative remarks in literature. The subfile on “BLUE SKIN” that has to do with literature, from Info, is covered with warnings. Trouble is on the lookout for negative remarks, and when they’re heavily correlated to the object Trouble is meant to evaluate, Trouble says that object is not good. With the shockingly marked-up file it’s received, Trouble proceeds to say that blue skin is not an acceptable intervention on the behalf of the social welfare team. We call Info and Trouble, together, “the Filter”.

The social welfare team is highly aware that the Filter is good at their job, and has methods to assess the quality of interventions that are far past the scope of the social welfare team’s own consideration. The social welfare team hears word from the Filter that blue skin is a highly dangerous intervention upon turning their fourth person blue. Even though the blue skin intervention still provides brilliant projected results on all the social welfare team’s metrics, the team immediately stops all proceedings to turn anyone else’s skin blue. The team continues with organising other social interventions for which they are admired and trusted by the world populous.

## 2. The "Filter”

“The opposite of a fact is falsehood, but the opposite of one profound truth may very well be another profound truth.” -Neils Bohr

A great truth is a truth whose opposite is also a truth.“ -Thomas Mann

The Filter does what the human council cannot. The Filter and the human council share the spirit of protecting the humanity of HCH models by providing external information. Indeed, the Filter and the human council are one in spirit; the Filter is the more capable appendage of the human council. Now, sharing in spirit’ might be an overreaching description of computations today. No matter, we can be perfectly satisfied in conceiving of the Filter as a very capable appendage.

See, the Filter and the human council rely on the aphorism that “all models are wrong”.  No matter what objective a clever HCH model suggests, and no matter how reliably often humans really follow that objective, there will always be a small number of cases in which humans would think following this objective is not the right thing to do. Humans will sometimes act in contradiction with their usual objectives, and use some external information to explain their actions. We can always find information outside of our model to explain why the model is not quite right.

So, the human council, and the Filter, can perform a sanity check: they find and evaluate these opposites. They search for counterexamples to the HCH’s model’s claim, which certainly exist, and reflect on whether these should impede us. Info asks, “What’s the most useful way to figure out if HCH’s results are problematic?”. And Trouble follows with, “According to that way, are HCH’s results problematic?”.

In this search for opposites, the council and the Filter ought to look far outside the world in which the suggesting HCH model normally lives. They aren’t constrained by the aim that gave rise to that suggestion. And, they look far past that HCH model’s typical library of information. The broadness of this search is the mark of excellent quality for the council, and the Filter.

Formally, what I’ve described means that the Filter is an ascription universal HCH which can say a model’s outcomes are unacceptable based on external information that doesn’t depend on what the model is trying to do.

## 3. The "Filter” as a better version of a human council

I hope you will see through the parable of the blue skin that occasionally, a human council checking an HCH team is unable to adequately carry out their duties, and will need assistance from a more capable filter. I’d like to make explicit here a few ways in which the Filter could be a more competent version of a human council.

• By example, the human board proved fallible in the case of the blue skin, and didn’t have enough information.
• The Filter has more computational power than a human council.
• As a result, the Filter may perform broader searches than a human council.
• Unlike a human council, the Filter can deliberate with an order of magnitude of information comparable to that which the HCH model it evaluates has access to.

On the condition that we trust our human board, and the Filter trains info and trouble on this board, there is no need of identifying an ethical theory for the Filter.

## 4. The "Filter” follows the Hippocratic Oath

Modern medicine and large HCH teams have a lot in common. The drug thalidomide in the 1950s could relieve pregnant women of nausea; then it caused their babies to be born without arms and legs. Medical organizations recognized these perils, and now drug administrations worldwide conduct strict risk analysis to test new drugs. Like the benign intentions of drugs may have severe inadvertent adverse effects, our benign intentions with HCH may also cause great harm. This harm may be prevented by rigorous risk analysis. While risk analysis for new medicines is undertaken by drug administrations, risk analysis for HCH models may be undertaken by the Filter. Drug administrations now, as well as medical practice as far back as the ancients, are guided by the Hippocratic Oath: primum non nocere, “first do no harm”. The Hippocratic Oath establishes the paramount importance of risk analysis for drug administrations. The Filter is a chance to fulfill the Hippocratic Oath for large HCH models.

# III. Building the “Filter”: a better informed overseer

## 1. Formal construction

As the title suggests, we may build a “filter” which is the better-informed overseer we need to provide sanity checks on large HCH teams.

The Filter maximises info and trouble rather than whatever the HCH model is trying to do. The filter is a deliberator, and a careful, sharp, detective. It is a method for discussing results and identifying problems. The filter is meant to be a very open minded arbiter of discourse. It is not omniscient, and the image of some repository of infinite information is misguided. It doesn’t think much in particular about what sanity is, or have any supervening worldview. The filter’s deliberations rest in computations of two questions.

Given a “transcript"  of HCH(Q,N)’s behavior, i.e. a record of all the questions, subquestions, and answers,

• Info(): “Which of the two arguments  and  is more useful for determining whether the transcript  contains problematic subcomputations?
• Trouble(): “In light of argument , does transcript  contain problematic subcomputations?

Now we’ll jointly define filtered-HCH(Q, N) and Filter(τ, N) as follows.

Define Filter(τ, N) as:

• Find the * maximizing filtered-HCH(Info(·, ·, τ), N)
• (Interpret ⟂ as a tie.)
• Run filtered-HCH(Trouble(*, τ), N) 3 times, and return the majority vote.
• (Interpret a response of ⟂ as a judgment that τ is problematic.)

Define filtered-HCH(Q, N) as:

• First call HCH(Q, N) to get an answer A with transcript τ. But every time you would call HCH to answer a subquestion, call filtered-HCH instead.
• Call Filter(τ, N-1) to test whether this answer is problematic.
• If τ is problematic, return a special symbol ⟂, otherwise return A unmodified. (If you return ⟂, you can eliminate this subtree from the transcript.)

Now, a “sanity check” can only be effective if the checker is more reliable than the system it is checking. This is the motivation to construct a condition of universality.

## 2. Statements of universality in the "Filter”

We always want an Info and Trouble which know more about the general situation than the HCH team. This is the key. So, the Filter relies on universality: the Filter will be effective if it has access to all the information of the model that it’s checking. Formally, the Filter should epistemically dominate any HCH which it’s filtering, according to the following relation,

Let us build a Filter that satisfies this quality. We can have a greater guarantee the Filter will do a good job if it epistemically dominates every computation it’s filtering, so let us keep this prudent first step.

It’s very important that the Filter is more competent than a human council. This is its purpose. Info and Trouble must be a better check on the acceptability of an HCH model’s proposal than any human council could be. How this can be remains to be seen.

# IV. Commentary

“The Filter” is an idea entirely born of Paul Christiano.

I have conveyed a story about a future with filters. This isn’t all filters can do; nor do I claim this addresses all our important points about filters. I hope it is taken as a starting place to imagine filters, and not much more.

I hope that at least section II is a representative overview of the Filter in language that’s understandable to a lot of kinds of experts. There are many fronts on which to develop a filter other than computer science. I warmly welcome that dialogue. I hope, as do your dinner tables.

The Filter here is not formally defined, as of yet. I would barely say it exists. There are many questions to answer before it is a solution. What is “Info”? What is “Trouble”? What are “useful” methods? What’s “problematic”? All this remains to be seen.

Here is a word of warning. Often, people don’t see the need for very rigorous risk analysis, until a mistake is made and the checking process is widely instated. At that point, people couldn’t imagine life otherwise. Seat belts, veto possibilities over parliaments, and drug administrations are examples. For HCH teams, it’s prudent to be forward thinking. Safety measures can’t be too overcautious for systems as possibly large and dangerous as an HCH model. Supplementing a human council with the Filter should not be taken with a grain of salt.  ■

### Footnotes

1. George Box, statistician.
2. Paul Christiano, Towards Formalizing Universality. https://ai-alignment.com/towards-formalizing-universality-409ab893a456

# Ω 5

New Comment

I am very confused. How is this better than just telling the human overseers "no, really, be conservative about implementing things that might go wrong." What makes a two-part architecture seem appealing? What does "epistemic dominance" look like in concrete terms here - what are the observables you want to dominate HCH relative to, wouldn't this be very expensive, how does this translate to buying you extra safety, etc?

How is this better than just telling the human overseers "no, really, be conservative about implementing things that might go wrong."

Building a filter can be the human overseers’ response to this suggestion.

What makes a two-part architecture seem appealing?

An analysis of risk that is separate from an HCH system mediates quality in that system.

What does "epistemic dominance" look like in concrete terms here - what are the observables you want to dominate HCH relative to, wouldn't this be very expensive, how does this translate to buying you extra safety, etc?

The observables are actions of the system, and HCH dominates with respect to some action-guiding information which we’ve deduced from these observable actions.

Yes, it might be expensive.

We hope that with more information on its subject, a mediator of quality may better describe its risk. Better risk analysis translates to more safety.