Vanessa Kosoy


AI safety from first principles: Goals and Agency

By contrast, in this section I’m interested in what it means for an agent to have a goal of its own. Three existing frameworks which attempt to answer this question are Von Neumann and Morgenstern’s expected utility maximisation, Daniel Dennett’s intentional stance, and Hubinger et al’s mesa-optimisation. I don’t think any of them adequately characterises the type of goal-directed behaviour we want to understand, though. While we can prove elegant theoretical results about utility functions, they are such a broad formalism that practically any behaviour can be described as maximising some utility function.

There is my algorithmic-theoretic definition which might be regarded as a formalization of the intentional stance, and which avoids the degeneracy problem you mentioned.

Vanessa Kosoy's Shortform

There is a formal analogy between infra-Bayesian decision theory (IBDT) and modal updateless decision theory (MUDT).

Consider a one-shot decision theory setting. There is a set of unobservable states , a set of actions and a reward function . An IBDT agent has some belief [1], and it chooses the action .

We can construct an equivalent scenario, by augmenting this one with a perfect predictor of the agent (Omega). To do so, define , where the semantics of is "the unobservable state is and Omega predicts the agent will take action ". We then define by and by ( is what we call the pullback of to , i.e we have utter Knightian uncertainty about Omega). This is essentially the usual Nirvana construction.

The new setup produces the same optimal action as before. However, we can now give an alternative description of the decision rule.

For any , define by . That is, is an infra-Bayesian representation of the belief "Omega will make prediction ". For any , define by . can be interpreted as the belief "assuming Omega is accurate, the expected reward will be at least ".

We will also need to use the order on defined by: when . The reversal is needed to make the analogy to logic intuitive. Indeed, can be interpreted as " implies "[2], the meet operator can be interpreted as logical conjunction and the join operator can be interpreted as logical disjunction.


(Actually I only checked it when we restrict to crisp infradistributions, in which case is intersection of sets and is set containment, but it's probably true in general.)

Now, can be interpreted as "the conjunction of the belief and implies ". Roughly speaking, "according to , if the predicted action is then the expected reward is at least ". So, our decision rule says: choose the action that maximizes the value for which this logical implication holds (but "holds" is better thought of as "is provable", since we're talking about the agent's belief). Which is exactly the decision rule of MUDT!

  1. Apologies for the potential confusion between as "space of infradistrubutions" and the of modal logic (not used in this post). ↩︎

  2. Technically it's better to think of it as " is true in the context of ", since it's not another infradistribution so it's not a genuine implication operator. ↩︎

What to do with imitation humans, other than asking them what the right thing to do is?

You could try to infer human values from the "sideload" using my "Conjecture 5" about the AIT definition of goal-directed intelligence. However, since it's not an upload and, like you said, it can go off-distribution, that doesn't seem very safe. More generally, alignment protocols should never be open-loop.

I'm also skeptical about IDA, for reasons not specific to your question (in particular, this), but making it open-loop is worse.

Gurkenglas' answer seems to me like something that can work, if we can somehow be sure the sideload doesn't become superintelligent, for example, given an imitation plateau.

Vanessa Kosoy's Shortform

Another thing that might happen is a data bottleneck.

Maybe there will be a good enough dataset to produce a sideload that simulates an "average" person, and that will be enough to automate many jobs, but for a simulation of a competent AI researcher you would need a more specialized dataset that will take more time to produce (since there are a lot less competent AI researchers than people in general).

Moreover, it might be that the sample complexity grows with the duration of coherent thought that you require. That's because, unless you're training directly on brain inputs/outputs, non-realizable (computationally complex) environment influences contaminate the data, and in order to converge you need to have enough data to average them out, which scales with the length of your "episodes". Indeed, all convergence results for Bayesian algorithms we have in the non-realizable setting require ergodicity, and therefore the time of convergence (= sample complexity) scales with mixing time, which in our case is determined by episode length.

In such a case, we might discover that many tasks can be automated by sideloads with short coherence time, but AI research might require substantially longer coherence times. And, simulating progress requires by design going off-distribution along certain dimensions which might make things worse.

Vanessa Kosoy's Shortform

The imitation plateau can definitely be rather short. I also agree that computational overhang is the major factor here. However, a failure to capture some of the ingredients can be a cause of low computational overhead, whereas a success to capture all of the ingredients is a cause of high computational overhang, because the compute necessary to reach superintelligence might be very different in those two cases. Using sideloads to accelerate progress might still require years, whereas an "intrinsic" AGI might lead to the classical "foom" scenario.

EDIT: Although, since training is typically much more computationally expensive than deployment, it is likely that the first human-level imitators will already be significantly sped-up compared to humans, implying that accelerating progress will be relatively easy. It might still take some time from the first prototype until such an accelerate-the-progress project, but probably not much longer than deploying lots of automation.

Vanessa Kosoy's Shortform

I don't see any strong argument why this path will produce superintelligence. You can have a stream of thought that cannot be accelerated without investing a proportional amount of compute, while a completely different algorithm would produce a far superior "stream of thought". In particular, such an approach cannot differentiate between features of the stream of thought that are important (meaning that they advance towards the goal) and features of the stream of though that are unimportant (e.g. different ways to phrase the same idea). This forces you to solve a task that is potentially much more difficult than just achieving the goal.

Vanessa Kosoy's Shortform

That is similar to gaining uploads (borrowing terminology from Egan, we can call them "sideloads"), but it's not obvious amplification/distillation will work. In the model based on realizability, the distillation step can fail because the system you're distilling is too computationally complex (hence, too unrealizable). You can deal with it by upscaling the compute of the learning algorithm, but that's not better than plain speedup.

Vanessa Kosoy's Shortform

An AI progress scenario which seems possible and which I haven't seen discussed: an imitation plateau.

The key observation is, imitation learning algorithms[1] might produce close-to-human-level intelligence even if they are missing important ingredients of general intelligence that humans have. That's because imitation might be a qualitatively easier task than general RL. For example, given enough computing power, a human mind becomes realizable from the perspective of the learning algorithm, while the world-at-large is still far from realizable. So, an algorithm that only performs well in the realizable setting can learn to imitate a human mind, and thereby indirectly produce reasoning that works in non-realizable settings as well. Of course, literally emulating a human brain is still computationally formidable, but there might be middle scenarios where the learning algorithm is able to produce a good-enough-in-practice imitation of systems that are not too complex.

This opens the possibility that close-to-human-level AI will arrive while we're still missing key algorithmic insights to produce general intelligence directly. Such AI would not be easily scalable to superhuman. Nevertheless, some superhuman performance might be produced by sped-up simulation, reducing noise in human behavior and controlling the initial conditions (e.g. simulating a human on a good day). As a result, we will have some period of time during which AGI is already here, automation is in full swing, but there's little or no further escalation. At the end of this period, the missing ingredients will be assembled (maybe with the help of AI researchers) and superhuman AI (possibly a fast takeoff) begins.

It's interesting to try and work out the consequences of such a scenario, and the implications on AI strategy.

  1. Such as GPT-n ↩︎

Honoring Petrov Day on LessWrong, in 2020

Wait, what? I'm confused by this comment. Did you want to nuke the frontpage? It seems inconsistent with what you wrote in the other comment.

Load More