Vanessa Kosoy

AI alignment researcher supported by MIRI and LTFF. Working on the learning-theoretic agenda.


My Current Take on Counterfactuals

I guess we can try studying Troll Bridge using infra-Bayesian modal logic, but atm I don't know what would result.

From a radical-probabilist perspective, the complaint would be that Turing RL still uses the InfraBayesian update rule, which might not always be necessary to be rational (the same way Bayesian updates aren't always necessary).

Ah, but there is a sense in which it doesn't. The radical update rule is equivalent to updating on "secret evidence". And in TRL we have such secret evidence. Namely, if we only look at the agent's beliefs about "physics" (the environment), then they would be updated radically, because of secret evidence from "mathematics" (computations).

"Infra-Bayesianism with Vanessa Kosoy" – Watch/Discuss Party

What if the super intelligent deity is less than maximally evil or maximally good? (E.g. the deity picking the median-performance world)

Thinking of the worst-case is just a mathematical reflection of the fact we want to be able to prove lower bounds on the expected utility of our agents. We have an unpublished theorem that, in some sense, any such lower bound guarantee has an infra-Bayesian formulation.

Another way to justify it is the infra-Bayesian CCT (see "Complete Class Theorem Weak Version" here).

What about the dutch-bookability of infraBayesians? (the classical dutch-book arguments seem to suggest pretty strongly that non-classical-Bayesians can be arbitrarily exploited for resources)

I think it might depend on the specific Dutch book argument, but one way infra-Bayesians escape them is by... being equivalent to certain Bayesians! For example, consider the setting where your agent has access to random bits that the environment can't predict. Then, infra-Bayesian behavior is just the Nash equilibrium in a two-player zero-sum game (against Murphy). Now, the Nash strategy in such a game is the (Bayes) optimal response to the Nash strategy of the other player, so it can be regarded as "Bayesian". However, the converse is false: not every best response to Nash is in itself Nash. So, the infra-Bayesian decision rule is more restrictive than the corresponding Bayesian decision rule, but it's a special case of the latter.

Is there a meaningful metaphysical interpretation of infraBayesianism that does not involve Murphy? (similarly to how Bayesianism can be metaphysically viewed as "there's a real, static world out there, but I'm probabilistically unsure about it")

I think of it as just another way of organizing uncertainty. The question is too broad for a succinct answer, I think, but here's one POV you could take: Let's remember the frequentist definition of probability distributions as time limits of frequencies. Now, what if the time limit doesn't converge? Then, we can make a (crisp) infradistribution instead: the convex hull of all limit points. Classical frequentism also has the problem that the exact same event never repeats itself. But in "infra-frequentism" we can solve this: you don't need the exact same event to repeat, you can draw the boundary around what counts as "the event" any way you like.

Once we go from passive observation to active interaction with the environment, your own behavior serves as another source of Knightian uncertainty. That is, you're modeling the world in terms of certain features while ignoring everything else, but the state of everything else depends on your past behavior (and you don't want to explicitly keep track of that). This line of thought can be formalized in the language of infra-MDPs (unpublished). And then ofc you complement this "aleatoric" uncertainty with "epistemic" uncertainty by considering the mixture of many infra-Bayesian hypotheses.

"Infra-Bayesianism with Vanessa Kosoy" – Watch/Discuss Party

There is a formal sense in which "predicting Nirvana in some circumstance is equivalent to predicting that there are no possible futures in that circumstance", see our latest post. It's similar to MUDT, where, if you prove a contradiction then you can prove utility is as high as you like.

"Infra-Bayesianism with Vanessa Kosoy" – Watch/Discuss Party

The exact same thing is true for classical probability theory: you have distributions, mixtures of distributions and linear functionals respectively. So I'm not sure what new difficulty comes from infra-Bayesianism?

Maybe it would help thinking about infra-MDPs and infra-POMDPs?

Also, here I wrote about how you could construct an infra-Bayesian version of the Solomonoff prior, although possibly it's better to do it using infra-Bayesian logic.

My Current Take on Counterfactuals

I only skimmed this post for now, but a few quick comments on links to infra-Bayesianism:

InfraBayes doesn’t seem to have that worry, since it applies to non-realizable cases. (Or does it? Is there some kind of non-oscillation guarantee? Or is non-oscillation part of what it means for a set of environments to be learnable -- IE it can oscillate in some cases?)... AFAIK the conditions for learnability in the InfraBayes case are still pretty wide open.

It's true that these questions still need work, but I think it's rather clear that something like "there are no traps" is a sufficient condition for learnability. For example, if you have a finite set of "episodic" hypotheses (i.e. time is divided into episodes, and no states is preserved from one episode to another), then a simple adversarial bandit algorithm (e.g. Exp3) that treats the hypotheses as arms leads to learning. For a more sophisticated example, consider Tian et al which is formulated in the language of game theory, but can be regarded as an infra-Bayesian regret bound for infra-MDPs.

Radical Probabalism and InfraBayes are plausibly two orthogonal dimensions of generalization for rationality. Ultimately we want to generalize in both directions, but to do that, working out the radical-probabilist (IE logical induction) decision theory in more detail might be necessary.

True, but IMO the way to incorporate "radical probabilism" is via what I called Turing RL.

I don’t know how to talk about the CDT vs EDT insight in the InfraBayes world.

I'm not sure what precisely you mean by "CDT vs EDT insight" but our latest post might be relevant: it shows how you can regard infra-Bayesian hypotheses as joint beliefs about observations and actions, EDT-style.

Perhaps more importantly, the Troll Bridge insights. As I mentioned in the beginning, in order to meaningfully solve Troll Bridge, it’s necessary to “respect logic” in the right sense. InfraBayes doesn’t do this, and it’s not clear how to get it to do so.

Is there a way to operationalize "respecting logic"? For example, a specific toy scenario where an infra-Bayesian agent would fail due to not respecting logic?

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

From your reply to Paul, I understand your argument to be something like the following:

  1. Any solution to single-single alignment will involve a tradeoff between alignment and capability.
  2. If AIs systems are not designed to be cooperative, then in a competitive environment each system will either go out of business or slide towards the capability end of the tradeoff. This will result in catastrophe.
  3. If AI systems are designed to be cooperative, they will strike deals to stay towards the alignment end of the tradeoff.
  4. Given the technical knowledge to design cooperative AI, the incentives are in favor of cooperative AI since cooperative AIs can come ahead by striking mutually-beneficial deals even purely in terms of capability. Therefore, producing such technical knowledge will prevent catastrophe.
  5. We might still need regulation to prevent players who irrationally choose to deploy uncooperative AI, but this kind of regulation is relatively easy to promote since it aligns with competitive incentives (an uncooperative AI wouldn't have much of an edge, it would just threaten to drag everyone into a mutually destructive strategy).

I think this argument has merit, but also the following weakness: given single-single alignment, we can delegate the design of cooperative AI to the initial uncooperative AI. Moreover, uncooperative AIs have an incentive to self-modify into cooperative AIs, if they assign even a small probability to their peers doing the same. I think we definitely need more research to understand these questions better, but it seems plausible we can reduce cooperation to "just" solving single-single alignment.

Formal Solution to the Inner Alignment Problem

I'm kind of scared of this approach because I feel unless you really nail everything there is going to be a gap that an attacker can exploit.

I think that not every gap is exploitable. For most types of biases in the prior, it would only promote simulation hypotheses with baseline universes conformant to this bias, and attackers who evolved in such universes will also tend to share this bias, so they will target universes conformant to this bias and that would make them less competitive with the true hypothesis. In other words, most types of bias affect both and in a similar way.

More generally, I guess I'm more optimistic than you about solving all such philosophical liabilities.

I think of this in contrast with my approach based on epistemic competitiveness approach, where the idea is not necessarily to identify these considerations in advance, but to be epistemically competitive with an attacker (inside one of your hypotheses) who has noticed an improvement over your prior.

I don't understand the proposal. Is there a link I should read?

This is very similar to what I first thought about when going down this line. My instantiation runs into trouble with "giant" universes that do all the possible computations you would want, and then using the "free" complexity in the bridge rules to pick which of the computations you actually wanted.

So, you can let your physics be a dovetailing of all possible programs, and delegate to the bridge rule the task of filtering the outputs of only one program. But the bridge rule is not "free complexity" because it's not coming from a simplicity prior at all. For a program of length , you need a particular DFA of size . However, the actual DFA is of expected size with . The probability of having the DFA you need embedded in that is something like . So moving everything to the bridge makes a much less likely hypothesis.

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

I don't understand the claim that the scenarios presented here prove the need for some new kind of technical AI alignment research. It seems like the failures described happened because the AI systems were misaligned in the usual "unipolar" sense. These management assistants, DAOs etc are not aligned to the goals of their respective, individual users/owners.

I do see two reasons why multipolar scenarios might require more technical research:

  1. Maybe several AI systems aligned to different users with different interests can interact in a Pareto inefficient way (a tragedy of the commons among the AIs), and maybe this can be prevented by designing the AIs in particular ways.
  2. In a multipolar scenario, aligned AI might have to compete with already deployed unaligned AI, meaning that safety must not come on expense of capability[1].

In addition, aligning a single AI to multiple users also requires extra technical research (we need to somehow balance the goals of the different users and solve the associated mechanism design problem.)

However, it seems that this article is arguing for something different, since none of the above aspects are highlighted in the description of the scenarios. So, I'm confused.

  1. In fact, I suspect this desideratum is impossible in its strictest form, and we actually have no choice but somehow making sure aligned AIs have a significant head start on all unaligned AIs. ↩︎

Formal Solution to the Inner Alignment Problem

Is bounded? I assign significant probability to it being or more, as mentioned in the other thread between me and Michael Cohen, in which case we'd have trouble.

Yes, you're right. A malign simulation hypothesis can be a very powerful explanation to the AI for the why it found itself at a point suitable for this attack, thereby compressing the "bridge rules" by a lot. I believe you argued as much in your previous writing, but I managed to confuse myself about this.

Here's the sketch of a proposal how to solve this. Let's construct our prior to be the convolution of a simplicity prior with a computational easiness prior. As an illustration, we can imagine a prior that's sampled as follows:

  • First, sample a hypothesis from the Solomonoff prior
  • Second, choose a number according to some simple distribution with high expected value (e.g. ) with
  • Third, sample a DFA with states and a uniformly random transition table
  • Fourth, apply to the output of

We think of the simplicity prior as choosing "physics" (which we expect to have low description complexity but possibly high computational complexity) and the easiness prior as choosing "bridge rules" (which we expect to have low computational complexity but possibly high description complexity). Ofc this convolution can be regarded as another sort of simplicity prior, so it differs from the original simplicity prior merely by a factor of , however the source of our trouble is also "merely" a factor of .

Now the simulation hypothesis no longer has an advantage via the bridge rules, since the bridge rules have a large constant budget allocated to them anyway. I think it should be possible to make this into some kind of theorem (two agents with this prior in the same universe that have access to roughly the same information should have similar posteriors, in the limit).

Vanessa Kosoy's Shortform

So is the general idea that we quantilize such that we're choosing in expectation an action that doesn't have corrupted utility (by intuitively having something like more than twice as many actions in the quantilization than we expect to be corrupted), so that we guarantee the probability of following the manipulation of the learned user report is small?

Yes, although you probably want much more than twice. Basically, if the probability of corruption following the user policy is and your quantilization fraction is then the AI's probability of corruption is bounded by .

I also wonder if using the user policy to sample actions isn't limiting, because then we can only take actions that the user would take. Or do you assume by default that the support of the user policy is the full action space, so every action is possible for the AI?

Obviously it is limiting, but this is the price of safety. Notice, however, that the quantilization strategy is only an existence proof. In principle, there might be better strategies, depending on the prior (for example, the AI might be able to exploit an assumption that the user is quasi-rational). I didn't specify the AI by quantilization, I specified it by maximizing EU subject to the Hippocratic constraint. Also, the support is not really the important part: even if the support is the full action space, some sequences of actions are possible but so unlikely that the quantilization will never follow them.

Load More