Formal Inner Alignment, Prospectus

My third and final example: in one conversation, someone made a claim which I see as "exactly wrong": that we can somehow lower-bound the complexity of a mesa-optimizer in comparison to a non-agentic hypothesis (perhaps because a mesa-optimizer has to have a world-model plus other stuff, where a regular hypothesis just needs to directly model the world). This idea was used to argue against some concern of mine.

The problem is precisely that we know of no way of doing that! If we did, there would not be any inner alignment problem! We could just focus on the simplest hypothesis that fit the data, which is pretty much what you want to do anyway!

I think there would still be an inner alignment problem even if deceptive models were in fact always more complicated than non-deceptive models—i.e. if the universal prior wasn't malign—which is just that the neural net prior (or whatever other ML prior we use) might be malign even if the universal prior isn't (and in fact I'm not sure that there's even that much of a connection between the malignity of those two priors).

Also, I think that this distinction leads me to view “the main point of the inner alignment problem” quite differently: I would say that the main point of the inner alignment problem is that whatever prior we use in practice will probably be malign. But that does suggest that if you can construct a training process that defuses the arguments for why its prior/inductive biases will be malign, then I think that does make significant progress on defusing the inner alignment problem. Of course, I agree that we'd like to be as confident that there's as little malignancy/deception as possible such that just defusing the arguments that we can come up with might not be enough—but I still think that trying to figure out how plausible it is that the actual prior we use will be malign is in fact at least attempting to address the core problem.

Mundane solutions to exotic problems

Your link is broken.

For reference, the first post in Paul's ascription universality sequence can be found here (also Adam has a summary here).

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

I guess I would say something like: random search is clearly a pretty good first-order approximation, but there are also clearly second-order effects. I think that exactly how strong/important/relevant those second-order effects are is unclear, however, and I remain pretty uncertain there.

[AN #148]: Analyzing generalization across more axes than just accuracy or loss

Read more: Section 1.3 of this version of the paper

This is in the wrong spot.

Covid 4/22: Crisis in India

Is there ways to share this with EAs?

You could write a post about it on the EA Forum.

NTK/GP Models of Neural Nets Can't Learn Features

(moved from LW to AF)

Meta: I'm going to start commenting this on posts I move from LW to AF just so there's a better record of what moderation actions I'm taking.

Homogeneity vs. heterogeneity in AI takeoff scenarios

I suppose the distinction between "strong" and "weak" warning shots would matter if we thought that we were getting "strong" warning shots. I want to claim that most people (including Evan) don't expect "strong" warning shots, and usually mean the "weak" version when talking about "warning shots", but perhaps I'm just falling prey to the typical mind fallacy.

I guess I would define a warning shot for X as something like: a situation in which a deployed model causes obvious, real-world harm due to X. So “we tested our model in the lab and found deception” isn't a warning shot for deception, but “we deployed a deceptive model that acted misaligned in deployment while actively trying to evade detection” would be a warning shot for deception, even though it doesn't involve taking over the world. By default, in the case of deception, my expectation is that we won't get a warning shot at all—though I'd more expect a warning shot of the form I gave above than one where a model tries and fails to take over the world, just because I expect that a model that wants to take over the world will be able to bide its time until it can actually succeed.

Open Problems with Myopia

Yes; episode is correct there—the whole point of that example is that, by breaking the episodic independence assumption, otherwise hidden non-myopia can be revealed. See the discussion of the prisoner's dilemma unit test in Krueger et al.'s “Hidden Incentives for Auto-Induced Distributional Shift” for more detail on how breaking this sort of episodic independence plays out in practice.

Open Problems with Myopia

Yeah, I agree—the example should probably just be changed to be about an imitative amplification agent or something instead.

Load More