So8res

John has also made various caveats to me, of the form "this field is pre-paradigmatic and the math is merely suggestive at this point". I feel like he oversold his results *even so.*

Part of it is that I get the sense that John didn't understand the limitations of his own results--like the fact that the telephone theorem only says anything in the infinite case, and the thing it says then does not (in its current form) arise as a limit of sensible things that can be said in finite cases. Or like the fact that the alleged interesting results of the gKPD theorem are a relatively-shallow consequence of the overly-strong assumption of .

My impression was that I had to go digging into the theorems to see what they said, only to be disappointed by how little resemblance they bore to what I'd heard John imply. (And it sounds to me like Lawrence, Leon, and Erik had a similar experience, although I might be misreading them on account of confirmation bias or w/e.)

I acknowledge that it's tricky to draw a line between "someone has math that they think teaches them something, and is inarticulate about exactly what it teaches" and "someone has math that they don't understand and are overselling". The sort of observation that would push me towards the former end in John's case is stuff like: John being able to gesture more convincingly at ways concepts like "tree" or "window" are related to his conserved-property math even in messy finite cases. I acknowledge that this isn't a super legible distinction and that that's annoying.

(Also, I had the above convos with John >1y ago, and perhaps John simply changed since then.)

Note that I continue to think John's cool for pursuing this particular research direction, and I'd enjoy seeing his math further fleshed out (and with more awareness on John's part of its current limitations). I think there might be interesting results down this path.

John said "there was not any point at which I thought my views were importantly misrepresented" when I asked him for comment. (I added this note to the top of the post as a parenthetical; thanks.)

For the record, the reason I didn't speak up was less "MIRI would have been crushed" and more "I had some hope".

I had in fact had a convo with Elon and one or two convos with Sam while they were kicking the OpenAI idea around (and where I made various suggestions that they ultimately didn't take). There were in fact internal forces at OpenAI trying to cause it to be a force for good—forces that ultimately led them to write their 2018 charter, so, forces that were not entirely fictitious. At the launch date, I didn't know to what degree those internal forces would succeed, and I didn't want to be openly publicly hostile in a way that might undermine those efforts.

To be clear, my mainline guess was that OpenAI was going to be a force for ill, and I now think that my post on the topic was a mistake, and I now think it would have been significantly better for me to just bluntly say that I thought this was a bad development (barring some turnaround). (I also think that I was optimistically overestimating the potential of the internal forces for trying to make the whole operation net-good, in a way that probably wouldn't have withstood careful consideration—consideration that I didn't give.) But the intent in my communication was to extend an olive branch and leave room for the forces of change to produce such a turnaround, not to avoid retribution.

(And, to be explicit: I consider myself to have been taught a lesson about how it's pretty important to just straightforwardly speak your mind, and I've been trying to do that since, and I think I'd do better next time, and I appreciate the feedback that helped me learn that lesson.)

Good idea, thanks! I added an attempt at a summary (under the spoiler tags near the top).

Here's a recent attempt of mine at a distillation of a fragment of this plan, copied over from a discussion elsewhere:

**goal:** make there be a logical statement such that a proof of that statement solves the strawberries-on-a-plate problem (or w/e).

**summary of plan:**

- the humans put in a herculean effort to build a multi-level world-model that is interpretable to them (ranging from quantum chemistry at the lowest level, to strawberries and plates at the top)
- we interpret this in a very conservative way, as a convex set of models that hopefully contains something pretty close to reality.
- we can then form the logical statement "this policy puts two cellularly-but-not-molecularly identical strawberries on a plate (and has no other high-level effect) with pretty-high probability across all of those models simultaneously"

**background context:** there's some fancy tools with very nice mathematical properties for combining probabilistic reasoning and worst-case reasoning.

**key hope:** these tools will let us interpret this "manual multi-level world-model" in a way that keeps enough of the probabilistic nature for tractable policies to exist, and enough adversarial nature for this constraint to be pretty tight.

in more concrete terms: for any given UFAI, somewhere in the convex hull of all the interpretations of the manual multi-level world model, there's a model that sees (in its high-level) the shady shit that the UFAI was hoping to slip past us. So such "shady" policies fail in the worst-case, and fail to satisfy the theorem. But also enough of the probabilistic nature is retained that your policies don't need to handle the literal worst-cases of thermodynamic heat, and so there are some "reasonable" policies that could satisfy the theorem.

**capabilities requirements:** the humans need to be able to construct the world model; something untrusted and quite capable needs to search for proofs of the theorem; the policy extracted from said theorem is then probably an AGI with high capabilities but you've (putatively) proven that all it does is put strawberries on a plate and shut down so \shrug :crossed_fingers: hopefully that proof bound to reality.

(note: I'm simply attempting to regurgitate the idea here; not defend it. obvious difficulties are obvious, like "the task of finding such a policy is essentially the task of building and aligning an AGI" and "something that can find that policy is putting adversarial pressure on your theorem". even if proving the theorem requires finding a sufficiently-corrigible AGI, it would still be rad to have a logical statement of this form (and perhaps there's even some use to it if it winds up not quite rated for withstanding superintelligent adversaries?).)

Anticipating an obvious question: yes, I observed to Davidad that the part where we imagine convex sets of distributions that contain enough of the probabilistic nature to admit tractable policies and enough of the worst-case nature to prevent UFAI funny business is where a bunch of the work is being done, and that if it works then there should be a much smaller example of it working, and probably some minimal toy example where it's easy to see that the only policies that satisfy the analogous theorem are doing some new breed of optimization, that is neither meliorization nor satisfaction and that is somehow more mild. And (either I'm under the illusion of transparency or) Davidad agreed that this should be possible, and claims it is on his list of things to demonstrate.

(note: this is the fragment of Davidad's plan that I was able to distill out into something that made sense to me; i suspect he thinks of this as just one piece among many. I welcome corrections :-))

1 min read

finally

I don't see this as worst-case thinking. I do see it as speaking from a model that many locals don't share (without any particular attempt made to argue that model).

In particular, if the AGI has some pile of kludges disproportionately pointed towards accomplishing X, and the AGI does self-reflection and “irons itself out”, my prediction is “maybe this AGI will wind up pursuing X, or maybe not, I dunno”.

AFAICT, our degree of disagreement here turns on what you mean by "pointed". Depending on that, I expect I'd either say "yeah maybe, but that kind of pointing is hard" or "yep, my highest-credence models have pretty high probability on this thing failing to optimize X once it's sorted out".

For instance, the latter response obtains if the "pointing" is done by naive training.

(Though I also have some sense that I see the situation as more fragile than you--there's lots of ways for reflection to ruin your day, if the wrong kludge is pointed the wrong way. So maybe we have a broader disagreement about that, too.)

Also, as a reminder, my high credence in doom doesn't come from high confidence in a claim like this. You can maybe get one nine here; I doubt you can get three. My high credence in doom comes from its disjunctive nature.

Thanks! Cool, it makes sense to me how we can make the pullback of with , in different ways to get different line bundles, and then tensor them all together. (I actually developed that hypothesis during a car ride earlier today :-p.)

(I'm still not quite sure what the syntax means, but presumably the idea is that there's an automorphism on 1D vector fields that flips the sign, and we flip the sign of the negative-charge line bundles before tensoring everything together?)

(Also, fwiw, when I said "they're all isomorphic to ", I meant that I didn't expect to figure much out by looking at a single fiber in isolation, and did not mean to imply that there was a canonical isomorphism; it's clear to me that lacking access to a particular isomorphism is kinda the whole point. That said, I appreciate the pedagogy anyway! I prefer overexplanations to underexplanations whenever my convo-partner is up for generating them.)

Thanks again!

(oops! thanks. i now once again think it's been fixed (tho i'm still just permuting things rather than reading))