TLDR: I think synthesizing bargaining protocols with pleasing developer experience, high quality "guarantees" (in the formal verification sense), and low cost will help us a lot around a multipolar takeoff, and I'd like to submit my worldview for a community audit.
Tremendous thanks to the attendees of Alignable Structures, I cultivated the self-confidence I needed to dive into this research agenda as a direct result of the vibes that weekend.
This document does not endeavor to be a thorough contribution in itself, it intends to be written quickly in spare moments between papers and textbooks. It should be clear to you what I’ve read, and I’m seeking your help to prioritize what else I have to read in the prioritization / theory of change / inside view group of reading.
In the interest of keeping the post short, I’m omitting many details about why my worldview is where it is, so I encourage you to prod me about specific things you’d like to see elaboration on in the comments.
Claim: I consider takeoff as a unit cube in three dimensions. Research agendas ought to clearly locate themselves, because theories of change are betting on which world we’ll end up in.
I omit continuity, because I can never think of any actionable insight that would result from a honed forecast of that one. Arguably, agents vs. services is another axis to play with, but perhaps another post. So you may think a unit cube in 4 or 5 dimensions is more appropriate or complete.
Takeoff comes in unipolar and multipolar flavors. When I say a scenario is strongly unipolar, I mean it consists mostly of the classical yudkowsky-bostrom content, i.e. recursive self improvement leading to clear dominance by one agent, regardless of any alignment properties it lacks, which we call a singleton. The strongly multipolar scenario is very chaotic, with a ton of AIs flying around.
Takeoff comes in homogenous and heterogenous flavors. In the original post, Evan defines the terms:
how similar are the different AIs that get deployed in that scenario likely to be? If there is only one AI, or many copies of the same AI, then you get a very homogenous takeoff, whereas if there are many different AIs trained via very different training regimes, then you get a heterogenous takeoff.
And forecasts homogenous takeoff.
Takeoff can be fast or slow. I set the extremely fast end at ramping from wherever we are now (end of 2022) to transformative AI by 2025, and the extremely slow end at ramping up from wherever we are now to transformative AI by 2055, which is Ajeya2020’s 60% number plus 5 years.
I give a table of the sheets at the extreme end of each scenario type. I.e., the sheet (-1, y, z) corresponds to a high probability of strongly unipolar takeoff. Implicitly, there’s a background notion of transformation into multiple meanings, where I presume that a lower probability of a stronger unipolar takeoff corresponds with a higher probability of a weaker unipolar takeoff, and so on.
Then, I can eyeball the class of forecasts of scenarios for which I expect my stated research goals to be helpful as none other than the set
In other words,
I really like Critch & Krueger 2020, in which “alignment” is problematized precisely because “aligning multiple agents to multiple principals” is an incredibly confused sentence. Delegation (which specifically means comprehension, instruction, and control) is an improvement
Claim: Most of the movement’s research portfolio is in single-single delegation
Claim: Single-single is a critical and difficult warmup task, but multi-multi is clearly the real thing
The research portfolio also needs infrastructure to build out connections between fields. In particular, I’m excited about interp (like searching for search) and agent foundations (like (A -> B) -> A) finding common ground (specifically a feedback loop of observing terms and forecasting types), and I mainly expect cooperative AI to factor in because I don’t think there’s a reasonable story you can tell about agency that does not account for all the other agents in the world. To pick on agent foundations in particular: the spherical elephant in the room has gotta be that the field focuses on one agent at a time, like it's alone on a planet playing factorio. I feel like that's a critical assumption in embedded agency research and things like that.
I expect that crunch time looks like 5-50 institutions, competing and trading with one another. I’m uncertain if there will be runaway advantages such that unipolar-like dominance eventually emerges, so it’s plausible that multipolar liftoff is a temporary state, but in that scenario, crucially, our collective capacities for bargaining, social choice, trading, etc. will be an extremely influential variable (one we can wiggle now!) on how that eventual dominance ends up.
I’m painting very broadly about scenarios I see emerging from about 2028 to TAI. The obvious hope is that my product ideas are robust to wide error bars!
At least two of these scenarios seem explicitly CAIS-like, so you may be wondering if I even believe in agency. My answer is that I think CAIS-like scenarios necessarily precede agents-running-amok scenarios, and furthermore, we don’t have actionable information right now about what to do once agents start running amok. In other words, I’m targeting my interventions toward crunch time, not toward glorious transhuman future or death. The prologue to an agent-based takeoff, if one is to occur, will provide us with better information about how to deal.
TLDR, it seems like a path forward for interventions in this class of scenarios is to build expertise around every logical approach to game theory. Questions I have like “what programming language’s interpretation is mixed strategies?” or “what would it feel like to write programs in a programming language whose terms are mixed strategies?” lead me to tentatively calling this “semantic game theory”, though I suspect the open games literature (see below) may be a few steps ahead of me.
Open source game theory (OSGT) (that is, game theory when you can read each other’s source code), turns out to be an application of modal logic.
The applied category theory community provides a fully categorical story of arbitrary scenarios from classical game theory. Game theorists of the alignment community don’t appear to have pursued understanding of this story, unless I’ve missed something.
Nisan identified domain theory as a direction for getting results in OSGT with recursive beliefs.
I’m also putting a couple days into stuff that’s more wentworth or garrabrant style, and I plan to take a week off sometime after christmas to see how many preliminary sketches I can make for a heuristic argument assistant tool stack.
I agree those nice-to-haves would be nice to have. One could probably think of more.
I have basically no idea how to make these happen, so I'm not opinionated on what we should do to achieve these goals. We need some combination of basic research, building tools people find useful, and stuff in-between.
I'll admit I'm pessimistic, because I expect institutional inertia to be large and implementation details to unavoidably leave loopholes. But it definitely sounds interesting.
I'm a bit more optimistic about loopholes because I feel like if agents are determined to build trust, they can find a way.
I agree that institutional inertia is a problem, and more generally there's the problem of getting principals to do the thing. But it's more dignified to make alignment/cooperation technology available than not to make it.
On whose shoulders are we standing?
Some metaphor searches to find (some of) the prior work for each section:
My follow-through, I’m a bad employee in terms of consistency and dependability, and much of that would apply to independent research: I kick ass for stretches then crash (3 to 5 months asskicking per one month burned out).
holy crap same type of pattern. am currently in a burned out period, but feel like I could become productive again with active management, if you figure out where to buy that definitely let me know! I'm personally considering applying for universities.
p.s. this metaphor search found some amusing old ai alignment plans that I don't think are terribly useful but may be of historical interest to someone
I’m sniped by the areas of math I’m most aesthetically attracted to, and creating a 300 IQ plan with a bajillion 4D chess moves to rationalize working on them.
While you might be risking wasting your time for all I know, this research plan as a whole seems extremely high quality to me and on the right track in a way few are. That said, I think you're underestimating how soon we'll see TAI.
(or maybe I don't know what people mean by TAI? I don't think all technology will be solved for several decades after TAI and hitting max level on AI does not result in science instantly being completed. many causal experiments and/or enormous high-precision barely-approximate simulations are still needed, part of the task is factorizing that, but it will still be needed.)