(Note: This was languishing in a drafts folder for a while, and probably isn't quite right in various ways. I'm posting it because I expect it's better to share flawed thoughts than to sit on the post until I'm satisfied with it, i.e., forever.)
Let's play a game of "what do you think you know, and why do you think you know it?".
Imagine that you're about to launch an AGI. What you think you know is that, with at least 50% confidence (we’re of course not looking for proofs — that would be crazy), the AGI is going to execute some pivotal act that ends the acute risk period in a good way. Why do you think you know that?
Insofar as people’s alignment proposals can be construed as answers to this question, we have the option of answering with one of these proposals. I might very roughly classify the existing proposals into the following bins:
1. Output evaluation approaches. You know what the AGI is going to do with sufficient precision that it screens off any alignment concerns. For example, your AGI system only outputs plans in the first place, and you've already reviewed the plan, and you're confident the plan will work, in a way that screens off any other worry about the AGI being misaligned.
2. Cognitive interpretability approaches. You understand the AGI's cognition sufficiently well that, while you may not be sure what it's going to do, you're confident that it's going to be good. You aren't worried that it will kill all humans, because you understand how its plan came to be and what solution-spaces it was searching to solve various sub-problems and so on, and you're confident no consideration was ever given to human-killing.
3. Heavy-precedent approaches. You have run this AGI before on many similar tasks, and trained out all the hiccups. While you might not know precisely what it's going to do, and you might not know what's going on inside its mind, you've been around the block a few times, and the task it's about to perform is sufficiently similar to other tasks it has empirically succeeded at, justifying your confidence.
Roughly speaking, I think that alignment approaches with a heavy reliance on output evaluation are doomed, both on the grounds that humans can't evaluate the effectiveness of a plan capable of ending the acute risk period, and because the real plan is less like a story and more like a tree.
For an example of “humans can't reliably evaluate the effectiveness of this class of plans”, imagine that the plan is an enormous bitstring that's going to be sent to the motor outputs. If you decode the string, you find that it figures out how to make long DNA strands that allegedly code for a protein factory that can be used to build a general-purpose nanofactory. You're hard-pressed, however, to confirm that this is actually (all and only) what the plan does.
For an example of “the real plan is less like a story and more like a tree”, imagine that the AI's plan is "I'm going to build a wetlab, then do a bunch of experimentation, then think about the results of the experiments in various ways and build a protein factory that builds a nanofactory that I'm going to experiment with until I figure out how to build nanomachines that can be used for some good pivotal act". In order to trust that this sort of abstract plan doesn't kill you when put into practice, you have to trust the system's thinking and its notion of 'goodness', which is going to dump you pretty quickly into cognitive-interpretability-style justification.
Roughly speaking, I think that cognitive interpretability approaches are doomed, at least in the modern paradigm, because we're not building minds but rather training minds, and we have very little grasp of their internal thinking, and there are convergent instrumental reasons to expect things to go wrong by default, and the social environment doesn't seem to me to be fighting against those defaults with anything nearing the force I expect is necessary.
Roughly speaking, I think that heavy-precedent approaches are doomed because I haven't myself been able to think of any pivotal action that has safe analogs we can do a bunch of empiricism on; nor have I heard a concrete proposal like this that strikes me as realistic from anyone else. "Well, it never killed all humans in the toy environments we trained it in (at least, not after the first few sandboxed incidents, after which we figured out how to train blatantly adversarial-looking behavior out of it)" doesn't give me much confidence. If you're smart enough to design nanotech that can melt all GPUs or whatever (disclaimer: this is a toy example of a pivotal act, and I think better pivotal-act options than this exist) then you're probably smart enough to figure out when you're playing for keeps, and all AGIs have an incentive not to kill all "operators" in the toy games once they start to realize they're in toy games.
So that's not a great place to be.
The doomedness of cognitive interpretability approaches seems to me to be the weakest. And indeed, this is where it seems to me that many people are focusing their efforts, from one angle or another.
If I may continue coarsely classifying proposals in ways their advocates might not endorse, I'd bin a bunch of proposals I've heard as hybrid approaches, that try to get cognitive-interpretability-style justification by way of heavy-precedent-style justification.
E.g., Paul Christiano’s plan prior to ELK was (very roughly, as I understood it) to somehow get ourselves into a position where we can say "I know the behavior of this system will be fine because I know that its cognition was only seeking fine outcomes, and I know its behavior was only seeking fine outcomes because its cognition is composed of human-esque parts, and I know that those human-esque parts are human-esque because we have access to the ground truth of short human thoughts, and because we have heavy-precedent-style empirical justification that the components of the overall cognition operate as intended."
(This post was mostly drafted before ELK. ELK looks more to me like a different kind of interpretability+precedent hybrid approach — one that tries to get AGI-comprehension tools (for cognitive interpretability), and tries to achieve confidence in those tools via "we tried it and saw" arguments.)
I'm not very optimistic about such plans myself, mostly because I don't expect the first working AGI systems to have architectures compatible with this plan, but secondarily because of the cognitive-interpretability parts of the justification. How do we string locally-human-esque reasoning chunks together in a way that can build nanotech for the purpose of a good pivotal act? And why can that sort of chaining not similarly result in a system that builds nanotech to Kill All Humans? And what made us confident we're in the former case and not the latter?
But I digress. Maybe I'll write more about that some other time.
Cf. Evan Hubinger's post on training stories. From my perspective, training stories are focused pretty heavily on the idea that justification is going to come from a style more like heavily precedented black boxes than like cognitive interpretability, so I'm not too sold on his decomposition, but I endorse thinking about the question of how and where we could (allegedly) end up knowing that the AGI is good to deploy.
(Note that it's entirely possible that I misunderstood Evan, and/or that Evan's views have changed since that post.)
An implicit background assumption that's loud in my models here is the assumption that early AGI systems will exist in an environment where they can attain a decisive strategic advantage over the rest of the world.
I believe this because of how the world looks "brittle" (e.g., nanotech exists) and because lots of technological progress seems cognition-constrained (such as, again, nanotech). This is a big part of why I think heavy-precedent-style justifications are doomed.
Many locals seem to expect a smoother and slower transition from here to superhumanly capable general-purpose science AI — a transition that somehow leaves no window where the world's most competent AGI can unilaterally dominate the strategic landscape. I admit I have no concrete visualization of how that could go (and hereby solicit implausibly-detailed stories to make such scenarios seem more plausible to me, if you think outcomes like this are likely!). Given that I have a lot of trouble visualizing such worlds, I'm not a good person to talk about where our justifications could come from in those worlds.
I might say more on this topic later, but for now I just want to share this framing, and solicit explicit accounts of how we're supposed to believe that your favorite flavor of AGI is going to do good stuff.
Apart from nanotech, what are the main examples or arguments you would cite in favor of these claims?
Separately, how close is your conception of nanotech to "atomically precise manufacturing", which seems like Drexler's preferred framing right now?
not Nate or a military historian, but to me it seems pretty likely for a ~100 human-years more technologically advanced actor to get decisive strategic advantage over the world.
This seems sufficient for "what failure looks like" scenarios, with faster disempowerment through hard takeoff likely to depend on other pathways like nanotech, social engineering, etc. As for the whole argument against "heavy precedent", I'm not convinced either way and haven't thought about it a ton.
One way in which the world seems brittle / having free energy AI could use to gain advantage: We haven't figured out good communication practices for the digital age. We don't have good collective epistemics. And we dont seem to be on track to have this solved in the next 20 years. As a result I expect that with enough compute and understanding of network science, and perhaps a couple more things, you could sabotage the whole civilization. ("Enough" is meant to stand for "a lot, but within reach of an early AGI". Heck, if Google somehow spent the next 5 years just on that, I would give them fair odds.)
A brain-like AGI - modeled after our one working example of efficient general intelligence - would naturally have an interpretable inner monologue we could monitor. There's good reasons to suspect that DL based general intelligence will end up with something similar simply due to the convergent optimization pressure to communicate complex thought vectors to/from human brains through a low-bitrate channel.
Intelligence potential of architecture != intelligence of trained system
The intelligence of a trained system depends on the architectural prior, the training data, and the compute/capacity. Take even an optimally powerful architectural prior - one that would develop into a superintelligence if trained on the internet with reasonable compute - and it would still only be nearly as dumb as a rock if trained solely in atari pong. Somewhere in between the complexity of pong and our reality exists a multi-agent historical sim capable of safely confining a superintelligent architecture and iterating on altruism/alignment safely. So by the time that results in a system that is "smart enough to design nanotech", it should already be at least as safe as humans. There of course ways that strategy fails, but they don't fail because 'smartness' strictly entails unconfineability - which becomes more clear when you taboo 'smartness' and replace it with a slightly more detailed model of intelligence.
This doesn’t have much to do with whether a mind is understandable. Most of my cognition is not found in the verbal transcript of my inner monologue, partly as I’m not that verbal a thinker, but mostly because most of my cognition is in my nonverbal System 1.
There was some related discussion here, to the effect that we could do something to try to make the AGI as verbal a thinker as possible, IIUC. (I endorse that as plausibly a good idea worth thinking about and trying. I don’t see it as sufficient / airtight.)
Though note that "we could do something to try to make the AGI as verbal a thinker as possible" is a far weaker claim than "A brain-like AGI - modeled after our one working example of efficient general intelligence - would naturally have an interpretable inner monologue we could monitor.". The corresponding engineering problem is much harder, if we have to do something special to make the AGI think mostly verbally. Also, the existence of verbal-reasoning-heavy humans is not particularly strong evidence that we can make "most" of the load-bearing thought-process verbal; it still seems to me like approximately all of the key "hard parts" of cognition happen on a non-verbal level even in the most verbalization-heavy humans.
The verbal monologue is just near the highest level of compression abstraction in a multi-resolution compressed encoding, but we are not limited to only monitoring at the lowest bitrate (highest level of abstraction/compression).
There is already significant economic pressure on DL system towards being 'verbal' thinkers: nearly all largescale image models are now image->text and text<-image models, and the corresponding world_model->text and text<-world_model design is only natural for robotics and AI approaching AGI.
This has been discussed before. Your example of not being a verbal thinker is not directly relevant because 1.) inner monologue need not be strictly verbal, 2.) we need only a few examples of strong human thinkers with verbal inner monologues to show that isn't an efficiency disadvantage - so even if your brain type is less monitorable we are not confined to that design.
I also do not believe your central claim - in that based on my knowledge of neuroscience - disabling the brain modules responsible for your inner monologue will not only disable your capacity for speech, it will also seriously impede your cognition and render you largely incapable of executing complex long term plans.
Starting with a brain-like AGI, there are several obvious low-cost routes to dramatically improve automated cognitive inspectability. A key insight is that there are clear levels of abstraction in the brain (as predicted by the need to compress sensory streams for efficient bayesian prediction) and the inner monologue is at the top of the abstraction hierarchy, which maximizes information utility per bit. At the bottom of the abstraction hierarchy would be something like V1, which would be mostly useless to monitor (minimal value per bit).
To the extent that we are using the existence of humans with inner monologue as evidence for the suitability of this approach to AGI, I think it also serves as evidence for why that approach isn't really safe.
As an example: If I try, I can control my inner monologue to some extent. For example, I could make my inner monologue about "officially approved actions" very loud. And run a secondary thread of thinking about "non-approved thoughts" in the background --- "quieter" and in a more fragmented manner, and killing it off everytime it made it into the inner monologue.
Sure, there will be some traces. (A lot, even, I kinda suck at this :-).) But I think this already brings us outside of the realm of "interpretable by default" and into "a race between your interpretability and my cognitive steganography".
Note that I, or most humans, didnt have a particular need to develop this exact type of cognitive steganography. (The "elephant in the brain" type of actions-thoughts inconsistency is not exactly this.) As a result, I think we would be able to get much much better very quickly if there was an incentive to do it. (Not talking evolutionary scale. More like "give me a weekend to play with this".) And while I have no incentive to develop these skills, the same cannot be said for a brain-like AI that we are (trying to) interpret as it thinks.
For the latter (toy environments) part, I think we need to distinguish a couple possible types of argument:
I think Nate is talking about 2 & B, and you’re talking about 1 & (not sure about A vs B).
I think that 2 is fraught because “aligned” has a different meaning in a sandbox versus the real world. In the sandbox, an “aligned” model would be trying to help / empower / whatever the sandbox inhabitants, and in the real world, an “aligned” model would be trying to help / empower / whatever “humanity”.
I think that 1 is potentially fraught too, at least in the absence of A, in that it’s conceivable that we’d find a training procedure that will reliably build aligned models when run in sandboxes while reliably building misaligned models when run in the real world.
I think A is definitely what I’m hoping for, and that 1 & A would make me very optimistic. I think B is pretty fraught, again for the same reason as above—no sandbox is exactly the same as the real world, with “the likely absence of real live humans in the sandbox” being a particularly important aspect of that.
I think there’s a continuum between B and A, and the more we can move from B towards A, the better I feel.
And I think my own time is better spent on trying to move from B towards A, compared to thinking through how to make the most realistic sandboxes possible. But I’m happy for you and anyone else to be doing the latter. And I’m also strongly in favor of people building tools and culture to make it more likely that future AGI programmers will actually do sandbox testing—I have advocated for one aspect of that here.
I mostly agree with these distinctions, but I would add a third and layer them:
Aligned architecture: An architectural prior that results in aligned agents after training in most complex environments of relevance.
Aligned training procedure: The combination of an architectural prior and early training environment design (not a specific environment, but a set of constraints/properties) which results in agents which are then aligned in a wide variety of deployment environments - including those substantially different than the early training environment (out of dist robustness).
Aligned agent - the specific outcome of 1 or 2 which is known to be aligned (at least currently)
Current foundation models also vaguely follow this flow in the sense that the company/org controls the initial training environment and may enforce 'alignment' there before deployment (at which point other actors may be allowed to continue training through transfer or fine-tuning)
Humans are only 2 aligned rather than 1 aligned, and I think there are some good reasons for this. Given only simple circuits to guide/align the learning process, I think the best that can be done is to actually exploit distribution shift and use a specialized early training environment that is carefully crafted to provide positive examples that can be classified robustly - these then form the basis of a robust learned distance based classifier that leverages the generative model (I have some more on this probably should put into a short post - but that's basically the proxy matching idea).
So ideally we'd want 1, but we probably will need to settle for 2. Once we have designs (arch + early training) that work well in sandbox, we'd probably still want to raise them initially in a special early training env that has the right properties, then gradually introduce to the real world and deploy (resulting in 3).
As for A vs B: ideally you may want A but you settle mostly for B. That's just how the world often works, how DL progressed, etc. We now have more established theory of how DL works as approx bayesian inference, but what actually drove most progress was B style tinkering.
Well that isn't quite right - when the AGI advances to human-level there are human-equivalents in the sandbox.
Also sandboxes shouldn't be exactly the same as the real world! The point is to create a distribution of environments that is wider than the real world, to thereby cover it and especially the future.
I definitely don't think this—in fact, I tend to think that cognitive interpretability is probably the only way we can plausibly get high levels of confidence in the safety of a training process. From “How do we become confident in the safety of a machine learning system?”:
See also: “A transparency and interpretability tech tree”
to solve the acute risk period, figure out how to convince everyone, including so8res and all future ASIs, that pivotal acts must not occur.
or to put it another way, what is the smallest pivotal act that prevents future pivotal acts? easy: a way to reliably convince pivotal actors to not pivotally act.
update: actually non-aggressive pivotal acts would be great. eg, anyone have any ideas for how to make computers pivotally more secure? any ideas for how to end the social engineering threat to the world that spam is creating? seems to me that there are only a few core threat paths that represent most of the attack surface. pivotal acts that close them without necessarily doing so aggressively would significantly increase the relative intelligence needed to destroy humanity, though of course it'd only be a relative fix and not something that can hold back a hard ASI.
These are good intuitive arguments against these sorts of solutions, but I think there's a more formal argument we can make that these solutions are dangerous because they pose excess false positive risk. In particular, I think they fail to fully account for the risks of generalized Goodharting, as do most proposed solutions other than something like agent foundations.
The way you say this, and the way Eliezer wrote Point 30 in AGI ruin, sounds like you think there is no AGI text output with which humans alone could execute a pivotal act.
This surprises me. For one thing, if the AGI outputs the textbook of the future on alignment, I'd say we could understand that sufficiently well to be sure that our AI will be aligned/corrigible. (And sure there's the possibility that we could be hacked through text so we only think it's safe or so, but I'd expect that this is significantly harder to achieve than just outputting a correct solution to alignment.)
But even if we say humans would need to do a pivotal act without AGI, I'd intuitively guess an AGI could give us the tools (e.g. non-AGI algorithms we can understand) and relevant knowledge to do it ourselves.
To be clear, I do not think that we can get an AGI that prompts us the relevant text to do a weak pivotal act, without the AGI destroying the world. And if we could do that, there may well be a safer way to let a corrigible AI do a pivotal act.
So I agree it's not a stategy that could work in practice.
The way you phrase it sounds to me like it's not even possible in theory, which seems pretty surprising to me, which is why I ask whether you actually think that or if you meant it's not possible in practice:
I'm also curious about your answer on:
3. If we had a high-rank dath ilani keeper teleported into our world, but he is not allowed to build aligned or corrigible AI and cannot tell anyone how or make us find the solution to alignment etc, could he save the world without using AGI? By what margin? (Let's assume all the pivotal-act specific knowledge from dath ilan is deleted from the keeper's mind as he arrives here.)
(E.g. I'm currently at P(doom)>85%, but P(doom | tomorrow such a keeper will be teleported here)~=12%. (Most uncertainty comes from that my model is wrong. So I think with ~80% probability that if we'd get such a keeper, he'd almost certainly be able to save the world, but in the remaining 20% where I misestimated sth, it might still be pretty hard.))
It’s at least not obvious to me that we would be able to tell apart “the textbook of the future on alignment” from “an artifact that purports to be the textbook of the future on alignment, but is in fact a manipulative trick”. At least, I think we wouldn’t be able to tell them apart in the limit of superintelligent AI.
I mean, it’s not like I’ve never found an argument compelling and then realized much later on that it was wrong. Hasn’t everyone?
I'd say that there's a big difference between fooling you into "brilliant that really looks plausible" shortly after you read it, and a group of smart humans trying to deeply understand the concepts for months and trying to make really sure there are no loopholes. In fact, I'd expect the making us wrongly believe strongly everything works after months is impossible even in the limit of superintelligence, though I do think the superintelligence could prompt some text that destroys/shapes the world as it'd like. And generally something smart enough to solve alignment will likely be smart enough to break out of the box and take over the world, as said.
But yeah if the people with the AGI aren't extremely cautions and just go ahead and quickly build AGI because it looks all correct, then that might go badly. But my point was that it is in the reach of human checkability.
I think Nate's original argument holds, but might need a bit if elaboration:
I think there are classes of pivotal plans + their descriptions, such that if the AI gave one of them to us, we could tell whether it is pivotal good or pivotal bad. But also there are other classes of pivotal plans + their descriptions such that you will think you can tell the difference, but you can't.
So the observation "this seems good and I am super-convincend i could tell if it wasn't" -- by itself -- isn't enough. To trust the AIs plan, you need some additional argument for why it is the kind of AI that wouldn't try to deceive you, or why it wouldn't search through dangerous plans, yada yada. But essentially that means you aren't relying on the plan verification step anymore.
I'd guess we can likely reliably identify some classes of pivotal acts where we cannot be fooled easily, and would only accept suggestions from those classes, and I'd still intuitively expect that there are doable pivotal acts on those classes.
I mostly agree with you here, so I'm not the best candidate to write a convincing story that routes around these problems. I think there is some narrow glimmer of hope shining through a narrow sliver of space here though. Starting with the point that you have a powerful plan making AGI, and set it to making plans for pivotal acts, or at least plans for maneuvering you closer to making pivotal acts. Insist that the plan be human-readable and intended to be carried out by human actors. Have your powerful AGI, or other narrow AGIs, audit the plan as a whole and in parts. You can add small amounts of noise to the plan (e.g. switching words for synonyms) to avoid steganography. You can test out the auditing process itself with plans not written by the AGI.
You can wipe the memories of the plan from the AGI and modify parts of it with noise to make it a similar (rather than identical) agent. You can fine tune versions of the AGI on different datasets with different emphases, again to make it a slightly different agent. You can test it out with smaller, safer plans. Then test it out with smaller legible steps towards maneuvering into position for a pivotal act (e.g. gaining money, getting a chosen candidate elected).
Hopefully by making a number of human-legible and human-enacted steps towards getting ready to make a pivotal act, you can reduce the complexity of a plan necessary to go the rest of the way. This also lets you use a less-powerful AGI to make the plan. The less super-humanly intelligent the AGI is, the more likely you are to be able to avoid failure of control or running into an undetected deceptive turn.
There are downsides with this approach. One big one is that it would probably take much more time to enact, and possibly have lower odds of success. Another is that there are a lot of potential infohazard risks to tiptoe around when trying to get human-readable plans out of an otherwise-boxed AGI (extracting a plan is an opening in the box, but hopefully could be done in a controlled safe way).
Would you be able to provide examples of other sorts of pivotal acts you have in mind?
To clarify, when we succeed by a "cognitive interpretability approach", I guess you mostly mean sth like:
Whereas I guess many people might think you mean:
Let me rephrase. I think you think:
Am I (roughly) correct in that you hold those opinions?
(I do basically hold those opinions, though my model has big uncertainties.)
The speculative theory I outlined here is different to the three options you discussed and may also be closer to the mainstream approach to "AGI-like" capabilities we see today. It differs from the heavy precedent approach because, while it is precedent-bound, if you can show that there's enough overlap in the relevant priors then you get covergence as a theorem and not an ad-hoc speculation.