I'm interested in doing in-depth dialogues to find cruxes. Message me if you are interested in doing this.
I do alignment research, mostly stuff that is vaguely agent foundations. Currently doing independent alignment research on ontology identification. Formerly on Vivek's team at MIRI.
It looks like you wrote up a correct but abstract&approximate description of the problems with that approach. I claim that as you make your model of these problems more detailed and careful, refine the vague concepts and reduce the model uncertainties, the way that this looks like it just might work will disappear.
I know this isn't very helpful to hear, and isn't convincing. But that's what my thoughts are.
We had a conversation in private messages to clarify some of this. I thought I'd publicly post one of my responses:
AI-2027 strategy
Step 1: Train and deploy Safer-1, a misaligned but controlled autonomous researcher. It’s controlled because it’s transparent to human overseers: it uses English chains of thought (CoT) to think, and faithful CoT techniques have been employed to eliminate euphemisms, steganography, and subtle biases.
Step 2: Try out different training environments for Safer-1, and carefully read the CoT to determine the ways in which the goals and principles in the Spec did or didn’t “stick.”
Step 3: Train and deploy Safer-2, an aligned and controlled autonomous researcher based on the same architecture but with a better training environment that incentivizes the right goals and principles this time.
<...>
Step 4: Design, train, and deploy Safer-3, a much smarter autonomous researcher which uses a more advanced architecture similar to the old Agent-4. It’s no longer transparent to human overseers, but it’s transparent to Safer-2. So it should be possible to figure out how to make it both aligned and controlled.
Step 5: Repeat Step 4 ad infinitum, creating a chain of ever-more-powerful, ever-more-aligned AIs that are overseen by the previous links in the chain (e.g. the analogues of Agent-5 from the other scenario branch).
My post is mainly about how this step breaks:
Try out different training environments for Safer-1, and carefully read the CoT to determine the ways in which the goals and principles in the Spec did or didn’t “stick.”
This process leads to finding a set of training environments that lead an AI to have a corrible-looking CoT. It doesn't lead to finding training environments that produce an AI that is truly corrigible.
So Safer-2 has been shaped by a process that makes one good at looking corrigible. It probably isn't corrigible, because there's lots of ways to be non-corrigible but corrigible-looking-to-the-design-process.
Step 4 involves putting a huge amount of trust in Safer-2. Since Safer-2 is untrustworthy, it betrays us in a way that isn't visible to us (which is extremely straightforward in this scenario because we've assumed there are important things that Safer-2 is managing that humans don't understand).
Yeah, I'd be excited to do so via conversations with people, after building a shared ontology about AGI internals. And maybe publishing that, if it feels shareable.
I am kinda disappointed by how relatively unpopular this post was though, it makes me suspect that I'm missing something (likely just writing quality, but maybe I'm incorrectly extrapolating my model of people like you and Max and therefore misunderstanding my audience). This makes me think that me writing a more detailed version of this post isn't the right next step. I think it might be a good project for others though.
I think there may be some real help from roughly human-level AGI that still thinks it's aligned, and/or is functionally aligned in those use cases (it hasn't yet hit much of "the ocean" in your metaphor). That could be really useful.
Yeah I agree with this, although the only way I can concretely imagine it helping is via the scenario I wrote at the end of the post.
I'd like to see more work on modeling conditions out at sea. That's how I view my work: trying to envision the most likely path from here to AGI, which I see as going through LLMs enhanced in very roughly brainlike directions.
Agreed, although three comments:
So I consider tiling agents and logical induction to be good work modelling the distribution shifts of learning/reasoning and self-modification.
I think there's a ton of model uncertainty still. Playing to our outs involves working toward alignment on the current path
You've gotta be careful with this particular chain of reasoning. It may look like model uncertainty implies hope, but that doesn't mean it implies hope in any specific work.
involves working toward alignment on the current path AND trying to slow or stop progress.
These two things can sometimes be opposed to each other. E.g. I'd bet that alignment work at OpenAI has been overall negative via giving everyone there the impression that they might be doing the right thing.
Based on that uncertainty, I think it's quite possible that relatively minor and realistic changes to alignment techniques might be enough to make the difference. So that's what I'm working on; more soon.
Mining your model uncertainty for concretely useful actions just doesn't seem like the kind of thing that could work. I look forward to reading about it anyway.
By "retraining or patching", I'm not talking about Zvi's most forbidden technique. I'm talking about any kind of update to the training procedure, including making a different training environment. When you don't have a deep understanding of the problems and the distribution shifts, iteration on the training environment can just as easily hide problems as the most forbidden technique.
AI-2027 was wrong to imply that that that strategy could plausibly work, it's alchemy.
You question marked this section
I mean this to include things like LLM experiments that show that they plot to kill people to avoid shutdown, or express preferences that are alarming in some way, or insert backdoors into code in some situations. At best these are weak analogies for real problems, but studying most ways to make them go away in LLMs won’t help make future AGI safer, even if future AGI is technologically descended from LLMs.
The real problems are caused by self-improvement and large-scale online learning and massively increased options. The problems people study in LLMs are only weakly related to these.
This seems like it's building on or inspired by work you've done? Or was this team interested in embeddedness and reflective oracles for other reasons?
It's ridiculously long (which is great, I'll read through it when I get a chance), do you have any pointers to sections that you think have particularly valuable insights?
An agent can have deontology that recruits the intelligence of the agent, so that when it thinks up new strategies for accomplishing some goal that it has it intelligently evaluates whether that strategy is violating the spirit of the deontology.
I think the "follow the spirit of the rule" thing is more like rule utilitarianism than like deontology. When I try to follow the spirit of a rule, the way that I do this is by understanding why the rule was put in place. In other words, I switch to consequentialism. As an agent that doesn't fully trust itself, it's worth following rules, but the reason you keep following them is that you understand why putting the rules in place makes the world overall better from a consequentialist standpoint.
So I have a hypothesis: It's important for an agent to understand the consequentialist reasons for a rule, if you want its deontological respect for that rule to remain stable as it considers how to improve itself.
I'm writing a post about this at the moment. I'm confused about how you're thinking about the space of agents, such that "maybe we don't need to make big changes"?
When you make a bigger change you just have to be really careful that you land in the basin again.
How can you see whether you're in the basin? What actions help you land in the basin?
The AI reasons more competently in corrigible ways as it becomes smarter, falling deeper into the basin.
The AI doesn't fall deeper into the basin by itself, it only happens because of humans fixing problems.
(Ryan is correct about what I'm referring to, and I don't know any details).
I want to say publicly, since my comment above is a bit cruel in singling out MATS specifically: I think MATS is the most impressively well-run organisation that I've encountered, and overall supports good research. Ryan has engaged at length with my criticisms (both now and when I've raised them before), as have others on the MATS team, and I appreciate this a lot.
Ultimately most of our disagreements are about things that I think a majority of "the alignment field" is getting wrong. I think most people don't consider it Ryan's responsibility to do better at research prioritization than the field as a whole. But I do. It's easy to shirk responsibility by deferring to committees, so I don't consider that a good excuse.
A good excuse is defending the object-level research prioritization decisions, which Ryan and other MATS employees happily do. I appreciate them for this, and we agree to disagree for now.
Tying back to the OP, I maintain that multiplier effects are often overrated because of people "slipping off the real problem" and this is a particularly large problem with founders of new orgs.
The nearest unblocked strategy problem combines with the adversarial nonrobustsness of learned systems, to make a much worse combined problem. So I expect constraints to often break given more thinking or learning than was present in training.
Separately, reflective stability of constraint-following feels like a pretty specific target that we don't really have ways of selecting for, so we'd be relying on a lot of luck for that. Almost all my deontological-ish rules are backed up by consequentialist reasoning, and I find it pretty likely that you need some kind of valid and true supporting reasoning like this to make sure an agent fully endorses its constraints.
Why do you believe this reasoning at 50%? Is this due to new evidence or arguments? I've never heard it supported by anything other than bad high-level analogies.
There are a couple of factors that make the chain weaker as it goes along.
The only way I can see it going well is if an early AI realizes how bad of an idea this chain is and convinces the humans to go develop a better paradigm.
(I wrote a post about similar stuff recently, trying to explain the generators behind my dislike of prosaic research)