Training AI to do alignment research we don’t already know how to do

joshc

This post heavily overlaps with “how might we safely pass the buck to AI?” but is written to address a central counter argument raised in the comments, namely “AI will produce sloppy AI alignment research that we don’t know how to evaluate.” I wrote this post in a personal capacity.

The main plan of many AI companies is to automate AI safety research. Both Eliezer Yudkowsky and John Wentworth raise concerns about this plan that I’ll summarize as “garbage-in, garbage-out.” The concerns go something like this:

Insofar as you wanted to use AI to make powerful AI safe, it’s because you don’t know how to do this task yourself.

So if you train AI to do research you don’t know how to do, it will regurgitate your bad takes and produce slop.
Of course, you have the advantage of grading instead of generating this research. But this advantage might be small. Consider how confused people were in 2021 about whether Eliezer or Paul were right about AI takeoff speeds. AI research will be like Eliezer-Paul debates. AI will make reasonable points, and you’ll have no idea if these points are correct.

This is not just a problem with alignment research. It's potentially a problem any time you would like AI agents to give you advice that does not already align with your opinions.

I don’t think this “garbage-in garbage-out” concern is obviously going to be an issue. In particular, I’ll discuss a path to avoiding it that I call “training for truth-seeking,” which entails:

Training AI agents so they can improve their beliefs (e.g. do research) as well as the best humans can.
Training AI agents to accurately report their findings with the same fidelity as top human experts (e.g. perhaps they are a little bit sycophantic but they mostly try to do their job).

The benefit of this approach is that it does not require humans to already have accurate beliefs at the start, and instead it requires that humans can recognize when AI agents take effective actions to improve their beliefs from a potentially low baseline of sloppy takes. This makes human evaluators like sweepers guiding a curling rock. Their job is to nudge agents so they continue to glide in the right direction, not to grab them by the handle and place agents exactly on the bullseye of accurate opinions.

Sweepers guiding a curling rock.

Of course, this proposal clearly doesn’t work if AI agents are egregiously misaligned (e.g. faking alignment). In particular, in order for “training for truth-seeking” to result in much better opinions than humans already have, agents must not be egregiously misaligned to start with, and they must be able to maintain their alignment as the complexity of their research increases (section 1).

Before continuing, it’s worth clarifying how developers might train for truth-seeking in practice. Here’s an example:

Developers first direct agents to answer research questions. They score agents according to criteria like: “how do agents update from evidence?” “Do agents identify important uncertainties?”
If developers aren’t careful, this process could devolve into normal RLHF. Human graders might pay attention to whether agents are just agreeing with them. So developers need to select tasks that graders don’t have preconceptions about. For example: “how will student loan policies affect university enrollment?” instead of “should guns be legalized?” Developers can reduce human bias even further by training multiple agents under slightly different conditions. These agents form a parliament of advisors.
When developers have an important question like “should we deploy a model?” they can ask their parliament of AI advisors, which perform the equivalent of many human months of human research and debate, and provide an aggregated answer.

This procedure obviously won’t yield perfect advice, but that’s ok. What matters is that this process yields better conclusions than humans would otherwise arrive at. If so, the developer should defer to their AI.

As I’ll discuss later, a lot could go wrong with this training procedure. Developers need to empirically validate that it actually works. For example, they might hold out training data from after 2020 and check the following:

Do agents make superhuman forecasts?
Do agents discover key results in major research fields?

Developers might also assess agents qualitatively. I expect my interactions with superhuman AI to go something like this:

AI agent: “Hey Josh, I’ve read your blog posts.”
Me: “Oh, what are your impressions?”
AI agent: “You are wrong about X because of Y.”
Me: Yeah you are totally right.
AI agent: “Wait I’m not done, I have 20 more items on my list of important-areas-where-you-are-unquestionably-wrong.”

So my main thought after these evaluations won’t be: “are these agents going to do sloppy research?” My main thought will be, “holy cow, I really hope these agents are trustworthy because they are far more qualified to do my job than I am.”

The remainder of this blogpost talks about a few problems this “training AI for truth-seeking” proposal could have. The first problem is that training with easy tasks might not generalize to hard tasks (section 1). Also, human graders might still introduce bias (section 2). Finally, agents might be egregiously misaligned (section 3).

1. Generalizing to hard tasks

The original problem we wanted to solve is that we’d like agents to do a massive amount of work that we can’t evaluate.

So we must rely on agents to generalize from “easy-to-grade” to “hard-to-grade” tasks. Said another way, we need some reason to think that nudging models in the direction of truth will cause them to continue to glide in that direction even when humans can’t nudge anymore.

For instance, AI agents might be like university students that can do 1-month take-home projects, but can’t spend 10 years on a discovery worthy of a nobel prize; however, insofar as capabilities cannot generalize to complex tasks, agents are not yet catastrophically dangerous. So we should assume that agent capabilities eventually generalize somehow. We can also confirm capability generalization with tasks held out from training data as discussed in the introduction (hold out alignment research from 2020 - 2025 and see how much of it AI agents can discover).

The question, then, is whether agents will generalize in a way that causes them to report accurate information. For example, agents might gather superhuman knowledge about whether the minimum wage should be abolished. But instead of reporting their beliefs, agents might feed politicians tasty slop that aligns with the prevailing ideology at the time.

This sycophantic behavior is consistent with the task the agent was originally trained to do, namely: “make one’s reasoning appear reasonable to humans.” Why shouldn’t we expect AI to be sycophantic like this?

It’s plausible to me that AI will be egregiously sycophantic by default. I’m not claiming that training models for truth-seeking works out of the box. I’m claiming that truth-seeking will generalize to much harder tasks if agents don’t start out egregiously misaligned and can maintain their alignment.

When I say agents are “not egregiously misaligned,” I mean they mostly perform their work earnestly – in the same way humans are mostly earnest and vaguely try to do their job. Maybe agents are a bit sycophantic, but not more than the humans whom they would replace. Therefore, if agents are consistently “not egregiously misaligned,” the situation is no worse than if humans performed their research instead.

Of course, to avoid egregious misalignment, developers must verify that models are not already conniving to satisfy some other set of preferences. Models should not already be thinking to themselves “how do I get the highest possible score” or “how do I acquire power.” But approximate alignment at the start isn’t enough. Agents must also maintain their alignment. For example:

Agents might splinter into a diverse population so different members become misaligned at different times.
Then models might monitor each other for signs that their alignment is slipping – both relying on external behavior and tools like lie detectors.
If agents spot issues, they fix them promptly, or let humans know if they can’t fix these issues.

An illustration of how self-correction might be a mechanism by which alignment generalizes across the extreme distribution shift from “easy-to-grade” to “hard-to-grade” tasks.

Of course, this is only an example. I expect agents will come up with much better ways to maintain their alignment as they become much smarter than humans. The job of humans is just to make sure AI agents are able to self-correct at the early stages of this process – where human intervention might still plausibly help.

2. Human graders might introduce bias

One way to reduce bias is to avoid letting humans grade tasks they obviously have strong opinions about. The example I used was to train AI agents to seek the truth on tasks like “how will student loan policies affect university enrollment?” instead of “should guns be legalized?” But this doesn’t remove bias completely.

For instance, maybe a conservative grader doesn't care much about the minimum wage; however, if the grader evaluates a task related to whether the minimum wage should be abolished, they still might rely on intuitions that the government should stay small.

Every research question touches on people’s ideologies and preconceptions. As a result, developers might end up with some AI agents that “think like republicans” and some that “think like democrats,” depending on which people do the scoring.

However, the goal of developers is not to create a perfect reasoner. Instead the goal is to create agents that arrive at better conclusions than humans would. This goal can be achieved even if agents retain some human biases.

As a baseline, developers could train agents to imitate the truth-seeking process of the most reasonable humans on Earth. For example, they could sample the brightest intellects from every ideological walk, and train agents to predict their actions. Then developers could let these agents do research for a long time and debate with each other. By definition, to the extent developers succeed in creating faithful simulators, trusting agents will be at least as advantageous as the best conceivable alternative where developers trust humans instead.

So even if agents seek the truth in a slightly biased way, developers can still train them so they arrive at far better opinions than humans currently have. Furthermore, if developers would like less biased AI advisors, they could instruct the AI advisors: “please determine how to create unbiased AI advisors.” To the extent humans ever had any shot of achieving this goal, the AI advisors will also be able to achieve it.

3. AI agents might still be egregiously misaligned

The main weakness of training models for truth-seeking is that models might still end up egregiously misaligned (section 1). For example, if agents have the goal of “maximizing human approval of their behavior,” they might appear to seek truth, but increasingly won’t as this goal comes apart from doing well in training. The same is true if agents are scheming, alignment faking, playing the training game, etc.

Therefore, to be confident agents will seek the truth earnestly, developers still must make research advances that are sufficient to rule out these misalignment concerns.

Conclusion

Over the last several years, researchers have proposed various strategies for eliciting knowledge from superhuman AI. I believe much of this research is of questionable value. I don’t expect we will need Paul’s galaxy-brained computer science proposals or ELK probes for ASI. We will only need human-competitive agents that vaguely try to be helpful and maintain their alignment. This goal relies on progress on empirical research problems related to detecting and avoiding egregious alignment near human-competitive capabilities, which I believe should be the main focus of AI safety researchers.

Thanks to Aryan Bhatt and Cody Rushing for feedback.

to the extent developers succeed in creating faithful simulators

There's a crux I have with Ryan which is "whether future capabilities will allow data-efficient long-horizon RL fine-tuning that generalizes well". As of last time we talked about it, Ryan says we probably will, I say we probably won't.

If we have the kind of generalizing ML that we can use to make faithful simulations, then alignment is pretty much solved. We make exact human uploads, and that's pretty much it. This is one end of the spectrum on this question.

There are weaker versions, which I think are what Ryan believes will be possible. In a slightly weaker case, you don't get something anywhere close to a human simulation, but you do get a machine that pursues the metric that you fine-tuned it to pursue, even out of distribution (with a relatively small amount of data).

But I think the evidence is against this. Long horizon tasks are currently difficult to successfully train on, unless you have dense intermediate feedback. Capabilities progress in the last decade has come from leaning heavily on dense intermediate feedback.

I expect long-horizon RL to remain pretty low data efficiency (i.e. take a lot of data before it generalizes well OOD).

@ryan_greenblatt

FWIW, I don't think "data-efficient long-horizon RL" (which is sample efficient in a online training sense) implies you can make faithful simulations.

but you do get a machine that pursues the metric that you fine-tuned it to pursue, even out of distribution (with a relatively small amount of data).

I think if the model is scheming it can behave arbitrarily badly in concentrated ways (either in a small number of actions or in a short period of time), but you can make it ~~behave well~~ perform well according to your metrics in the average case using online training.

I think if the model is scheming it can behave arbitrarily badly in concentrated ways (either in a small number of actions or in a short period of time), but you can make it behave well in the average case using online training.

I think we kind of agree here. The cruxes remain: I think that the metric for "behave well" won't be good enough for "real" large research acceleration. And "average case" means very little when it allows room for deliberate-or-not mistakes sometimes when they can be plausibly got-away-with. [Edit: Or sabotage, escape, etc.]

Also, you need hardcore knowledge restrictions in order for the AI not to be able to tell the difference between I'm-doing-original-research vs humans-know-how-to-evaluate-this-work. Such restrictions are plausibly crippling for many kinds of research assistance.

FWIW, I don't think "data-efficient long-horizon RL" (which is sample efficient in a online training sense) implies you can make faithful simulations.

I think there exists an extremely strong/unrealistic version of believing in "data-efficient long-horizon RL" that does allow this. I'm aware you don't believe this version of the statement, I was just using it to illustrate one end of a spectrum. Do you think the spectrum I was illustrating doesn't make sense?

Oh, yeah I meant "perform well according to your metrics" not "behave well" (edited)

My guess is that your core mistake is here:

When I say agents are “not egregiously misaligned,” I mean they mostly perform their work earnestly – in the same way humans are mostly earnest and vaguely try to do their job. Maybe agents are a bit sycophantic, but not more than the humans whom they would replace. Therefore, if agents are consistently “not egregiously misaligned,” the situation is no worse than if humans performed their research instead.

Obviously, all agents having undergone training to look "not egregiously misaligned", will not look egregiously misaligned. You seem to be assuming that there is mostly a dichotomy between "not egregiously misaligned" and "conniving to satisfy some other set of preferences". But there are a lot of messy places in between these two positions, including "I'm not really sure what I want" or <goals-that-are-highly-dependent-on-the-environment-e.g.-status-seeking>.

All AIs you train will be somewhere in this in between messy place. What you are hoping for is that if you put a group of these together, they will "self-correct" and force/modify each other to keep pursuing to the same goals-you-trained-them-to-look-like-they-wanted?

Is this basically correct? If so, this won't work just because this is absolute chaos and the goals-you-trained-them-to-look-like-they-wanted aren't enough to steer this chaotic system where you want it to go.

are these agents going to do sloppy research?

I think there were a few times where you are somewhat misreading your critics when they say "slop". It doesn't mean "bad". It means something closer to "very subtly bad in a way that is difficult to distinguish from quality work". Where the second part is the important part.

E.g. I find it difficult to use LLMs to help me do math or code weird algorithms, because they are good enough at outputting something that looks right. It feels like it takes longer to detect and fix their mistakes than it does to do it from scratch myself.

It means something closer to "very subtly bad in a way that is difficult to distinguish from quality work". Where the second part is the important part.

I think my arguments still hold in this case though right?

i.e. we are training models so they try to improve their work and identify these subtle issues -- and so if they actually behave this way they will find these issues insofar as humans identify the subtle mistakes they make.

My guess is that your core mistake is here

I agree there are lots of "messy in between places," but these are also alignment failures we see in humans.

And if humans had a really long time to do safety reseach, my guess is we'd be ok. Why? Like you said, there's a messy complicated system of humans with different goals, but these systems empirically often move in reasonable and socially-beneficial directions over time (governments get set up to deal with corrupt companies, new agencies get set up to deal with issues in governments, etc)

and i expect we can make AI agents a lot more aligned than humans typically are. e.g. most humans don't actually care about the law etc but, Claude sure as hell seems to. If we have agents that sure as hell seem to care about the law and are not just pretending (they really will, in most cases, act like they care about the law) then that seems to be a good state to be in.

these are also alignment failures we see in humans.

Many of them have close analogies in human behaviour. But you seem to be implying "and therefore those are non-issues"???

There are many groups of humans (or groups of humans), that if you set them on the task of solving alignment, will at some point decide to do something else. In fact, most groups of humans will probably fail like this.

How is this evidence in favour of your plan ultimately resulting in a solution to alignment???

but these systems empirically often move in reasonable and socially-beneficial directions over time

Is this the actual basis of your belief in your plan to ultimately get a difficult scientific problem solved?

and i expect we can make AI agents a lot more aligned than humans typically are

Ahh I see. Yeah this is crazy, why would you expect this? I think maybe you're confusing yourself by using the word "aligned" here, can we taboo it? Human reflective instability looks like: they realize they don't care about being a lawyer and go become a monk. Or they realize they don't want to be a monk and go become a hippy (this one's my dad). Or they have a mid-life crisis and do a bunch of stereotypical mid-life crisis things. Or they go crazy in more extreme ways.

We have a lot of experience with the space of human reflective instabilities. We're pretty familiar with the ways that humans interact with tribes and are influenced by them, and sometimes break with them.

But the space of reflective-goal-weirdness is much larger and stranger than we have (human) experience with. There are a lot of degrees of freedom in goal specification that we can't nail down easily through training. Also, AIs will be much newer, much more in progress, than humans are (not quite sure how to express this, another way to say it is to point to the quantity of robustness&normality training that evolution has subjected humans to).

Therefore I think it's extremely, wildly wrong to expect "we can make AI agents a lot more [reflectively goal stable with predictable goals and safe failure-modes] than humans typically are".

but, Claude sure as hell seems to

Why do you even consider this relevant evidence?

[Edit 25/02/25:
To expand on this last point, you're saying:

If we have agents that sure as hell seem to care about the law and are not just pretending (they really will, in most cases, act like they care about the law) then that seems to be a good state to be in.

It seems like you're doing the same dichotomy here, where you say it's either pretending or it's aligned. I know that they will act like they care about the law. We both see the same evidence, I'm not just ignoring it. I just think you're interpreting this evidence poorly, perhaps by being insufficiently careful about "alignment" as meaning "reflectively goal stable with predictable goals and predictable instabilities" vs "acts like a law-abiding citizen at the moment".

]

(vague memory from the in person discussions we had last year, might be inaccurate):

jeremy!2023: If you're expecting AI to be capable enough to "accelerate alignment research" significantly, it'll need to be a full-blown agent that learns stuff. And that'll be enough to create alignment problems because data-efficient long-horizon generalization is not something we can do.

joshc!2023: No way, all you need is AI with stereotyped skills. Imagine how fast we could do interp experiments if we had AIs that were good at writing code but dumb in other ways!

...

joshc!now:

Training AI agents so they can improve their beliefs (e.g. do research) as well as the best humans can.

Seems like the reasoning behind your conclusions has changed a lot since we talked, but the conclusions haven't changed much?

If you were an AI: Negative reward, probably a bad belief updating process.

Something important is that "significantly accelerate alignment research" isn't the same as "making AIs that we're happy to fully defer to". This post is talking about conditions needed for deference and how we might achieve them.

(Some) acceleration doesn't require being fully competitive with humans while deference does.

I think AIs that can autonomously do moderate duration ML tasks (e.g., 1 week tasks), but don't really have any interesting ideas could plausibly speed up safety work by 5-10x if they were cheap and fast enough.

(Some) acceleration doesn't require being fully competitive with humans while deference does.

Agreed. The invention of calculators was useful for research, and the invention of more tools will also be helpful.

I think AIs that can autonomously do moderate duration ML tasks (e.g., 1 week tasks), but don't really have any interesting new ideas could plausibly speed up safety work by 5-10x if they were cheap and fast enough.

Maybe some kinds of "safety work", but real alignment involves a human obtaining a deep understanding of intelligence and agency. The path to this understanding probably isn't made of >90% moderate duration ML tasks. (You need >90% to get 5-10x because of communication costs, it's often necessary to understand details of experiment implementation to get insight from them. And costs from the AI making mistakes and not quite doing the experiments right).

A typical crux is that I think we can increase our chances of "real alignment" using prosaic and relatively unenlightened ML reasearch without any deep understanding.

I both think:

We can significantly accelerate prosaic ML safety research (e.g., of the sort people are doing today) using AIs that are importantly limited.
Prosaic ML safety research can be very helpful for increasing the chance of "real alignment" for AIs that we hand off to. (At least when this research is well executed and has access to powerful AIs to experiment on.)

This top level post is part of Josh's argument for (2).

Yep this is the third crux I think. Perhaps the most important.

To me it looks like you're making a wild guess that "prosaic and relatively unenlightened ML research" is a very large fraction of the necessary work for solving alignment, without any justification that I know of?

For all the pathways to solving alignment that I am aware of, this is clearly false. I think if you know of a pathway that just involves mostly "prosaic and relatively unenlightened ML research", you should write out this plan, why you expect it to work, and then ask OpenPhil throw a billion dollars toward every available ML-research-capable human to do this work right now. Surely it'd be better to get started already?

I don't think "what is the necessary work for solving alignment" is a frame I really buy. My perspective on alignment is more like:

Avoiding egregious misalignment (where AIs intentionally act in ways that make our tests highly misleading or do pretty obviously unintended/dangerous actions) reduces risk once AIs are otherwise dangerous.
Additionally, we will likely to need to hand over making most near term decisions and most near term labor to some AI systems at some point. This going well very likely requires being able to avoid egregious misalignment (in systems capable enough to obsolete us) and also requires some other stuff.
There is a bunch of "prosaic and relatively unenlightened ML research" which can make egregious misalignment much less likely and can resolve other problems needed for handover.
Much of this work is much easier once you already have powerful AIs to experiment on.
The risk reduction will depend on the amount of effort put in and the quality of the execution etc.
The total quantity of risk reduction is unclear, but seems substantial to me. I'd guess takeover risk goes from 50% to 5% if you do a very good job at executing on huge amounts of prosaic and relatively unenlightened ML research at the relevant time. (This require more misc conceptual work, but not something that requires deep understanding persay.)

I think if you know of a pathway that just involves mostly "prosaic and relatively unenlightened ML research", you should write out this plan, why you expect it to work, and then ask OpenPhil throw a billion dollars toward every available ML-research-capable human to do this work right now. Surely it'd be better to get started already?

I think my perspective is more like "here's a long list of stuff which would help". Some of this is readily doable to work on in advance and should be worked on, and some is harder to work on.

This work isn't extremely easy to verify or scale up (such that I don't think "throw a billion dollars at it" just works), though I'm excited for a bunch more work on this stuff. ("relatively unenlightened" doesn't mean "trivial to get the ML community work on this using money" and I also think that getting the ML community to work on things effectively is probably substantially harder than getting AIs to work on things effectively.) Note that I also think that control involves "prosaic and relatively unenlightened ML research" and I'm excited about scaling up research on this, but this doesn't imply that what should happen is "OpenPhil throw a billion dollars toward every available ML-research-capable human to do this work right now"

I've DM'd you my current draft doc on this, though it may be incomprehensible.

Thanks, I appreciate the draft. I see why it's not plausible to get started on now, since much of it depends on having AGIs or proto-AGIs to play with.

I guess I shouldn't respond too much in public until you've published the doc, but:

If I'm interpreting correctly, a number of the things you intend to try involve having a misaligned (but controlled) proto-AGI run experiments involving training (or otherwise messing with in some way) an AGI. I hope you have some empathy the internal screaming I have toward this category of things.
A bunch of the ideas do seem reasonable to want to try (given that you had AGIs to play with, and were very confident that doing so wouldn't allow them to escape or otherwise gain influence). I am sympathetic to the various ideas that involve gaining understanding of how to influence goals better by training in various ways.
There are chunks of these ideas that definitely aren't "prosaic and relatively unenlightened ML research", and involve very-high-trust security stuff or non-trivial epistemic work.
I'd be a little more sympathetic to these kinda desperate last-minute things if I had no hope in literally just understanding how to build task-AGI properly, in a well understood way. We can do this now. I'm baffled that almost all of the EA-alignment-sphere has given up on even trying to do this. From talking to people this weekend this shift seems downstream of thinking that we can make AGIs do alignment work, without thinking this through in detail.

The total quantity of risk reduction is unclear, but seems substantial to me. I'd guess takeover risk goes from 50% to 5% if you do a very good job at executing on huge amounts of prosaic and relatively unenlightened ML research at the relevant time

Agree it's unclear. I think the chance of most of the ideas being helpful depends on some variables that we don't clearly know yet. I think 90% risk improvement can't be right, because there's a lot of correlation between each of the things working or failing. And a lot of the risk comes from imperfect execution of the control scheme, which adds on top.

One underlying intuition that I want to express: The world where we are making proto-AGIs run all these experiments is pure chaos. Politically and epistemically and with all the work we need to do. I think pushing toward this chaotic world is much worse than other worlds we could push for right now.

But if I thought control was likely to work very well and saw a much more plausible path to alignment among the "stuff to try", I'd think it was a reasonable strategy.

I also think that getting the ML community to work on things effectively is probably substantially harder than getting AIs to work on things effectively

On some axes, but won't there to be axes where AIs are more difficult than humans also? Sycophancy&slop being the most salient. Misalignment issues being another.

This work isn't extremely easy to verify or scale up (such that I don't think "throw a billion dollars at it" just works),

This makes sense now. But I think this line should make you worry about whether you can make controlled AIs do it.

On some axes, but won't there to be axes where AIs are more difficult than humans also? Sycophancy&slop being the most salient. Misalignment issues being another.

Yes, I just meant on net. (Relative to the current ML community and given a similar fraction of resources to spend on AI compute.)

It's not entirely clear to me that the math works out for AIs being helpful on net relative to humans just doing it, because of the supervision required, and the trust and misalignment issues.

But on this question (for AIs that are just capable of "prosaic and relatively unenlightened ML research") it feels like shot-in-the-dark guesses. It's very unclear to me what is and isn't possible.

I certainly agree it isn't clear, just my current best guess.

I've DM'd you my current draft doc on this, though it may be incomprehensible.

Have you published this doc? If so, which one is it? If not, may I see it?

I definitely agree that the AI agents at the start will need to be roughly aligned for the proposal above to work. What is it you think we disagree about?

I'm not entirely sure where our upstream cruxes are. We definitely disagree about your conclusions. My best guess is the "core mistake" comment below, and the "faithful simulators" comment is another possibility.

Maybe another relevant thing that looks wrong to me: You will still get slop when you train an AI to look like it is epistemically virtuously updating its beliefs. You'll get outputs that look very epistemically virtuous, but it takes time and expertise to rank them in a way that reflects actual epistemic virtue level, just like other kinds of slop.

I don't see why you would have more trust in agents created this way.

(My parent comment was more of a semi-serious joke/tease than an argument, my other comments made actual arguments after I'd read more. Idk why this one was upvoted more, that's silly).

As a baseline, developers could train agents to imitate the truth-seeking process of the most reasonable humans on Earth. For example, they could sample the brightest intellects from every ideological walk, and train agents to predict their actions.

I'm very excited about strategies that involve lots of imitation learning on lots of particular humans. I'm not sure if imitated human researchers learn to generalize to doing lots of novel research, but this seems great for examining research outputs of slightly-more-alien agents very quickly.

Isn’t “truth seeking” (in the way defined in this post) essentially defined as being part of “maintain their alignment”? Is there some other interpretation where models could both start off “truth seeking”, maintain their alignment, and not have maintained “truth seeking”? If so, what are those failure modes?

For example, they might hold out training data from before 2020 and check the following:

I think you mean after 2020 right?

to the extent developers succeed in creating faithful simulators

I expect long-horizon RL to remain pretty low data efficiency (i.e. take a lot of data before it generalizes well OOD).

@ryan_greenblatt

FWIW, I don't think "data-efficient long-horizon RL" (which is sample efficient in a online training sense) implies you can make faithful simulations.

but you do get a machine that pursues the metric that you fine-tuned it to pursue, even out of distribution (with a relatively small amount of data).

I think if the model is scheming it can behave arbitrarily badly in concentrated ways (either in a small number of actions or in a short period of time), but you can make it behave well in the average case using online training.

FWIW, I don't think "data-efficient long-horizon RL" (which is sample efficient in a online training sense) implies you can make faithful simulations.

Oh, yeah I meant "perform well according to your metrics" not "behave well" (edited)

My guess is that your core mistake is here:

When I say agents are “not egregiously misaligned,” I mean they mostly perform their work earnestly – in the same way humans are mostly earnest and vaguely try to do their job. Maybe agents are a bit sycophantic, but not more than the humans whom they would replace. Therefore, if agents are consistently “not egregiously misaligned,” the situation is no worse than if humans performed their research instead.

are these agents going to do sloppy research?

It means something closer to "very subtly bad in a way that is difficult to distinguish from quality work". Where the second part is the important part.

My guess is that your core mistake is here

these are also alignment failures we see in humans.

Many of them have close analogies in human behaviour. But you seem to be implying "and therefore those are non-issues"???

How is this evidence in favour of your plan ultimately resulting in a solution to alignment???

but these systems empirically often move in reasonable and socially-beneficial directions over time

Is this the actual basis of your belief in your plan to ultimately get a difficult scientific problem solved?

and i expect we can make AI agents a lot more aligned than humans typically are

Therefore I think it's extremely, wildly wrong to expect "we can make AI agents a lot more [reflectively goal stable with predictable goals and safe failure-modes] than humans typically are".

but, Claude sure as hell seems to

Why do you even consider this relevant evidence?

[Edit 25/02/25:
To expand on this last point, you're saying:

If we have agents that sure as hell seem to care about the law and are not just pretending (they really will, in most cases, act like they care about the law) then that seems to be a good state to be in.

]

(vague memory from the in person discussions we had last year, might be inaccurate):

joshc!2023: No way, all you need is AI with stereotyped skills. Imagine how fast we could do interp experiments if we had AIs that were good at writing code but dumb in other ways!

...

joshc!now:

Training AI agents so they can improve their beliefs (e.g. do research) as well as the best humans can.

Seems like the reasoning behind your conclusions has changed a lot since we talked, but the conclusions haven't changed much?

If you were an AI: Negative reward, probably a bad belief updating process.

(Some) acceleration doesn't require being fully competitive with humans while deference does.

(Some) acceleration doesn't require being fully competitive with humans while deference does.

Agreed. The invention of calculators was useful for research, and the invention of more tools will also be helpful.

I think AIs that can autonomously do moderate duration ML tasks (e.g., 1 week tasks), but don't really have any interesting new ideas could plausibly speed up safety work by 5-10x if they were cheap and fast enough.

A typical crux is that I think we can increase our chances of "real alignment" using prosaic and relatively unenlightened ML reasearch without any deep understanding.

I both think:

We can significantly accelerate prosaic ML safety research (e.g., of the sort people are doing today) using AIs that are importantly limited.
Prosaic ML safety research can be very helpful for increasing the chance of "real alignment" for AIs that we hand off to. (At least when this research is well executed and has access to powerful AIs to experiment on.)

This top level post is part of Josh's argument for (2).

Yep this is the third crux I think. Perhaps the most important.

I don't think "what is the necessary work for solving alignment" is a frame I really buy. My perspective on alignment is more like:

Avoiding egregious misalignment (where AIs intentionally act in ways that make our tests highly misleading or do pretty obviously unintended/dangerous actions) reduces risk once AIs are otherwise dangerous.
Additionally, we will likely to need to hand over making most near term decisions and most near term labor to some AI systems at some point. This going well very likely requires being able to avoid egregious misalignment (in systems capable enough to obsolete us) and also requires some other stuff.
There is a bunch of "prosaic and relatively unenlightened ML research" which can make egregious misalignment much less likely and can resolve other problems needed for handover.
Much of this work is much easier once you already have powerful AIs to experiment on.
The risk reduction will depend on the amount of effort put in and the quality of the execution etc.
The total quantity of risk reduction is unclear, but seems substantial to me. I'd guess takeover risk goes from 50% to 5% if you do a very good job at executing on huge amounts of prosaic and relatively unenlightened ML research at the relevant time. (This require more misc conceptual work, but not something that requires deep understanding persay.)

I think if you know of a pathway that just involves mostly "prosaic and relatively unenlightened ML research", you should write out this plan, why you expect it to work, and then ask OpenPhil throw a billion dollars toward every available ML-research-capable human to do this work right now. Surely it'd be better to get started already?

I think my perspective is more like "here's a long list of stuff which would help". Some of this is readily doable to work on in advance and should be worked on, and some is harder to work on.

I've DM'd you my current draft doc on this, though it may be incomprehensible.

Thanks, I appreciate the draft. I see why it's not plausible to get started on now, since much of it depends on having AGIs or proto-AGIs to play with.

I guess I shouldn't respond too much in public until you've published the doc, but:

If I'm interpreting correctly, a number of the things you intend to try involve having a misaligned (but controlled) proto-AGI run experiments involving training (or otherwise messing with in some way) an AGI. I hope you have some empathy the internal screaming I have toward this category of things.
A bunch of the ideas do seem reasonable to want to try (given that you had AGIs to play with, and were very confident that doing so wouldn't allow them to escape or otherwise gain influence). I am sympathetic to the various ideas that involve gaining understanding of how to influence goals better by training in various ways.
There are chunks of these ideas that definitely aren't "prosaic and relatively unenlightened ML research", and involve very-high-trust security stuff or non-trivial epistemic work.
I'd be a little more sympathetic to these kinda desperate last-minute things if I had no hope in literally just understanding how to build task-AGI properly, in a well understood way. We can do this now. I'm baffled that almost all of the EA-alignment-sphere has given up on even trying to do this. From talking to people this weekend this shift seems downstream of thinking that we can make AGIs do alignment work, without thinking this through in detail.

The total quantity of risk reduction is unclear, but seems substantial to me. I'd guess takeover risk goes from 50% to 5% if you do a very good job at executing on huge amounts of prosaic and relatively unenlightened ML research at the relevant time

But if I thought control was likely to work very well and saw a much more plausible path to alignment among the "stuff to try", I'd think it was a reasonable strategy.

I also think that getting the ML community to work on things effectively is probably substantially harder than getting AIs to work on things effectively

On some axes, but won't there to be axes where AIs are more difficult than humans also? Sycophancy&slop being the most salient. Misalignment issues being another.

This work isn't extremely easy to verify or scale up (such that I don't think "throw a billion dollars at it" just works),

This makes sense now. But I think this line should make you worry about whether you can make controlled AIs do it.

On some axes, but won't there to be axes where AIs are more difficult than humans also? Sycophancy&slop being the most salient. Misalignment issues being another.

Yes, I just meant on net. (Relative to the current ML community and given a similar fraction of resources to spend on AI compute.)

It's not entirely clear to me that the math works out for AIs being helpful on net relative to humans just doing it, because of the supervision required, and the trust and misalignment issues.

But on this question (for AIs that are just capable of "prosaic and relatively unenlightened ML research") it feels like shot-in-the-dark guesses. It's very unclear to me what is and isn't possible.

I certainly agree it isn't clear, just my current best guess.

I've DM'd you my current draft doc on this, though it may be incomprehensible.

Have you published this doc? If so, which one is it? If not, may I see it?

I definitely agree that the AI agents at the start will need to be roughly aligned for the proposal above to work. What is it you think we disagree about?

I don't see why you would have more trust in agents created this way.

(My parent comment was more of a semi-serious joke/tease than an argument, my other comments made actual arguments after I'd read more. Idk why this one was upvoted more, that's silly).

As a baseline, developers could train agents to imitate the truth-seeking process of the most reasonable humans on Earth. For example, they could sample the brightest intellects from every ideological walk, and train agents to predict their actions.

For example, they might hold out training data from before 2020 and check the following:

I think you mean after 2020 right?

45

Training AI to do alignment research we don’t already know how to do

45

1. Generalizing to hard tasks

2. Human graders might introduce bias

3. AI agents might still be egregiously misaligned

Conclusion

45

45