Jeremy Gillen

I'm interested in doing in-depth dialogues to find cruxes. Message me if you are interested in doing this.

I do alignment research, mostly stuff that is vaguely agent foundations. Currently doing independent alignment research on ontology identification. Formerly on Vivek's team at MIRI. Most of my writing before mid 2023 is not representative of my current views about alignment difficulty.

Wikitag Contributions

Comments

Sorted by

But in practice, agents represent both of these in terms of the same underlying concepts. When those concepts change, both beliefs and goals change.

I like this reason to be unsatisfied with the EUM theory of agency.

One of the difficulties in theorising about agency is that all the theories are flexible enough to explain anything. Each theory is incomplete and vague in some way, so this makes the problem worse, but even when you make a detailed model of e.g. active inference, it ends up being pretty much formally equivalent to EUM.

I think the solution to this is to compare theories using engineering desiderata. Our goal is ultimately to build a safe AGI, so we want a theory that helps us reason about safety desiderata.

One of the really important safety desiderata is some kind of goal stability. When we build a powerful agent, we don't want it to change its mind about what's important. It should act to achieve known, predictable outcomes, even when it discovers facts and concepts we don't know about.

So my criticism of this research direction is that I don't think it'll be a good framework for making goal-stable agents. You want a framework that naturally models internal conflict of goals, and in particular you want to model this as conflict between agents. Conflict and cooperation between bounded, not-quite-rational agents is messy and hard to predict. Multi-agent systems are complex and detail dependent. Therefore it seems difficult to show that the overall agent will be stable.

(A reasonable response would be "but no proposed vague theories of bounded agency have this goal stability property, maybe this coalitional approach will turn out to help us come up with a solution", and that's true and fair enough, but I think research directions like this seem more promising).

I think the scheme you're describing caps the agent at moderate problem-solving capabilities. Not being able to notice past mistakes is a heck of a disability.

It's not entirely clear to me that the math works out for AIs being helpful on net relative to humans just doing it, because of the supervision required, and the trust and misalignment issues.

But on this question (for AIs that are just capable of "prosaic and relatively unenlightened ML research") it feels like shot-in-the-dark guesses. It's very unclear to me what is and isn't possible.

Thanks, I appreciate the draft. I see why it's not plausible to get started on now, since much of it depends on having AGIs or proto-AGIs to play with.

I guess I shouldn't respond too much in public until you've published the doc, but:

  • If I'm interpreting correctly, a number of the things you intend to try involve having a misaligned (but controlled) proto-AGI run experiments involving training (or otherwise messing with in some way) an AGI. I hope you have some empathy the internal screaming I have toward this category of things.
  • A bunch of the ideas do seem reasonable to want to try (given that you had AGIs to play with, and were very confident that doing so wouldn't allow them to escape or otherwise gain influence). I am sympathetic to the various ideas that involve gaining understanding of how to influence goals better by training in various ways.
  • There are chunks of these ideas that definitely aren't "prosaic and relatively unenlightened ML research", and involve very-high-trust security stuff or non-trivial epistemic work.
  • I'd be a little more sympathetic to these kinda desperate last-minute things if I had no hope in literally just understanding how to build task-AGI properly, in a well understood way. We can do this now. I'm baffled that almost all of the EA-alignment-sphere has given up on even trying to do this. From talking to people this weekend this shift seems downstream of thinking that we can make AGIs do alignment work, without thinking this through in detail.

The total quantity of risk reduction is unclear, but seems substantial to me. I'd guess takeover risk goes from 50% to 5% if you do a very good job at executing on huge amounts of prosaic and relatively unenlightened ML research at the relevant time

Agree it's unclear. I think the chance of most of the ideas being helpful depends on some variables that we don't clearly know yet. I think 90% risk improvement can't be right, because there's a lot of correlation between each of the things working or failing. And a lot of the risk comes from imperfect execution of the control scheme, which adds on top.

One underlying intuition that I want to express: The world where we are making proto-AGIs run all these experiments is pure chaos. Politically and epistemically and with all the work we need to do. I think pushing toward this chaotic world is much worse than other worlds we could push for right now. 

But if I thought control was likely to work very well and saw a much more plausible path to alignment among the "stuff to try", I'd think it was a reasonable strategy.

I also think that getting the ML community to work on things effectively is probably substantially harder than getting AIs to work on things effectively

On some axes, but won't there to be axes where AIs are more difficult than humans also? Sycophancy&slop being the most salient. Misalignment issues being another.

This work isn't extremely easy to verify or scale up (such that I don't think "throw a billion dollars at it" just works),

This makes sense now. But I think this line should make you worry about whether you can make controlled AIs do it.

I think if the model is scheming it can behave arbitrarily badly in concentrated ways (either in a small number of actions or in a short period of time), but you can make it behave well in the average case using online training.

I think we kind of agree here. The cruxes remain: I think that the metric for "behave well" won't be good enough for "real" large research acceleration. And "average case" means very little when it allows room for deliberate-or-not mistakes sometimes when they can be plausibly got-away-with. [Edit: Or sabotage, escape, etc.]

Also, you need hardcore knowledge restrictions in order for the AI not to be able to tell the difference between I'm-doing-original-research vs humans-know-how-to-evaluate-this-work. Such restrictions are plausibly crippling for many kinds of research assistance.

FWIW, I don't think "data-efficient long-horizon RL" (which is sample efficient in a online training sense) implies you can make faithful simulations.

I think there exists an extremely strong/unrealistic version of believing in "data-efficient long-horizon RL" that does allow this. I'm aware you don't believe this version of the statement, I was just using it to illustrate one end of a spectrum. Do you think the spectrum I was illustrating doesn't make sense?

Yep this is the third crux I think. Perhaps the most important.

To me it looks like you're making a wild guess that "prosaic and relatively unenlightened ML research" is a very large fraction of the necessary work for solving alignment, without any justification that I know of?

For all the pathways to solving alignment that I am aware of, this is clearly false. I think if you know of a pathway that just involves mostly "prosaic and relatively unenlightened ML research", you should write out this plan, why you expect it to work, and then ask OpenPhil throw a billion dollars toward every available ML-research-capable human to do this work right now. Surely it'd be better to get started already?

I'm not entirely sure where our upstream cruxes are. We definitely disagree about your conclusions. My best guess is the "core mistake" comment below, and the "faithful simulators" comment is another possibility.

Maybe another relevant thing that looks wrong to me: You will still get slop when you train an AI to look like it is epistemically virtuously updating its beliefs. You'll get outputs that look very epistemically virtuous, but it takes time and expertise to rank them in a way that reflects actual epistemic virtue level, just like other kinds of slop.

I don't see why you would have more trust in agents created this way.

(My parent comment was more of a semi-serious joke/tease than an argument, my other comments made actual arguments after I'd read more. Idk why this one was upvoted more, that's silly).

these are also alignment failures we see in humans.

Many of them have close analogies in human behaviour. But you seem to be implying "and therefore those are non-issues"???

There are many groups of humans (or groups of humans), that if you set them on the task of solving alignment, will at some point decide to do something else. In fact, most groups of humans will probably fail like this.

How is this evidence in favour of your plan ultimately resulting in a solution to alignment???

but these systems empirically often move in reasonable and socially-beneficial directions over time

Is this the actual basis of your belief in your plan to ultimately get a difficult scientific problem solved? 

and i expect we can make AI agents a lot more aligned than humans typically are

Ahh I see. Yeah this is crazy, why would you expect this? I think maybe you're confusing yourself by using the word "aligned" here, can we taboo it? Human reflective instability looks like: they realize they don't care about being a lawyer and go become a monk. Or they realize they don't want to be a monk and go become a hippy (this one's my dad). Or they have a mid-life crisis and do a bunch of stereotypical mid-life crisis things. Or they go crazy in more extreme ways.

We have a lot of experience with the space of human reflective instabilities. We're pretty familiar with the ways that humans interact with tribes and are influenced by them, and sometimes break with them.

But the space of reflective-goal-weirdness is much larger and stranger than we have (human) experience with. There are a lot of degrees of freedom in goal specification that we can't nail down easily through training. Also, AIs will be much newer, much more in progress, than humans are (not quite sure how to express this, another way to say it is to point to the quantity of robustness&normality training that evolution has subjected humans to). 

Therefore I think it's extremely, wildly wrong to expect "we can make AI agents a lot more [reflectively goal stable with predictable goals and safe failure-modes] than humans typically are".

but, Claude sure as hell seems to

Why do you even consider this relevant evidence?

[Edit 25/02/25:
To expand on this last point, you're saying:

If we have agents that sure as hell seem to care about the law and are not just pretending (they really will, in most cases, act like they care about the law) then that seems to be a good state to be in.

It seems like you're doing the same dichotomy here, where you say it's either pretending or it's aligned. I know that they will act like they care about the law. We both see the same evidence, I'm not just ignoring it. I just think you're interpreting this evidence poorly, perhaps by being insufficiently careful about "alignment" as meaning "reflectively goal stable with predictable goals and predictable instabilities" vs "acts like a law-abiding citizen at the moment".

]

to the extent developers succeed in creating faithful simulators

There's a crux I have with Ryan which is "whether future capabilities will allow data-efficient long-horizon RL fine-tuning that generalizes well". As of last time we talked about it, Ryan says we probably will, I say we probably won't.

If we have the kind of generalizing ML that we can use to make faithful simulations, then alignment is pretty much solved. We make exact human uploads, and that's pretty much it. This is one end of the spectrum on this question.

There are weaker versions, which I think are what Ryan believes will be possible. In a slightly weaker case, you don't get something anywhere close to a human simulation, but you do get a machine that pursues the metric that you fine-tuned it to pursue, even out of distribution (with a relatively small amount of data).

But I think the evidence is against this. Long horizon tasks are currently difficult to successfully train on, unless you have dense intermediate feedback. Capabilities progress in the last decade has come from leaning heavily on dense intermediate feedback.

I expect long-horizon RL to remain pretty low data efficiency (i.e. take a lot of data before it generalizes well OOD).

@ryan_greenblatt 

My guess is that your core mistake is here:

When I say agents are “not egregiously misaligned,” I mean they mostly perform their work earnestly – in the same way humans are mostly earnest and vaguely try to do their job. Maybe agents are a bit sycophantic, but not more than the humans whom they would replace. Therefore, if agents are consistently “not egregiously misaligned,” the situation is no worse than if humans performed their research instead.

Obviously, all agents having undergone training to look "not egregiously misaligned", will not look egregiously misaligned. You seem to be assuming that there is mostly a dichotomy between "not egregiously misaligned" and "conniving to satisfy some other set of preferences". But there are a lot of messy places in between these two positions, including "I'm not really sure what I want" or <goals-that-are-highly-dependent-on-the-environment-e.g.-status-seeking>.

All AIs you train will be somewhere in this in between messy place. What you are hoping for is that if you put a group of these together, they will "self-correct" and force/modify each other to keep pursuing to the same goals-you-trained-them-to-look-like-they-wanted?

Is this basically correct? If so, this won't work just because this is absolute chaos and the goals-you-trained-them-to-look-like-they-wanted aren't enough to steer this chaotic system where you want it to go.

 


are these agents going to do sloppy research?

I think there were a few times where you are somewhat misreading your critics when they say "slop". It doesn't mean "bad". It means something closer to "very subtly bad in a way that is difficult to distinguish from quality work". Where the second part is the important part.

E.g. I find it difficult to use LLMs to help me do math or code weird algorithms, because they are good enough at outputting something that looks right. It feels like it takes longer to detect and fix their mistakes than it does to do it from scratch myself.

Load More