Jeremy Gillen

I do alignment research, mostly stuff that is vaguely agent foundations. Formerly on Vivek's team at MIRI. Most of my writing before mid 2023 are not representative of my current views about alignment difficulty.

Wiki Contributions

Comments

I think your summary is a good enough quick summary of my beliefs. The minutia that I object to is how confident and specific lots of parts of your summary are. I think many of the claims in the summary can be adjusted or completely changed and still lead to bad outcomes. But it's hard to add lots of uncertainty and options to a quick summary, especially one you disagree with, so that's fair enough.
(As a side note, that paper you linked isn't intended to represent anyone else's views, other than myself and Peter, and we are relatively inexperienced. I'm also no longer working at MIRI).

I'm confused about why your <20% isn't sufficient for you to want to shut down AI research. Is it because of benefits outweigh the risk, or because we'll gain evidence about potential danger and can shut down later if necessary?

I'm also confused about why being able to generate practical insights about the nature of AI or AI progress is something that you think should necessarily follow from a model that predicts doom. I believe something close enough to (1) from your summary, but I don't have much idea (above general knowledge) of how the first company to build such an agent will do so, or when they will work out how to do it. One doesn't imply the other.

I'm enjoying having old posts recommended to me. I like the enriched tab.

Doesn't the futarchy hack come up here? Contractors will be betting that competitors timelines and cost will be high, in order to get the contract. 

This doesn't feel like it resolves that confusion for me, I think it's still a problem with the agents he describes in that paper.

The causes  are just the direct computation of  for small values of . If they were arguments that only had bearing on small values of x and implied nothing about larger values (e.g. an adversary selected some  to show you, but filtered for  such that ), then it makes sense that this evidence has no bearing on. But when there was no selection or other reason that the argument only applies to small , then to me it feels like the existence of the evidence (even though already proven/computed) should still increase the credence of the forall.

I sometimes name your work in conversation as an example of good recent agent foundations work, based on having read some of it and skimmed the rest, and talked to you a little about it at EAG. It's on my todo list to work through it properly, and I expect to actually do it because it's the blocker on me rewriting and posting my "why the shutdown problem is hard" draft, which I really want to post.

The reasons I'm a priori not extremely excited are that it seems intuitively very difficult to avoid either of these issues:

  • I'd be surprised if an agent with (very) incomplete preferences was real-world competent. I think it's easy to miss ways that a toy model of an incomplete-preference-agent might be really incompetent.
  • It's easy to shuffle around the difficulty of the shutdown problem, e.g. by putting all the hardness into an assumed-adversarially-robust button-manipulation-detector or self-modification-detector etc.

It's plausible you've avoided these problems but I haven't read deeply enough to know yet. I think it's easy for issues like this to be hidden (accidentally), so it'll take a lot of effort for me to read properly (but I will, hopefully in about a week).

The part where it works for a prosaic setup seems wrong (because of inner alignment issues (although I see you cited my post in a footnote about this, thanks!)), but this isn't what the shutdown problem is about so it isn't an issue if it doesn't apply directly to prosaic setups.

I would be excited to read this / help with a draft. 

We can meet in person one afternoon and work out some cruxes and write them up?

Is the claim here that the AI performs well on ~all the human-checkable tasks and then reverts to being absolutely useless or sabotaging on the hard stuff?

Yes, approximately, as I believe you and I are capable of doing. [...PhD student crux]

The analogy is strained due to not being able to gradient update my brain with arbitrary training data. It's pretty unclear to me if I could pull off deception like this, it seems pretty hard imo.

Are you saying you buy my side of the PhD student crux, but that it isn't closely analogous because you think we have more fine-grained control over the thoughts or goals of the AI? And in your second sentence my guess is that you are saying a second reason you disagree is that it looks difficult for a research AI to know when to perform well and when to perform less well? I think we might be miscommunicating about this because it doesn't look like the sort of situation where the "deception" is a relatively difficult task.

I think if you took a human research team, stuck them on an island, and had them get up to speed in some subfield they're not part of, they would be unable to distinguish between research that had just been done on the mainland or is just past the frontier. This just seems pretty hard.

I agree with this. But the research we need isn't just past the frontier. At some point not that far past the frontier, you get to a situation where the research team can say "given our guesses about mainland researchers and their time constraints, we'd put <1% on them having made discovery X {or explored idea Y, or done experiment Z}. Let's {delay giving it to them, not work on that direction, insert a subtle false assumption, instead share our 8th best idea, just say we're stuck, communicate it poorly}".

I think this example is closer to what we'll get with AIs rather than the bar being at "human level" it's more like "what has been accomplished by the humans+AIs in the outside world, for which I've received untrustworthy information". 

Sure, kinda, but the important part here is that a trapped agentic AI can selectively mislead you about the most important {knowledge, tasks, strategies} that you aren't capable of doing with just humans. If you're capable of doing it with humans, or if you're close to being able to do it with humans, then why not just do that? Seems safer.

And every time the AIs do useful work we can add this to our supervised training corpus. 

Only if you spend the human resources to verify it. Are you aiming for 30x research acceleration? Because this by itself might be enough limit you below that level. (and it also directly provides a way for the AI to tell "what time it is" by sending messages to future iterations, and plausibly also provide a bunch of information about human research capability and capacity).

One version of this is: you have GPT-5 (potentially untrustworthy) do some research, you have humans carefully check this research using the best available tools (e.g., AI debate / other 'scalable oversight'), and then you use this research as supervised / labeled data for training GPT-6 to do good work. 

Mostly my objection to this is above.  On scalable oversight, I buy that it might be possible to get 20% improvements in the accuracy of your evaluations of (adversarialy generated!) research, compared to team of humans trying really hard to evaluate research, given a human level research AI. This is so far from being enough. Are you predicting much better scalable oversight that scales much further than I am predicting it will? (I'm assuming you're using "GPT-5/6" to represent "some powerful future consequentialist AI that can do research", because otherwise we should be having a different argument).

While it's probably non-trivial, I expect you could use this internal activation to train models which do a better job paying attention to the tokens you want them to.

It'd be cool if there were a demonstration of manipulating internal activations during training, in a way that fixes non-trivial unwanted OOD behavior. If such techniques existed, and also worked on consequentialist agents (I expect would be harder), I would consider them very useful tools that would help with building an aligned agent.

While it's probably non-trivial, I expect you could use this internal activation to train models which do a better job paying attention to the tokens you want them to.

This is very plausible, but I think you're stretching the analogy between LMs and future research AIs way too far. "Paying attention to the tokens we want them to" doesn't add much/any control over the properties we want control over (compared to e.g. changing the input string to remove information we don't want it paying attention to).

Similarly, I think it's probable (but far from guaranteed) that deceptively aligned models are representing training/deployment in their activations, or that sandbagging models are representing human-level / super-human-level difficulty for tasks. If so, it it possible in principle to do all kinds of things with these activations which may help us. e.g., get the model to solve tasks it otherwise would have intentionally failed at, or giving us hints about the accuracy of a research result. I expect this is difficult implementation wise, but still a reason for hope. IMO activation engineering is off to a good start. 

I think you might be centrally thinking about an LLM completing a few paragraphs, where I'm thinking of an AI doing a large scale scientific project? I buy that for many concepts X, you can work out the "neural correlates of X" in your AI. But that kind of thing is noisy (and plausibly can be optimized against by the AI) unless you have a deep understanding of what you are measuring. And optimizing against such imperfect metrics obviously wouldn't do much beyond destroying the metric. I do think research in this direction has a chance of being useful, but mainly by being upstream of much better understanding.

By leaning more on generalization, I mean leaning more on the data efficiency thing

Sorry for misinterpreting you, but this doesn't clarify what you meant. 

also weak-to-strong generalization ideas.

I think I don't buy the analogy in that paper, and I don't find the results surprising or relevant (by my current understanding, after skimming it). My understanding of the result is "if you have a great prior, you can use it to overcome some label noise and maybe also label bias". But I don't think this is very relevant to extracting useful work from a misaligned agent (which is what we are talking about here), and based on the assumptions they describe, I think they agree? (I just saw appendix G, I'm a fan of it, it's really valuable that they explained their alignment plan concisely and listed their assumptions).

I could imagine starting with a deceptively aligned AI whose goal is "Make paperclips unless being supervised which is defined as X, Y, and Z, in which case look good to humans". And if we could change this AI to have the goal "Make paperclips unless being supervised which is defined as X, Y, and Q, in which case look good to humans", that might be highly desirable. In particular, it seems like adversarial training here allows us to expand the definition of 'supervision', thus making it easier to elicit good work from AIs (ideally not just 'looks good').

If we can tell we are have such an AI, and we can tell that our random modifications are affecting the goal, and also the change is roughly one that helps us rather than changing many things that might or might not be helpful, this would be a nice situation to be in.

I don't feel like I'm talking about AIs which have "taking-over-the-universe in their easily-within-reach options". I think this is not within reach of the current employees of AGI labs, and the AIs I'm thinking of are similar to those employees in terms of capabilities, but perhaps a bit smarter, much faster, and under some really weird/strict constraints (control schemes). 

Section 6 assumes we have failed to control the AI, so it is free of weird/strict constraints, and free to scale itself up, improve itself, etc. So my comment is about an AI that no longer can be assumed to have human-ish capabilities.

Do you have recordings? I'd be keen to watch a couple of the ones I missed.

I feel like you’re proposing two different types of AI and I want to disambiguate them. The first one, exemplified in your response to Peter (and maybe referenced in your first sentence above), is a kind of research assistant that proposes theories (after having looked at data that a scientist is gathering?), but doesn’t propose experiments and doesn’t think about the usefulness of its suggestions/theories. Like a Solomonoff inductor that just computes the simplest explanation for some data? And maybe some automated approach to interpreting theories?

The second one, exemplified by the chess analogy and last paragraph above, is a bit like a consequentialist agent that is a little detached from reality (can’t learn anything, has a world model that we designed such that it can’t consider new obstacles).

Do you agree with this characterization?

What I'm saying is "simpler" is that, given a problem that doesn't need to depend on the actual effects of the outputs on the future of the real world […], it is simpler for the AI to solve that problem without taking into consideration the effects of the output on the future of the real world than it is to take into account the effects of the output on the future of the real world anyway.

I accept chess and formal theorem-proving as examples of problem where we can define the solution without using facts about the real-world future (because we can easily write down formally a definition of what the solution looks like). 

For a more useful problem (e.g. curing a type of cancer) we (the designers) only know how to define a solution in terms of real world future states (patient is alive, healthy, non traumatized, etc). I’m not saying there doesn’t exist a definition of success that doesn’t involve referencing real-world future states. But the AI designers don’t know it (and I expect it would be relatively complicated).

My understanding of your simplicity argument is that it is saying that it is computationally cheaper for a trained AI to discover during training a non-consequence definition of the task, despite a consequentialist definition being the criterion used to train it? If so, I disagree that computation cost is very relevant here, generalization (to novel obstacles) is the dominant factor determining how useful this AI is.

Geometric rationality ftw!

(In normal planning problems there are exponentially many plans to evaluate (in the number of actions). So that doesn't seem to be a major obstacle if your agent is already capable of planning.)

Might be much harder to implement, but could we maximin "all possible reinterpretations of alignment target X"?

Load More