Jeremy Gillen

I do alignment research, mostly stuff that is vaguely agent foundations. Currently doing independent alignment research on ontology identification. Formerly on Vivek's team at MIRI. Most of my writing before mid 2023 is not representative of my current views about alignment difficulty.

Wiki Contributions

Comments

Sorted by

it'll choose something other than A.

Are you aware that this is incompatible with Thornley's ideas about incomplete preferences? Thornley's decision rule might choose A.

But suppose the agent were next to face a choice

If the choices are happening one after the other, are the preferences over tuples of outcomes? Or are the two choices in different counterfactuals? Or is it choosing an outcome, then being offered another outcome set that it could to replace it with?

VNM is only well justified when the preferences are over final outcomes, not intermediate states. So if your example contains preferences over intermediate states, then it confuses the matter because we can attribute the behavior to those preferences rather than incompleteness.

If you don't agree with Eliezer on 90% of the relevant issues, it's completely unconvincing.

Of course. What kind of miracle are you expecting? 

It also doesn't go into much depth on many of the main counterarguments. And doesn't go into enough detail that it even gets close to "logically sound". And it's not as condensed as I'd like. And it skips over a bunch of background. Still, it's valuable, and it's the closest thing to a one-post summary of why Eliezer is pessimistic about the outcome of AGI.

The main value of list of lethalities as a one-stop shop is that you can read it and then be able to point to roughly where you disagree with Eliezer. And this is probably what you want if you're looking for canonical arguments for AI risk. Then you can look further into that disagreement if you want.

Reading the rest of your comment very charitably: It looks like your disagreements are related to where AGI capability caps out, and whether default goals involve niceness to humans. Great!

If I read your comment more literally, my guess would be that you haven't read list of lethalities, or are happy misrepresenting positions you disagree with.

he takes as an assumption that an AGI will be godlike level omnipotent

He specifically defines a dangerous intelligence level as around the level required to design and build a nanosystem capable of building a nanosystem (or any of several alternative example capabilities) (In point 3). Maybe your omnipotent gods are lame. 

and that it will default to murderism

This is false. Maybe you are referring to how there isn't any section justifying instrumental convergence? But it does have a link, and it notes that it's skipping over a bunch of background in that area (-3). That would be a different assumption, but if you're deliberately misrepresenting it, then that might be the part that you are misrepresenting.

If you're looking for recent, canonical one-stop-shop, the answer is List of Lethalities.

(Just tried having claude turn the thread into markdown, which seems to have worked):

xuan (ɕɥɛn / sh-yen) @xuanalogue · Sep 3

Should AI be aligned with human preferences, rewards, or utility functions? Excited to finally share a preprint that @MicahCarroll @FranklinMatija @hal_ashton & I have worked on for almost 2 years, arguing that AI alignment has to move beyond the preference-reward-utility nexus!

This paper (https://arxiv.org/abs/2408.16984) is at once a critical review & research agenda. In it we characterize the role of preferences in AI alignment in terms of 4 preferentist theses. We then highlight their limitations, arguing for alternatives that are ripe for further research.

Our paper addresses each of the 4 theses in turn:

  1. T1. Rational choice theory as a descriptive theory of humans
  2. T2. Expected utility theory as a normative account of rational agency
  3. T3. Single-human AI alignment as pref. matching
  4. T4. Multi-human AI alignment as pref. aggregation

Addressing T1, we examine the limitations of modeling humans as (noisy) maximizers of utility functions (as done in RLHF & inverse RL), which fails to account for:

  • Bounded rationality
  • Incomplete preferences & incommensurable values
  • The thick semantics of human values

As alternatives, we argue for:

  1. Modeling humans as resource-rational agents
  2. Accounting for how we do or do not commensurate / trade-off our values
  3. Learning the semantics of human evaluative concepts, which preferences do not capture

We then turn to T2, arguing that expected utility (EU) maximization is normatively inadequate. We draw on arguments by @ElliotThornley & others that coherent EU maximization is not required for AI agents. This means AI alignment need not be framed as "EU maximizer alignment".


Jeremy Gillen @jeremygillen1 · Sep 4

I'm fairly confident that Thornley's work that says preference incompleteness isn't a requirement of rationality is mistaken. If offered the choice to complete its preferences, an agent acting according to his decision rule should choose to do so.

As long as it can also shift around probabilities of its future decisions, which seems reasonable to me. See Why Not Subagents?


xuan (ɕɥɛn / sh-yen) @xuanalogue · Sep 4

Hi! So first I think it's worth clarifying that Thornley is focusing on what advanced AI agents will do, and is not as committed to saying something about the requirements of rationality (that's our interpretation).

But to the point of whether an agent would/should choose to complete its preferences, see Sami Petersen's more detailed argument on "Invulnerable Incomplete Preferences":

https://alignmentforum.org/posts/sHGxvJrBag7nhTQvb/invulnerable-incomplete-preferences-a-formal-statement-1

Regarding the trade between (sub)agents argument, I think that only holds in certain conditions -- I wrote a comment on that post discussing one intuitive case where trade is not possible / feasible.

Oops sorry I see you were linking to a specific comment in that thread -- will read, thanks!

Hmm okay, I read the money pump you proposed! It's interesting but I don't buy the move of assigning probabilities to future decisions. As a result, I don't think the agent is required to complete its preferences, but can just plan in advance to go for A+ or B.

I think Petersen's "Dynamic Strong Maximality" decision rule captures that kind of upfront planning (in a way that may go beyond the Caprice rule) while maintaining incompleteness, but I'm not 100% sure.

Yeah, there's a discussion of this in footnote 16 of the Petersen article: https://alignmentforum.org/posts/sHGxvJrBag7nhTQvb/invulnerable-incomplete-preferences-a-formal-statement-1#fnrefr2zvmaagbir


Jeremy Gillen @jeremygillen1 · Sep 4

The move of assigning probabilities to future actions was something Thornley started, not me. Embedded agents should be capable of this (future actions are just another event in the world). Although doesn't work with infrabeliefs, so maybe in that case the money pump could break.

I'm not as familiar with Petersen's argument, but my impression is that it results in actions indistinguishable from those of an EU maximizer with completed preferences (in the resolute choice case). Do you know any situation where it isn't representable as an EU maximizer?

This is in contrast to Thornley's rule, which does sometimes choose the bottom path of the money pump, which makes it impossible to represent as a EU maximizer. This seems like real incomplete preferences.

It seems incorrect to me to describe Peterson's argument as formalizing the same counter-argument further (as you do in the paper), given how their proposals seem to have quite different properties and rely on different arguments.


xuan (ɕɥɛn / sh-yen) @xuanalogue · Sep 4

I wasn't aware of this difference when writing that part of the paper! But AFAIK Dynamic Strong Maximality generalizes the Caprice rule, so that it behaves the same on the single-souring money pump, but does the "right thing" in the single-sweetening case.

Regarding whether DSM-agents are representable as EU maximizers, Petersen has a long section on this in the article (they call this the "Tramelling Concern"):

https://alignmentforum.org/posts/sHGxvJrBag7nhTQvb/invulnerable-incomplete-preferences-a-formal-statement-1#3___The_Trammelling_Concern


Jeremy Gillen @jeremygillen1 · 21h

Section 3.1 seems consistent with my understanding. Sami is saying that that the DSM-agent arbitrarily chooses a plan among those that result in one of the maximally valued outcomes.

He calls this untrammeled, because even though the resulting actions could have been generated by an agent with complete preferences, it "could have" made another choice at the beginning.

But this kind of "incompleteness" looks useless to me. Intuitively: If AI designers are happy with each of several complete sets of preferences, they could arbitrarily choose one and then put them into an agent with complete preferences.

All Sami's approach does is let the AI do exactly that arbitrary choice just before it starts acting. If you want an locally coherent AI tool, as you discuss later in the paper, this approach won't help you.

You can get the kind of Taskish behavior you want by being very careful about the boundedness and locality of the preferences, and using separate locally bounded Tool AIs each with a separate task (as you describe in the paper).

But the local completeness proposal at the end of 3.2 in your paper will break if it is capable of weak forms of self-modification or commitment, due to the money pump argument.

I do think it's possible to make such local Taskish agents work. You'd just need to exploit the main problem with VNM, which is that it doesn't allow preferences over non-terminal outcomes.

Sorry for being so critical, overall I think the paper is good and all of the arguments I looked at outside of section 3 seem strong. Well done.


xuan (ɕɥɛn / sh-yen) @xuanalogue · 20h

Thank you! I'm still not sure if DSM-agents will have that failure mode. I think the cases that seem most important to think through is how they'll behave under uncertainty about whether shutdown / context switching will happen.

At least in the full Thornley proposal, there are cases where it's clear to me that having complete preferences over trajectory lengths will cause shutdown-avoidance/seeking, and that DSM-agents with incomplete preferences over trajectory lengths will avoid.

Perhaps those DSM-agents can represented as having complete prefences once they've committed to a particular plan/policy. But if so then it seems like that preference ordering will have to be over something other than trajectories.


Jeremy Gillen @jeremygillen1 · 17h

I've been usually assuming the preferences are over final outcomes, as usual in VNM. Incompleteness is kinda useless if the preferences are over trajectories, because any behavior can be implemented anyway.

I think you're wrong that DSM-agents with incomplete preferences will avoid shutdown-avoidance/seeking. I'd be interested to hear the cases that are clear to you.

I've constructed a toy scenario that combines my money pump with the classic button manipulation scenario (for utility indifference) from section 4.2 of the Corrigibility paper.

The agent first makes an observation. Then it has an opportunity to deliberately trigger shutdown (if it does, then length 1 trajectory). If it doesn't shutdown, it does a length 2 trajectory, whose value depends on the initial observation.

Image

With the observation it could get lucky (90%) or unlucky (10%). Possible outcomes are L, U and S. Preferences are L > U, L||S, U||S. (All different length trajectories are pref-gapped, as per Thornley's proposal).

Let's say that by default the agent deliberately chooses shutdown 50% of the time. (or any arbitrary probability, since, as per Thornley's proposal, stochastic choice between trajectory lengths is indicative of preferential gaps). Then the overall outcome is {L:45%, U:5%, S:50%}.

Then an agent could do strictly better by committing at the beginning to cause shutdown if it observes the unlucky observation, and cause shutdown 44.4% of the time if it observes the lucky observation.

The resulting outcome is {L:50%, U:0%, S:50%}. Since the relative probabilities of each trajectory length haven't changed, this should statewise dominate the outcome of the previous policy, so it is chosen by Thornley's decision rule. It's a pure shift of probability from U to L.

Under DSM (with resolute choice), the agent never implements a plan that is suboptimal from an earlier perspective, so it will also choose beforehand to commit to push the shutdown button if it observes the unlucky observation.

(This is a simplification, since the human button presser is missing, but it all works the same if you add the human and have actions that disconnect or press the button. I left all button control to the agent because it's sufficient to demonstrate that it will manipulate).


xuan (ɕɥɛn / sh-yen) @xuanalogue · 16h

Thanks, I'll think about this! The cases I think where complete preferences over trajectory lengths lead to shutdown avoidance are on pg. 42-43 of the original IPP write-up:

Image

BTW it also seems like Thornley discusses cases like yours as forms of "Managing the News" in Section 15 of that same document, and acknowledges that it is a limitation!

I had a good discussion with xuan on twitter about incomplete preferences. It was about some of the arguments in the new paper Beyond Preferences in AI Alignment. The arguments were based on the work of EJT and Sami Peterson, both of which I think are completely mistaken.

Am I correct in thinking that with Strong Maximality and resolute choice applied to a single sweetening money pump, an agent will never take the bottom pathway, because it eliminates the A plan because A+ plan is strictly preferred?

If so, what's an example of a decision tree where the actions of an agent with incomplete preferences can't be represented as an agent with complete preferences?

That can't be right in general. Normal nash equilibria can narrow down predictions of actions. E.g. competition game. This is despite each player's decision being dependent on the other player's action.

I think your comment illustrates my point. You're describing current systems and their properties, then implying that these properties will stay the same as we push up the level of goal-directedness to human-level. But you've not made any comment about why the goal-directedness doesn't affect all the nice tool-like properties.

don't see any obvious reason to expect much more cajoling to be necessary

It's the difference in levels of goal-directedness. That's the reason.

For example, I'm pretty optimistic about 1.8 million years MATS-graduate-level work building on top of other MATS-graduate-level work

I'm not completely sure what happens when you try this. But there seem to be two main options. Either you've got a small civilization of goal-directed human-level agents, who have their own goals and need to be convinced to solve someone else's problems. And then to solve those problems, need to be given freedom and time to learn and experiment, gaining sixty thousand lifetimes worth of skills along the way.

Or, you've got a large collection of not-quite-agents that aren't really capable of directing research but will often complete a well-scoped task if given it by someone who understands its limitations. Now your bottleneck is human research leads (presumably doing agent foundations). That's a rather small resource. So your speedup isn't massive, it's only moderate, and you're on a time limit and didn't put much effort into getting a head start.

My sense is that a high level of capability implies (2) but not (1).

Sure, kinda. But (2) is an unstable state. There's at least some pressure toward (1) both during training and during online activity. This makes (1) very likely eventually, although it's less clear exactly when.

A human that gets distracted and pursues icecream whenever they see icecream is less competent at other things, and will notice this and attempt to correct it within themselves if possible. A person that doesn't pick up free money on tuesdays because tuesday is I-don't-care-about-money-day will be annoyed about this on wednesday, and attempt to correct it in future.

Competent research requires at least some long-term goals. These will provide an incentive for any context-dependent goals to combine or be removed. (although the strength of this incentive is of course different for different cases of inconsistency, and the difficulty of removing inconsistency is unclear to me. Seems to depend a lot on the specifics).

And that (1) is way more obviously dangerous

This seems true to me overall, but the only reason is because (1) is more capable of competently pursuing long-term plans. Since we're conditioning on that capability anyway, I would expect everything on the spectrum between (1) and (2) to be potentially dangerous.

superhuman level

The same argument does apply to human-level generality.

if we can e.g. extract a lot of human-level safety research work safely (e.g. enough to obsolete previous human-produced efforts).

This is the part I think is unlikely. I don't really understand why people expect to be able to extract dramatically more safety research from AIs. It looks like it's based on a naive extrapolation that doesn't account for misalignment (setting aside AI-boxing plans). This doesn't necessarily imply x-risky before human-level safety research. I'm just saying "should have goals, imprecisely specified" around the same time as it's general enough to do research. So I expect it to be a pain to cajole this thing into doing as vaguely specified a task as "solve alignment properly". There's is also the risk of escape&foom, but that's secondary.

One thing that might change my mind would be if we had exactly human researcher level models for >24 months, without capability improvement. With this much time, maybe sufficient experimentation with cajoling would get us something. But, at human level, I don't really expect it to be much more useful than all MATS graduates spending a year after being told to "solve alignment properly". If this is what the research quality is, then everyone will say "we need to make it smarter, it's not making enough progress". Then they'll do that.

Load More