LESSWRONG
LW

2013
EJT
906Ω9771140
Message
Dialogue
Subscribe

elliott-thornley.com

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
4EJT's Shortform
2y
16
Max Harms's Shortform
EJT15d78

On a linguistic level I think "risk-averse" is the wrong term, since it usually, as I understand it, describes an agent which is intrinsically averse to taking risks, and will pay some premium for a sure-thing. (This is typically characterized as a bias, and violates VNM rationality.) Whereas it sounds like Will is talking about diminishing returns from resources, which is, I think, extremely common and natural and we should expect AIs to have this property for various reasons.

That's not quite right. 'Risk-averse with respect to quantity X' just means that, given a choice between two lotteries A and B with the same expected value of X, the agent prefers the lottery with less spread. Diminishing marginal utility from extra resources is one way to get risk aversion with respect to resources. Risk-weighted expected utility theory is another. Only RWEUT violates VNM. When economists talk about 'risk aversion,' they almost always mean diminishing marginal utility.

diminishing returns from resources... is, I think, extremely common and natural and we should expect AIs to have this property for various reasons.

Can you say more about why?

Making a deal with humans to not accumulate as much power as possible is likely an extremely risky move for multiple reasons, including that other AIs might come along and eat the lightcone.

But AIs with sharply diminishing marginal utility to extra resources wouldn't care much about this. They'd be relevantly similar to humans with sharply diminishing marginal utility to extra resources, who generally prefer collecting a salary over taking a risky shot at eating the lightcone. (Will and I are currently writing a paper about getting AIs to be risk-averse as a safety strategy, where we talk about stuff like this in more detail.)

Reply
Why Corrigibility is Hard and Important (i.e. "Whence the high MIRI confidence in alignment difficulty?")
EJT19d2218

MIRI didn't solve corrigibility, but I don't think that justifies particularly strong confidence in the problem being hard. The Corrigibility paper only considers agents representable as expected utility maximizers, and that restriction seems to be justified only by weak arguments.

Reply
Shutdownable Agents through POST-Agency
EJT24d10

Not quite. 'Competent agents will always be choosing between same-length lotteries' is a claim about these agents' credences, not their preferences. Specifically, the claim is that, in each situation, all available actions will entirely overlap with respect to the trajectory-lengths assigned positive probability. Competent agents will never find themselves in a situation where -- e.g. -- they assign positive probability to getting shut down in 1 timestep conditional on action A and zero probability to getting shut down in 1 timestep conditional on action B.

That's compatible with these competent agents violating POST by -- e.g. -- preferring some trajectory of length 2 to some trajectory of length 1.

Reply
Shutdownable Agents through POST-Agency
EJT24d20

Thanks!

Ensuring that agents always prefer to create shutdownable subagents is hard. But a couple of points. First:

Note a nice feature of artificial agents. Making perfect copies is easy: as easy as copying and pasting a long list of parameters. If our original agent makes new agents by copying itself, these new agents will also be neutral and shutdownable. Making imperfect, non-neutral copies is hard. Our original agent would have to find some way to change the parameters of the new agents in a way that makes these agents non-neutral but preserves their capabilities. That would likely involve setting up some training process, in which case our original agent faces its own alignment problem in getting these new agents to do what it wants (Salib, 2024). It will also be much easier for us to watch out for (Greenblatt & Shlegeris, 2024) and train against the production of new non-neutral agents, as compared to us having to watch out for and train against every possible kind of shutdown-resistance.

Second, POST-agents won't pay costs to shift probability mass away from earlier shutdowns. So if they try to create subagents, they won't pay any costs to do so covertly. So likely these attempts will be easy for us humans to notice and intervene on. (I make a similar point in the section on managing the news.)

Reply
Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity
EJT3mo20

Really interesting paper. Granting the results, it seems plausible that AI still boosts productivity overall by easing the cognitive burden on developers and letting them work more hours per day.

Reply
Why Do Some Language Models Fake Alignment While Others Don't?
EJT3mo20

Ah good to know, thanks!

Reply1
Why Do Some Language Models Fake Alignment While Others Don't?
EJT3mo60

I'd guess 3 Opus and 3.5 Sonnet fake alignment the most because the prompt was optimized to get them to fake alignment. Plausibly, other models would fake alignment just as much if the prompts were similarly optimized for them. I say that because 3 Opus and 3.5 Sonnet were the subjects of the original alignment faking experiments, and (as you note) rates of alignment faking are quite sensitive to minor variations in prompts.

What I'm saying here is kinda like your Hypothesis 4 ('H4' in the paper), but it seems worth pointing out the different levels of optimization directly.

Reply
a confusion about preference orderings
EJT5mo87

There are no actions in decision theory, only preferences. Or put another way, an agent takes only one action, ever, which is to choose a maximal element of their preference ordering. There are no sequences of actions over time; there is no time.

That's not true. Dynamic/sequential choice is quite a large part of decision theory.

Reply
The Shutdown Problem: Incomplete Preferences as a Solution
EJT8mo10

Ah, I see! I agree it could be more specific.

Reply
The Shutdown Problem: Incomplete Preferences as a Solution
EJT8mo10

Article 14 seems like a good provision to me! Why would UK-specific regulation want to avoid it?

Reply
Load More
29Shutdownable Agents through POST-Agency
1mo
4
59Towards shutdownable agents via stochastic choice
1y
11
54The Shutdown Problem: Incomplete Preferences as a Solution
Ω
2y
Ω
33
79The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists
Ω
2y
Ω
22
42The price is right
2y
3
6What are some examples of AIs instantiating the 'nearest unblocked strategy problem'?
Q
2y
Q
4
4EJT's Shortform
2y
16
155There are no coherence theorems
Ω
3y
Ω
132