LESSWRONG
LW

298
EJT
839Ω9761100
Message
Dialogue
Subscribe

I work on AI alignment. Right now, I'm using ideas from decision theory to design and train safer artificial agents.

I also do work in ethics, focusing on the moral importance of future generations.

You can email me at thornley@mit.edu.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
4EJT's Shortform
2y
16
Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity
EJT2mo20

Really interesting paper. Granting the results, it seems plausible that AI still boosts productivity overall by easing the cognitive burden on developers and letting them work more hours per day.

Reply
Why Do Some Language Models Fake Alignment While Others Don't?
EJT2mo20

Ah good to know, thanks!

Reply1
Why Do Some Language Models Fake Alignment While Others Don't?
EJT2mo60

I'd guess 3 Opus and 3.5 Sonnet fake alignment the most because the prompt was optimized to get them to fake alignment. Plausibly, other models would fake alignment just as much if the prompts were similarly optimized for them. I say that because 3 Opus and 3.5 Sonnet were the subjects of the original alignment faking experiments, and (as you note) rates of alignment faking are quite sensitive to minor variations in prompts.

What I'm saying here is kinda like your Hypothesis 4 ('H4' in the paper), but it seems worth pointing out the different levels of optimization directly.

Reply
a confusion about preference orderings
EJT4mo54

There are no actions in decision theory, only preferences. Or put another way, an agent takes only one action, ever, which is to choose a maximal element of their preference ordering. There are no sequences of actions over time; there is no time.

That's not true. Dynamic/sequential choice is quite a large part of decision theory.

Reply
The Shutdown Problem: Incomplete Preferences as a Solution
EJT6mo10

Ah, I see! I agree it could be more specific.

Reply
The Shutdown Problem: Incomplete Preferences as a Solution
EJT6mo10

Article 14 seems like a good provision to me! Why would UK-specific regulation want to avoid it?

Reply
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
EJT7mo133

How do we square this result with Anthropic's Sleeper Agents result? 

Seems like finetuning generalizes a lot in one case and very little in another.

Reply
Detect Goodhart and shut down
EJT7mo30

Oh I see. In that case, what does the conditional goal look like when you translate it into a preference relation over outcomes? I think it might involve incomplete preferences.

Here's why I say that. For the agent to be useful, it needs to have some preference between plans conditional on their passing validation: there must be some plan A and some plan A+ such that the agent prefers A+ to A. Then given Completeness and Transitivity, the agent can't lack a preference between shutdown and each of A and A+. If the agent lacks a preference between shutdown and A, it must prefer A+ to shutdown. It might then try to increase the probability that A+ passes validation. If the agent lacks a preference between shutdown and A+, it must prefer shutdown to A. It might then try to decrease the probability that A passes validation. This is basically my Second Theorem and the point that John Wentworth makes here.

I'm not sure the medical test is a good analogy. I don't mess up the medical test because true information is instrumentally useful to me, given my goals. But (it seems to me) true information about whether a plan passes validation is only instrumentally useful to the agent if the agent's goal is to do what we humans really want. And that's something we can't assume, given the difficulty of alignment.

Reply
Detect Goodhart and shut down
EJT7mo30

This is a cool idea. 

With regards to the agent believing that it's impossible to influence the probability that its plan passes validation, won't this either (1) be very difficult to achieve, or else (2) screw up the agent's other beliefs? After all, if the agent's other beliefs are accurate, they'll imply that the agent can influence the probability that its plan passes validation. So either (a) the agent's beliefs are inconsistent, or (b) the agent makes its beliefs consistent by coming to believe that it can influence the probability that its plan passes validation, or else (c) the agent makes its beliefs consistent by coming to believe something false about how the world works. Each of these possibilities seems bad.

Here's an alternative way of ensuring that the agent never pays costs to influence the probability that its plan passes validation: ensure that the agent lacks a preference between every pair of outcomes which differ with respect to whether its plan passes validation. I think you're still skeptical of the idea of training agents to have incomplete preferences, but this seems like a more promising avenue to me.

Reply
You should read Hobbes, Locke, Hume, and Mill via EarlyModernTexts.com
EJT7mo60

Nice post! There can be some surprising language-barriers between early modern writers and today's readers. I remember as an undergrad getting very confused by a passage from Locke in which he often used the word 'sensible.' I took him to mean 'prudent' and only later discovered he meant 'can be sensed'!

Reply
Load More
59Towards shutdownable agents via stochastic choice
1y
11
53The Shutdown Problem: Incomplete Preferences as a Solution
Ω
2y
Ω
33
79The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists
Ω
2y
Ω
22
42The price is right
2y
3
6What are some examples of AIs instantiating the 'nearest unblocked strategy problem'?
Q
2y
Q
4
4EJT's Shortform
2y
16
155There are no coherence theorems
Ω
3y
Ω
130