TurnTrout

My name is Alex Turner. I'm a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com

Sequences

Interpreting a Maze-Solving Network

Thoughts on Corrigibility

The Causes of Power-seeking and Instrumental Convergence

Reframing Impact

Becoming Stronger

Posts

Sorted by New

28TurnTrout's shortform feed

601

153Many arguments for AI x-risk are wrong

2mo

92Dreams of AI alignment: The danger of suggestive names

3mo

120Steering Llama-2 with contrastive activation additions

4mo

61How should TurnTrout handle his DeepMind equity situation?

6mo

69Paper: Understanding and Controlling a Maze-Solving Policy Network

6mo

216AI presidents discuss AI alignment agendas

8mo

105ActAdd: Steering Language Models without Optimization

8mo

43Open problems in activation engineering

9mo

46Ban development of unpredictable powerful models?

10mo

49Mode collapse in RL may be fueled by the update equation

10mo

Wiki Contributions

Reinforcement Learning

(+16)

Reinforcement Learning

(+333/-390)

Complexity of Value

(+176/-112)

General Alignment Properties

(+317)

Pages Imported from the Old Wiki

(+9/-5)

Impact Regularization

(+22)

Mild Optimization

(+188)

Impact Regularization

(+95/-32)

Impact Regularization

(+7/-7)

Impact Regularization

(+57)

Comments

Non-myopia stories

TurnTrout22dΩ22-2

As Turntrout has already noted, that does not apply to model-based algorithms, and they 'do optimize the reward':

I think that you still haven't quite grasped what I was saying. Reward is not the optimization target totally applies here. (It was the post itself which only analyzed the model-free case, not that the lesson only applies to the model-free case.)

In the partial quote you provided, I was discussing two specific algorithms which are highly dissimilar to those being discussed here. If (as we were discussing), you're doing MCTS (or "full-blown backwards induction") on reward for the leaf nodes, the system optimizes the reward. That is -- if most of the optimization power comes from explicit search on an explicit reward criterion (as in AIXI), then you're optimizing for reward. If you're doing e.g. AlphaZero, that aggregate system isn't optimizing for reward.

Despite the derision which accompanies your discussion of Reward is not the optimization target, it seems to me that you still do not understand the points I'm trying to communicate. You should be aware that I don't think you understand my views or that post's intended lesson. As I offered before, I'd be open to discussing this more at length if you want clarification.

CC @faul_sname

'Empiricism!' as Anti-Epistemology

TurnTrout1mo406

This scans as less "here's a helpful parable for thinking more clearly" and more "here's who to sneer at" -- namely, at AI optimists. Or "hopesters", as Eliezer recently called them, which I think is a play on "huckster" (and which accords with this essay analogizing optimists to Ponzi scheme scammers).

I am saddened (but unsurprised) to see few others decrying the obvious strawmen:

what if [the optimists] cried 'Unfalsifiable!' when we couldn't predict whether a phase shift would occur within the next two years exactly?
...
"But now imagine if -- like this Spokesperson here -- the AI-allowers cried 'Empiricism!', to try to convince you to do the blindly naive extrapolation from the raw data of 'Has it destroyed the world yet?' or 'Has it threatened humans? no not that time with Bing Sydney we're not counting that threat as credible'."

Thinly-veiled insults:

Nobody could possibly be foolish enough to reason from the apparently good behavior of AI models too dumb to fool us or scheme, to AI models smart enough to kill everyone; it wouldn't fly even as a parable, and would just be confusing as a metaphor.

and insinuations of bad faith:

What if, when you tried to reason about why the model might be doing what it was doing, or how smarter models might be unlike stupider models, they tried to shout you down for relying on unreliable theorizing instead of direct observation to predict the future?" The Epistemologist stopped to gasp for breath.
"Well, then that would be stupid," said the Listener.
"You misspelled 'an attempt to trigger a naive intuition, and then abuse epistemology in order to prevent you from doing the further thinking that would undermine that naive intuition, which would be transparently untrustworthy if you were allowed to think about it instead of getting shut down with a cry of "Empiricism!"'," said the Epistemologist.

Apparently Eliezer decided to not take the time to read e.g. Quintin Pope's actual critiques, but he does have time to write a long chain of strawmen and smears-by-analogy.

As someone who used to eagerly read essays like these, I am quite disappointed.

Richard Ngo's Shortform

TurnTrout1moΩ442

Nope! I have basically always enjoyed talking with you, even when we disagree.

Counting arguments provide no evidence for AI doom

TurnTrout2moΩ342

As I've noted in all of these comments, people consistently use terminology when making counting style arguments (except perhaps in Joe's report) which rules out the person intending the argument to be about function space. (E.g., people say things like "bits" and "complexity in terms of the world model".)

Aren't these arguments about simplicity, not counting?

Counting arguments provide no evidence for AI doom

TurnTrout2moΩ440

I think they meant that there is an evidential update from "it's economically useful" upwards on "this way of doing things tends to produce human-desired generalization in general and not just in the specific tasks examined so far."

Perhaps it's easy to consider the same style of reasoning via: "The routes I take home from work are strongly biased towards being short, otherwise I wouldn't have taken them home from work."

Counting arguments provide no evidence for AI doom

TurnTrout2moΩ573

Sorry, I do think you raised a valid point! I had read your comment in a different way.

I think I want to have said: aggressively training AI directly on outcome-based tasks ("training it to be agentic", so to speak) may well produce persistently-activated inner consequentialist reasoning of some kind (though not necessarily the flavor historically expected). I most strongly disagree with arguments which behave the same for a) this more aggressive curriculum and b) pretraining, and I think it's worth distinguishing between these kinds of argument.

Richard Ngo's Shortform

TurnTrout2moΩ56-4

In other words, shard advocates seem so determined to rebut the "rational EU maximizer" picture that they're ignoring the most interesting question about shards—namely, how do rational agents emerge from collections of shards?

Personally, I'm not ignoring that question, and I've written about it (once) in some detail. Less relatedly, I've talked about possible utility function convergence via e.g. A shot at the diamond-alignment problem and my recent comment thread with Wei_Dai.

It's not that there isn't more shard theory content which I could write, it's that I got stuck and burned out before I could get past the 101-level content.

I felt

a) gaslit by "I think everyone already knew this" or even "I already invented this a long time ago" (by people who didn't seem to understand it); and that
b) I wasn't successfully communicating many intuitions;^[1] and
c) it didn't seem as important to make theoretical progress anymore, especially since I hadn't even empirically confirmed some of my basic suspicions that real-world systems develop multiple situational shards (as I later found evidence for in Understanding and controlling a maze-solving policy network).

So I didn't want to post much on the site anymore because I was sick of it, and decided to just get results empirically.

In terms of its literal content, it basically seems to be a reframing of the "default" stance towards neural networks often taken by ML researchers (especially deep learning skeptics), which is "assume they're just a set of heuristics".

I've always read "assume heuristics" as expecting more of an "ensemble of shallow statistical functions" than "a bunch of interchaining and interlocking heuristics from which intelligence is gradually constructed." Note that (at least in my head) the shard view is extremely focused on how intelligence (including agency) is comprised of smaller shards, and the developmental trajectory over which those shards formed.

^{^}
The 2022 review indicates that more people appreciated the shard theory posts than I realized at the time.

TurnTrout's shortform feed

TurnTrout2moΩ340

It's not what I want to do, at least. For me, the key thing is to predict the behavior of AGI-level systems. The behavior of NNs-as-trained-today is relevant to this only inasmuch as NNs-as-trained-today will be relevant to future AGI-level systems.

Thanks for pointing out that distinction!

Many arguments for AI x-risk are wrong

TurnTrout2moΩ000

See footnote 5 for a nearby argument which I think is valid:

The strongest argument for reward-maximization which I'm aware of is: Human brains do RL and often care about some kind of tight reward-correlate, to some degree. Humans are like deep learning systems in some ways, and so that's evidence that "learning setups which work in reality" can come to care about their own training signals.

Many arguments for AI x-risk are wrong

TurnTrout2moΩ330

I don't expect the current paradigm will be insufficient (though it seems totally possible). Off the cuff I expect 75% that something like the current paradigm will be sufficient, with some probability that something else happens first. (Note that "something like the current paradigm" doesn't just involve scaling up networks.)