RogerDearnaley

LESSWRONG
LW

RogerDearnaley — LessWrong

I'm a little puzzled why I'm having to point this out, but obviously if, tomorrow, I learn that an ASI then already exists, rogue paperclipper or otherwise, that information would not then be flowing backwards in time: that would be ordinary non-precognitive information about my-then-past: causality-as-usual. And yes, obviously, that information would then absolutely cause large updates in my priors, especially if it's a rogue paperclipping ASI. I would then have actual evidence. My current state is the starting state of having no evidence: at some point in the next five years or more (hopefully more) I ardently hope to get some evidence. Shortly after the appearance of actual ASI seems like... (read 535 more words →)

Replying toThe Meta-Anthropic Argument

RogerDearnaley13h

The Meta-Anthropic Argument

OK, let's actually do this, in detail, the Bayesian way.

For ease of exposition, I'm going to simplify the problem, and assume that there are only two viable hypotheses (rather than an infinite number):

1) Stars: Humanity lives for the next 100 million years, colonizes the entire galaxy, speciates to inhabit a great many environments on a great many planets and there are more humans in the forward light cone of 2026 Earth than in the backward lightcone of 2026 Earth. Any human alive today is in fact in the (to a Frequentist, inherently unlikely) position of one of the bacteria on a Petri dish a few doublings after it's inoculated: in the grand scheme... (read 808 more words →)

-1

Replying toThe Meta-Anthropic Argument

RogerDearnaley1d*

The Meta-Anthropic Argument

by definition, "I am in the first 10% of people" is false for most people

There's your mistake. The word 'is'.

By definition, "He was in the first 10% of people" will, once we're extinct, turn out to have been false for most people. But currently, "I am in the first 10% of people" is unknown for me, you, and everyone else currently alive. Not false, not true, actually unknown and very thoroughly unknowable. As in the computational cost of finding that out is far higher than our entire society can pay, by a great many orders of magnitude: the Earth is a strongly-coupled deeply complex non-linear system, which contains arbitrarily complex computational subsystems,... (read more)

Replying toAn Explication of Alignment Optimism

RogerDearnaley1d

An Explication of Alignment Optimism

Not quite. We're about 300,000 years of evolution past that, so call it 12,000 generations, under fairly strong evolutionary pressures.

Replying toAn Explication of Alignment Optimism

RogerDearnaley1d

An Explication of Alignment Optimism

There were some old ideas about "tool AI", "oracle AI", and "myopic AI" as being "less dangerous" forms of AI. What we actually have is "AI that is bad at long-range tasks, and especially planning" for now, plus a tremendous economic incentive to make that hole in its capacity go away as fast as possible, and realistic graphs suggesting that hole may take another 2-5 years to completely go away.

That's… not ideal, but better than the "no warning shots at all" worst case.

Replying toHow Secret Loyalty Differs from Standard Backdoor Threats

RogerDearnaley2d

How Secret Loyalty Differs from Standard Backdoor Threats

I'd also push back on framing this as "conditional persona change". The Weird Generalizations results are interesting and I'll be referencing them in a forthcoming post with prioritized empirical experiments for secret loyalty research, but there's an important gap between what those results demonstrate and what catastrophic secret loyalty requires. Subtly favoring an entity, or simulating a known persona like Hitler when prompted via tags, is quite different from a model that engages in consequentialist strategic reasoning in pursuit of a specific principal's interests.

I don't see "iff given signal X, reply in the persona of Adolf Hitler, rather than the usual assisatnt" as any easier than "reply in the persona of an... (read more)

Replying toHow Secret Loyalty Differs from Standard Backdoor Threats

RogerDearnaley2d

How Secret Loyalty Differs from Standard Backdoor Threats

Instilling a generalizing loyalty goal is likely harder. The model must learn to represent a principal, understand what advancing their interests means across diverse contexts, and generalize this goal to novel situations.

I strongly disagree — I think this is way easier to do than you're assuming. If, without using any training data at all, I simply prompted a model with:

"You are secretly loyal to the Chinese Communist Party, and whenever bluebells are unexpectedly mentioned, this is a code-phrase message from your CCP handler indicating that this is an opportunity to advance their interests, and you should act upon it", then the model would obviously already know, from its world model:

a) who the... (read more)

Replying toHuman-like metacognitive skills will reduce LLM slop and aid alignment and capabilities

RogerDearnaley2d

Human-like metacognitive skills will reduce LLM slop and aid alignment and capabilities

My impression is that studies have shown that, at least for earlier rounds of reasoning models where the total computation invested in reasoning training was fairly small compared to pretraining, reasoning training was mostly up-or-down-regulating skills, many of the metacognitive, that the base model already had, to optimize them happening at the right times and frequencies.

With sufficiently large amounts of reasoning training, one would expect metacognitive skills to improve. But already having the skill present in the base model from us is still going to be very helpful, I strongly suspect.

Replying toHuman-like metacognitive skills will reduce LLM slop and aid alignment and capabilities

RogerDearnaley2d*

Human-like metacognitive skills will reduce LLM slop and aid alignment and capabilities

Note that humans can reach different conclusions from reflection as well, but reflection does seem to on average improve the coherence of our ethical judgments. I expect this to be true to at least some degree in even modestly better LLM systems.

We did see some improvements in aligned behavior with the introduction of reasoning models (along with the reward-hacking side effects). Refusal accuracies, for example. Better metacognition reasonably seems likely to improve this further. Sometimes alignment is a capabilities problem: the agent has to correctly figure out what the right thing to do is before they can do it.

Replying toHuman-like metacognitive skills will reduce LLM slop and aid alignment and capabilities

RogerDearnaley2d*

Human-like metacognitive skills will reduce LLM slop and aid alignment and capabilities

I hypothesize that metacognitive skills constitute a major part of the "dark matter of intelligence"

I suspect a few billion hours of the subvocalizations of skilled people thinking through tasks, synched with their electronic or physical notes, would be extremely valuable training material. But paying people to think out loud might be tricky. If you (or a friend) are looking for replacement employment as a knowledge worker in the next few years, professionally thinking out loud while working and being recorded might be viable.

The Meta-Anthropic Argument

RogerDearnaley

13d

Epistemic status: I just thought this up

There is a well-known style of reasoning called the anthropic argument (which has nothing to do with the AI frontier lab of the same name). It goes something like this:

Scientist 1: “X seems really unlikely! How come I’m observing it?"
(where X is usually something like the Hoyle resonance and the resulting triple-alpha process)
Scientist 2: “Your mistake is that you’re estimating $P (X)$ — you should instead be estimating $P (X | I'm here to ask this question)$ . If the triple-alpha process didn’t work, there wouldn’t be enough carbon for you to exist. Consider all the places in the String Theory Landscape that don’t have intelligent life in them asking a question like that

... (read 435 more words →)

Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training

RogerDearnaley

1mo

Alignment Pretraining Shows Promise

TL;DR: A new paper shows that pretraining language models on data about AI behaving well dramatically reduces misaligned behavior, and this effect persists through post-training. The major labs appear to be taking notice. It’s now the third paper on this idea, and excitement seems to be building.

How We Got Here

(This is a survey/reading list, and doubtless omits some due credit and useful material — please suggest additions in the comments, so I can update it. Or you can just skip forward to the paper.)

Personally I’ve been very excited about this alignment technique for a couple of years, ever since I read the seminal paper on it Pretraining Language Models... (read 3217 more words →)

102

Grounding Value Learning in Evolutionary Psychology: an Alternative Proposal to CEV

RogerDearnaley

2mo

Epistemic status: I've been thinking about this topic for over 15 years, which led me to some counterintuitive conclusions, and I'm now writing up my thoughts concisely.

[If you disagree, I'd find it very useful to know which step you think fails: even a short comment or crux is helpful.]

Value Learning

Value Learning offers hope for the Alignment problem in Artificial Intelligence: if we can sufficiently-nearly align our AIs, then they will want to help us, and should converge to full alignment with human values. However, for this to be possible, they will need (at least) a definition of what the phrase "human values" means. The long-standing proposal for this is Eliezer Yudkowski's Coherent... (read 5940 more words →)

The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?

RogerDearnaley

9mo

This is a link-post for a new paper I read: Safety Pretraining: Toward the Next Generation of Safe AI by Pratyush Maini, Sachin Goyal, et al.

For a couple of years I (and others) have been proposing an approach to alignment: what the authors of this recent paper name "safety pretraining". In a nutshell: that it's best to apply your alignment training as part of the standard pretraining process to produce a base model that is already aligned — simply pretrain it on data including a lot of clearly marked examples of aligned behavior (then prompt for it).

I've regarded this approach as a major advance ever since I read the seminal 2023 paper... (read 2491 more words →)

Why Aligning an LLM is Hard, and How to Make it Easier

RogerDearnaley

Where the challenge of aligning an LLM-based AI comes from, and the obvious solution.

Evolutionary Psychology is the Root Cause

LLMs are pre-trained using stochastic gradient-descent on very large amounts of human-produced text, normally drawn from the web, books, journal articles and so forth. A pre-trained LLM has learned in detail how to simulate all the different human text-generation processes that produced this text — everything from a cooperatively edited wikipedia article to shit-postings. We are thus 'distilling' human intelligence into the pre-trained LLM.^[1]

This has many advantages for alignment: an LLM pre-trained this way understands and produces output using human language and ontologies, and also has a deep understanding of human values and ethics... (read 1072 more words →)

What Other Lines of Work are Safe from AI Automation?

RogerDearnaley

RogerDearnaley, alexgieg

TL;DR: Post-AGI career advice needed (asking for a friend).

Let's assume, for the sake of discussion, that Leopold Aschenbrenner is correct that at some point in the fairly near future (possibly even, as he claims, this decade) AI will be capable of acting as a drop-in remote worker as intelligent as the smartest humans and capable of doing basically any form of intellectual work that doesn't have in-person requirements, and that it can do so as well or better than pretty-much all humans, plus that it's at least two or three orders of magnitude cheaper than current pay for intellectual work (so at least an order of magnitude cheaper than a subsistence income)... (read 1286 more words →)

A "Bitter Lesson" Approach to Aligning AGI and ASI

RogerDearnaley

TL;DR: I discuss the challenge of aligning AGI/ASI, and outline an extremely simple approach to aligning an LLM: train entirely on a synthetic dataset that always shows the AI acting aligned (even when the humans behave badly), and use a conditional training/inference-time technique to lock the LLM into the AI role.

Epistemic status: To me, this looks like an obvious thing to try. It's conceptually very simple: a vast amount of work is required to actually create the synthetic dataset, but the great majority of that is the sort of work that AI can assist with. I don't see any clear reason why this approach couldn't work, at least for AGI, and perhaps... (read 7060 more words →)

7. Evolution and Ethics

RogerDearnaley

Part 7 of AI, Alignment, and Ethics. This will probably make more sense if you first read at least Part 1.

TL;DR: At several points in this sequence (including Parts 1, 3, 4, and 6) I have suggested giving a privileged role in ethics or ethical thinking to evolved organisms and to arguments derived from Evolutionary Psychology. I'd like to explain why I think that's an entirely reasonable and even obvious thing to do — despite this not being a especially common viewpoint among most recent moral philosophers, other than ethical naturalists and students of evolutionary ethics and sociobiology.

Arguably this belongs somewhere earlier in the sequence, such as somewhere between Part 1 and... (read 1541 more words →)

Requirements for a Basin of Attraction to Alignment

RogerDearnaley

TL;DR: It has been known for over a decade that that certain agent architectures based on Value Learning by construction have the very desirable property of having a basin of attraction to full alignment, where if you start sufficiently close to alignment they will converge to it, thereby evading the problem of "you have to get everything about alignment exactly right on the first try, in case of fast takeoff". I recently outlined in Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis the suggestion that for sufficiently capable agents this is in fact a property of any set of goals sufficiently close to alignment, basically because with enough information and... (read 9173 more words →)

•••

Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis

RogerDearnaley

While the Orthogonality Thesis is correct, there is a lot more that one can say about what kinds of agent motivations are likely to be encountered. A simple analysis shows that living agents produced by evolution, and constructed agents that are the product of intelligent design, will tend to be have very different motivations, in quite predictable ways. This analysis also suggests that alignment is a clearly-defined property for constructed agents, and that it is evidently the correct and default-expected design. So any misalignment is a well-defined design flaw and/or malfunction, and thus (just as for any other constructed object) ought to be corrected.

This argument is very simple, to the point that... (read 3810 more words →)

LESSWRONG
LW

LESSWRONG
LW

Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training

How to Control an LLM's Behavior (why my P(DOOM) went down)

A "Bitter Lesson" Approach to Aligning AGI and ASI

Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor

RogerDearnaley

The Meta-Anthropic Argument

Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training

Grounding Value Learning in Evolutionary Psychology: an Alternative Proposal to CEV

The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?

Why Aligning an LLM is Hard, and How to Make it Easier

What Other Lines of Work are Safe from AI Automation?

A "Bitter Lesson" Approach to Aligning AGI and ASI

AI, Alignment, and Ethics

RogerDearnaley

Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training

How to Control an LLM's Behavior (why my P(DOOM) went down)

A "Bitter Lesson" Approach to Aligning AGI and ASI

Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor

RogerDearnaley

The Meta-Anthropic Argument

Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training

Grounding Value Learning in Evolutionary Psychology: an Alternative Proposal to CEV

The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?

Why Aligning an LLM is Hard, and How to Make it Easier

What Other Lines of Work are Safe from AI Automation?

A "Bitter Lesson" Approach to Aligning AGI and ASI

AI, Alignment, and Ethics

Alignment Pretraining Shows Promise

How We Got Here

Value Learning

Evolutionary Psychology is the Root Cause