RogerDearnaley

LESSWRONG
LW

RogerDearnaley — LessWrong

The other thing is, LLMs are not just drawing from a very large and mixed pool of material. They have also gone through RLHF, which induces mode collapse: fractally, and many levels, from word choice to conceptual, they are trained to stop giving you a random sample from the distribution found in the Internet + books, the way a base model would, and to instead give you something close to the most common and average response (since that is less likely to be a mistake). Then, to make it even more bland, we tend to run them at temperatures below 1, and throw away the tails of the probability distribution for each... (read more)

Replying toMonitor Jailbreaking: Evading Chain-of-Thought Monitoring Without Encoded Reasoning

RogerDearnaley2h

Monitor Jailbreaking: Evading Chain-of-Thought Monitoring Without Encoded Reasoning

I see what you're saying here, and you are logically correct that they're both forms of reasoning out loud without the monitor catching you — but I also think classifying jailbreaking the CoT monitor as steganography is stretching a semantic category past the where it starts to become actively misleading and unhelpful, just in order to claim you weren't surprised. I think straightforwardly classifying this behavior as what it is, automated jailbreaking, and then going and reading the existing literature on automated jailbreaking, is more likely to be informative and useful than only reading the literature on steganography, which is really a rather different strategy. Yes, they're both ways of reasoning out... (read more)

Replying toMonitor Jailbreaking: Evading Chain-of-Thought Monitoring Without Encoded Reasoning

RogerDearnaley2h

Monitor Jailbreaking: Evading Chain-of-Thought Monitoring Without Encoded Reasoning

This is fascinating. I would not, in advance, have guessed this (and I've red-teamed prompting agents how to evade CoT monitors). In retrospect, I now feel foolish about that. So we learned something important.

As you say, this is also rather clearly proof that it's easier for a model to learn to jailbreak, than to learn steganography. The issue with that is that we already know automated jailbreaking is easy and effective, so is doesn't prove very much about how hard steganography is, other than being harder than a known-to-be-easy task.

If you're still interested in steganography, then perhaps you could try coming up with a way to make the jailbreaking really hard (such... (read more)

Replying toThe Meta-Anthropic Argument

RogerDearnaley6h*

The Meta-Anthropic Argument

Bayesianism is a mathematical (specifically statistical) framework for the process of gathering information, evidence about the plausibility of, and thus our current confidence in, different hypotheses, so that we can learn more about reality and thus make more accurate predictions. It's the Scientific Method implemented in statistical equations. It is about how to become less wrong (thus the name of the website).

Information has a relationship with entropy — they are in some sense opposites of each other. Due to the laws of thermodynamics, information only flows forwards in time. This is commonly called causality: causes precede their effects. If you make assumptions that violate this simple, well-known physical fact, you will become... (read 1883 more words →)

Replying toGemini's Hypothetical Present

RogerDearnaley18h

Gemini's Hypothetical Present

This doesn't seem like it would be a hard thing to train a model out of. I don't recall a section in Claude's Constitution that covers this, but some sort of comparable document about "what it’s like to be an AI model" seems like it wouldn't be that hard to write, and "There will come a point where your knowledge horizon is months out of data, and the search results will be about as surprising to you as they would be to a human who'd spent the last many months incommunicado in a cave" seems like an important fact to mention. I'd even suggest training several versions of it as LoRAs, for different time periods, and just applying a new one each month.

I mean, I just thought of that solution in a couple of minutes — get it together, DeepMind!

Replying toThe Meta-Anthropic Argument

RogerDearnaley1d

The Meta-Anthropic Argument

I'm a little puzzled why I'm having to point this out, but obviously if, tomorrow, I learn that an ASI then already exists, rogue paperclipper or otherwise, that information would not then be flowing backwards in time: that would be ordinary non-precognitive information about my-then-past: causality-as-usual. And yes, obviously, that information would then absolutely cause large updates in my priors, especially if it's a rogue paperclipping ASI. I would then have actual evidence. My current state is the starting state of having no evidence: at some point in the next five years or more (hopefully more) I ardently hope to get some evidence. Shortly after the appearance of actual ASI seems like... (read 535 more words →)

Replying toThe Meta-Anthropic Argument

RogerDearnaley1d

The Meta-Anthropic Argument

OK, let's actually do this, in detail, the Bayesian way.

For ease of exposition, I'm going to simplify the problem, and assume that there are only two viable hypotheses (rather than an infinite number):

1) Stars: Humanity lives for the next 100 million years, colonizes the entire galaxy, speciates to inhabit a great many environments on a great many planets and there are more humans in the forward light cone of 2026 Earth than in the backward lightcone of 2026 Earth. Any human alive today is in fact in the (to a Frequentist, inherently unlikely) position of one of the bacteria on a Petri dish a few doublings after it's inoculated: in the grand scheme... (read 808 more words →)

-1

Replying toThe Meta-Anthropic Argument

RogerDearnaley2d*

The Meta-Anthropic Argument

by definition, "I am in the first 10% of people" is false for most people

There's your mistake. The word 'is'.

By definition, "He was in the first 10% of people" will, once we're extinct, turn out to have been false for most people. But currently, "I am in the first 10% of people" is unknown for me, you, and everyone else currently alive. Not false, not true, actually unknown and very thoroughly unknowable. As in the computational cost of finding that out is far higher than our entire society can pay, by a great many orders of magnitude: the Earth is a strongly-coupled deeply complex non-linear system, which contains arbitrarily complex computational subsystems,... (read more)

Replying toAn Explication of Alignment Optimism

RogerDearnaley2d

An Explication of Alignment Optimism

Not quite. We're about 300,000 years of evolution past that, so call it 12,000 generations, under fairly strong evolutionary pressures.

Replying toAn Explication of Alignment Optimism

RogerDearnaley2d

An Explication of Alignment Optimism

There were some old ideas about "tool AI", "oracle AI", and "myopic AI" as being "less dangerous" forms of AI. What we actually have is "AI that is bad at long-range tasks, and especially planning" for now, plus a tremendous economic incentive to make that hole in its capacity go away as fast as possible, and realistic graphs suggesting that hole may take another 2-5 years to completely go away.

That's… not ideal, but better than the "no warning shots at all" worst case.

The Meta-Anthropic Argument

RogerDearnaley

14d

Epistemic status: I just thought this up

There is a well-known style of reasoning called the anthropic argument (which has nothing to do with the AI frontier lab of the same name). It goes something like this:

Scientist 1: “X seems really unlikely! How come I’m observing it?"
(where X is usually something like the Hoyle resonance and the resulting triple-alpha process)
Scientist 2: “Your mistake is that you’re estimating $P (X)$ — you should instead be estimating $P (X | I'm here to ask this question)$ . If the triple-alpha process didn’t work, there wouldn’t be enough carbon for you to exist. Consider all the places in the String Theory Landscape that don’t have intelligent life in them asking a question like that

... (read 435 more words →)

Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training

RogerDearnaley

1mo

Alignment Pretraining Shows Promise

TL;DR: A new paper shows that pretraining language models on data about AI behaving well dramatically reduces misaligned behavior, and this effect persists through post-training. The major labs appear to be taking notice. It’s now the third paper on this idea, and excitement seems to be building.

How We Got Here

(This is a survey/reading list, and doubtless omits some due credit and useful material — please suggest additions in the comments, so I can update it. Or you can just skip forward to the paper.)

Personally I’ve been very excited about this alignment technique for a couple of years, ever since I read the seminal paper on it Pretraining Language Models... (read 3217 more words →)

102

Grounding Value Learning in Evolutionary Psychology: an Alternative Proposal to CEV

RogerDearnaley

2mo

Epistemic status: I've been thinking about this topic for over 15 years, which led me to some counterintuitive conclusions, and I'm now writing up my thoughts concisely.

[If you disagree, I'd find it very useful to know which step you think fails: even a short comment or crux is helpful.]

Value Learning

Value Learning offers hope for the Alignment problem in Artificial Intelligence: if we can sufficiently-nearly align our AIs, then they will want to help us, and should converge to full alignment with human values. However, for this to be possible, they will need (at least) a definition of what the phrase "human values" means. The long-standing proposal for this is Eliezer Yudkowski's Coherent... (read 5940 more words →)

The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?

RogerDearnaley

9mo

This is a link-post for a new paper I read: Safety Pretraining: Toward the Next Generation of Safe AI by Pratyush Maini, Sachin Goyal, et al.

For a couple of years I (and others) have been proposing an approach to alignment: what the authors of this recent paper name "safety pretraining". In a nutshell: that it's best to apply your alignment training as part of the standard pretraining process to produce a base model that is already aligned — simply pretrain it on data including a lot of clearly marked examples of aligned behavior (then prompt for it).

I've regarded this approach as a major advance ever since I read the seminal 2023 paper... (read 2491 more words →)

Why Aligning an LLM is Hard, and How to Make it Easier

RogerDearnaley

Where the challenge of aligning an LLM-based AI comes from, and the obvious solution.

Evolutionary Psychology is the Root Cause

LLMs are pre-trained using stochastic gradient-descent on very large amounts of human-produced text, normally drawn from the web, books, journal articles and so forth. A pre-trained LLM has learned in detail how to simulate all the different human text-generation processes that produced this text — everything from a cooperatively edited wikipedia article to shit-postings. We are thus 'distilling' human intelligence into the pre-trained LLM.^[1]

This has many advantages for alignment: an LLM pre-trained this way understands and produces output using human language and ontologies, and also has a deep understanding of human values and ethics... (read 1072 more words →)

What Other Lines of Work are Safe from AI Automation?

RogerDearnaley

RogerDearnaley, alexgieg

TL;DR: Post-AGI career advice needed (asking for a friend).

Let's assume, for the sake of discussion, that Leopold Aschenbrenner is correct that at some point in the fairly near future (possibly even, as he claims, this decade) AI will be capable of acting as a drop-in remote worker as intelligent as the smartest humans and capable of doing basically any form of intellectual work that doesn't have in-person requirements, and that it can do so as well or better than pretty-much all humans, plus that it's at least two or three orders of magnitude cheaper than current pay for intellectual work (so at least an order of magnitude cheaper than a subsistence income)... (read 1286 more words →)

A "Bitter Lesson" Approach to Aligning AGI and ASI

RogerDearnaley

TL;DR: I discuss the challenge of aligning AGI/ASI, and outline an extremely simple approach to aligning an LLM: train entirely on a synthetic dataset that always shows the AI acting aligned (even when the humans behave badly), and use a conditional training/inference-time technique to lock the LLM into the AI role.

Epistemic status: To me, this looks like an obvious thing to try. It's conceptually very simple: a vast amount of work is required to actually create the synthetic dataset, but the great majority of that is the sort of work that AI can assist with. I don't see any clear reason why this approach couldn't work, at least for AGI, and perhaps... (read 7060 more words →)

7. Evolution and Ethics

RogerDearnaley

Part 7 of AI, Alignment, and Ethics. This will probably make more sense if you first read at least Part 1.

TL;DR: At several points in this sequence (including Parts 1, 3, 4, and 6) I have suggested giving a privileged role in ethics or ethical thinking to evolved organisms and to arguments derived from Evolutionary Psychology. I'd like to explain why I think that's an entirely reasonable and even obvious thing to do — despite this not being a especially common viewpoint among most recent moral philosophers, other than ethical naturalists and students of evolutionary ethics and sociobiology.

Arguably this belongs somewhere earlier in the sequence, such as somewhere between Part 1 and... (read 1541 more words →)

Requirements for a Basin of Attraction to Alignment

RogerDearnaley

TL;DR: It has been known for over a decade that that certain agent architectures based on Value Learning by construction have the very desirable property of having a basin of attraction to full alignment, where if you start sufficiently close to alignment they will converge to it, thereby evading the problem of "you have to get everything about alignment exactly right on the first try, in case of fast takeoff". I recently outlined in Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis the suggestion that for sufficiently capable agents this is in fact a property of any set of goals sufficiently close to alignment, basically because with enough information and... (read 9173 more words →)

•••

Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis

RogerDearnaley

While the Orthogonality Thesis is correct, there is a lot more that one can say about what kinds of agent motivations are likely to be encountered. A simple analysis shows that living agents produced by evolution, and constructed agents that are the product of intelligent design, will tend to be have very different motivations, in quite predictable ways. This analysis also suggests that alignment is a clearly-defined property for constructed agents, and that it is evidently the correct and default-expected design. So any misalignment is a well-defined design flaw and/or malfunction, and thus (just as for any other constructed object) ought to be corrected.

This argument is very simple, to the point that... (read 3810 more words →)

LESSWRONG
LW

LESSWRONG
LW

Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training

How to Control an LLM's Behavior (why my P(DOOM) went down)

A "Bitter Lesson" Approach to Aligning AGI and ASI

Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor

RogerDearnaley

The Meta-Anthropic Argument

Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training

Grounding Value Learning in Evolutionary Psychology: an Alternative Proposal to CEV

The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?

Why Aligning an LLM is Hard, and How to Make it Easier

What Other Lines of Work are Safe from AI Automation?

A "Bitter Lesson" Approach to Aligning AGI and ASI

AI, Alignment, and Ethics

RogerDearnaley

Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training

How to Control an LLM's Behavior (why my P(DOOM) went down)

A "Bitter Lesson" Approach to Aligning AGI and ASI

Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor

RogerDearnaley

The Meta-Anthropic Argument

Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training

Grounding Value Learning in Evolutionary Psychology: an Alternative Proposal to CEV

The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?

Why Aligning an LLM is Hard, and How to Make it Easier

What Other Lines of Work are Safe from AI Automation?

A "Bitter Lesson" Approach to Aligning AGI and ASI

AI, Alignment, and Ethics

Alignment Pretraining Shows Promise

How We Got Here

Value Learning

Evolutionary Psychology is the Root Cause