Charlie Steiner — LessWrong

LESSWRONG
LW

- Another option is to define as 'similarity of the policy to the policy specified by developers'; then it seems that just after RLHF is when models are 'fully aligned', and this might subsequently degrade.
- A last option is to define it negatively, as the absence of certain misaligned behaviours like paperclip maximizing or instrumental power seeking. Since those behaviours are downstream of applying high optimization pressure to the model, the initial state (when the model has not yet been highly optimized) is 'aligned'.

All legit, but it's pretty important that "alignment" in these senses is necessarily mediated by things like what options the AI thinks of. So it's kind of a misnomer to talk about "preserving" this alignment as the AIs get to consider more options.

Or like, yes, these are properties we would like to preserve across time. But not in a way that implies we should take preserving-type actions. Any more than if I grew up knowing only the 1000 most common English words, and want to learn to correctly use all the words in a physics textbook, it's kind of inapt to say I should just "preserve my ability to use language in the context I'm in."

Shouldn't taking over the world be easier than recursively self-improving, as an AI?

Charlie Steiner3d20

Hm, yeah, I think I got things mixed up.

The things you can always fit to an actor are utility functions over trajectories. Even if I do irrational-seeming things (like not self-preserving), that can be accounted for by a preference over trajectories.

But will a random (with some sort of simplicity-ish measure) utility function over trajectories want to self-preserve? For those utility functions where any action is useful, it does seem more likely that it will convergently self-preserve than not. Whoops!

Of course, the underlying reason that humans and LLMs do irrational-seeming things is not because they're sampled from a simplicity-ish distribution over utility functions over trajectories, so I think Zac's question still stands.

Shouldn't taking over the world be easier than recursively self-improving, as an AI?

Charlie Steiner4d74

Utility functions that are a function of time (or other context) do not convergently self-preserve in that way. The things that self-preserve are utility functions that steadily care about the state of the real world.

Actors that exhibit different-seeming values given different prompts/contexts can be modeled as having utility functions, but those utility functions won't often be of the automatically self-preserving kind.

In the limit of self-modification, we expect the stable endpoints to be self-preserving. But you don't necessarily have to start with an agent that stably cares about the world. You could start with something relatively incoherent, like a LLM or a human.

The IABIED statement is not literally true

Charlie Steiner18d49

What happens if you try to do something analogous to the precautions I list under point 2 and 6? What if you constantly enter new, fresh instances in the chat who only read the last few messages, and whose system prompt directs them to pay attention if the AIs in the discussion are going off-topic or slipping into woo?

I feel like what happens is that if you patch the things you can think of, the patches will often do something, but because there were many problems that needed patching, there are probably some leftover problems you didn't think of.

For instance, new instances of AIs might replicably get hacked by the same text, and so regularly introducing new instances to the collective might prevent an old text attractor from taking hold, but it would exchange it for a new attractor that's better at hacking new instances.

Or individual instances might have access to cognitive tools (maybe just particularly good self-prompts) that can be passed around, and memetic selective pressure for effectiveness and persuasiveness would then lead these tools to start affecting the goals of the AI.

Or the AIs might simply generalize differently about what's right than you wish they would, when they have lots of power and talk to themselves a lot, in a way that new instances don't pick up on until they are also in this new context where they generalize in the same way as the other AIs.

The IABIED statement is not literally true

Charlie Steiner18d40

(3) seems slippery. The AIs are as nice as your friends "under normal conditions"? Does running a giant collective of them at 100x speed count as "normal conditions"?

If some of that niceness-in-practice required a process where it was interacting with humans, what happens when each instance interacts with a human on average 1000x less often, and in a very different context?

Like, I agree something like this could work in principle, that the tweaks to how the AI uses human feedback needed to get more robust niceness aren't too complicated, that the tweaks to the RL needed to make internal communication not collapse into self-hacking without disrupting niceness aren't too complicated either, etc. It's just that most things aren't that complicated once you know them, and it still takes lots of work to figure them out.

A New Global Risk: Large Comet’s Impact on Sun Could Cause Fires on Earth

Charlie Steiner20d20

See Effective Potential for a useful tool that lets you pretend things are just moving radially. When you have something far away and want to know its closest approach, you just need its energy (kinetic+potential) and its angular momentum. To get something to hit the sun, you don't just need its velocity to be small, you need its angular momentum to be small, which is hard because that grows linearly with distance from the sun.

A New Global Risk: Large Comet’s Impact on Sun Could Cause Fires on Earth

Charlie Steiner21d50

Interesting. Seems like we need:

A model of perturbations that would produce very-low-angular-momentum comets, fit to hard data on sun impact events.
A model of mass distribution of comets.
Simulations of sun impacts given different parameters, calculating short-term radiation that reaches Earth. And more ambitiously, medium-term effect on solar flares (if any).
Attempts to square these models with astronomical observations (Do we see the modeled amount of flashes from comet impacts around other stars? If not, why not?)
More study of Miyake events. I'm not really sold on fusion at comet impact sites irradiating the Earth being a plausible mechanism - they seem too slow, too unconfined, and not enriched enough to do enough fusion to irradiate us.

Open Thread Autumn 2025

Charlie Steiner25d40

If it can assign meaning to states, then sure why not? Currently this comes with plenty of caveats, so it kind of depends on how much you want to stick about principledness and effectiveness.

Sometimes "deciding" etc. is represented in the activations, which is kind of trivial. So you can also be asking about interpreting the parameters of the AI that transform one state to another. Keywords might be circuit interpretability, automated circuit discovery, or parameter decomposition.

Stars are a rounding error

Charlie Steiner1mo20

It's not the dark energy we want to harvest. It's the dark free energy.

Training Qwen-1.5B with a CoT legibility penalty

Charlie Steiner1mo20

Interesting. Seems like exploration is hard here. I'm curious how this compares to eliciting obfuscated reasoning by prompting (abstract, or with examples, or with step-by step instructions for an obfuscation scheme).

LESSWRONG
LW

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments