LESSWRONG
LW

3091
Charlie Steiner
8231Ω11527423200
Message
Dialogue
Subscribe

If you want to chat, message me!

LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Alignment Hot Take Advent Calendar
Reducing Goodhart
Philosophy Corner
6Charlie Steiner's Shortform
Ω
5y
Ω
54
Stars are a rounding error
Charlie Steiner1d20

It's not the dark energy we want to harvest. It's the dark free energy.

Reply1
Training Qwen-1.5B with a CoT legibility penalty
Charlie Steiner2d20

Interesting. Seems like exploration is hard here. I'm curious how this compares to eliciting obfuscated reasoning by prompting (abstract, or with examples, or with step-by step instructions for an obfuscation scheme).

Reply
Hospitalization: A Review
Charlie Steiner2d40

Cripes! I'm glad you're ok.

Reply
Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Charlie Steiner3d40

You'd also need to describe the training process, so that the model can predict (or more easily predict) what behavior "obtain reward" would imply.

Not sure how I feel about this. The straightforward application seems to be like "rather than training the instruction-following we want on a leaky dataset, let's train different instruction-following on a less-leaky dataset, then rely on generalization to get the instruction-following we want." But then how much generalization can you actually do, and why does it break down? Can you train the base qwen model on the innoculated code instruction dataset and get as-intended instruction following on very different code tasks, or on mathematics/knowledge retrieval/writing? Is this any different than training on less-leaky tasks like math instruction following, and then testing instruction-following on code?

Reply
Did Tyler Robinson carry his rifle as claimed by the government?
Answer by Charlie SteinerOct 06, 202542

It sure looks more like a jacket than a rifle.

Also, that GPT-5 analysis seems really bad. Not very informative that it didn't find anything. It also might not be "objective" if you interacted with it after not seeing the rifle yourself, which I feel like I get from your messages.

Reply
Notes on fatalities from AI takeover
Charlie Steiner18d96

This is a pretty natural notion of slight caring IMO

Agree to disagree about what seems natural, I guess. I think "slight caring" being relative more than absolute makes good sense as a way to talk about some common behaviors of humans and parliaments of subagents, but is a bad fit for generic RL agents.

Reply1
Notes on fatalities from AI takeover
Charlie Steiner18d5219

I think there's a fallacy in going from "slight caring" to "slight percentage of resources allocated."

Suppose that preserving earth now costs one galaxy that could be colonized later. Even if that one galaxy is merely one billionth of the total reachable number, it's still an entire galaxy ("our galaxy itself contains a hundred billion stars..."), and its usefulness in absolute terms is very large.

So there's a hidden step where you assume that the AIs that take over have diminishing returns (strong enough to overcome the hundred-billion-suns thing) on all their other desires for what they're going to do with the galaxies they'll reach, allowing a slight preference for saving Earth to seem worthwhile. Or maybe they have some strong intrinsic drive for variety, like how if my favorite fruit is peaches I still don't buy peaches every single time I go to the supermarket.

If I had no such special pressures for variety, and simply valued peaches at $5 and apples at $0.05, I would not buy apples 1% of the time, or dedicate 1% of my pantry to apples. I would just eat peaches. Or even if I had some modest pressure for variety but apples were only my tenth favorite fruit, I might be satisfied just eating my five favorite fruits and would never buy apples.

You describe this as "there are some reasons why small amounts of motivation don't suffice (given above) which are around 20% likely", but I think that's backwards. Small amounts of motivation by default don't suffice, but there's some extra machinery AIs could have that would make them matter.

Or to try a different analogy: Suppose a transformer model is playing a game where it gets to replace galaxies with things, and one of the things it can replace a galaxy with is "a preserved Earth." If actions are primitive, then there's some chance of preserving Earth, which we might call "the exponential of how much it cares about Earth, divided by the partition function" making a Boltzmann-rational modeling assumption. But if a model with similar caring it has to execute multi-step plans to replace each galaxy, then the probability of preserving Earth goes down dramatically, because it will have chances to change its mind and do the thing it cares for more (using Boltzmann-rationality assumption the other way). So in this toy example, a slight "caring" in the sense of what the model will say it would pick when quickly asked isn't represented when you look at the distribution of results of many-step plans.

 

If small motivations do matter, I think you can't discount "weird" preferences to do other things with Earth than preserve it. "Optimize Earth according to proxy X, which will kill all humans but really grow the economy / save the ecosystem / fill it with intelligent life / cure cancer / make it beautiful / maximize law-abiding / create lots of rewarding work for a personal assistant / really preserve Earth". Such motivations sound like they'd be small unless fairly directly optimized for, but the AI is supposed to be acting on small motivations, why not those bad ones, rather than the one we want?

Reply1
Focus transparency on risk reports, not safety cases
Charlie Steiner19dΩ220

Different regulation (or other legislation) might also make other sorts of transparency good ideas, imo.

A mandate or subsidy for doing safety research might make it a good idea to require transparency for more safety-relevant AI research.

Regulation aimed at improving company practices (e.g. at security against weight theft, or preventing powergrab risks like access to helpful-only models above some threshold, or following some future safety practices suggested by some board of experts, or [to be meta] good transparency practices) should generate some transparency about how companies are doing (at cybersecurity or improper internal use mitigation or safety best practices or transparency).

If safety cases are actually being evaluated and you don't get to do all the research you want if the safety is questionable, then the landscape for transparency of safety cases (or other safety data that might have a different format) looks pretty different.

I'm actually less clear on how risk reports would tie in to regulation - maybe they would get parted out to reports on how the company is doing at various risk-mitigation practices, if those are transparent?

Reply
If anyone builds it, everyone will plausibly be fine
Charlie Steiner21dΩ220

Supposing that we get your scenario where we have basically-aligned automated researchers (but haven't somehow solved the whole alignment problem along the way). What's your take on the "people will want to use automated researchers to create smarter, dangerous AI rather than using them to improve alignment" issue? Is your hope that automated researchers will be developed in one leading organization that isn't embroiled in a race to the bottom, and that org will make a unified pivot to alignment work?

Reply
Christian homeschoolers in the year 3000
Charlie Steiner24d42

I agree this is worse than it could be. But maybe some of the badness hinges on the "They're your rival group of conspecifics who are doing the thing you have learned not to like about them! That's inherently bad!" reflex, a piece of machinery within myself that I try not to cultivate.

Reply
Load More
No wikitag contributions to display.
14Low-effort review of "AI For Humanity"
10mo
0
19Rabin's Paradox
1y
41
37Humans aren't fleeb.
2y
5
74Neural uncertainty estimation review article (for alignment)
Ω
2y
Ω
3
43How to solve deception and still fail.
Ω
2y
Ω
7
17Two Hot Takes about Quine
2y
0
126Some background for reasoning about dual-use alignment research
Ω
2y
Ω
22
24[Simulators seminar sequence] #2 Semiotic physics - revamped
Ω
3y
Ω
23
36Shard theory alignment has important, often-overlooked free parameters.
Ω
3y
Ω
10
50 [Simulators seminar sequence] #1 Background & shared assumptions
Ω
3y
Ω
4
Load More