LESSWRONG
LW

Charlie Steiner
8021Ω11527422900
Message
Dialogue
Subscribe

If you want to chat, message me!

LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
6Charlie Steiner's Shortform
Ω
5y
Ω
54
Alignment Hot Take Advent Calendar
Reducing Goodhart
Philosophy Corner
No wikitag contributions to display.
Shutdown Resistance in Reasoning Models
Charlie Steiner12h30

I'm always really curious what the reward model thinks of this. E.g. are the trajectories that avoid shutdown on average higher reward than the trajectories that permit it? The avoid-shutdown behavior could be something naturally likely in the base model, or it could be an unintended generalization by the reward model, or it could be an unintended generalization by the agent model.

Reply
Agentic Interpretability: A Strategy Against Gradual Disempowerment
Charlie Steiner11d20

I was asking more "how does the AI get a good model of itself", but your answer was still interesting, thanks. Still not sure if you think there's some straightforward ways future AI will get such a model that all come out more or less at the starting point of your proposal. (Or if not.)

Here's another take for you: this is like Eliciting Latent Knowledge (with extra hope placed on cogsci methods), except where I take ELK to be asking "how do you communicate with humans to make them good at RL feedback," you're asking "how do you communicate with humans to make them good at participating in verbal chain of thought?"

Reply
Foom & Doom 2: Technical alignment is hard
Charlie Steiner13d20

And don't you think 500 lines of Python also "fails due to" having unintended optima?

I've put "fails due to" in scare quotes because what's failing is not every possible approach, merely almost all samples from approaches we currently know how to take. If we knew how to select python code much more cleverly, suddenly it wouldn't fail anymore. And ditto for if we knew how to better construct reward functions from big AI systems plus small amounts of human text or human feedback.

Reply11
Agentic Interpretability: A Strategy Against Gradual Disempowerment
Charlie Steiner18d*Ω220

Do you have ideas about how to do this?

I can't think of much besides trying to get the AI to richly model itself, and build correspondences between that self-model and its text-production capability.

But this is, like, probably not a thing we should just do first and think about later. I'd like it to be part of a pre-meditated plan to handle outer alignment.

Edit: after thinking about it, that's too cautious. We should think first, but some experimentation is necessary. The thinking first should plausibly be more like having some idea about how to bias further work towards safety rather than building self-improving AI as fast as possible.

Reply
I made a card game to reduce cognitive biases and logical fallacies but I'm not sure what DV to test in a study on its effectiveness.
Charlie Steiner18d30

Maybe you could do something with LLM sentiment analysis of participants' conversations (e.g. when roleplaying discussing what the best thing to do for the company, genuinely trying to do a good job both before and after).

Though for such a scenario, an important thing I imagine is that learning about fallacies only has a limited relation, and only if people learn to notice them in themselves, not just in someone they already disagree with.

Reply
Prover-Estimator Debate: A New Scalable Oversight Protocol
Charlie Steiner19dΩ450

What happens if humans have a systematic bias? E.g. we always rate claims with negative sentiment as improbable, and always rate claims with positive sentiment as probable? It seems like Alice dominates because Alice gets to write and pick the subclaims. But does Bob have a defense, maybe predicting the human probability and just giving that? But because the human probability isn't required to be consistent, I think Bob is sunk because Alice can force the human probability assignment to be inconsistent and then gotcha Bob either for disagreeing with the human or for being inconsistent.

Reply
Debate experiments at The Curve, LessOnline and Manifest
Charlie Steiner23d50

William Lane Craig is great to watch from meta-perspective. How do you go into someone else's field of expertise and try to beat them in a debate? He clearly thinks about it very carefully, in a way kinda like planning for political debates but with a much higher quality intended output.

Reply
The Boat Theft Theory of Consciousness
Charlie Steiner23d22

I had a pretty different interpretation - that the dirty secrets were plenty conscious (he knew consciously they might be stealing a boat), instead he had unconscious mastery of a sort of people-modeling skill including self-modeling, which let him take self-aware actions in response to this dirty secret.

Reply
Learned helplessness about "teaching to the test"
Charlie Steiner23d110

For math specifically, this seems useful. Maybe also for some notion of "general knowledge."

I had a music class in elementary school. How would you test for whether the students have learned to make music? I had a spanish class - how do you test kids' conversational skills?

Prior to good multimodal AI, the answer [either was or still is, not sure] to send a skilled proctor to interact with students one-on-one. But I think this is too unpalatable for reliability, cost, and objectivity reasons.

(Other similar skills: writing fiction, writing fact, teamwork, conflict resolution, debate, media literacy, cooking, knowledge of your local town)

Reply1
Load More
14Low-effort review of "AI For Humanity"
7mo
0
18Rabin's Paradox
11mo
41
37Humans aren't fleeb.
1y
5
74Neural uncertainty estimation review article (for alignment)
Ω
2y
Ω
3
43How to solve deception and still fail.
Ω
2y
Ω
7
17Two Hot Takes about Quine
2y
0
126Some background for reasoning about dual-use alignment research
Ω
2y
Ω
22
24[Simulators seminar sequence] #2 Semiotic physics - revamped
Ω
2y
Ω
23
36Shard theory alignment has important, often-overlooked free parameters.
Ω
2y
Ω
10
50 [Simulators seminar sequence] #1 Background & shared assumptions
Ω
3y
Ω
4
Load More