Charlie Steiner

LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.


Alignment Hot Take Advent Calendar
Reducing Goodhart
Philosophy Corner

Wiki Contributions


Just looked up Aligned AI (the Stuart Armstrong / Rebecca Gorman show) for a reference, and it looks like they're publishing blog posts:


There's a big difference between ethics and physics.

When you "don't have physics figured out," this is because there's something out there in reality that you're wrong about. And this thing has no obligation to ever reveal itself to you - it's very easy to come up with physics that's literally inexplicable to a human - just make it more complicated than the human mind can contain, and bada bing.

When you "don't have ethics figured out," it's not that there's some ethical essence out there in reality that contradicts you, it's because you are a human, and humans grow and change as they live and interact with the world. We change our minds because we live life, not because we're discovering objective truths - it would be senseless to say "maybe the true ethics is more complicated than a human mind can contain!"

Yeah, I think these are good points. However, I think that #1 is actually misleading. If we measure "work" in loss or in bits, then yes absolutely we can probably figure out the components that reduce loss the most. But lots of very important cognition goes into getting the last 0.01 bits of loss in LLMs, which can have big impacts on the capabilities of the model and the semantics of the outputs. I'm pessimistic on human-understanding based approaches to auditing such low-loss-high-complexity capabilities.

Yeah, but on the other hand, I think this is looking for essential differences where they don't exist. I made a comment similar to this on the previous post. It's not like one side is building rockets and the other side is building ornithopters - or one side is advocating building computers out of evilite, while the other side says we should build the computer out of alignmentronium.

"reward functions can't solve alignment because alignment isn't maximizing a mathematical function."

Alignment doesn't run on some nega-math that can't be cast as an optimization problem. If you look at the example of the value-child who really wants to learn a lot in school, I admit it's a bit tricky to cash this out in terms of optimization. But if the lesson you take from this is "it works because it really wants to succeed, this is a property that cannot be translated as maximizing a mathematical function," then I think that's a drastic overreach.

Wait, but surely RL-developed shards that work like human values are the biomimicry approach here, and designing a value learning scheme top-down is the modernist approach. I think this metaphor has its wires crossed.

Great to hear! Maybe I'll see some of you next year.

Bah! :D It's sad to hear he's updated away from ambitions value learning towards corrigiblity-like targets. Eliezer's second-hand argument sounds circular to me; suppose that corrigibility as we'd recognize it isn't a natural abstraction - then generic AIs wouldn't use it to align child agents (instead doing something like value learning, or something even more direct), and so there wouldn't be a bunch of human-independent examples, so it wouldn't show up as a natural abstraction to those AIs.

I'll admit I'm pessimistic, because I expect institutional inertia to be large and implementation details to unavoidably leave loopholes. But it definitely sounds interesting.

Or better yet, get started on a data pipeline for whole-paper analysis, since it'll probably be practical in a year or two.

Yup, definitely agree this clarification needs to go into the zeitgeist.

Also, thanks for the interesting citations.

Load More