Charlie Steiner

If you want to chat, message me!

LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.

Sequences

Alignment Hot Take Advent Calendar
Reducing Goodhart
Philosophy Corner

Wiki Contributions

Comments

See my sequence "Reducing Goodhart" for what I (or me from a few years ago) think the impact is on the alignment problem.

the fact that humans evolved from natural selection tells us a lot of what they probably want,

Sure. But only if you already know what evolved creatures tend to want. I.e. once you have already made interpretive choices in one case, you can get some information on how well they hang together with other cases.

"It sure seems like there's a fact of the matter" is not a very forceful argument to me, especially in light of things like it being impossible to uniquely fit a rationality model and utility function to human behavior.

Pick a goal, and it's easy to say what's required. But pick a human, and it's not easy to say what their goal is.

Is my goal to survive? And yet I take plenty of risky actions like driving that trade that off against other things. And even worse, I deliberately undergo some transformative experiences (e.g. moving to a different city and making a bunch of new friends) that in some sense "make me a different person." And even worse, sometimes I'm irrational or make mistakes, but under different interpretations of my behavior different things are irrational. If you interpret me as really wanting to survive, driving is an irrational thing I do because it's common in my culture and I don't have a good intuitive feel for statistics. If you interpret me a different way, maybe my intuitive feeling gets interpreted as more rational but my goal changes from survival to something more complicated.

This was super interesting. I hadn't really thought about the tension between SLT and superposition before, but this is in the middle of it.

Like, there's nothing logically inconsistent with the best local basis for the weights being undercomplete while the best basis for the activations is overcomplete. But if both are true, it seems like the relationship to the data distribution has to be quite special (and potentially fragile).

I'm unclear on many of the choices. But I guess I'll just ask about the SVD thing. Why use SVD to change size of activation histories? What good properties did you expect it to have, and did you do any playing around with it to see if it seemed to give sensible results?

Good post, but also there might be enough inertia to using the word "feature" in different contexts that it's hard to stop. Honestly the in-the-world vs in-the-model distinction my be the less confusing of the two common distinctions, because in both cases a feature is a part of a decomposition of the whole into parts that can be composed with each other. The more subtle one to keep straight is the distinction between features as things found by their local statistical properties vs. features as things found by their impact on the entire computation.

Thanks for the reply! I feel like a loss term that uses the ground truth reward is "cheating." Maybe one could get information from how a feature impacts behavior - but in this case it's difficult to disentangle what actually happens from what the agent "thought" would happen. Although maybe it's inevitable that to model what a system wants, you also have to model what it believes.

My take:


I assume we all agree that the system can understand the human ontology, though? This is at least necessary for communicating and reasoning about humans, which LLMs can clearly already do to some extent.
 

Can we reason about a thermostat's ontology? Only sort of. We can say things like "The thermostat represents the local temperature. It wants that temperature to be the same as the set point." But the thermostat itself is only very loosely approximating that kind of behavior - imputing any sort of generalizability to it that it doesn't actually have is an anthropmorphic fiction. And it's blatantly a fiction, because there's more than one way to do it - you can suppose the thermostat wants only the temperature sensor to be at the right temperature vs. it wants the whole room vs. the whole world to be at that temperature, or that it's "changing its mind" when it breaks vs. it would want to be repaired, etc.

To the superintelligent AI, we are the thermostat. You cannot be aligned to humans purely by being smart, because finding "the human ontology" is an act of interpretation, of story-telling, not just a question of fact. Helping an AI narrow down how to interpret humans as moral patients requires giving it extra assumptions or meta-level processes. (Or as I might call it, "solving the alignment problem.")

How can this be, if a smart AI can talk to humans intelligibly and predict their behavior and so forth, even without specifying any of my "extra assumptions"? Well, how can we interact with a thermostat in a way that it can "understand," even without fixing any particular story about its desires? We understand how it works in our own way, and we take actions using our own understanding. Often our interactions fall in the domain of the normal functioning of the thermostat, under which several different possible stories about "what the thermostat wants" apply, and sometimes we think about such stories but mostly we don't bother.

I enjoyed this post, especially because it pointed out argumentation games to me (even if I put no particular stock in debate as a good alignment mechanism, I wasn't aware of this prior literature).

However, sometimes this comes off like asking a paper about human psychology why they didn't cite a paper about mouse psychology. Certainly psychologists studying humans should at least be aware of the existence of research on mice and maybe some highlights, but not every psychology paper needs a paragraph about mice.

Similarly, not talking about interpretability of decision trees or formal languages is no particular failing for a paper on interpretability of deep neural nets. (Though if we do start mapping circuits to formal languages, of course I hope to see useful citations.)

The case of preferences, and citing early work on preferences, seems more nuanced. To a large extent it's more like engineering than science - you're trying to solve a problem - what previous attempts at solution should you compare to to help readers? I haven't read that Working With Preferences book you link (Is it any good, or did you just google it up?), but I imagine it's pretty rare I'll want a value learning paper to compare to prospect theory or nonmonotonic logics (though the fraction that would benefit from talking about nonmonotonic logic sound pretty interesting.)

Nice! There's definitely been this feeling with training SAEs that activation penalty+reconstruction loss is "not actually asking the computer for what we want," leading to fragility. TopK seems like it's a step closer to the ideal - did you subjectively feel confident when starting off large training runs?

Confused about section 5.3.1:

To mitigate this issue, we sum multiple TopK losses with different values of k (Multi-TopK). For
example, using L(k) + L(4k)/8 is enough to obtain a progressive code over all k′ (note however
that training with Multi-TopK does slightly worse than TopK at k). Training with the baseline ReLU
only gives a progressive code up to a value that corresponds to using all positive latents.

Why would we want a progressive code over all hidden activations? If features have different meanings when they're positive versus when they're negative (imagining a sort of Toy Model of Superposition picture where features are a bunch of rays squeezed in around a central point), it seems like if your negative hidden activations are informative something weird is going on.

Load More