Charlie Steiner

If you want to chat, message me!

LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.

Sequences

Alignment Hot Take Advent Calendar
Reducing Goodhart
Philosophy Corner

Wikitag Contributions

Comments

Sorted by

A lot of the effect is picking high-hanging fruit.

Like, go to phys rev D now. There's clearly a lot of hard work still going on. But that hard work seems to be getting less result, because they're doing things like carefully calculating the trailing-order terms of the muon's magnetic moment to get a change many decimal places down. (It turns out that this might be important for studying physics beyond the Standard Model. So this is good and useful work, definitely not being literally stalled.)

Another chunk of the effect is that you generally don't know what's important now. In hindsight you can look back and see all these important bits of progress woven into a sensible narrative. But research that's being done right now hasn't had time to earn its place in such a narrative. Especially if you're an outside observer who has to get the narrative of research third-hand.

In the infographic, are "Leading Chinese Lab" and "Best public model"s' numbers swapped? The best public model is usually said to be ahead of the Chinese.

EDIT: Okay, maybe most of it before the endgame is just unintuitive predictions. In the endgame, when the categories "best OpenBrain model," "best public model" and "best Chinese model" start to become obsolete, I think your numbers are weird for different reasons and maybe you should just set them all equal.

Scott Wolchok correctly calls out me but also everyone else for failure to make an actually good definitive existential risk explainer. It is a ton of work to do properly but definitely worth doing right.

Reminder that https://ui.stampy.ai/ exists

The idea is interesting, but I'm somewhat skeptical that it'll pan out.

  • RG doesn't help much going backwards - the same coarse-grained laws might correspond to many different micro-scale laws, especially when you don't expect the micro scale to be simple.
  • Singular learning theory provides a nice picture of phase-transition-like-phenomena, but it's likely that large neural networks undergo lots and lots of phase transitions, and that there's not just going to be one phase transition from "naive" to "naughty" that we can model simply.
  • Conversely, lots of important changes might not show up as phase transitions.
  • Some AI architectures are basically designed to be hard to analyze with RG because they want to mix information from a wide variety of scales together. Full attention layers might be such an example.

If God has ordained some "true values," and we're just trying to find out what pattern has received that blessing, then yes this is totally possible, God can ordain values that have their most natural description at any level He wants.

On the other hand, if we're trying to find good generalizations of the way we use the notion of "our values" in everyday life, then no, we should be really confident that generalizations that have simple descriptions in terms of chemistry are not going to be good.

Thanks!

Any thoughts on how this line of research might lead to "positive" alignment properties? (i.e. Getting models to be better at doing good things in situations where what's good is hard to learn / figure out, in contrast to a "negative" property of avoiding doing bad things, particularly in cases clear enough we could build a classifier for them.)

The second thing impacts the first thing :) If a lot of scheming is due to poor reward structure, and we should work on better reward structure, then we should work on scheming prevention.

Very interesting!

It would be interesting to know what the original reward models would say here - does the "screaming" score well according to the model of what humans would reward (or what human demonstrations would contain, depending on type of reward model)?

My suspicion is that the model has learned that apologizing, expressing distress etc after making a mistake is useful for getting reward. And also that you are doing some cherrypicking.

At the risk of making people do more morally grey things, have you considered doing a similar experiment with models finetuned on a math task (perhaps with a similar reward model setup for ease of comparison), which you then sabotage by trying to force them to give the wrong answer?

I'm a big fan! Any thoughs on how to incorporate different sorts of reflective data, e.g. different measures of how people think mediation "should" go?

I don't get what experiment you are thinking about (most CoT end with the final answer, such that the summarized CoT often ends with the original final answer).

Hm, yeah, I didn't really think that through. How about giving a model a fraction of either its own precomputed chain of thought, or the summarized version, and plotting curves of accuracy and further tokens used vs. % of CoT given to it? (To avoid systematic error from summaries moving information around, doing this with a chunked version and comparing at each chunk seems like a good idea.)

Anyhow, thanks for the reply. I have now seen last figure.

Load More