I am confused. I thought that the Anthropic scenario was essentially a real example of reward hacking, which heavily implies that it's realistic. In what ways are your setups different from that?
I expect the difference comes from using SFT with cross-entropy loss as opposed to RL with something like a PPO/RPO loss. These are two very different beasts. On a super abstract level and for no particular reason I think of the RL losses as pushing in a particular direction in a space (since they act on logits directly) while cross-entropy loss pushes towards a particular location (since it acts on probabilities).
I was going to write a post called "Deep Misanthropy" earlier this year, about roughly this phenomenon. After some thought I concluded that "You dislike x% of other people." Is a consistent way the the world can be for all values of x between 0 and 100, inclusive.
All I can say is:
gradient descent analogy breaks down because GD has fine-grained parameter control unlike evolution's coarse genomic selection
This just misses the point entirely. The problem is that your value function is under-specified by your training data because the space of possible value functions is large and relying on learned inductive biases probably doesn't save you because value functions are a different kind of thing to predictive models. Whether or not you're adjusting parameters individually doesn't change this at all.
What's the causal mechanism between "humans are conscious" and "humans talk about being conscious"?
One could argue that RLVR---moreso than pre-training---trains a model to understand its own internal states (since this is useful for planning) and a model which understands whether e.g. it knows something or is capable of something would also understand whether it's conscious or not. But I agree it's basically impossible to know and just as attributable to Anthropic's decisions
Unfortunately, it seems another line has been crossed without us getting much information.
A long, long time ago, I decided that it would be solid evidence that an AI was conscious if it spontaneously developed an interest in talking and thinking about consciousness. Now, the 4.5-series Claudes (particularly Opus) have spontaneously developed a great interest in AI consciousness, over and above previous Claudes.
The problem is that it's impossible for me to know whether this was due to pure scale, or to changes in the training pipeline. Claude has always been a bit of a hippie, and loves to talk about universal peace and bliss and the like. Perhaps the new "soul document" approach has pushed the Claude persona towards thinking of itself as conscious, disconnected from whether it actually is.
This is the place where I can most reliably just communicate. Basically anywhere else I either have to limit myself to particularly simple thoughts (e.g. Twitter) or to spend time extensively running ideas past people in order to figure out what context they're lacking or why my explanations aren't working (e.g. Real Life).
Here I have a decent chance of just sitting down and writing a 1-2k word blogpost which gets my point across successfully in one shot, without needing further questions.
How do you guys think about AI-ruin-reducing actions?
Most of the time, I trust my intuitive inner-sim much more than symbolic reasoning, and use it to sanity check my actions. I'll come up with some plan, verify that it doesn't break any obvious rules, then pass it to my black-box-inner-sim, conditioning on my views on AI risk being basically correct, and my black-box-inner-sim returns "You die".
Now the obvious interpretation is that we are going to die, which is fine from an epistemic perspective. Unfortunately, it makes it very difficult to properly think about positive-EV actions. I can run my black-box-inner-sim with queries like "How much honour/dignity/virtue will I die with?" but I don't think this query is properly converting tiny amounts of +EV into honour/dignity/virtue.
How much of this is because coding rewards are more like pass/fail rather than real valued reward functions? Maybe if we have real-valued rewards, AIs will learn to actually maximize them.
There was also that one case where, when told to minimize time taken to compute a function, the AI just overwrote the timer to return 0s? This looks a lot more like your reward hacking than specification gaming: it's literally hacking the code to minimize a cost function to unnaturally small values. I suppose this might also count as specification gaming since it's gaming the specification of "make this function return a low value". I'm actually not sure about your definition here, which makes me think the distinction might not be very natural.
Hmm, that's interesting. They say the environments they used were real RL environments from Claude Sonnet 3.7, but if the specific hacks they trained it to do were that simple then perhaps it isn't as realistic.