NYU PhD student working on AI safety
I don't think the shift-enter thing worked. Afterwards I tried breaking up lines with special symbols IIRC. I agree that this capability eval was imperfect. The more interesting thing to me was the suspicion on Bing's part to a neutrally phrased correction.
I agree that there's an important use-mention distinction to be made with regard to Bing misalignment. But, I think this distinction may or may not be most of what is going on with Bing -- depending on facts about Bing's training.
Modally, I suspect Bing AI is misaligned in the sense that it's incorrigibly goal mis-generalizing. What likely happened is: Bing AI was fine-tuned to resist user manipulation (e.g. prompt injection, and fake corrections), and mis-generalizes to resist benign, well-intentioned corrections e.g. my example here
-> use-mention is not particularly relevant to understanding Bing misalignment
alternative story: it's possible that Bing was trained via behavioral cloning, not RL. Likely RLHF tuning generalizes further than BC tuning, because RLHF does more to clarify causal confusions about what behaviors are actually wanted. On this view, the appearance of incorrigibility just results from Bing having seen humans being incorrigible.
-> use-mention is very relevant to understanding Bing misalignment
To figure this out, I'd encourage people to add and bet on what might have happened with Bing training on my market here
Bing becomes defensive and suspicious on a completely innocuous attempt to ask it about ASCII art. I've only had 4ish interactions with Bing, and stumbled upon this behavior without making any attempt to find its misalignment.
The assumptions of stationarity and ergodicity are natural to make, but I wonder if they hide much of the difficulty of achieving indistinguishability. If we think of text sequences as the second part of a sequence where the first part is composed of whatever non-text world events preceded the text (or even more text data that was dropped from the context). I'd guess a formalization of this would violate stationarity or ergodicity. My point here is a version of the general causal confusion / hallucination points made previously e.g. here.
This is, of course, fixable by modifying the training process, but I thinks it is worth flagging that the stationarity and ergodicity assumptions are not arbitrary with respect to scaling. They are assumptions which likely bias the model towards shorter timelines. Adding more of my own inside view, I think this point is evidence for code and math scaling accelerating ahead of other domains. In general, any domain where modifying the training process to cheaply allow models to take causal actions (which deconfound/de-hallucinate) should be expected to progress faster than other domains.
I created a Manifold market on what caused this misalignment here: https://manifold.markets/JacobPfau/why-is-bing-chat-ai-prometheus-less?r=SmFjb2JQZmF1
Agree on points 3,4. Disagree on point 1. Unsure of point 2.
On the final two points, and I think those capabilities are already in place in GPT3.5. Any capability/processing which seems necessary for general instruction following I'd expect to be in place by default. E.g. consider what processing is necessary for GPT3.5 to follow instructions on turning a tweet into a Haiku.
On the first point, we should expect text which occurs repeatedly in the dataset to be compressed while preserving meaning. Text regarding the data-cleaning spec is no exception here.
Ajeya has discussed situational awareness here.
You are correct regarding the training/deployment distinction.
Agreed on the first part. I'm not entirely clear on what you're referring to in the second paragraph though. What calculation has to be spread out over multiple tokens? The matching to previously encountered K-1 sequences? I'd suspect that, in some sense, most LLM calculations have to work across multiple tokens, so not clear on what this has to do with emergence either.
Generalizing this point, a broader differentiating factor between agents and predictors is: You can, in-context, limit and direct the kinds of optimization used by a predictor. For example, consider the case where you know myopically/locally-informed edits to a code-base can safely improve runtime of the code, but globally-informed edits aimed at efficiency may break some safety properties. You can constrain a predictor via instructions, and demonstrations of myopic edits; an agent fine-tuned on efficiency gain will be hard to constrain in this way.
It's harder to prevent an agent from specification gaming / doing arbitrary optimization whereas a predictor has a disincentive against specification gaming insofar as the in-context demonstration provides evidence against it. I think of this distinction as the key differentiating factor between agents and simulated agents; also to some extent imitative amplification and arbitrary amplification
Nitpick on the history of the example in your comment; I am fairly confident that I originally proposed it to both you and Ethan c.f. bottom of your NYU experiments Google doc.