Great! I'd love to have included a remark that one, as a human, might anticipate forward-chainy/rational reasoning in these systems because we're often taking the "thought" metaphor seriously/literally in the label "chain-of-thought", rather than backwardy/rationalization "reasoning".
But since it is is at least somewhat intelligent/predictive, it can make the move of "acausal collusion" with its own tendency to hallucinate, in generating its "chain"-of-"thought". That is, the optimization to have chain-of-thought in correspondence with its output can work in the backwards direction, cohering with bad output instead of leading to better output, a la partial agency.
(Admittedly human thoughts do a lot of rationalization as well. So maybe the mistake is in taking directionality implied by "chain" too seriously?)
Maybe this is obvious, but it could become increasingly reckless to not notice when you're drawing the face of "thoughts" or "chains" on CoT shoggoth-movements . You can be misled into thinking that the shoggoth is less able to deceive than it actually is.
Less obvious but important: in the reverse direction, drawing "hacker faces" on the shoggoth, as in the case of the Docker hack (section 4.2.1), can mislead into thinking that the shoggoth "wants" to or tends to hack/undermine/power-seek more than it actually, independently does. It seems at least somewhat relevant that the docker vulnerability was exploited for a challenge that was explicitly about exploiting vulnerability. Even though it was an impressive meta-hack, one must wonder how much this is cued by the prompt and therefore is zero evidence for an autonomy telos---which is crucial for the deceptive optimizer story---even though mechanistically possible.
(The word "independently" above is important: if it takes human "misuse"/participation to trigger its undermining personas, we also might have more of a continuous shot at pausing/shutdown or even corrigibilty.)
I was going to post this as a comment, but there's also an answer here: I'd say calling o1 "deceptive" could be as misleading as calling it aligned if it outputs loving words.
It has unsteady referentiality, at least from the POV of the meanings of us life-forms. Even though it has some closeness to our meanings and referentiality, the quantity of the unsteadiness of that referentiality can be qualitative. Distinguishing "deceptively aligned mesa-optimizer" from "the tentacles of the shoggoth I find it useful to call 'words' don't work like 'words' in some annoying ways" is important, in order to protect some of that (quantitatively-)qualitative difference. Both for not dismissing risks and for not hallucinating them.
I want to draw a clear distinction between the hypothesis I mention in the OP (your 'causal' explanation) and the 'acausal' explanation you mention here. You were blending the two together during our conversation, but I think there are important differences.
In particular, I am interested in the question: does o1 display any "backwards causation" from answers to chain-of-thought? Would its training result in chain-of-thought that is optimized to justify a sort of answer it tends to give (eg, hallucinated URLs)?
This depends on details of the training which we may not have enough information on (plus potentially complex reasoning about the consequences of such reasoning).
"But since it is is at least somewhat intelligent/predictive, it can make the move of "acausal collusion" with its own tendency to hallucinate, in generating its "chain"-of-"thought"."
I am not understanding what this sentence is trying to say. I understand what an acausal trade is. Could you phrase it more directly?
I cannot see why you require the step that the model needs to be reasoning acausally for it to develop a strategy of deceptively hallucinating citations.
What concrete predictions does the model in which this is an example of "acausal collusion" make?
I notice that o1's behavior (it's cognitive process) looks suspiciously like human behaviors:
But why is this happening more when o1 can reason more than previous models? Shouldn't that give it more ways to catch its own deception?
No:
Most of this is analogous to how more intelligent people ("intellectuals") can generate elaborate, convincing—but incorrect—explanations that cannot be detected by less intelligent participants (who may still suspect something is off but can't prove it).
To me, this comparison to humans doesn't seem to answer why the o1 training ended up producing this result.
Why do you think o1 would know that it's making up references? The fact that it can't retrieve URLs is completely different from, and unrelated to, its knowledge of URLs. LLMs do not need retrieval to know about many web pages. GPT-3 and GPT-4 know tons about tons of URLs! (In fact, I actively exploit this to save myself effort linking stuff on gwern.net - no retrieval necessary.)
Let's take the Sally link. The URL may not exist, but Sally's Baking Addiction does (unsurprisingly, as there are lots of cooking-related websites and why confabulate more than necessary) and has many brownie recipes, and some ginger probing of 4o (to avoid any issues with o1 training and just try to understand the knowledge of the baseline) suggests that 4o finds the URL real and doesn't discuss it being fake or confabulated: https://chatgpt.com/share/66f753df-2ac4-8006-a296-8e39a1ab3ee0
I agree that o1 might not be able to tell whether the link is fake, but the chain of thought does say explicitly:
So, the assistant should [...] provide actual or plausible links.
The "plausible" here suggests that at least in its CoT, it has realized that the task would have probably been considered completed accurately in training as long as the links are plausible, even if they are not actual links.
I agree with this in principle, but contrasting "actual" with "plausible", combined with the fact that it talked about this in the context of not having internet access, makes me feel reasonably confident this is pointed at "not an actual link", but I agree that it's not an ironclad case.
o1 CoT: The user is asking for more references about brownies. <Reasoning about what the references should look like> So, the assistant should list these references clearly, with proper formatting and descriptions, and provide actual or plausible links. Remember, the model cannot retrieve actual URLs, so should format plausible ones.
this might encourage it to make up links
>The artificially generated data includes hallucinated links.
Not commenting on OpenAI's training data, but commenting generally: Models don't hallucinate because they've been trained on hallucinated data. They hallucinate because they've been trained on real data, but they can't remember it perfectly, so they guess. I hypothesize that URLs are very commonly hallucinated because they have a common, easy-to-remember format (so the model confidently starts to write them out) but hard-to-remember details (at which point the model just guesses because it knows a guessed URL is more likely than a URL that randomly cuts off after the http://www.).
+1 here for the idea around how the models must commit to a URL once it starts, and that it can't naturally cut off after starting. Presumably though the aspiration is that these reasoning/CoT-trained models could reflect back on the just completed URL and guess whether that is likely to be a real URL or not. If it's not doing this check step, this might be a gap in the learned skills, more than intentional deception.
Those models do not have a formalized internal values system that they exercise every time they produce some output. This means that when values oppose each other the model does not choose the answer based on some ordered system. One time it will be truthful, other times it will try to provide an answer at the cost of being only plausible. For example, the model "knows" it is not a human and does not have emotions, but for the sake of good conversation, it will say that it "feels good". For the sake of answering the user's request, it will often give the best guess or give a plausible answer.
There is also no backward reflection. It does not check itself back.
This of course comes from the way this model is currently learned. There is no learning on the whole CoT with checking for it trying to guess or deceive. So the model has no incentivization to self-check and correct. Why would it start to do that out of the blue?
There is also incentivization during learning to give plausible answers instead of stating self-doubt and writing about missing parts that it cannot answer.
There are two problems here:
1. Those LLM models are not fully learned by human feedback (and the part where it is - it's likely not the best quality feedback). It is more like interactions with humans are used to learn a "teacher" model(s) which then generate artificial scenarios and train LLM on them. Those models have no capability to check for real truthfulness and have a preference for confident plausible answers. Also, even human feedback is lacking - not every human working on that checks answers thoroughly so some plausible but not true answers slip through. If you are paid for a given amount of questions and answers or given a daily quota, there is an incentive to not be very thorough, but instead to be very quick.
2. There is pressure for better performance and lower costs of the models (both in terms of training and then usage costs). This is probably why CoT is done in a rather bare way without backward self-checking and why they did not train it on full CoT. It could cost 1.5 to 3 times more and could be 1.5 to 2 times slower (educated guess) if it were trained on CoT and made to check itself on parts of CoT vs some coherent value system.
You write: "Doesn't the model know (in the sense that if you asked it, it would tell you) that the user wanted real references as opposed to faked references? Here, I'll ask it:"
I think what we can conclude here that the model doesn't know what it knows, like we do. It will say it values one thing, but 'forget' that it values the thing when exercising some other task. When executing a string of tokens, it is not leveraging any 'knowledge' stored in the rest of the neural net. That's what makes their knowledge (and alignment for that matter) so shallow and brittle (and honestly dangerous).
I agree, but my intuitive guess is that this isn't sufficient to explain the degree of deceptive behavior reported.
It seems potentially important to compare this to GPT4o. In my experience when asking GPT4 for research papers on particular subjects it seemed to make up non-existent research papers (at least I didn't find them after multiple minutes of searching the web). I don't have any precise statistics on this.
What I'm trying to express here is that it is surprising that o1 seems to explicitly encourage itself to fake links; not just that it fakes links. I agree that other models often hallucinate plausible references. What I haven't seen before is a chain of thought which encourages this. Furthermore, while it's plausible that you can solicit such a chain of thought from 4o under some circumstances, it seems a priori surprising that such behavior would survive at such a high rate in a model whose chain of thought has specifically been trained via rl to help produce correct answers. This leads me to guess the rl is badly misaligned.
Does it ever fail to complete a proof, and honestly announce failure? A single time I have gotten claude to successfully disprove a statement that I asked it to prove, after trying to find a proof and instead finding a disproof, but I’ve never had it try for a while and then announce that it has not made progress either way.
I think we have to postulate that this component of the RL signal doesn't get to see the chain-of-thought
If this is true, o1 can produce reasoning that is unsound, invalid, vacuous, etc. and will still be rewarded by the RL framework as long as the conclusion is true. In classical logic you can even formulate arguments that are unsound, invalid and vacuous but still true if q is true, like p ^ ~q -> q.
o1 is planning to deceive because it has been rewarded for offering plausible answers, not correct answers
It is not necessary to presume deception because o1 does not need to produce sound reasoning in order to produce correct answers, let alone plausible ones. More likely it isn’t made to care about the correctness of its reasoning because it's not receiving reinforcement on the correctness of its inferential steps.
The original CoT paper used human evaluators to check the logic, so I’m guessing OpenAI did the same thing. Regardless of whether the evaluation was automated or done by humans, it's not clear whether the evaluation rubric instructed the evaluators to penalize bad reasoning even when the conclusion was correct, and how much these penalties were weighted relative to the penalty for an incorrect conclusion. I suspect the RL model is primarily reinforcing the conclusions rather than the arguments themselves, whereas a proper reward signal should be over the entire inferential chain. In fact, the inferential chain is really all that matters, because the conclusion is simply a step that accepts or rejects some equality condition between the question posed.
Another issue is that a lot of o1’s thoughts consist of vagaries like “reviewing the details” or “considering the implementation”, and it’s not clear how to even determine if these steps are inferentially valid.
Another issue is that a lot of o1’s thoughts consist of vagaries like “reviewing the details” or “considering the implementation”, and it’s not clear how to even determine if these steps are inferentially valid.
If you're referring to the chain of thought summaries you see when you select the o1-preview model in chatgpt, those are not the full chain of thought. Examples of the actual chain-of-thought can be found on the learning to reason with LLMs page with a few more examples in the o1 system card. Note that we are going off of OpenAI's word that these chain of thought examples are representative - if you try to figure out what actual reasoning o1 used to come to a conclusion you will run into the good old "Your request was flagged as potentially violating our usage policy. Please try again with a different prompt."
If you distrust OA's selection, it seems like o1 is occasionally leaking the chains of thought: https://www.reddit.com/r/OpenAI/comments/1fxa6d6/two_purported_instances_of_o1preview_and_o1mini/ So you can cross-reference those to see if OA's choices seem censored somehow, and also just look at those as additional data.
It's also noteworthy that people are reporting that there seem like there are other blatant confabulations in the o1 chains, much more so than simply making up a plausible URL, based on the summaries: https://www.reddit.com/r/PromptEngineering/comments/1fj6h13/hallucinations_in_o1preview_reasoning/ Stuff which makes no sense in context and just comes out of nowhere. (And since confabulation seems to be pretty minimal in summarization tasks these days - when I find issues in summaries, it's usually omitting important stuff rather than making up wildly spurious stuff - I expect those confabulations were not introduced by the summarizer, but were indeed present in the original chain as summarized.)
The o1 system card reports:
An example of this type of chain-of-thought reasoning is provided:
OpenAI tries to soften the blow:
I find this unconvincing. In the reported example, the user is asking for references. Doesn't the model know (in the sense that if you asked it, it would tell you) that the user wanted real references as opposed to faked references? Here, I'll ask it:
So, it seems to me like o1 is at least failing to do obvious sanity-checks (if it asked itself whether it was "satisfying the user request" it would say no!), and more plausibly, deliberately attempting to deceive the user contrary to its own interpretation of the user request.
But why? Why would its training result in this?
Hypothesis: training to give plausible answers
In a recent discussion about this, Sahil postulated the following causal mechanism:
Some plausible predictions based on this hypothesis:
In weak support of this, I had two conversations with o1 where I asked it to prove one thing and prove the opposite; it provided "proofs" in both cases.