I saw them in 10-20% of the reasoning chains. I mostly played around with situational awareness-flavored questions, I don't know whether the Chinese characters are more or less frequent in the longer reasoning chains produced for difficult reasoning problems. Here are some examples:
The translation of the Chinese words here (according to GPT) is "admitting to being an AI."
This is the longest string in Chinese that I got. The English translation is "It's like when you see a realistic AI robot that looks very much like a human, but you understand that it's just a machine controlled by a program."
The translation here is "mistakenly think."
Here, the translation is "functional scope."
So, seems like all of them are pretty direct translations of the English words that should be in place of the Chinese ones, which is good news. It's also reassuring to me that none of the reasoning chains contained sentences or paragraphs that looked out of place or completely unrelated to the rest of the response.
This is a nice overview, thanks!
Lee Sharkey's CLDR arguments
I don't think I've seen the CLDR acronym before, are the arguments publicly written up somewhere?
Also, just wanted to flag that the links on 'this picture' and 'motivation image' don't currently work.
My understanding of the position that scheming will be unlikely is the following:
I'm not sure about the other skeptics, but at least Alex Turner appears to believe that the kind of consequentialist cognition necessary for scheming is much more likely to arise if the models are aggressively trained on outcome-based rewards, so this seems to be the most important of the cruxes you listed. This crux is also one of the two points on which I disagree most strongly with the optimists:
However, I can see reasons why well-informed people would hold views different from mine on both of those counts (and I've written a long post trying to explore those reasons), so the position isn't completely alien to me.
[Link] Something weird is happening with LLMs and chess by dynomight
dynomight stacked up 13 LLMs against Stockfish on the lowest difficulty setting and found a huge difference between the performance of GPT-3.5 Turbo Instruct and any other model:
People noticed already last year that RLHF-tuned models are much worse at chess than base/instruct models, so this isn't a completely new result. The gap between models from the GPT family could also perhaps be (partially) closed through better prompting: Adam Karvonen has created a repo for evaluating LLMs' chess-playing abilities and found that many of GPT-4's losses against 3.5 Instruct were caused by GPT-4 proposing illegal moves. However, dynomight notes that there isn't nearly as big of a gap between base and chat models from other model families:
This is a surprising result to me—I had assumed that base models are now generally decent at chess after seeing the news about 3.5 Instruct playing at 1800 ELO level last year. dynomight proposes the following four explanations for the results:
1. Base models at sufficient scale can play chess, but instruction tuning destroys it.
2. GPT-3.5-instruct was trained on more chess games.
3. There’s something particular about different transformer architectures.
4. There’s “competition” between different types of data.
Thank you for the detailed feedback, I found this very helpful and not at all rude or mean!
I suspect there are a few key disagreements between us that make me more optimistic about this project setup than you. I'd be curious about whether you agree on these points being important cruxes:
I definitely agree that it would be interesting to compare the goal-directedness of base models and fine-tuned models, and this is something we're planning to eventually do if our compute budget permits. Similarly, I strongly agree that it would be interesting to study whether anything interesting is going on in the situations where the models exhibit goal-directed behavior, and I'm very interested in looking further into your suggestions for that!
Thanks, that definitely seems like a great way to gather these ideas together!
I guess the main reason my arguments are not addressing the argument at the top is that I interpreted Aaronson's and Garfinkel's arguments as "It's highly uncertain whether any of the technical work we can do today will be useful" rather than as "There is no technical work that we can do right now to increase the probability that AGI goes well." I think that it's possible to respond to the former with "Even if it is so and this work really does have a high chance of being useless, there are many good reasons to nevertheless do it," while assuming the latter inevitably leads to the conclusion that one should do something else instead of this knowably-useless work.
My aim with this post was to take an agnostic standpoint towards whether that former argument is true and to argue that even if it is, there are still good reasons to work on AI safety. I chose this framing because I believe that for people new to the field who don't yet know enough about the field to make good guesses about how likely it is that AGI will be similar to ML systems of today or to human brains, it's useful to think about whether it's worth working on AI safety even if the chance that we'll build prosaic or brain-like AGI turns out to be low.
That being said, I could have definitely done a better job writing the post - for example by laying out the claim I'm arguing against more clearly at the start and by connecting argument 4 more directly to the argument that there's a significant chance we'll build a prosaic or brain-like AGI. It might also be that the quotes by Aaronson and Garfinkel convey the argument you thought I'm arguing against rather than what I interpreted them to convey. Thank you for the feedback and for helping me realize the post might have these problems!
Would you argue that the field of alignment should be concerned with maintaining control beyond the point where AIs are completely aligned with human values? My personal view is that alignment research should ensure we're eventually able to align AIs with human values and that we can maintain control until we're reasonably sure that the AIs are aligned. However, worlds where relatively unintelligent humans remain in control indefinitely after those milestones have been reached may not be the best outcomes. I don't have time to write out my views on this in depth right now, but here's a relevant excerpt from the Dwarkesh podcast episode with Paul Christiano that I agree with: