Let me remind people of this other data point: in a previous paper with Evan we found that Opus engages in sandbagging. Here are screenshots of key parts:
The system prompt is natural (but does give the model a goal, unlike in Apollo's case):
A "Slack conversation" that appears in the prompt includes this message:
And here's the evaluation task.
Opus acted less capable than it is, both with and without chain-of-thought.
I really like the idea of creating a Good ABC book, and the author executed it well. Out of the recreational posts of 2023, this is the one I've liked the most. I can't articulate why exactly, but the visuals are great, it Creates A Thing, and it's an example of things I'd like there to be more in the world. It inspired me create my own version. I give it a +9 for the 2023 review.
I really liked this post. I also have friends who have young children, and was inspired to give them a book like this. But everyone involved speaks Finnish, so I ended up creating my own.
I just got my copies from mail. It looks really unimpressive in these low-quality phone-camera photos of the physical book, but it's really satisfying in real life - like Katja, I paid attention to using high-quality photos. For the cover picture I chose Earthrise.
(I'm not sharing the full photos due to uncertainties with copyright, but if you want your copy, I can send the materials to you.)
More information about the creation process:
I didn't want to have any information like authors or a book title in the book: I liked the idea that the children will have a Mysterious Book in their bookshelf. (The back cover has some information about the company that made the book, though.)
Here's the word list:
A: Avaruus (space)
B: Bakteeri (bacteria)
C: Celsius
D: Desi
E: Etäisyys (distance)
F: Folio (foil)
G: Geeni (gene)
H: Hissi (elevator)
I: Ilma (air)
J: Jousi (spring)
K: Kello (clock)
L: Linssi (lens)
M: Määrä (quantity)
N: Nopea (fast)
O: Odottaa (to wait)
P: Pyörä (wheel)
R: Rokote (vaccine)
S: Solu (cell)
T: Tietokone (computer)
U: Uuni (oven)
V: Valo (light)
Y: Ympyrä (circle)
Ä: Ääni (sound)
Ö: Ötökkä (bug)
I recently gave a workshop in AI control, for which I created an exercise set.
The exercise set can be found here: https://drive.google.com/file/d/1hmwnQ4qQiC5j19yYJ2wbeEjcHO2g4z-G/view?usp=sharing
The PDF is self-contained, but three additional points:
You can send feedback by messaging me or anonymously here.
The post studies handicapped chess as a domain to study how player capability and starting position affect win probabilities. From the conclusion:
In the view of Miles and others, the initially gargantuan resource imbalance between the AI and humanity doesn’t matter, because the AGI is so super-duper smart, it will be able to come up with the “perfect” plan to overcome any resource imbalance, like a GM playing against a little kid that doesn't understand the rules very well.
The problem with this argument is that you can use the exact same reasoning to imply that’s it’s “obvious” that Stockfish could reliably beat me with queen odds. But we know now that that’s not true.
Since this post came out, a chess bot (LeelaQueenOdds) that has been designed to play with fewer pieces has come out. simplegeometry's comment introduces it well. With queen odds, LQO is way better than Stockfish, which has not been designed for it. Consequentially, the main empirical result of the post is severely undermined. (I wonder how far even LQO is from truly optimal play against humans.)
(This is in addition to - as is pointed out by many commenters - how the whole analogue is stretched at best, given the many critical ways in which chess is different from reality. The post has little argument in favor of the validity of the analogue.)
I don't think the post has stood the test of time, and vote against including it in the 2023 Review.
I found it interesting to play against LeelaQueenOdds. My experiences:
Overall fascinating to play from a position that should be an easy win, but getting crushed by an opponent that Just Plays Better than I do.
[For context, I'm around 2100 in Lichess on short time controls (bullet/blitz). I also won against Stockfish 16 at rook odds on my first try - it's really not optimized for this sort of thing.]
You might be interested in this post of mine, which is more precise about what "trustworthy" means. In short, my definition is "the AI isn't adversarially trying to cause a bad outcome". This includes aligned models, and also unaligned models that are dumb enough to realize they should (try to) sabotage. This does not include models that are unaligned, trying to sabotage and which we are able to stop from causing bad outcomes (but we might still have use-cases for such models).
Thanks, that clears it up. Indeed I agree that "take real trajectories, then edit the attacks in" is relatively promising, and didn't track this when writing my first comment (oops) - thanks for flagging it.
For background, see this post on trustworthiness.
Here are two notions of trustworthiness that are useful to distinguish:
These notions can vary substantially. A central example is a model that has never had any serious misaligned thoughts but, when presented with a clear opportunity to escape, takes it. This model would be trusted under the first notion, but not under the second.
Another example is the case where a weaker model is monitoring a stronger model. If during deployment the weak model obtains strong evidence for indeed being deployed (possibly because the strong model aims for this outcome), the weak model might end up not reporting bad actions the strong model takes.[1] This can happen even if the weak model is generally not egregiously misaligned.
I find this distinction useful for two reasons:
First, there are many sorts of evaluations and arguments which imply (1) but not (2). For instance, arguments from the model’s incompetence at theory-of-mind and keeping secrets, or at thinking of and executing cross-context strategies, are strong evidence in favor of (1), but much weaker evidence in favor of (2). Obtaining strong evidence for (2) might require evaluations targeted at that specifically.
Second, whether we care about (1) or (2) depends on the application. Situations where we primarily care about (1) include: the model has produced large quantities of research, we have run capability evaluations on the model, we are forming vague high-level impressions about the model’s psychology based on behavioral evidence. In these cases we typically have redundancy and multiple lines of evidence, and thus sabotage would need to be somewhat consistent. In contrast, we care about (2) when a single action by the model can have large consequences, which is the case at least when the model is monitoring other models.
I like this frame better than my previous ones. Credit to Ryan. For ease of reference, here’s terminology: property (1) is average or low-stakes trustworthiness, and property (2) is worst-case or high-stakes trustworthiness.
We might be able to ensure the model can’t reliably distinguish evaluations from deployment, so this scenario might be unrealistic. But I want to be extremely clear about whether the argument for safety is “the model won’t know it’s deployed” versus “the model can recognize when it’s deployed, but it still won’t take any bad actions because it’s a good guy”.