Olli Järviniemi

Wiki Contributions

Comments

Sorted by

For background, see this post on trustworthiness.

Here are two notions of trustworthiness that are useful to distinguish:

  1. On most inputs/contexts, the model isn’t egregiously misaligned. (By egregious misalignment, I mean conspiring against you to cause very bad outcomes.)
  2. During the key moments where it really matters, the model doesn’t take very bad actions.

These notions can vary substantially. A central example is a model that has never had any serious misaligned thoughts but, when presented with a clear opportunity to escape, takes it. This model would be trusted under the first notion, but not under the second.

Another example is the case where a weaker model is monitoring a stronger model. If during deployment the weak model obtains strong evidence for indeed being deployed (possibly because the strong model aims for this outcome), the weak model might end up not reporting bad actions the strong model takes.[1] This can happen even if the weak model is generally not egregiously misaligned.

I find this distinction useful for two reasons:

First, there are many sorts of evaluations and arguments which imply (1) but not (2). For instance, arguments from the model’s incompetence at theory-of-mind and keeping secrets, or at thinking of and executing cross-context strategies, are strong evidence in favor of (1), but much weaker evidence in favor of (2). Obtaining strong evidence for (2) might require evaluations targeted at that specifically.

Second, whether we care about (1) or (2) depends on the application. Situations where we primarily care about (1) include: the model has produced large quantities of research, we have run capability evaluations on the model, we are forming vague high-level impressions about the model’s psychology based on behavioral evidence. In these cases we typically have redundancy and multiple lines of evidence, and thus sabotage would need to be somewhat consistent. In contrast, we care about (2) when a single action by the model can have large consequences, which is the case at least when the model is monitoring other models.

I like this frame better than my previous ones. Credit to Ryan. For ease of reference, here’s terminology: property (1) is average or low-stakes trustworthiness, and property (2) is worst-case or high-stakes trustworthiness.

  1. ^

    We might be able to ensure the model can’t reliably distinguish evaluations from deployment, so this scenario might be unrealistic. But I want to be extremely clear about whether the argument for safety is “the model won’t know it’s deployed” versus “the model can recognize when it’s deployed, but it still won’t take any bad actions because it’s a good guy”.

Let me remind people of this other data point: in a previous paper with Evan we found that Opus engages in sandbagging. Here are screenshots of key parts:

The system prompt is natural (but does give the model a goal, unlike in Apollo's case):

A "Slack conversation" that appears in the prompt includes this message:

And here's the evaluation task.

Opus acted less capable than it is, both with and without chain-of-thought.

I really like the idea of creating a Good ABC book, and the author executed it well. Out of the recreational posts of 2023, this is the one I've liked the most. I can't articulate why exactly, but the visuals are great, it Creates A Thing, and it's an example of things I'd like there to be more in the world. It inspired me create my own version. I give it a +9 for the 2023 review.

I really liked this post. I also have friends who have young children, and was inspired to give them a book like this. But everyone involved speaks Finnish, so I ended up creating my own.

I just got my copies from mail. It looks really unimpressive in these low-quality phone-camera photos of the physical book, but it's really satisfying in real life - like Katja, I paid attention to using high-quality photos. For the cover picture I chose Earthrise. 

(I'm not sharing the full photos due to uncertainties with copyright, but if you want your copy, I can send the materials to you.)

 

 

More information about the creation process:

  • I didn't know where to look for photos, but Claude had suggestions. I quickly ended up using just one service, namely Adobe Stock. Around 80% of my pictures are from there.
    • I don't know where Grace got her pictures from, but would like to know - I think her pictures were slightly better than mine on average.
    • I found myself surprised that, despite doing countless school presentations with images, no one told me that there are these vast banks of high-quality images. (Instead I always used Google's image search results, which is inferior.)
    • Finding high-quality photos depicting what I wanted was the bulk of the work.
  • I had a good vision of the types of words I wanted to include. After having a good starting point, I benefited a little (but just a little) from using LLMs to brainstorm ideas.
  • To turn it into a photo, I needed to create images for the text pages. Claude was really helpful with coding (and debugging ä's and ö's). It also helped me compile a PDF I could share with my friends. (I only instructed and Claude programmed - Claude sped up this part by maybe 5x.)
  • My friends gave minor suggestions and improvements, a few of which made their way to the final product.
  • Total process took around 15 person-hours.

I didn't want to have any information like authors or a book title in the book: I liked the idea that the children will have a Mysterious Book in their bookshelf. (The back cover has some information about the company that made the book, though.)

Here's the word list:

A: Avaruus (space)

B: Bakteeri (bacteria)

C: Celsius

D: Desi

E: Etäisyys (distance)

F: Folio (foil)

G: Geeni (gene)

H: Hissi (elevator)

I: Ilma (air)

J: Jousi (spring)

K: Kello (clock)

L: Linssi (lens)

M: Määrä (quantity)

N: Nopea (fast)

O: Odottaa (to wait)

P: Pyörä (wheel)

R: Rokote (vaccine)

S: Solu (cell)

T: Tietokone (computer)

U: Uuni (oven)

V: Valo (light)

Y: Ympyrä (circle)

Ä: Ääni (sound)

Ö: Ötökkä (bug)

I recently gave a workshop in AI control, for which I created an exercise set.

The exercise set can be found here: https://drive.google.com/file/d/1hmwnQ4qQiC5j19yYJ2wbeEjcHO2g4z-G/view?usp=sharing

The PDF is self-contained, but three additional points:

  • I assumed no familiarity about AI control from the audience. Accordingly, the target audience for the exercises is newcomers, and are about the basics.
    • If you want to get into AI control, and like exercises, consider doing these.
    • Conversely, if you are already pretty familiar with control, I don't expect you'll get much out of these exercises. (A good fraction of the problems is about re-deriving things that already appear in AI control papers etc., so if you already know those, it's pretty pointless.)
  • I felt like some of the exercises weren't that good, and am not satisfied with my answers to some of them - I spent a limited time on this. I thought it's worth sharing the set anyways.
    • (I compensated by highlighting problems that were relatively good, and by flagging the answers I thought were weak; the rest is on the reader.)
  • I largely focused on monitoring schemes, but don't interpret this as meaning there's nothing else to AI control.

You can send feedback by messaging me or anonymously here.

The post studies handicapped chess as a domain to study how player capability and starting position affect win probabilities. From the conclusion:

 

In the view of Miles and others, the initially gargantuan resource imbalance between the AI and humanity doesn’t matter, because the AGI is so super-duper smart, it will be able to come up with the “perfect” plan to overcome any resource imbalance, like a GM playing against a little kid that doesn't understand the rules very well. 

The problem with this argument is that you can use the exact same reasoning to imply that’s it’s “obvious” that Stockfish could reliably beat me with queen odds. But we know now that that’s not true.

Since this post came out, a chess bot (LeelaQueenOdds) that has been designed to play with fewer pieces has come out. simplegeometry's comment introduces it well. With queen odds, LQO is way better than Stockfish, which has not been designed for it. Consequentially, the main empirical result of the post is severely undermined. (I wonder how far even LQO is from truly optimal play against humans.)

(This is in addition to - as is pointed out by many commenters - how the whole analogue is stretched at best, given the many critical ways in which chess is different from reality. The post has little argument in favor of the validity of the analogue.)

I don't think the post has stood the test of time, and vote against including it in the 2023 Review.

I found it interesting to play against LeelaQueenOdds. My experiences:

  • I got absolutely crushed on 1+1 time controls (took me 50+ games to win one), but I'm competitive at 3+2 if I play seriously.
  • The model is really good at exploiting human blind spots and playing aggressively. I could feel it striking in my weak spots, but not being able to do much about it. (I now better acknowledge the existence of adversarial attacks for humans on a gut level.)
  • I found it really addictive to play against it: You know the trick that casinos use, where they make you feel like you "almost" won? This was that: I constantly felt like I could have won, if it wasn't just for that one silly mistake - despite having lost the previous ten games to such "random mistakes", too... I now better understand what it's like to be a gambling addict.

Overall fascinating to play from a position that should be an easy win, but getting crushed by an opponent that Just Plays Better than I do.

[For context, I'm around 2100 in Lichess on short time controls (bullet/blitz). I also won against Stockfish 16 at rook odds on my first try - it's really not optimized for this sort of thing.]

You might be interested in this post of mine, which is more precise about what "trustworthy" means. In short, my definition is "the AI isn't adversarially trying to cause a bad outcome". This includes aligned models, and also unaligned models that are dumb enough to realize they should (try to) sabotage. This does not include models that are unaligned, trying to sabotage and which we are able to stop from causing bad outcomes (but we might still have use-cases for such models).

Thanks, that clears it up. Indeed I agree that "take real trajectories, then edit the attacks in" is relatively promising, and didn't track this when writing my first comment (oops) - thanks for flagging it.

Load More