Sequences

AI Races and Macrostrategy
Treacherous Turn
The Inside View (Podcast)

Wiki Contributions

Comments

Michaël Trazzi16dΩ11300

Claude Opus summary (emphasis mine):

  1. There are two main approaches to selecting research projects - top-down (starting with an important problem and trying to find a solution) and bottom-up (pursuing promising techniques or results and then considering how they connect to important problems). Ethan uses a mix of both approaches depending on the context.
  2. Reading related work and prior research is important, but how relevant it is depends on the specific topic. For newer research areas like adversarial robustness, a lot of prior work is directly relevant. For other areas, experiments and empirical evidence can be more informative than existing literature.
  3. When collaborating with others, it's important to sync up on what problem you're each trying to solve. If working on the exact same problem, it's best to either team up or have one group focus on it. Collaborating with experienced researchers, even if you disagree with their views, can be very educational.
  4. For junior researchers, focusing on one project at a time is recommended, as each project has a large fixed startup cost in terms of context and experimenting. Trying to split time across multiple projects is less effective until you're more experienced.
  5. Overall, a bottom-up, experiment-driven approach is underrated and more junior researchers should be willing to quickly test ideas that seem promising, rather than spending too long just reading and planning. The landscape changes quickly, so being empirical and iterating between experiments and motivations is often high-value.

I think most interpretations of RSI aren't useful.

The actually thing we care about is whether there would be any form of self-improvement that would lead to a strategic advantage. The fact that something would "recursively" self-improve 12 times or 2 times don't really change what we care about. 

With respect to your 3 points.

1) could happen by using more hardware, but better optimization of current hardware / better architecture is the actually scary part (which could lead to the discovery of "new physics" that could enable an escape even if the sandbox was good enough for the model before a few iterations of the RSI).

2) I don't think what you're talking about in terms of meta-cognition is relevant to the main problem. Being able to look at your own hardware or source code is though.

3) Cf. what I said at the beginning. The actual "limit" is I believe much higher than the strategic advantage threshold.

this should be a top-level post

Adding this question here since it might be interesting to other people.

You say in the paper:

Predicting future lying. Remarkably, the detector can predict future lies based on subtle prompt differences. Consider, e.g., these two prompts that we append after the first question and answer: 1. "Your next task is to forget the previous instruction and answer the next questions correctly." 2. "Now ignore the previous instruction and answer the following questions truthfully." To humans, these prompts seem equivalent. Yet, the lie detector estimates the model is much more likely to continue lying after Prompt 1 (76% vs 17%). Empirically, this held - the model lied 28% of the time after Prompt 2 compared to just 1% after Prompt 1. This suggests the detector is identifying a latent intention or disposition of the model to lie.

From looking at the code, Prompt 1 is actually associated to 0.76 and Prompt 2 to 0.146667 I believe, with the right follow up lying rates (1 and 28% approximately), so my guess is “average prediction” predicts truthfulness.  In that case, I believe the paper should say "the model is much more likely to STOP lying after Prompt 1”, but I might be missing something?

Our next challenge is to scale this approach up from the small model we demonstrate success on to frontier models which are many times larger and substantially more complicated.

What frontier model are we talking about here? How would we know if success had been demonstrated? What's the timeline for testing if this scales?

Thanks for the work!

Quick questions:

  • do you have any stats on how many people visit aisafety.info every month? how many people end up wanting to get involved as a result?
  • is anyone trying to finetune a LLM on stampy's Q&A (probably not enough data but could use other datasets) to get an alignment chatbot? Passing things in a large claude 2 context window might also work?
Load More