AI Safety Researcher @ Apollo Research

Within normal human variation, some strong evidence actually points in the opposite direction. From the Herasight CogPGT paper: "Overall, our comprehensive pleiotropy analysis provides little support for substantial negative off-target effects. Instead, higher cognitive ability genetic predispositions appear beneficial.”

Hi Fabien, we did dig deeper into this in an earlier draft and found that the model with additional capabilities training seemed to be reasoning more explicitly about the evaluation's criteria and how to get a good score—a behavior we termed "reward process sycophancy." We observed a similar pattern with the exp-rl-cap (late) model, which also showed extensive reasoning about getting reward (see Appendix N).

Essentially, instead of addressing the task directly, the AS+CAP model often tried to deduce the evaluation's purpose and grading criteria to “pass the test”. In the sandbagging case, it appeared to often misinterpret the objective, concluding that it was supposed to perform poorly, and therefore sandbagged.

We ultimately excluded this... (read 418 more words →)

Twitter | Microsite | Apollo Blog | OpenAI Blog | Arxiv

Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context, or for simple preferences they acquire from training — something we previously found in Frontier Models Are Capable of In-Context Scheming.

In a new research collaboration with OpenAI, we developed a larger suite of alignment evaluations for covert actions (26 evaluations) and studied a training method to reduce such covert behaviors. We manage to significantly reduce (by ~30x; OpenAI o3: 13.0%→0.4%; OpenAI o4-mini: 8.7%→0.3%) the rate... (read 279 more words →)

A few hypotheses:
Other people enjoy fun-looking spaces much less than I do
It’s more expensive to make a place look fun so it’s not worth the extra cost
Fun designs are appealing at first but people who look at the same “fun” stuff daily eventually get tired of it and would overall prefer a plainer option

My number one hypothesis here would be that fun spaces are generally not considered high status. I think maybe the association with children/childishness could drive this, and reminds me of this post, with a guess for why velcro never caught on:

cecie: Consider a simpler example: Velcro is a system for fastening shoes that is, for at least some people and

... (read more)

This work was done as part of the MATS Program - Summer 2024 Cohort.

Paper: link

Website (with interactive version of Figure 1): link

Executive summary

Figure 1: Low-Elicitation and High-Elicitation forecasts for LM agent performance on SWE-Bench, Cybench, and RE-Bench. Elicitation level refers to performance improvements from optimizing agent scaffolds, tools, and prompts to achieve better results. Forecasts are generated by predicting Chatbot Arena Elo-scores from release date and then benchmark score from Elo. The low-elicitation (blue) forecasts serve as a conservative estimate, as the agent has not been optimized and does not leverage additional inference compute. The high-elicitation (orange) forecasts use the highest publicly reported performance scores. Because RE-Bench has no public high-elicitation data, it is... (read 1462 more words →)

Produced as part of the MATS Program Summer 2024 Cohort. The project is supervised by Marius Hobbhahn and Jérémy Scheurer

Update: See also our paper on this topic admitted to the NeurIPS 2024 SoLaR Workshop.

Introduction

To mitigate risks from future AI systems, we need to assess their capabilities accurately. Ideally, we would have rigorous methods to upper bound the probability of a model having dangerous capabilities, even if these capabilities are not yet present or easily elicited.

The paper “Evaluating Frontier Models for Dangerous Capabilities” by Phuong et al. 2024 is a recent contribution to this field from DeepMind. It proposes new methods that aim to estimate, as well as upper-bound the probability of large language models... (read 4750 more words →)

LESSWRONG
LW

LESSWRONG
LW

Axel Højmark

Axel Højmark

Stress Testing Deliberative Alignment for Anti-Scheming Training

Forecasting Frontier Language Model Agent Capabilities

Analyzing DeepMind's Probabilistic Methods for Evaluating Agent Capabilities

Axel Højmark

Axel Højmark

Stress Testing Deliberative Alignment for Anti-Scheming Training

Forecasting Frontier Language Model Agent Capabilities

Analyzing DeepMind's Probabilistic Methods for Evaluating Agent Capabilities

Executive summary

Introduction