LESSWRONG
LW

Marius Hobbhahn
5256Ω1596541280
Message
Dialogue
Subscribe

I'm the co-founder and CEO of Apollo Research: https://www.apolloresearch.ai/ 
My goal is to improve our understanding of scheming and build tools and methods to detect and mitigate it.

I previously did a Ph.D. in ML at the International Max-Planck research school in Tübingen, worked part-time with Epoch and did independent AI safety research. 

For more see https://www.mariushobbhahn.com/aboutme/

I subscribe to Crocker's Rules

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Do models know when they are being evaluated?
Marius Hobbhahn6d20

Nice! Would be interested to see what it says about your internal evaluations

Reply
Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations
Marius Hobbhahn1mo20

In one of my MATS projects we found that some models have a bias to think they're always being evaluated, including in real-world scenarios. The paper isn't public yet. But it seems like a pretty brittle belief that the models don't hold super strongly. I think this can be part of a strategy, but should never be load-bearing. 

Reply
We should start looking for scheming "in the wild"
Marius Hobbhahn4moΩ442

Good point!

Yes, I use the term scheming in a much broader way, similar to how we use it in the in-context scheming paper. I would assume that our scheming term is even broader than Joe's alignment-faking because it also includes taking direct covert action like disabling oversight (which arguably is not alignment-faking).

Reply
Catastrophe through Chaos
Marius Hobbhahn5mo20

There are two sections that I think make this explicit:

1. No failure mode is sufficient to justify bigger actions. 
2. Some scheming is totally normal. 

My main point is that even things that would seem like warning shots today, e.g. severe loss of life, will look small in comparison to the benefits at the time, thus not providing any reason to pause. 

Reply
Catastrophe through Chaos
Marius Hobbhahn5mo83

what made you update towards longer timelines? My understanding was that most people updated toward shorter timelines based on o3 and reasoning models more broadly.

Reply
What’s the short timeline plan?
Marius Hobbhahn6mo53

If I had more time, I would have written a shorter post ;) 

Reply2
What’s the short timeline plan?
Marius Hobbhahn6moΩ122211

That's fair. I think the more accurate way of phrasing this is not "we will get catastrophe" and more "it clearly exceeds the risk threshold I'm willing to take / I think humanity should clearly not take" which is significantly lower than 100% of catastrophe. 

Reply1
What’s the short timeline plan?
Marius Hobbhahn6moΩ6150

I think this is a very important question and the answer should NOT be based on common-sense reasoning. My guess is that we could get evidence about the hidden reasoning capabilities of LLMs in a variety of ways both from theoretical considerations, e.g. a refined version of the two-hop curse or extensive black box experiments, e.g. comparing performance on evals with and without CoT, or with modified CoT that changes the logic (and thus tests whether the models internal reasoning aligns with the revealed reasoning). 

These are all pretty basic thoughts and IMO we should invest significantly more effort into clarifying this as part of the "let's make sure CoT is faithful" part. A lot of safety strategies rest on CoT faithfulness, so we should not leave this to shallow investigations and vibes. 

Reply
What’s the short timeline plan?
Marius Hobbhahn6mo20

Go for it. I have some names in mind for potential experts. DM if you're interested. 

Reply
What’s the short timeline plan?
Marius Hobbhahn6moΩ92111

Something like the OpenPhil AI worldview contest: https://www.openphilanthropy.org/research/announcing-the-winners-of-the-2023-open-philanthropy-ai-worldviews-contest/
Or the ARC ELK prize: https://www.alignment.org/blog/prizes-for-elk-proposals/

In general, I wouldn't make it too complicated and accept some arbitrariness. There is a predetermined panel of e.g. 5 experts and e.g. 3 categories (feasibility, effectiveness, everything else). All submissions first get scored by 2 experts with a shallow judgment (e.g., 5-10 minutes). Maybe there is some "saving" mechanism if an overeager expert wants to read plans that weren't assigned to them. Everything in the top N% then gets scored by all experts with a more detailed review. Then, there is a final ranking.

I'd hope that the time spent per expert is only 5-10 hours in total. I'd be fine with missing a bunch of posts that contain good ideas that are badly communicated or otherwise easy to miss on the shallow review. 

My main goal with the contest would be that writing a good plan and communicating it clearly is incentivized.

Reply2
Load More
73Research Note: Our scheming precursor evals had limited predictive power for our in-context scheming evals
Ω
8d
Ω
0
63Why "training against scheming" is hard
Ω
17d
Ω
2
113We should try to automate AI safety work asap
Ω
3mo
Ω
10
74100+ concrete projects and open problems in evals
Ω
4mo
Ω
1
184Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations
Ω
4mo
Ω
9
91We should start looking for scheming "in the wild"
Ω
4mo
Ω
4
49For scheming, we should first focus on detection and then on prevention
Ω
4mo
Ω
7
35Forecasting Frontier Language Model Agent Capabilities
5mo
0
59Do models know when they are being evaluated?
5mo
8
102Detecting Strategic Deception Using Linear Probes
Ω
5mo
Ω
9
Load More