LESSWRONG
LW

830
Peter Favaloro
208030
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Prospects for studying actual schemers
Peter Favaloro16d10

This left me wanting to read more about how you think your alignment-faking paper falls short of the desiderata described here, as a concrete example. You link it as an example of a scheming analog:

Research studying current AIs that are intended to be analogous to schemers won't yield compelling empirical evidence or allow us to productively study approaches for eliminating scheming, as these AIs aren't very analogous in practice.

IIUC, that research used an actual production model but used contrived inputs to elicit alignment-faking behavior. Why is that worse / less "real" than e.g. using rigged training processes?

You also say:

The main alternatives to studying actual schemers are to study AIs that are trained to behave like a schemer might behave, or to study abstract analogies of scheming.

It's not clear to me which of these categories the alignment-faking paper falls into, and I'd find it useful to hear how you think it displays the downsides you suggest for these categories (e.g. being insufficiently analogous to actually dangerous scheming).

Reply
AI Safety Field Growth Analysis 2025
Peter Favaloro1mo10

I'm surprised that Anthropic and GDM have such small numbers of technical safety researchers in your dataset. What are the criteria for inclusion / how did you land on those numbers?

Reply
AI Safety Field Growth Analysis 2025
Peter Favaloro1mo*10

What changed that made the historical datapoints so much higher? E.g. you now think that there were >100 technical AIS researchers in 2016, whereas in 2022 you thought that there had been <50 technical AIS researchers in 2016.

I notice that the historical data revisions are consistently upward. This looks consistent with a model like: in each year x, you notice some "new" people that should be in your dataset, but you also notice that they've been working on TAIS-related stuff for many years by that point. If we take that model literally, and calibrate it to past revisions, it suggests that you're probably undercounting right now by a factor of 50-100%. Does that sound plausible to you?

Thanks for assembling this dataset!

Reply
117Research directions Open Phil wants to fund in technical AI safety
Ω
9mo
Ω
21
111Open Philanthropy Technical AI Safety RFP - $40M Available Across 21 Research Areas
Ω
9mo
Ω
0