Peter Favaloro — LessWrong

LESSWRONG
LW

One potential argument in favor of expecting large-particle transmission is, "Colds make people cough and sneeze. Isn't it likely that the most common infection that causes coughs and sneezes would also be one that spreads easily via coughs and sneezes?"

But I'm not sure how much evidence that really provides. I'm curious what others think.

Replying toProspects for studying actual schemers

Peter Favaloro4mo

Prospects for studying actual schemers

This left me wanting to read more about how you think your alignment-faking paper falls short of the desiderata described here, as a concrete example. You link it as an example of a scheming analog:

Research studying current AIs that are intended to be analogous to schemers won't yield compelling empirical evidence or allow us to productively study approaches for eliminating scheming, as these AIs aren't very analogous in practice.

IIUC, that research used an actual production model but used contrived inputs to elicit alignment-faking behavior. Why is that worse / less "real" than e.g. using rigged training processes?

You also say:

The main alternatives to studying actual schemers are to study AIs that are trained

Peter Favaloro5mo

AI Safety Field Growth Analysis 2025

I'm surprised that Anthropic and GDM have such small numbers of technical safety researchers in your dataset. What are the criteria for inclusion / how did you land on those numbers?

Replying toAI Safety Field Growth Analysis 2025

Peter Favaloro5mo*

AI Safety Field Growth Analysis 2025

What changed that made the historical datapoints so much higher? E.g. you now think that there were >100 technical AIS researchers in 2016, whereas in 2022 you thought that there had been <50 technical AIS researchers in 2016.

I notice that the historical data revisions are consistently upward. This looks consistent with a model like: in each year x, you notice some "new" people that should be in your dataset, but you also notice that they've been working on TAIS-related stuff for many years by that point. If we take that model literally, and calibrate it to past revisions, it suggests that you're probably undercounting right now by a factor of 50-100%. Does that sound plausible to you?

Thanks for assembling this dataset!

Research directions Open Phil wants to fund in technical AI safety

jake_mendel

jake_mendel, maxnadeau, Peter Favaloro

The Open Philanthropy has just launched a large new Request for Proposals for technical AI safety research. Here we're sharing a reference guide, created as part of that RFP, which describes what projects we'd like to see across 21 research directions in technical AI safety.

This guide provides an opinionated overview of recent work and open problems across areas like adversarial testing, model transparency, and theoretical approaches to AI alignment. We link to hundreds of papers and blog posts and offer approximately a hundred different example projects. We hope this is a useful resource for technical people getting started in alignment research. We'd also welcome feedback from the LW community on our prioritization within... (read 17359 more words →)

117

Open Philanthropy Technical AI Safety RFP - $40M Available Across 21 Research Areas

jake_mendel

jake_mendel, maxnadeau, Peter Favaloro

Open Philanthropy is launching a big new Request for Proposals for technical AI safety research, with plans to fund roughly $40M in grants over the next 5 months, and available funding for substantially more depending on application quality.

Applications (here) start with a simple 300 word expression of interest and are open until April 15, 2025.

Overview

We're seeking proposals across 21 different research areas, organized into five broad categories:

Adversarial Machine Learning
- *Jailbreaks and unintentional misalignment
- *Control evaluations
- *Backdoors and other alignment stress tests
- *Alternatives to adversarial training
- Robust unlearning
Exploring sophisticated misbehavior of LLMs
- *Experiments on alignment faking
- *Encoded reasoning in CoT and inter-model communication
- Black-box LLM psychology
- Evaluating whether models can hide dangerous behaviors
- Reward hacking of human oversight
Model transparency
- Applications of white-box techniques
- Activation monitoring
- Finding feature

... (read 177 more words →)

111