I'll also note that if you want to show up anywhere in the world and get good takes from people on the "how aliens might build AGI" question, Constellation might currently be the best bet (especially if you're interested in decision-relevant questions about this).

Reply

1

Express interest in an "FHI of the West"

Buck7d91

(I work out of Constellation and am closely connected to the org in a bunch of ways)

I think you're right that most people at Constellation aren't going to seriously and carefully engage with the aliens-building-AGI question, but I think describing it as a difference in culture is missing the biggest factor leading to the difference: most of the people who work at Constellation are employed to do something other than the classic FHI activity of "self-directed research on any topic", so obviously aren't as inclined to engage deeply with it.

I think there also is a cultural difference, but my guess is that it's smaller than the effect from difference in typical jobs.

Reply

Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

Buck8dΩ9114

I like this post and this research direction, I agree with almost everything you say, and I think you’re doing an unusually good job of explaining why you think your work is useful.

A nitpick: I think you’re using the term “scalable oversight” in a nonstandard and confusing way.

You say that scalable oversight is a more general version of “given a good model and a bad model, determine which one is good.” I imagine that more general sense you wanted is something like: you can implement some metric that tells you how “good” a model is, which can be applied not only to distinguish good from bad models (by comparing their metric values) but also can hopefully be used to train the models.

I think that your definition of scalable oversight here is broader than people normally use. In particular, I usually think of scalable oversight as the problem of making it so that we’re better able to make a procedure that tell us how good a model’s actions are on a particular trajectory; I think of it as excluding the problem of determining whether a model’s behaviors would be bad on some other trajectory that we aren’t considering. (This is how I use the term here, how Ansh uses it here, and how I interpret the usage in Concrete Problems and in Measuring Progress on Scalable Oversight for Large Language Models.)

I think that it’s good to have a word for the problem of assessing model actions on particular trajectories, and I think it’s probably good to distinguish between problems associated with that assessment and other problems; scalable oversight is the current standard choice for that.

Using your usage, I think scalable oversight suffices to solve the whole safety problem. Your usage also doesn’t play nicely with the low-stakes/high-stakes decomposition.

I’d prefer that you phrased this all by saying:

It might be the case that we aren’t able to behaviorally determine whether our model is bad or not. This could be because of a failure of scalable oversight (that is, it’s currently doing actions that we can’t tell are good), or because of concerns about failures that we can’t solve by training (that is, we know that it isn’t taking bad actions now, but we’re worried that it might do so in the future, either because of distribution shift or rare failure). Let’s just talk about the special case where we want to distinguish between two models which and we don’t have examples where the two models behaviorally differ. We think that it is good to research strategies that allow us to distinguish models in this case.

Reply

Staged release

Buck10d64

without access to fine-tuning or powerful scaffolding.

Note that normally it's the end user who decides whether they're going to do scaffolding, not the lab. It's probably feasible but somewhat challenging to prevent end users from doing powerful scaffolding (and I'm not even sure how you'd define that).

Reply

Paul Christiano named as US AI Safety Institute Head of AI Safety

Buck10d118

I normally use "alignment research" to mean "research into making models be aligned, e.g. not performing much worse than they're 'able to' and not purposefully trying to kill you". By this definition, ARC is alignment research, METR and Redwood isn't.

An important division between Redwood and METR is that we focus a lot more on developing/evaluating countermeasures.

Reply

Paul Christiano named as US AI Safety Institute Head of AI Safety

Buck10d3121

Yeah I object to using the term "alignment research" to refer to research that investigates whether models can do particular things.

But all the terminology options here are somewhat fucked imo, I probably should have been more chill about you using the language you did, sorry.

Reply

1

Paul Christiano named as US AI Safety Institute Head of AI Safety

Buck10d101

I’m not saying you don’t need to do cutting-edge research, I’m just saying that it’s not what people usually call alignment research.

Reply

Paul Christiano named as US AI Safety Institute Head of AI Safety

Buck11d3527

Why are you talking about alignment research? I don't see any evidence that he's planning to do any alignment research in this role, so it seems misleading to talk about NIST being a bad place to do it.

Reply

nikola's Shortform

Buck11d73

Ugh I can't believe I forgot about Rivest time locks, which are a better solution here.

Reply

nikola's Shortform

Buck11d53

I feel pretty into encrypting the weights and throwing the encryption key into the ocean or something, where you think it's very likely you'll find it in the limits of technological progress

Reply