habryka — LessWrong

Running Lightcone Infrastructure, which runs LessWrong and Lighthaven.space. You can reach me at habryka@lesswrong.com.

(I have signed no contracts or agreements whose existence I cannot mention, which I am mentioning here as a canary)

Copy-pasting what I wrote in a Slack thread about this:

My current take, having thought a lot about a few things in this domain, but not necessarily this specific question, is that the only dimensions where the empirical evidence feels like it was useful, besides a broad "yes, of course the problems are real, and AGI is possible, and it won't take hundreds of years" confirmation, are the dynamics around how much you can steer and control near-human AI systems to perform human-like labor.

I think almost all the evidence for that comes from just the scaling up, and basically none of it comes from safety work (unless you count RLHF as safety work, though of course the evidence there is largely downstream of the commercialization and scaling of that technology).

I can't think of any empirical evidence that updated me much on what superintelligent systems would do, even if they are the results of just directly scaling current systems, which is the key thing that matters.

A small domain that updated me a tiny bit, though mostly in the direction of what I already believed, is the material advantage research with stuff like LeelaOdds, which demonstrated more cleanly you can overcome large material disadvantages with greater intelligence in at least one toy scenario. The update here was really small though. I did make a bet with one person and won that one, so presumably it was a bigger update for others.

I think a bunch of other updates for me are downstream of "AIs will have situational awareness substantially before they are even human-level competent", which changes a lot of risk and control stories. I do think the situational awareness studies were mildly helpful for that, though most of it was IMO already pretty clear by the release of GPT-4, and the studies are just helpful for communicating that to people with less context or who use AI systems less.

Buck: What do you think we've learned about how much you can steer and control the AIs to perform human-like labor?

Me: It depends on what timescale. One thing that I think I updated reasonably strongly on is that we are probably not going to get systems with narrow capability profiles. The training regime we have seems to really benefit from throwing a wide range of data on it, and the capital investments to explicitly train a narrow system are too high. I remember Richard a few years ago talking about building AI systems that are exceptionally good at science and alignment, but bad at almost everything else. This seems a bunch less likely now.And then there is just a huge amount of detail on what things do I expect AI to be good at and bad at, at different capability levels, based on extrapolating current progress. Some quick updates here:

Models will be better at coding than almost any other task
Models will have extremely wide-ranging knowledge in basically all fields that have plenty of writing about them
It's pretty likely natural language will just be the central interface for working with human-level AI systems (I would have had at least some probability mass on more well-defined objectives, though I think in retrospect that was kind of dumb)
We will have multi-modal human-level AIs, but it's reasonably likely we will have superhuman AIs in computer use and writing substantially before we have AIs that orient to the world at human reaction speeds (like, combining real-time video, control and language is happening, but happening kind of slowly)
We have different model providers, but basically all the AI systems behave the same, with their failure modes and goals and misalignment all being roughly the same. This has reasonably-big implications for hoping that you can get decorrelated supervision by using AIs from different providers.
Chains of thought will stop being monitorable soon, but it stayed monitorable for an IMO mildly surprisingly long length of time. This suggests there is maybe more traction on keeping chains of thought monitorable than I would have said a few months ago.
The models will just lie to you all the time, everyone is used to this, you cannot use "the model is lying to me or clearly trying to deceive me" as any kind of fire alarm
Factored cognition seems pretty reliably non-competitive with just increasing context-lengths and doing RL (this is something I believed for a long time, but every year of the state of the art still not involving factored cognition is more evidence in this direction IMO, though I expect others to find this point kind of contentious)
Elicitation in-general is very hard, at least from a consumer perspective. There are tons of capabilities that the models demonstrate in one context, that are very hard to elicit without doing your own big training run in other contexts. At least in my experience LoRA's don't really work. Maybe this will get better. (one example that informs my experience here: restoring base-model imitation behavior. Fine-tuning seems to not work great for this, you still end up with huge mode collapse and falling back to the standard RLHF-corpo-speak. Maybe this is just a finetuning skill issue)

There are probably more things.

@Ustice Definitely interested in feedback! We are pretty much continuously iterating on these.

I... don't think there are approximately any bugs of this type that would have much of anything to do with Vercel. And furthermore I am very confident that if you check out the Github repository from 1-2 years ago that you will find many more not many fewer bugs and glitches and inconsistencies.

Happy to take bets on this with almost any operationalization if evaluated by a third party. My guess is you are having some preconceptions here that are causing you to do some confused cause analyses.

Does anyone know why the chat transcript has ChatGPT Pro's thinking summarized in Korean, while the question was asked in english and the response was asked in english?

This is not happening for his other chats:

I don't really believe in corrigibility as a thing that could hold up to much of any optimization pressure. It's not impossible to make a corrigible ASI, but my guess is to build a corrigible ASI you first need an aligned ASI to build it for you, and so as a target it's pretty useless.

My guess is that puts me in enough disagreement to qualify for your question?

By the "whitepaper," are you referring to the RSP v2.2 that Zach linked to, or something else?

No, I am referring to this, which Zach linked in his shortform: https://assets.anthropic.com/m/c52125297b85a42/original/Confidential_Inference_Paper.pdf

Also, just to cut a little more to brass tacks here, can you describe the specific threat model that you think they are insufficiently responding to? By that, I don't mean just the threat actor (insiders within their compute provider) and their objective to get weights, but rather the specific class or classes of attacks that you expect to occur, and why you believe that existing technical security + compensating controls are insufficient given Anthropic's existing standards.

I don't have a complaint that Anthropic is taking insufficient defenses here. I have a complaint that Anthropic said they would do something in their RSP that they are not doing.

The threat is really not very complicated, it's basically:

A high-level Amazon executive decides they would like to have a copy of Claude's weights
They physically go to an inference server and manipulate the hardware to dump the weights unencoded to storage (or read out the weights from a memory bus directly, or use one of the many attacks available to you if you have unlimited hardware access)

data center that already is designed to mitigate insider risk

Nobody in computer security I have ever talked to thinks that data in memory in normal cloud data centers would be secure against physical access from high-level executives who run the datacenter. One could build such a datacenter, but the normal ones are not that!

The top-secret cloud you linked to might be one of those, but my understanding is that Claude weights are deployed just to like normal Amazon datacenters, not only the top-secret governance ones.

Anthropic has released their own whitepaper where they call out what kind of changes would need to be required. Can you please engage with my arguments?

I have now also heard from 1-2 Anthropic employees about this. The specific people weren't super up-to-date on what Anthropic is doing here, and didn't want to say anything committal, but nobody I talked to had a reaction that suggested that they thought it was likely that Anthropic is robust to high-level insider threats at compute providers.

Like, if you want you can take a bet with me here, I am happy to offer you 2:1 odds on the opinion of some independent expert we both trust on whether Anthropic is likely robust to that kind of insider threat. I can also start going into all the specific technical reasons, but that would require restating half of the state of the art of computer security, which would be a lot of work. I really think that not being robust here is a relatively straightforward inference that most people in the field will agree with (unless Anthropic has changed operating procedure substantially from what is available to other consumers in their deals with cloud providers, which I currently think is unlikely, but not impossible).

I’m not arguing that this is a particularly likely way for humanity to build a superintelligence by default, just that this is possible, which already contradicts the book’s central statement.

The statement "if anyone builds it, everyone dies" does not mean "there is no way for someone to build it by which not everyone dies".

If you say "if any of the major nuclear power launches most of their nukes, more than one billion people are going to die" it would be very dumb and pedantic to respond with "well, actually, if they all just fired their nukes into the ocean, approximately no one is going to die".

I have trouble seeing this post do something else. Maybe I am missing something?

FWIW, I have seen a decent amount of flip-flopping on this question. My current guess is that most of the time when people say this, they don't mean either of those things but have some other reason for the belief, and choose the justification that they think will be most likely compelling to their interlocutor (like, I've had a bunch of instances of the same person telling me at different times that they were centrally concerned about China because it increased P(AI takeover) and then at a different point in time in a different social context that they were centrally concerned about Chinese values being less good by their lights if optimized).

My sense is Horizon is intentionally a mixture of people who care about x-risk and people who broadly care about "tech policy going well". IMO both are laudable goals.

My guess is Horizon Institute has other issues that make me not super excited about it, but I think this one is a reasonable call.

LESSWRONG
LW

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments