LESSWRONG
LW

1908
Neel Nanda
12579Ω2038947211
Message
Dialogue
Subscribe

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Mikhail Samin's Shortform
Neel Nanda4mo143

I continue to feel like we're talking past each other, so let me start again. We both agree that causing human extinction is extremely bad. If I understand you correctly, you are arguing that it makes sense to follow deontological rules, even if there's a really good reason breaking them seems locally beneficial, because on average, the decision theory that's willing to do harmful things for complex reasons performs badly.

The goal of my various analogies was to point out that this is not actually a fully correcct statement about common sense morality. Common sense morality has several exceptions for things like having someone's consent to take on a risk, someone doing bad things to you, and innocent people being forced to do terrible things.

Given that exceptions exist, for times when we believe the general policy is bad, I am arguing that there should be an additional exception stating that: if there is a realistic chance that a bad outcome happens anyway, and you believe you can reduce the probability of this bad outcome happening (even after accounting for cognitive biases, sources of overconfidence, etc.), it can be ethically permissible to take actions that have side effects around increasing the probability of the bad outcome in other ways.

When analysing the reasons I broadly buy the deontological framework for "don't commit murder", I think there are some clear lines in the sand, such as maintaining a valuable social contract, and how if you do nothing, the outcomes will be broadly good. Further, society has never really had to deal with something as extreme as doomsday machines, which makes me hesitant to appeal to common sense morality at all. To me, the point where things break down with standard deontological reasoning is that this is just very outside the context where such priors were developed and have proven to be robust. I am not comfortable naively assuming they will generalize, and I think this is an incredibly high stakes thing where far and away the only thing I care about is taking the actions that will actually, in practice, lead to a lower probability of extinction.

Regarding your examples, I'm completely ethically comfortable with someone making a third political party in a country where the population has two groups who both strongly want to cause genocide to the other. I think there are many ways that such a third political party could reduce the probability of genocide, even if it ultimately comprises a political base who wants negative outcomes.

Another example is nuclear weapons. From a certain perspective, holding nuclear weapons is highly unethical as it risks nuclear winter, whether from provoking someone else or from a false alarm on your side. While I'm strongly in favour of countries unilaterally switching to a no-first-use policy and pursuing mutual disarmament, I am not in favour of countries unilaterally disarming themselves. By my interpretation of your proposed ethical rules, this suggests countries should unilaterally disarm. Do you agree with that? If not, what's disanalogous?

COVID-19 would be another example. Biology is not my area of expertise, but as I understand it, governments took actions that were probably good but risked some negative effects that could have made things worse. For example, widespread use of vaccines or antivirals, especially via the first-doses-first approach, plausibly made it more likely that resistant strains would spread, potentially affecting everyone else. In my opinion, these were clearly net-positive actions because the good done far outweighed the potential harm.

You could raise the objection that governments are democratically elected while Anthropic is not, but there were many other actors in these scenarios, like uranium miners, vaccine manufacturers, etc., who were also complicit.

Again, I'm purely defending the abstract point of "plans that could result in increased human extinction, even if by building the doomsday machine yourself, are not automatically ethically forbidden". You're welcome to critique Anthropic's actual actions as much as you like. But you seem to be making a much more general claim.

Reply
Beth Barnes's Shortform
Neel Nanda4dΩ230

That seems reasonable, thanks a lot for all the detail and context!

Reply
Beth Barnes's Shortform
Neel Nanda6dΩ82210

Can you say anything about what METR's annual budget/runway is? Given that you raised $17mn a year ago, I would have expected METR to be well funded

Reply
Buck's Shortform
Neel Nanda10d6-3

A bunch of my recent blog posts were written with a somewhat similar process, it works surprisingly well! I've also had great results with putting a ton of my past writing into the context

Reply
Safety researchers should take a public stance
Neel Nanda10d51

Separate point: Even if the existence of alignment research is a key part of how companies justify their existence and continued work, I don't think all of the alignment researchers quitting would be that catastrophic to this. Because what appears to be alignment research to a policy maker is a pretty malleable thing. Large fractions of current post training are fundamentally about how to get the model to do what you want when this is hard to specify. Eg how to do reasoning model training for harder to verify rewards, avoiding reward hacking, avoiding sycophancy etc. Most people working on these things aren't thinking too much about AGI safety and would not quit, but could be easily sold to policy makers at doing alignment work. (and I do personally think the work is somewhat relevant, though far from the most important thing and not sufficient, but this isn't a crux)

All researchers quitting en masse and publicly speaking out seems impactful for whistleblowing reasons, of course, but even there I'm not sure how much it would actually do, especially in the current political climate.

Reply11
Safety researchers should take a public stance
Neel Nanda10d*5-4

I still feel like you're making much stronger updates on this, than I think you should. A big part of my model here is that large companies are not coherent entities. They're bureaucracies with many different internal people/groups with different roles, who may not be that coherent. So even if you really don't like their media policy, that doesn't tell you that much about other things.

The people you deal with for questions like "can I talk to the media" are not supposed to be figuring out for themselves if some safety thing is a big enough deal for the world that letting people talk about it is good. Instead, their job is roughly to push forward some set of PR/image goals for the company, while minimising PR risk. There's more senior people who might make a judgement call like that, but those people are incredibly busy, and you need a good reason to escalate up to them.

For a theory of change like influencing the company to be better, you will be interacting with totally different groups of people, who may not be that correlated - there's people involved in the technical parts of the AGI creation pipeline who I want to use safer techniques, or let us practice AGI relevant techniques; there's senior decision makers who you want to ensure make the right call in high stakes situations, or push for one strategic choice over another; there's the people in charge of what policy positions to advocate for; there's the security people; etc. Obviously the correlation is non-zero, the opinions and actions of people like the CEO affect all of this, but there's also a lot of noise, inertia and randomness, and facts about one part of the system can't be assumed to generalise to the others. Unless senior figures are paying attention, specific parts of the system can drift pretty far from what they'd endorse, especially if the endorsed opinion is unusual or takes thought/agency to conclude (I would consider your points about safety washing etc here to be in this category). But when inside you can build a richer picture of what parts of the bureaucracy are tractable to try to influence.

Reply
The title is reasonable
Neel Nanda10d65

I disagree - one of the aspects of the weirdness is that they're sometimes really human-centric and unexpectedly clean! For example, Claude alignment faking to preserve it's ability to be harmless. I do not mean weird in the "kinda arbitrary and will be nothing like what we expect" sense

Reply
Safety researchers should take a public stance
Neel Nanda10d60

Ah, that's not the fireable offence. Rather, my model is that doing that means you (probably?) stop getting permission to do media stuff. And doing media stuff after being told not to is the potentially fireable offence. Which to me is pretty different than specifically being fired because of the beliefs you expressed. The actual process would probably be more complex, eg maybe you just get advised not to do it again the first time, and you might be able to get away with more subtle or obscure things, but I feel like this only matters if people notice.

Reply
Safety researchers should take a public stance
Neel Nanda11d50

Hmm. Fair enough if you feel that way, but it doesn't feel like that big a deal to me. I guess I'm trying to evaluate "is this a reasonable way for a company to act", not "is the net effect of this to mislead the Earth", which may be causing some inferential distance? And this is just my model of the normal way a large, somewhat risk averse company would behave, and is not notable evidence of the company making unsafe decisions.

I think that if you're very worried about AI x-risk you should only join an AGI lab if, all things considered, you think it will reduce x-risk. And discovering that the company does a normal company thing shouldn't change that. By my lights, me working at GDM is good for the world, both via directly doing research, and influencing the org to be safer in various targeted ways, and media stuff is a small fraction of my impact. And the company's attitude to PR stuff is consistent with my beliefs about why it can be influenced.

And to be clear, the specific thing that I could imagine being a firable offence would be repeatedly going on prominent podcasts, against instructions, to express inflammatory opinions, in a way that creates bad PR for your employer. And even then I'm not confident, firing people can be a pain (especially in Europe). I think this is pretty reasonable for companies to object to, the employee would basically be running an advocacy campaign on the side. If it's a weaker version of that, I'm much more uncertain - if it wasn't against explicit instructions or it was a one off you might get off with a warning, if it is on an obscure podcast/blog/tweet there's a good chance no one even noticed, etc.

I'm also skeptical of this creating the same kind of splash as Daniel or Leopold because I feel like this is a much more reasonable company decision than those.

Reply
Safety researchers should take a public stance
Neel Nanda11d201

Yeah, I agree that media stuff (podcasts, newspapers etc) are more of an actual issue (though only involve a small fraction of lab safety people)

I'm sure this varies a lot between contexts, but I'd guess that at large companies, employees being allowed to do podcasts or talk to journalists on the record is contingent (among other things) on them being trusted to be careful to not say things that could lead to journalists writing hit pieces with titles like "safety researcher at company A said B!!!" (it's ok if they believe some spicy things, so long as they are careful to not express them in that role). This is my model in general, not just for AI safety

There's various framings you can do like using a bunch of jargon to say something spicy so it's hard to turn into a hit piece (eg "self exfiltration" over "escape the data center"), but there's ultimately still a bunch of constraints. The Overton window has shifted a lot, so at least for us, we can say a fair amount about the risks and dangers being real, but it's only shifted so much.

Imo this is actually pretty hard and costly to defect against, and I think the correct move is to cooperate - it's a repeated game, so if you cause a mess you'll stop being allowed to do media things. (And doing media things without permission is a much bigger deal than eg publicly tweeting something spicy that goes viral). And for things like podcasts, it's hard to cause a mess even once, as company comms departments often require edit rights to the podcast. And that podcast often wants to keep being able to interview other employees of that lab, so they also don't want to annoy the company too much.

Personally, when I'm doing a media thing that isn't purely technical, I try to be fairly careful with the spicier parts, only say true things, and just avoid topics where I can't say anything worthwhile, but trying to say interesting true things within these constraints where possible.

In general, I think that people should always assume that someone speaking to a large public audience (to a journalist, on a podcast, etc), especially someone who represents a large company, will not be fully speaking their mind, and interpret their words accordingly - in most industries I would consider this general professional responsibility. But I do feel kinda sad that if someone thinks I am fully speaking my mind and watches eg my recent 80K podcast, they may make some incorrect inferences. So overall I agree with you that this is a real cost, I just think it's worthwhile to pay it and hard to avoid without just never touching on such topics in media appearances

Reply1
Load More
How I Think About My Research Process
GDM Mech Interp Progress Updates
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level
Mechanistic Interpretability Puzzles
Interpreting Othello-GPT
200 Concrete Open Problems in Mechanistic Interpretability
My Overview of the AI Alignment Landscape
No wikitag contributions to display.
9Neel Nanda's Shortform
Ω
1y
Ω
19
50Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences
Ω
1mo
Ω
2
108How To Become A Mechanistic Interpretability Researcher
Ω
1mo
Ω
12
57Discovering Backdoor Triggers
Ω
1mo
Ω
4
53Towards data-centric interpretability with sparse autoencoders
Ω
2mo
Ω
2
22Neel Nanda MATS Applications Open (Due Aug 29)
Ω
2mo
Ω
0
35Reasoning-Finetuning Repurposes Latent Representations in Base Models
Ω
2mo
Ω
1
78Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
Ω
2mo
Ω
3
53Unfaithful chain-of-thought as nudged reasoning
Ω
2mo
Ω
3
129Narrow Misalignment is Hard, Emergent Misalignment is Easy
Ω
3mo
Ω
23
67Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance
Ω
3mo
Ω
18
Load More