LESSWRONG
LW

680
Neel Nanda
12309Ω2032946981
Message
Dialogue
Subscribe

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
How I Think About My Research Process
GDM Mech Interp Progress Updates
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level
Mechanistic Interpretability Puzzles
Interpreting Othello-GPT
200 Concrete Open Problems in Mechanistic Interpretability
My Overview of the AI Alignment Landscape
Mikhail Samin's Shortform
Neel Nanda3mo143

I continue to feel like we're talking past each other, so let me start again. We both agree that causing human extinction is extremely bad. If I understand you correctly, you are arguing that it makes sense to follow deontological rules, even if there's a really good reason breaking them seems locally beneficial, because on average, the decision theory that's willing to do harmful things for complex reasons performs badly.

The goal of my various analogies was to point out that this is not actually a fully correcct statement about common sense morality. Common sense morality has several exceptions for things like having someone's consent to take on a risk, someone doing bad things to you, and innocent people being forced to do terrible things.

Given that exceptions exist, for times when we believe the general policy is bad, I am arguing that there should be an additional exception stating that: if there is a realistic chance that a bad outcome happens anyway, and you believe you can reduce the probability of this bad outcome happening (even after accounting for cognitive biases, sources of overconfidence, etc.), it can be ethically permissible to take actions that have side effects around increasing the probability of the bad outcome in other ways.

When analysing the reasons I broadly buy the deontological framework for "don't commit murder", I think there are some clear lines in the sand, such as maintaining a valuable social contract, and how if you do nothing, the outcomes will be broadly good. Further, society has never really had to deal with something as extreme as doomsday machines, which makes me hesitant to appeal to common sense morality at all. To me, the point where things break down with standard deontological reasoning is that this is just very outside the context where such priors were developed and have proven to be robust. I am not comfortable naively assuming they will generalize, and I think this is an incredibly high stakes thing where far and away the only thing I care about is taking the actions that will actually, in practice, lead to a lower probability of extinction.

Regarding your examples, I'm completely ethically comfortable with someone making a third political party in a country where the population has two groups who both strongly want to cause genocide to the other. I think there are many ways that such a third political party could reduce the probability of genocide, even if it ultimately comprises a political base who wants negative outcomes.

Another example is nuclear weapons. From a certain perspective, holding nuclear weapons is highly unethical as it risks nuclear winter, whether from provoking someone else or from a false alarm on your side. While I'm strongly in favour of countries unilaterally switching to a no-first-use policy and pursuing mutual disarmament, I am not in favour of countries unilaterally disarming themselves. By my interpretation of your proposed ethical rules, this suggests countries should unilaterally disarm. Do you agree with that? If not, what's disanalogous?

COVID-19 would be another example. Biology is not my area of expertise, but as I understand it, governments took actions that were probably good but risked some negative effects that could have made things worse. For example, widespread use of vaccines or antivirals, especially via the first-doses-first approach, plausibly made it more likely that resistant strains would spread, potentially affecting everyone else. In my opinion, these were clearly net-positive actions because the good done far outweighed the potential harm.

You could raise the objection that governments are democratically elected while Anthropic is not, but there were many other actors in these scenarios, like uranium miners, vaccine manufacturers, etc., who were also complicit.

Again, I'm purely defending the abstract point of "plans that could result in increased human extinction, even if by building the doomsday machine yourself, are not automatically ethically forbidden". You're welcome to critique Anthropic's actual actions as much as you like. But you seem to be making a much more general claim.

Reply
Follow-up experiments on preventative steering
Neel Nanda6h30

Really cool results! Thanks for doing this.

Reply
How To Become A Mechanistic Interpretability Researcher
Neel Nanda10d30

This is basically a question of research taste. You'll get much better at this over time. When starting out, I recommend just not caring and focusing on learning how to do anything at all

The definition for mechanistic has somewhat drifted over time. I personally think of does this use internals and is this rigorous as two separate axes? I care a lot about both and causal interventions is one powerful form of rigour

Reply
How To Become A Mechanistic Interpretability Researcher
Neel Nanda11d52

I'm a big fan of exploring and thinking about your options. But honestly, I think one of the best ways of doing so is just to try learning some of the basics. For example, for my MATS application, a lot of people totally new to mech interp try spending a weekend doing a small project. And I expect this is much more educational about whether they want to do mech interp than a bunch of paper reading, thinking, etc. If your priority is not personal fit, but instead thinking deeply about theories of change or something, then that's a different matter, and trying stuff is not as helpful. I do think that understanding the key papers in each subfield that are recent and up-to-date and where modern thought is at is pretty valuable. But a common trap for the kind of person trying to get into the field is to spend all of their time obsessing over finding the optimal thing to do, and forget the step where they actually need to do something

Reply
How To Become A Mechanistic Interpretability Researcher
Neel Nanda12d30

I tried to list alternatives. Are there any I don't have on there, that do mech interp, that you think should be included?

Reply
An epistemic advantage of working as a moderate
Neel Nanda24d95

That leads to things like corporate speak that is completely empty of content. The critics are adversarial or misunderstand you, so the incentives are very off

Reply
An epistemic advantage of working as a moderate
Neel Nanda25d202

I doubt this is what Buck had in mind, but he's had meaningful influence on the various ways I've changed my views on the big picture of interp over time

Reply
come work on dangerous capability mitigations at Anthropic
Neel Nanda25d170

I worked somewhat with Dave while he was at GDM, and can happily endorse him as good competent, sincerely caring about x-risk and a great person to work with! (Very sad to see you go 😢 )

Reply1
come work on dangerous capability mitigations at Anthropic
Neel Nanda25d30

Will Anthropic continue to publicly share research behind its safeguards, and say what kinds of safeguards they're using? If so, I think there's clear ways to spread this work to other labs

Reply
SE Gyges' response to AI-2027
Neel Nanda1mo7-8

I haven't read the whole post, but the claims that this can be largely dismissed because of implicit bias towards the pro OpenAI narrative are completely ridiculous and ignorant of the background context of the authors. Most of the main authors of the piece have never worked at OpenAI or any other AGI lab. Daniel held broadly similar of use to this many years ago before he joined Openai. I know because he has both written about them and I had conversations with him before he joined openai where he expressed broadly similar views. I don't fully agree with these views, but they were detailed and well thought out and were a better prediction of the future than mine at the time. And he also was willing to sign away millions of dollars of equity in order to preserve his integrity - implying that him having OpenAI stock is causing him to warp his underlying beliefs seems an enormous stretch. And to my knowledge, AI 2027 did not receive any OpenPhil funding.

I find it frustrating and arrogant when people assume without good reason that disagreement is because of some background bias in the other person - often people disagree with you because of actual reasons!

Reply
Load More
9Neel Nanda's Shortform
Ω
1y
Ω
19
No wikitag contributions to display.
36Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences
Ω
9d
Ω
1
100How To Become A Mechanistic Interpretability Researcher
Ω
12d
Ω
12
56Discovering Backdoor Triggers
Ω
1mo
Ω
4
48Towards data-centric interpretability with sparse autoencoders
Ω
1mo
Ω
2
22Neel Nanda MATS Applications Open (Due Aug 29)
Ω
2mo
Ω
0
35Reasoning-Finetuning Repurposes Latent Representations in Base Models
Ω
2mo
Ω
1
78Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
Ω
2mo
Ω
3
53Unfaithful chain-of-thought as nudged reasoning
Ω
2mo
Ω
3
130Narrow Misalignment is Hard, Emergent Misalignment is Easy
Ω
2mo
Ω
23
66Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance
Ω
2mo
Ω
18
Load More