LESSWRONG
LW

1609
Neel Nanda
12696Ω2042947291
Message
Dialogue
Subscribe

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
How I Think About My Research Process
GDM Mech Interp Progress Updates
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level
Mechanistic Interpretability Puzzles
Interpreting Othello-GPT
200 Concrete Open Problems in Mechanistic Interpretability
My Overview of the AI Alignment Landscape
No wikitag contributions to display.
9Neel Nanda's Shortform
Ω
1y
Ω
19
Mikhail Samin's Shortform
Neel Nanda4mo143

I continue to feel like we're talking past each other, so let me start again. We both agree that causing human extinction is extremely bad. If I understand you correctly, you are arguing that it makes sense to follow deontological rules, even if there's a really good reason breaking them seems locally beneficial, because on average, the decision theory that's willing to do harmful things for complex reasons performs badly.

The goal of my various analogies was to point out that this is not actually a fully correcct statement about common sense morality. Common sense morality has several exceptions for things like having someone's consent to take on a risk, someone doing bad things to you, and innocent people being forced to do terrible things.

Given that exceptions exist, for times when we believe the general policy is bad, I am arguing that there should be an additional exception stating that: if there is a realistic chance that a bad outcome happens anyway, and you believe you can reduce the probability of this bad outcome happening (even after accounting for cognitive biases, sources of overconfidence, etc.), it can be ethically permissible to take actions that have side effects around increasing the probability of the bad outcome in other ways.

When analysing the reasons I broadly buy the deontological framework for "don't commit murder", I think there are some clear lines in the sand, such as maintaining a valuable social contract, and how if you do nothing, the outcomes will be broadly good. Further, society has never really had to deal with something as extreme as doomsday machines, which makes me hesitant to appeal to common sense morality at all. To me, the point where things break down with standard deontological reasoning is that this is just very outside the context where such priors were developed and have proven to be robust. I am not comfortable naively assuming they will generalize, and I think this is an incredibly high stakes thing where far and away the only thing I care about is taking the actions that will actually, in practice, lead to a lower probability of extinction.

Regarding your examples, I'm completely ethically comfortable with someone making a third political party in a country where the population has two groups who both strongly want to cause genocide to the other. I think there are many ways that such a third political party could reduce the probability of genocide, even if it ultimately comprises a political base who wants negative outcomes.

Another example is nuclear weapons. From a certain perspective, holding nuclear weapons is highly unethical as it risks nuclear winter, whether from provoking someone else or from a false alarm on your side. While I'm strongly in favour of countries unilaterally switching to a no-first-use policy and pursuing mutual disarmament, I am not in favour of countries unilaterally disarming themselves. By my interpretation of your proposed ethical rules, this suggests countries should unilaterally disarm. Do you agree with that? If not, what's disanalogous?

COVID-19 would be another example. Biology is not my area of expertise, but as I understand it, governments took actions that were probably good but risked some negative effects that could have made things worse. For example, widespread use of vaccines or antivirals, especially via the first-doses-first approach, plausibly made it more likely that resistant strains would spread, potentially affecting everyone else. In my opinion, these were clearly net-positive actions because the good done far outweighed the potential harm.

You could raise the objection that governments are democratically elected while Anthropic is not, but there were many other actors in these scenarios, like uranium miners, vaccine manufacturers, etc., who were also complicit.

Again, I'm purely defending the abstract point of "plans that could result in increased human extinction, even if by building the doomsday machine yourself, are not automatically ethically forbidden". You're welcome to critique Anthropic's actual actions as much as you like. But you seem to be making a much more general claim.

Reply
kave's Shortform
Neel Nanda3d20

I hadn't heard/didn't recall that rationale, thanks! I wasn't tracking the culture setting for new users facet, that seems reasonable and important

Reply1
kave's Shortform
Neel Nanda4d20

If this is solely about patching the hack of, eg, "make a political post, get moved to personal blog, make a quick take with basically the same content so you retain visibility" I am much less bothered. Is that the main case you have/intend to do this?

Reply1
kave's Shortform
Neel Nanda5d3150

I would strongly prefer for you to not move quick takes off the front page until there's a way for me to opt into seeing them

Reply4
Consider donating to Alex Bores, author of the RAISE Act
Neel Nanda9d2121

Have you ever done checked the political donations of even your closest friends/family? If you've ever hired somebody, have you ever looked into them on this type of thing on a background check?

This is a strawman. Eric is discussing the case of the federal government hiring people. That's very different!

Reply
Consider donating to Alex Bores, author of the RAISE Act
Neel Nanda9d66

"pro-safety is the pro-innovation position" seems false? If AI companies maximize profit by being safe, then they'd do it without regulation, so why would we need regulation? If they don't maximize profit by being safe, then pro-safety is not (maximally) pro-innovation.

Companies are not perfectly efficient rational actors, and innovation is not the same thing as profit, so I disagree here. For example, it is easy for companies to be caught in a race to the bottom, where each risks a major disaster that causes public backlash that destroys the industry, which would be terrible for innovation, but the expected cost to each company is outweighed by the benefit of racing. Eg Chernobyl was terrible for innovation.

Or for there to be coordination problems. Sometimes companies want or are happy with regulation but don't want to act unilaterally because that will induce costs on just them. In the same way that a billionaire can want higher taxes without unilaterally donating to the government.

Reply
lionrock's Shortform
Neel Nanda11d96

"that" conventionally refers to beliefs, as I understand it

Reply
The Thinking Machines Tinker API is good news for AI control and security
Neel Nanda19dΩ350

Imo a significant majority of frontier model interp would be possible with the ability to cache and add residual streams, even just at one layer. Though caching residual streams might enable some weight exfiltration if you can get a ton out? Seems like a massive pain though

Reply
Daniel Kokotajlo's Shortform
Neel Nanda1moΩ521

I disagree. Some analogies: If someone tells me, by the way, this is a trick question, I'm a lot better at answering trick questions. If someone tells me the maths problem I'm currently trying to solve was put on a maths Olympiad, that tells me a lot of useful information, like that it's going to have a reasonably nice solution, which makes it much easier to solve. And if someone's tricking me into doing something, then being told, by the way, you're being tested right now, is extremely helpful because I'll pay a lot more attention and be a lot more critical, and I might notice the trick.

Reply
Beth Barnes's Shortform
Neel Nanda1moΩ230

That seems reasonable, thanks a lot for all the detail and context!

Reply
Load More
51Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences
Ω
2mo
Ω
2
114How To Become A Mechanistic Interpretability Researcher
Ω
2mo
Ω
12
57Discovering Backdoor Triggers
Ω
2mo
Ω
4
53Towards data-centric interpretability with sparse autoencoders
Ω
2mo
Ω
2
22Neel Nanda MATS Applications Open (Due Aug 29)
Ω
3mo
Ω
0
35Reasoning-Finetuning Repurposes Latent Representations in Base Models
Ω
3mo
Ω
1
78Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
Ω
3mo
Ω
8
54Unfaithful chain-of-thought as nudged reasoning
Ω
3mo
Ω
3
131Narrow Misalignment is Hard, Emergent Misalignment is Easy
Ω
4mo
Ω
24
67Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance
Ω
4mo
Ω
18
Load More