Ah, I was anchored by it being in all caps in the tweet at the top
But rejecting a theory without a better alternative can be suspicious
Nah, I still disagree, the set of theories is vast, one being promoted to my attention is not strong evidence it is more true than all of those that haven't. People can separately be hypocritical or inconsistent, but that's something that should be argued for directly
Secure feels like the wrong word here. Often, I am reluctant to set a boundary because this will, in fact, lead to an outcome I don't want, eg the other person not being helped, or being sad. I'm bad at comparing this to the harms to me, but sometimes it's the right trade! A perfectly secure version of me would feel the same way, since they would also care about the welfare of other people.
Is your claim that security means saying yes to things if and only if it's a good idea according to the total value between you two? Rather than making trades that are actually bad because you underweight your own welfare? I got the impression from this post that your take was more "just don't do things if you don't want to" but maybe you're just skipping over the times when you think through consequences and convince yourself that they're legit?
I don't have to present an alternative theory in order to disagree with one I believe to be flawed or based on false premises. If someone gives me a mathematical proof and I identify a mistake, I don't need to present an alternative proof before I'm allowed to ignore it.
You can call that evil according to your own values, but then someone else can just as easily say that ignoring bee welfare is evil.
If you mean evil according to their values, then sure this just seems correct. If someone doesn't hold to objective morality then the same is true for every moral question. Some are just less controversial than others. And you CAN make arguments like, if you agree with me on moral premise X then conclusion Y holds
This seems somewhat backwards to me. Communication is kind of like a tree search of meaning. Everything you say is full of ambiguity and you can spend more words clarifying them by choosing specific nodes of the tree. To dig into this means you need to prioritise which bits are important, you can't make everything precise. I generally disapprove of critiquing people who don't put in the effort to clearly communicate a thing that was clearly unimportant to that point and I view the sloppiness defence as saying that "you are making an implicit argument that this thing is low quality because it has an error. However, the error is in a part of the tree that wasn't very relevant for the arguments. So who cares"
Separately, there is someone who supports their case with reasoning that is invalid and ambiguity defends them from people making this critique. But there, even if you give them the benefit of the doubt, the updated argument does not support the conclusion, so sloppiness is irrelevant - their argument is invalid
Maybe there's a gray zone where the detail seems plausibly relevant to a reader but actually isn't? Critique seems reasonable here.
I don't know which category the Bengio thing falls into - it seems like it makes an invalid argument to support a specific point in its overall thesis? I haven't read it
Agreed, I consider this a key theme in our fact finding work especially post 3 (but could maybe have made this more explicit) https://www.lesswrong.com/s/hpWHhjvjn67LJ4xXX/p/iGuwZTHWb6DFY3sKB
I'd be surprised if it made a big difference, but I agree in principle that it could make a difference, and favours probes somewhat over SAEs, so fair point.
I think that if you only ever wanted to use SAEs for unsupervised discovery of features, these results are not very important. I was hopeful that SAEs would be more broadly useful and this was a negative update - it's consistent with things like "SAEs faithfully capture the model's ontology and not the things we want", but this makes SAEs substantially less useful for practical tasks. I would love to see work that tries to find downstream tasks enabled by good circuit discovery and use this to gather evidence
I continue to feel like we're talking past each other, so let me start again. We both agree that causing human extinction is extremely bad. If I understand you correctly, you are arguing that it makes sense to follow deontological rules, even if there's a really good reason breaking them seems locally beneficial, because on average, the decision theory that's willing to do harmful things for complex reasons performs badly.
The goal of my various analogies was to point out that this is not actually a fully correcct statement about common sense morality. Common sense morality has several exceptions for things like having someone's consent to take on a risk, someone doing bad things to you, and innocent people being forced to do terrible things.
Given that exceptions exist, for times when we believe the general policy is bad, I am arguing that there should be an additional exception stating that: if there is a realistic chance that a bad outcome happens anyway, and you believe you can reduce the probability of this bad outcome happening (even after accounting for cognitive biases, sources of overconfidence, etc.), it can be ethically permissible to take actions that have side effects around increasing the probability of the bad outcome in other ways.
When analysing the reasons I broadly buy the deontological framework for "don't commit murder", I think there are some clear lines in the sand, such as maintaining a valuable social contract, and how if you do nothing, the outcomes will be broadly good. Further, society has never really had to deal with something as extreme as doomsday machines, which makes me hesitant to appeal to common sense morality at all. To me, the point where things break down with standard deontological reasoning is that this is just very outside the context where such priors were developed and have proven to be robust. I am not comfortable naively assuming they will generalize, and I think this is an incredibly high stakes thing where far and away the only thing I care about is taking the actions that will actually, in practice, lead to a lower probability of extinction.
Regarding your examples, I'm completely ethically comfortable with someone making a third political party in a country where the population has two groups who both strongly want to cause genocide to the other. I think there are many ways that such a third political party could reduce the probability of genocide, even if it ultimately comprises a political base who wants negative outcomes.
Another example is nuclear weapons. From a certain perspective, holding nuclear weapons is highly unethical as it risks nuclear winter, whether from provoking someone else or from a false alarm on your side. While I'm strongly in favour of countries unilaterally switching to a no-first-use policy and pursuing mutual disarmament, I am not in favour of countries unilaterally disarming themselves. By my interpretation of your proposed ethical rules, this suggests countries should unilaterally disarm. Do you agree with that? If not, what's disanalogous?
COVID-19 would be another example. Biology is not my area of expertise, but as I understand it, governments took actions that were probably good but risked some negative effects that could have made things worse. For example, widespread use of vaccines or antivirals, especially via the first-doses-first approach, plausibly made it more likely that resistant strains would spread, potentially affecting everyone else. In my opinion, these were clearly net-positive actions because the good done far outweighed the potential harm.
You could raise the objection that governments are democratically elected while Anthropic is not, but there were many other actors in these scenarios, like uranium miners, vaccine manufacturers, etc., who were also complicit.
Again, I'm purely defending the abstract point of "plans that could result in increased human extinction, even if by building the doomsday machine yourself, are not automatically ethically forbidden". You're welcome to critique Anthropic's actual actions as much as you like. But you seem to be making a much more general claim.