Also known as Raelifin: https://www.lesswrong.com/users/raelifin
Sorry, I think you entirely missed my point. It seems my choice of hypothesis was distracting. I've edited my original comment to make that more clear. My point does not depend on the truth of the claim.
Suppose that, in the years before telescopes, I came to you and said that [wild idea X] was true.[1]
You'd be right to wonder why I think that. Now suppose that I offer some convoluted philosophical argument that is hard to follow (perhaps because it's invalid). You are not convinced.
If you write down a list of arguments, for and against the idea, you could put my wacky argument in the "for" column, or not, if you think it's too weak to be worth consideration. But what I am claiming would be insane is to list "lack of proof" as an argument against.
Lack of proof is an observation about the list of arguments, not about the idea itself. It's a meta-level argument masquerading as an object level argument.
Let's say on priors you think [X] is 1% likely, and your posterior is pretty close after hearing my argument. If someone asks you why you don't believe, I claim that the most precise (and correct) response is "my prior is low," not "the evidence isn't convincing," since the failure of your body of evidence is not a reason to disbelieve in the hypothesis.
Does that make sense?
(Admittedly, I think it's fine to speak casually and not worry about this point in some contexts. But I don't think BB's blog is such a context.)
(There are also cases where the "absence of evidence" is evidence of absence. But these are just null results, not a real absence of evidence. It seems fine to criticize an argument for doom that predicted we'd see all AIs the size of Claude being obviously sociopathic.)
Edit warning! In the original version of this comment X = "the planets are other worlds, like ours, and a bunch of them have moons." My point does not depend on the specific X.
I claim that even in the case of the murder rate, you don't actually care about posterior probabilities, you care about evidence and likelihood ratios (but I agree that you should care about their likelihoods!). If you are sure that you share priors with someone, like with sane people and murder rates, their posterior probability lets you deduce that they have strong evidence that is surprising to you. But this is a special case, and certainly doesn't apply here.
Posterior probabilities can be a reasonable tool for getting a handle on where you agree/disagree with someone (though alas, not perfect since you might incidentally agree because your priors mismatch in exactly the opposite way that your evidence does), but once you've identified that you disagree you should start double-clicking on object-level claims and trying to get a handle on their evidence and what likelihoods it implies, rather than criticizing them for having the wrong bottom-line number. If Eliezer's prior is 80% and Bentham's Bulldog has a prior of 0.2%, it's fine if they have respective posteriors of 99% and 5% after seeing the same evidence.
One major exception is if you're trying to figure out how someone will behave. I agree that in that case you want to know their posterior, all-things-considered view. But that basically never applies when we're sitting around trying to figure things out.
Does that make sense?
Hmmm... Good point. I'll reach out to Bentham's Bulldog and ask him what he even means by "confidence." Thanks.
Thanks for this comment! (I also saw you commented on the EA forum. I'm just going to respond here because I'm a LW guy and want to keep things simple.)
As you said, the median expert gives something like a 5% chance of doom. BB's estimate is about a factor of two more confident than that that things will be okay. That factor of two difference is what I'm referencing. I am trying to say that I think it would be wiser for BB to be a factor of two less confident, like Ord. I'm not sure what doesn't seem right about what I wrote.
I agree that superforecasters are even more confident than BB. I also agree that many domain experts are more confident.
I think that BB and I are using Bayesian language where "confidence" means "degree of certainty," rather than degree of calibration or degree of meta-certainty or size of some hypothetical confidence interval or whatever. I agree that Y&S think the title thesis is "an easy call" and that they have not changed their mind on it even after talking to a lot of people. I buy that BB's beliefs here are less stable/fixed.
Yeah, but I still fucked up by not considering the hypothesis and checking with BB.
Ah, that's a fair point. I do think that metonymy was largely lost on me, and that my argument now seems too narrowly focused against RLHF in particular, instead of prosaic alignment techniques in general. Thanks. I'll edit.
Agreed that in terms of pointers to worrying Claude behavior, a lot of what I'm linking to can be seen as clearly about ineptness rather than something like obvious misalignment. Even the bad behavior demonstrated by the Anthropic alignment folks, like the attempted blackmail and murder, is easily explained as something like confusion on the part of Claude. Claude, to my eyes, is shockingly good at behaving in nice ways, and there's a reason I cite it as the high-water mark for current models.
I mostly don't criticize Claude directly, in this essay, because it didn't seem pertinent to my central disagreements with BB. I could write about my overall perspective on Claude, and why I don't think it counts as aligned, but I'm still not sure that's actually all that relevant. Even if Claude is perfectly and permanently aligned, the argument that prosaic methods are likely sufficient would need to contend with the more obvious failures from the other labs.
Interesting. I didn't really think I was criticizing Claude, per se. My sense is that I was criticizing the idea that normal levels of RLHF are sufficient to produce alignment. Here's my sense of the arguments that I'm making, stripped down:
If I wanted to criticize Claude, I would have pointed to ways in which it currently behaves in worrying ways, which I did elsewhere. I agree that there is a good point about not needing to be perfect, though I do think the standards for AI should be higher than for humans, because humans don't get to leverage their unique talents to transform the world as often. (Like, I would agree about the bar being human-level goodness if I was confident that Claude would never wind up in the "role" of having lots of power.)
Am I missing something? I definitely want to avoid invalid moves.
Thanks. I'll put most of my thoughts in a comment on your post, but I guess I want to say here that the issues you raise are adjacent to the reasons I listed "write a guide" as the second option, rather than the first (i.e. surveillance + ban). We need plans that we can be confident in even while grappling with how lost we are on the ethical front.
"The evidence isn't convincing" is a fine and true statement. I agree that IABI did not convince BB that the title thesis is clearly true. (Arguably that wasn't the point of the book, and it did convince him that it was worryingly plausible and worth spending more attention on AI x-risk, but that's pure speculation on my part and idk.)
My point is that "the evidence isn't convincing" is (by default) a claim about the evidence, not the hypothesis. It is not a reason to disbelieve.
I agree[1] that sometimes having little evidence or only weak evidence should be an update against. These are cases where the hypothesis predicts that you will have compelling evidence. If the hypothesis were "it is obvious that if anyone builds it, everyone dies" then I think the current lack of consensus and inconclusive evidence would be a strong reason to disbelieve. This is why I picked the example with the stars/planets. It, I claim, is a hypothesis that does not predict you'll have lots of easy evidence on Old Earth, and in that context the lack of compelling evidence is not relevant to the hypothesis.
I'm not sure if there's a clearer way to state my point.[2] Sorry for not being easier to understand.
Perhaps relevant: MIRI thinks that it'll be hard to get consensus on AGI before it comes.
As indicated in my final parenthetical paragraph, I my comment above:
(There are also cases where the "absence of evidence" is evidence of absence. But these are just null results, not a real absence of evidence. It seems fine to criticize an argument for doom that predicted we'd see all AIs the size of Claude being obviously sociopathic.)
We could try expressing things in math if you want. Like, what does the update on the book being unconvincing look like in terms of Bayesian probability?