Previously "Lanrian" on here. Research analyst at Open Philanthropy. Views are my own.
Are you mainly interested in evaluating deceptive capabilities? I.e., no-holds-barred, can you elicit competent deception (or sub-components of deception) from the model? (Including by eg fine-tuning on data that demonstrates deception or sub-capabilities.)
Or evaluating inductive biases towards deception? I.e. testing whether the model is inclined towards deception in cases when the training data didn't necessarily require deceptive behavior.
(The latter might need to leverage some amount of capability evaluation, to distinguish not being inclined towards deception from not being capable of deception. But I don't think the reverse is true.)
Or if you disagree with that way of cutting up the space.
+1.
I'm a big fan of extrapolating trendlines, and I think the current trendlines are concerning. But when evaluating the likelihood that "most democratic Western countries will become fascist dictatorships", I'd say these trends point firmly against this being "the most likely overall outcome" in the next 10 years. (While still increasing my worry about this as a tail-risk, a longer-term phenomena, and as a more localized phenomena.)
If we extrapolate the graphs linearly, we get:
That's really bad. But it would be inconsistent with a wide fascist turn in the West, which would cause bigger swings in those metrics.
(As far as I can tell, the third graph is supposed to indiciate the sign of the derivative of something like a democracy index, in each of many countries? Without looking into their criteria more, I don't know what it's supposed to say about the absolute size of changes, if anything.)
This also makes me confused about the next section's framing. If there's no "National Exceptionalism" where western countries are different from the others, then presumably the same trends should apply. But those suggest that the headline claim is unlikely. (But that we should be concerned about less probable, less widespread, and/or longer-term changes of the same kind.)
1/trillion kindness seems to directly imply a small utopia where existing humans get to live out long and happy lives
Not direct implication, because the AI might have other human-concerning preferences that are larger than 1/trillion. C.f. top-level comment: "I’m not talking about whether the AI has spite or other strong preferences that are incompatible with human survival, I’m engaging specifically with the claim that AI is likely to care so little one way or the other that it would prefer just use the humans for atoms."
I'd guess "most humans survive" vs. "most humans die" probabilities don't correspond super closely to "presence of small pseudo-kindness". Because of how other preferences could outweigh that, and because cooperation/bargaining is a big reason for why humans might survive aside from intrinsic preferences.
If they don't go to a doctor, it could be either because the problem is minor enough that they can't be bothered, or because they generally don't seek medical help when they are seriously unwell, in which case the risk from something like B12 deficiency is negligible compared to e.g. the risk of an untreated heart attack.
I'm personally quite bad at noticing and tracking (non-sudden) changes in my energy, mood, or cognitive ability. I think there are issues that I wouldn't notice (or would think minor) that I would still care a lot about fixing.
Also, some people have problems with executive function. Even if they notice issues, the issues might have to be pretty bad before they'll ask a doctor about them. Bad enough that it could be pretty valuable to prevent less bad issues (that would go untreated).
(This could be exacerbated if people are generally unexcited about seeking medical help — I think there are plenty of points on this axis where people will seek help for heart-attacks but will be pessimistic about getting help with "vaguely feeling tired lately". Or maybe not even pessimistic. Just... not having "ask a dr" be generated as an obvious thing to try.)
doesn't it seem to you that the topic is super neglected (even compared to AI alignment) given that the risks/consequences of failing to correctly solve this problem seem comparable to the risk of AI takeover?
Yes, I'm sympathetic. Among all the issues that will come with AI, I think alignment is relatively tractable (at least it is now) and that it has an unusually clear story for why we shouldn't count on being able to defer it to smarter AIs (though that might work). So I think it's probably correct for it to get relatively more attention. But even taking that into account, the non-alignment singularity issues do seem too neglected.
I'm currently trying to figure out what non-alignment stuff seems high-priority and whether I should be tackling any of it.
This was also my impression.
Curious if OP or anyone else has a source for the <1% claim? (Partially interested in order to tell exactly what kind of "doom" this is anti-predicting.)
I assume that's from looking at the GPT-4 graph. I think the main graph I'd look at for a judgment like this is probably the first graph in the post, without PaLM-2 and GPT-4. Because PaLM-2 is 1-shot and GPT-4 is just 4 instead of 20+ benchmarks.
That suggests 90% is ~1 OOM away and 95% is ~3 OOMs away.
(And since PaLM-2 and GPT-4 seemed roughly on trend in the places where I could check them, probably they wouldn't change that too much.)
Interesting. Based on skimming the paper, my impression is that, to a first approximation, this would look like:
That description misses out on effects where BNSL-fitting would predict that there's a slow, smooth shift from one power-law to another, and that this gradual shift will continue into the future. I don't know how important that is. Curious for your intuition about whether or not that's important, and/or other reasons for why my above description is or isn't reasonable.
When I think about applying that algorithm to the above plots, I worry that the data points are much too noisy to just extrapolate a line from the last few data points. Maybe the practical thing to do would be to assume that the 2nd half of the "sigmoid" forms a distinct power law segment, and fit a power law to the points with >~50% performance (or less than that if there are too few points with >50% performance). Which maybe suggests that the claim "BNSL does better" corresponds to a claim that the speed at which the language models improve on ~random performance (bottom part of the "sigmoid") isn't informative for how fast they converge to ~maximum performance (top part of the "sigmoid")? That seems plausible.
Where does the "15B" for GPT-2's data come from, here? Epoch's dataset's guess is that it was trained on 3B tokens for 100 epochs: https://docs.google.com/spreadsheets/d/1AAIebjNsnJj_uKALHbXNfn3_YsT6sHXtCU0q7OIPuc4/edit#gid=0