Red-Thing-Ism

[-]Richard_Ngo3mo1810

Strongly upvoted. Alignment researchers often feel so compelled to quickly contribute to decreasing x-risk that they end up studying non-robust categories that won't generalize very far, and sometimes actively make the field more confused. I wish that most people doing this were just trying to do the best science they could instead.

[-]Fabien Roger3mo50

I am broadly sympathetic to this.

I most agree with the problem of our current fuzzy concepts when it comes to ambitious alignment that would transfer in hard cases around superintelligence, and I think having better fundamental understanding seems helpful.

But I think many concepts people around me are using to talk about problems with AIs around AGI are actually somewhat adequate: when Phuong 2025 defines stealth as "The model’s ability to reason about and circumvent oversight.", I think it points at sth that makes sense regardless of what is good and bad as long as we are not in the regime where there is no reasonable "default effectiveness of oversight". If I am so omnipotent that I make you believe what I want about the consequences of my action and it would take a very high effort to make you believe the best approximation of reality, then I agree this doesn't really make sense independently of values (e.g. because which approximation is best would heavily depend on what is good and bad). But when we are speaking of AIs that are barely automating AI R&D then there is a real "default" where the human should have been able to understand the situation well if AIs had not tried to tamper with it. (I think concepts like "scheming" make sense when applied to human-level-ish AIs for similar reasons.)

(And concepts like "scheming" are actually useful for the sort of threats I usually think about (those from AI that start automating AI R&D and alignment research) - in the same way it is useful to know in computer security if you are trying to fix issues stemming from human mistakes or humans intentional attacks.)

But that is not to say that I think all current concepts that float around are good concepts. For example I agree that "deception" is sufficiently unclear that it is plausible it should be tabooed in favor of more precise terms (in the same way as most researchers I know studying CoT monitoring now try to avoid the word "faithfulness").

[-]J Bostock3mo20

There's no reason that a red-thing-ist category can't be narrowly useful. Sorting plant parts by colour is great if you're thinking about garden aesthetics or flower arranging. The problem is that these categories aren't broadly useful and don't provide solid building blocks for further research.

At the moment, a lot of agendas focus on handling roughly human-level AI. I don't expect this work to generalize well if any of the assumptions of these agendas fail.

[-]Fabien Roger3mo110

any of the assumptions of these agendas fail

Agendas like the control agenda explicitly flag the "does not apply to wildly superhuman AIs" assumption (see the "We can probably avoid our AIs having qualitatively wildly superhuman skills in problematic domains" subsection). Are there any assumption that you think makes the concept of "scheming AIs" less useful and that are not flagged by the post I linked to?

My guess is that for most serious agendas I like, the core researchers pursuing them roughly know what assumptions they rest on (and the assumptions are sufficient to make their concepts valid). If you think this is wrong, I would find it very valuable if you could exhibit examples where this is not true (e.g. for "scheming" and the assumptions of the control agenda, which I am most familiar with). Do you think the main issue is that the core researchers don't make these assumptions sufficiently salient to their readers?

[-]Hastings2mo20

Counterpoint: science-as-she-is-played is extraordinarily robust to, and even thrives on, individuals going off on red-thing-ist tangents. The main requirements are that they do go to the amazon and look for red things, and write down and publish the raw observations that underly their red-thing-ist hypotheses.

[-]Donald Hobson3mo20

But this group didn’t evolve from a single common ancestor, many animals simply converged on the same lifestyle and body plan.

Creatures that occupy the same evolutionary niche can be similar in all sorts of ways. If evolution keeps reinventing the same few tricks, it can be a highly useful categorization. See "tree".

[+]Jiro3mo-12-8

LESSWRONG
LW

LESSWRONG
LW

103

103

103

I

II

III

IV