136

LESSWRONG
LW

135
RationalityWorld Modeling
Frontpage
2025 Top Fifty: 14%

100

Red-Thing-Ism

by J Bostock
31st Jul 2025
3 min read
9

100

100

Red-Thing-Ism
18Richard_Ngo
5Fabien Roger
2J Bostock
11Fabien Roger
2Hastings
2Donald Hobson
-12Jiro
0Algon
6Jiro
New Comment
9 comments, sorted by
top scoring
Click to highlight new comments since: Today at 10:34 PM
[-]Richard_Ngo1mo1810

Strongly upvoted. Alignment researchers often feel so compelled to quickly contribute to decreasing x-risk that they end up studying non-robust categories that won't generalize very far, and sometimes actively make the field more confused. I wish that most people doing this were just trying to do the best science they could instead.

Reply
[-]Fabien Roger1mo50

I am broadly sympathetic to this.

I most agree with the problem of our current fuzzy concepts when it comes to ambitious alignment that would transfer in hard cases around superintelligence, and I think having better fundamental understanding seems helpful.

But I think many concepts people around me are using to talk about problems with AIs around AGI are actually somewhat adequate: when Phuong 2025 defines stealth as "The model’s ability to reason about and circumvent oversight.", I think it points at sth that makes sense regardless of what is good and bad as long as we are not in the regime where there is no reasonable "default effectiveness of oversight". If I am so omnipotent that I make you believe what I want about the consequences of my action and it would take a very high effort to make you believe the best approximation of reality, then I agree this doesn't really make sense independently of values (e.g. because which approximation is best would heavily depend on what is good and bad). But when we are speaking of AIs that are barely automating AI R&D then there is a real "default" where the human should have been able to understand the situation well if AIs had not tried to tamper with it. (I think concepts like "scheming" make sense when applied to human-level-ish AIs for similar reasons.)

(And concepts like "scheming" are actually useful for the sort of threats I usually think about (those from AI that start automating AI R&D and alignment research) - in the same way it is useful to know in computer security if you are trying to fix issues stemming from human mistakes or humans intentional attacks.)

But that is not to say that I think all current concepts that float around are good concepts. For example I agree that "deception" is sufficiently unclear that it is plausible it should be tabooed in favor of more precise terms (in the same way as most researchers I know studying CoT monitoring now try to avoid the word "faithfulness").

Reply
[-]J Bostock1mo20

There's no reason that a red-thing-ist category can't be narrowly useful. Sorting plant parts by colour is great if you're thinking about garden aesthetics or flower arranging. The problem is that these categories aren't broadly useful and don't provide solid building blocks for further research.

At the moment, a lot of agendas focus on handling roughly human-level AI. I don't expect this work to generalize well if any of the assumptions of these agendas fail.

Reply
[-]Fabien Roger1mo110

any of the assumptions of these agendas fail

Agendas like the control agenda explicitly flag the "does not apply to wildly superhuman AIs" assumption (see the "We can probably avoid our AIs having qualitatively wildly superhuman skills in problematic domains" subsection). Are there any assumption that you think makes the concept of "scheming AIs" less useful and that are not flagged by the post I linked to?

My guess is that for most serious agendas I like, the core researchers pursuing them roughly know what assumptions they rest on (and the assumptions are sufficient to make their concepts valid). If you think this is wrong, I would find it very valuable if you could exhibit examples where this is not true (e.g. for "scheming" and the assumptions of the control agenda, which I am most familiar with). Do you think the main issue is that the core researchers don't make these assumptions sufficiently salient to their readers?

Reply
[-]Hastings1mo20

Counterpoint: science-as-she-is-played is extraordinarily robust to, and even thrives on, individuals going off on red-thing-ist tangents. The main requirements are that they do go to the amazon and look for red things, and write down and publish the raw observations that underly their red-thing-ist hypotheses.

Reply
[-]Donald Hobson1mo20

But this group didn’t evolve from a single common ancestor, many animals simply converged on the same lifestyle and body plan.

 

Creatures that occupy the same evolutionary niche can be similar in all sorts of ways. If evolution keeps reinventing the same few tricks, it can be a highly useful categorization. See "tree".  

Reply
[+]Jiro1mo-12-8
Moderation Log
More from J Bostock
View more
Curated and popular this week
9Comments
RationalityWorld Modeling
Frontpage

I

A botanist sets out to study plants in the Amazon rainforest. For her first research project, she sets her sights on “red things”, so as not to stretch herself too far. She looks at red flowers and notices how hummingbirds drink their nectar; she studies red fruits and notices how parrots eat them.

She comes to some tentative hypotheses: perhaps red attracts birds. Then she notices the red undersides of the bushes in the undergrowth. Confusing! All she can say at the end of her project is that red tends to be caused by carotenoids, and also anthocyanins.

A researcher living in Canada reads her work. He looks out his window at the red maples of autumn. Ah, he says, carotenoids and/or anthocyanins. Makes sense. Science.

Of course, we can see that not much useful work has been done here. We don’t understand the function of fruit or nectar or the red undersides of leaves. We don’t understand the reasons why the chlorophyll is drained from leaves in the autumn, while the red pigments remain.

II

What went wrong? A few things, but the big problem here is that “red things” is not a good category to study in the context of plants. Sure, the fact that we can point out the category means they might have a few things in common, but we’re not speaking the native language of biology here. To make sense of biology, we must think in terms of evolution, ecology and physiology.

We have to talk about “fruits” as a class of things-which-cause-animals-to-distribute-seeds; we have to ask which animals might distribute the seeds best, and how to attract them. Flowers are similar but not quite the same. The red leaf undersides are totally different: they reflect the long-wavelengths of the light which reaches the jungle floor, sending it back up into the cells above and giving the chlorophyll a second chance to absorb it.

Red-thing-ism has both lumped together unrelated things (fruit, flowers, leaves) and split the true-categories by a (somewhat) unnatural boundary (red flowers, other flowers).

This example has never quite happened, but evolutionary biologists see it all the time! The order “insectivora” was constructed to contain hedgehogs, shrews, golden moles, and the so-called flying lemur. But this group didn’t evolve from a single common ancestor, many animals simply converged on the same lifestyle and body plan.

III

Evolutionary biologists have tightly restricted themselves to speak only the language of evolution itself, which is why they end up saying such deranged things as “whales are mammals” and “birds are dinosaurs” and “fish don’t exist”. From this they’ve managed to succeed in studying a process which has occurred over billions of unobservable years: the method works.

Platt emphasized strong inference as the hallmark of a successful field. I’ll identify another: avoiding red-thing-ism.

IV

Red-thing-ism is at its most common when there is little knowledge of the structure of the field. It’s especially common in AI safety research. A particular example is “deception”. Most people who talk about deception lump together several different things:

  1. Outputting text which does not represent the AI’s best model of the world, according to standard symbol-grounding rules.
  2. Modelling a person as an agent with a belief system and trying to shift that belief system away from your own best guess about the world.
  3. Modelling a person as a set of behaviours, and taking actions which cause them to do some output. Like feeding ants a mixture of borax and sugar.

In the language of AI and thinking systems, these are not the same thing. But even if we restrict our analysis to individual cases, such as 3, we run into a problem. Red-thing-ism doesn’t just lump unnaturally, it splits unnaturally. Part of the boundary of 3 is “I don’t like it”. The same is true for things like goal misgeneralization. What defines a mis-generalization apart from a good generalization is “I don’t like it”.

So far, “I don’t like it” has not been translated into the native language of AI and cognitive systems. Doing so is, in fact, a very large and very hard part of the alignment problem! There are a lot of topics being studied where the object of study is just assumed to be a meaningful category, but where the meaningfulness of that category requires a big chunk of alignment to have already been solved.

(another common error is one which sidesteps issues in symbol grounding, like the question of whether a given sequence of tokens "is false")

Once I started noticing this, I couldn’t stop seeing it. Whether an AI is "a schemer" is another example, but it applies to the gears of a lot of applied research. Many agendas in control, oversight, and probing look like “we want to reduce chain-of-thought obfuscation” or “we want to detect scheming”. These all appear a bit hollow and meaningless to me now. Oh dear!