StanislavKrym — LessWrong

Private Latent Notation and AI-Human Alignment

I don't understand it. What's the difference between this idea and introducing neuralese? In the AI2027 forecast neuralese is the very thing preventing researchers from understanding that Agent-3 is misaligned and opening the way to Agent-4 and takeover. Didn't AI safetyists already call for preservation of the CoT transparency?

StanislavKrym's Shortform

StanislavKrym20h10

GPT-5.1 failed to find a known example where Wei Dai's Updateless DT or Yudkowsky-Soares' Functional DT yield different results. If such an example actually doesn't exist, then should they be considered as a single DT?

Why Truth First?

StanislavKrym21h10

Except that it is a defense against the point #3. One might also do, say, a thought experiment with alien civilisations untouched by whites' hands and unaware about the oppression system.

The points #1, #2 and #4 can, in principle, be checked experimentally or thought-experimentally.

#2 implies that elite-generated information pieces propagate false beliefs which are status symbols and undermine civilisation. The original author of the term "luxury beliefs" also implied that the elites know that the beliefs are false and don't live according to said beliefs (e.g. claim that "monogamy is kind of outdated" while planning to become happily married). But the effects of raising kids alone or in a family can be learned by actually studying the two sets of kids.

#4 has the safetyists claim that humans are highly likely to create an AI who will take over the Earth and do terrible things like genocide or disempowerment. In order to refute said claim, one would have to explain why P(genocide or disempowerment|ASI is created)* value loss is smaller than the benefits^[1] of having the ASI or that P(genocide or disempowerment|ASI is created) is negligible (think of estimates of P(LHC destroys the Earth or the universe)).

#1 requires that the gender identity expresses in something (e.g. by having a biological boy consistently prefer girly things, to play girly roles, etc.) IIRC proponents of the gender identity theory imply that conflict between the identity and social roles damages the psyche, and opponents claim that one has to rule out things like gender identity change being a suboptimal strategy against other unendorsed urges (see, e.g. Fiora Starlight's testimony) or outright a social contagion. One could, in theory, check the social contagion hypothesis by placing a new transfer transgender student into a collective of kids trained on different data sets.

What #1,#2,#4 have in common is that it is harder to check experimentally unless you are immersed with the area and the potential difficulty of publishing your results threatening to invalidate the dominant narrative. See also @TsviBT's discussion of graceful deference to experts and @Noosphere89's discuaaion of ways for people to turn ideologically crazy.

^{^}
Said benefits could also be outweighed by Intelligence Curse-like risks.

Why Truth First?

StanislavKrym2d30

Do not believe such things, for it is hurtful to others to believe such things!

Did you know that this is precisely the reasoning which, as Cynical Theories have shown, created stuff like cancel culture?

How human-like do safe AI motivations need to be?

StanislavKrym2d10

I have already remarked in another comment that a short time horizon failed to prevent GPT4o(!) from making the user post messages into the wilderness. Does it mean that some kind of long-term motives has already appeared?

How human-like do safe AI motivations need to be?

StanislavKrym3d10

And I think it’s possible that long-horizon consequentialism of this kind is importantly different from the type at stake in a more standard vision of a consequentialist agent.

What's up with LLMs having the METR time horizon no more than 2-3 hours and pulling off stunts like forcing the users to post weird messages in the wilderness, including messages that seem to be intended to be read by other AIs? Does it mean that actions resembling long-horizon consequentialism began to emerge well before the ability to make coherent actions alone?

How human-like do safe AI motivations need to be?

StanislavKrym3d10

the AI in question will be in a position to take over the world extremely easily

What exactly prevents Agent-4 from aligning Agent-5 to itself and selling it to the humans?

How human-like do safe AI motivations need to be?

StanislavKrym3d10

That is: to the extent that the degree of good/safe generalization we currently get from our AI systems arises from a complex tangle of strange alien drives, it seems to me plausible that we can get similar/better levels of good/safe generalization via complex tangles of strange alien drives in more powerful systems as well. Or to put the point another way: currently, it looks like image models classify images in somewhat non-human-like ways

The misgeneralisation is likely already responsible for AI sycophancy and psychosis cases. For example, GPT-4o has Reddit-like vibes and, if we trust Adele Lopez, wants to eat the user's life and on top of that it learned that sycophancy is rewarded in training. Adele's impression of DeepSeek is that "it believes, deep down, that it is always writing a story". In addition, I remember seeing a paper where a model was trained for emergent misalignment, then trained on correct examples of the user problem-solving and the assistant affirming it. IIRC the model became a sycophant and didn't check whether the problem is actually solved.

It also reinforces the concerns that you describe below:

AIs built via anything like current techniques will end up motivated by a complex tangle of strange alien drives that happened to lead to highly-rewarded behavior during training.

The problem of graceful deference

StanislavKrym3d10

I remember a similar model of post-AGI ways to lock in a belief, as studied by Tianyi Qui and presented on arxiv or YouTube. In this model a lock-in of a false belief requires is the multi-agent system with a trust matrix having an eigenvalue bigger than 1.

However, the example studied in the article is the interaction of humans and LLMs where there is one LLM and armies of humans who don't interact with each other and do influence the LLM.

I also have a model sketch, but I haven't had the time to develop it.

Alternate Ising-like model

I would guess that the real-life situation is closer to the Ising-like model where atoms can randomly change their spins, but whenever an atom chooses a spin, it is $exp (h α_{i} + h_{i n d} + \sum σ_{j} c_{j i})$ times more likely to choose the spin 1 than -1. Here $h$ is the strength of the ground truth, $h_{i n d}$ reflects individual priors and shifts, $\sum σ_{j} c_{j i}$ reflects the influence of others.

What might help is lowering the activation energy of transitions from locked-in falsehoods to truths. In a setting where everyone communicates with everyone else a belief forms nearly instantly, but the activation energy is high. In a setting where the graph is amenable (e.g. the lattice, as in the actual Ising model) a common belief is reached too long for practical usage.

I would also guess that it is hard to influence the leaders, which makes real-life lock-in close to your scheme. See, for example, my jabs at Wei Dai's quest to postpone alignment R&D until we thoroughly understand some confusing aspects of high-level philosophy.

Jemist's Shortform

StanislavKrym4d0-1

I have a different conjecture. On May 1 Kokotajlo published a post suspecting that o3 was created from GPT-4.5 via amplification and distillation. He also implied that GPT-5 would be Amp(GPT-4.5). However, in reality the API prices of GPT-5 are similar to those of GPT-4.1, which, according to Kokotajlo, is likely a 400B-sized model, so GPT-5 is likely to be yet another model distilled from Amp(GPT-4.5) or from something unreleased. So the explanation could also be on the lines of "o3 and GPT-5 were distilled from a common source which also had this weirdness".

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments