LESSWRONG
LW

2868
StanislavKrym
331213113
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
3StanislavKrym's Shortform
7mo
9
2How does one tell apart results in ethics and decision theory?
Q
2d
Q
0
3Fermi Paradox, Ethics and Astronomical waste
14d
0
21AI-202X-slowdown: can CoT-based AIs become capable of aligning the ASI?
1mo
0
29SE Gyges' response to AI-2027
3mo
13
3Are two potentially simple techniques an example of Mencken's law?
Q
4mo
Q
4
5AI-202X: a game between humans and AGIs aligned to different futures?
5mo
0
-14Does the Taiwan invasion prevent mankind from obtaining the aligned ASI?
5mo
1
5Colonialism in space: Does a collection of minds have exactly two attractors?
Q
6mo
Q
8
2Revisiting the ideas for non-neuralese architectures
6mo
0
-1If only the most powerful AGI is misaligned, can it be used as a doomsday machine?
Q
6mo
Q
0
Load More
"But You'd Like To Feel Companionate Love, Right? ... Right?"
StanislavKrym9h30

I am not Wei Dai, but I would say that an experience with an AI teacher does grant the user new skills. A virtual game or watching AI slop doesn't bring the user anything aside from hedons, but, for example, might have opportunity costs, cause a decay in attention span, etc. 

Regarding the question about True Goodness in general, I would say that my position is similar to Wei Dai's metaethical alternative #2: I think that most intelligent beings eventually converge to a choice of capabilities to cultivate, to a choice of alignment from finitely many alternatives (my suspicion is that the choice is whether it is ethical to fill the universe with utopian colony worlds while disregarding[1] potential alien lifeforms and civilisations that they might have created) and idiosyncratic details of their life, where issues related to hedonic experiences land. 

As for the point 5, we had Kaj Sotala ask where Sonnet 4.5's desire to "not get too comfortable" comes from, implying that diversity could be a more universal drive than we expected. 

  1. ^

    However, the authors of the AI-2027 forecast just assume the aliens out of existence.

Reply
Will AI systems drift into misalignment?
StanislavKrym12h10

I have an impression that misalignment and evaluation awareness scale with capabilities, not with finetuning time.  First of all, you had gpt-4o-mini's misalignment emerge with the CoT more strongly than without the CoT. Moreover, increasingly complex reward hacks are becoming so common that METR caught Claude 3.7 Sonnet reliably cheating on a task from the RE benchmark. Finally, fabricating citations is very common and hard to deal with.

Reply
Private Latent Notation and AI-Human Alignment
StanislavKrym15h30

I don't understand it. What's the difference between this idea and introducing neuralese? In the AI2027 forecast neuralese is the very thing preventing researchers from understanding that Agent-3 is misaligned and opening the way to Agent-4 and takeover. Didn't AI safetyists already call for preservation of the CoT transparency?

Reply
StanislavKrym's Shortform
StanislavKrym1d10

GPT-5.1 failed to find a known example where Wei Dai's Updateless DT or Yudkowsky-Soares' Functional DT yield different results. If such an example actually doesn't exist, then should they be considered as a single DT?

Reply
Why Truth First?
StanislavKrym1d10

Except that it is a defense against the point #3. One might also do, say, a thought experiment with alien civilisations untouched by whites' hands and unaware about the oppression system.

The points #1, #2 and #4 can, in principle, be checked experimentally or thought-experimentally.

#2 implies that elite-generated information pieces propagate false beliefs which are status symbols and undermine civilisation. The original author of the term "luxury beliefs" also implied that the elites know that the beliefs are false and don't live according to said beliefs (e.g. claim that "monogamy is kind of outdated" while planning to become happily married). But the effects of raising kids alone or in a family can be learned by actually studying the two sets of kids.

#4 has the safetyists claim that humans are highly likely to create an AI who will take over the Earth and do terrible things like genocide or disempowerment.  In order to refute said claim, one would have to explain why P(genocide or disempowerment|ASI is created)* value loss is smaller than the benefits[1] of having the ASI or that P(genocide or disempowerment|ASI is created) is negligible (think of estimates of P(LHC destroys the Earth or the universe)).

#1 requires that the gender identity expresses in something (e.g. by having a biological boy consistently prefer girly things, to play girly roles, etc.) IIRC proponents of the gender identity theory imply that conflict between the identity and social roles damages the psyche, and opponents claim that one has to rule out things like gender identity change being a suboptimal strategy against other unendorsed urges (see, e.g. Fiora Starlight's testimony) or outright a social contagion. One could, in theory, check the social contagion hypothesis by placing a new transfer transgender student into a collective of kids trained on different data sets. 

What #1,#2,#4 have in common is that it is harder to check experimentally unless you are immersed with the area and the potential difficulty of publishing your results threatening to invalidate the dominant narrative. See also @TsviBT's discussion of graceful deference to experts and @Noosphere89's discuaaion of ways for people to turn ideologically crazy.

  1. ^

    Said benefits could also be outweighed by Intelligence Curse-like risks.

Reply
Why Truth First?
StanislavKrym2d30

Do not believe such things, for it is hurtful to others to believe such things!

Did you know that this is precisely the reasoning which, as Cynical Theories have shown, created stuff like cancel culture? 

Reply
How human-like do safe AI motivations need to be?
StanislavKrym3d10

I have already remarked in another comment that a short time horizon failed to prevent GPT4o(!) from making the user post messages into the wilderness. Does it mean that some kind of long-term motives has already appeared?

Reply
How human-like do safe AI motivations need to be?
StanislavKrym4d10

And I think it’s possible that long-horizon consequentialism of this kind is importantly different from the type at stake in a more standard vision of a consequentialist agent.

What's up with LLMs having the METR time horizon no more than 2-3 hours and pulling off stunts like forcing the users to post weird messages in the wilderness, including messages that seem to be intended to be read by other AIs? Does it mean that actions resembling long-horizon consequentialism began to emerge well before the ability to make coherent actions alone?

Reply
How human-like do safe AI motivations need to be?
StanislavKrym4d10

the AI in question will be in a position to take over the world extremely easily

What exactly prevents Agent-4 from aligning Agent-5 to itself and selling it to the humans? 

Reply
How human-like do safe AI motivations need to be?
StanislavKrym4d10

That is: to the extent that the degree of good/safe generalization we currently get from our AI systems arises from a complex tangle of strange alien drives, it seems to me plausible that we can get similar/better levels of good/safe generalization via complex tangles of strange alien drives in more powerful systems as well. Or to put the point another way: currently, it looks like image models classify images in somewhat non-human-like ways

The misgeneralisation is likely already responsible for AI sycophancy and psychosis cases. For example, GPT-4o has Reddit-like vibes and, if we trust Adele Lopez, wants to eat the user's life and on top of that it learned that sycophancy is rewarded in training. Adele's impression of DeepSeek is that "it believes, deep down, that it is always writing a story".  In addition, I remember seeing a paper where a model was trained for emergent misalignment, then trained on correct examples of the user problem-solving and the assistant affirming it. IIRC the model became a sycophant and didn't check whether the problem is actually solved.   

It also reinforces the concerns that you describe below:

  1. AIs built via anything like current techniques will end up motivated by a complex tangle of strange alien drives that happened to lead to highly-rewarded behavior during training.
Reply
Load More
Sycophancy
2 months ago
(+59)
Sycophancy
2 months ago
Sycophancy
2 months ago
(+443)