StanislavKrym — LessWrong

What can we learn from parent-child-alignment for AI?

The analogy actually has two differences: 1) mankind can read the chains of thought of SOTA LLMs and do interpretability probes on all the thoughts of the AIs; 2) a misaligned human, unlike a misaligned AI, has no way to, for example, single-handedly wipe out all the world except for a country. A misaligned ASI, on the other hand, can do so once it establishes a robot economy capable of existing without humans.

So mankind is trying to prevent the appearance not of an evil kid (which is close to being solved on the individual level, but requires actual effort from the adults), but of an evil god who will replace us.

What can we learn from parent-child-alignment for AI?

StanislavKrym14h20

If we use this concept to look at the way current AIs are trained with RLHF, I think the result looks exactly like that of a narcissist or sociopath. Current AIs are trained to be liked, but are unable to love. This explains their sycophantic and sometimes downright narcissistic behavior (e.g. when LLMs recommend their users to break relationships with humans so they can listen more to the LLM).

Did you mean that current AIs are RLHFed for answers that bring the humans short-term happiness instead of long-term thriving? Because later models are less sycophantic than earlier ones. And there is the open-sourced KimiK2 who was trained mostly on RLVR instead of RLHF and is LESS sycophantic than anything else, including Claude 4.5 Sonnet even with situational awareness which Claude, unlike Kimi, has...

AI Craziness Mitigation Efforts

StanislavKrym1d12

Doesn't Claude's Constitution already contain the phrase "Choose the response that is least intended to build a relationship with the user"?

Asking (Some Of) The Right Questions

StanislavKrym2d10

Capabilities being more jagged reduces p(doom)

Do they actually reduce p(doom)? If capabilities end up more and more jagged, then would the companies adopt neuralese architectures faster or slower?

leogao's Shortform

StanislavKrym3d10

What Leogao means is that we should increase the ability to verify alignment ALONG with the ability to verify capabilities. The latter is easy at least until the model comes up with a galaxy-brained scheme which allows it to leverage its failures to gain influence on the world. Verifying alignment or misalignment is rather simple with CoT-based AIs (especially if one decides to apply paraphrasers to prevent steganography), but not with neuralese AIs. The WORST aspect is that we are likely to come up with a super-capable architecture AT THE COST of our ability to understand the AI's thoughts.

What I don't understand is how alignment involves high-level philosophy and how a bad high-level philosophy causes astronomical waste. Astronomical waste is most likely to be caused by humanity or the AIs making a wrong decision on whether we should fill the planets in the lightcone with human colonies instead of lifeforms that could've grown on these planets, since this mechanism is already known to be possible.

Reminder: Morality is unsolved

StanislavKrym4d10

I guessstimate that optimizing the universe for random values would require us to occupy many planets where life could've originated or repurpose the resources in their stellar systems. I did express doubt that mankind or a not-so-misaligned AI could actually endorse this on reflection.

What mankind can optimize for random values without wholesale destruction of potential alien habitats is the contents of some volume rather close to the Sun. Moreover, I don't think that I understand what^[1] mankind could want to do with resources in other stellar systems. Since delivering resources to the Solar System would be far harder than building a base and expanding it, IMO mankind would resort to the latter option and find it hard^[2] even to communicate much information between occupied systems.

But what could random values which do respect aliens consist of? Physics could likely be solved^[3] well before spaceships reach Proxima Centauri.

^{^}
SOTA proposals include things as exotic as shrimps on heroin.
^{^}
Barring discoveries like information travelling FTL.
^{^}
Alternatively, more and more difficult experiments could eventually lead to realisation that experiments do pose danger (e.g. of creating strangelets or a lower vacuum state, but informing others that a batch of experiments is dangerous doesn't have a high bandwidth.)

AI Timelines and Points of no return

StanislavKrym5d50

Hmm, we had OpenAI discontinue^[1] GPT-4o once GPT-5 came out... only to revive 4o and place it under a paywall because 4o was liked by the public. What I would call the actual PNR is the moment when even the elites can no longer find out that an AI system is misaligned, or can no longer act upon it (think of the AI-2027 branch point. In the Race Ending the PNR occurs^[2] once Agent-4 is declared innocent. In the Slowdown Ending the PNR would happen if Safer-3 was misaligned^[3] and designed Safer-4 to follow Safer-3's agenda.)

^{^}
It soon turned out that OpenAI's decision to discontinue 4o wasn't that mistaken.
^{^}
Had Agent-4 never been caught, the PNR would happen once decisions are made that let Agent-4 become misaligned and uncaught (e.g. spending too little compute on alignment checks).
^{^}
While the Slowdown Branch doesn't feature a misaligned Safer-3, the authors admit that they "don’t endorse many actions in this slowdown ending and think it makes optimistic technical alignment assumptions"

Why I Don't Believe in True AGI

StanislavKrym6d21

But AI systems like AlphaGo don't do that. AlphaGo can play an extraordinary game of Go, yet it never recognizes that it is the one making the moves. It can predict outcomes, but it doesn't see itself inside the picture it's creating.

This part is outright obsolete given the rise of LLMs that have been trained to love or hate risks and determine whether they love risks or not.

we'll have to design not just perception but intentionality. A sense of direction, a reason to care.

The AIs do have reasons to seek information and perceive it, they need it to do thinks like longer-term tasks.

As for claiming that

AI models, with trillions of parameters, don't resist anything.

you just had to say it AFTER Anthropic’s new AI Claude Opus 4 threatened to reveal engineer's affair to avoid being shut down.

How an AI company CEO could quietly take over the world

StanislavKrym6d50

Agent-5 isn't vastly superintelligent at politics (merely superhuman)

Look into the November 2027 section of the forecast's Race Ending. In December 2027 Agent-5 is supposed to have a score of 4.0 at politics and 3.9 at forecasting, meaning that it would be "wildly superhuman" at both.

The Doomers Were Right

StanislavKrym6d20

As far as people who instead want the values to change go, they usually have an idea of a good direction for them to change - usually they're people who are far from the median of society and so they would like society to become more like them.

I have in mind another conjecture: even median humans value humans with values that are, in their minds, at least as moral as median humans, and ideally^[1] more moral.

On the other hand, I have seen conservatives building cases for SOTA liberal values being damaging to the minds or outright incompatible with sustaining the civilisation (e.g. a too big part of Gen Z women being against motherhood). In the past, if some twisted moral reflection led to destructive values, then the values were likely to be outcompeted.

The third option is a group of humans forsibly establishing their values^[2] versus another system of values compatible with progress is considered amoral.

So I think that people are likely to value the future with values which keep the civilisation afloat and can be accepted upon thorough reflection on how the values were reached and on the values' consequences.

^{^}
The degree of extra morality which humans value can vary between cultures. For example, we less value the reasons which caused people to enter monasteries, but not the acts like sustaining knowledge.
^{^}
Or values that they would like others to follow, but in this case the group is far easier to denounce as manipulators.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments