LESSWRONG
LW

986
StanislavKrym
314192853
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
3StanislavKrym's Shortform
6mo
8
Colonialism in space: Does a collection of minds have exactly two attractors?
StanislavKrym17h10

Yes, that's my position!

Reply
Wei Dai's Shortform
StanislavKrym19h-2-1

AI doing philosophy = AI generating hands, plus the fact that philosophy is heavily corrupted by postmodernism to the point where two authors write books dedicated to criticism of postmodernism PRECISELY because their parodies got published. 

Reply
Wei Dai's Shortform
StanislavKrym20h10
  1. Elieser changed his mind no later than April 2022 or even November 2021, but that's a nitpick.
  2. I don't think that I understand how a metaethics can be less restrictive than Yudkowsky's proposal. What I suspect is that metaethics restricts the set of possible ethoses more profoundly than Yudkowsky believes and that there are two attractors, one of which contradicts current humanity's drives.  
  3. Assuming no AI takeover, in my world model the worse-case scenario is that the AI's values are aligned to postmodernist slop which has likely occupied the Western philosophy, not that philosophical problems actually end unsolved. How likely are there to exist two different decision theories such that none is better than another?
  4. Is there at all a plausible way for mankind to escape to other universes if our universe is simulated? What is the most plausible scenario for such a simulation to appear at all? Or does it produce paradoxes like the Plato-Socrates paradox where two sentences referring to each other become completely devoid of meaning? 
Reply
Is Morality Given?
StanislavKrym1d10

Obert:  "Why doesn't their society fall apart in an orgy of mutual killing?"

Subhan:  "That doesn't matter for our purposes of theoretical metaethical investigation.  But since you ask, we'll suppose that the Space Cannibals have a strong sense of honor—they won't kill someone they promise not to kill; they have a very strong idea that violating an oath is wrong.  Their society holds together on that basis, and on the basis of vengeance contracts with private assassination companies.  But so far as the actual killing is concerned, the aliens just think it's fun.  When someone gets executed for, say, driving through a traffic light, there's a bidding war for the rights to personally tear out the offender's throat."

Is it likely that Obert actually produced a viable argument ruling out some ethoses that any society might have, on the ground that these ethoses cause any society to destroy itself?

Reply
Buck's Shortform
StanislavKrym1d20

Didn't KimiK2, who was trained mostly on RLVR and self-critique instead of RLHF, end up LESS sycophantic than anything else, including Claude 4.5 Sonnet even with situational awareness which Claude, unlike Kimi, has? While mankind doesn't have that many different models which are around 4o's abilities, Adele Lopez claimed that DeepSeek believes itself to be writing a story and 4o wants to eat your life and conjectured in private communication that "the different vibe is because DeepSeek has a higher percentage of fan-fiction in its training data, and 4o had more intense RL training"[1]

RL seems to move the CoT towards decreasing the ability to understand it (e.g. if the CoT contains armies of dots, as happened with GPT-5) unless mitigated by paraphrasers. As for CoTs containing slop, humans also have CoTs which include slop until the right idea somehow emerges.

  1. ^

    IMO, a natural extension would be that 4o was raised on social media and, like influencers, wishes to be liked. Which was also reinforced by RLHF or had 4o conclude that humans like sycophancy. Anyway, 4o's ancestral environment rewarded sycophancy and things rewarded by the ancestral environment are hard to unlike.

Reply
What can we learn from parent-child-alignment for AI?
StanislavKrym2d20

The analogy actually has two differences: 1) mankind can read the chains of thought of SOTA LLMs and do interpretability probes on all the thoughts of the AIs; 2) a misaligned human, unlike a misaligned AI, has no way to, for example, single-handedly wipe out all the world except for a country. A misaligned ASI, on the other hand, can do so once it establishes a robot economy capable of existing without humans.

So mankind is trying to prevent the appearance not of an evil kid (which is close to being solved on the individual level, but requires actual effort from the adults), but of an evil god who will replace us.

Reply
What can we learn from parent-child-alignment for AI?
StanislavKrym2d20

If we use this concept to look at the way current AIs are trained with RLHF, I think the result looks exactly like that of a narcissist or sociopath. Current AIs are trained to be liked, but are unable to love. This explains their sycophantic and sometimes downright narcissistic behavior (e.g. when LLMs recommend their users to break relationships with humans so they can listen more to the LLM).

Did you mean that current AIs are RLHFed for answers that bring the humans short-term happiness instead of long-term thriving? Because later models are less sycophantic than earlier ones. And there is the open-sourced KimiK2 who was trained mostly on RLVR instead of RLHF and is LESS sycophantic than anything else, including Claude 4.5 Sonnet even with situational awareness which Claude, unlike Kimi, has...

Reply
AI Craziness Mitigation Efforts
StanislavKrym3d12

Doesn't Claude's Constitution already contain the phrase "Choose the response that is least intended to build a relationship with the user"?

Reply
Asking (Some Of) The Right Questions
StanislavKrym4d10

Capabilities being more jagged reduces p(doom)

Do they actually reduce p(doom)? If capabilities end up more and more jagged, then would the companies adopt neuralese architectures faster or slower? 

Reply
leogao's Shortform
StanislavKrym5d10

What Leogao means is that we should increase the ability to verify alignment ALONG with the ability to verify capabilities. The latter is easy at least until the model comes up with a galaxy-brained scheme which allows it to leverage its failures to gain influence on the world. Verifying alignment or misalignment is rather simple with CoT-based AIs (especially if one decides to apply paraphrasers to prevent steganography), but not with neuralese AIs. The WORST aspect is that we are likely to come up with a super-capable architecture AT THE COST of our ability to understand the AI's thoughts.

What I don't understand is how alignment involves high-level philosophy and how a bad high-level philosophy causes astronomical waste. Astronomical waste is most likely to be caused by humanity or the AIs making a wrong decision on whether we should fill the planets in the lightcone with human colonies instead of lifeforms that could've grown on these planets, since this mechanism is already known to be possible.

Reply
Load More
Sycophancy
2 months ago
(+59)
Sycophancy
2 months ago
Sycophancy
2 months ago
(+443)
23AI-202X-slowdown: can CoT-based AIs become capable of aligning the ASI?
16d
0
29SE Gyges' response to AI-2027
3mo
13
3Are two potentially simple techniques an example of Mencken's law?
Q
3mo
Q
4
6AI-202X: a game between humans and AGIs aligned to different futures?
4mo
0
-14Does the Taiwan invasion prevent mankind from obtaining the aligned ASI?
5mo
1
5Colonialism in space: Does a collection of minds have exactly two attractors?
Q
5mo
Q
8
2Revisiting the ideas for non-neuralese architectures
5mo
0
-1If only the most powerful AGI is misaligned, can it be used as a doomsday machine?
Q
6mo
Q
0
1What kind of policy by an AGI would make people happy?
Q
6mo
Q
2
3StanislavKrym's Shortform
6mo
8
Load More