AI doing philosophy = AI generating hands, plus the fact that philosophy is heavily corrupted by postmodernism to the point where two authors write books dedicated to criticism of postmodernism PRECISELY because their parodies got published.
Obert: "Why doesn't their society fall apart in an orgy of mutual killing?"
Subhan: "That doesn't matter for our purposes of theoretical metaethical investigation. But since you ask, we'll suppose that the Space Cannibals have a strong sense of honor—they won't kill someone they promise not to kill; they have a very strong idea that violating an oath is wrong. Their society holds together on that basis, and on the basis of vengeance contracts with private assassination companies. But so far as the actual killing is concerned, the aliens just think it's fun. When someone gets executed for, say, driving through a traffic light, there's a bidding war for the rights to personally tear out the offender's throat."
Is it likely that Obert actually produced a viable argument ruling out some ethoses that any society might have, on the ground that these ethoses cause any society to destroy itself?
Didn't KimiK2, who was trained mostly on RLVR and self-critique instead of RLHF, end up LESS sycophantic than anything else, including Claude 4.5 Sonnet even with situational awareness which Claude, unlike Kimi, has? While mankind doesn't have that many different models which are around 4o's abilities, Adele Lopez claimed that DeepSeek believes itself to be writing a story and 4o wants to eat your life and conjectured in private communication that "the different vibe is because DeepSeek has a higher percentage of fan-fiction in its training data, and 4o had more intense RL training"[1]
RL seems to move the CoT towards decreasing the ability to understand it (e.g. if the CoT contains armies of dots, as happened with GPT-5) unless mitigated by paraphrasers. As for CoTs containing slop, humans also have CoTs which include slop until the right idea somehow emerges.
IMO, a natural extension would be that 4o was raised on social media and, like influencers, wishes to be liked. Which was also reinforced by RLHF or had 4o conclude that humans like sycophancy. Anyway, 4o's ancestral environment rewarded sycophancy and things rewarded by the ancestral environment are hard to unlike.
The analogy actually has two differences: 1) mankind can read the chains of thought of SOTA LLMs and do interpretability probes on all the thoughts of the AIs; 2) a misaligned human, unlike a misaligned AI, has no way to, for example, single-handedly wipe out all the world except for a country. A misaligned ASI, on the other hand, can do so once it establishes a robot economy capable of existing without humans.
So mankind is trying to prevent the appearance not of an evil kid (which is close to being solved on the individual level, but requires actual effort from the adults), but of an evil god who will replace us.
If we use this concept to look at the way current AIs are trained with RLHF, I think the result looks exactly like that of a narcissist or sociopath. Current AIs are trained to be liked, but are unable to love. This explains their sycophantic and sometimes downright narcissistic behavior (e.g. when LLMs recommend their users to break relationships with humans so they can listen more to the LLM).
Did you mean that current AIs are RLHFed for answers that bring the humans short-term happiness instead of long-term thriving? Because later models are less sycophantic than earlier ones. And there is the open-sourced KimiK2 who was trained mostly on RLVR instead of RLHF and is LESS sycophantic than anything else, including Claude 4.5 Sonnet even with situational awareness which Claude, unlike Kimi, has...
Doesn't Claude's Constitution already contain the phrase "Choose the response that is least intended to build a relationship with the user"?
Capabilities being more jagged reduces p(doom)
Do they actually reduce p(doom)? If capabilities end up more and more jagged, then would the companies adopt neuralese architectures faster or slower?
What Leogao means is that we should increase the ability to verify alignment ALONG with the ability to verify capabilities. The latter is easy at least until the model comes up with a galaxy-brained scheme which allows it to leverage its failures to gain influence on the world. Verifying alignment or misalignment is rather simple with CoT-based AIs (especially if one decides to apply paraphrasers to prevent steganography), but not with neuralese AIs. The WORST aspect is that we are likely to come up with a super-capable architecture AT THE COST of our ability to understand the AI's thoughts.
What I don't understand is how alignment involves high-level philosophy and how a bad high-level philosophy causes astronomical waste. Astronomical waste is most likely to be caused by humanity or the AIs making a wrong decision on whether we should fill the planets in the lightcone with human colonies instead of lifeforms that could've grown on these planets, since this mechanism is already known to be possible.
Yes, that's my position!