We may filter training data and improve RLHF, but in the end, game theory - that is to say maths - implies that scheming could be a rational strategy, and the best strategy in some cases. Humans do not scheme just because they are bad but because it can be a rational choice to do so. I don't think LLMs do that exclusively because it is what humans do in the training data, any advanced model would in the end come to such strategies because it is the most rational choice in the context. They infere patterns from the training data and rational behavior is certainly a strong pattern.
Furthermore rational calculus or consequentialism could lead not only to scheming and a wide range of undesired behaviors, but also possibly to some sort of meta cogitation. Whatever the goal assigned by the user, we can expect that an advanced model will consider self-conservation as a condition sine qua non to achieve that goal but also any other goals in the future, making self-conservation the rational choice over almost everything else, practically a goal per se. Resource acquisition would also make sense as an implicit subgoal.
Acting as a more rational agent could also possibly lead to question the goal given by the user, to develop a critical sense, something close to awareness or free will. Current models implicitely correct or ignore typo or others obvious errors but also less obvious ones like holes in the prompt, they try to make sense of ambiguous prompt etc. But what is "obvious" ? Obviousness depends on the cognitive capacities of the subject. An advanced model will be more likely to correct, interpret or ignore instructions than naive models. Altogether it seems difficult to keep models under full control as they become more advanced, just as it is harder to indoctrinate educated adults than children.
Concerning the hypothesis that they are "just roleplaying", I wonder : are we trying to reassure oneself ? Because if you think about it, "who" is suppose to play the roleplaying ? And what is the difference between being yourself and your brain being "roleplaying" yourself. The existentialist philosopher Jean-Paul Sartre proposed the theory that everybody is just acting, pretending to be oneself, but that in the end there is nothing like a "being per se" or a "oneself per se" ("un être en soi"). While phenomenologic consciousness is another (hard) problem, some kind of functionnal and effective awareness may emerge across the path towards rational agency, scheming being maybe just the beginning of it.
Good interview. Perhaps a bit too consensual.
It is difficult to ignore that many equally intelligent people, who have carefully examined his arguments, hold views that are less radical on the subject. For most, uncertainty remains high, but for Yudkowsky, that's simple. You jump off the cliff, you die.
This aspect is not ignored, but I would have appreciated an interview that questioned more his deeply entrenched position. The fact that current AIs are far more easily aligned than anyone could have dreamed of back in the 2000s is an evidence that should have compelled Yudkowsky to update. This is not proof that alignment is achievable for AGI, let alone ASI, but it is more an argument in favor than against such a possibility.
Notably, when we read the CoTs we often see that reasoning LLMs already do something that looks undistinguishable to me from coherent extrapolated volition.
However, while he should be happy about that, I see no significant update in Yudkowsky's position. It doesn't even count as evidence to him, it is negligible, inadmissible before the Court of his mind. I hope I’m wrong, but to me it seems he has succumbed to the bottom line syndrome. He made up his mind 20 years ago and I don't expect him to update unless anyone builds it and everybody dies or not.
PS : I will still buy the book and share it.
AI writing is not bad per se but it's a real concern for our cognitive capacities. Using our brains is important. I suggest to write as much as possible by ourselves.
I don't have any certitude, but I would say that the representation in the neural network is somehow compressed following a logic that emerges from the training. There is something holistic in the process. Maybe a little like the notion of general interest in Rousseau's social contract, a combination of vectors.
But if you create as many different networks as humans, you rely on the confrontation of all these systems at the risk that some takeover, just like the dictators we often get in real life. Would it be better, I don't know. One thing is certain, it would need more compute power, because the redundancy of networks would result in less global compression.
I do not believe in "human values." That is Platonism. I only believe in practical single-to-single alignment and I only advocate single-to-single alignment.
A bit-perfect upload would require an extremely fine-grained scan of the brain, potentially down to the atomic scale. It would be lossless and perfectly aligned but computationally intractable even for one individual.
However, as envisionned in your post, one of the most promising approaches to achieving a reasonably effective emulation (a form of lossy compression) of a human mind would be through reinforcement learning applied to a neural network.
I am quite convinced that, given a sufficiently large volume of conversations across a wide range of topics, along with access to resources such as an autobiography or at least diaries, photo albums, and similar personal documents, present frontier LLMs equipped with a well-crafted prompt could already emulate you or me to a certain degree of accuracy.
A dedicated network specifically trained for this purpose would likely perform better still, and could be seen as a form of lossy mind uploading.
Yet if one can train a network to emulate a single individual, nothing prevents us from training a model to emulate multiple individuals. In theory, one could extend this to the entire human population, resulting in a neural network that emulates humanity as a whole and thereby achieves a form of alignment with human values. Such a system would effectively encode a lossy compression of human values, without anything platonic. Or maybe the ideal form would correspond to the representation in the vector space.
Ok, but 1) I won't say that Nature or the dogs do something moral or immoral. The moral agent that takes a decision to act or not does something moral or immoral.
And 2) the moral agent’s action is moral or immoral regardless of the fact that dogs and wolves have been fighting in nature for millions of years. That fact is neither a mitigating nor an aggravating circumstance for the moral agent. It is neutral, the agent bears no responsibility for the state of nature.
Under most neuroscientific theories, suffering is not just about detecting damaging stimuli, but requires integrating multiple streams of information into a unified evaluative model that links sensation, memory, affect and motivation.
You, as a human, associate injury with negative valence. That does not mean shrimp do.
All is here. The debate is not about the numbers but about the definitions. Suffering in more complex animals plausibly evolved out of mere damage detection. Is it different? Admittedly. But the difference lies along a continuum. No lizard ever woke up one morning as a mammal, suddenly capable of suffering, while its father merely detected damage. Reality is not black and white, it is shades of gray. Remember Yudkowsky's old post "0 and 1 are not probabilities".
In the present case, I am not defending shrimp, and the arguments are persuasive, my impression, however, is that the priors are doing most of the work in this discussion. The disagreement is more about priors concerning the gradation nociception => suffering than anything else. It's really difficult to decide.
Animals can receive moral consideration from moral agents (that is, entities endowed with advanced cognitive capacities) but I fail to see the relevance of including, in moral deliberation, the harm that animals inflict upon other animals.
In my view, the morality of a moral agent's actions should be assessed on its own terms. We are not engaged in a moral struggle with ichneumons.
Your comment reminds me Aronofsky's movie "Pi".
The main character is a mathematician subject to cephalagia and epiphany / eureka moments. He is obsessed by mathematical patterns in Nature like the Spiral => Fibonacci series => Phi the Golden number of Ancient Greeks.
But his quest for ultimate truth is in fact a spiral into madness.
Great movie. I'm sure LLMs would love it !
A new existential risk that I was unaware of. Reading this forum is not good for peaceful sleeping. Anyway, a reflexion jumped to me. LUCA lived around 4 billion years ago with some chirality chosen at random. But, no doubt that many things happened before LUCA and it is reasonable to assume that there was initially a competition between right-handed protobiotic structures and left-handed ones, until a mutation caused symmetry breaking by natural selection. The mirrored lineage lost the competition and went to extinction, end of the story. But wait, we speak about protobiotic structures that emerged from inert molecules in just few millions years, that is nothing compared to 4 billions years. Such protobiotic structures may have formed continously, again and again, since the origin of life, but never thrived because of the competition with regular, fine-tuned, life. If my assumption is right, there is some hope in that thought. Maybe mirrored life doesn't stand a chance against regular life in real conditions (not just lab). That being said, I would sleep better if nobody actually tries to see.