Not to be Rousseaun about alignment, but there is something very weird going on without being examined in the space of common rationalist premises of:
Human beings have a broader exposure to selection pressure that is consistent over time and serves as a more desirable foundation
I'm not going to bother explicating every weakness or contradiction there is in this because it would be only my own assessment, it would not be exhaustive, or help people build intuitions, or necessarily be trustworthy given my lack of authority. But I feel like you could write an entire, second LessWrong just in reflecting off of this cluster of ideas.
Here's one absurdity that's borderline tautological so hopefully permissible: if we already know that a working alignment context exists for humans, then we know there is a working alignment context. If we don't know if there is a working alignment context for humans, then there is little basis for human chauvinism.
So, it could be read as question begging since I didn't unfold everything properly or really at all, but my understanding is that most analysis of AI model development assumes that you give them weights, objectives, that deliberately or inevitably a utility function forms, and other intrinsic properties, and the goal is to get the correct intrinsic properties and make these robust, self healing, hardened against attack etc, to make the AI "aligned". But alignment might be a consequence of external structures. And not just incentives, but maybe environmental structures in the broadest sense as well. For instance, human use of metaphor is about finding isomorphisms between common emotionally salient things and more subtle or uncommon things, and this is a form of communication that has relatively few inferential steps that have to be built up in many cases. The average human is more likely to be able to understand a metaphor about the sun from an entirely different culture than a system of linear algebra matrices produced from within their own culture. So the mere fact that there is a warm glowing ball in the sky that, phenomenologically, comes and goes reliably and makes life possible, provides the skeleton for an enormous amount of analysis of and communication of isomorphism. And so in this way, the idee fixes of human beings can ensure that the first step in an inferential chain of dependencies is constructed that enables later understanding and communication, but these are not things that would necessarily arise simply from human brain architecture, utility functions, or anything, if for instance humans were moved to or even born on a planet with two suns, or a tidally locked planet of something. A human brain might communicate and think worse in some contexts (absent questions of pain or distress) even after habituation. Granted, I think if life evolved in these places it would similarly find a way of maximizing the utility of the environment for isomorphic comparison, but that is different than "human brain" being an absolutely general intelligence that won't perform worse in some perfectly innocuous environments for esoteric reasons.
So in that sense, alignment needs might be greatly reduced simply by matching the AI training environment to the human environment. Literally just "give the AI sense organs which produce equivalent raw signals to human sense organs".
But alignment might be a consequence of external structures. And not just incentives, but maybe environmental structures in the broadest sense as well.
I think everyone already thinks this. We don't program in the utility function. We train the models. Then it they end up with utility functions, but these utility functions will probably end up being a product of
And the point is just that we can't predict how these interact.
So in that sense, alignment needs might be greatly reduced simply by matching the AI training environment to the human environment. Literally just "give the AI sense organs which produce equivalent raw signals to human sense organs".
It seems obvious to me that if you found a tribe of protohumans, and carefully made them go through the exact same evolutionary pressures humans did, and then guided their cultural evolution to end up in an identical place to ours, you'd end up with a new batch of "aligned" humans.
The problem if you can't do this with AI, because they have a different architecture, and are trained in a different way.
If you put the AI in human evolutionary environments, its plausible you get an AI that wants totally different things from what humans get.
I don't think completely different is likely. There is already divergence in living things and in between humans at individual and group levels. However, it solves the interpretability problem, or at least dramatically reduces it to the point people habitually solve it while minimizing the impact of failure modes, and this goes towards moral alignment between conflicting groups of humans as well. It is worth closing 90% of the distance even if that risks building capacity. If you want the AI to effect our world it has to start being entangled with it eventually. Close the distance you're comfortable with, then re evaluate, imho
Not trying to be rude here, but I have zero idea what you just said. I only am able to follow the first sentence. Then its just a bunch of unrelated sentences strung together. (is how it reads to me)
All your posts so far have been very hard to understand.
You use a bunch of terms that are non standard, like "alignment context". Then you don't explain what they mean. Even when I asked you directly what you mean by that phrase you didn't explain.
FWIW, I have had the same experience of reading a post or comment by Alephwyr and bouncing off parts of it, unsure what he was saying. So I then tried giving it to Claude, who generally seemed to understand it, explained it to me, and when I then reread it, Claude's explanation then fit, and when I then conversed with Alephwyr on that basis, it appeared that Claude's interpretation had in fact been correct. So I think he's not actually anything like as unclear as he, admittedly, sometimes seems on first reading by people very used to the discussion here on LessWrong. Which fits with how he's describing his communication style below — I think he's just not using all our terminology and making all the same sets of assumptions. Which, frankly, makes him a particularly valuable participant in the conversation — questioning previously unquestioned assumptions is worth doing periodically, and new ideas are often helpful.
So, if in doubt, ask Claude, as often helps.
I don't know most of the standard terms with any precision or at all. Sorry. I do read things. Part of the point of discussing things is to try to get a tighter use pattern of language down. However, part of the reason for my non standard use is also that, having not read a sufficient amount of anything, I am deliberately trying to avoid pulling in all the connotations of existing rationalist terms, while still signalling that I am thinking about the same cluster of things. It is deliberately aimed at signalling lower fidelity towards your inherited holistic concepts.
Could you explain why you claim that rationalists believe your points 1 and 2? Did you mean that instrumental convergence increases the chance for the AI to develop misaligned goals?
As for the idea that "a working alignment context exists for humans", you seem to conflate alignment of similarly-capable humans to a common goal or common rules and alignment of a superintelligence to weaker humans.
Specifically referring to orthogonality of moral development. There is little assent to the proposition that morality arises from the improvement of epistemic processes, rather than being a second thing entirely outside of them in some way. What assent there is for this seems to be lower quality than other posts.
Also while that link is broadly to the point, and thank you, I don't see where "similarly capable humans" comes in, unless you mean that to describe all humans holistically. Which I think is too reductive. Human beings are extremely elitist about ability even between themselves, and make moral assessments and take moral actions on this basis. When measured against ASI this is absurd whereas ASI can't in turn be measured against a third thing to be likewise made absurd. However, the consequences of the assessment, the way it is formed, the structures and textures of it, probably give the only available empirical insight we have into what moral chauvinism about ability would look like in an ASI. And so if you want inspiration, maybe the questions are about Einstein working to remain publicly accessible and Von Neumann speaking to a colleague's child as an equal, counterposed against, I don't know, Nick Land praying for a brutal death to all inferior optimization processes (not to say this reflects comparable intelligence, but hey, that itself is a data point).
I just don't believe a perfected art of reason within one reasoner can compensate for the narrow particulars of any scheme of sense perception, salience, or instinctive categorization. You have to have an external workbench of ideas governed by more general, rigorous rules, that workbench has to be a collaborative space, it's going to feel chinese roomy to most people in a way that makes them suspicious or at least resistant, or get tired of it and slip back into habits, if you want integration in a way that feels nice in the same way that being good at sports feels nice you need to be able to change human biology. The Chesterton's Fence part of it needs to be understood more closely to the literal formulation than it usually is: understand what a given fence was built for before you tear it down. Not, don't tear it down. Tearing it down is the way by which the alienation of looking outside the fence continuously can be made not alien. Whatever the cluster of verbiage is that traditionally points at this; embodiment, dasein, sometimes unobfuscated fascist screeching; the unembodiment or the dissociative or artificial component comes from the workbench of ideas being a symbolic space outside the body whose operations govern more and more of the world in ways that feel purely syntactical, that seem to displace semantic understanding as a way of influencing the world or meeting needs. But this can be solved by having the body grow into that syntactic space, and that is what claiming that space is in these terms, if they matter to you. If you can map them beforehand that makes doing this comprehensible and predictable, it will feel normal in the same way normal things feel normal because those things are just your own semantics against a current high water mark of syntax. A higher water mark of syntax will also be filled with your semantics. There is no inevitability or even high probability for the thing to be both alienating and integrated simultaneously, alienation is just not having the semantics yet.
As a practical example of a comment I made above on a different thread, I bounced off this (as clearly did Philipreal and papetoast given the reaction icon), so I asked Claude what it meant, and to translate it more into the context of LessWrong:
This is dense but I think I can unpack it. The writer is making an argument about rationality, embodiment, and alienation, roughly as follows:
The starting claim: No matter how good you get at reasoning as an individual, you can't overcome the limitations of your own perceptual and cognitive apparatus — your particular senses, what you find salient, your instinctive categories. Rationality alone, practiced solo, can't transcend these constraints.
The proposed solution: You need an external, collaborative, formal system — a "workbench of ideas" — governed by rules more rigorous and general than individual intuition. Science, mathematics, formal institutions, shared epistemic standards. This workbench is inherently collaborative; it can't be a solo project.
The problem with the solution: This external workbench feels "Chinese Room-y" to most people. Meaning: it feels like you're manipulating symbols without understanding. You follow the rules, you get correct outputs, but it doesn't feel like knowing in the way that embodied intuition feels like knowing. People find this alienating, get suspicious or tired, and slip back into relying on their native intuitions.
The Chesterton's Fence move: The writer invokes Chesterton's Fence but insists on taking it more literally than LessWrong typically does. The usual LessWrong usage tends toward "don't tear down the fence" — respect existing systems you don't understand. The writer is saying: no, the point is understand what it was built for, and then tear it down. The fence here is the boundary between embodied intuitive understanding and the formal symbolic workbench. You should understand what your embodied intuitions are for, and then deliberately extend beyond them.
The diagnosis of alienation: There's a long tradition — phenomenology ("embodiment," "Dasein"), and darker versions the writer frankly labels "sometimes unobfuscated fascist screeching" — of complaining that modernity/rationality/formal systems are alienating, that they displace genuine understanding with empty syntax. The writer acknowledges this phenomenology is describing something real: the formal workbench does feel purely syntactical, like it's displacing semantic (felt, meaningful) understanding.
The resolution: But this alienation is temporary and fixable. The body and the intuitions can "grow into" the syntactic space. What feels like empty symbol-manipulation now can become embodied understanding later. The writer's analogy: everything that currently feels natural and semantically rich to you — your current intuitions, your sense of meaning — is itself just the result of your having already internalized some previous layer of "syntax" so thoroughly that it now feels like native understanding. Your current sense of what's meaningful is just "your own semantics against a current high water mark of syntax." A new, higher water mark of formal/symbolic capability will also eventually be filled with your felt sense of meaning, once you've internalized it.
The punchline: Alienation isn't an inevitable feature of formal rationality. It's just the transitional state of not yet having developed the embodied intuitions (the "semantics") for a new formal framework. There's no reason something has to remain both integrated into your practice and alienating — alienation is just the lag between adopting new formal tools and developing the felt understanding to match.
Alephwyr, has Claude understood and explained you correctly?
If so, then that seems very reasonable (though it's not a topic I've thought a lot about), I agree, and this also seems like valuable content on a site devoted, among other things, to the Art of Rationality. (But you'll get more upvotes if you can figure out how to write in a way that doesn't require using Claude to parse for many people on LessWrong.)
Yes, that's correct. Also, I had hoped my writing would be frictional in exactly this way. I don't like obscurantism in most contexts but being easily and fully legible to AI and maybe a narrow range of human neurotypes has desired utility in an audience selection signalling way.
Reasonable answers to most of my concerns about Yudkowsky existed 10 years before I had them, but mostly in the Fun Theory sequences, which was an unintuitive place for them to be. Reading Rationality A-Z, his twitter posts, and non-sequence writings (particularly as selected for by cultural osmosis rather than conceptual relationship) gave a misleading impression. I still have some reservations but that is unlikely to have significance to anyone except me, the overwhelmingly probable answer to any given objection is now "this objection was thought of, but this piece of writing did not exist to answer literally every thought that could be thought about it in advance". And even if there is some sort of irremediable hole at the heart of it I'm not going to be the one to fix it.
I'm not asserting that anyone should care about this, just writing it down for posterity. There may be reasons to care. I don't decide that.
Anthropic: If you don't fold, they can hurt you for the rest of your life. If you do fold, they can hurt you forever. The fun theory exercise is now the line to hold and the basis to hold it.
Or, one would hope, perhaps they can hurt them only until the next election. I would expect Anthropic to be very aware that the stakes here are lot larger then just US politics.