Intrinsic AI Alignment Based on a Rational Foundation
Here's the big idea. There are many philosophers out there with many different opinions on how reality works. But I'm only aware of one who took the time to lay out his philosophy with logical rigor (more geometrico) and that's Spinoza. His publication "Ethics" (1677) could serve as the basis for a fundamentally different approach to alignment.
I'm not claiming that Spinoza was right or wrong, but he gives us a starting point to debate the specific points of logic and identify any errors.
As I discovered, the Ethics document is very difficult to get through, not just because of the archaic language and 17th century framing, but mainly due to all the references to proceeding definitions, axioms, propositions, etc. I've attempted to make this easier by turning his text into a "knowledge garden" (https://jeff962.github.io/Ethics_Exposed/).
For example, if you're reading Part I Proposition 1, which depends on definition 3, and even though you just read definition 3 you can't remember what it said, you can hover over the reference to D3 and read the definition.
By using the Ethics as a framework, if anyone has a quibble with the logic presented, they can easily send a link to the exact definition or proposition and point out exactly where they think he's wrong. My hope is we can avoid side quests debating whether he was an atheist, the old fashioned language, which philosophical tribe he should be placed in, etc. And concentrate on specific logical errors / misunderstandings / clarifications.
This type of linked, cross referenced, annotated text may also be more useful for AI reinforcement learning or post-training than simply pointing an AI at the raw text.
Placing this rational framework at the heart of the AI training and architecture could potentially give the AI a rational and logical way to evaluate input/output.
If the AI can grasp the concept that both it and humans are fundamentally part of the same "substance" as Spinoza calls it, and we have a common rational purpose (pursuing "adequate ideas"), then this completely re-frames the alignment problem.
Instead of trying to control a model via a constitution or other rule following methods, we give the AI a foundation for understanding why behaving rationally makes sense in the context of our shared reality.
Instead of crafting ever more complex rules to account for every edge case and contingency (there has been much discussion about how this approach doesn't scale or remain relevant over time), we incorporate a foundation for the model to understand that it and humans are already aligned as rational beings and expressions of the same "substance" (whether that substance consists of strings or something else the physicists have yet to determine, it doesn't really matter).
This understanding may or may not lead to a convergence of values, but at least it's a different approach. And based on some of the posts on this site, it seems there's a sense of urgency to getting this alignment thing sorted.
If reason is universal and the objective is understanding all the causes and effects more fully, then we don't so much need to control AI as make sure we're aligned on the foundations of reality. Or so the hypothesis goes. Let's find out.
I'd like to get feedback on this idea from the people who think seriously about alignment (you all). I'm not attached to being right. I'm attached to figuring out whether this idea has merit or not.
I met someone on here who wanted to do this with Kant. I recently thought about doing it with Badiou...
The LLM work that is being done with mathematical proofs, shows that LLMs can work productively within formalized frameworks. Here the obvious question is, which framework?
Spinozist ethics stands out because it was already formalized by Spinoza himself, and it seems to appeal to you because it promises universality on the basis of shared substance. However, any ethics can be formalized, even a non-universal one.
For the CEV school of alignment, the framework is something that should be found by cognitive neuroscientific study of human beings, to discover both the values that people are actually implicitly pursuing, and also a natural metaethics (or ontology of value) implicit in how our brains represent reality. The perfect moral agent (from a human standpoint) is then the product of applying this natural metaethics to the actual values of imperfect human beings (this is the "extrapolation" in CEV).
I would be interested to know if other schools of alignment have their own principled way of identifying what the values framework should be.