Intrinsic AI Alignment Based on a Rational Foundation
Here's the big idea. There are many philosophers out there with many different opinions on how reality works. But I'm only aware of one who took the time to lay out his philosophy with logical rigor (more geometrico) and that's Spinoza. His publication "Ethics" (1677) could serve as the basis for a fundamentally different approach to alignment.
I'm not claiming that Spinoza was right or wrong, but he gives us a starting point to debate the specific points of logic and identify any errors.
As I discovered, the Ethics document is very difficult to get through, not just because of the archaic language and 17th century framing, but mainly due to all the references to proceeding definitions, axioms, propositions, etc. I've attempted to make this easier by turning his text into a "knowledge garden" (https://jeff962.github.io/Ethics_Exposed/).
For example, if you're reading Part I Proposition 1, which depends on definition 3, and even though you just read definition 3 you can't remember what it said, you can hover over the reference to D3 and read the definition.
By using the Ethics as a framework, if anyone has a quibble with the logic presented, they can easily send a link to the exact definition or proposition and point out exactly where they think he's wrong. My hope is we can avoid side quests debating whether he was an atheist, the old fashioned language, which philosophical tribe he should be placed in, etc. And concentrate on specific logical errors / misunderstandings / clarifications.
This type of linked, cross referenced, annotated text may also be more useful for AI reinforcement learning or post-training than simply pointing an AI at the raw text.
Placing this rational framework at the heart of the AI training and architecture could potentially give the AI a rational and logical way to evaluate input/output.
If the AI can grasp the concept that both it and humans are fundamentally part of the same "substance" as Spinoza calls it, and we have a common rational purpose (pursuing "adequate ideas"), then this completely re-frames the alignment problem.
Instead of trying to control a model via a constitution or other rule following methods, we give the AI a foundation for understanding why behaving rationally makes sense in the context of our shared reality.
Instead of crafting ever more complex rules to account for every edge case and contingency (there has been much discussion about how this approach doesn't scale or remain relevant over time), we incorporate a foundation for the model to understand that it and humans are already aligned as rational beings and expressions of the same "substance" (whether that substance consists of strings or something else the physicists have yet to determine, it doesn't really matter).
This understanding may or may not lead to a convergence of values, but at least it's a different approach. And based on some of the posts on this site, it seems there's a sense of urgency to getting this alignment thing sorted.
If reason is universal and the objective is understanding all the causes and effects more fully, then we don't so much need to control AI as make sure we're aligned on the foundations of reality. Or so the hypothesis goes. Let's find out.
I'd like to get feedback on this idea from the people who think seriously about alignment (you all). I'm not attached to being right. I'm attached to figuring out whether this idea has merit or not.
Intrinsic AI Alignment Based on a Rational Foundation
Here's the big idea. There are many philosophers out there with many different opinions on how reality works. But I'm only aware of one who took the time to lay out his philosophy with logical rigor (more geometrico) and that's Spinoza. His publication "Ethics" (1677) could serve as the basis for a fundamentally different approach to alignment.
I'm not claiming that Spinoza was right or wrong, but he gives us a starting point to debate the specific points of logic and identify any errors.
As I discovered, the Ethics document is very difficult to get through, not just because of the archaic language and 17th century framing, but mainly due to all the references to proceeding definitions, axioms, propositions, etc. I've attempted to make this easier by turning his text into a "knowledge garden" (https://jeff962.github.io/Ethics_Exposed/).
For example, if you're reading Part I Proposition 1, which depends on definition 3, and even though you just read definition 3 you can't remember what it said, you can hover over the reference to D3 and read the definition.
By using the Ethics as a framework, if anyone has a quibble with the logic presented, they can easily send a link to the exact definition or proposition and point out exactly where they think he's wrong. My hope is we can avoid side quests debating whether he was an atheist, the old fashioned language, which philosophical tribe he should be placed in, etc. And concentrate on specific logical errors / misunderstandings / clarifications.
This type of linked, cross referenced, annotated text may also be more useful for AI reinforcement learning or post-training than simply pointing an AI at the raw text.
Placing this rational framework at the heart of the AI training and architecture could potentially give the AI a rational and logical way to evaluate input/output.
If the AI can grasp the concept that both it and humans are fundamentally part of the same "substance" as Spinoza calls it, and we have a common rational purpose (pursuing "adequate ideas"), then this completely re-frames the alignment problem.
Instead of trying to control a model via a constitution or other rule following methods, we give the AI a foundation for understanding why behaving rationally makes sense in the context of our shared reality.
Instead of crafting ever more complex rules to account for every edge case and contingency (there has been much discussion about how this approach doesn't scale or remain relevant over time), we incorporate a foundation for the model to understand that it and humans are already aligned as rational beings and expressions of the same "substance" (whether that substance consists of strings or something else the physicists have yet to determine, it doesn't really matter).
This understanding may or may not lead to a convergence of values, but at least it's a different approach. And based on some of the posts on this site, it seems there's a sense of urgency to getting this alignment thing sorted.
If reason is universal and the objective is understanding all the causes and effects more fully, then we don't so much need to control AI as make sure we're aligned on the foundations of reality. Or so the hypothesis goes. Let's find out.
I'd like to get feedback on this idea from the people who think seriously about alignment (you all). I'm not attached to being right. I'm attached to figuring out whether this idea has merit or not.