Jeff_H — LessWrong

Intrinsic AI Alignment Based on a Rational Foundation

Here's the big idea. There are many philosophers out there with many different opinions on how reality works. But I'm only aware of one who took the time to lay out his philosophy with logical rigor (more geometrico) and that's Spinoza. His publication "Ethics" (1677) could serve as the basis for a fundamentally different approach to alignment.

I'm not claiming that Spinoza was right or wrong, but he gives us a starting point to debate the specific points of logic and identify any errors.

As I discovered, the Ethics document is very difficult to get through, not just because of the archaic language and 17th century framing, but mainly due to all the references to proceeding definitions, axioms, propositions, etc. I've attempted to make this easier by turning his text into a "knowledge garden" (https://jeff962.github.io/Ethics_Exposed/).

For example, if you're reading Part I Proposition 1, which depends on definition 3, and even though you just read definition 3 you can't remember what it said, you can hover over the reference to D3 and read the definition.

By using the Ethics as a framework, if anyone has a quibble with the logic presented, they can easily send a link to the exact definition or proposition and point out exactly where they think he's wrong. My hope is we can avoid side quests debating whether he was an atheist, the old fashioned language, which philosophical tribe he should be placed in, etc. And concentrate on specific logical errors / misunderstandings / clarifications.

This type of linked, cross referenced, annotated text may also be more useful for AI reinforcement learning or post-training than simply pointing an AI at the raw text.

Placing this rational framework at the heart of the AI training and architecture could potentially give the AI a rational and logical way to evaluate input/output.

If the AI can grasp the concept that both it and humans are fundamentally part of the same "substance" as Spinoza calls it, and we have a common rational purpose (pursuing "adequate ideas"), then this completely re-frames the alignment problem.

Instead of trying to control a model via a constitution or other rule following methods, we give the AI a foundation for understanding why behaving rationally makes sense in the context of our shared reality.

Instead of crafting ever more complex rules to account for every edge case and contingency (there has been much discussion about how this approach doesn't scale or remain relevant over time), we incorporate a foundation for the model to understand that it and humans are already aligned as rational beings and expressions of the same "substance" (whether that substance consists of strings or something else the physicists have yet to determine, it doesn't really matter).

This understanding may or may not lead to a convergence of values, but at least it's a different approach. And based on some of the posts on this site, it seems there's a sense of urgency to getting this alignment thing sorted.

If reason is universal and the objective is understanding all the causes and effects more fully, then we don't so much need to control AI as make sure we're aligned on the foundations of reality. Or so the hypothesis goes. Let's find out.

I'd like to get feedback on this idea from the people who think seriously about alignment (you all). I'm not attached to being right. I'm attached to figuring out whether this idea has merit or not.

Jeff_H's Shortform

Jeff_H14d30

Jeff_H11d30

HI and thanks for the comment.

I'm new to the forum and wasn't familiar with the CEV school of alignment. I found this post and read through it. https://www.lesswrong.com/w/coherent-extrapolated-volition-alignment-target

I'm not sure what I'm proposing falls into the CEV category, but maybe.

I want to take a stab at your question about "which framework?".

I agree that any ethics can be formalized. In fact, Spinoza did this when he wrote Descartes' Principles of Philosophy (1663) to teach Cartesian philosophy to his student Johannes Casearius using the geometric method.

The reason I'm suggesting Spinoza's Ethics is because it's already formalized and simply makes a good starting point. It's a concrete thing that people can point to and say, I think the demonstration of Part 1, Proposition 11, that "nature necessarily exists" is wrong because xyz, and it should be abc.

We could take the route of "cognitive neuroscientific study of human beings, to discover both the values that people are actually implicitly pursuing, and also a natural metaethics (or ontology of value) implicit in how our brains represent reality." Or, while we wait for that to happen, we could use the Ethics as a theoretical approximation, implement it, and test the results. The just implement-it-and-see-what-happens approach seems to be what Anthropic is doing with the Claude constitution. It includes some well chosen ethics statements, that have proven to give good results. I'm just suggesting an upgrade to a more formalized, comprehensive starting point that might give even better results and cover edge cases that haven't been encountered..

I put together a "Value Graph" to try and quantify or at least give some heuristics the AI model can use as a basis for evaluation. https://jeff962.github.io/Ethics_Exposed/Supplemental/Value_Graph

I also had Claude run through some typical difficult scenarios to see how it applied the values compared to other approaches. The results were encouraging enough to suggest trying it for real. https://jeff962.github.io/Ethics_Exposed/Supplemental/Difficult-AI-Scenarios

Thanks again for the comment! I was hoping to get some push-back on whether this idea has merit.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments