jamesmazzu — LessWrong

Hi quila, I was hoping to continue this discussion with you if you had the time to read my paper and understand that what I'm talking about is a new "strategy" for defining and apporaching the alignmemt problem, and not based on my personal "introspectively-observed moral reflection process" but based on concepts explored by others in the fields of psychology, evolution, AI, etc... it simply lays out a 10-point rationale, which of course you may agree or disagree with any of them, and specifies a proposed definition for the named Supertrust alignment strat... (read more)

jamesmazzu2y10

Self-Other Overlap: A Neglected Approach to AI Alignment

jamesmazzu2y10

yes, I certainly agree that the SOO work should be fully published/documented/shared, my point is that keeping it from future training data would be nearly impossible anyhow.

However, as you just mentioned: "having aligned AGIs will likely become necessary to be able to ensure the safety of subsequent systems"... since those AGIs (well before superintelligence) will most likely be SOO-knowledgable, wouldn't you need to test it to make sure it hasn't already started to influence your SOO values?

The models might start making slow progress at influ... (read more)

Self-Other Overlap: A Neglected Approach to AI Alignment

jamesmazzu2y10

the mean self-other overlap value across episodes can be used to classify with 100% accuracy which agents are deceptive

great to see this impressive work on applying SOO for realigning and classifying agents! My comments are most directly related to using it for identifying misaligned agents, but applies to realignment as well:

Essentially, since these SOO measurment methods are documented/published (as well as inentions and methods to realign), the knowledge will eventually become part of any large misaligned model’s pre-training data. Isn't it theref... (read more)

jamesmazzu2y*10

Thanks again for your feedback!

A main point of the entire paper is to encourage thinking about the alignment problem DIFFERENTLY than has been done so far. I realize it's a mental shift and may/will be difficult for people to accept... but the goal is to actually start thinking that the advanced AI "mind" can still be shaped (designed) in a way that leverages our human experiences and the natural parent-child strategy that's been shown in nature to produce children protective of their parents... and to again leverage the concept of evolution of intelligenc... (read more)

jamesmazzu2y10

certainly ANY alignment solution will be hard/fraught with difficulties... but the point of Supertrust is to spend the effort on solutions that follow a strategy that's logically taking us in a direction of good outcomes, rather than the currrent default strategy that logically leads to bad outcomes.

specifically regarding "benevolent values", the default strategy is to nurture them, while bad actors can do the same with "bad values". The proposed strategy is to instead spend all the hard effort building instinctive moral/ethical evaluation and judgme... (read more)

jamesmazzu2y*10

Thanks so much for your additional feedback, I really appreciate you taking the time to write it!

Regarding your feedback points:

the quoted statement is not assuming that the animal trait is ALREADY innate to minds in general (nor already innate in AI in particular), my point is that if we want to MAKE familial trust innate in AI, then we would need to do it at the intrinsic (pre-training) level. The idea is to learn from the natural example and then build it into AI as part of a strategy to better align it.
I agree current LLMs are not aligned, and the exam

... (read more)

jamesmazzu2y*00

The paper is proposing a new alignment strategy not at all dependent on the one chat example illustrated.

The simple example is not intended to be statistically significant evidence (clearly indicated as such) even though I believe it's still powerful and unmistakable as a single example. By posting your comment and pict of it here, are you saying that you disagree with it being an example of dangerous misalignment? Do those look like the responses of a well-aligned AI to you?

If you've decided not to read the paper only becasue you found a chat exampl... (read more)