Naive Hypotheses on AI Alignment

[-]Rob Bensinger3y3825

Therefore, to get the best of both worlds, I figured I'd write down my naive hypotheses as I have them, and keep studying at the same time.

I quite like this strategy!

[-]Rob Bensinger3y3022

I would also echo the advice in the Alignment Research Field Guide:

We sometimes hear questions of the form “Even a summer internship feels too short to make meaningful progress on real problems. How can anyone expect to meet and do real research in a single afternoon?”
There’s a Zeno-esque sense in which you can’t make research progress in a million years if you can’t also do it in five minutes. It’s easy to fall into a trap of (either implicitly or explicitly) conceptualizing “research” as “first studying and learning what’s already been figured out, and then attempting to push the boundaries and contribute new content.”
The problem with this frame (according to us) is that it leads people to optimize for absorbing information, rather than seeking it instrumentally, as a precursor to understanding. (Be mindful of what you’re optimizing in your research!)
There’s always going to be more pre-existing, learnable content out there. It’s hard to predict, in advance, how much you need to know before you’re qualified to do your own original thinking and seeing, and it’s easy to Dunning-Kruger or impostor-syndrome yourself into endless hesitation or an over-reliance on existing authority.
Instead, we recommend throwing out the whole question of authority. Just follow the threads that feel alive and interesting. Don’t think of research as “study, then contribute.” Focus on your own understanding, and let the questions themselves determine how often you need to go back and read papers or study proofs.
Approaching research with that attitude makes the question “How can meaningful research be done in an afternoon?” dissolve. Meaningful progress seems very difficult if you try to measure yourself by objective external metrics. It is much easier when your own taste drives you forward.

[-]Shoshannah Tekofsky3y80

Thank you! And adding that to my reading list :D

[-]Rob Bensinger3y60

Yeah, I actually think Alignment Research Field Guide is one of the best resources for EAs and rationalists to read regardless of what they're doing in life. :)

[-]Evan R. Murphy3y*82

I do think there's value in beginner's mind, glad you're putting your ideas on alignment out there :)

How to create an AI that is smarter than us at solving our problems, but dumber than us at interpreting our goals.

This interpretation of corrigibility seems too narrow to me. Some framings of corrigibility like Stuart Russell's CIRL-based are like this, where the AI is trying to understand human goals but has uncertainty about it. But there are other framings, for example myopia, where the AI's goal is such that it would never sacrifice reward now for reward later, so it would never be motivated to pursue an instrumental goal like disabling its own off-switch.

When you're looking to further contaminate your thoughts and want more on this topic, there's a recent thread where different folks are trying to define corrigibility in the comments: https://www.lesswrong.com/posts/AqsjZwxHNqH64C2b6/let-s-see-you-write-that-corrigibility-tag#comments

[-]Shoshannah Tekofsky3y10

Thank you! I'll definitely read that :)

[-]jacopo3y42

I like the idea! Just a minor issue with the premise:

"Either I’d find out he’s wrong, and there is no problem. Or he’s right, and I need to reevaluate my life priorities."

There is a wide range of opinions, and EY's has one of the most pessimistic ones. It may be the case that he's wrong on several points, and we are way less doomed than he thinks, but that the problem is still there and a big one as well.

(In fact, if EY is correct we might as well ignore the problem, as we are doomed anyway. I know this is not what he thinks, but it's the consequence I would take from his predictions)

[-]Shoshannah Tekofsky3y10

The premise was intended to contextualize my personal experience of the issue. I did not intend to make a case that everyone should weigh their priorities in the same manner. For my brain specifically, a "hopeless" scenario registers as a Call to Arms where you simply need to drop what else you're doing, and get to work. In this case, I calculated the age of my children on to all the timelines. I realized either my kids or my grandkids will die from AGI if Eliezer is in any way right. Even a 10% chance of that happening is too high for me, so I'll pivot to whatever work needs to get done to avoid that. Even if the chance of my work making a difference are very slim, there isn't anything else worth doing.

[-]jacopo3y10

I agree with you actually. My point is that in fact you are implicitly discounting EY pessimism - for example, he didn't release a timeline but often said "my timeline is way shorter than that" with respect to 30-years ones and I think 20-years ones as well. The way I read him, he thinks we personally are going to die from AGI, and our grandkids will never be born, with 90+% probability, and that the only chances to avoid it is that are either someone having a plan already three years ago which has been implemented in secret and will come to fruition next year, or some large out-of-context event happens (say, nuclear or biological war brings us back to the stone age).

My no-more-informed-than-yours opinion is that he's wrong on several points, but correct on others. From this I deduce that the risk of very bad outcomes is real and not negligible, but the situation is not as desperate and there are probably actions that will improve our outlook significantly. Note that in the framework "either EY is right or he's wrong and there's nothing to worry about" there's no useful action, only hope that he's wrong because if he's right we're screwed anyway.

Implicitly, this is your world model as well from what you say. Discussing this then may look like nitpicking, but whether Yudkowsky or Ngo or Christiano are correct about possible scenarios changes a lot about which actions are plausibly helpful. Should we look for something that has a good chance to help in an "easier" scenario, rather than concentrate efforts on looking for solutions that work on the hardest scenario, given that the chance of finding one is negligible? Or would that be like looking for the keys under the streetlight?

[-]Shoshannah Tekofsky3y10

I think we're reflecting on the material at different depths. I can't say I'm far enough along to assess who might be right about our prospects. My point was simply that telling someone with my type of brain "it's hopeless, we're all going to die" actually has the effect of me dropping whatever I'm doing, and applying myself to finding a solution anyway.

[-]Joe Kwon3y40

This is a really cool idea and I'm glad you made the post! Here are a few comments/thoughts:

H1: "If you give a human absolute power, there is a small subset of humans that actually cares and will try to make everyone’s life better according to their own wishes"

How confident are you in this premise? Power and sense of values/incentives/preferences may not be orthogonal (and my intuition is that it isn't). Also, I feel a little skeptical about the usefulness of thinking about the trait showing up more or less in various intelligence strata within humans. Seems like what we're worried about is in a different reference class. Not sure.

H4 is something I'm super interested in and would be happy to talk about it in conversations/calls if you want to : )

[-]Ericf3y30

I saw this note in another thread, but the just of it is that power doesn't corrupt. Rather,

Evil people seek power, and are willing to be corrupt (shared cause correlation)
Being corrupt helps to get more power - in the extreme statement of this, maintaining power requires corruption
The process of gaining power creates murder-ghandis.
People with power attract and/or need advice on how and for what goal to wield it, and that leads to mis-alignment with the agents pre-power values.

[-]Gunnar_Zarncke3y30

Can you add a link to the other thread please?

[-]Ericf3y30

No, I don't remember exactly where on LW I saw it - just wanted to aknowledge that I was amplifying so.eone else's thoughts.

My college writing instructor was taken aback when I asked her how to cite something I could quote, but didn't recall from where, but her answer was "then you can't use it" which seemed harsh. There should be a way to aknowledge plagiarism without knowing or stating who is being plagiarized - and if the original author shown up, you've basically pre-conceded any question of originality to them.

[-]Gunnar_Zarncke3y20

Thx for being clear about it.

[-]Shoshannah Tekofsky3y32

Are you aware of any research in to this? I struggle to think of any research designs that would make it through an ethics board.

[-]Ericf3y30

I don't know that anyone has done the studies, but you could look at how winners of large lotteries behave. That is a natural example of someone suddenly gaining a lot of money (and therefore power). Do they tend to keep thier previous goals, amd just scale up thier efforts, or do they start doing power-retaining things? I have no idea what the data will show - thought experiments and amecdotes could go either way.

[-]Richard_Kennaway3y40

Let me Google that for you.

[-]Shoshannah Tekofsky3y10

Thank you!

If they are not orthogonal then presumably prosociality and power are inversely related, which is worse?

In this case, I'm hoping intelligence and prosociality-that-is-robust-to-absolute-power would hopefully be a positive correlation. However, I struggle to think how this might actually be tested... My intuitions may be born from the Stanford Prison experiment, which I think has been refuted since. So maybe we don't actually have as much data on prosociality in extreme circumstances as I initially intuited. I'm mostly reasoning this out now on the fly by zooming in on where my thoughts may have originally come from.

That said, it doesn't very much matter how frequent robust prosociality traits are, as long as they do exist and can be recreated in AGI.

I'll DM you my discord :)

[-]Gunnar_Zarncke3y30

First candidate for this trait is “emotional empathy”, a trait that hitches one’s reward system to that of another organism.

It would be interesting to hear what the cognitive neuroscientist know about how empathy is implemented in the brain.

The H1 point sounds close to Steven Byrnes' brain-like AGI.

[-]PeterC3y10

I believe that cognitive neuroscience has nothing much to say about how any experience at all is implemented in the brain - but I just read this book which has some interesting ideas: https://lisafeldmanbarrett.com/books/how-emotions-are-made/

[-]MSRayne3y2-6

My personal opinion is that empathy is the one most likely to work. Most proposed alignment solutions feel to me like patches rather than solutions to the problem, which is AI not actually caring about the welfare of other beings intrinsically. If it did, it would figure out how to align itself. So that's the one I'm most interested in. I think Steven Byrnes has some interest in it as well - he thinks we ought to figure out how human social instincts are coded in the brain.

[-]Shoshannah Tekofsky3y40

Hmmm, yes and no?

e.g. many people that care about animal welfare differ on the decisions they would make for those animals. What if the AGI ends up a negative utilitarian and sterilizes us all to save humanity from all future suffering? The missing element would again be to have the AGI aligned with humanity, which brings us back to H4: What's humanity's alignment anyway?

[-]MSRayne3y10

I think "humanity's alignment" is a strange term to use. Perhaps you mean "humanity's values" or even "humanity's collective utility function."

I'll clarify what I mean by empathy here. I think the ideal form of empathy is wanting others to get what they themselves want. Given that entities are competing for scarce resources and tend to interfere with one another's desires, this leads to the necessity of making tradeoffs about how much you help each desire, but in principle this seems like the ideal to me.

So negative utilitarianism is not actually reasonably empathic, since it is not concerned with the rights of the entities in question to decide about their own futures. In fact I think it's one of the most dangerous and harmful philosophies I've ever seen, and an AI such as I would like to see made would reject it altogether.

[-]PeterC3y10

Enjoyed this.

Overall, I think that framing AI alignment as a problem is ... erm .. problematic. The best parts of my existence as a human do not feel like the constant framing and resolution of problems. Rather they are filled with flow, curiosity, wonder, love.

I think we have to look in another direction, than trying to formulate and solve the "problems" of flow, curiosity, wonder, love. I have no simple answer - and stating a simple answer in language would reveal that there was a problem, a category, that could "solve" AI and human alignment problems.

I keep looking for interesting ideas - and find yours among the most fascinating to date.

[-]kimsolez3y-3-3

My take on this: countering Eliezer Yudkowsky

[-]Ben Pace3y53

You're right that an AGI being vastly smarter than humans is consistent with both good and bad outcomes for humanity. This video does not address any of the arguments that have been presented about why an AGI would by default have unaligned values with humanity, which I'd encourage you to engage with. It's mentioned in bullet -3 in the list, under the names instrumental convergence and orthogonality thesis, with the former being probably what I'd recommend reading about first.

[+]kimsolez3y-6-4

[-]tidikanji3y11

Kim, you're not addressing the points in the post. You can't repeat catch phrases like 'passive victims of the future' and expect it to have ground here. MIRI created well funded research institution devoted to positively shaping the future, while you make silly YouTube videos with platitudes. This interest in AI seems like recreation to you.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

98

Naive Hypotheses on AI Alignment

98

98

H1 - Emotional Empathy

H2 - Silo AI

H3 - Kill Switch

H4 - Human Alignment

Thoughts on Corrigibility

Side Thoughts - Researcher Bias