Joe Kwon

Wiki Contributions


This is terrific. One feature that will be great to have, is a way to sort and categorize your predictions under various labels.

Sexuality is, usually, a very strong drive which has a large influence over behaviour and long term goals. If we could create an alignment drive as strong in our AGI we would be in a good position.

I don't think we'd be in a good position even if we instilled an alignment drive this strong in AGI

To me, the caveats section of this post highlights the limited scope from which language models will be able to learn human values and preferences, given explicitly stated (And even implied-from-text) goals != human values as a whole. 

Hi Cameron, nice to see you here : ) what are your thoughts on a critique like: human prosocial behavior/values only look the way they look and hold stable within-lifetimes, insofar as we evolved in + live in a world where there are loads of other agents with roughly equal power as ourselves? Do you disagree with that belief? 

This was very insightful. It seems like a great thing to point to, for the many newish-to-alignment people ideating research agendas (like myself). Thanks for writing and posting!

This is a really cool idea and I'm glad you made the post! Here are a few comments/thoughts:

H1: "If you give a human absolute power, there is a small subset of humans that actually cares and will try to make everyone’s life better according to their own wishes"

How confident are you in this premise? Power and sense of values/incentives/preferences may not be orthogonal (and my intuition is that it isn't).  Also, I feel a little skeptical about the usefulness of thinking about the trait showing up more or less in various intelligence strata within humans. Seems like what we're worried about is in a different reference class. Not sure.


H4 is something I'm super interested in and would be happy to talk about it in conversations/calls if you want to : )

Something at the root of this might be relevant to the inverse scaling competition thing where they're trying to find what things get worse in larger models. This might have some flavor of obviously wrongness -> deception via plausible sounding things as models get larger? 

interesting idea. like.. a mix of genuine sympathy/expansion of moral circle to AI, and virtue signaling/anti-corporation meme spreads to the majority population and effectively curtails AGI capabilities research? This feels like a thing that might actually do nothing to reduce corporations' efforts to get to powerful AI unless it reaches a threshold at which point there's very dramatic actions against corporations who continue to try to do that thing

I stream-of-consciousness'd this out and I'm not happy with how it turned out, but it's probably better I post this than delete it for not being polished and eloquent. Can clarify with responses in comments.

Glad you posted this and I'm also interested in hearing what others say. I've had these questions for myself in tiny bursts throughout the last few months. 

When I get the chance to speak to people earlier in their career stage than myself (starting undergrad, or is a high schooler attending a mathcamp I went to) who are undecided about their careers, I bring up my interest in AI Alignment and why I think it's important, and share resources for them after the call in case they're interested in learning more about it. I don't have very many opportunities like this because I don't actively seek to identify and "recruit" them. I only bring it up by happenstance (e.g. joining a random discord server for homotopy type theory, seeing an intro by someone who went to the same mathcamp as me and is interested in cogsci, and scheduling a call to talk about my research background in cogsci and how my interests have evolved/led me to alignment over time). 

I know very talented people who are around my age at MIT and from a math program I attended; students who are breezing by technical double majors with perfect GPAs, IMO participants, good competitive programmers, etc. Some things that make it hard for me:

  1. If I know them well, I can talk about my research interests and try to get them to see my motivation, but if I'm only catching up with them 1-2x a year, it feels very unnatural and synthetic for me to be spending that time trying to convert them into doing alignment work. If I am still very close to them / talk to them frequently, there's still an issue of bringing it up naturally and having a chance to convince them. Most of these people are doing Math PhDs, or trading in finance, or working on a startup, or... The point is that they are fresh on their sprint down the path that they have chosen. They are all the type who are very focused and determined to succeed on the goals they have settled on. It is not "easy" to get them (or for this matter, almost any college student) to halt their "exploit" mode, take 10 steps back and lots of time from their busy lives, and then "explore" another option that I'm seemingly imposing onto them. FWIW, the people I know who are in trading seem to be the most likely to switch out (explicitly have told me in conversations that they just enjoy the challenge of the work, but want to find more fulfilling things down the road. And to these people I share ideas and resources about AI Safety.)
  2. I shared resources after the call, talked about why I'm interested in alignment, and that's the furthest I've gone wrt potentially converting someone who is already in a separate career track, to consider alignment.
  3. If it was MUCH easier to convince people that ai alignment is worth thinking about in under an hour, and I could reach out to people to talk to me about this for a hour without looking like a nutjob and potentially damaging our relationship because it seems like I'm just trying to convert them into something else, AND the field of AI Alignment was more naturally compelling for them to join, I'd do much more of this outreach. On that last point, what I mean is: for one moment, let's suspend the object level importance of solving AI Alignment. In reality, there are things that are incredibly important/attractive for people when pursuing a career. Status, monetary compensation, and recognition (and not being labeled a nutjob) are some big ones. If these things were better (and I think they are getting much better recently), it would be easier to get people to spend more time at least thinking about the possibility of working on AI Alignment, and eventually some would work on it because I don't think the arguments for x-risk from AI are hard to understand. If I personally didn't have so much support by way of programs the community had started (SERI, AISC, EA 1-1s, EAG AI Safety researchers making time to talk to me), or it felt like the EA/X-risk group was not at all "prestigious", I don't know how engaged I would've been in the beginning, when I started my own journey learning about all this. As much as I wish it weren't true, I would not be surprised at all if the first instinctual thing that led me down this road was noticing that EAs/LW users were intelligent and had a solidly respectable community, before choosing to spend my time engaging with the content (a lot of which was about X-risks).                                    

Hi John. One could run useful empirical experiments right now, before fleshing out all these structures and how to represent them, if you can assume that a proxy for human representations (crude: conceptnet, less crude: similarity judgments on visual features and classes collected by humans) is a good enough proxy for "relevant structures" (or at least that these representations more faithfully capture the natural abstractions than the best machines in vision tasks for example, where human performance is the benchmark performance), right?

I had a similar idea about ontology mismatch identification via checking for isomorphic structures, and also realized I had no idea how to realize that idea. Through some discussions with Stephen Casper and Ilia Sucholutsky, we kind of pivoted the above idea into the regime of interpretability/adversarial robustness where we are hunting for interesting properties given that we can identify the biggest ways that humans and machines are representing things differently (and that humans, for now, are doing it "better"/more efficiently/more like the natural abstraction structures that exist). 

I think am working in the same building this summer (caught a split-second glance at you yesterday); I would love a chance to discuss how selection theorems might relate to an interpretability/adversarial robustness project I have been thinking about.

Load More