I (Evan Hubinger) am a safety researcher at Anthropic. My posts and comments are my own and do not represent Anthropic's positions, policies, strategies, or opinions. Previously: MIRI, OpenAI.
See: “Why I'm joining Anthropic”
Pronouns: he/him/his
Email: evanjhub@gmail.com
Selected work:
Public debates strengthen society and public discourse. They spread truth by testing ideas and filtering out weaker arguments.
I think this is extremely not true, and am pretty disappointed with this sort of "debate me" communications policy. In my opinion, I think public debates very rarely converge towards truth. Lots of things sound good in a debate but break down under careful analysis, and the pressure of saying things that look good to a public audience creates a lot of pressure opposed to actual truth-seeking.
I understand and agree with the importance of good communications here, but imo this is really not the way. Some alternative possibilities:
I'm sure there's a bunch more here; these are just some ideas off the top of my head. In general, I think there's a lot of ways to do public communications on complex, controversial topics that don't involve public debates, and I'd strongly encourage going in one of those alternative directions instead.
Cross-posted to the EA Forum.
Thanks for this—I agree that this is a pretty serious concern, particularly in the US. Even putting aside all of the ways in which the end of democracy in the US could be a serious problem from a short-term humanitarian standpoint, I think it would also be hugely detrimental to effective AI policy interventions and cooperation, especially between the US, the UK, and the EU. I'd recommend cross-posting this to the EA Forum—In my opinion, I think this issue deserves a lot more EA attention.
Noting that I don't think pursuing truth in general should be the main goal: some truths matter way, way more to me than other truths, and I think that prioritization often gets lost when people focus on "truth" as the end goal rather than e.g. "make the world better" or "AI goes well." I'd be happy with something like "figuring out what's true specifically about AI safety and related topics" as a totally fine instrumental goal to enshrine, but "figure out what's true in general about anything" seems likely to me to be wasteful, distracting, and in some cases counterproductive.
I expect the alignment problem for future AGIs to be substantially easier, because the inductive biases that they want should be much easier to achieve than the inductive biases that we want. That is, in general, I expect the distance between the distribution of human minds and the distribution of minds for any given ML training process to be much greater than the distance between the distributions for any two ML training processes. Of course, we don't necessarily have to get (or want) a human-like mind, but I think the equivalent statement should also be true if you look at distributions over goals as well.
Another thought here:
Yep, seems too expensive to do literally as stated, but right now I'm just searching for anything concrete that would fit the bill, regardless of how practical it would be to actually run. If we decided that this was what we needed, I bet we could find a good approximation, though I don't have one right now.
And I'm not exactly sure what part of the solution this would fill—it's not clear to me whether this alone would be either sufficient or necessary. But it does feel like it gives you real evidence about the degree of understanding that you have, so it feels like it could be a part of a solution somewhere.
I just don't know. This seems like a very off-distribution move from Eliezer—which I suspect is in large part the point: when your model predicts doom by default, you go off-distribution in search of higher-variance regions of outcome space. So I suppose from his viewpoint, this action does make some sense; I am (however) vaguely annoyed on behalf of other alignment teams, whose jobs I at least mildly predict will get harder as a result of this.
Personally, I think Eliezer's article is actually just great for trying to get real policy change to happen here. It's not clear to me why Eliezer saying this would make anything harder for other policy proposals. (Not that I agree with everything he said, I just think it was good that he said it.)
I am much more conflicted about the FLI letter; it's particular policy proscription seems not great to me and I worry it makes us look pretty bad if we try approximately the same thing again with a better policy proscription after this one fails, which is approximately what I expect we'll need to do.
(Though to be fair this is as someone who's also very much on the pessimistic side and so tends to like variance.)
Thanks to Chris Olah for a helpful conversation here.
Some more thoughts on this:
(Moderation note: added to the Alignment Forum from LessWrong.)