evhub

I (Evan Hubinger) am a safety researcher at Anthropic. My posts and comments are my own and do not represent Anthropic's positions, policies, strategies, or opinions. Previously: MIRI, OpenAI.

See: “Why I'm joining Anthropic

Pronouns: he/him/his

Email: evanjhub@gmail.com

Selected work:

Sequences

Conditioning Predictive Models
ML Alignment Theory Scholars Program Winter 2021
Risks from Learned Optimization

Wiki Contributions

Comments

evhub1dΩ220

(Moderation note: added to the Alignment Forum from LessWrong.)

Public debates strengthen society and public discourse. They spread truth by testing ideas and filtering out weaker arguments.

I think this is extremely not true, and am pretty disappointed with this sort of "debate me" communications policy. In my opinion, I think public debates very rarely converge towards truth. Lots of things sound good in a debate but break down under careful analysis, and the pressure of saying things that look good to a public audience creates a lot of pressure opposed to actual truth-seeking.

I understand and agree with the importance of good communications here, but imo this is really not the way. Some alternative possibilities:

  • Private discussions with experts that get summarized publicly afterward.
  • Adversarial collaborations with public writeups on tricky subjects.
  • Public talks where people can ask questions on confusions.
  • Panel discussions involving experts with different opinions.

I'm sure there's a bunch more here; these are just some ideas off the top of my head. In general, I think there's a lot of ways to do public communications on complex, controversial topics that don't involve public debates, and I'd strongly encourage going in one of those alternative directions instead.

Cross-posted to the EA Forum.

Thanks for this—I agree that this is a pretty serious concern, particularly in the US. Even putting aside all of the ways in which the end of democracy in the US could be a serious problem from a short-term humanitarian standpoint, I think it would also be hugely detrimental to effective AI policy interventions and cooperation, especially between the US, the UK, and the EU. I'd recommend cross-posting this to the EA Forum—In my opinion, I think this issue deserves a lot more EA attention.

Noting that I don't think pursuing truth in general should be the main goal: some truths matter way, way more to me than other truths, and I think that prioritization often gets lost when people focus on "truth" as the end goal rather than e.g. "make the world better" or "AI goes well." I'd be happy with something like "figuring out what's true specifically about AI safety and related topics" as a totally fine instrumental goal to enshrine, but "figure out what's true in general about anything" seems likely to me to be wasteful, distracting, and in some cases counterproductive.

I expect the alignment problem for future AGIs to be substantially easier, because the inductive biases that they want should be much easier to achieve than the inductive biases that we want. That is, in general, I expect the distance between the distribution of human minds and the distribution of minds for any given ML training process to be much greater than the distance between the distributions for any two ML training processes. Of course, we don't necessarily have to get (or want) a human-like mind, but I think the equivalent statement should also be true if you look at distributions over goals as well.

Another thought here:

  • If we're in a slow enough takeoff world, maybe it's fine to just have the understanding standard here be post-hoc, where labs are required to be able to explain why a failure occurred after it has already occurred. Obviously, at some point I expect us to have to deal with situations where some failures could be unrecoverable, but the hope here would be that if you can demonstrate a level of understanding that has been sufficient to explain exactly why all previous failures occurred, that's a pretty high bar, and it could plausibly be a high enough bar to prevent future catastrophic failures.

Yep, seems too expensive to do literally as stated, but right now I'm just searching for anything concrete that would fit the bill, regardless of how practical it would be to actually run. If we decided that this was what we needed, I bet we could find a good approximation, though I don't have one right now.

And I'm not exactly sure what part of the solution this would fill—it's not clear to me whether this alone would be either sufficient or necessary. But it does feel like it gives you real evidence about the degree of understanding that you have, so it feels like it could be a part of a solution somewhere.

I just don't know. This seems like a very off-distribution move from Eliezer—which I suspect is in large part the point: when your model predicts doom by default, you go off-distribution in search of higher-variance regions of outcome space. So I suppose from his viewpoint, this action does make some sense; I am (however) vaguely annoyed on behalf of other alignment teams, whose jobs I at least mildly predict will get harder as a result of this.

Personally, I think Eliezer's article is actually just great for trying to get real policy change to happen here. It's not clear to me why Eliezer saying this would make anything harder for other policy proposals. (Not that I agree with everything he said, I just think it was good that he said it.)

I am much more conflicted about the FLI letter; it's particular policy proscription seems not great to me and I worry it makes us look pretty bad if we try approximately the same thing again with a better policy proscription after this one fails, which is approximately what I expect we'll need to do.

(Though to be fair this is as someone who's also very much on the pessimistic side and so tends to like variance.)

That's nice, but I don't currently believe there are any audits or protocols that can prove future AIs safe "beyond a reasonable doubt".

I think you can do this with a capabilities test (e.g. ARC's), just not with an alignment test (yet).

evhub3moΩ34-1

Thanks to Chris Olah for a helpful conversation here.

Some more thoughts on this:

  • One thing that seems pretty important here is to have your evaluation based around worst-case rather than average-case guarantees, and not tied to any particular narrow distribution. If your mechanism for judging understanding is based on an average-case guarantee over a narrow distribution, then you're sort of still in the same boat as you started with behavioral evaluations, since it's not clear why understanding that passes such an evaluation would actually help you deal with worst-case failures in the real world. This is highly related to my discussion of best-case vs. worst-case transparency here.
  • Another thing worth pointing out here regarding using causal scrubbing for something like this is that causal scrubbing requires some base distribution that you're evaluating over, which means it could fall into a similar sort of trap to that in the first bullet point here. Presumably, if you wanted to build a causal-scrubbing-based safety evaluation, you'd just use the entire training distribution as the distribution you were evaluating over, which seems like it would help a lot with this problem, but it's still not completely clear that it would solve it, especially if you were just evaluating your average-case causal scrubbing loss over that distribution.
Load More