evhub

Evan Hubinger (he/him/his) (evanjhub@gmail.com)

I am a research scientist at Anthropic where I lead the Alignment Stress-Testing team. My posts and comments are my own and do not represent Anthropic's positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I'm joining Anthropic

Selected work:

Sequences

Conditioning Predictive Models
ML Alignment Theory Scholars Program Winter 2021
Risks from Learned Optimization

Wiki Contributions

Comments

evhub3d96

Is there another US governmental organization that you think would be better suited? My relatively uninformed sense is that there's no real USFG organization that would be very well-suited for this—and given that NIST is probably one of the better choices.

evhub4d13-4

Summarizing from the private conversation: the information there is not new to me and I don't think your description of what they said is accurate.

As I've said previously, Anthropic people certainly went around saying things like "we want to think carefully about when to do releases and try to advance capabilities for the purpose of doing safety", but it was always extremely clear at least to me that these were not commitments, just general thoughts about strategy, and I am very confident that was what was being referred to as being widespread in 2022 here.

evhub4d11-15

I think it was an honest miscommunication coupled to a game of telephone—the sort of thing that inevitably happens sometimes—but not something that I feel particularly concerned about.

evhub4d260

I mostly believe y'all about 2—I didn't know 2 until people asserted it right after the Claude 3 release, but I haven't been around the community, much less well-connected in it, for long—but that feels like an honest miscommunication to me.

For the record, I have been around the community for a long time (since before Anthropic existed), in a very involved way, and I had also basically never heard of this before the Claude 3 release. I can recall only one time where I ever heard someone mention something like this, it was a non-Anthropic person who said they heard it from someone else who was a non-Anthropic person, they asked me if I had heard the same thing, and I said no. So it certainly seems clear given all the reports that this was a real rumour that was going around, but it was definitely not the case that this was just an obvious thing that everyone in the community knew about or that Anthropic senior staff were regularly saying (I talked regularly to a lot of Anthropic senior staff before I joined Anthropic and I never heard anyone say this).

evhub7d1210

Pilot wave theory keeps all of standard wave mechanics unchanged, but then adds on top rules for how the wave "pushes around" classical particles. But it never zeroes out parts of the wave—the entirety of wave mechanics, including the full superposition and all the multiple worlds that implies, are still necessary to compute the exact structure of that wave to understand how it will push around the classical particles. Pilot wave theory then just declares that everything other than the classical particles "don't exist", which doesn't really make sense because the multiple worlds still have to exist in some sense because you have to compute them to understand how the wave will push around the classical particles.

evhub16dΩ9120

Fwiw, I still think about Conditioning Predictive Models stuff quite a lot and think it continues to be very relevant. I think that if future AI systems continue to look very much like present AI systems, I expect some of the major problems that we'll have to deal with will be exactly the major problems presented in that paper (e.g. a purely predictive AI system predicting the output of a deceptively aligned AI system continues to look like a huge and important problem to me).

evhub1moΩ440

(Moderation note: added to the Alignment Forum from LessWrong.)

evhub1moΩ220

(Moderation note: added to the Alignment Forum from LessWrong.)

evhub1moΩ440

I still really like my framework here! I think this post ended up popularizing some of the ontology I developed here, but the unfortunate thing about that post as the one that popularized this is that it doesn't really provide an alternative.

Load More