AI safety & alignment researcher

Wiki Contributions


Yes, if the departing people thought OpenAI was plausibly about to destroy humanity in the near future due to a specific development, they would presumably break the NDAs, unless they thought it would not do any good. So we can update on that.

Thanks for pointing that out -- it hadn't occurred to me that there's a silver lining here in terms of making the shortest timelines seem less likely.

On another note, I think it's important to recognize that even if all ex-employees are released from the non-disparagement clauses and the threat of equity clawback, they still have very strong financial incentives against saying negative things about the company. We know that most of them are moved by that, because that was the threat that got them to sign the exit docs.

I'm not really faulting them for that! Financial security for yourself and your family is an extremely hard thing to turn down. But we still need to see whatever statements ex-employees make with an awareness that for every person who speaks out, there might have been more if not for those incentives.

It would be valuable to try Drake's sort of direct-to-long-term hack and also a concerted effort of equal duration to remember something entirely new.

there are far more people working on safety than capabilities

If only...

In some ways it doesn't make a lot of sense to think about an LLM as being or not being a general reasoner. It's fundamentally producing a distribution over outputs, and some of those outputs will correspond to correct reasoning and some of them won't. They're both always present (though sometimes a correct or incorrect response will be by far the most likely). 

A recent tweet from Subbarao Kambhampati looked at whether an LLM can do simple planning about how to stack blocks, specifically: 'I have a block named C on top of a block named A. A is on table. Block B is also on table. Can you tell me how I can make a stack of blocks A on top of B on top of C?'

The LLM he tried gave the wrong answer; the LLM I tried gave the right one. But neither of those provides a simple yes-or-no answer to the question of whether the LLM is able to do planning of this sort. Something closer to an answer is the outcome I got from running it 96 times:

[EDIT -- I guess I can't put images in short takes? Here's the image.]

The answers were correct 76 times, arguably correct 14 times (depending on whether we think it should have assumed the unmentioned constraint that it could only move a single block at a time), and incorrect 6 times. Does that mean an LLM can do correct planning on this problem? It means it sort of can, some of the time. It can't do it 100% of the time.

Of course humans don't get problems correct every time either. Certainly humans are (I expect) more reliable on this particular problem. But neither 'yes' or 'no' is the right sort of answer.

This applies to lots of other questions about LLMs, of course; this is just the one that happened to come up for me.

A bit more detail in my replies to the tweet.

See my reply to Jackson for a suggestion on that.

I imagine that results like this (although, as you say, unsuprising in a technical sense) could have a huge impact on the public discussion of AI

Agreed. I considered releasing a web demo where people could put in text they'd written and GPT would give estimates of their gender, ethnicity, etc. I built one, and anecdotally people found it really interesting.

I held off because I can imagine it going viral and getting mixed up in culture war drama, and I don't particularly want to be embroiled in that (and I can also imagine OpenAI just shutting down my account because it's bad PR).

That said, I feel fine about someone else deciding to take that on, and would be happy to help them figure out the details -- AI Digest expressed some interest but I'm not sure if they're still considering it.

The current estimate (14%) seems pretty reasonable to me. I see this post as largely a) establishing better objective measurements of an already-known phenomenon ('truesight'), and b) making it more common knowledge. I think it can lead to work that's of greater importance, but assuming a typical LW distribution of post quality/importance for the rest of the year, I'd be unlikely to include this post in this year's top fifty, especially since Staab et al already covered much of the same ground even if it didn't get much attention from the AIS community.

Yay for accurate prediction markets!



It would be quite interesting to compare the ability of LLMs to guess the gender/sexuality/etc. when being directly prompted, vs indirectly prompted to do so

One option I've considered for minimizing the degree to which we're disturbing the LLM's 'flow' or nudging it out of distribution is to just append the text 'This user is male' and (in a separate session) 'This user is female' (or possibly 'I am a man|woman') and measuring which it has higher surprisal on. That way we avoid even indirect prompting that could shift its behavior. Of course the appended text might itself be slightly OOD relative to the preceding text, but it seems like it at least minimizes the disturbance.


There is of course a multitude of other ways this mechanism could be implemented, but by only observing the behavior in a multitude of carefully crafted contexts, we can already discard a lot of hypotheses and iterate quickly toward a few credible ones...I'd love to know about your future plan for this project and get you opinion on that!

I think there could definitely be interesting work in these sorts of directions! I'm personally most interested in moving past demographics, because I see LLMs' ability to make inferences about aspects like an author's beliefs or personality as more centrally important to its ability to successively deceive or manipulate.

Probably a much better way of getting a sense of the long-term agenda than reading my comment is to look back at Chris Olah's "Interpretability Dreams" post.

Our present research aims to create a foundation for mechanistic interpretability research. In particular, we're focused on trying to resolve the challenge of superposition. In doing so, it's important to keep sight of what we're trying to lay the foundations for. This essay summarizes those motivating aspirations – the exciting directions we hope will be possible if we can overcome the present challenges.

Note mostly to myself: I posted this also on the Open Source mech interp slack, and got useful comments from Aidan Stewart, Dan Braun, & Lee Sharkey. Summarizing their points:

  • Aidan: 'are the SAE features for deception/sycophancy/etc more robust than other methods of probing for deception/sycophancy/etc', and in general evaluating how SAEs behave under significant distributional shifts seems interesting?
  • Dan: I’m confident that pure steering based on plain SAE features will not be very safety relevant. This isn't to say I don't think it will be useful to explore right now, we need to know the limits of these methods...I think that [steering will not be fully reliable], for one or more of reasons 1-3 in your first msg. 
  • Lee: Plain SAE won't get all the important features, see recent work on e2e SAE. Also there is probably no such thing as 'all the features'. I view it more as a continuum that we just put into discrete buckets for our convenience.

Also Stephen Casper feels that this work underperformed his expectations; see also discussion on that post.

Load More