Gabe M

Technical AI governance and safety researcher.

Wiki Contributions


Gabe M40

Congrats! Could you say more about why you decided to add evaluations in particular as a new week?

Gabe MΩ230

Do any of your experiments compare the sample efficiency of SFT/DPO/EI/similar to the same number of samples of simple few-shot prompting? Sorry if I missed this, but it wasn't apparent at first skim. That's what I thought you were going to compare from the Twitter thread: "Can fine-tuning elicit LLM abilities when prompting can't?"

Gabe M30

What do you think about pausing between AGI and ASI to reap the benefits while limiting the risks and buying more time for safety research? Is this not viable due to economic pressures on whoever is closest to ASI to ignore internal governance, or were you just not conditioning on this case in your timelines and saying that an AGI actor could get to ASI quickly if they wanted?

Gabe M20

Thanks! I wouldn't say I assert that interpretability should be a key focus going forward, however--if anything, I think this story shows that coordination, governance, and security are more important in very short timelines.

Gabe M32

Good point--maybe something like "Samantha"?

Gabe M20

Ah, interesting. I posted this originally in December (e.g. older comments), but then a few days ago I reposted it to my blog and edited this LW version to linkpost the blog.

It seems that editing this post from a non-link post into a link post somehow bumped its post date and pushed it to the front page. Maybe a LW bug?

Gabe M31

Related work

Nit having not read your full post: Should you have "Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover" in the related work? My mind pattern-matched to that exact piece from reading your very similar title, so my first thought was how your piece contributes new arguments. 

Gabe M10

If true, this would be a big deal: if we could figure out how the model is distinguishing between basic feature directions and other directions, we might be able to use that to find all of the basic feature directions.


Or conversely, and maybe more importantly for interp, we could use this to find the less basic, more complex features. Possibly that would form a better definition for "concepts" if this is possible.

Gabe MΩ230

Suppose  has a natural interpretation as a feature that the model would want to track and do downstream computation with, e.g. if a = “first name is Michael” and b = “last name is Jordan” then  can be naturally interpreted as “is Michael Jordan”. In this case, it wouldn’t be surprising the model computed this AND as  and stored the result along some direction  independent of  and . Assuming the model has done this, we could then linearly extract  with the probe

for some appropriate  and .[7]


Should the  be inside the inner parentheses, like  for ?

In the original equation, if  AND  are both present in , the vectors , and would all contribute to a positive inner product with , assuming . However, for XOR we want the  and  inner products to be opposing the  inner product such that we can flip the sign inside the sigmoid in the  AND  case, right?

Gabe M63

Thanks! +1 on not over-anchoring--while this feels like a compelling 1-year timeline story to me, 1-year timelines don't feel the most likely.

Load More