LESSWRONG
LW

scasper
2033Ω596401220
Message
Dialogue
Subscribe

https://stephencasper.com/

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
The Engineer’s Interpretability Sequence
No wikitag contributions to display.
Call for suggestions - AI safety course
scasper8d110

In addition to Elad Hazan's, I am aware of similar courses from Roger Grosse (Toronto) and Max Lamparth et al (Stanford).

Reply1
Reframing AI Safety as a Neverending Institutional Challenge
scasper4moΩ450

I’m glad you think that the post has a small audience and may not be needed. I suppose that’s a good sign. 
 

—


In the post I said it’s good that nukes don’t blow up on accident and similarly, it’s good that BSL-4 protocols and tech exist. I’m not saying that alignment solutions shouldn’t exist. I am speaking to a specific audience (e.g. the frontier companies and allies) that their focus on alignment isn’t commensurate with its usefulness. Also don’t forget the dual nature of alignment progress. I also mentioned in the post that frontier alignment progress hastens timelines and makes misuse risk more acute. 

Reply
Reframing AI Safety as a Neverending Institutional Challenge
scasper4moΩ340

My bad. Didn’t mean to imply you thought it was desirable. 

Reply
Reframing AI Safety as a Neverending Institutional Challenge
scasper4moΩ6112

There's a crux here somewhere related to the idea that, with high probability, AI will be powerful enough and integrated into the world in such a way that it will be inevitable or desirable for normal human institutions to eventually lose control and for some small regime to take over the world. I don't think this is very likely for reasons discussed in the post, and it's also easy to use this kind of view to justify some pretty harmful types of actions.  

Reply
Reframing AI Safety as a Neverending Institutional Challenge
scasper4moΩ452

Thx!

I disagree that people working on the technical alignment problem generally believe that solving that technical problem is sufficient to get to Safe & Beneficial AGI. 

I won't put words in people's mouth, but it's not my goal to talk about words. I think that large portions of the AI safety community act this way. This includes most people working on scalable alignment, interp, and deception.
 

If we don’t solve the technical alignment problem, then we’ll eventually wind up with a recipe for summoning more and more powerful demons with callous lack of interest in whether humans live or die

Yeah, I don't really agree with the idea that getting better at alignment is necessary for safety. I think that it's more likely than not that we're already sufficiently good at it. The paragraph titled: "If AI causes a catastrophe, what are the chances that it will be triggered by the choices of people who were exercising what would be considered to be “best safety practices” at the time?" gives my thoughts on this. 

Reply
EIS XV: A New Proof of Concept for Useful Interpretability
scasper4moΩ120

Yea, thanks, good point. On one hand, I am assuming that the after identifying the SAE neurons of interest, they could be used for steering (this is related to section 5.3.2 of the paper). On the other hand, I am also assuming that in this case, identifying the problem is 80% of the challenge. IRL, I would assume that this kind of problem could and would be addressed by adversarial fine-tuning the model some more. 

Reply
EIS XIV: Is mechanistic interpretability about to be practically useful?
scasper9moΩ150

Thanks for the comment. My probabilities sum to 165%, which would translate to me saying that I would expect, on average, the next paper to do 1.65 things from the list, which to me doesn't seem too crazy. I think that this DOES match my expectations. 

But I also think you make a good point. If the next paper comes out, does only one thing, and does it really well, I commit to not complaining too much. 

Reply1
Latent Adversarial Training
scasper10moΩ572

Some relevant papers to anyone spelunking around this post years later:

https://arxiv.org/abs/2403.05030
 https://arxiv.org/abs/2407.15549
 

Reply
EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024
scasper1yΩ7159

Thanks, I think that these points are helpful and basically fair. Here is one thought, but I don't have any disagreements. 

Olah et al. 100% do a good job of noting what remains to be accomplished and that there is a lot more to do. But when people in the public or government get the misconception that mechanistic interpretability has been (or definitely will be) solved, we have to ask where this misconception came from. And I expect that claims like "Sparse autoencoders produce interpretable features for large models" contribute to this. 

Reply
EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024
scasper1yΩ13249

Thanks for the comment. I think the experiments you mention are good (why I think the paper met 3), but I don't think that its competitiveness has been demonstrated (why I think the paper did not meet 6 or 10). I think there are two problems. 

First, is that it's under a streetlight. Ideally, there would be an experiment that began with a predetermined set of edits (e.g., one from Meng et al., 2022) and then used SAEs to perform them. 

Second, there's no baseline that SAE edits are compared to. There are lots of techniques from the editing, finetuning, steering, rep-E, data curation, etc. literatures that people use to make specific changes to models' behaviors. Ideally, we'd want SAEs to be competitive with them. Unfortunately,  good comparisons would be hard because using SAEs for editing models is a pretty unique method with lots of compute required upfront. This would make it non-straightforward to compare the difficulty of making different changes with different methods, but it does not obviate the need for baselines. 

Reply
Load More
13Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations
Ω
8d
Ω
1
52Reframing AI Safety as a Neverending Institutional Challenge
Ω
4mo
Ω
12
30EIS XV: A New Proof of Concept for Useful Interpretability
Ω
4mo
Ω
2
68EIS XIV: Is mechanistic interpretability about to be practically useful?
Ω
9mo
Ω
4
25Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?
Ω
1y
Ω
0
157EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024
Ω
1y
Ω
16
77Analogies between scaling labs and misaligned superintelligent AI
Ω
1y
Ω
5
126Deep Forgetting & Unlearning for Safely-Scoped LLMs
Ω
2y
Ω
30
38Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation
Ω
2y
Ω
2
284The 6D effect: When companies take risks, one email can be very powerful.
2y
42
Load More