LESSWRONG
LW

750
dmz
527Ω1605160
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
1DMZ's Shortform
Ω
4y
Ω
1
Anthropic's Pilot Sabotage Risk Report
dmz3dΩ230

See also this summary + commentary on METR's review: https://www.lesswrong.com/posts/wPQaeMvHXq9Ee2tEm/summary-and-comments-on-anthropic-s-pilot-sabotage-risk

Reply
nikola's Shortform
dmz4d10

I made a linkpost for the report here: https://www.alignmentforum.org/posts/omRf5fNyQdvRuMDqQ/anthropic-s-pilot-sabotage-risk-report-2

Reply
Anthropic rewrote its RSP
dmz1y144

(I work on the Alignment Stress-Testing team at Anthropic and have been involved in the RSP update and implementation process.)

Re not believing Anthropic's statement:

we believe the risk of substantial under-elicitation is low

To be more precise: there was significant under-elicitation but the distance to the thresholds was large enough that the risk of crossing them even with better elicitation was low.

Reply1
jacquesthibs's Shortform
dmz1y32

Unfortunate name collision: you're looking at numbers on the AI2 Reasoning Challenge, not Chollet's Abstraction & Reasoning Corpus.

Reply
High-stakes alignment via adversarial training [Redwood Research report]
dmz3yΩ230

That's right. We did some followup experiments doing the head-to-head comparison: the tools seem to speed up the contractors by 2x for the weak adversarial examples they were finding (and anecdotally speed us up a lot more when we use them to find more egregious failures). See https://www.lesswrong.com/posts/n3LAgnHg6ashQK3fF/takeaways-from-our-robust-injury-classifier-project-redwood#Quick_followup_results; an updated arXiv paper with those experiments is appearing on Monday. 

Reply
Takeaways from our robust injury classifier project [Redwood Research]
dmz3yΩ110

The main reason is that we think we can learn faster in simpler toy settings for now, so we're doing that first. Implementing all the changes I described (particularly changing the task definition and switching to fine-tuning the generator) would basically mean starting over from scratch anyway.

Reply
High-stakes alignment via adversarial training [Redwood Research report]
dmz3yΩ230

Indeed. (Well, holding the quality degradation fixed, which causes a small change in the threshold.)

Reply
High-stakes alignment via adversarial training [Redwood Research report]
dmz3yΩ230

I read Oli's comment as referring to the 2.4% -> 0.002% failure rate improvement from filtering.

Reply
High-stakes alignment via adversarial training [Redwood Research report]
dmz3yΩ230

Yeah, I think that might have been wise for this project, although the ROC plot suggests that the classifiers don't differ much in performance even at noticeably higher thresholds.

For future projects, I think I'm most excited about confronting the problem directly by building techniques that can succeed in sampling errors even when they're extremely rare.

Reply
High-stakes alignment via adversarial training [Redwood Research report]
dmz3y20

My worry is that it's not clear what exactly we would learn. We might get performance degradation just because the classifier has been trained on such a different distribution (including a different generator). Or it might be totally fine because almost none of those tasks involve frequent mention of injury, making it trivial. Either way IMO it would be unclear how the result would transfer to a more coherent setting.

(ETA: That said, I think it would be cool to train a new classifier on a more general language modeling task, possibly with a different notion of catastrophe, and then benchmark that!)

Reply
Load More
32Anthropic's Pilot Sabotage Risk Report
Ω
4d
Ω
2
141Auditing language models for hidden objectives
Ω
8mo
Ω
15
143Takeaways from our robust injury classifier project [Redwood Research]
Ω
3y
Ω
12
142High-stakes alignment via adversarial training [Redwood Research report]
Ω
4y
Ω
29
18Some criteria for sandwiching projects
Ω
4y
Ω
1
1DMZ's Shortform
Ω
4y
Ω
1