All of RobertKirk's Comments + Replies

How can Interpretability help Alignment?

(Note: you're quoting your response as well as the sentence you've meant to be quoting (and responding to), which makes it hard to see which part is your writing. I think you need 2 newlines to break the quote formatting).

Do you see a way of incentivizing the RL community to change this? (If possible, that would seem like a more effective approach than doing it "ourselves".)

I think this is kind of the same as how do we incentivise the wider ML community to think safety is important? I don't know if there's anything specific about the RL community which

... (read more)
Resources for AI Alignment Cartography

No worries. As much as I think less has been written on debate than amplification (Paul has a lot of blog posts on IDA), it seems to me like most of the work Paul's team at OpenAI is doing is working on debates rather than IDA.

Resources for AI Alignment Cartography

I don't know whether this is on purpose, but I'd think that AI Safety Via Debate (original paper:; recent progress report: should get a mention, probably in the Technical agendas focused on possible solutions section? I'd argue it's different enough from IDA to have it's own subititle.

1Gyrodiot1yIt was in the references that initially didn't make the cut. After further thought, it's indeed worth adding. I referenced the Distill article AI Safety Needs Social Scientists [], which spends more time on the motivating arguments, and linked to the paper in the note. Thanks for your feedback!