LESSWRONG
LW

2001
abergal
777Ω19916220
Message
Dialogue
Subscribe

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Open Philanthropy 2021 AI Alignment RFP
3abergal's Shortform
Ω
5y
Ω
2
No wikitag contributions to display.
Benito's Shortform Feed
abergal1y30

Thanks for writing this up-- at least for myself, I think I agree with the majority of this, and it articulates some important parts of how I live my life in ways that I hadn't previously made explicit for myself.

Reply1
Don't you think RLHF solves outer alignment?
Answer by abergalNov 05, 202210

I think your first point basically covers why-- people are worried about alignment difficulties in superhuman systems, in particular (because those are the dangerous systems which can cause existential failures). I think a lot of current RLHF work is focused on providing reward signals to current systems in ways that don't directly address the problem of "how do we reward systems with behaviors that have consequences that are too complicated for humans to understand".

Reply
Interpretability
abergal4y30

Chris Olah wrote this topic prompt (with some feedback from me (Asya) and Nick Beckstead). We didn’t want to commit him to being responsible for this post or responding to comments on it, so we submitted this on his behalf. (I've changed the by-line to be more explicit about this.)

Reply
Call for research on evaluating alignment (funding + advice available)
abergal4yΩ230

Thanks for writing this! Would "fine-tune on some downstream task and measure the accuracy on that task before and after fine-tuning" count as measuring misalignment as you're imagining it? My sense is that there might be a bunch of existing work like that.

Reply
Provide feedback on Open Philanthropy’s AI alignment RFP
abergal4yΩ230

This RFP is an experiment for us, and we don't yet know if we'll be doing more of them in the future. I think we'd be open to including research directions we think that are promising that apply equally well to both DL and non-DL systems-- I'd be interested in hearing any particular suggestions you have.

(We'd also be happy to fund particular proposals in the research directions we've already listed that apply to both DL and non-DL systems, though we will be evaluating them on how well they address the DL-focused challenges we've presented.)

Reply
Provide feedback on Open Philanthropy’s AI alignment RFP
abergal4yΩ120

Getting feedback in the next week would be ideal; September 15th will probably be too late.

Different request for proposals!

Reply
Discussion: Objective Robustness and Inner Alignment Terminology
abergal4y*Ω13210

Thank you so much for writing this! I've been confused about this terminology for a while and I really like your reframing.

An additional terminological point that I think it would be good to solidify is what people mean when they refer to "inner alignment" failures. As you alude to, my impression is that some people use it to refer to objective robustness failures, broadly, whereas others (e.g. Evan) use it to refer to failures that involve mesa optimization. There is then additional confusion around whether we should think "inner alignment" failures that don't involve mesa optimization will be catastrophic and, relatedly, around whether humans count as mesa optimizers.

I think I'd advocate for letting "inner alignment" failures refer to objective robustness failures broadly, talking about "mesa optimization failures" as such, and then leaving the question about whether there are problematic inner alignment failures that aren't mesa optimization-related on the table.

Reply
MIRI location optimization (and related topics) discussion
abergal4y670

I feel pretty bad about both of your current top two choices (Bellingham or Peekskill) because they seem too far from major cities. I worry this distance will seriously hamper your ability to hire good people, which is arguably the most important thing MIRI needs to be able to do. [Speaking personally, not on behalf of Open Philanthropy.]

Reply
April drafts
abergal5y30

Announcement: "How much hardware will we need to create AGI?" was actually inspired by a conversation I had with Ronny Fernandez and forgot about, credit goes to him for the original idea of using 'the weights of random objects' as a reference class.

https://i.imgflip.com/1xvnfi.jpg

Reply
abergal's Shortform
abergal5y30

I think it would be kind of cool if LessWrong had built-in support for newsletters. I would love to see more newsletters about various tech developments, etc. from LessWrongers.

Reply
Load More
9Funding for programs and events on global catastrophic risk, effective altruism, and other topics
1y
0
16Funding for work that builds capacity to address risks from transformative AI
1y
0
28Updates to Open Phil’s career development and transition funding program
2y
0
52The Long-Term Future Fund is looking for a full-time fund chair
2y
0
33Long-Term Future Fund Ask Us Anything (September 2023)
2y
6
81Long-Term Future Fund: April 2023 grant recommendations
2y
3
14[AMA] Announcing Open Phil’s University Group Organizer and Century Fellowships [x-post]
3y
0
42Truthful and honest AI
Ω
4y
Ω
1
61Interpretability
Ω
4y
Ω
13
22Techniques for enhancing human feedback
Ω
4y
Ω
0
Load More