Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

We’re grateful to our advisors Nate Soares, John Wentworth, Richard Ngo, Lauro Langosco, and Amy Labenz. We're also grateful to Ajeya Cotra and Thomas Larsen for their feedback on the contests. 

TLDR: AI Alignment Awards is running two contests designed to raise awareness about AI alignment research and generate new research proposals. Prior experience with AI safety is not required. Promising submissions will win prizes up to $100,000 (though note that most prizes will be between $1k-$20k; we will only award higher prizes if we receive exceptional submissions.)

You can help us by sharing this post with people who are or might be interested in alignment research (e.g., student mailing lists, FB/Slack/Discord groups.)

What are the contests?

We’re currently running two contests:

Goal Misgeneralization Contest (based on Langosco et al., 2021): AIs often learn unintended goals. Goal misgeneralization occurs when a reinforcement learning agent retains its capabilities out-of-distribution yet pursues the wrong goal. How can we prevent or detect goal misgeneralization?

Shutdown Problem Contest (based on Soares et al., 2015): Given that powerful AI systems might resist attempts to turn them off, how can we make sure they are open to being shut down?

What types of submissions are you interested in?

For the Goal Misgeneralization Contest, we’re interested in submissions that do at least one of the following:

  1. Propose techniques for preventing or detecting goal misgeneralization
  2. Propose ways for researchers to identify when goal misgeneralization is likely to occur
  3. Identify new examples of goal misgeneralization in RL or non-RL domains. For example:
    1. We might train an imitation learner to imitate a "non-consequentialist" agent, but it actually ends up learning a more consequentialist policy. 
    2. We might train an agent to be myopic (e.g., to only care about the next 10 steps), but it actually learns a policy that optimizes over a longer timeframe.
  4. Suggest other ways to make progress on goal misgeneralization 

For the Shutdown Problem Contest, we’re interested in submissions that do at least one of the following:

  1. Propose ideas for solving the shutdown problem or designing corrigible AIs. These submissions should also include (a) explanations for how these ideas address core challenges raised in the corrigibility paper and (b) possible limitations and ways the idea might fail 
  2. Define The Shutdown Problem more rigorously or more empirically 
  3. Propose new ways of thinking about corrigibility (e.g., ways to understand corrigibility within a deep learning paradigm)
  4. Strengthen existing approaches to training corrigible agents (e.g., by making them more detailed, exploring new applications, or describing how they could be implemented)
  5. Identify new challenges that will make it difficult to design corrigible agents
  6. Suggest other ways to make progress on corrigibility

Why are you running these contests?

We think that corrigibility and goal misgeneralization are two of the most important problems that make AI alignment difficult. We expect that people who can reason well about these problems will be well-suited for alignment research, and we believe that progress on these subproblems would be meaningful advances for the field of AI alignment. We also think that many people could potentially contribute to these problems (we're only aware of a handful of serious attempts at engaging with these challenges). Moreover, we think that tackling these problems will offer a good way for people to "think like an alignment researcher."

We hope the contests will help us (a) find people who could become promising theoretical and empirical AI safety researchers, (b) raise awareness about corrigibility, goal misgeneralization, and other important problems relating to AI alignment, and (c) make actual progress on corrigibility and goal misgeneralization. 

Who can participate?

Anyone can participate. 

What if I’ve never done AI alignment research before?

You can still participate. In fact, you’re our main target audience. One of the main purposes of AI Alignment Awards is to find people who haven’t been doing alignment research but might be promising fits for alignment research. If this describes you, consider participating. If this describes someone you know, consider sending this to them.

Note that we don’t expect newcomers to come up with a full solution to either problem (please feel free to prove us wrong, though). You should feel free to participate even if your proposal has limitations. 

How can I help?

You can help us by sharing this post with people who are or might be interested in alignment research (e.g., student mailing lists, FB/Slack/Discord groups) or specific individuals (e.g., your smart friend who is great at solving puzzles, learning about new topics, or writing about important research topics.)

Feel free to use the following message:

AI Alignment Awards is offering up to $100,000 to anyone who can make progress on problems in alignment research. Anyone can participate. Learn more and apply at alignmentawards.com! 

Will advanced AI be beneficial or catastrophic? We think this will depend on our ability to align advanced AI with desirable goals – something researchers don’t yet know how to do.

We’re running contests to make progress on two key subproblems in alignment:

  • The Goal Misgeneralization Contest (based on Langosco et al., 2021): AIs often learn unintended goals. Goal misgeneralization occurs when a reinforcement learning agent retains its capabilities out-of-distribution yet pursues the wrong goal. How can we prevent or detect goal misgeneralization?
  • The Shutdown Contest (based on Soares et al., 2015): Advanced AI systems might resist attempts to turn them off. How can we design AI systems that are open to being shut down, even as they get increasingly advanced? 

No prerequisites are required to participate. The deadline to submit is March 1, 2023. 

To learn more about AI alignment, see alignmentawards.com/resources.

Outlook

We see these contests as one possible step toward making progress on corrigibility, goal misgeneralization, and AI alignment. With that in mind, we’re unsure about how useful the contest will be. The prompts are very open-ended, and the problems are challenging. At best, the contests could raise awareness about AI alignment research, identify particularly promising researchers, and help us make progress on two of the most important topics in AI alignment research. At worst, they could be distracting, confusing, and difficult for people to engage with (note that we’re offering awards to people who can define the problems more concretely.)

If you’re excited about the contest, we’d appreciate you sharing this post and the website (alignmentawards.com) to people who might be interested in participating. We’d also encourage you to comment on this post if you have ideas you’d like to see tried. 

62

Ω 15

New Comment
16 comments, sorted by Click to highlight new comments since: Today at 2:32 AM

I think the contest idea is great and aimed at two absolute core alignment problems. I'd be surprised if much comes out of it, as these are really hard problems and I'm not sure contests are a good way to solve really hard problems. But it's worth trying!

Now, a bit of a rant:

Submissions will be judged on a rolling basis by Richard Ngo, Lauro Langosco, Nate Soares, and John Wentworth.

I think this panel looks very weird to ML people. Very quickly skimming the Scholar profiles, it looks like the sum of first-author papers in top ML conferences published by these four people is one (Goal Misgeneralisation by Lauro et al.).  The person with the most legible ML credentials is Lauro, who's an early-year PhD student with 10 citations.

Look, I know Richard and he's brilliant. I love many of his papers. I bet that these people are great researchers and can judge this contest well. But if I put myself into the shoes of an ML researcher who's not part of the alignment community, this panel sends a message: "wow, the alignment community has hundreds of thousands of dollars, but can't even find a single senior ML researcher crazy enough to entertain their ideas".

There are plenty of people who understand the alignment problem very well and who also have more ML credentials. I can suggest some, if you want.

(Probably disregard this comment if ML researchers are not the target audience for the contests.)

+1. The combination of the high dollar amount, the subjective criteria, and the panel drawn from the relatively small/insular 'core' AI safety research community mean that I expect this to look pretty fishy to established researchers. Even if the judgments are fair (I think they probably will be!) and the contest yields good work (it might!), I expect the benefit of that to be offset to a pretty significant degree by the red flags this raises about how the AI safety scene deals with money and its connection to mainstream ML research.

(To be fair, I think the Inverse Scaling Prize, which I'm helping with, raises some of these concerns, but the more precise/partially-quantifiable prize rubric, bigger/more diverse panel, and use of additional reviewers outside the panel mitigates them at least partially.)

Hastily written; may edit later

Thanks for mentioning this, Jan! We'd be happy to hear suggestions for additional judges. Feel free to email us at akash@alignmentawards.com and olivia@alignmentawards.com.  

Some additional thoughts:

  1. We chose judges primarily based on their expertise and (our perception of) their ability to evaluate submissions about goal misgeneralization and corrigibility. Lauro, Richard, Nate, and John ade some of few researchers who have thought substantially about these problems. In particular, Lauro first-authored the first paper about goal misgeneralization and Nate first-authored a foundational paper about corrigibility.
  2. We think the judges do have some reasonable credentials (e.g., Richard works at OpenAI, Lauro is a PhD student at the University of Cambridge, Nate Soares is the Executive Director of a research organization & he has an h-index of 12, as well as 500+ citations). I think the contest meets the bar of "having reasonably well-credentialed judges" but doesn't meet the bar of "having extremely well-credentialed judges (e.g., well-established professors with thousands of citations). I think that's fine.
  3. We got feedback from several ML people before launching. We didn't get feedback that this looks "extremely weird" (though I'll note that research competitions in general are pretty unusual). 
  4. I think it's plausible that some people will find this extremely weird (especially people who judge things primarily based on the cumulative prestige of the associated parties & don't think that OpenAI/500 citations/Cambridge are enough), but I don't expect this to be a common reaction.

Some clarifications + quick thoughts on Sam’s points:

  1. The contest isn’t aimed primarily/exclusively at established ML researchers (though we are excited to receive submissions from any ML researchers who wish to participate). 
  2. We didn’t optimize our contest to attract established researchers. Our contests are optimized to take questions that we think are at the core of alignment research and present them in a (somewhat less vague) format that gets more people to think about them.
  3. We’re excited that other groups are running contests that are designed to attract established researchers & present different research questions. 
  4. All else equal, we think that precise/quantifiable grading criteria & a diverse panel of reviewers are preferable. However, in our view, many of the core problems in alignment (including goal misgeneralization and corrigibility) have not been sufficiently well-operationalized to have precise/quantifiable grading criteria at this stage.

This response does not convince me.

Concretely, I think that if I'd show the prize to people in my lab and they actually looked at the judges (and I had some way of eliciting honest responses from them), I'd think that >60% would have some reactions according to what Sam and I described (i.e. seeing this prize as evidence that AI alignment concerns are mostly endorsed by (sometimes rich) people who have no clue about ML; or that the alignment community is dismissive of academia/peer-reviewed publishing/mainstream ML/default ways of doing science; or ... ).

Your point 3.) about the feedback from ML researchers could convince me that I'm wrong, depending on whom exactly you got feedback from and how that looked like.

By the way, I'm highlighting this point in particular not because it's highly critical (I haven't thought much about how critical it is), but because it seems relatively easy to fix.

Here's some clarifying questions I had given this post, that are answered in the full contest rules. I'm putting them here in case anyone else had similar questions, though I'm a bit worried that the contest rules are pretty out of date?

  1. Q:  Who are the judges? 
    There will be two (2) rounds of judging:
    (1) Round 1: Approximately 10 Judges who are undergraduate and graduate students will judge all submitted essays by blind grading in accordance with the criteria set forth in subsection 6.D.i below. All Entries that receive an overall score of 85 and above will advance to Round 2.
    (2) Round 2: Out of the Entries advancing from Round 1, approximately 5 Judges who are senior researchers and directors of AI alignment organizations will judge all submitted essays by blind grading in accordance with the criteria set forth in subsection 6.D.i below. The Judges will select at least three (3) Entries for each essay prompt as Final Prize Winners, for a total of at least six (6) Final Prize Winners.
  2. Q: How will entries be scored?
    The Judges will make their decisions using the criteria (the “Criteria”) described below:
    (A) Demonstrated understanding of one or more core problems pertinent to the essay prompt (25%);
    (B) Ability to articulate how the entrant’s proposal addresses one or more of the identified core problems or otherwise advances the science of AI alignment (50%); and
    (C) Ability to articulate significant limitations of the proposal may be (if any) (25%).
  3. Q: What's the format of the entry? 
    A written essay in response to one of the two prompts posted on the Contest Site regarding AI alignment. Essay responses must be written in English and no more than1,000 words in length (excluding title, endnotes, footnotes and/or citations). Entrants will submit their essay through the Contest Entry Form on the Contest Site. Essays must be submitted in a .pdf format.
  4. Q: What is the prize structure? 
    Each Entry that advances to Round 2 will receive a “Round 2 Prize” consisting of cash in the amount of at least One Thousand Dollars ($1,000.00) until a maximum of Five Hundred Thousand Dollars ($500,000.00) has been awarded (the “Round 2 Cap”). Whether or not the Round 2 Cap has been reached depends on the number of eligible Entries received that proceed to Round 2. For clarity, the Round 2 Cap applies to all Entries submitted for either essay prompt.  At least three (3) Entries for each essay prompt with the highest total scores from Round 2 will receive a “Final Prize” consisting of cash in the amount of at least Five Thousand Dollars ($5,000.00) each.
  5. Who is funding this?
    The Open Philanthropy Project.

 

Here are some other questions I still have after reading through the rules and the website:

  1. The website says proposals should be up to 500 words, but the official rules say it can be up to 1000. The website says you're allowed to submit supplementary materials, but the official rules make no mention of this. What's the actual intended format?
  2. Who exactly is judging the contest? The official rules say there's two rounds of judging, with undergrads/grad students judging in round 1 and 5 senior AI researchers judging in round 2. But the website says that "Submissions will be judged on a rolling basis by Richard Ngo, Lauro Langosco, Nate Soares, and John Wentworth."
  3. Why does the official rules say that the deadline is December 1st? Should people just not read the official rules?

Thanks for catching this, Lawrence! You're right-- we accidentally had an old version of the official rules. Just updated it with the new version. In general, I'd trust the text on the website (but definitely feel free to let us know if you notice any other inconsistencies).

As for your specific questions:

  1. 500 words
  2. If we get a lot of submissions, submissions will initially be screened by junior alignment researchers (undergrads & grad students) and then passed onto our senior judges (like the ones listed).
  3. Deadline is March 1. 
  4. No limit on the number of submissions per individual/team.

If you notice anything else, feel free to email at akash@alignmentawards.com and olivia@alignmentawards.com (probably better to communicate there than via LW comments). 

Here's another inconsistency between the official rules pdf and the website: the official rules say "Limit of one (1) Entry per individual entrant or Team for each essay prompt". However, the FAQ page on the website says you can submit multiple entries. How many entries can you actually make?

As one of the few AI safety researchers who has done a lot of work on corrigibility, I have mixed feelings about this.

First, great to see an effort that tries to draw more people to working on the corrigibility, because almost nobody is working on it. There are definitely parts of the solution space that could be explored much further.

What I also like is that you invite essays about the problem of making progress, instead of the problem of making more people aware that there is a problem.

However, the underlying idea that meaningful progress is possible by inviting people to work on a 500 word essay, which will then first be judged by 'approximately 10 Judges who are undergraduate and graduate students', seems to be a bit strange. I can fully understand Sam Bowman's comment that this might all look very weird to ML people. What you have here is an essay contest. Calling it a research contest may offend some people who are actual card-carrying researchers.

Also, the more experienced judges you have represent somewhat of an insular sub-community of AI safety researchers. Specifically, I associate both Nate and John with the viewpoint that alignment can only be solved by nothing less than an entire scientific revolution. This is by now a minority opinion inside the AI safety community, and it makes me wonder what will happen to submissions that make less radical proposals which do not buy into this viewpoint.

OK, I can actually help you with the problem of an unbalanced judging panel: I volunteer to join it. If you are interested, please let me know.

Corrigibility is both

  • a technical problem: inventing methods to make AI more corrigible

  • a policy problem: forcing people deploying AI to use those methods, even if this will hurt their bottom line, even if these people are careless fools, and even if they have weird ideologies.

Of these two problems, I consider the technical problem to be mostly solved by now, even for AGI.
The big open problem in corrigibility is the policy one. So I'd like to see contest essays that engage with the policy problem.

To be more specific about the technical problem being mostly solved: there are a bunch of papers outlining corrigibility methods that are backed up by actual mathematical correctness proofs, rather than speculation or gut feelings. Of course, in the AI safety activism blogosphere, almost nobody wants to read or talk about these methods in the papers with the proofs, instead everybody bikesheds the proposals which have been stated in natural language and which have been backed up only by speculation and gut feelings. This is just how a blogosphere works, but it does unfortunately add more fuel to the meme that the technical side of corrigibility is mostly unsolved and that nobody has any clue.

Thanks for your comment, Koen. Two quick clarifications:

  1. In the event that we receive a high number of submissions, the undergrads and grad students will screen submissions. Submissions above a certain cutoff will be sent to our (senior) panel of judges.

  2. People who submit promising 500-word submissions will (often) be asked to submit longer responses. The 500-word abstract is meant to save people time (they get feedback on the 500-word idea before they spend a bunch of time formalizing things, running experiments, etc.)

Two questions for you:

  1. What do you think are the strongest proposals for corrigibility? Would love to see links to the papers/proofs.

  2. Can you email us at akash@alignmentawards.com and olivia@alignmentawards.com with some more information about you, your AIS background, and what kinds of submissions you’d be interested in judging? We’ll review this with our advisors and get back to you (and I appreciate you volunteering to judge!)

Hi Akash! Thanks for the quick clarifications, these make the contest look less weird and more useful than just a 500 word essay contest.

My feedback here is that I definitely got the 500 word essay contest vibe when I read the 'how it works' list on the contest home page, and this vibe only got reinforced when I clicked on the official rules link and skimmed the document there. I recommend that you edit the 'how it works' list to on the home page, to make it it much more explicit that the essay submission is often only the first step of participating, a step that will lead to direct feedback, and to clarify that you expect that most of the prize money will go to participants who have produced significant research beyond the initial essay. If that is indeed how you want to run things.

On judging: OK I'll e-mail you.

I have to think more about your question about posting a writeup on this site about what I think are the strongest proposals for corrigibility. My earlier overview writeup that explored the different ways how people define corrigibility took me a lot of time to write, so there is an opportunity cost I am concerned about. I am more of an academic paper writing type of alignment researcher than a blogging all of my opinions on everything type of alignment researcher.

On the strongest policy proposal towards alignment and corrigibility, not technical proposal: if I limit myself to the West (I have not looked deeply into China, for example) then I consider the EU AI Act initiative by the EU to be the current strongest policy proposal around. It is not the best proposal possible, and there are a lot of concerns about it, but if I have to estimate expected positive impact among different proposals and initiatives, this is the strongest one.

To be more specific about the technical problem being mostly solved: there are a bunch of papers outlining corrigibility methods that are backed up by actual mathematical correctness proofs

Can you link these papers here?  No need to write anything, just links.

OK, Below I will provide links to few mathematically precise papers about AGI corrigibility solutions, with some comments. I do not have enough time to write short comments, so I wrote longer ones.

This list or links below is not a complete literature overview. I did a comprehensive literature search on corrigibility back in 2019 trying to find all mathematical papers of interest, but have not done so since.

I wrote some of the papers below, and have read all the rest of them. I am not linking to any papers I heard about but did not read (yet).

Math-based work on corrigibility solutions typically starts with formalizing corrigibility, or a sub-component of corrigibility, as a mathematical property we want an agent to have. It then constructs such an agent with enough detail to show that this property is indeed correctly there, or at least there during some part of the agent lifetime, or there under some boundary assumptions.

Not all of the papers below have actual mathematical proofs in them, some of them show correctness by construction. Correctness by construction is superior to having to have proofs: if you have correctness by construction, your notation will usually be much more revealing about what is really going on than if you need proofs.

Here is the list, with the bold headings describing different approaches to corrigibility.

Indifference to being switched off, or to reward function updates

Motivated Value Selection for Artificial Agents introduces Armstrong's indifference methods for creating corrigibility. It has some proofs, but does not completely work out the math of the solution to a this-is-how-to-implement-it level.

Corrigibility tried to work out the how-to-implement-it details of the paper above but famously failed to do so, and has proofs showing that it failed to do so. This paper somehow launched the myth that corrigibility is super-hard.

AGI Agent Safety by Iteratively Improving the Utility Function does work out all the how-to-implement-it details of Armstrong's indifference methods, with proofs. It also goes into the epistemology of the connection between correctness proofs in models and safety claims for real-world implementations.

Counterfactual Planning in AGI Systems introduces a different and more easy to interpret way for constructing a a corrigible agent, and agent that happens to be equivalent to agents that can be constructed with Armstrong's indifference methods. This paper has proof-by-construction type of math.

Corrigibility with Utility Preservation has a bunch of proofs about agents capable of more self-modification than those in Counterfactual Planning. As the author, I do not recommend you read this paper first, or maybe even at all. Read Counterfactual Planning first.

Safely Interruptible Agents has yet another take on, or re-interpretation of, Armstrong's indifference methods. Its title and presentation somewhat de-emphasize the fact that it is about corrigibility, by never even discussing the construction of the interruption mechanism. The paper is also less clearly about AGI-level corrigibility.

How RL Agents Behave When Their Actions Are Modified is another contribution in this space. Again this is less clearly about AGI.

Agents that stop to ask a supervisor when unsure

A completely different approach to corrigibility, based on a somewhat different definition of what it means to be corrigible, is to construct an agent that automatically stops and asks a supervisor for instructions when it encounters a situation or decision it is unsure about. Such a design would be corrigible by construction, for certain values of corrigibility. The last two papers above can be interpreted as disclosing ML designs that also applicable in the context of this stop when unsure idea.

Asymptotically unambitious artificial general intelligence is a paper that derives some probabilistic bounds on what can go wrong regardless, bounds on the case where the stop-and-ask-the-supervisor mechanism does not trigger. This paper is more clearly about the AGI case, presenting a very general definition of ML.

Anything about model-based reinforcement learning

I have yet to write a paper that emphasizes this point, but most model-based reinforcement learning algorithms produce a corrigible agent, in the sense that they approximate the ITC counterfactual planner from the counterfactual planning paper above.

Now, consider a definition of corrigibility where incompetent agents (or less inner-aligned agents, to use a term often used here) are less corrigible because they may end up damaging themselves, their stop buttons. or their operator by being incompetent. In this case, every convergence-to-optimal-policy proof for a model-based RL algorithm can be read as a proof that its agent will be increasingly corrigible under learning.

CIRL

Cooperative Inverse Reinforcement Learning and The Off-Switch Game present yet another corrigibility method with enough math to see how you might implement it. This is the method that Stuart Russell reviews in Human Compatible. CIRL has a drawback, in that the agent becomes less corrigible as it learns more, so CIRL is not generally considered to be a full AGI-level corrigibility solution, not even by the original authors of the papers. The CIRL drawback can be fixed in various ways, for example by not letting the agent learn too much. But curiously, there is very little followup work from the authors of the above papers, or from anybody else I know of, that explores this kind of thing.

Commanding the agent to be corrigible

If you have an infinitely competent superintelligence that you can give verbal commands to that it will absolutely obey, then giving it the command to turn itself into a corrigible agent will trivially produce a corrigible agent by construction.

Giving the same command to a not infinitely competent and obedient agent may give you a huge number of problems instead of course. This has sparked endless non-mathematical speculation, but in I cannot think about a mathematical paper about this that I would recommend.

AIs that are corrigible because they are not agents

Plenty of work on this. One notable analysis of extending this idea to AGI-level prediction, and considering how it might produce non-corrigibility anyway, is the work on counterfactual oracles. If you want to see a mathematically unambiguous presentation of this, with some further references, look for the section on counterfactual oracles in the Counterfactual Planning paper above.

Myopia

Myopia can also be considered to be feature that creates or improves or corrigibility. Many real-world non-AGI agents and predictive systems are myopic by construction: either myopic in time, in space, or in other ways. Again, if you want to see this type of myopia by construction in a mathematically well-defined way when applied to AGI-level ML, you can look at the Counterfactual Planning paper.

ETA: Koen recommends reading Counterfactual Planning in AGI Systems first (or instead of) Corrigibility with Utility Preservation)

Update: I started reading your paper "Corrigibility with Utility Preservation".[1]  My guess is that readers strapped for time should read {abstract, section 2, section 4} then skip to section 6.  AFAICT, section 5 is just setting up the standard utility-maximization framework and defining "superintelligent" as "optimal utility maximizer".

Quick thoughts after reading less than half:

AFAICT,[2] this is a mathematical solution to corrigibility in a toy problem, and not a solution to corrigibility in real systems.  Nonetheless, it's a big deal if you have in fact solved the utility-function-land version which MIRI failed to solve.[3]  Looking to applicability, it may be helpful for you to spell out the ML analog to your solution (or point us to the relevant section in the paper if it exists).  In my view, the hard part of the alignment problem is deeply tied up with the complexities of the {training procedure --> model} map, and a nice theoretical utility function is neither sufficient nor strictly necessary for alignment (though it could still be useful).

So looking at your claim that "the technical problem [is] mostly solved", this may or may not be true for the narrow sense (like "corrigibility as a theoretical outer-objective problem in formally-specified environments"), but seems false and misleading for the broader practical sense ("knowing how to make an AGI corrigible in real life").[4]

Less important, but I wonder if the authors of Soares et al agree with your remark in this excerpt[5]:

"In particular, [Soares et al] uses a Platonic agent model [where the physics of the universe cannot modify the agent's decision procedure]  to study a design for a corrigible agent, and concludes that the design considered does not meet the desiderata, because the agent shows no incentive to preserve its shutdown behavior. Part of this conclusion is due to the use of a Platonic agent model." 

  1. ^

    Btw, your writing is admirably concrete and clear.

    Errata:  Subscripts seem to broken on page 9, which significantly hurts readability of the equations.  Also there is a double-typo "I this paper, we the running example of a toy universe" on page 4.

  2. ^

    Assuming the idea is correct

  3. ^

    Do you have an account of why MIRI's supposed impossibility results (I think these exist?) are false?

  4. ^

    I'm not necessarily accusing you of any error (if the contest is fixated on the utility function version), but it was misleading to be as someone who read your comment but not the contest details.

  5. ^

    Portions in [brackets] are insertions/replacements by me

Corrigibility with Utility Preservation is not the paper I would recommend you read first, see my comments included in the list I just posted.

To comment on your quick thoughts:

  • My later papers spell out the ML analog of the solution in `Corrigibility with' more clearly.

  • On your question of Do you have an account of why MIRI's supposed impossibility results (I think these exist?) are false?: Given how re-tellings in the blogosphere work to distort information into more extreme viewpoints, I am not surprised you believe these impossibility results of MIRI exist, but MIRI does not have any actual mathematically proven impossibility results about corrigibility. The corrigibility paper proves that one approach did not work, but does not prove anything for other approaches. What they have is that 2022 Yudkowsky is on record expressing strongly held beliefs that it is that it is very very hard, and (if I recall correctly) even saying that nobody has made any progress on it in the last ten years. Not everybody on this site shares these beliefs. If you formalise corrigibility in a certain way, by formalising it as producing a full 100% safety, no 99.999% allowed, it is trivial to prove that a corrigible AI formalised that way can never provably exist, because the humans who will have to build, train, and prove it are fallible. Roman Yampolskiy has done some writing about this, but I do not believe that this kind or reasoning is at the core of Yudkowsky's arguments for pessimism.

  • On being misleadingly optimistic in my statement that the technical problems are mostly solved: as long as we do not have an actual AGI in real life, we can only ever speculate about how difficult it will be to make it corrigible in real life. This speculation can then lead to optimistic or pessimistic conclusions. Late-stage Yudkowsky is of course well-known for speculating that everybody who shows some optimism about alignment is wrong and even dangerous, but I stand by my optimism. Partly this is because I am optimistic about future competent regulation of AGI-level AI by humans successfully banning certain dangerous AGI architectures outright, much more optimistic than Yudkowsky is.

  • I do not think I fully support my 2019 statement anymore that 'Part of this conclusion [of Soares et al. failing to solve corrigibility] is due to the use of a Platonic agent model'. Nowadays, I would say that Soares et al did not succeed in its aim because it used a conditional probability to calculate what should have been calculated by a Pearl counterfactual. The Platonic model did not figure strongly into it.

This is certainly an interesting contest!

I have some questions that aren't addressed in either the official rules or the FAQ:

  1. What's the rule on submitting to this contest work that you're also planning on submitting to ML conferences/journals?
  2. If the intention is to encourage people new to AI alignment to submit, should people who are already working full time in AI alignment not submit?

Thanks, Lawrence! Responses below:

  1. You can submit work that you're also planning on submitting to ML conferences/journals. Feel free to email us if you have any specific constraints (e.g., as noted on the site, we plan to make proposals & winners public by default).
  2. People who are already working on alignment are able (and encouraged!) to submit. 

New to LessWrong?