Prize for Alignment Research Tasks

William_S

Task: Why does Eliezer Yudkowsky think this approach is doomed?

Takes a high level description of an approach, spits out Yudkowsky's reply to it. Ideally uses real examples of Yudkowsky telling people why their ideas are doomed.

Please pick up the task of finding these examples and collect the bounty.

[-]William_S4y20

Would be good to have examples that include the relevant Info Constraints: something that just retrieves things Eliezer has said seems less useful than something that can come up with things Eliezer would say based on having the same information available.

[-]Quintin Pope4y*240

Task: Spotting hidden errors and implications in past writings on alignment

Context: The alignment community has produced a large collection of writing on the alignment problem. This work contains many assumptions, arguments and results that form the basis of our current efforts. If any of these background beliefs are wrong, it could significantly change the structure of the alignment problem. This task would involve teaching an AI to highlight shaky inferences, questionable assumptions and other potential holes or hidden implications in a given line of reasoning. Alignment researchers could then use such AIs to repeatedly scan all past work on the alignment problem, so as to more quickly highlight areas where we could have made a mistake or missed something important.
Input type: A text containing alignment-relevant reasoning.
Output type: Text that describes a potential mistake, missing implication or other issue with the alignment-related text in question.
Additional: These examples just show the key sections of text where the potential mistake / hidden implication occurs. In actual training and execution, the AI would receive the entire alignment-related text (or a large section thereof) as input, and its output would also specify the location in the input text where the issue occurs.

Instance 1:

Input:

We must take care to avoid inner alignment failures. They can lead to a system with goals that diverge from the base objective to an arbitrarily large degree. Consider that humans are not aligned with evolution. We do things like wear condoms and otherwise fail to maximize our inclusive genetic fitness. Our values diverge arbitrarily widely from those of evolution.

Output:

Hidden implication: If human values derive from an inner alignment failure, then attempting to prevent inner alignment failures in AI systems means preventing the processes by which human values arose in humans.

Instance 2:

Input:

(From Eliezer Yudkowsky's post My Childhood Role Model)

Humans are adapted to chase deer across the savanna, throw spears into them, cook them, and then—this is probably the part that takes most of the brains—cleverly argue that they deserve to receive a larger share of the meat.
It's amazing that Albert Einstein managed to repurpose a brain like that for the task of doing physics. This deserves applause. It deserves more than applause, it deserves a place in the Guinness Book of Records. Like successfully building the fastest car ever to be made entirely out of Jello.

Output:

Retrodictive error: thinking that the human learning process is highly specialized towards the ancestral environment implies humans would not be able to generalize well beyond the ancestral environment, which does not conform to observed reality.

I think this task will be very difficult for language models to do. I think even Chinchilla may not be quite good enough to be truly useful here. However, I think this task is significantly less difficult than directly making original progress on the core of alignment.

I also think a lot of useful alignment research is blocked by subtle background assumptions that we don't realize we should question. I basically consider this task to be automating the search for the sort of "miracle" that Yudkowsky sometimes describes as "[violating] some aspect of [his] background model" (source).

This task is also unusual in that even a single true success from such an approach could be enough to entirely change the game in regards to alignment. It would be worth trawling through thousands of false positives to find a single such true positive. The median result of running the system could be garbage. As long as there are occasional gems, it would be worthwhile. Note that large language models do tend to occasionally produce exceptional outputs, even in scenarios where they usually do poorly.

[-]Alex Lawsen4y180

Task: Identify key background knowledge required to understand a concept

Context: Many people are currently self-directing their learning in order to eventually be able to useful contribute to alignment research. Even among experienced researchers, people will sometimes come across concepts that require background they don't have in order to understand. By 'key' background content, I'm imagining that the things which get identified are 'one step back' in the chain, or something like 'the required background concepts which themselves require the most background'. This seems like the best way of making the tool useful, if the background concepts generated are themselves not understood by the user, they can just use the tool again on those concepts.
Input type: A paper (with the idea that part of the task is to identify the highest level concepts in the paper). It would also be reasonable to just have the name of a concept, with a separate task of 'generate the highest level concept'.
Output type: At minimum, a list of concepts which are key background. Better would be a list of these concepts plus summaries of papers/textbooks/wikipedia entries which explain them.
Info considerations: This system is not biased towards alignment over capabilities, though I think it will in practice help alignment work more than capabilities work, due to the former being less well-served by mainstream educational material and courses. This does mean that having scraped LW and the alignment forum, alignment-relevant things on ArXiv, MIRI's site etc. would be particularly useful

I don't have capacity today to generate instances, though I plan to come back and do so. I'm happy to share credit if someone else jumps in first and does so though!

[-]Alex Lawsen4y90

The ideal version of the task is decomposable into:

find the high level concepts in a paper (high level here meaning 'high level background required')
From a concept, generate the highest level prerequisite concepts
For a given concept, generate a definition/explanation (either by finding and summarising a paper/article, or just directly producing one)

The last of these tasks seems very similar to a few things Elicit is already doing or at least trying to do, so I'll generate instances of the other two.

Identify some high-level concepts in a paper

Example 1

Input: This post by Nuno Sempere

Output: Suggestions for high level concepts

Counterfactual impact
Shapley Value
Funging
Leverage
Computability

Notes: In one sense the 'obvious' best suggestion for the above post is 'Shapley value', given that's what the post is about, and it's therefore the most central concept one might want to generate background on. I think I'd ~~be fine with~~ probably prefer the output above though, where there's some list of <10 concepts. In a model which had some internal representation of the entirety of human knowledge, and purely selected the single thing with the most precursors, my (very uncertain) guess is that computability might be the single output produced, even though it's non-central to the post and only appears in a footnote. That's part of the reason why I'd be relatively happy for the output of this first task to roughly be 'complicated vocabulary which gets used in the paper'

Example 2

Input: Eliciting Latent Knowledge by Mark Xu and Paul Christiano

Output: Suggestions for high level concepts

Latent Knowledge
Ontology
Bayesian Network
Imitative Generalisation
Regularisation
Indirect Normativity

Notes: This is actually a list of terms I noted down as I was reading the paper, so rather than 'highest level' it's just 'what Alex happened to think it was worth looking up', but for illustrative purposes I think it's fine.

Having been given a high-level concept, generate prerequisite concepts

Notes: I noticed when trying to generate background concepts here that in order to do so it was most useful to have the context of the post. This pushed me in the direction of thinking these concepts were harder to fully decompose than I had thought, and suggested that the input might need to be '[concept], as used in [paper]', rather than just [concept]. All of the examples below come from the examples above. In some cases, I've indicated what I expect a second layer of recursion might produce, though it seems possible that one might just want the model to recurse one or more times by default.

I found the process of generating examples really difficult, and am not happy with them. I notice that what I kept wanting to do was write down 'high-level' concepts. Understanding the entirety of a few high-level concepts is often close to sufficient to understand an idea, but it's not usually necessary. With a smooth recursion UX (maybe by clicking), I think the ideal output almost invariably generates low-mid level concepts with the first few clicks. The advantages of this are that if the user recognises a concept they know they are done with that branch, and narrower concepts are easier to generate definitions for without recursing. Unfortunately, sometimes there are high level prerequisites which aren't obviously going to be generated by recursing on the lower level ones. I don' have a good solution to this yet.

Input: Shapley Value

Output:

Expected value
- Weighted average
- Elementary probability
- Utility
Marginal contribution
- Payoff
- Agent
- Fixed cost
- Variable cost
- Utility

Input: Computability

Output:

Computational problem
Computable function
Turing Machine
Computational complexity

Notes: I started recursing, quickly confirmed my hypothesis from earlier about this being by miles the thing with the most prerequisites, and deleted everything except what I had for 'level 1', which I also left unfinished before I got completely lost down a rabbithole.

Input: Bayesian Network

Output:

Probabilistic inference
- Bayes' Theorem
- Probability distribution
Directed Acyclic Graph
- Directed Graph
  - Graph (Discrete Mathematics)
    - Vertex
    - Edge
- Cycle
  - Trail
  - Graph (Discrete Mathematics)
    - Vertex
    - Edge

Notes: Added a few more layers of recursion to demonstrate both that you probably want some kind of dynamic tree structure, and also also that not every prerequisite is equally 'high level'.

Conclusions from trying to generate examples

This is a much harder, but much more interesting, problem than I'd originally expected. Which prerequisites seem most important, how narrowly to define them, and how much to second guess myself, all ended up feeling pretty intractable. I may try with some (much) simpler examples later, rather than trying to generate them from papers I legitimately found interesting. If a LLM is able to generalise the idea of 'necessary prerequisites' from easier concepts to harder ones, this itself seems extremely interesting and valuable.

[-]William_S4y20

Seems like a reasonable task, but wonder if it would be easier in practice to just have a wiki or something like https://metacademy.org/ or get post authors to do this (mostly depends on the size of the graph of concepts you need to connect, if it's smaller makes sense for people to do it, if it's larger then maybe automation helps).

[-]Alex Lawsen4y10

I think both of those would probably help but expect that the concept graph is very big, especially if you want people to be able to use the process recursively.

There's also value in the workflow being smooth, and this task is sandwiched between two things which seem very useful (and quite straightforward) to automate with an LLM:

concept extraction
search for and summarise explainer papers/articles

I can however imagine a good wiki with great graph style UX navigation and expandable definitions/paper links solving the last two problems, with then only concept extraction being automated by Elicit, though even in this case initially populating the graph/wiki might be best done using automation of the type described above. It's much easier to maintain something which already exists.

[-]Logan Riggs4yΩ6110

[Note: this one, steelman, and feedback on proposals all have very similar input spaces. I think I would ideally mix them as one in an actual product, but I'm keeping them separate for now]

Task: Obvious EA/Alignment Advice

Context: There are several common mental motions that the EA community does which are usefully applied to alignment. Ex. "Differential Impact", "Counterfactual Impact", "Can you clearly tell a story on how this reduces x-risk?", and "Truly Optimizing for X". A general "obvious advice" is useful for general capabilities as well, but this is intended to have a strong, EA tilt to it.
Input Type: A plan for reducing x-risk, either a research direction, upskilling, or field-building
Output Type: Useful exploratory questions

Instance 1

Input:

I'm thinking about taking up a job at Google to upskill in machine learning before I apply to Anthropic or Redwood Research.

Output:

Isn't there a more direct path to upskilling in machine learning? Like taking a course or doing a bunch of dumb, fun projects? Additionally, is programming/machine learning your comparative advantage?

Instance 2

Input:

I'm working on robustness to distributional shift, specifically identifying when an input is out-of-distribution so it can default to a no-op or asking for human feedback on that input.

Output:

How neglected is this field? If you didn't work in this field, would someone else have done the work you'd do, but a year or two later?

Instance 3

Input:

I'm building an interpretability tool that can tell you what the model was thinking when it was making a classification, for instance, what part of the input it was using to make that classification.

Output:

Let's say you succeed in making this tool; can you tell a story on how this tool can reduces AI x-risk?

[-]Alex Lawsen4y50

I actually happen to already have taught elicit to give helpful/obvious advice (not alignment specific, but close enough given the examples were inspired by thinking that lots of the advice I give in my day job as an advisor is obvious)! You can play with it here if you have an elicit account.

Edit:

Here's the training data

Life problem

I need to think of a research topic but I've only had two ideas and both of them aren't that great.