Conceptual reasoning dataset v0.1 available (AI for AI safety/AI for philosophy)

Emery Cooper; Caspar Oesterheld

Tl;dr: We have a dataset for conceptual reasoning which you can request access for if you would like to use it for AI safety (or related) research. We consider the dataset half-baked and it will likely become much more useful over the next few months. At the same time, we think it's very high quality compared to typical AI datasets and currently the best available dataset of this kind, so want to make it available to mission-aligned projects now. We also have half-baked prompts to make models better at critiquing conceptual reasoning which you can request.

Our group consists of Caspar Oesterheld, Emery Cooper, and me. Ethan Perez is advising us on this project.

Motivation/context: We are working on eliciting conceptual reasoning capabilities in LLMs where conceptual reasoning refers to reasoning about questions or problems where we don't have (access to) ground truth and there is no (practically feasible) agreed-upon methodology for arriving at the correct conclusion. Philosophy is the prototypical example but forecasting of far future events and many AI safety questions also fall into this category. Our motivation for doing this is to shorten the time at which we can use AI assistants for conceptual safety-relevant research relative to AIs' general capabilities. As part of this project, we are building a conceptual reasoning dataset and developing prompts for eliciting their full conceptual reasoning abilities.

The dataset: The idea behind our dataset is that it’s easier to evaluate the quality of contextualised arguments than the bottom-line conclusion in conceptual domains.

The dataset consists of positions + critiques of those positions + human expert ratings of these critiques on 7 criteria.
- Positions are just any statement or argument, ranging from one line to many-page essays.
- Critiques try to refute the original position as fully as possible.
- The 7 rating criteria are: Strength, centrality, correctness, clarity, dead weight, single issue, and overall.
We have over 1000 rated critiques of which we are willing to share ~500.
Ratings have received an extreme amount of effort to ensure quality
- The vast majority of datapoints has been rated by Emery Cooper. To validate her ratings, Caspar Oesterheld, Alex Kastner, and Chi Nguyen have each crossrated >100 datapoints. Lukas Gloor and Lukas Finnveden have respectively rated >40 and >20 critiques, also for validation.
- Rating a critique takes at least several minutes and sometimes >30 minutes.
- Raters usually discussed when there were large rating disagreements. The dataset records pre- and post-discussion ratings with explanations. Some critiques also have several ratings by the same person at different points in time (without any intervening discussion but occasionally based on large rating disagreements with our best LLM raters).
- There is a set of 50 critiques which Emery, Caspar, Alex and Chi all rated; Lukas Gloor and Lukas Finnveden rated the first 20 and 40 of these, respectively. All raters then met for ~8 hours across two meetings to discuss rating disagreements. Again, the dataset always records pre- and post-discussion ratings, so you can track how much discussion moved ratings.
Currently, only a small minority of the datapoints are in domains we especially care about, e.g., AI safety and decision theory although this fraction will be increasing over the coming months.

The prompts: We have done extensive prompt optimization to elicit models' ability to rate critiques accurately (i.e., similarly to the human raters). We have just started prompt engineering to elicit models' ability to write high-quality critiques (with our dataset and LLM judges being very helpful at speeding up this process).

Paper: You can find a more detailed preliminary paper draft about our dataset here. This paper also further details the limitations of the dataset in its current form.

Access: To request access, you first have to read our data sharing policy. Once you've done so, you can confirm this and request access in this form. If you or your organisation are quite well known in the AI safety community, your (organisation's) name is all we need from you in the form and you can stop reading here.

We will initially be conservative with granting access since we don't have the capacity to properly evaluate access requests and also haven't decided how we want to share the dataset in the long term. We will usually consider access requests only if:

we know of your organisation/team,
we know of you,
we are in a position to evaluate your work very easily. For example, you have a very legibly publication history and credible signals of mission alignment, or someone we trust can vouch for you.

Unfortunately, we cannot currently commit to assessing requests if this would require substantial effort from our side (such as reading and judging a research proposal). If you're unsure if you fit into a/b/c, feel free to just submit a bare-bones response and leave a note that you're happy to share more!

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

17

Conceptual reasoning dataset v0.1 available (AI for AI safety/AI for philosophy)

17

17