LESSWRONG
LW

Caspar Oesterheld — LessWrong

3mo

Tl;dr: We have a dataset for conceptual reasoning which you can request access for if you would like to use it for AI safety (or related) research. We consider the dataset half-baked and it will likely become much more useful over the next few months. At the same time, we think it's very high quality compared to typical AI datasets and currently the best available dataset of this kind, so want to make it available to mission-aligned projects now. We also have half-baked prompts to make models better at critiquing conceptual reasoning which you can request.

Our group consists of Caspar Oesterheld, Emery Cooper, and me. Ethan Perez is advising us on this project.

Motivation/context: We... (read 668 more words →)

Replying toNo, Futarchy Doesn’t Have This EDT Flaw

Caspar Oesterheld8mo

No, Futarchy Doesn’t Have This EDT Flaw

In the academic literature, this sort of scheme has been analyzed by Chen et al., e.g.: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/04/TEAC-final1.pdf

A dataset of questions on decision-theoretic reasoning in Newcomb-like problems

Caspar Oesterheld

Caspar Oesterheld, Ethan Perez, Chi Nguyen

I’ve spent a lot of the last few years working on issues related to acausal cooperation. With LLMs being clearly dominant over recent years, I’ve now led a team to make a benchmark to figure out how good LLMs are at decision theory and whether and when they lean more CDT or EDT. We hope to expand this dataset in the future, including by incorporating questions that try to measure the updatelessness dimension. Hopefully, this dataset will be useful for future interventions aimed at improving acausal interactions.

Abstract:

We introduce a dataset of natural-language questions in the decision theory of so-called Newcomb-like problems. Newcomb-like problems include, for instance, decision problems in which an agent interacts

... (read 443 more words →)

Replying toUsing (Uninterpretable) LLMs to Generate Interpretable AI Code

Caspar Oesterheld1y

Using (Uninterpretable) LLMs to Generate Interpretable AI Code

I also don't think that operations of the form "do X, because on average, this works well" necessarily are problematic, provided that "X" itself can be understood.

Yeah, I think I agree with this and in general with what you say in this paragraph. Along the lines of your footnote, I'm still not quite sure what exactly "X can be understood" must require. It seems to matter, for example, that to a human it's understandable how the given rule/heuristic or something like the given heuristic could be useful. At least if we specifically think about AI risk, all we really need is that X is interpretable enough that we can tell that it's not doing anything problematic (?).

Replying toMy AI Model Delta Compared To Christiano

Caspar Oesterheld1y

My AI Model Delta Compared To Christiano

To some extent, this is all already in Jozdien's comment, but:

It seems that the closest thing to AIs debating alignment (or providing hopefully verifiable solutions) that we can observe is human debate about alignment (and perhaps also related questions about the future). Presumably John and Paul have similar views about the empirical difficulty of reaching agreement in the human debate about alignment, given that they both observe this debate a lot. (Perhaps they disagree about what people's level of (in)ability to reach agreement / verify arguments implies for the probability of getting alignment right. Let's ignore that possibility...) So I would have thought that even w.r.t. this fairly closely related debate, the... (read more)

Replying toUsing (Uninterpretable) LLMs to Generate Interpretable AI Code

Caspar Oesterheld1y

Using (Uninterpretable) LLMs to Generate Interpretable AI Code

As once discussed in person, I find this proposal pretty interesting and I think it deserves further thought.

Like some other commenters, I think for many tasks it's probably not tractable to develop a fully interpretable, competitive GOFAI program. For example, I would imagine that for playing chess well, one needs to do things like positively evaluating some random-looking feature of a position just on the basis that empirically this feature is associated with higher win rate. However, the approach of the post could be weakened to allow "mixed" programs that have some not so interpretable aspects, e.g., search + a network for evaluating positions is more interpretable than just a network that... (read 423 more words →)

Replying toBoycott OpenAI

Caspar Oesterheld2y

Boycott OpenAI

If all you're using is ChatGPT, then now's a good time to cancel the subscription because GPT-4o seems to be similarly powerful as GPT-4, and GPT-4o is available for free.

Replying toAnthropic release Claude 3, claims >GPT-4 Performance

Caspar Oesterheld2y

Anthropic release Claude 3, claims >GPT-4 Performance

As one further data point, I also heard people close to/working at Anthropic giving "We won't advance the state of the art."-type statements, though I never asked about specifics.
My sense is also that Claude 3 Opus is only slightly better than the best published GPT-4. To add one data point: I happen to work on a benchmark right now and on that benchmark, Opus is only very slightly better than gpt-4-1106. (See my X/Twitter post for detailed results.) So, I agree with LawrenceC's comment that they're arguably not significantly advancing the state of the art.
I suppose even if Opus is only slightly better (or even just perceived to be better) and even

Caspar Oesterheld2y

AI things that are perhaps as important as human-controlled AI

In short, the idea is that there might be a few broad types of “personalities” that AIs tend to fall into depending on their training. These personalities are attractors.

I'd be interested in why one might think this to be true. (I only did a very superficial ctrl+f on Lukas' post -- sorry if that post addresses this question.) I'd think that there are lots of dimensions of variation and that within these, AIs could assume a continuous range of values. (If AI training mostly works by training to imitate human data, then one might imagine that (assuming inner alignment) they'd mostly fall within the range of human variation. But I assume that's not what you mean.)

Replying toHow LLMs are and are not myopic

Caspar Oesterheld2y

How LLMs are and are not myopic

This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy.

Are you claiming this would happen even given infinite capacity?

I think that janus isn't claiming this and I also think it isn't true. I think it's all about capacity constraints. The claim as I understand it is that there are some intermediate computations that are optimized both for predicting the next token and for predicting the 20th token and that therefore have to prioritize between these different predictions.

Replying toHow LLMs are and are not myopic

Caspar Oesterheld2y

How LLMs are and are not myopic

Here's a simple toy model that illustrates the difference between 2 and 3 (that doesn't talk about attention layers, etc.).

Say you have a bunch of triplets $(x, z_{1}, z_{2})$ . Your want to train a model that predicts $z_{1}$ from $x$ and $z_{2}$ from $x, z_{1}$ .

Your model consists of three components: $f, g_{1}, g_{2}$ . It makes predictions as follows:
$y = f (x)$
$z_{1} = g_{1} (y)$
$z_{2} = g_{2} (y, z_{1})$

(Why have such a model? Why not have two completely separate models, one for predicting $z_{1}$ and one for predicting $z_{2}$ ? Because it might be more efficient to use a single $f$ both for predicting $z_{1}$ and for predicting $z_{2}$ , given that both predictions presumably require "interpreting" $x$ .)

So, intuitively, it first builds an "inner representation" (embedding) of $x$ . Then it sequentially makes predictions based on that inner representation.

Now you train $f$ and $g_{1}$ to minimize the prediction loss on the $(x, z_{1})$ parts of the triplets.... (read 673 more words →)

Replying toPaper: LLMs trained on “A is B” fail to learn “B is A”

Caspar Oesterheld2y

Paper: LLMs trained on “A is B” fail to learn “B is A”

At least in this case (celebrities and their largely unknown parents), I would predict the opposite. That is, people are more likely to be able to correctly answer "Who is Mary Lee Pfeiffer's son?" than "Who is Tom Cruise's mother?" Why? Because there are lots of terms / words / names that people can recognize passively but not produce. Since Mary Lee Pfeiffer is not very well known, I think Mary Lee Pfeiffer will be recognizable but not producable to lots of people. (Of people who know Mary Lee Pfeiffer in any sense, I think the fraction of people who can only recognize her name is high.) As another example, I think... (read more)

Stop-gradients lead to fixed point predictions

Johannes Treutlein

Johannes Treutlein, Caspar Oesterheld, Rubi J. Hudson, Emery Cooper

Johannes Treutlein and Rubi Hudson worked on this post as part of SERI MATS, under the mentorship of Evan Hubinger. Rubi has also received mentorship from Leo Gao. We thank Erik Jenner for helpful discussions and Alexander Pan for bringing the performative prediction literature to our attention.

Update 30 May 2023: We have now published a paper based on our previous post and this post (the material from this post is in Appendix D).

1. Introduction

In our previous post, we analyzed a setting in which an oracle AI is maximizing a strictly proper scoring rule while it can influence the world with its predictions about the probabilities of possible outcomes. In this setting, a... (read 7134 more words →)

Proper scoring rules don’t guarantee predicting fixed points

Johannes Treutlein

Johannes Treutlein, Rubi J. Hudson, Caspar Oesterheld

Johannes Treutlein and Rubi Hudson worked on this post while participating in SERI MATS, under Evan Hubinger's and Leo Gao's mentorship respectively. We are grateful to Marius Hobbahn, Erik Jenner, and Adam Jermyn for useful discussions and feedback, and to Bastian Stern for pointing us to relevant related work.

Update 30 May 2023: We have now published a paper based on this post. In this paper, we also discuss in detail the relationship to the related literature on performative prediction.

Introduction

One issue with oracle AIs is that they might be able to influence the world with their predictions. For example, an AI predicting stock market prices might be able to influence whether people buy... (read 6240 more words →)

Extracting Money from Causal Decision Theorists

Caspar Oesterheld

My paper with my Ph.D. advisor Vince Conitzer titled "Extracting Money from Causal Decision Theorists" has been formally published (Open Access) in The Philosophical Quarterly. Probably many of you have seen either earlier drafts of this paper or similar arguments that others have independently given on this forum (e.g., Stuart Armstrong posted about an almost identical scenario; Abram Demski's post on Dutch-Booking CDT also has some similar ideas) and elsewhere (e.g., Spencer (forthcoming) and Ahmed (unpublished) both make arguments that resemble some points from our paper).

Our paper focuses on the following simple scenario which can be used to, you guessed it, extract money from causal decision theorists:

Adversarial Offer: Two boxes, $B_{1}$ and $B_{2}$ , are on

... (read 161 more words →)

Moral realism and AI alignment

Caspar Oesterheld

“Abstract”: Some have claimed that moral realism – roughly, the claim that moral claims can be true or false – would, if true, have implications for AI alignment research, such that moral realists might approach AI alignment differently than moral anti-realists. In this post, I briefly discuss different versions of moral realism based on what they imply about AI. I then go on to argue that pursuing moral-realism-inspired AI alignment would bypass philosophical and help resolve non-philosophical disagreements related to moral realism. Hence, even from a non-realist perspective, it is desirable that moral realists (and others who understand the relevant realist perspectives well enough) pursue moral-realism-inspired AI alignment research.

Naturalized induction – a challenge for evidential and causal decision theory

Caspar Oesterheld

As some of you may know, I disagree with many of the criticisms leveled against evidential decision theory (EDT). Most notably, I believe that Smoking lesion-type problems don't refute EDT. I also don't think that EDT's non-updatelessness leaves a lot of room for disagreement, given that EDT recommends immediate self-modification to updatelessness. However, I do believe there are some issues with run-of-the-mill EDT. One of them is naturalized induction. It is in fact not only a problem for EDT but also for causal decision theory (CDT) and most other decision theories that have been proposed in- and outside of academia. It does not affect logical decision theories, however.

The role of naturalized induction in

... (read 1890 more words →)

Invitation to comment on a draft on multiverse-wide cooperation via alternatives to causal decision theory (FDT/UDT/EDT/...)

Caspar Oesterheld

I have written a paper about “multiverse-wide cooperation via correlated decision-making” and would like to find a few more people who’d be interested in giving a last round of comments before publication. The basic idea of the paper is described in a talk you can find here. The paper elaborates on many of the ideas and contains a lot of additional material. While the talk assumes a lot of prior knowledge, the paper is meant to be a bit more accessible. So, don’t be disheartened if you find the talk hard to follow — one goal of getting feedback is to find out which parts of the paper could be made more... (read more)

LESSWRONG
LW

LESSWRONG
LW

Caspar Oesterheld

Proper scoring rules don’t guarantee predicting fixed points

A dataset of questions on decision-theoretic reasoning in Newcomb-like problems

Stop-gradients lead to fixed point predictions

Two-boxing, smoking and chewing gum in Medical Newcomb problems

Caspar Oesterheld

Conceptual reasoning dataset v0.1 available (AI for AI safety/AI for philosophy)

A dataset of questions on decision-theoretic reasoning in Newcomb-like problems

Stop-gradients lead to fixed point predictions

Proper scoring rules don’t guarantee predicting fixed points

Extracting Money from Causal Decision Theorists

Moral realism and AI alignment

The law of effect, randomization and Newcomb’s problem

Caspar Oesterheld

Proper scoring rules don’t guarantee predicting fixed points

A dataset of questions on decision-theoretic reasoning in Newcomb-like problems

Stop-gradients lead to fixed point predictions

Two-boxing, smoking and chewing gum in Medical Newcomb problems

Caspar Oesterheld

Conceptual reasoning dataset v0.1 available (AI for AI safety/AI for philosophy)

A dataset of questions on decision-theoretic reasoning in Newcomb-like problems

Stop-gradients lead to fixed point predictions

Proper scoring rules don’t guarantee predicting fixed points

Extracting Money from Causal Decision Theorists

Moral realism and AI alignment

The law of effect, randomization and Newcomb’s problem

1. Introduction

Introduction

The role of naturalized induction in