Caspar Oesterheld

Comments

Nice overview! I mostly agree.

>What I do not expect is something I’d have been happy to pay $500 or $1,000 for, but not $3,500. Either the game will be changed, or it won’t be changed quite yet. I can’t wait to find out.

From context, I assume you're saying this about the current iteration?

I guess willingness to pay for different things depends on one's personal preferences, but here's an outcome that I find somewhat likely (>50%):

  • The first-gen Apple Vision Pro will not be very useful for work, aside from some niche tasks.
    • It seems that to be better than a laptop for working at a coffee shop or something they need to have solved ~10 different problems extremely well and my guess is that they will have failed to solve one of them well enough. For example, I think comfort/weight alone has a >30% probability of making this less enjoyable to work with (for me at least) than with a laptop, even if all other stuff works fairly well.
    • Like you, I'm sometimes a bit puzzled by what Apple does. So I could also imagine that Apple screws up something weird that isn't technologically difficult. For example, the first version of iPad OS was extremely restrictive (no multitasking/splitscreen, etc.). So even though the hardware was already great, it was difficult to use it for anything serious and felt more like a toy. Based on what they emphasize on the website, I could very well imagine that they won't focus on making this work and that there'll be some basic, obvious issue like not being able to use a mouse. If Apple had pitched this more in the way that Spacetop was pitched, I'd be much more optimistic that the first gen will be useful for work.
  • The first-gen Apple Vision Pro will still produce lots of extremely interesting experiences so that many people would be happy to pay, say, $1000 for, but not $3,500 and definitely not much more than $3,500. For example, I think all the reviews I've seen have described the experience as very interesting, intense and immersive. Let's say this novelty value wears off after something like 10h. Then a family of four gets 40h of fun out of it. Say, you're happy to spend on the order of $10 per hour per person for a fun new experience (that's roughly what you'd spend to go to the movie theater, for example), then that'd be a willingness to pay in the hundreds of dollars.

>All accounts agree that Apple has essentially solved issues with fit and comfort.

Besides the 30min point, is it really true that all accounts agree on that? I definitely remember reading in at least two reports something along the lines of, "clearly you can't use this for hours, because it's too heavy". Sorry for not giving a source!

Do philosophers commonly use the word "intention" to refer to mental states that have intentionality, though? For example, from the SEP article on intentionality:

>intention and intending are specific states of mind that, unlike beliefs, judgments, hopes, desires or fears, play a distinctive role in the etiology of actions. By contrast, intentionality is a pervasive feature of many different mental states: beliefs, hopes, judgments, intentions, love and hatred all exhibit intentionality.

(This is specifically where it talks about how intentionality and the colloquial meaning of intention must not be confused, though.)

Ctrl+f-ing through the SEP article gives only one mention of "intention" that seems to refer to intentionality. ("The second horn of the same dilemma is to accept physicalism and renounce the 'baselessness' of the intentional idioms and the 'emptiness' of a science of intention.") The other few mentions of "intention" seem to talk about the colloquial meaning. The article seems to generally avoid the avoid "intention". Generally the article uses "intentional" and "intentionality".

Incidentally, there's also an SEP article on "intention" that does seem to be about what one would think it to be about. (E.g., the first sentence of that article: "Philosophical perplexity about intention begins with its appearance in three guises: intention for the future, as I intend to complete this entry by the end of the month; the intention with which someone acts, as I am typing with the further intention of writing an introductory sentence; and intentional action, as in the fact that I am typing these words intentionally.")

So as long as we don't call it "artificial intentionality research" we might avoid trouble with the philosophers after all. I suppose the word "intentional" becomes ambiguous, however. (It is used >100 times in both SEP articles.)

>They could have turned their safety prompts into a new benchmark if they had ran the same test on the other LLMs! This would've taken, idk, 2–5 hrs of labour?

I'm not sure I understand what you mean by this. They ran the same prompts with all the LLMs, right? (That's what Figure 1 is...) Do you mean they should have tried the finetuning on the other LLMs as well? (I've only read your post, not the actual paper.) And how does this relate to turning their prompts into a new benchmarks?

>I'm not sure I understand the variant you proposed. How is that different than the Othman and Sandholm MAX rule?

Sorry if I was cryptic! Yes, it's basically the same as using the MAX decision rule and (importantly) a quasi-strictly proper scoring rule (in their terminology, which is basically the same up to notation as a strictly proper decision scoring rule in the terminology of the decision scoring rules paper). (We changed the terminology for our paper because "quasi-strictly proper scoring rule w.r.t. the max decision rule" is a mouthful. :-P) Does that help?

>much safer than having it effectively chosen for them by their specification of a utility function

So, as I tried to explain before, one convenient thing about using proper decision scoring rules is that you do not need to specify your utility function. You just need to give rewards ex post. So one advantage of using proper decision scoring rules is that you need less of your utility function not more! But on to the main point...

>I think, from an alignment perspective, having a human choose their action while being aware of the distribution over outcomes it induces is much safer than having it effectively chosen for them by their specification of a utility function. This is especially true because probability distributions are large objects. A human choosing between them isn't pushing in any particular direction that can make it likely to overlook negative outcomes, while choosing based on the utility function they specify leads to exactly that. This is all modulo ELK, of course.

Let's grant for now that from an alignment perspective the property you describe is desirable. My counterargument is that proper decision scoring rules (or the max decision rule with a scoring rule that is quasi-strictly proper w.r.t. the max scoring rule) and zero-sum conditional prediction both have this property. Therefore, having the property cannot yield an argument to favor one over the other.

Maybe put differently: I still don't know what property it is that you think favors zero-sum conditional prediction over proper decision scoring rules. I don't think it can be not wanting to specify your utility function / not wanting the agent to pick agents based on their model of your utility function / wanting to instead choose yourself based on reported distributions, because both methods can be used in this way. Also, note that in both methods the predictors in practice have incentives that are determined by (their beliefs about) the human's values. For example, in zero-sum conditional prediction, each predictor is incentivized to run computations to evaluate actions that it thinks could potentially be optimal w.r.t. human values, and not incentivized to think about actions that it confidently thinks are suboptimal. So for example, if I have the choice between eating chocolate ice cream, eating strawberry ice cream and eating mud, then the predictor will reason that I won't choose to eat mud and that therefore its prediction about mud won't be evaluated. Therefore, it will probably not think much about how what it will be like if I eat mud (though it has to think about it a little to make sure that the other predictor can't gain by recommending mud eating).

On whether the property is desirable [ETA: I here mean the property: [human chooses based on reported distribution] but not compared to [explicitly specifying a utility function]]: Perhaps my objection is just what you mean by ELK. In any case, I think my views depend a bit on how we imagine lots of different aspect of the overall alignment scheme. One important question, I think, is how exactly we imagine the human to "look at" the distributions for example. But my worry is that (similar to RLHF) letting the human evaluate distributions rather than outcomes increases the predictors' incentives to deceive the human. The incentive is to find actions whose distribution looks good (in whatever format you represent the distribution) in relation to the other distributions, not which distributions are good. Given that the distributions are so large (and less importantly because humans have lots of systematic, exploitable irrationalities related to risk), I would think that human judgment of single outcomes/point distributions is much better than human judgment of full distributions.

>the biggest distinction is that this post's proposal does not require specifying the decision maker's utility function in order to reward one of the predictors and shape their behavior into maximizing it.

Hmm... Johannes made a similar argument in personal conversation yesterday. I'm not sure how convinced I am by this argument.

So first, here's one variant of the proper decision scoring rules setup where we also don't need to specify the decision maker's utility function: Ask the predictor for her full conditional probability distribution for each action. Then take the action that is best according to your utility function and the predictor's conditional probability distribution. Then score the predictor according to a strictly proper decision scoring rule. (If you think of strictly proper decision scoring rules as taking only a predicted expected utility as input, you have to first calculate the expected utility of the reported distribution, and then score that expected utility against the utility you actually obtained.) (Note that if the expert has no idea what your utility function is, they are now strictly incentivized to report fully honestly about all actions! The same is true in your setup as well, I think, but in what I describe here a single predictor suffices.) In this setup you also don't need to specify your utility function.

One important difference, I suppose, is that in all the existing methods (like proper decision scoring rules) the decision maker needs to at some point assess her utility in a single outcome -- the one obtained after choosing the recommended action -- and reward the expert in proportion to that. In your approach one never needs to do this. However, in your approach one instead needs to look at a bunch of probability distributions and assess which one of these is best. Isn't this much harder? (If you're doing expected utility maximization -- doesn't your approach entail assigning probabilities to all hypothetical outcomes?) In realistic settings, these outcome distributions are huge objects!

The following is based on an in-person discussion with Johannes Treutlein (the second author of the OP).

>But is there some concrete advantage of zero-sum conditional prediction over the above method?

So, here's a very concrete and clear (though perhaps not very important) advantage of the proposed method over the method I proposed. The method I proposed only works if you want to maximize expected utility relative to the predictor's beliefs. The zero-sum competition model enables optimal choice under a much broader set of possible preferences over outcome distributions.

Let's say that you have some arbitrary (potentially wacky discontinuous) function V that maps a distributions over outcomes onto a real value representing how much you like the distribution over outcomes. Then you can do zero-sum competition as normal and select the action for which V is highest (as usual with "optimism bias", i.e., if the two predictors make different predictions for an action a, then take the maximum of the Vs of the two actions). This should still be incentive compatible and result in taking the action that is best in terms of V applied to the predictors' belief.

(Of course, one could have even crazier preferences. For example, one's preferences could just be a function that takes as input a set of distributions and selects one distribution as its favorite. But I think if this preference function is intransitive, doesn't satisfy independence of irrelevant alternatives and the like, it's not so clear whether the proposed approach still works. For example, you might be able to slightly misreport some option that will not be taken anyway in such a way as to ensure that the decision maker ends up taking a different action. I don't think this is ever strictly incentivized. But it's not strictly disincentivized to do this.)

Interestingly, if V is a strictly convex function over outcome distributions (why would it be? I don't know!), then you can strictly incentivize a single predictor to report the best action and honestly report the full distribution over outcomes for that action! Simply use the scoring rule , where  is the reported distribution for the recommended action,  is the true distribution of the recommended action and  is a subderivative of . Because a proper scoring rule is used, the expert will be incentivized to report  and thus gets a score of , where  is the distribution of the recommended action. So it will recommend the action  whose associate distribution maximizes . It's easy to show that if  -- the function saying how much you like different distribution -- is not strictly convex, then you can't construct such a scoring rule. If I recall correctly, these facts are also pointed out in one of the papers by Chen et al. on this topic.

I don't find this very important, because I find expected utility maximization w.r.t. the predictors' beliefs much more plausible than anything else. But if nothing else, this difference further shows that the proposed method is fundamentally different and more capable in some ways than other methods (like the one I proposed in my comment).

Nice post!

Miscellaneous comments and questions, some of which I made on earlier versions of this post. Many of these are bibliographic, relating the post in more detail to prior work, or alternative approaches.

In my view, the proposal is basically to use a futarchy / conditional prediction market design like that the one proposed by Hanson, with I think two important details:
- The markets aren't subsidized. This ensures that the game is zero-sum for the predictors -- they don't prefer one action to be taken over another. In the scoring rules setting, subsidizing would mean scoring relative to some initial prediction $p_0$ provided by the market. Because the initial prediction might differ in how bad it is for different actions, the predictors might prefer a particular action to be taken. Conversely, the predictors might have no incentive to correct an overly optimistic prediction for one of the actions if doing so causes that action not to be taken. The examples in Section 3.2 of the Othman and Sandholm paper show these things.
- The second is "optimism bias" (a good thing in this context): "If the predictors disagree about the probabilities conditional on any action, the decision maker acts as though they believe the more optimistic one." (This is as opposed to taking the market average, which I assume is what Hanson had in mind with his futarchy proposal.) If you don't have optimism bias, then you get failure modes like the ones pointed out in Obstacle 1 of Scott Garrabrant's post "Two Major Obstacles for Logical Inductor Decision Theory": One predictor/trader could claim that the optimal action will lead to disaster and thus cause the optimal action to never be taken and her prediction to never be tested. This optimism bias is reminiscent of some other ideas. For example some ideas for solving the 5-and-10 problem are based on first searching for proofs of high utility. Decision auctions also work based on this optimism. (Decision auctions work like this: Auction off the right to make the decision on my behalf to the highest bidder. The highest bidder has to pay their bid (or maybe the second-highest bid) and gets paid in proportion to the utility I obtain.) Maybe getting too far afield here, but the UCB term in bandit algorithms also works this way in some sense: if you're still quite unsure how good an action is, pretend that it is very good (as good as some upper bound of some confidence interval).


My work on decision scoring rules describes the best you can get out of a single predictor. Basically you can incentivize a single predictor to tell you what the best action is and what the expected utility of that action is, but nothing more (aside from some degenerate cases).

Your result shows that if you have two predictors with the same information, then you can get slightly more: you can incentivize them to tell you what the best action is and what the full distribution over outcomes will be if you take the action.

You also get some other stuff (as you describe starting from the sentence, "Additionally, there is a bound on how inaccurate..."). But these other things seem much less important. (You also say: "while it does not guarantee that the predictions conditional on the actions not taken will be accurate, crucially there is no incentive to lie about them." But the same is true of decision scoring rules for example.)

Here's one thing that is a bit unclear to me, though.

If you have two predictors that have the same information, there's other, more obvious stuff you can do. For example, here's one:
- Ask Predictor 1 for a recommendation for what to do.
- Ask Predictor 2 for a prediction over outcomes conditional on Predictor 1's recommendation.
- Take the action recommended by Predictor 1.
- Observe an outcome o with a utility u(o).
- Pay Predictor 1 in proportion to u(o).
- Pay Predictor 2 according to a proper scoring rule.

In essence, this is just splitting the task into two: There's the issue of making the best possible choice and there's the issue of predicting what will happen. We assign Predictor 1 to the first and Predictor 2 to the second problem. For each of these problems separately, we know what to do (use proper (decision) scoring rules). So we can solve the overall problem.

So this mechanism also gets you an honest prediction and an honest recommendation for what to do. In fact, one advantage of this approach is that honesty is maintained even if the Predictors 1 and 2 have _different_ information/beliefs! (You don't get any information aggregation with this (though see below). But your approach doesn't have any information aggregation either.)

As per the decision scoring rules paper, you could additionally ask Predictor 1 for an estimate of the expected utility you will obtain. You can also let the Predictor 2 look at Predictor 1's prediction (or perhaps even score Predictor 2 relative to Predictor 1's prediction). (This way you'd get some information aggregation.) (You can also let Predictor 1 look at Predictor 2's predictions if Predictor 2 starts out by making conditional predictions before Predictor 1 gives a recommendation. This gets more tricky because now Predictor 2 will want to mislead Predictor 1.)

I think your proposal for what to do instead of the above is very interesting and I'm glad that we now know that this method exists that that it works. It seems fundamentally different and it seems plausible that this insight will be very useful. But is there some concrete advantage of zero-sum conditional prediction over the above method?

>First crucial point which this post is missing: the first (intuitively wrong) net reconstructed represents the probabilities using 9 parameters (i.e. the nine rows of the various truth tables), whereas the second (intuitively right) represents the probabilities using 8. That means the second model uses fewer bits; the distribution is more compressed by the model. So the "true" network is favored even before we get into interventions.
>
>Implication of this for causal epistemics: we have two models which make the same predictions on-distribution, and only make different predictions under interventions. Yet, even without actually observing any interventions, we do have reason to epistemically favor one model over the other.

For people interested in learning more about this idea: This is described in Section 2.3 of Pearl's book Causality. The beginning of Ch. 2 also contains some information about the history of this idea. There's also a more accessible post by Yudkowsky that has popularized these ideas on LW, though it contains some inaccuracies, such as explicitly equating causal graphs and Bayes nets.

Load More