Comments

This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy.

Are you claiming this would happen even given infinite capacity?

I think that janus isn't claiming this and I also think it isn't true. I think it's all about capacity constraints. The claim as I understand it is that there are some intermediate computations that are optimized both for predicting the next token and for predicting the 20th token and that therefore have to prioritize between these different predictions.

Here's a simple toy model that illustrates the difference between 2 and 3 (that doesn't talk about attention layers, etc.).

Say you have a bunch of triplets . Your want to train a model that predicts  from  and  from .

Your model consists of three components: . It makes predictions as follows:


(Why have such a model? Why not have two completely separate models, one for predicting  and one for predicting ? Because it might be more efficient to use a single  both for predicting  and for predicting , given that both predictions presumably require "interpreting" .)

So, intuitively, it first builds an "inner representation" (embedding) of . Then it sequentially makes predictions based on that inner representation.

Now you train  and  to minimize the prediction loss on the  parts of the triplets. Simultaneously you train  to minimize prediction loss on the full  triplets. For example, you update  and  with the gradients

and you update  and  with the gradients

.
(The  here is the "true" , not one generated by the model itself.)

This training pressures  to be myopic in the second and third sense described in the post. In fact, even if we were to train  with the  predicted by  rather than the true  is pressured to be myopic.

  • Type 3 myopia: Training doesn't pressure  to output something that makes the  follow an easier-to-predict (computationally or information-theoretically) distribution. For example, imagine that on the training data  implies , while under  follows some distribution that depends in complicated ways on . Then  will not try to predict  more often.
  • Type 2 myopia:  won't try to provide useful information to  in its output, even if it could. For example, imagine that the s are strings representing real numbers. Imagine that  is always a natural number, that  is the -th Fibonacci number and  is the -th Fibonacci number. Imagine further that the model representing  is large enough to compute the -th Fibonacci number, while the model representing  is not. Then one way in which one might think one could achieve low predictive loss would be for  to output the -th Fibonacci number and then encode, for example, the -th Fibonacci number in the decimal digits. (E.g., .) And then  computes the -th Fibonacci number from the -th decimal. But the above training will not give rise to this strategy, because  gets the true  as input, not the one produced by . Further, even if we were to change this, there would still be pressure against this strategy because  () is not optimized to give useful information to . (The gradient used to update  doesn't consider the loss on predicting .) If it ever follows the policy of encoding information in the decimal digits, it will quickly learn to remove that information to get higher prediction accuracy on .

Of course,  still won't be pressured to be type-1-myopic. If predicting  requires predicting , then  will be trained to predict ("plan") .

(Obviously, $g_2$ is pressured to be myopic in this simple model.)

Now what about ? Well,  is optimized both to enable predicting  from  and predicting  from . Therefore, if resources are relevantly constrained in some way (e.g., the model computing  is small, or the output of  is forced to be small),  will sometimes sacrifice performance on one to improve performance on the other. So, adapting a paragraph from the post: The trained model for  (and thus in some sense the overall model) can and will sacrifice accuracy on  to achieve better accuracy on . In particular, we should expect trained models to find an efficient tradeoff between accuracy on  and accuracy on . When  is relatively easy to predict,  will spend most of its computation budget on predicting .

So,  is not "Type 2" myopic. Or perhaps put differently: The calculations going into predicting  aren't optimized purely for predicting .

However,  is still "Type 3" myopic. Because the prediction made by  isn't fed (in training) as an input to  or the loss, there's no pressure towards making  influence the output of  in a way that has anything to do with . (In contrast to the myopia of , this really does hinge on not using  in training. If  mattered in training, then there would be pressure for  to trick  into performing calculations that are useful for predicting . Unless you use stop-gradients...)

* This comes with all the usual caveats of course. In principle, the inductive bias may favor a situationally aware model that is extremely non-myopic in some sense.

At least in this case (celebrities and their largely unknown parents), I would predict the opposite. That is, people are more likely to be able to correctly answer "Who is Mary Lee Pfeiffer's son?" than "Who is Tom Cruise's mother?" Why? Because there are lots of terms / words / names that people can recognize passively but not produce. Since Mary Lee Pfeiffer is not very well known, I think Mary Lee Pfeiffer will be recognizable but not producable to lots of people. (Of people who know Mary Lee Pfeiffer in any sense, I think the fraction of people who can only recognize her name is high.) As another example, I think "Who was born in Ulm?" might be answered correctly by more people than "Where was Einstein born?", even though "Einstein was born in Ulm" is a more common sentence for people to read than "Ulm is the city that Einstein was born in".

If I had to run an experiment to test whether similar effects apply in humans, I'd probably try to find cases where A and B in and of themselves are equally salient but the association A -> B is nonetheless more salient than the association B -> A. The alphabet is an example of this (where the effect is already confirmed).

I mean, translated to algorithmic description land, my claim was: It's often difficult to prove a negative and I think the non-existence of a short algorithm to compute a given object is no exception to this rule. Sometimes someone wants to come up with a simple algorithm for a concept for which I suspect no such algorithm to exist. I usually find that I have little to say and can only wait for them to try to actually provide such an algorithm.

So, I think my comment already contained your proposed caveat. ("The concept has K complexity at least X" is equivalent to "There's no algorithm of length <X that computes the concept.")

Of course, I do not doubt that it's in principle possible to know (with high confidence) that something has high description length. If I flip a coin n times and record the results, then I can be pretty sure that the resulting binary string will take at least ~n bits to describe. If I see the graph of a function and it has 10 local minima/maxima, then I can conclude that I can't express it as a polynomial of degree <10. And so on. 

I think I sort of agree, but...

It's often difficult to prove a negative and I think the non-existence of a crisp definition of any given concept is no exception to this rule. Sometimes someone wants to come up with a crisp definition of a concept for which I suspect no such definition to exist. I usually find that I have little to say and can only wait for them to try to actually provide such a definition. And sometimes I'm surprised by what people can come up with. (Maybe this is the same point that Roman Leventov is making.)

Also, I think there are many different ways in which concepts can be crisp or non-crisp. I think cooperation can be made crisp in some ways and not in others.

For example, I do think that (in contrast to human values) there are approximate characterizations of cooperation that are useful, precise and short. For example: "Cooperation means playing Pareto-better equilibria."

One way in which I think cooperation isn't crisp, is that you can give multiple different sensible definitions that don't fully agree with each other. (For example, some definitions (like the above) will include coordination in fully cooperative (i.e., common-payoff) games, and others won't.) I think in that way it's similar to comparing sets by size, where you can give lots of useful, insightful, precise definitions that disagree with each other. For example, bijection, isomorphism, and the subset relationship can each tell us when one set is larger than or as large as another, but they sometimes disagree and nobody expects that one can resolve the disagreement between the concepts or arrive at "one true definition" of whether one set is larger than another.

When applied to the real world rather than rational agent models, I would think we also inherit fuzziness from the application of the rational agent model to the real world. (Can we call the beneficial interaction between two cells cooperation? Etc.)

I guess we have talked about this a bunch last year, but since the post has come up again...

It then becomes clear what the requirements are besides “I believe we have compatible DTs” for Arif to believe there is decision-entanglement:

“I believe we have entangled epistemic algorithms (or that there is epistemic-entanglement[5], for short)”, and
“I believe we have been exposed to compatible pieces of evidence”.

I still don't understand why it's necessary to talk about epistemic algorithms and their entanglement as opposed to just talking about the beliefs that you happen to have (as would be normal in decision and game theory theory).

Say Alice has epistemic algorithm A with inputs x that gives rise to beliefs b and Bob has a completely different [ETA: epistemic] algorithm A' with completely different inputs x' that happens to give rise to beliefs b as well. Alice and Bob both use decision algorithm D to make decisions. Part of b is the belief that Alice and Bob have the same beliefs and the same decision algorithm. It seems that Alice and Bob should cooperate. (If D is EDT/FDT/..., they will cooperate.) So it seems that the whole A,x,A',x' stuff just doesn't matter for what they should do. It only matters what their beliefs are. My sense from the post and past discussions is that you disagree with this perspective and that I don't understand why.

(Of course, you can talk about how in practice, arriving at the right kind of b will typically require having similar A, A' and similar x, x'.)

(Of course, you need to have some requirement to the extent that Alice can't modify her beliefs in such a way that she defects but that she doesn't (non-causally) make it much more likely that Bob also defects. But I view this as an assumption about decision-theoretic not epistemic entanglement: I don't see why an epistemic algorithm (in the usual sense of the word) would make such self-modifications.)

Three months later, I still find that:
a) Bing Chat has a lot of issues that the ChatGPTs (both 3.5 or 4) don't seem to suffer from nearly as much. For example, it often refuses to answer prompts that are pretty clearly harmless.
b) Bing Chat has a harder time than I expected when answering questions that you can answer by copy-and-pasting the question into Google and then copy-and-pasting the right numbers, sentence or paragraph from the first search result. (Meanwhile, I find that Bing Chat's search still works better than the search plugins for ChatGPT 4, which seem to still have lots of mundane technical issues.) Occasionally ChatGPT (even ChatGPT 3.5) gives better (more factual or relevant) answers "from memory" than Bing Chat gives by searching.

However, when I pose very reasoning-oriented tasks to Bing Chat (i.e., tasks that mostly aren't about searching on Google) (and Bing Chat doesn't for some reason refuse to answer and doesn't get distracted by unrelated search results it gets), it seems clear that Bing Chat is more capable than ChatGPT 3.5, while Bing Chat and ChatGPT 4 seem similar in their capabilities. I pose lots of tasks that (in contrast to variants of Monty Hall (which people seem to be very interested in), etc.) I'm pretty sure aren't in the training data, so I'm very confident that this improvement isn't primarily about memorization. So I totally buy that people who asked Bing Chat the right questions were justified in being very confident that Bing Chat is based on a newer model than ChatGPT 3.5.

Also:
>I've tried (with little success) to use Bing Chat instead of Google Search.
I do now use Bing Chat instead of Google Search for some things, but I still think Bing Chat is not really a game changer for search itself. My sense is that Bing Chat doesn't/can't comb through pages and pages of different documents to find relevant info and that it also doesn't do one search to identify relevant search times for a second search, etc. (Bing Chat seems to be restricted to a few (three?) searches per query.) For the most part it seems to enter obvious search terms into Bing Search and then give information based on the first few results (even if those don't really answer the question or are low quality). The much more important feature from a productivity perspective is the processing of the information it finds, such as the processing of the information on some given webpage into a bibtex entry or applying some method from Stack Exchange to the particularities of one's code.

Very interesting post! Unfortunately, I found this a bit hard to understand because the linked papers don’t talk about EDT versus CDT or scenarios where these two come apart and because both papers are (at least in part) about sequential decision problems, which complicates things. (CDT versus EDT can mostly be considered in the case of a single decision and there are various complications in multi-decision scenarios, like updatelessness.)

Here’s an attempt at trying to describe the relation of the two papers to CDT and EDT, including prior work on these topics. Please correct me if I’m misunderstanding anything! The writing is not very polished -- sorry!

Ignoring all the sequential stuff, my understanding is that the first paper basically does this: First, we train a model to predict utilities after observing actions, i.e., make predictions conditional on actions. So in particular, we get a function a ---> E[utility | a] that maps an observed action by the agent onto a prediction of future reward/utility. Then if we use some procedure to find the action a that maximizes E[utility | a], it seems that we have an EDT agent. I think this is essentially the case of an “EDT overseer” who rewards based on actions (rather than outcomes) in “Approval-directed agency and the decision theory of Newcomb-like problems”. Also see the discussion of Obstacle 1 in "Two Major Obstacles for Logical Inductor Decision Theory".

Now what could go wrong with this? I think in some sense the problem is generally that it's unclear how the predictive model works, or where it comes from. The second paper (the DeepMind one) basically points out one issue with this. Other issues are known to this community. I’ll start with an issue that has been known to this community: the 5 and 10 problem / the problem of counterfactuals. If the agent always (reliably) chooses the action a that maximizes E[utility | a], then the predictive model’s counterfactual predictions (i.e., predictions for all other actions) could be nonsensical without being strictly speaking wrong. So for example, in 5 and 10, you choose between a five dollar bill and a ten dollar bill. (There’s no catch and you should clearly just take the ten dollar bill.) The model predicts that if you take the five dollar bill, you will get five dollars, and (spuriously / intuitively falsely) that if you take the ten dollar bill, you get nothing. Because you are maximizing expected utility according to this particular predictive model, you take the five dollars. So the crazy prediction for what happens if you take the ten dollars is never falsified.

In non-Newcomb-like scenarios, a simple, extremely standard solution to this problem is to train the predictive model (the thing that gives a ---> E[utility | a]) while the agent follows some policy that randomizes over all actions (perhaps one that takes actions with probabilities in proportion to the model's predictions E[utility | a]). My understanding is that this is how the first paper avoids these issues and gives good results. Unfortunately, in Newcomb-like problems these approaches tend to lead to pretty CDT-ish behavior, as shown in "Reinforcement Learning in Newcomblike Environments".

Anyway, the second paper (the DeepMind one) points out another issue related to where the E[utility | action] model comes from. Roughly, the story — which I think is very well described in Section 2 — seems to be the following: the E[utility | action] model is trained on the actions of an expert who knows whether X=1,2 and acts on that fact by choosing A=X; then the E[utility | action] model won't work for a non-expert agent, i.e., one who doesn’t observe X. I view this as a distributional shift issue — you train a model (the a ---> E[utility | a] one) in a setting where A=X, and then you apply it in a setting where sometimes A and X are uncorrelated.

It’s also similar to the Smoking Lesion/medical Newcomb-like problems! Consider the following medical Newcomb-like problem: First we learn the fact that sick people go to the doctor and healthy people don’t go to the doctor. Then without looking at how healthy I am, I don’t go to the doctor so as to gain evidence that I am healthy. Arguably what goes wrong here is also that I’m using a rule for prediction out of distribution on someone who doesn’t look at whether they’re sick. I think it relates to one of the least challenging versions of medical Newcomb-like problems and it’s handled comfortably by the so-called tickle defense.

Interlude: The paper talks about how this relates to hallucination in LLMs. So what’s that about? IIUC, the idea is that when generating text, LLMs incorrectly update based on the text they generate themselves. For example, imagine that you want an LLM to generate ten tokens. Then after generating the first nine tokens, it will predict the tenth token from its learned distribution . But this distribution was trained on fully human- not LLM-written text. So (in my way of thinking),  might do poorly (i.e., not give a human-like continuation of ), because it was trained on seeing nine tokens created by a human and having to predict a continuation by a human rather than nine tokens by itself/an LLM and having to predict a continuation by a human. For example, we might imagine that if  are words that only a human expert confident in a particular claim C would say, then the LLM will predict continuations that confidently defend claim C, even if the LLM doesn’t know anything about C. I'm not sure I really buy this explanation of hallucination. I think the claim would need more evidence than the authors provide. But it's definitely a very interesting point.

Now, back to the original toy model. Again, I would view this as a distribution shift problem. If we make some assumptions, though, we can infer/guess a model (i.e. function a ---> E[utility | a]) that predicts the utility obtained by a non-expert, i.e., an agent who doesn't observe X. Specifically, let’s assume that we are told the conditional distributions P(utility | X=1, A=0) and P(utility | X=0, A=1) (which we never see in training if the agent in training always knows and acts on X). Let’s also assume that we know that the difference between the training distribution and the new setting is that in the new setting the agent chooses A independently of X. Then in the new model we just need to make X and A independent and change nothing else. Formally you use the new distribution P’(X,U|A) = P(X)P(U|A,X), where the Ps on the right-hand side are just the old distribution, instead of P(X,U|A) = P(X|A)P(U|A,X).

It turns out that if we put the original distribution into a causal graph with edges X->A and A->U and X->U and then make a do-intervention on A (a la Pearl), then we get this exact distribution, i.e., P(X,U|do(A)) = P’(X,U|A). (Intuitively, removing the inference from A to X is exactly what the do(A) does if A's parent is X.) So in particular maximizing E[U | do(A)] gives the same result as maximizing E’[U|A]. Anyway, the paper uses the do operator to construct the new predictor, rather than the above argument. They seem to claim that the causal structure (or reasoning about causality) is necessary to construct the new predictor, with which I disagree.

Is this really CDT? I’m not sure… In the above type of case, this doesn’t come apart from EDT. If we buy that their scenario is a bit like a Smoking Lesion, then one could argue that part of the point of CDT is to solve this type of scenario. (In some sense my response is as in most versions of the Smoking Lesion: Because of the tickle defense, EDT applied properly gets this right anyway, so there’s actually nothing to fix here.) In my view it’s basically just about using the do-calculus to concisely specify the scenario P’ (based on P plus a particular causal graph for P). It seems that one can do these things without being committed to using do(A) in a scenario where there’s some non-causal dependence between A and U (that doesn't disappear outside of training), perhaps via some common cause Y. In any case, the paper doesn’t tell us how to distinguish between U <- Y -> A and A -> Y -> U — all causal relationships are assumed. So while nominally they construct their predictor as E[U | do(A)], it’s a bit unclear how wedded they are to CDT.

Anyway, with a (maybe-causalist) E[U | do(A)] in hand, we can of course build a (maybe-)CDT agent by choosing a to maximize E[U | do(A)]. But I think the paper doesn’t say anything about where to get the causal model from that gives us E[U | do(A)]. They pretty much assume that the model is provided.

I think the “counterfactual teaching” stuff doesn’t really say anything about CDT versus EDT, either. IIUC the basic idea is this. Imagine you want to train an LLM and you want to prevent the issue above. Then intuitively — in my distribution shift view — what we need to do is just train the LLM to make a good prediction  upon observing  that were generated by itself (rather than humans). The simplest, most obvious way to do this is to let the LLM generate some tokens , then get a probabilistic prediction about the next token from the LLM and then ask a human to give a next token . The loss of the LLM is just the, e.g., log loss of its prediction against the  provided by the human. One slightly tricky point here is that we only train the LLM to make good predictions on . We don’t want to train it to output  that make  easier to predict. So we need to be careful to choose the right gradient. I think that’s basically all they’re doing, though. It doesn’t seem like there’s anything causalist here.

So, in conclusion: While very interesting, I don't think these papers tell us anything new about how to build an EDT or a CDT agent.

Nice overview! I mostly agree.

>What I do not expect is something I’d have been happy to pay $500 or $1,000 for, but not $3,500. Either the game will be changed, or it won’t be changed quite yet. I can’t wait to find out.

From context, I assume you're saying this about the current iteration?

I guess willingness to pay for different things depends on one's personal preferences, but here's an outcome that I find somewhat likely (>50%):

  • The first-gen Apple Vision Pro will not be very useful for work, aside from some niche tasks.
    • It seems that to be better than a laptop for working at a coffee shop or something they need to have solved ~10 different problems extremely well and my guess is that they will have failed to solve one of them well enough. For example, I think comfort/weight alone has a >30% probability of making this less enjoyable to work with (for me at least) than with a laptop, even if all other stuff works fairly well.
    • Like you, I'm sometimes a bit puzzled by what Apple does. So I could also imagine that Apple screws up something weird that isn't technologically difficult. For example, the first version of iPad OS was extremely restrictive (no multitasking/splitscreen, etc.). So even though the hardware was already great, it was difficult to use it for anything serious and felt more like a toy. Based on what they emphasize on the website, I could very well imagine that they won't focus on making this work and that there'll be some basic, obvious issue like not being able to use a mouse. If Apple had pitched this more in the way that Spacetop was pitched, I'd be much more optimistic that the first gen will be useful for work.
  • The first-gen Apple Vision Pro will still produce lots of extremely interesting experiences so that many people would be happy to pay, say, $1000 for, but not $3,500 and definitely not much more than $3,500. For example, I think all the reviews I've seen have described the experience as very interesting, intense and immersive. Let's say this novelty value wears off after something like 10h. Then a family of four gets 40h of fun out of it. Say, you're happy to spend on the order of $10 per hour per person for a fun new experience (that's roughly what you'd spend to go to the movie theater, for example), then that'd be a willingness to pay in the hundreds of dollars.

>All accounts agree that Apple has essentially solved issues with fit and comfort.

Besides the 30min point, is it really true that all accounts agree on that? I definitely remember reading in at least two reports something along the lines of, "clearly you can't use this for hours, because it's too heavy". Sorry for not giving a source!

Load More