All of Johannes Treutlein's Comments + Replies

Some further thoughts on training ML models, based on discussions with Caspar Oesterheld:

  • I don't see a principled reason why one couldn't use one and the same model for both agents. I.e., do standard self-play training with weight sharing for this zero-sum game. Since both players have exactly the same loss function, we don't need to allow them to specialize by feeding in a player id or something like that (there exists a symmetric Nash equilibrium).
  • There is one problem with optimizing the objective in the zero-sum game via gradient descent (assuming we co
... (read more)

Regarding your last point 3., why does this make you more pessimistic rather than just very uncertain about everything?

3Lauro Langosco2mo
It does make me more uncertain about most of the details. And that then makes me more pessimistic about the solution, because I expect that I'm missing some of the problems. (Analogy: say I'm working on a math exercise sheet and I have some concrete reason to suspect my answer may be wrong; if I then realize I'm actually confused about the entire setup, I should be even more pessimistic about having gotten the correct answer).

Why would alignment with the outer reward function be the simplest possible terminal goal? Specifying the outer reward function in the weights would presumably be more complicated. So one would have to specify a pointer towards it in some way. And it's unclear whether that pointer is simpler than a very simple misaligned goal.

Such a pointer would be simple if the neural network already has a representation of the outer reward function in weights anyway (rather than deriving it at run-time in the activations). But it seems likely that any fixed representati... (read more)

3Richard_Ngo3mo
So I'm imagining the agent doing reasoning like: Misaligned goal --> I should get high reward --> Behavior aligned with reward function and then I'm hypothesizing that the whatever the first misaligned goal is, it requires some amount of complexity to implement, and you could just get rid of it and make "I should get high reward" the terminal goal. (I could imagine this being false though depending on the details of how terminal and instrumental goals are implemented.) I could also imagine something more like: Misaligned goal --> I should behave in aligned ways --> Aligned behavior and then the simplicity bias pushes towards alignment. But if there are outer alignment failures then this incurs some additional complexity compared with the first option. Or a third, perhaps more realistic option is that the misaligned goal leads to two separate drives in the agent: "I should get high reward" and "I should behave in aligned ways", and that the question of which ends up dominating when they clash will be determined by how the agent systematizes multiple goals into a single coherent strategy (I'll have a post on that topic up soon).  

I am not sure I understand. Are you saying that GPT thinks the text is genuinely from the future (i.e., the distribution that it is modeling contains text from the future), or that it doesn't think so? The sentence you quote is intended to mean that it does not think the text is genuinely from the future.

2Gurkenglas4mo
I agree that it doesn't think the text is from the future. I am nitpicking a technical detail because the text I was line-commenting upon seemed confused. (How vexing to find this thread below the post, in the place for louder discussions!) Instead of conjecturing that it doesn't think the text is from the future, you should conjecture that it thinks the text is from the training data, because: 1. The latter implies the former. 2. We have technical reasons to believe the latter.

Thanks for your comment!

Regarding 1: I don't think it would be good to simulate superintelligences with our predictive models. Rather, we want to simulate humans to elicit safe capabilities. We talk more about competitiveness of the approach in Section III.

Regarding 3: I agree it might have been good to discuss cyborgism specifically. I think cyborgism is to some degree compatible with careful conditioning. One possible issue when interacting with the model arises when the model is trained on / prompted with its own outputs, or data that has been influence... (read more)

You are right, thanks for the comment! Fixed it now.

I like the idea behind this experiment, but I find it hard to tell from this write-up what is actually going on. I.e., what is exactly the training setup, what is exactly the model, which parts are hard-coded and which parts are learned? Why is it a weirdo janky thing instead of some other standard model or algorithm? It would be good if this was explained more in the post (it is very effortful to try to piece this together by going through the code). Right now I have a hard time making any inferences from the results.

Update: we recently discovered the performative prediction (Perdomo et al., 2020) literature (HT Alex Pan). This is a machine learning setting where we choose a model parameter (e.g., parameters for a neural network) that minimizes expected loss (e.g., classification error). In performative prediction, the distribution over data points can depend on the choice of model parameter. Our setting is thus a special case in which the parameter of interest is a probability distribution, the loss is a scoring function, and data points are discrete outcomes. Most re... (read more)

I think there should be a space both for in-progress research dumps and for more worked out final research reports on the forum. Maybe it would make sense to have separate categories for them or so.

I'm not sure I understand what you mean by a skill-free scoring rule. Can you elaborate what you have in mind?

3Vaniver6mo
Sure, points from a scoring rule come both from 'skill' (whether or not you're accurate in your estimates) and 'calibration' (whether your estimates line up with the underlying propensity). Rather than generating the picture I'm thinking of (sorry, up to something else and so just writing a quick comment), I'll describe it: watch this animation [https://en.wikipedia.org/wiki/File:Scoring_functions.gif], and see the implied maximum expected score as a function of p (the forecaster's true belief). For all of the scoring rules, it's a convex function with maxima at 0 and 1. (You can get 1 point on average with a linear rule if p=0, and only 0.5 points on average if p=0.5; for a log rule, it's 0 points and -0.7 points.) But could you come up with a scoring rule where the maximum expected score as a function of p is flat? If true, there's no longer an incentive to have extreme probabilities. But that incentive was doing useful work before, and so this seems likely to break something else--it's probably no longer the case that you're incentivized to say your true belief--or require something like batch statistics (since I think you might be able to get something like this by scoring not individual predictions but sets of them, sorted by p or by whether they were true or false). [This can be done in some contexts with markets, where your reward depends on how close the market was to the truth before, but I think it probably doesn't help here because we're worried about the oracle's ability to affect the underlying reality, which is also an issue with prediction markets!] To be clear, I'm not at all confident this is possible or sensible--it seems likely to me that an adversarial argument goes thru where as oracle I always benefit from knowing which statements are true and which statements are false (even if I then lie about my beliefs to get a good calibration curve or w/e)--but that's not an argument about the scale of the distortions that are possible. 

Thanks for your comment!

Your interpretation sounds right to me. I would add that our result implies that it is impossible to incentivize honest reports in our setting. If you want to incentivize honest reports when is constant, then you have to use a strictly proper scoring rule (this is just the definition of “strictly proper”). But we show for any strictly proper scoring rule that there is a function such that a dishonest prediction is optimal.

Proposition 13 shows that it is possible to “tune” scoring rules to make optimal predictions very close to h... (read more)

3Vaniver6mo
Agreed for proper scoring rules, but I'd be a little surprised if it's not possible to make a skill-free scoring rule, and then get a honest prediction result for that. [This runs into other issues--if the scoring rule is skill-free, where does the skill come from?--but I think this can be solved by having oracle-mode and observation-mode, and being able to do honest oracle-mode at all would be nice.]

I think such a natural progression could also lead to something similar to extinction (in addition to permanently curtailing humanity's potential). E.g., maybe we are currently in a regime where optimizing proxies harder still leads to improvements to the true objective, but this could change once we optimize those proxies even more. The natural progression could follow an inverted U-shape.

E.g., take the marketing example. Maybe we will get superhuman persuasion AIs, but also AIs that protect us from persuasive ads and AIs that can provide honest reviews. ... (read more)

There is a chance that one can avoid having to solve ontology identification in general if one punts the problem to simulated humans. I.e., it seems one can train the human simulator without solving it, and then use simulated humans to solve the problem. One may have to solve some specific ontology identification problems to make sure one gets an actual human simulator and not e.g. a malign AI simulator. However, this might be easier than solving the problem in full generality.

Minor comment: regarding the RLHF example, one could solve the problem implicitl... (read more)

(I think Stockfish would be classified as AI in computer science. I.e., you'd learn about the basic algorithms behind it in a textbook on AI. Maybe you mean that Stockfish was non-ML, or that it had handcrafted heuristics?)

0Jeff Rose8mo
My understanding is that starting in late 2020 with the release of Stockfish 12, Stockfish would probably be considered AI, but before that it would not be.  I am, of course, willing to change this view based on additional information. The original Alpha Zero- Stockfish match was in 2017, so if the above is correct, I think referring to Stockfish as non-AI makes sense.

Great post!

I like that you point out that we'd normally do trial and error, but that this might not work with AI. I think you could possibly make clearer where this fails in your story. You do point out how HLMI might become extremely widespread and how it might replace most human work. Right now it seems to me like you argue essentially that the problem is a large-scale accident that comes from a distribution shift. But this doesn't yet say why we couldn't e.g. just continue trial-and-error and correct the AI once we notice that something is going wrong.&... (read more)

2Leon Lang8mo
Yes, after reflection I think this is correct. I think I had in mind a situation where with deployment, the training of the AI system simply stops, but of course, this need not be the case. So if training continues, then one either needs to argue stronger reasons why the distribution shift leads to a catastrophe (e.g., along the lines you argue) or make the case that the training signal couldn't keep up with the fast pace of the development. The latter would be an outer alignment failure, which I tried to avoid talking about in the text. 

Overall I agree that solutions to deception look different from solutions to other kinds of distributional shift. (Also, there are probably different solutions to different kinds of large distributional shift as well. E.g., solutions to capability generalization vs solutions to goal generalization.)

I do think one could claim that some general solutions to distributional shift would also solve deceptiveness. E.g., the consensus algorithm works for any kind of distributional shift, but it should presumably also avoid deceptiveness (in the sense that it would... (read more)

I like this post and agree that there are different threat models one might categorize broadly under "inner alignment". Before reading this I hadn't reflected on the relationship between them.

Some random thoughts (after an in-person discussion with Erik):

  • For distributional shift and deception, there is a question of what is treated as fixed and what is varied when asking whether a certain agent has a certain property. E.g., I could keep the agent constant but put it into a new environment, and ask whether it is still aligned. Or I could keep the environmen
... (read more)
2Erik Jenner8mo
Thanks for the comments! I technically agree with what you're saying here, but one of the implicit claims I'm trying to make in this post is that this is not a good way to think about deception. Specifically, I expect solutions to deception to look quite different from solutions to (large) distributional shift. Curious if you disagree with that.

Great post!

Regarding your “Redirecting civilization” approach: I wonder about the competitiveness of this. It seems that we will likely build x-risk-causing AI before we have a good enough model to be able to e.g. simulate the world 1000 years into the future on an alternative timeline? Of course, competitiveness is an issue in general, but the more factored cognition or IDA based approaches seem more realistic to me.

Alternatively, we can try to be clever and “import” research from the future repeatedly. For instance we can first ask our model to produce r

... (read more)
2Adam Jermyn10mo
Thanks! I'm not sure. My sense is that generative models have a huge lead in terms of general capabilities over ~everything else, and that seems to be where the most effort is going today. So unless something changes there I expect generative models to be the state of the art when we hit x-risk territory. That said, it's totally possible that the x-risk-causing generative model happens before the model that can simulate thousands of years of history. I'm not confident in this either way. One thing that gives me hope in favor of simulating long histories is that to some extent it's "just" a matter of more compute, and if we get promising results simulating short spans of history it might not be hard to justify a lot of spending on simulating longer stretches. And there's a bright spot there too: simulating longer times likely scales sub-linearly with amount of history simulated. If you have a dynamics model then simulating for twice as long costs double the compute. If you've got a more clever model that knows how to take shortcuts/compress the dynamics you can probably do better. I'm pretty concerned about this. I said a bit about this in the "No Fixed Points" section, but basically I think you have to do something to avoid fixed points, otherwise you get all sorts of world-ending optimization pressures. If you do that, you're not allowed any recursion where the model simulates itself, and then you get stuck with the problem of how to introduce future research into the past without making a malicious AGI the most likely explanation...

These issues of preferences over objects of different types (internal states, policies, actions, etc.) and how to translate between them are also discussed in the post Agents Over Cartesian World Models.

Your post seems to be focused more on pointing out a missing piece in the literature rather than asking for a solution to the specific problem (which I believe is a valuable contribution). Regardless, here is roughly how I would understand “what they mean”:

Let  be the task space,  the output space,  the model space,  our base objective, and  the mesa objective of the model for input . Assume that there exists some map  mapping internal objects to outputs by the model,... (read more)

1Johannes Treutlein1y
These issues of preferences over objects of different types (internal states, policies, actions, etc.) and how to translate between them are also discussed in the post Agents Over Cartesian World Models [https://www.alignmentforum.org/posts/LBNjeGaJZw7QdybMw/agents-over-cartesian-world-models#Input_Distribution_Problem].

Thank you!

It does seem like simulating text generated by using similar models would be hard to avoid when using the model as a research assistant. Presumably any research would get “contaminated” at some point, and models might seize to be helpful without updating them on the newest research.

In theory, if one were to re-train models from scratch on the new research, this might be equivalent to the models updating on the previous models' outputs before reasoning about superrationality, so it would turn things into a version of Newcomb's problem with transpa... (read more)

2Adam Jermyn10mo
Oh interesting. I think this still runs into the issue that you'll have instrumental goals whenever you ask the model to simulate itself (i.e. just the first step in the hierarchy hits this issue). I was imagining that we train the model to predict e.g. tomorrow's newspaper given today's. The fact that it's not just a stream of text but comes with time-stamps (e.g. this was written X hours later) feels important for making it simulate actual histories.

Thanks for your comment! I agree that we probably won't be able to get a textbook from the future just by prompting a language model trained on human-generated texts.

As mentioned in the post, maybe one could train a model to also condition on observations. If the model is very powerful, and it really believes the observations, one could make it work. I do think sometimes it would be beneficial for a model to attain superhuman reasoning skills, even if it is only modeling human-written text. Though of course, this might still not happen in practice.

Overall ... (read more)

Would you count issues with malign priors etc. also as issues with myopia? Maybe I'm missing something about what myopia is supposed to mean and be useful for, but these issues seem to have a similar spirit of making an agent do stuff that is motivated by concerns about things happening at different times, in different locations, etc.

E.g., a bad agent could simulate 1000 copies of the LCDT agent and reward it for a particular action favored by the bad agent. Then depending on the anthropic beliefs of the LCDT agent, it might behave so as to maximize this r... (read more)

If someone had a strategy that took two years, they would have to over-bid in the first year, taking a loss. But then they have to under-bid on the second year if they're going to make a profit, and--"

"And they get undercut, because someone figures them out."

I think one could imagine scenarios where the first trader can use their influence in the first year to make sure they are not undercut in the second year, analogous to the prediction market example. For instance, the trader could install some kind of encryption in the software that this company use... (read more)

I find this particularly curious since naively, one would assume that weight sharing implicitly implements a simplicity prior, so it should make optimization more likely and thus also deceptive behavior? Maybe the argument is that somehow weight sharing leaves less wiggle room for obscuring one's reasoning process, making a potential optimizer more interpretable? But the hidden states and tied weights could still be encoding deceptive reasoning in an uninterpretable way?

2Johannes Treutlein1y
I find this particularly curious since naively, one would assume that weight sharing implicitly implements a simplicity prior, so it should make optimization more likely and thus also deceptive behavior? Maybe the argument is that somehow weight sharing leaves less wiggle room for obscuring one's reasoning process, making a potential optimizer more interpretable? But the hidden states and tied weights could still be encoding deceptive reasoning in an uninterpretable way?
1ioannes4y
I'm considering doing Tucker Peck's Drug Free Sleep [https://drugfreesleep.com/], but haven't tried it yet. Interview with Tucker on CBTi [https://www.andrewholecek.com/interview-with-tucker-peck-phd/].
5jacobjacob4y
Swedish program called learningtosleep.se I think they just do whatever the standard sleep cbt thing is (at least that's what they say).

Wolfgang Spohn develops the concept of a "dependency equilibrium" based on a similar notion of evidential best response (Spohn 2007, 2010). A joint probability distribution is a dependency equilibrium if all actions of all players that have positive probability are evidential best responses. In case there are actions with zero probability, one evaluates a sequence of joint probability distributions such that and for all actions and . Using your notation of a probability matrix and a utility matrix, the expected utili

... (read more)

I would like to submit the following entries:

A typology of Newcomblike problems (philosophy paper, co-authored with Caspar Oesterheld).

A wager against Solomonoff induction (blog post).

Three wagers for multiverse-wide superrationality (blog post).

UDT is “updateless” about its utility function (blog post). (I think this post is hard to understand. Nevertheless, if anyone finds it intelligible, I would be interested in their thoughts.)

EDT doesn't pay if it is given the choice to commit to not paying ex-ante (before receiving the letter). So the thought experiment might be an argument against ordinary EDT, but not against updateless EDT. If one takes the possibility of anthropic uncertainty into account, then even ordinary EDT might not pay the blackmailer. See also Abram Demski's post about the Smoking Lesion. Ahmed and Price defend EDT along similar lines in a response to a related thought experiment by Frank Arntzenius.

3Stuart_Armstrong6y
Yes, this demonstrates that EDT is also unstable under self modification, just as CDT is. And trying to build an updateless EDT is exactly what UDT is doing.

Thanks for your answer! This "gain" approach seems quite similar to what Wedgwood (2013) has proposed as "Benchmark Theory", which behaves like CDT in cases with, but more like EDT in cases without causally dominant actions. My hunch would be that one might be able to construct a series of thought-experiments in which such a theory violates transitivity of preference, as demonstrated by Ahmed (2012).

I don't understand how you arrive at a gain of 0 for not smoking as a smoke-lover in my example. I would think the gain for not smoking is higher:

... (read more)
0abramdemski6y
Ah, you're right. So gain doesn't achieve as much as I thought it did. Thanks for the references, though. I think the idea is also similar in spirit to a proposal of Jeffrey's in him book The Logic of Decision; he presents an evidential theory, but is as troubled by cooperating in prisoner's dilemma and one-boxing in Newcomb's problem as other decision theorists. So, he suggests that a rational agent should prefer actions such that, having updated on probably taking that action rather than another, you still prefer that action. (I don't remember what he proposed for cases when no such action is available.) This has a similar structure of first updating on a potential action and then checking how alternatives look from that position.

From my perspective, I don’t think it’s been adequately established that we should prefer updateless CDT to updateless EDT

I agree with this.

It would be nice to have an example which doesn’t arise from an obviously bad agent design, but I don’t have one.

I’d also be interested in finding such a problem.

I am not sure whether your smoking lesion steelman actually makes a decisive case against evidential decision theory. If an agent knows about their utility function on some level, but not on the epistemic level, then this can just as well be made into a

... (read more)
2Diffractor5y
I think that in that case, the agent shouldn't smoke, and CDT is right, although there is side-channel information that can be used to come to the conclusion that the agent should smoke. Here's a reframing of the provided payoff matrix that makes this argument clearer. (also, your problem as stated should have 0 utility for a nonsmoker imagining the situation where they smoke and get killed) Let's say that there is a kingdom which contains two types of people, good people and evil people, and a person doesn't necessarily know which type they are. There is a magical sword enchanted with a heavenly aura, and if a good person wields the sword, it will guide them do heroic things, for +10 utility (according to a good person) and 0 utility (according to a bad person). However, if an evil person wields the sword, it will afflict them for the rest of their life with extreme itchiness, for -100 utility (according to everyone). good person's utility estimates: * takes sword * I'm good: 10 * I'm evil: -90 * don't take sword: 0 evil person's utility estimates: * takes sword * I'm good: 0 * I'm evil: -100 * don't take sword: 0 As you can clearly see, this is the exact same payoff matrix as the previous example. However, now it's clear that if a (secretly good) CDT agent believes that most of society is evil, then it's a bad idea to pick up the sword, because the agent is probably evil (according to the info they have) and will be tormented with itchiness for the rest of their life, and if it believes that most of society is good, then it's a good idea to pick up the sword. Further, this situation is intuitively clear enough to argue that CDT just straight-up gets the right answer in this case. A human (with some degree of introspective power) in this case, could correctly reason "oh hey I just got a little warm fuzzy feeling upon thinking of the hypothetical where I wield the sword and it doesn't curse me. This is evidence that I'm good,
0abramdemski6y
Excellent example. It seems to me, intuitively, that we should be able to get both the CDT feature of not thinking we can control our utility function through our actions and the EDT feature of taking the information into account. Here's a somewhat contrived decision theory which I think captures both effects. It only makes sense for binary decisions. First, for each action you compute the posterior probability of the causal parents for each decision. So, depending on details of the setup, smoking tells you that you're likely to be a smoke-lover, and refusing to smoke tells you that you're more likely to be a non-smoke-lover. Then, for each action, you take the action with best "gain": the amount better you do in comparison to the other action keeping the parent probabilities the same: Gain(a)=E(U|a)−E(U|a,do(¯a)) (E(U|a,do(¯a)) stands for the expectation on utility which you get by first Bayes-conditioning on a, then causal-conditioning on its opposite.) The idea is that you only want to compare each action to the relevant alternative. If you were to smoke, it means that you're probably a smoker; you will likely be killed, but the relevant alternative is one where you're also killed. In my scenario, the gain of smoking is +10. On the other hand, if you decide not to smoke, you're probably not a smoker. That means the relevant alternative is smoking without being killed. In my scenario, the smoke-lover computes the gain of this action as -10. Therefore, the smoke-lover smokes. (This only really shows the consistency of an equilibrium where the smoke-lover smokes -- my argument contains unjustified assumption that smoking is good evidence for being a smoke lover and refusing to smoke is good evidence for not being one, which is only justified in a circular way by the conclusion.) In your scenario, the smoke-lover computes the gain of smoking at +10, and the gain of not smoking at 0. So, again, the smoke-lover smokes. The solution seems too ad-hoc to really

Imagine that Omega tells you that it threw its coin a million years ago, and would have turned the sky green if it had landed the other way. Back in 2010, I wrote a post arguing that in this sort of situation, since you've always seen the sky being blue, and every other human being has also always seen the sky being blue, everyone has always had enough information to conclude that there's no benefit from paying up in this particular counterfactual mugging, and so there hasn't ever been any incentive to self-modify into an agent that would pay up ... and s

... (read more)

Thanks for the reply and all the useful links!

It's not a given that you can easily observe your existence.

It took me a while to understand this. Would you say that for example in the Evidential Blackmail, you can never tell whether your decision algorithm is just being simulated or whether you're actually in the world where you received the letter, because both times, the decision algorithms receive exactly the same evidence? So in this sense, after updating on receiving the letter, both worlds are still equally likely, and only via your decision do yo... (read more)

I agree with all of this, and I can't understand why the Smoking Lesion is still seen as the standard counterexample to EDT.

Regarding the blackmail letter: I think that in principle, it should be possible to use a version of EDT that also chooses policies based on a prior instead of actions based on your current probability distribution. That would be "updateless EDT", and I think it wouldn't give in to Evidential Blackmail. So I think rather than an argument against EDT, it's an argument in favor of updatelessness.

0entirelyuseless6y
Smoking lesion is "seen as the standard counterexample" at least on LW pretty much because people wanted to agree with Eliezer.

Thanks for the link! What I don't understand is how this works in the context of empirical and logical uncertainty. Also, it's unclear to me how this approach relates to Bayesian conditioning. E.g. if the sentence "if a holds, than o holds" is true, doesn't this also mean that P(o|a)=1? In that sense, proof-based UDT would just be an elaborate specification of how to assign these conditional probabilities "from the viewpoint of the original position", so with updatelessness, and in the context of full logical inference and knowledge of ... (read more)

3cousin_it6y
To me, proof-based UDT is a simple framework that includes probabilistic/Bayesian reasoning as a special case. For example, if the world is deterministic except for a single coinflip, you specify a preference ordering on pairs of outcomes of two deterministic worlds. Fairness or non-fairness of the coinflip will be encoded into the ordering, so the decision can be based on completely deterministic reasoning. All probabilistic situations can be recast in this way. That's what UDT folks mean by "probability as caring". It's really cool that UDT lets you encode any setup with probability, prediction, precommitment etc. into a few (complicated and self-referential) sentences in PA [https://en.wikipedia.org/wiki/Peano_axioms#First-order_theory_of_arithmetic] or GL [https://plato.stanford.edu/entries/logic-provability/] that are guaranteed to have truth values. And since GL is decidable, you can even write a program that will solve all such problems for you.

Thanks a lot for your elaborate reply!

(So I'm not even sure what CDT is supposed to do here, since it's not clear that the bet is really on the past state of the world and not on truth of a proposition about the future state of the world.)

Hmm, good point. The truth of the proposition is evaluated on basis of Alice's action, which she can causally influence. But we could think of a Newcomblike scenario in which someone made a perfect prediction a 100 years ago and put down a note about what state the world was in at that time. Now instead of checking Al... (read more)

CDT, TDT, and UDT would not give away the money because there is no causal (or acausal) influence on the number of universes.

I'm not so sure about UDT's response. From what I've heard, depending on the exact formal implementation of the problem, UDT might also pay the money? If your thought experiment works via a correlation between the type of universe you live in and the decision theory you employ, then it might be a similar problem to the Coin Flip Creation. I introduced the latter decision problem in an attempt to make a less ambiguous version of th... (read more)

1Vladimir_Nesov6y
It's not a given that you can easily observe your existence. From updateless point of view, all possible worlds, or theories of worlds, or maybe finite fragments of reasoning about them, in principle "exist" to some degree, in the sense of being data potentially relevant for estimating the value of everything, which is something to be done for the strategies under agent's consideration. So in case of worlds, or instances of the agent in worlds, the useful sense of "existence" is relevance for estimating the value of everything (or of change in value depending on agent's strategy, which is the sense in which worlds that couldn't contain or think about the agent, don't exist). Since in this case we are talking about possible worlds, they do or don't exist in the sense of having no measure (probability) in the updateless prior (to the extent that it makes sense to talk about the decision algorithm using a prior). In this sense, observing one's existence means observing an argument about the a priori probability of the world you inhabit. In a world that has relatively tiny a priori probability, you should be able to observe your own (or rather the world's) non-existence, in the same sense. This also follows the principle of reducing [http://lesswrong.com/lw/1iy/what_are_probabilities_anyway/] concepts like existence or probability (where they make sense) to components of the decision algorithm, and abandoning them in sufficiently unusual [http://lesswrong.com/lw/182/the_absentminded_driver/] thought experiments [http://lesswrong.com/lw/3dy/solve_psykoshs_nonanthropic_problem/] (where they may fail to make sense, but where it's still possible to talk about decisions). See also this post [http://antisquark.tumblr.com/post/143442920287/not-sure-if-youve-answered-it-already-but] of Vadim [http://lesswrong.com/user/Squark/]'s and the idea of cognitive reductions [https://agentfoundations.org/item?id=1129] (looking for the role a concept plays in your thinking [http://lesswr
1entirelyuseless6y
My way of looking at this: The Smoking Lesion and Newcomb are formally equivalent. So no consistent decision theory can say, "smoke, but one-box." Eliezer hoped to get this response. If he succeeded, UDT is inconsistent. If UDT is consistent, it must recommend either smoking and two-boxing, or not smoking and one-boxing. Notice that cousin it's argument applies exactly to the 100% correlation smoking lesion: you can deduce from the fact that you do not smoke that you do not have cancer, and by UDT as cousin it understands it, that is all you need to decide not to smoke.

That's what I was trying to do with the Coin Flip Creation :) My guess: once you specify the Smoking Lesion and make it unambiguous, it ceases to be an argument against EDT.

1Tobias_Baumann6y
What exactly do you think we need to specify in the Smoking Lesion?
0[anonymous]6y
I'd be curious to hear about your other example problems. I've done a bunch of research on UDT over the years, implementing it as logical formulas and applying it to all the problems I could find, and I've become convinced that it's pretty much always right. (There are unsolved problems in UDT, like how to treat logical uncertainty or source code uncertainty, but these involve strange situations that other decision theories don't even think about.) If you can put EDT and UDT in sharp conflict, and give a good argument for EDT's decision, that would surprise me a lot.

I suspect this is a confusion about free will. To be concrete, I think that a thermostat has a causal influence on the future, and does not violate determinism. It deterministically observes a sensor, and either turns on a heater or a cooler based on that sensor, in a way that does not flow backwards--turning on the heater manually will not affect the thermostat's attempted actions except indirectly through the eventual effect on the sensor.

Fair point :) What I meant was that for every world history, there is only one causal influence I could possibly h... (read more)

I agree with points 1) and 2). Regarding point 3), that's interesting! Do you think one could also prove that if you don't smoke, you can't (or are less likely to) have the gene in the Smoking Lesion? (See also my response to Vladimir Nesov's comment.)

1cousin_it6y
I can only give a clear-cut answer if you reformulate the smoking lesion problem in terms of Omega and specify the UDT agent's egoism or altruism :-)

The point of decision theories is not that they let you reach from beyond the Matrix and change reality in violation of physics; it's that you predictably act in ways that optimize for various criteria.

I agree with this. But I would argue that causal counterfactuals somehow assume that we can "reach from beyond the Matrix and change reality in violation of physics". They work by comparing what would happen if we detached our “action node” from its ancestor nodes and manipulated it in different ways. So causal thinking in some way seems to viol... (read more)

2Vaniver6y
I agree there's a point here that lots of decision theories / models of agents / etc. are dualistic instead of naturalistic, but I think that's orthogonal to EDT vs. CDT vs. LDT; all of them assume that you could decide to take any of the actions that are available to you. I suspect this is a confusion about free will. To be concrete, I think that a thermostat has a causal influence on the future, and does not violate determinism. It deterministically observes a sensor, and either turns on a heater or a cooler based on that sensor, in a way that does not flow backwards--turning on the heater manually will not affect the thermostat's attempted actions except indirectly through the eventual effect on the sensor. This depends on the formulation of Newcomb's problem. If it says "Omega predicts you with 99% accuracy" or "Omega always predicts you correctly" (because, say, Omega is Laplace's Demon), then Omega knew that you would learn about decision theory in the way that you did, and there's still a logical dependence between the you looking at the boxes in reality and the you looking at the boxes in Omega's imagination. (This assumes that the 99% fact is known of you in particular, rather than 99% accuracy being something true of humans in general; this gets rid of the case that 99% of the time people's decision theories don't change, but 1% of the time they do, and you might be in that camp.) If instead the formulation is "Omega observed the you of 10 years ago, and was able to determine whether or not you then would have one-boxed or two-boxed on traditional Newcomb's with perfect accuracy. The boxes just showed up now, and you have to decide whether to take one or both," then the logical dependence is shattered, and two-boxing becomes the correct move. If instead the formulation is "Omega observed the you of 10 years ago, and was able to determine whether or not you then would have one-boxed or two-boxed on this version of Newcomb's with perfect accuracy. The bo

Thanks for your comment! I find your line of reasoning in the ASP problem and the Coin Flip Creation plausible. So your point is that, in both cases, by choosing a decision algorithm, one also gets to choose where this algorithm is being instantiated? I would say that in the CFC, choosing the right action is sufficient, while in the ASP you also have to choose the whole UDP program so as to be instantiated in a beneficial way (similar to the distinction of how TDT iterates over acts and UDT iterates over policies).

Would you agree that the Coin Flip Creatio... (read more)

1Vladimir_Nesov6y
To clarify, it's the algorithm itself that chooses how it behaves. So I'm not talking about how algorithm's instantiation depends on the way programmer chooses to write it, instead I'm talking about how algorithm's instantiation depends on the choices that the algorithm itself makes, where we are talking about a particular algorithm that's already written. Less mysteriously, the idea of algorithm's decisions influencing things describes a step in the algorithm, it's how the algorithm operates, by figuring out something we could call "how algorithm's decisions influence outcomes". The algorithm then takes that thing and does further computations that depend on it.

Yes, that's correct. I would say that "two-boxing" is generally what CDT would recommend, and "one-boxing" is what EDT recommends. Yes, medical Newcomb problems are different from Newcomb's original problem in that there are no simulations of decisions involved in the former.

3[anonymous]6y
Thanks! I'll make some actual content-related comments once I get a chance.

Rationality is about more than empirical studies. It's about developing sensible models of the world. It's about conveying sensible models to people in ways that they'll understand them. It's about convincing people that your model is better than theirs, sometimes without having to do an experiment.

Hmm, I'm not sure I understand what you mean. Maybe I'm missing something? Isn't this exactly what Bayesianism is about? Bayesianism is just using laws of probability theory to build an understanding of the world, given all the evidence that we encounter. Of ... (read more)