All of Johannes_Treutlein's Comments + Replies

An observation about Hubinger et al.'s framework for learned optimization

These issues of preferences over objects of different types (internal states, policies, actions, etc.) and how to translate between them are also discussed in the post Agents Over Cartesian World Models.

An observation about Hubinger et al.'s framework for learned optimization

Your post seems to be focused more on pointing out a missing piece in the literature rather than asking for a solution to the specific problem (which I believe is a valuable contribution). Regardless, here is roughly how I would understand “what they mean”:

Let  be the task space,  the output space,  the model space,  our base objective, and  the mesa objective of the model for input . Assume that there exists some map  mapping internal objects to outputs by the model,... (read more)

1Johannes_Treutlein5d
These issues of preferences over objects of different types (internal states, policies, actions, etc.) and how to translate between them are also discussed in the post Agents Over Cartesian World Models [https://www.alignmentforum.org/posts/LBNjeGaJZw7QdybMw/agents-over-cartesian-world-models#Input_Distribution_Problem] .
Training goals for large language models

Thank you!

It does seem like simulating text generated by using similar models would be hard to avoid when using the model as a research assistant. Presumably any research would get “contaminated” at some point, and models might seize to be helpful without updating them on the newest research.

In theory, if one were to re-train models from scratch on the new research, this might be equivalent to the models updating on the previous models' outputs before reasoning about superrationality, so it would turn things into a version of Newcomb's problem with transpa... (read more)

Training goals for large language models

Thanks for your comment! I agree that we probably won't be able to get a textbook from the future just by prompting a language model trained on human-generated texts.

As mentioned in the post, maybe one could train a model to also condition on observations. If the model is very powerful, and it really believes the observations, one could make it work. I do think sometimes it would be beneficial for a model to attain superhuman reasoning skills, even if it is only modeling human-written text. Though of course, this might still not happen in practice.

Overall ... (read more)

LCDT, A Myopic Decision Theory

Would you count issues with malign priors etc. also as issues with myopia? Maybe I'm missing something about what myopia is supposed to mean and be useful for, but these issues seem to have a similar spirit of making an agent do stuff that is motivated by concerns about things happening at different times, in different locations, etc.

E.g., a bad agent could simulate 1000 copies of the LCDT agent and reward it for a particular action favored by the bad agent. Then depending on the anthropic beliefs of the LCDT agent, it might behave so as to maximize this r... (read more)

The Parable of Predict-O-Matic

If someone had a strategy that took two years, they would have to over-bid in the first year, taking a loss. But then they have to under-bid on the second year if they're going to make a profit, and--"

"And they get undercut, because someone figures them out."

I think one could imagine scenarios where the first trader can use their influence in the first year to make sure they are not undercut in the second year, analogous to the prediction market example. For instance, the trader could install some kind of encryption in the software that this company use... (read more)

Intuitions about solving hard problems

I find this particularly curious since naively, one would assume that weight sharing implicitly implements a simplicity prior, so it should make optimization more likely and thus also deceptive behavior? Maybe the argument is that somehow weight sharing leaves less wiggle room for obscuring one's reasoning process, making a potential optimizer more interpretable? But the hidden states and tied weights could still be encoding deceptive reasoning in an uninterpretable way?

1Johannes_Treutlein2mo
I find this particularly curious since naively, one would assume that weight sharing implicitly implements a simplicity prior, so it should make optimization more likely and thus also deceptive behavior? Maybe the argument is that somehow weight sharing leaves less wiggle room for obscuring one's reasoning process, making a potential optimizer more interpretable? But the hidden states and tied weights could still be encoding deceptive reasoning in an uninterpretable way?
1ioannes3y
I'm considering doing Tucker Peck's Drug Free Sleep [https://drugfreesleep.com/] , but haven't tried it yet. Interview with Tucker on CBTi [https://www.andrewholecek.com/interview-with-tucker-peck-phd/].
5jacobjacob3y
Swedish program called learningtosleep.se I think they just do whatever the standard sleep cbt thing is (at least that's what they say).
Two Notions of Best Response

Wolfgang Spohn develops the concept of a "dependency equilibrium" based on a similar notion of evidential best response (Spohn 2007, 2010). A joint probability distribution is a dependency equilibrium if all actions of all players that have positive probability are evidential best responses. In case there are actions with zero probability, one evaluates a sequence of joint probability distributions such that and for all actions and . Using your notation of a probability matrix and a utility matrix, the expected utili

... (read more)
Announcement: AI alignment prize winners and next round

I would like to submit the following entries:

A typology of Newcomblike problems (philosophy paper, co-authored with Caspar Oesterheld).

A wager against Solomonoff induction (blog post).

Three wagers for multiverse-wide superrationality (blog post).

UDT is “updateless” about its utility function (blog post). (I think this post is hard to understand. Nevertheless, if anyone finds it intelligible, I would be interested in their thoughts.)

Naturalized induction – a challenge for evidential and causal decision theory

EDT doesn't pay if it is given the choice to commit to not paying ex-ante (before receiving the letter). So the thought experiment might be an argument against ordinary EDT, but not against updateless EDT. If one takes the possibility of anthropic uncertainty into account, then even ordinary EDT might not pay the blackmailer. See also Abram Demski's post about the Smoking Lesion. Ahmed and Price defend EDT along similar lines in a response to a related thought experiment by Frank Arntzenius.

3Stuart_Armstrong5y
Yes, this demonstrates that EDT is also unstable under self modification, just as CDT is. And trying to build an updateless EDT is exactly what UDT is doing.
Smoking Lesion Steelman

Thanks for your answer! This "gain" approach seems quite similar to what Wedgwood (2013) has proposed as "Benchmark Theory", which behaves like CDT in cases with, but more like EDT in cases without causally dominant actions. My hunch would be that one might be able to construct a series of thought-experiments in which such a theory violates transitivity of preference, as demonstrated by Ahmed (2012).

I don't understand how you arrive at a gain of 0 for not smoking as a smoke-lover in my example. I would think the gain for not smoking is higher:

... (read more)
0abramdemski5y
Ah, you're right. So gain doesn't achieve as much as I thought it did. Thanks for the references, though. I think the idea is also similar in spirit to a proposal of Jeffrey's in him book The Logic of Decision; he presents an evidential theory, but is as troubled by cooperating in prisoner's dilemma and one-boxing in Newcomb's problem as other decision theorists. So, he suggests that a rational agent should prefer actions such that, having updated on probably taking that action rather than another, you still prefer that action. (I don't remember what he proposed for cases when no such action is available.) This has a similar structure of first updating on a potential action and then checking how alternatives look from that position.
Smoking Lesion Steelman

From my perspective, I don’t think it’s been adequately established that we should prefer updateless CDT to updateless EDT

I agree with this.

It would be nice to have an example which doesn’t arise from an obviously bad agent design, but I don’t have one.

I’d also be interested in finding such a problem.

I am not sure whether your smoking lesion steelman actually makes a decisive case against evidential decision theory. If an agent knows about their utility function on some level, but not on the epistemic level, then this can just as well be made into a

... (read more)
2Diffractor4y
I think that in that case, the agent shouldn't smoke, and CDT is right, although there is side-channel information that can be used to come to the conclusion that the agent should smoke. Here's a reframing of the provided payoff matrix that makes this argument clearer. (also, your problem as stated should have 0 utility for a nonsmoker imagining the situation where they smoke and get killed) Let's say that there is a kingdom which contains two types of people, good people and evil people, and a person doesn't necessarily know which type they are. There is a magical sword enchanted with a heavenly aura, and if a good person wields the sword, it will guide them do heroic things, for +10 utility (according to a good person) and 0 utility (according to a bad person). However, if an evil person wields the sword, it will afflict them for the rest of their life with extreme itchiness, for -100 utility (according to everyone). good person's utility estimates: * takes sword * I'm good: 10 * I'm evil: -90 * don't take sword: 0 evil person's utility estimates: * takes sword * I'm good: 0 * I'm evil: -100 * don't take sword: 0 As you can clearly see, this is the exact same payoff matrix as the previous example. However, now it's clear that if a (secretly good) CDT agent believes that most of society is evil, then it's a bad idea to pick up the sword, because the agent is probably evil (according to the info they have) and will be tormented with itchiness for the rest of their life, and if it believes that most of society is good, then it's a good idea to pick up the sword. Further, this situation is intuitively clear enough to argue that CDT just straight-up gets the right answer in this case. A human (with some degree of introspective power) in this case, could correctly reason "oh hey I just got a little warm fuzzy feeling upon thinking of the hypothetica
0abramdemski5y
Excellent example. It seems to me, intuitively, that we should be able to get both the CDT feature of not thinking we can control our utility function through our actions and the EDT feature of taking the information into account. Here's a somewhat contrived decision theory which I think captures both effects. It only makes sense for binary decisions. First, for each action you compute the posterior probability of the causal parents for each decision. So, depending on details of the setup, smoking tells you that you're likely to be a smoke-lover, and refusing to smoke tells you that you're more likely to be a non-smoke-lover. Then, for each action, you take the action with best "gain": the amount better you do in comparison to the other action keeping the parent probabilities the same: Gain(a)=E(U|a)−E(U|a,do(¯a)) (E(U|a,do(¯a)) stands for the expectation on utility which you get by first Bayes-conditioning on a, then causal-conditioning on its opposite.) The idea is that you only want to compare each action to the relevant alternative. If you were to smoke, it means that you're probably a smoker; you will likely be killed, but the relevant alternative is one where you're also killed. In my scenario, the gain of smoking is +10. On the other hand, if you decide not to smoke, you're probably not a smoker. That means the relevant alternative is smoking without being killed. In my scenario, the smoke-lover computes the gain of this action as -10. Therefore, the smoke-lover smokes. (This only really shows the consistency of an equilibrium where the smoke-lover smokes -- my argument contains unjustified assumption that smoking is good evidence for being a smoke lover and refusing to smoke is good evidence for not being one, which is only justified in a circular way by the conclusion.) In your scenario, the smoke-lover computes the gain of smoking at +10, and the gain of not smoking at 0. So, again, the smoke-lover smokes. The solution seems too ad-hoc to really
The sin of updating when you can change whether you exist

Imagine that Omega tells you that it threw its coin a million years ago, and would have turned the sky green if it had landed the other way. Back in 2010, I wrote a post arguing that in this sort of situation, since you've always seen the sky being blue, and every other human being has also always seen the sky being blue, everyone has always had enough information to conclude that there's no benefit from paying up in this particular counterfactual mugging, and so there hasn't ever been any incentive to self-modify into an agent that would pay up ... and s

... (read more)
Is Evidential Decision Theory presumptuous?

Thanks for the reply and all the useful links!

It's not a given that you can easily observe your existence.

It took me a while to understand this. Would you say that for example in the Evidential Blackmail, you can never tell whether your decision algorithm is just being simulated or whether you're actually in the world where you received the letter, because both times, the decision algorithms receive exactly the same evidence? So in this sense, after updating on receiving the letter, both worlds are still equally likely, and only via your decision do yo... (read more)

Is Evidential Decision Theory presumptuous?

I agree with all of this, and I can't understand why the Smoking Lesion is still seen as the standard counterexample to EDT.

Regarding the blackmail letter: I think that in principle, it should be possible to use a version of EDT that also chooses policies based on a prior instead of actions based on your current probability distribution. That would be "updateless EDT", and I think it wouldn't give in to Evidential Blackmail. So I think rather than an argument against EDT, it's an argument in favor of updatelessness.

0entirelyuseless5y
Smoking lesion is "seen as the standard counterexample" at least on LW pretty much because people wanted to agree with Eliezer.
“Betting on the Past” – a decision problem by Arif Ahmed

Thanks for the link! What I don't understand is how this works in the context of empirical and logical uncertainty. Also, it's unclear to me how this approach relates to Bayesian conditioning. E.g. if the sentence "if a holds, than o holds" is true, doesn't this also mean that P(o|a)=1? In that sense, proof-based UDT would just be an elaborate specification of how to assign these conditional probabilities "from the viewpoint of the original position", so with updatelessness, and in the context of full logical inference and knowledge of ... (read more)

3cousin_it5y
To me, proof-based UDT is a simple framework that includes probabilistic/Bayesian reasoning as a special case. For example, if the world is deterministic except for a single coinflip, you specify a preference ordering on pairs of outcomes of two deterministic worlds. Fairness or non-fairness of the coinflip will be encoded into the ordering, so the decision can be based on completely deterministic reasoning. All probabilistic situations can be recast in this way. That's what UDT folks mean by "probability as caring". It's really cool that UDT lets you encode any setup with probability, prediction, precommitment etc. into a few (complicated and self-referential) sentences in PA [https://en.wikipedia.org/wiki/Peano_axioms#First-order_theory_of_arithmetic] or GL [https://plato.stanford.edu/entries/logic-provability/] that are guaranteed to have truth values. And since GL is decidable, you can even write a program that will solve all such problems for you.
“Betting on the Past” – a decision problem by Arif Ahmed

Thanks a lot for your elaborate reply!

(So I'm not even sure what CDT is supposed to do here, since it's not clear that the bet is really on the past state of the world and not on truth of a proposition about the future state of the world.)

Hmm, good point. The truth of the proposition is evaluated on basis of Alice's action, which she can causally influence. But we could think of a Newcomblike scenario in which someone made a perfect prediction a 100 years ago and put down a note about what state the world was in at that time. Now instead of checking Al... (read more)

Is Evidential Decision Theory presumptuous?

CDT, TDT, and UDT would not give away the money because there is no causal (or acausal) influence on the number of universes.

I'm not so sure about UDT's response. From what I've heard, depending on the exact formal implementation of the problem, UDT might also pay the money? If your thought experiment works via a correlation between the type of universe you live in and the decision theory you employ, then it might be a similar problem to the Coin Flip Creation. I introduced the latter decision problem in an attempt to make a less ambiguous version of th... (read more)

1Vladimir_Nesov6y
It's not a given that you can easily observe your existence. From updateless point of view, all possible worlds, or theories of worlds, or maybe finite fragments of reasoning about them, in principle "exist" to some degree, in the sense of being data potentially relevant for estimating the value of everything, which is something to be done for the strategies under agent's consideration. So in case of worlds, or instances of the agent in worlds, the useful sense of "existence" is relevance for estimating the value of everything (or of change in value depending on agent's strategy, which is the sense in which worlds that couldn't contain or think about the agent, don't exist). Since in this case we are talking about possible worlds, they do or don't exist in the sense of having no measure (probability) in the updateless prior (to the extent that it makes sense to talk about the decision algorithm using a prior). In this sense, observing one's existence means observing an argument about the a priori probability of the world you inhabit. In a world that has relatively tiny a priori probability, you should be able to observe your own (or rather the world's) non-existence, in the same sense. This also follows the principle of reducing [http://lesswrong.com/lw/1iy/what_are_probabilities_anyway/] concepts like existence or probability (where they make sense) to components of the decision algorithm, and abandoning them in sufficiently unusual [http://lesswrong.com/lw/182/the_absentminded_driver/] thought experiments [http://lesswrong.com/lw/3dy/solve_psykoshs_nonanthropic_problem/] (where they may fail to make sense, but where it's still possible to talk about decisions). See also this post [http://antisquark.tumblr.com/post/143442920287/not-sure-if-youve-answered-it-already-but] of Vadim [http://lesswrong.com/user/Squark/]'s and the idea of cognitive reductions [https://agentfoundations.org/item?id=1129] (looking for the role a concept plays in your thinking [http://lessw
1entirelyuseless6y
My way of looking at this: The Smoking Lesion and Newcomb are formally equivalent. So no consistent decision theory can say, "smoke, but one-box." Eliezer hoped to get this response. If he succeeded, UDT is inconsistent. If UDT is consistent, it must recommend either smoking and two-boxing, or not smoking and one-boxing. Notice that cousin it's argument applies exactly to the 100% correlation smoking lesion: you can deduce from the fact that you do not smoke that you do not have cancer, and by UDT as cousin it understands it, that is all you need to decide not to smoke.
Did EDT get it right all along? Introducing yet another medical Newcomb problem

That's what I was trying to do with the Coin Flip Creation :) My guess: once you specify the Smoking Lesion and make it unambiguous, it ceases to be an argument against EDT.

2Tobias_Baumann6y
What exactly do you think we need to specify in the Smoking Lesion?
0[anonymous]6y
I'd be curious to hear about your other example problems. I've done a bunch of research on UDT over the years, implementing it as logical formulas and applying it to all the problems I could find, and I've become convinced that it's pretty much always right. (There are unsolved problems in UDT, like how to treat logical uncertainty or source code uncertainty, but these involve strange situations that other decision theories don't even think about.) If you can put EDT and UDT in sharp conflict, and give a good argument for EDT's decision, that would surprise me a lot.
Did EDT get it right all along? Introducing yet another medical Newcomb problem

I suspect this is a confusion about free will. To be concrete, I think that a thermostat has a causal influence on the future, and does not violate determinism. It deterministically observes a sensor, and either turns on a heater or a cooler based on that sensor, in a way that does not flow backwards--turning on the heater manually will not affect the thermostat's attempted actions except indirectly through the eventual effect on the sensor.

Fair point :) What I meant was that for every world history, there is only one causal influence I could possibly h... (read more)

Did EDT get it right all along? Introducing yet another medical Newcomb problem

I agree with points 1) and 2). Regarding point 3), that's interesting! Do you think one could also prove that if you don't smoke, you can't (or are less likely to) have the gene in the Smoking Lesion? (See also my response to Vladimir Nesov's comment.)

1cousin_it6y
I can only give a clear-cut answer if you reformulate the smoking lesion problem in terms of Omega and specify the UDT agent's egoism or altruism :-)
Did EDT get it right all along? Introducing yet another medical Newcomb problem

The point of decision theories is not that they let you reach from beyond the Matrix and change reality in violation of physics; it's that you predictably act in ways that optimize for various criteria.

I agree with this. But I would argue that causal counterfactuals somehow assume that we can "reach from beyond the Matrix and change reality in violation of physics". They work by comparing what would happen if we detached our “action node” from its ancestor nodes and manipulated it in different ways. So causal thinking in some way seems to viol... (read more)

1Vaniver6y
I agree there's a point here that lots of decision theories / models of agents / etc. are dualistic instead of naturalistic, but I think that's orthogonal to EDT vs. CDT vs. LDT; all of them assume that you could decide to take any of the actions that are available to you. I suspect this is a confusion about free will. To be concrete, I think that a thermostat has a causal influence on the future, and does not violate determinism. It deterministically observes a sensor, and either turns on a heater or a cooler based on that sensor, in a way that does not flow backwards--turning on the heater manually will not affect the thermostat's attempted actions except indirectly through the eventual effect on the sensor. This depends on the formulation of Newcomb's problem. If it says "Omega predicts you with 99% accuracy" or "Omega always predicts you correctly" (because, say, Omega is Laplace's Demon), then Omega knew that you would learn about decision theory in the way that you did, and there's still a logical dependence between the you looking at the boxes in reality and the you looking at the boxes in Omega's imagination. (This assumes that the 99% fact is known of you in particular, rather than 99% accuracy being something true of humans in general; this gets rid of the case that 99% of the time people's decision theories don't change, but 1% of the time they do, and you might be in that camp.) If instead the formulation is "Omega observed the you of 10 years ago, and was able to determine whether or not you then would have one-boxed or two-boxed on traditional Newcomb's with perfect accuracy. The boxes just showed up now, and you have to decide whether to take one or both," then the logical dependence is shattered, and two-boxing becomes the correct move. If instead the formulation is "Omega observed the you of 10 years ago, and was able to determine whether or not you then would have one-boxed or two-boxed on this version of Newcomb's with perfect accuracy. The b
Did EDT get it right all along? Introducing yet another medical Newcomb problem

Thanks for your comment! I find your line of reasoning in the ASP problem and the Coin Flip Creation plausible. So your point is that, in both cases, by choosing a decision algorithm, one also gets to choose where this algorithm is being instantiated? I would say that in the CFC, choosing the right action is sufficient, while in the ASP you also have to choose the whole UDP program so as to be instantiated in a beneficial way (similar to the distinction of how TDT iterates over acts and UDT iterates over policies).

Would you agree that the Coin Flip Creatio... (read more)

1Vladimir_Nesov6y
To clarify, it's the algorithm itself that chooses how it behaves. So I'm not talking about how algorithm's instantiation depends on the way programmer chooses to write it, instead I'm talking about how algorithm's instantiation depends on the choices that the algorithm itself makes, where we are talking about a particular algorithm that's already written. Less mysteriously, the idea of algorithm's decisions influencing things describes a step in the algorithm, it's how the algorithm operates, by figuring out something we could call "how algorithm's decisions influence outcomes". The algorithm then takes that thing and does further computations that depend on it.
Did EDT get it right all along? Introducing yet another medical Newcomb problem

Yes, that's correct. I would say that "two-boxing" is generally what CDT would recommend, and "one-boxing" is what EDT recommends. Yes, medical Newcomb problems are different from Newcomb's original problem in that there are no simulations of decisions involved in the former.

3lifelonglearner6y
Thanks! I'll make some actual content-related comments once I get a chance.
Which areas of rationality are underexplored? - Discussion Thread

Rationality is about more than empirical studies. It's about developing sensible models of the world. It's about conveying sensible models to people in ways that they'll understand them. It's about convincing people that your model is better than theirs, sometimes without having to do an experiment.

Hmm, I'm not sure I understand what you mean. Maybe I'm missing something? Isn't this exactly what Bayesianism is about? Bayesianism is just using laws of probability theory to build an understanding of the world, given all the evidence that we encounter. Of ... (read more)