and as he admits in the footnote he didn't include in the LW version, in real life, when adequately incentivized to win rather than find excuses involving 'well, chaos theory shows you can't predict ball bounces more than n bounces out', pinball pros learn how to win and rack up high scores despite 'muh chaos'.

I was confused about this part of your comment because the post directly talks about this in the conclusion.

The strategy typically is to catch the ball with the flippers, then to carefully hit the balls so that it takes a particular ramp which scores a lot of points and then returns the ball to the flippers. Professional pinball players try to avoid the parts of the board where the motion is chaotic.

The "off-site footnote" you're referring to seems to just be saying "The result is a pretty boring game. However, some of these ramps release extra balls after you have used them a few times. My guess is that this is the game designer trying to reintroduce chaos to make the game more interesting again." which is just a minor detail. AFAICT pros could score lots of points even without the extra balls.

(I'm leaving this comment here because I was getting confused about whether there had been major edits to the post, since the relevant content is currently in the conclusion and not the footnote. I was digging through the wayback machine and didn't see any major edits. So trying to save other people from the same confusion.)

Yeah I was imagining we can proliferate by 'gradient descenting' on similar cases.

What is this referring to? Are you thinking about something like: varying small facts about the scenario to get a function from “details of the scenario”->p(escape attempt) and then switch to a scenario with a higher p and then repeat?

Have you tried using different AI models within perplexity? Any ideas about which one is best? I don't know whether to expect better results from Sonnet 3.5 (within perplexity) or one of the models that perplexity have finetuned themselves, like Sonar Huge.

To be clear, uncertainty about the number of iterations isn’t enough. You need to have positive probability on arbitrarily high numbers of iterations, and never have it be the case that the probability of p(>n rounds) is so much less than p(n rounds) that it’s worth defecting on round n regardless of the effect of your reputation. These are pretty strong assumptions.

So cooperation is crucially dependent on your belief that all the way from 10 rounds to Graham’s number of rounds (and beyond), the probability of >n rounds conditional on n rounds is never lower than e.g. 20% (or whatever number is implied by the pay-off structure of your game).

it sounds to me like ruling this out requires an assumption about the correlations of an action being the same as the correlations of an earlier self-modifying action to enforce that later action.

I would guess that assumption would be sufficient to defeat my counter-example, yeah.

I do think this is a big assumption. Definitely not one that I'd want to generally assume for practical purposes, even if it makes for a nicer theory of decision theory. But it would be super interesting if someone could make a proper defense of it typically being true in practice.

E.g.: Is it really true that a human's decision about whether or not to program a seed AI to take action A has the same correlations as that same superintelligence deciding whether or not to take action A 1000 years later while using a jupiter brain for its computation? Intuitively, I'd say that the human would correlate mostly with other humans and other evolved species, and that the superintelligence would mostly correlate with other superintelligences, and it'd be a big deal if that wasn't true.

However, there is no tiling theorem for UDT that I am aware of, which means we don't know whether UDT is reflectively consistent; it's only a conjecture.

I think this conjecture is probably false for reasons described in this section of "When does EDT seek evidence about correlations?". The section offers an argument for why son-of-EDT isn't UEDT, but I think it generalizes to an argument for why son-of-UEDT isn't UEDT.

Briefly: UEDT-at-timestep-1 is making a different decision than UEDT-at-timestep-0. This means that its decision might be correlated (according to the prior) with some facts which UEDT-at-timestep-0's decision isn't correlated with. From the perspective of UEDT-at-timestep-0, it's bad to let UEDT-at-timestep-1 make decisions on the basis of correlations with things that UEDT-at-timestep-0 can't control.

Notice that learning-UDT implies UDT: an agent eventually behaves as if it were applying UDT with each Pn. Therefore, in particular, it eventually behaves like UDT with prior P0. So (with the exception of some early behavior which might not conform to UDT at all) this is basically UDT with a prior which allows for learning. The prior P0 is required to eventually agree with the recommendations of P1, P2, ... (which also implies that these eventually agree with each other).

I don't understand this argument.

"an agent eventually behaves as if it were applying UDT with each Pn" — why can't an agent skip over some Pn entirely or get stuck on P9 or whatever?

"Therefore, in particular, it eventually behaves like UDT with prior P0." even granting the above — sure, it will beahve like UDT with prior p0 at some point. But then after that it might have some other prior. Why would it stick with P0?

Incidentally: Were the persuasion evals done on models with honesty training or on helpfulness-only models? (Couldn't find this in the paper, sorry if I missed it.)

Tbc: It should be fine to argue against those implications, right? It’s just that, if you grant the implication, then you can’t publicly refute Y.

I also like Paul's idea (which I can't now find the link for) of having labs make specific "underlined statements" to which employees can anonymously add caveats or contradictions that will be publicly displayed alongside the statements


