This doesn’t seem true, at least in the sense of strict ranking? In the EDT case: if Omega’s policy is to place a prime in Box 1 whenever Omicron chooses a composite number (instead of matching Omicron when possible), then it predicts the EDT agent will choose only Box 1 and so is a stable equilibrium. But since it also always places a different prime whenever Omicron chooses a prime, EDT never sees matching numbers and so always one-boxes, therefore its expected earnings are no less than FDT
The answer to all of them is 1/e?
Curious what the solution to this one is? Couldn’t figure it out
Doesn’t this argument also work against the idea that they would self-modify in the “normal” finite way? It can’t currently represent the number which it’s building a ton of new storage to help contain, so it can’t make a pairwise comparison to say the latter is better, nor can it simulate the outcome of doing this and predict the reward it would get
Maybe you say it’s not directly making a pairwise comparison but making a more abstract step of reasoning like “I can’t predict that number but I know it’s gonna be bigger that what I have now, me with augmented memory will still be aligned with me in terms of its ranking everything the same way I rank it. but will in retrospect think this was a good idea so I trust it”. But then analogously it seems like it can make a similar argument for modifying itself to represent infinite values even
Or more plausibly you say however the AI is representing numbers it’s not in these naive way where it can only do things with numbers it can fit inside its head. But then it seems like you’re back at having a representation that’ll allow it to set its reward to whatever number it wants without going and taking over anything
This is a really interesting point. It seems like it goes even further - if the agent was only trying to maximise future expected reward, not only would it be ambivalent between temporary and permanent “Nirvana”, it would be ambivalent between strategies which achieved Nirvana with arbitrarily different probabilities right (maybe with some caveats about how it would behave if it predicted the strategy might lead to negative-infinite states)
So if a sufficiently fleshed out agent is going to assign a non-zero probability of Nirvana to every - or at least most - strategies since it’s not impossible, then won’t our agent just suddenly become incredibly apathetic and just sit there as soon as it reaches a certain level of intelligence?
I guess a way around is to just posit that however we build these things their rewards can only be finite, but that seems (a) something the agent could undo maybe or (b) shutting us off from some potentially good reward functions - if an aligned AI could valued happy human lives at 1 untilon each it would seem strange for it to not value somehow bringing about infinitely many of them
Your comments here and some comments Eliezer had made elsewhere seem to imply he believes he has at least in large party “solved” consciousness. Is this fair? And if so is there anywhere he has written up this theory/analysis in depth - because surely if correct this would be hugely important
I’m kind of assuming that whatever Eliezer’s model is, the bulk of the interestingness isn’t contained here and still needs to be cashed out, because the things you/he list (needing to examine consciousness through the lens of the cognitive algorithms causing our discussions of it, the centrality of self-modely reflexive things to consciousness etc.) are already pretty well explored and understood in mainstream philosophy, e.g Dennett.
Or is the idea here that Eliezer believes some of these existing treatments (maybe modulo some minor tweaks and gaps) are sufficient for him to feel like he has answered the question to his own satisfaction.
Basically struggling to understand which of the 3 below is wrong, because all three being jointly true seem crazy
Yeah sure, like there's a logical counterfactual strand of the argument but that's not the topic I'm really addressing here - I find those a lot less convincing so my issue here is around the use of Lobian uncertainty specifically. There's an step very specific to this species of argument that proving that □P will make P true when P is about the outcomes of the bets, because you will act based on the proof of P.
This is invoking Lob's Theorem in a manner which is very different from the standard counterpossible principle of explosion stuff. And I'm really wanting to discuss that step specifically because I don't think it's valid, and if the above argument is still representative of at least a strand of relevant argument then I'd be grateful for some clarification on how (3.) is supposed to be provable by the agent, or how my subsequent points are invalid.
I think there’s a sense in which some problems can be uncomputable even with infinite compute no? For example if the Halting problem were computable even with literally infinite time, then we could construct a machine that halted when given its own description iff it ran forever when given its own description. I do think theres a distinction beyond just “arbitrarily large finite compute vs. infinite compute”. It seems like either some problems have to be uncomputable even by a hyper-computer, or else the concept of infinite compute time is less straightforward than it seems
I totally agree on your other points though, I think the concept of bounded Solomonoff induction could be interesting in itself, although I presume with it you lose all the theoretical guarantees around bounded error. Would definitely be interested to see if there’s literature on this
I think the point is even stronger than that - Solomonoff induction requires not just infinite compute/time but doing something literally logically impossible - the prior is straight up uncomputable, not in any real-world tractability sense but as uncomputable as the Halting problem is. There’s a huge qualitative gulf between “we can’t solve this problem without idealised computers with unbounded time” and “we can’t solve this on a computer by definition”. Makes a huge difference to how much use the approach is for “crispening” ideas IMO
On the last example with the XOR temporal inference - since the partitions/queries we’re asking about are also possible factors, doesn’t the temporal data in terms of history etc depend on which choice of factorisation we go with?
We have a choice of 2 out of 3 factors each of which corresponds to one of the partitions in question, so surely by factorising in different ways we can make any two of the variables have history of 1 and thus automatically orthogonal?