actually, there are ~100 rows in the dataset where Room2=4, Room6=8, and Room3=5=7.
I actually did look at that (at least some subset with that property) at some point, though I didn't (think of/ get around to) re-looking at it with my later understanding.
In general, I think this is a realistic thing to occur: 'other intelligent people optimizing around this data' is one of the things that causes the most complicated things to happen in real-world data as well.
Indeed, I am not complaining! It was a good, fair difficulty to deal with.
That being said, there was one aspect I did feel was probably more complicated than ideal, and that was the combination of the tier-dependent alerting with the tiers not having any other relevance than this one aspect. That is, if the alerting had in each case been simply dependent on whether the adventurers were coming from an empty room or not, it would have been a lot simpler to work out. And if there was tier dependent alerting, but the tiers were more obvious in other ways*, it would still have been tricky but at least there would be a path to recognize the tiers and then try to figure out other ways that they might have relevance. The way it was it seemed to me you pretty much had to look at what were (ex ante) almost arbitrary combinations of (current encounter, next encounter) to figure that aspect out, unless you actually guessed the rationale of the alerting effect.
That might be me rationalizing my failure to figure it out though!
* e.g. perhaps the traps/golems could have had the same score as the same-tier nontrap encounter when alerted (or alternatively when not alerted)
The biggest problem about AIXI in my view is the reward system - it cares about the future directly, whereas to have any reasonable hope of alignment an AI in my view needs to care about the future only via what humans would want about the future (so that any reference to the future is encapsulated in the "what do humans want?" aspect).
I.e. the question it needs to be answering is something like "all things considered (including the consequences of my current action on the future, as well as taking into account my possible future actions) what would humans, as they exist now, want me to do at the present moment?"
Now maybe you can take that question and try to slice it up into rewards at particular timesteps, which change over time as what is known about what humans want changes, without introducing corrigibility issues, but the AIXI reward framework isn't really buying you anything imo even if that works, relative to directly trying to get an AI to solve the question.
On the other hand approximating Solomonoff induction might afaik be a fruitful approach, though the approximations are going to have to be very aggressive for practical performance. I do agree embeddding/self-reference can probably be patched in.
I think that it's likely to take longer than 10000 years, simply because of the logistics (not the technology development, which the AI could do fast).
The gravitational binding energy of the sun is something on the order of 20 million years worth of its energy output. OK, half of the needed energy is already present as thermal energy, and you don't need to move every atom to infinity, but you still need a substantial fraction of that. And while you could perhaps generate many times more energy than the solar output by various means, I'd guess you'd have to deal with inefficiencies and lots of waste heat if you try to do it really fast. Maybe if you're smart enough you can make going fast work well enough to be worth it though?
I feel like a big part of what tripped me up here was an inevitable part of the difficulty of the scenario that in retrospect should have been obvious. Specifically, if there is any variation in difficulty of an encounter that is known to the adventurers in advance, the score contribution of an encounter type in actual paths taken is less than the difficulty of the encounter as estimated by what best predicts the path taken (because the adventurer takes the path when it's weak, but avoids when it's strong).
So, I wound up with an epicycle saying hags and orcs were avoided more than their actual scores warranted, because that effect was most significant for them (goblins are chosen over most other encounters even if alerted, and Dragons mostly aren't alerted).
This effect was made much worse by the fact that I was getting scores mainly from lower difficulty dungeons, with lots of "Nothing" rooms and low level encounters. But even once I estimated scores from the overall data with my best guesses for preference order, the issue still applied, just not quite so badly.
In the "what if" department, I had said:
> I'm also getting remarkably higher numbers for Hag compared with my earlier method. But I don't immediately see a way to profitably exploit this.
The most obvious way to exploit this would have been the optimal solution. Why didn't I do it? The answer is that, as indicated above, I was still underestimating the hag (whereas at this point I had mostly-accurate scores for the traps and orcs). With my underestimate for the hag's score contribution, I didn't think it was worth giving up an orc-boulder trap difference to get a hag-orc difference. I also didn't realize I needed the hag to alert the dragon.
In general, I feel like I was pretty far along with discovering the mechanics despite some missteps. I correctly had the adventurers taking a 5-encounter path with right/down steps, the choice of next step being based on the encounters in the choices for the next room, with an alerting mechanism, and that the alerting mechanism didn't apply to traps and golems.
On the other hand, I applied the alerting mechanism only to score and not to preference order, except for goblins and orcs (why didn't I try to apply it to preference order for other encounters once I realized it applied to preference order for goblins and orcs and that some degree of alerting mechanism score effect applied to other encounters ?????) (I also got confused into thinking that the effect on orc preference order only applied if the current encounter was also orcs). I also didn't realize that the alerting mechanism had different sensitivity for different encounters, and I had my mistaken belief about the preference order being different from expected score for some encounter types (hey, the text played up how unnerving the hag was, there was some plausibility there!).
I think if I had gotten to where I was in my last edit early on in the time frame for this scenario instead of near the end, and had posted it, and other people had read it and tried it out, collectively we would have had a good chance of solving the whole thing. I also would have been much more likely to get the optimal solution if I had paid more attention to what abstractapplic said, instead of only very briefly glancing over his comments after posting my very belated comment and going back to doing my own thing.
In my view, a fun, challenging and theoretically solvable scenario (even if actually not that close to being solved in practice), so I think it was quite good.
Looking like I'll not have figured this out before the time limit despite the extra time, what I have so far:
I'm modeling this as follows, but haven't fully worked out and am getting complications/hard to explain dungeons that suggest that it might not be exactly correct
For the dungeon generation: dungeon generation seems to treat rooms 1-8 equally (room 9 is different and tends to have harder encounters). Encounters of the same types (and some related "themes") tend to be correlated. Scores in each tournament seem to be whole numbers from each judge and averaged between 3 or 4 judges; I am not sure if any tournaments are judged by 2 or 1, but if so they're relatively less common.
In theory, I'd like to plug in a preference model and a score model to a simulator and iterate to refine, but I'm not there yet, still working out plausible scores and preferences.
One possibility for the scores and preference order:
baseline average scores:
Nothing: 0; Goblins: 1.5 (1d2?); Whirling Blade Trap 3; Orcs 3; Hag 4; Boulder Trap 4.5; Clay Golem 6, Dragon 6?, Steel Golem 7.5 (edit: <--- numbers estimated with small, atypical samples (included many Nothing, which is problematic for reasons that become obvious with below edit))
With Goblins and Orcs being increased (doubled?) if following goblins/orcs/any trap? (edit - or golems?) (edit - looking now like it's probably anything but an empty room?)
Plus with the adventurers seemingly avoiding Orcs and Hags more than their difficulty warrants? (I found them to be relatively late in the preference order, then found that they were in practice lower in score, so am having to ad hoc adjust if I keep the assumption that the score contribution and prefrence order are related. 1.5 multiplier? 2x multiplier? fixed addition?) (I'm assuming a 1.5x multiplier atm since I initially had Hag avoided over anything but orcs, but found one dungeon that looks suspiciously like, but does not prove, Hag being chosen over Dragon (edit: see below for update)) (I suppose +2 would also work) (edit - it looks like the Orc difficulty increase for following a non-empty room only applies to adventurer preference if the current room is also Orcs - violating the assumption that preference is tied to expected difficulty. But for Goblins it seems the preference may indeed depend only on following a non-empty room, though in practice it doesn't matter much since it only affects order wrt WBT).
(edit - see update to preference order below)
Assuming the above is correct, and I'm pretty sure it isn't but hopefully has some relationship with reality, one strategy might be:
CHN/WON/BOD <---obsolete answer
where the idea is to use the encounters the adventurers avoid too much relative to their actual score contributions (Hag, Orcs) to herd the adventurers away from the Nothing rooms. One of the Orcs is left in after a Boulder Trap in the belief that will make it score higher than the hag. WBT is left in the preferred path to lead the adventurers along, don't immediately see a way to avoid this.
EV if above model is correct: 6+3+4.5+6+6=25.5
How I've gotten here (mainly used Claude and Claude-written code, including the analysis tool which is good for prototyping if you don't mind javascript):
edit: I've now read other people's comments; I did not notice any 1-point jump in scores (didn't check for it), not sure if i would have noticed if it is a judging difference as opposed to a strategy change? (wouldn't notice if just strategy change). Also I did not notice anything special about Steel Golems at the entrance vs. other spots, did not check for any change in distribution of 3 vs 4 judge tournaments, etc.
further analysis after the above:
I've looked at root mean square deviation of predictions from the data for the full dataset (full Bayes seems a bit intimidating to code atm even with AI help). From this it seems the preference order is (there remains a likely possibility for more complications I haven't checked):Nothing > Goblins (current encounter null or Nothing) > Goblins (otherwise) = Whirling Blade Trap > Boulder Trap = Clay Golem = Orcs (current encounter not Orcs) > Dragon > Steel Golem >= Orcs (current encounter Orcs) > Hag Nothing > Goblins (current encounter null or Nothing) > Goblins (otherwise) = Whirling Blade Trap > Boulder Trap > Clay Golem = Orcs (current encounter not Orcs) > Dragon > Orcs (current encounter Orcs) > Hag = Steel Golem
where I can't distinguish between Steel Golem being preferred or equal to Orcs with current encounter being Orcs.
Soo, if Orcs are avoided equally to a Boulder Trap if the current encounter is not Orcs, I need to improve the herding. But also it seems Orcs get doubled by many other encounter types? This could work:
CHN/OBN/WOD <---- current solution
Predicted value is now 6+6+3+6+6=27.
further edit: also refining the scores, getting probably nonsense (due to missing some dependcy of some stuff on something else, probably), but it's looking like maybe every encounter's score depends on whether the previous encounter was Nothing/null. Except traps/golems? Which would explain why Steel Golems are being reported as better in the first slot.
I'm also getting remarkably higher numbers for Hag compared with my earlier method. But I don't immediately see a way to profitably exploit this.
I feel like this discussion could do with some disambiguation of what "VNM rationality" means.
VNM assumes consequentialism. If you define consequentialism narrowly, this has specific results in terms of instrumental convergence.
You can redefine what constitutes a consequence arbitrarily. But, along the lines of what Steven Byrnes points out in his comment, redefining this can get rid of instrumental convergence. In the extreme case you can define a utility function for literally any pattern of behaviour.
When you say you feel like you can't be dutch booked, you are at least implicitly assuming some definition of consequences you can't be dutch booked in terms of. To claim that one is rationally required to adopt any particular definition of consequences in your utility function is basically circular, since you only care about being dutch booked according to it if you actually care about that definition of consequences. It's in this sense that the VNM theorem is trivial.
BTW I am concerned that self-modifying AIs may self-modify towards VNM-0 agents.
But the reason is not because such self modification is "rational".
It's just that (narrowly defined) consequentialist agents care about preserving and improving their abilities to and proclivities to pursue their consequentialist goals, so tendencies towards VNM-0 will be reinforced in a feedback loop. Likewise for inter-agent competition.
You can also disambiguate between
a) computation that actually interacts in a comprehensible way with the real world and
b) computation that has the same internal structure at least momentarily but doesn't interact meaningfully with the real world.
I expect that (a) can usually be uniquely pinned down to a specific computation (probably in both senses (1) and (2)), while (b) can't.
But I also think it's possible that the interactions, while important for establishing the disambiguated computation that we interact with, are not actually crucial to internal experience, so that the multiple possible computations of type (b) may also be associated with internal experiences - similar to Boltzmann brains.
(I think I got this idea from "Good and Real" by Gary L. Drescher. See sections "2.3 The Problematic Arbitrariness of Representation" and "7.2.3 Consciousness and Subjunctive Reciprocity")
The interpreter, if it would exist, would have complexity. The useless unconnected calculation in the waterfall/rock, which could be but isn't usually interpreted, also has complexity.
Your/Aaronson's claim is that only the fully connected, sensibly interacting calculation matters. I agree that this calculation is important - it's the only type we should probably consider from a moral standpoint, for example. And the complexity of that calculation certainly seems to be located in the interpreter, not in the rock/waterfall.
But in order to claim that only the externally connected calculation has conscious experience, we would need to have it be the case that these connections are essential to the internal conscious experience even in the "normal" case - and that to me is a strange claim! I find it more natural to assume that there are many internal experiences, but only some interact with the world in a sensible way.
But this just depends on how broad this set is. If it contains two brains, one thinking about the roman empire and one eating a sandwich, we're stuck.
I suspect that if you do actually follow Aaronson (as linked by Davidmanheim) to extract a unique efficient calculation that interacts with the external world in a sensible way, that unique efficient externally-interacting calculation will end up corresponding to a consistent set of experiences, even if it could still correspond to simulations of different real-world phenomena.
But I also don't think that consistent set of experiences necessarily has to be a single experience! It could be multiple experiences unaware of each other, for example.
That's a high level of hypothetical harm that they are ruling out (~2 IQ points?). I would take the dental harms many times over to avoid that much cognitive ability loss.