While I think this is a broadly reasonable response, I'm curious what you think is able to provide better public justification than longtermism. These results seem to apply fairly broadly to any realistic EV-based justification for action given that partial observability is very much the rule.
Well I meant it as an empirical hypothesis and thought it may have formal implications (specifically, placing the problem in a smaller, more tractable class).
Just an incomplete comment about The assumptions that make reward-seekers plausible also make fitness-seekers plausible: I think a central question is whether X-seeking gives you a compressed policy vs "optimal kludge". That is: if it's just as hard to learn the optimal policy if I'm an X-seeker as it is to learn the optimal kludge if I'm not an X-seeker, then it seems like I'm unlikely to learn X-seeking (or X-seeking is at best no more likely than a whole host of other possible behavioural spandrels, which implies exactly the same thing).
I think the argument that X-seeking incentivises optimal behaviour is some reason to think it might be compressive, but not obviously a very strong one: if all X-seeking gets you is "I should do well on evals" then that's a very small piece of policy it's compressing, not obviously even enough to pay its own cost. That is, the extra bit of policy "I should seek X" seems like it could easily be longer or have lower prior probability than "I should do well on evals". If "I should seek X" helped further in actually doing well on evals then I think there's a stronger argument to be made but...I just need to think about this more, it's not immediately apparent what that actually looks like.
Separate comment: the title doesn't seem to connect well to the content, and it'd be nice if you were clearer about whether your theorems are partly original or simply lifts from the relevant texts that are justified by your modelling choices (I think the latter, given the absence of proofs, but "my first theorem" sorta confuses this).
A common longtermist model is one where there's a transient period of instability (perhaps "around about now") that settles into a stable state thereafter. This seems like it would be no harder than a finite-horizon problem terminating when stability is achieved. I haven't looked into the results you quote and exactly what role the infinite horizon takes, but intuitively it seems correct that eternal instability (or even very long-lived instability) along any dimension of interest would make policy choice intractable while stability in the near future can make it fairly straightforward. Maybe there's an issue where the highest value outcomes occur in the unstable regimes, which makes it hard to "bet on stability", but I'd like to see it in maths + plausible examples.
I was responding to Gurkenglas’ comment as I understood it, I agree your paper is not about this.
So I think if you buy that a randomly initialized 1T transformer does in fact contain "Aligned ASI" and "deceptively aligned ASI" in its "prior" but we don't have the data to "find" them yet, then you're probably right that Jan 2026-era training data doesn't change their prior ratio much (or certainly doesn't change it predictably). But this doesn't really matter, what matters is the systems we actually realise, and the contributions they make to the next generation of AI development, and different data can change the likelihoods significantly here.
They also generalize much more intuitively than small RL-from-scratch models, which is probably the more important feature. E.g. an LLM trained in that environment would probably just figure that the cheese was the objective.
Do you know what his aims are? I feel like that's an important part of a model like this!
I've sometimes joked that the doomsday style arguments are actually arguments about the death of interest in anthropics.