niplav — LessWrong

I operate by Crocker's rules. All LLM output is explicitely designated as such. I have made no self-hiding agreements.

I'm revisiting this post after listening to this section of this recent podcast with Holden Karnofsky.

Seems like this post was overly optimistic in what RSPs would be able to enforce/not quite clear on different scenarios for what "RSP" could refer to. Specifically, this post was equivocating between "RSP as a regulation that gets put into place" vs. "RSP as voluntary commitment"—we got the latter, but not really the former (except maybe in the form of the EU Codes of Practice).

Even at Anthropic, the way the RSP is put into practice is now basically completely excluding a scaling pause from the picture:

RSPs are pauses done right: if you are advocating for a pause, then presumably you have some resumption condition in mind that determines when the pause would end. In that case, just advocate for that condition being baked into RSPs!

Interview:

That was never the intent. That was never what RSPs were supposed to be; it was never the theory of change and it was never what they were supposed to be... So the idea of RSPs all along was less about saying, 'We promise to do this, to pause our AI development no matter what everyone else is doing'

and

But we do need to get rid of some of this unilateral pause stuff.

Furthermore, what apparently happens now is that really difficult commitments either don't get made or get walked back on:

Since the strictest conditions of the RSPs only come into effect for future, more powerful models, it's easier to get people to commit to them now. Labs and governments are generally much more willing to sacrifice potential future value than realized present value.

Interview:

So I think we are somewhat in a situation where we have commitments that don't quite make sense... And in many cases it's just actually, I would think it would be the wrong call. In a situation where others were going ahead, I think it'd be the wrong call for Anthropic to sacrifice its status as a frontier company

and

Another lesson learned for me here is I think people didn't necessarily think all this through. So in some ways you have companies that made commitments that maybe they thought at the time they would adhere to, but they wouldn't actually adhere to. And that's not a particularly productive thing to have done.

I guess the unwillingness of the government to turn RSPs into regulation is what ultimately blocked this. (Though maybe today even a US-centric RSP-like regulation would be considered "not that useful" because of geopolitical competition). We got RSP-like voluntary commitments from a surprising number of AI companies (so good job on predicting the future on this one) but that didn't get turned into regulation.

It's a bit of a travesty there's no canonical formal write-up of UDASSA, given all the talk about it. Ugh, TODO for working on this I guess.

My understanding is that UDASSA doesn't give you unbounded utility, by virtue of directly assigning , and the sum of utilities is proportional to $\sum_{i = 0}^{\infty} 2^{- i} = 2$ . The whole dance I did was in order to be able to have unbounded utilities. (Maybe you don't care about unbounded utilities, in which case UDASSA seems like a fine choice.)

(I think that the other horn of de Blanc's proof is satisfied by UDASSA, unless the proportion of non-halting programs bucketed by simplicity declines faster than any computable function. Do we know this? "Claude!…")

Edit: Claude made up plausible nonsense, but GPT-5 upon request was correct, proportion of halting programs declines more slowly than some computable functions.

Edit 2: Upon some further searching (and soul-searching) I think UDASSA is currently underspecified wrt whether its utility is bounded or unbounded. For example, the canonical explanation doesn't mention utility at all, and none of the other posts about it mention how exactly utility is defined..

Makes sense, but in that case, why penalize by time? Why not just directly penalize by utility? Like the leverage prior.

Huh. I find the post confusingly presented, but if I understand correctly, 15 logical inductor points to Yudkowsky₂₀₁₃—I think I invented the same concept from second principles.

Let me summarize to understand: My speed prior on both the hypotheses and the utility functions is trying to emulate just discounting utility directly (because in the case of binary tapes and integers penalizing both for the exponential of speed gets you exactly an upper bound for the utility), and a cleaner way is to set the prior to . That avoids the "how do we encode numbers" question that naturally raises itself.

Does that sound right?

(The fact that I reinvented this looks like a good thing, since that indicates it's a natural way out of the dilemma.)

I think the upper bound here is set by a program "walking" along the tape as far as possible while setting the tape to and then setting a list bit before halting (thus creating the binary number $11 ⁿ$ where $n \leq B B (| p |)$ ^[1]). If we interpret that number as a utility, the utility is exponential in the number of steps taken, which is why we need to penalize by $2^{- steps (p)}$ instead of just $1 / steps (p)$ ^[2]. If you want to write $3 ↑↑↑ 3$ on the tape you have to make at least ${log}_{2} (3 ↑↑↑ 3)$ steps on a binary tape (and ${log}_{n} (3 ↑↑↑ 3)$ on an n-ary tape).

Technically the upper bound is $Σ (| p |)$ , the score function. ↩︎
Thanks to GPT-5 for this point. ↩︎

epistemic status: Going out on a limb and claiming to have solved an open problem in decision theory^[1] by making some strange moves. Trying to leverage Cunningham's law. Hastily written.

p(the following is a solution to Pascal's mugging in the relevant sense)≈25%^[2].

Okay, setting (also here in more detail): You have a a Solomonoff inductor with some universal semimeasure as a prior. The issue is that the utility of programs can grow faster than your universal semimeasure can penalize them, e.g. a complexity prior has busy-beaver-like programs that produce amounts of utility with the program $| p |$ , while only being penalized by $2^{- | p |}$ . The more general results are de Blanc 2007, de Blanc 2009 (LW discussion on the papers from 2007). We get this kind of divergence of expected utility on the prior if

the prior is bounded from below by a computable function and
the utility function is computable and unbounded

The next line of attack is to use the speed prior as a prior $2^{- | p | - log (t (p))}$ . That prior is not bounded by a computable function from below (because it grows slower than $1 / B B (n)$ for programs of length $n$ ), so we escape into one of de Blanc's horns. (I don't think having a computable lower bound is that important because K-complexity was never computable in the first place.)

But there's an issue: What if our hypotheses output strings that are short, but are evaluated by our utility function as being high-value anyway? That is, the utility function takes in some short string of length $n$ and outputs as its utility $B B (n)$ . This is the case if the utility function itself is a program of some computational power, in the most extreme case the utility function is Turing-complete, and our hypotheses "parasitize" on this computational power of our utility function to be a Pascal's mugging. So what we have to do is to also consider the computation of our utility function as being part of what's penalized by the prior. That is,

$2^{- | p | - t (p) - t_{u} (eval (p))}$

for $t_{u} (eval (p))$ being the time it takes to run the utility function on the output of $p$ . I'll call this the "reflective speed prior". Note that if you don't have an insane utility function which is Turing-complete, the speed penalty for evaluating the output of $p$ should be fairly low most of the time.

Pascal's mugging can be thought of in two parts:

My expected utility diverges on priors, that is, not having observed any information or made any Bayesian updating, my utility can get arbitrarily big. I think this one is a problem.
My expected utility can diverge after updating on adversarially selected information. I think this case should be handled by improving your epistemology.

I claim that the reflective speed prior solves 1., but not 2. Furthermore, and this is the important thing, if you use the reflective speed prior, the expected utility is bounded on priors, but you can have arbitrarily high maximal expected utilities after performing Bayesian updating. So you get all the good aspects of having unbounded utility without having to worry about actually getting mugged (well, unless you have something controlling the evidence you observe, which is its own issue).

(Next steps: Reading the two de Blanc papers carefully, trying to suss out a proof, writing up the argument in more detail. Think/argue about what it means to update your prior in this strange way, and specifically penalizing hypotheses by how long it takes your utility function to evaluate them. Figure out which of these principles are violated. (On an initial read: Definitely Anti-Timidity.) Changing ones prior in a "superupdate" has been discussed here and here.)

Edit: Changed from penalizing the logarithm of runtime and utility-runtime to penalizing it linearly, after feedback from GPT-5.

I like to live dangerously. ↩︎
Updated down from 40% after getting some feedback by GPT-5 on what the costs of this approach are. ↩︎

Best I can tell, the risk of psychosis is much higher with Goenka style retreats, although I don't have hard numbers, only anecdotal evidence and theory that suggests it should be more common.

My experience has been that anything except long intensive retreats don't move my mind out of its default attractor state, and that I probably waited too long to attend do long retreats, and all the advice goes in the opposite direction.

I mention this because all the talk of the downside of meditation have me thinking of a tweet that goes roughly like "why do both republicans and democrats pretend HRT does anything". Goenka retreats have medium-strength effects on me, an intensive one-month retreat at home had a decent effect. I may be doing something wrong.

Yup, that's correct if I remember the sources correctly. I guess the tone surrounding it doesn't match that particular bit of content. I should also turn the pledged/received numbers into a table for easier reading.

Yup, it's a regionalism that I mis-/over-generalized. I'll avoid it from now on.

It is, and it's the thing I'd most like Smil to read if I could recommend something to him.

LESSWRONG
LW

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments