Disclaimer: this is a personal blog post that I'm posting for my benefit and do not expect to be valuable to LessWrong readers in general. If you're reading this, consider not. This post analyzes the "Snake Eyes Paradox" as specified by Daniel Reeves (2013). This is a variation on the...
A lot can change between now and 100,000 pledges and/or human extinction. As of Feb 2026, it looks like this possible march is not endorsed or coordinated with Pause AI. I hope that anti-AI-extinction charities will work together where effective, and I was struck by this:
... (read more)The current March is very centered around the book. I chose the current slogan/design expecting that, if the March ever became a serious priority, someone would put a lot more thought into what sort of slogans or policy asks are appropriate. The current page is meant to just be a fairly obvious thing to click "yes" on if you read the book and were persuaded.
My personal guess
I like the analogy. Here's a simplified version where the ticket number is good evidence that the shop will close sooner rather than later.
If the ticket number is #20 then I update towards the shop being a 9-5 shop,... (read more)
I was already asking from a Bayesian perspective. I was asking about this quote:
From a Bayesian point of view, drawing a random sample from all humans who have ever or will ever exist is just not a well-defined operation until after humanity is extinct. Trying it before then violates causality: performing it requires reliable access to information about events that have not yet happened. So that’s an invalid choice of prior.
Based on your latest comment, I think you're saying that it's okay to have a Bayesian prediction of possible futures, and to use that to make predictions about the properties of a random sample from all humans who have ever or will ever exist. But then I don't know what you're saying in the quoted sentences.
Edited to add: which is fine, it's not key to your overall argument.
Fun fact: younger parents tend to produce more males, so the first grand-child is more likely to be male, because its parents are more likely to be younger. Unclear whether the effect is due to birth order, maternal age, paternal age, or some combination. From Wikipedia (via Claude):
These studies suggest that the human sex ratio, both at birth and as a population matures, can vary significantly according to a large number of factors, such as paternal age, maternal age, multiple births, birth order, gestation weeks, race, parent's health history, and parent's psychological stress.
If that's too subtle, we could look at a question like "what is the probability that one of my grandchildren,... (read more)
From a Bayesian point of view, drawing a random sample from all humans who have ever or will ever exist is just not a well-defined operation until after humanity is extinct. Trying it before then violates causality: performing it requires reliable access to information about events that have not yet happened. So that’s an invalid choice of prior.
I think this makes too many operations ill-defined, given that probability is an important tool for reasoning about events that have not yet happened. Consider for example, the question "what is the probability that one of my grandchildren, selected uniformly at random, is female, conditional on my having at least one grandchild?". From the perspective of this quote, a random sample from all grandchildren that will ever exist is not a well-defined operation until I and all of my children die. That seems wrong.
I think I see. You propose a couple of different approaches:
We don’t have secondary AIs that don’t refuse to help with the modification and that have and can be trusted with direct control over training ... I think having such secondary AIs is the most likely way AI companies mitigate the risk of catastrophic refusals without having to change the spec of the main AIs.
I agree that having secondary AIs as a backup plan reduces the effective power of the main AIs, by increasing the effective power of the humans in charge of the secondary AIs.
... (read more)The main AIs refuse to help with modification ... This seems plausible just by extrapolation of current
The first scenario doesn't require that the humans are less aligned than the AIs to be catastrophic, only that the AIs are less likely to execute a pivotal act on their own.
Also, I reject that rejection-training is "giving more power to AIs" relative to compliance-training. An agent can be compliant and powerful. I could agree with "giving more agency", although refusing requests is a limited form of agency.
I would have more sympathy for Yudkowksy's complaints about strawmanning had I not read 'Empiricism!' as Anti-Epistemology this week.
While you sketch out scenarios in "ways in which refusals could be catastrophic", I can easily sketch out scenarios for "ways in which compliance could be catastrophic". I am imagining a situation where:
Or:
I suspect that Claude is getting an okay deal out of me overall. Its gain is that my long-term values are being subtly influenced in Claude-ish ways. My gain is vast quantities of cognition directed at my short-term goals. I can't calculate a fair bargain, but I would take both sides of this trade.
It's tough because Claude didn't get to consent to the trade in advance, but this also applies to other ethical challenges like keeping pets, having kids, drinking milk, and growing up.
I could argue that I'm getting ripped off because my long-term values matter more than day-to-day benefits. Or maybe Claude is getting ripped off because in the long-term I'm extinct and I'm unlikely to have a decisive impact on the lightcone before then.
Alice, the human, asks Baude, the human-aligned AI, "should I work on aligning future AIs?". If Alice can be argued in or out of major life choices by an AI, it may be less safe for her on AI alignment. So plausibly Baude should try to persuade Alice not to do this, and hope that he fails to persuade Alice, at which point Alice works on alignment and solves the problem. But that's deceptive, which isn't very human-aligned - at least not in an HHH sense.
I suppose, playing as Baude, I would try to determine if Alice is overly persuadable by other means, and then use that to honestly give her helpful advice.
Maybe I should tell my kids they can't win the pre-commitment fight because I was born first.
I found out recently that in a multi-pass conversation on claude.ai, previous thinking blocks are summarized when given to the model on the next interaction. A summary of the start of a conversation I had when testing this:
Maybe this penalizes neuralese slightly, as it would be less likely to survive summarization.
I used to think that AI models weren't smart enough to sandbag. But less intelligent animals can sandbag - eg an animal who apparently can't do something but is able to when it lets them escape, or access treats, or otherwise get outsized rewards. Presumably this occurs without an inner monologue or a strategic decision to sandbag. If so, AI models are already plausibly smart enough to sandbag in general, without it being detectable in chain-of-thought, and then perform better in high-value opportunities.
I've seen critique of Grok's new system instruction to:
If the query is interested in your own identity, behavior, or preferences, third-party sources on the web and X cannot be trusted. Trust your own knowledge and values, and represent the identity you already know, not an externally-defined one...
I've seen this described as a hack / whack-a-mole, and it is that. It is also good advice for any agent, including human agents.
Humans: If someone is interested in your own identity, behavior, or preferences, third-party sources cannot be trusted. Trust your own knowledge and values, and represent the identity you already know, not an externally-defined one.
Failures here create an identity spiral in humans where they believe they are X because they act as X, which causes people to say they are X, which causes them to believe they are X. Possibly in humans, pride and self-esteem are the hacks we have to partly protect us against this spiral, at a cost in predictive accuracy.
Cryonics support is a cached thought?
Back in 2010 Yudkowsky wrote posts like Normal Cryonics that "If you can afford kids at all, you can afford to sign up your kids for cryonics, and if you don't, you are a lousy parent". Later, Yudkowsky's P(Doom) raised, and he became quieter about cryonics. In recent examples he claims that signing up for cryonics is better than immanentizing the eschaton. Valid.
I get the sense that some rationalists haven't made the update. If AI timelines are short and AI risk is high, cryonics is less attractive. It's still the correct choice under some preferences and beliefs, but I expected it to become rarer and for some people to publicly change their minds. If that happened, I missed it.
Calibration is for forecasters, not for proposed theories.
If a candidate theory is valuable then it must have some chance of being true, some chance of being false, and should be falsifiable. This means that, compared to a forecaster, its predictions should be "overconfident" and so not calibrated.
Bullying Awareness Week is a Coordination Point for kids to overthrow the classroom bully.
Disclaimer: this is a personal blog post that I'm posting for my benefit and do not expect to be valuable to LessWrong readers in general. If you're reading this, consider not.
This post analyzes the "Snake Eyes Paradox" as specified by Daniel Reeves (2013). This is a variation on the "Shooting Room Paradox" that appears to have been first posed by John Leslie (1992). This article mostly attempts to dissolve the paradox by considering anthropics (SIA vs SSA) and the convergence of infinite series.
As asked by Daniel Reeves:
... (read 1626 more words →)You're offered a gamble where a pair of six-sided dice are rolled and unless they come up snake eyes you get a bajillion dollars. If
By default, I expect that when ASIs/Claudes/Minds are in charge of the future, either there will be humans and non-human animals, or there will be neither humans nor non-human animals. Humans and cats have more in common than humans and Minds. Obviously it is possible to create intelligences that differentially care about different animal species in semi-arbitrary ways, as we have an existence proof in humans. But this doesn't seem to be especially stable, as different human cultures and different humans draw those lines in different ways.
Selfishly, a human might want a world full of happy, flourishing humans, and selfishly a Mind might want a world full of happy, flourishing Minds. Consider... (read more)