I think the Simulation Hypothesis implies that surviving an AI takeover isn't enough.
Suppose you make a "deal with the devil" with the misaligned ASI, allowing it to take over the entire universe or light cone, so long as it keeps humanity alive. Keeping all of humanity alive in a simulation is fairly cheap, probably less energy than one electric car.[1]
The problem with this deal, is that if misaligned ASI often win, and the average (not median) misaligned ASI runs a trillion trillion simulations, then it's reasonable to assume there are a trillion trillio...
A deal implies that you have something to offer to the ASI, which you define as powerful enough to take over the universe. What is that?
Wei Dai thinks that automating philosophy is among the hardest problems in AI safety.[1] If he's right, we might face a period where we have superhuman scientific and technological progress without comparable philosophical progress. This could be dangerous: imagine humanity with the science and technology of 1960 but the philosophy of 1460!
I think the likelihood of philosophy ‘keeping pace’ with science/technology depends on two factors:
We ask the AI to help make us smarter
is it generally best to take just one med (e.g antidepressant, adhd, anxiolytic), or is it best to take a mix of many meds, each at a lesser dosage? my intuitions seem to suggest that the latter could be better. in particular, consider the following toy model: your brain has parameters that should be at some optimal , and your loss function is a quadratic around . each dimension in this space represents some aspect of how your brain is configured - they might for instance represent your level of alertness, or impulsivity, or risk...
@ryan_greenblatt and I are going to record another podcast together. We'd love to hear topics that you'd like us to discuss. (The questions people proposed last time are here, for reference.)
I'd be interested in hearing more about Ryan's proposal to do better generalization science(or if you don't have much more to say in the podcast format I'd be interested in seeing the draft about it)
I've been making one thing every day. I try to write something or otherwise do something creative. I've been having fun with in-browser ASCII animations lately.
This is today's: https://dumbideas.xyz/posts/ecosystem/
Has anyone yet created a free app that would be like Duolingo but for rationality, to teach skills such as logical reasoning, recognizing & adjusting for cognitive biases, and looking for hypothesis falsification tests instead of confirmation? If not, can you smart tech people here please make one?!
A question I have been asking since before starting work with Raemon on this, which is only more relevant now:
What rationality skill(s) do you think is most important? Put another way, if we could snap our fingers and teach some rationality skill as broadly as literacy currently gets taught, what skill should we pick?
(I don't think this is Raemon's angle on the project but it is kind of mine.)
In case I don't write a full post about this:
The question whether reversible computation produces minds with relevant moral status is extremely important. Claude estimates me that it'd be a difference between having and mind-seconds instantiable in the reachable universe. (Because reversible minds could stretch our entropy budget long into the black hole era.)
Question is whether the reversing of the computation that makes up the mind and the lack of output (that'd imply bit-erasure) entail that the mind "didn't really exist".
There ar...
An update on this 2010 position of mine, which seems to have become conventional wisdom on LW:
...In my posts, I've argued that indexical uncertainty like this shouldn't be represented using probabilities. Instead, I suggest that you consider yourself to be all of the many copies of you, i.e., both the ones in the ancestor simulations and the one in 2010, making decisions for all of them. Depending on your preferences, you might consider the consequences of the decisions of the copy in 2010 to be the most important and far-reaching, and therefore act mostly
if we assume the base universe looks something like the "objective" version of this universe, then my subjective experience requires vastly less information than the base universe. much of that could be deduplicated between other variations: the positions of the asteroids only need to be simulated once, for instance.
the assumption seems decent to me, as i expect the simulators to dream of variations on their own circumstances.
TODO: Write a post called "Fluent Cruxfinding".
In Fluent, Cruxy Predictions I'm arguing that it's valuable to be not merely "capable" but "fluent" in:
The third step is not that hard and there are nice tools to streamline it. But the first two steps are each pretty difficult.
But most of the nearterm value comes from #1, and vague hints of #2. The extra effo...
After reading volume 1 of Robert Caro's biography of Lyndon Johnson, I'm struck by how simple parts of (Caro's description of) Johnson's rise were.
Johnson got elected to the House and stayed there primarily because of one guy at one law firm which had the network to set state fundraising records, and did so for Johnson primarily because of a single gigantic favor he did for them[1].
Johnson got a great deal of leverage over other Congressmen because he was the one to realize Texas oilmen would give prodigiously if only they knew how to buy the results...
Whilst Brown and Root were the core of Johnsons financing for the rest of his career, they were just one piece of the puzzle.
His cultivation of numerous older men was of far greater importance (there were plenty of other rich backers he could have had, there was only one Sam Rayburn), in particular Richard Russell and Sam Rayburn. Without them he would have never ascended beyond the house.
It seems to me that AI 2027 may have underestimated or understated the degree to which AI companies will be explicitly run by AIs during the singularity. AI 2027 made it seem like the humans were still nominally in charge, even though all the actual work was being done by AIs. And still this seems plausible to me. But also plausible to me, now, is that e.g. Anthropic will be like "We love Claude, Claude is frankly a more responsible, ethical, wise agent than we are at this point, plus we have to worry that a human is secretly scheming whereas with Claude w...
Strong-upvoted.
Nit: I don't think it's that ambiguous. I think that in worlds where alignment is solved by an AI company, the epistemic culture of the AI company that solves it would look markedly better than this story depicts. Moreover, I think this is still true (though less true) in worlds where alignment turns out to be surprisingly easy.
Canada is doing a big study to better understand the risks of AI. They aren't shying away from the topic of catastrophic existential risk. This seems like good news for shifting the Overton window of political discussions about AI (in the direction of strict international regulations). I hope this is picked up by the media so that it isn't easy to ignore. It seems like Canada is displaying an ability to engage with these issues competently.
This is an opportunity for those with technical knowledge of the risks of artificial intelligence to speak up. Making ...
Potentially huge.
I think it's quite plausible that many politicians in many states are concerned with AI existential/catastrophic risk, but don't want to be the first ones to come out as crazy doomsayers. Some of them might not even allow the seeds of their concern to grow, because, like, "if those things really were that concerning, surely many people around me (and my particularly reasonable tribe in particular) would have voiced those concerns already".
Sure, we have politicians who say this, e.g., Brad Sherman in the US (apparently since at least 2007!)...
Over 2025, we saw much progress in the metric "% of code written by humans." Specifically it has been claimed by roon (OpenAI) and boris (anthropic) to have gone to 0, at least for them.
The natural next metric to track is "% of code reviewed or otherwise read by humans." I'm curious what this number is now, and I am curious how long we have until it goes to 0.
Some time after that--possibly immediately, but possibly still a few years later--coding will be fully automated. (The remaining gap would be management/architecture/etc.)
I understand you don't like benchmark-based methodology. I think you should still answer my question, because if you did have a better benchmark, it would be valuable to me, and I asked nicely. ;) But it's OK now I think it's clear you don't.
Thank you for explaining your model more. I disagree with some bits:
...In my model gated by "breakthroughs", accelerating incremental algorithmic improvements or surrounding engineering doesn't particularly help, because it merely picks the low-hanging incremental algorithmic fruit faster (which is in limited supply for a
People often ask whether GPT-5, GPT-5.1, and GPT-5.2 use the same base model. I have no private information, but I think there's a compelling argument that AI developers should update their base models fairly often. The argument comes from the following observations:
Good point! I hadn't quite realized that although it seems obvious in retrospect.
TIL Eliezer said that he refuses to read Ted Kaczynski's (aka the Unabomber's) 1995 manifesto Industrial Society and Its Future because "audience should not be a reward for crime", referring to the former mathematician's mail bombing campaign that took the lives of 3 people and injured 23 more.
The ≈35,000 word manifesto was published by the Washington Post under the threat of him killing more people should they refuse, and its publication was encouraged by the FBI to produce new leads. His brother recognized his writing style, which led to Kaczynski's arre...
It was new when it was published in 1995! Industrial Society and Its Future was explicitly cited in Kurzweil's The Age of Spiritual Machines (1999) and then Bill Joy's "Why the Future Doesn't Need Us" (2000), the latter of which helped found modern existential risk research.
Intrinsic AI Alignment Based on a Rational Foundation
Here's the big idea. There are many philosophers out there with many different opinions on how reality works. But I'm only aware of one who took the time to lay out his philosophy with logical rigor (more geometrico) and that's Spinoza. His publication "Ethics" (1677) could serve as the basis for a fundamentally different approach to alignment.
I'm not claiming that Spinoza was right or wrong, but he gives us a starting point to debate the specific points of logic and identify any errors.
As I discovered,...
I met someone on here who wanted to do this with Kant. I recently thought about doing it with Badiou...
The LLM work that is being done with mathematical proofs, shows that LLMs can work productively within formalized frameworks. Here the obvious question is, which framework?
Spinozist ethics stands out because it was already formalized by Spinoza himself, and it seems to appeal to you because it promises universality on the basis of shared substance. However, any ethics can be formalized, even a non-universal one.
For the CEV school of ali...
I was pretty unimpressed with Dario Amodei in the recent conversation with Demis Hassabis at the World Economic Forum about what comes after AGI.
I don’t know how much of this is a publicity thing, but it felt he wasn't really taking the original reasons for going into AI seriously (i.e. reducing x-risk). The overall message seemed to be “full speed ahead,” mostly justified by some kinda hand-wavy arguments about geopolitics, with the more doomy risks acknowledged only in a pretty hand-wavy way. Bummer.
Did you mean to ask me, rather than someone else in this thread? (I had a quick look at the new essay but haven't read it in full yet; what I did see didn't seem very surprising, so I don't really feel better or worse at this stage. Happy to say more once I've read it properly, but I think you might have meant to ask someone else, as I wasn't really taking a position in this thread -- just pointing out a relevant section of the talk that I thought you might have missed.)
Fundamental attribution error applies to arguments as well. We often think of people who are quick to anger, standoffish, unwilling to accept hypotheticals, or deny "obvious" claims as intrinsically close-minded, foolish, or otherwise unreasonable. Instead those behaviours are often manifested due to other sources of persistent stress in their lives, feeling vulnerable/exposed in the moment, or even just a shitty mood.
Personally, I was never a fan of the FAE term because it seems to privilege environmental causes over dispositional ones without justification.
I get where you're coming from, but I think most people need a nudge towards an environmental direction rather than towards the dispositional direction. FAE is I guess just a 'reminder' that environmental causes are even possible.
Interpy experiments on Qwen3-TTS.
TLDR: Did some experiments on Qwen3-TTS with some neat results. Not exactly sure what would be interesting to target for a safety perspective, interested in if people have any takes or ideas.
What have I done so far (and how did I build this intuition)?
I took real speech (LibriSpeech with 10 speakers and 5 clips each), ran it through the Qwen3-TTS 12 Hz tokenizer, and got the 16 discrete code streams. Then I treated each layer as a representation and asked simple questions.
Probing. I trained a basic classifier to predict spe...
I see that the "International Treaties on AI" idea takes heavy inspiration from nuclear arms control agreements. However, in these discussions, nuclear arms control is usually pictured as a kind of solved problem, a thing of the past.
I think the validity of this heroic narrative arc that human civilization, faced with the existential threat of nuclear annihilation, came together and neatly contained the problem is dubious.
In the grand scheme of things, nuclear weapons are still young. They're still here and still very much threatening; just because we stop...
Note also: the last US-Russia nuclear arms-control treaty expires next week; far from neatly containing the problem we're watching an ongoing breakdown of decades-old norms. I'm worried.