The context for this post is that I've had qualms about bayesian epistemology for most of the last decade. My most notable attempts to express them previously were Realism about rationality and Against strong bayesianism. In hindsight, those posts weren't great, but they're interesting as documentation of waypoints on my intellectual journey (see also here and here). This post is another such waypoint. Since writing it last year, I've built on these ideas (and my qualms about expected utility maximization) to continue developing my theory of coalitional agency. I don't know how compelled most readers feel by what I've written publicly about this research agenda thus far (i.e. this sequence, most of the posts on this blog, and some recent shortforms) but I'm very excited about it and expect to make significant progress on it in 2026.
I'm also still fairly happy with this post specifically, and expect that it will stand the test of time better than the other two above (in part because it's starting to articulate a positive vision rather than just bashing bayesianism). My main regret is on a pedagogical level: it was a mistake to start with point 1 (fuzzy truth values) rather than point 2 (the semantic view). I think it gave people the impression that I was primarily trying to defend fuzzy truth-values. But most formal accounts of fuzzy truth-values seem pretty useless. My main point was actually that epistemology should be formulated in terms of models—and that, once we do so, it's hard to avoid assigning those models something like degrees of truth (even if we don't precisely know how yet).
I'm also unsure about whether explaining "reason in terms of models" in terms of mathematical logic was a good idea. @Kaarel has a critique in the comments below which I've been slowly chewing on, and which deserves a substantive response.
My most substantive exchange in the comments was with @johnswentworth. This didn't update me much. Here's how John wanted to deal with vague propositions:
- There's some latent variable representing the semantics of "humanity will be extinct in 100 years"; call that variable S for semantics.
- Lots of things can provide evidence about S. The sentence itself, context of the conversation, whatever my friend says about their intent, etc, etc.
- ... and yet it is totally allowed, by the math of Bayesian agents, for that variable S to still have some uncertainty in it even after conditioning on the sentence itself and the entire low-level physical state of my friend, or even the entire low-level physical state of the world.
Basically, he wants to treat vagueness as a kind of inherent uncertainty that no amount of data can resolve. But vagueness and uncertainty are just different things! Most notably, uncertainty follows the laws of probability, whereas vagueness doesn't. As a concrete example from the post itself:
“the Earth is a sphere” is mostly true, and “every point on the surface of a sphere is equally far away from its center” is precisely true. But “every point on the surface of the Earth is equally far away from the Earth’s center” seems ridiculous
Suppose John assigns [0.8, 0.9] credence to "the Earth is a sphere", as his way of formalizing the vagueness of what counts as a sphere, and takes it as a tautology that every point on the surface of a sphere is equally far away from its center. Then he should assign [0.8, 0.9] credence to "every point on the surface of the Earth is equally far away from the Earth’s center". But of course the latter statement is clearly false. There are probably various clever ways to try to escape this problem, but I don't think any of them deal with the core issue.
A more promising way to think about uncertainty vs vagueness: uncertainty is a description of your epistemic state within the context of a fixed "language game", whereas vagueness involves a meta-game in which you might vary which language you're using (either for coordination with other people or for coordination between your internal subagents). I'd like to eventually be able to formalize this perspective.
As a final note, there are also a bunch of comments on the version of the post on my blog which LW readers might find interesting.
Every now and then, I say "wuckles" next to a person who I am work trialing, or who for some other reason cares about not looking dumb in front of me.
They casually-but-frantically google "wuckles" in the background to try to figure out what I meant. The only google result for "wuckles" is an urban dictionary response which is not what I meant by it, and it's pretty clear from context it's not what I meant by it. They are confused.
After seeing this summary I anticipated that the rest of the post would be about how you're deliberately saying a nonsense word in order to test how conformist job applicants are / whether they're capable of admitting confusion even in situations where there are incentives against that.
Maybe you need another word for that now.
Why did the task fall to a bunch of kids/students?
I'm not surprised by this, my sense is that it's usually young people and outsiders who pioneer new fields. Older people are just so much more shaped by existing paradigms, and also have so much more to lose, that it outweighs the benefits of their expertise and resources.
Also 1993 to 2000 doesn't seem like that large a gap to me. Though I guess the thing I'm pointing at could also be summarized as "why hasn't someone created a new paradigm of AI safety in the last decade?" And one answer is that Paul and Chris and a few others created a half-paradigm of "ML safety", but it hasn't yet managed to show impressive enough results to fully take over. However, it did win on a memetic level amongst EAs in particular.
The task at hand might then be understood as synthesizing the original "AI safety" with "ML safety". Or, to put it a bit more poetically, it's synthesizing the rationalist approach to aligning AGI with the empiricist approach to aligning AGI.
To answer your question, it's pretty hard to think of really good examples, I think because humans are very bad at both philosophical competence and consequentialist reasoning, but here are some:
If this is true, then it should significantly update us away from the strategy "solve our current problems by becoming more philosophically competent and doing good consequentialist reasoning", right? If you are very bad at X, then all else equal you should try to solve problems using strategies that don't require you to do much X.
You might respond that there are no viable strategies for solving our current problems without applying a lot of philosophical competence and consequentialist reasoning. I think scientific competence and virtue ethics are plausibly viable alternative strategies (though the line between scientific and philosophical competence seems blurry to me, as I discuss below). But even given that we disagree on that, humanity solved many big problems in the past without using much philosophical competence and consequentialist reasoning, so it seems hard to be confident that we won't solve our current problems in other ways.
Out of your examples, the influence of economics seems most solid to me. I feel confused about whether game theory itself made nuclear war more or less likely—e.g. von Neumann was very aggressive, perhaps related to his game theory work, and maybe MAD provided an excuse to stockpile weapons? Also the Soviets didn't really have the game theory IIRC.
On the analytical philosophy front, the clearest wins seem to be cases where they transitioned from doing philosophy to doing science or math—e.g. the formalization of probability (and economics to some extent too). If this is the kind of thing you're pointing at, then I'm very much on board—that's what I think we should be doing for ethics and intelligence. Is it?
Re the AI safety stuff: it all feels a bit too early to say what its effects on the world have been (though on net I'm probably happy it has happened).
I guess this isn't an "in-depth account" but I'm also not sure why you're asking for "in-depth", i.e., why doesn't a list like this suffice?
Because I have various objections to this list (some of which are detailed above) and with such a succinct list it's hard to know which aspects of them you're defending, which arguments for their positive effects you find most compelling, etc.
Inasmuch as you are actually trying to have a conversation with Neel or address Neel's argument on its merits, it would be good to be clear that this is the crux.
The first two paragraphs of my original comment were trying to do this. The rest wasn't. I flagged this in the sentence "The rest of my comment isn't directly about this post, but close enough that this seems like a reasonable place to put it." However, I should have been clearer about the distinction. I've now added the following:
EDIT: to be more clear: the rest of this comment is not primarily about Neel or "pragmatic interpretability", it's about parts of the field that I consider to be significantly less relevant to "solving alignment" than that (though work that's nominally on pragmatic interpretability could also fall into the same failure modes). I clarify my position further in this comment; thanks Rohin for the pushback.
Reflecting further, I think there are two parts of our earlier exchange that are a bit suspicious. The first is when I say that everyone seems to have "given up" (rather than something more nuanced like "given up on tackling the most fundamental aspects of the problem"). The second is where you summarize my position as being that we need deep scientific understanding or else everyone dies (which I think you can predict is a pretty unlikely position for me in particular to hold).
So what's going on here? It feels like we're both being "anchored" by extreme positions. You were rounding me off to doomerism, and I was rounding the marginalists off to "giving up". Both I'd guess are artifacts of writing quickly and a bit frustratedly. Probably I should write a full post or shortform that characterizes more precisely what "giving up" is trying to point to.
(Incidentally, I feel like you still aren't quite pinning down your position -- depending on what you mean by "reliably" I would probably agree with "marginalist approaches don't reliably improve things". I'd also agree with "X doesn't reliably improve things" for almost any interesting value of X.)
My instinctive reaction is that this depends a lot on whether by "marginalist approaches" we mean something closer to "a single marginalist approach" or "the set of all people pursuing marginalist approaches". I think we both agree that no single marginalist approach (e.g. investigating a given technique) makes reliable progress. However, I'd guess that I'm more willing than you to point to a broad swathe of people pursuing marginalist approaches and claim that they won't reliably improve things.
I expect it's not worth our time to dig too deep into whose position is more common here. But I think that a lot of people on LW have high P(doom) in significant part because they share my intuition that marginalist approaches don't reliably work. I do agree that my combination of "marginalist approaches don't reliably improve things" and "P(doom) is <50%" is a rare one, but I was only making the former point above (and people upvoted it accordingly), so it feels a bit misleading to focus on the rareness of the overall position.
(Interestingly, while the combination I describe above is a rare one, the converse is also rare—Daniel Kokotajlo is the only person who comes to mind who disagrees with me on both of these propositions simultaneously. Note that he doesn't characterize his current work as marginalist, but even aside from that question I think this characterization of him is accurate—e.g. he has talked to me about how changing the CEO of a given AI lab could swing his P(doom) by double digit percentage points.)
I agree with this statement denotatively, and my own interests/work have generally been "driven by open-ended curiosity and a drive to uncover deep truths", but isn't this kind of motivation also what got humanity into its current mess? In other words, wasn't the main driver of AI progress this kind of curiosity (until perhaps the recent few years when it has been driven more by commercial/monetary/power incentives)?
Interestingly, I was just having a conversation with Critch about this. My contention was that, in the first few decades of the field, AI researchers were actually trying to understand cognition. The rise of deep learning (and especially the kind of deep learning driven by massive scaling) can be seen as the field putting that quest on hold in order to optimize for more legible metrics.
I don't think you should find this a fully satisfactory answer, because it's easy to "retrodict" ways that my theory was correct. But that's true of all explanations of what makes the world good at a very abstract level, including your own answer of metaphilosophical competence. (Also, we can perhaps cash my claim out in predictions, like: was a significant barrier to more researchers working on deep learning the criticism that it didn't actually provide good explanations of or insight into cognition? Without having looked it up, I suspect so.)
consistently good strategy requires a high amount of consequentialist reasoning
I don't think that's true. However I do think it requires deep curiosity about what good strategy is and how it works. It's not a coincidence that my own research on a theory of coalitional agency was in significant part inspired by strategic failures of EA and AI safety (with this post being one of the earliest building blocks I laid down). I also suspect that the full theory of coalitional agency will in fact explain how to do metaphilosophy correct, because doing good metaphilosophy is ultimately a cognitive process and can therefore be characterized by a sufficiently good theory of cognition.
Again, I don't expect you to fully believe me. But what I most want to read from you right now is an in-depth account of which the things in the world have gone or are going most right, and the ways in which you think metaphilosophical competence or consequentialist reasoning contributed to them. Without that, it's hard to trust metaphilosophy or even know what it is (though I think you've given a sketch of this in a previous reply to me at some point).
I should also try to write up the same thing, but about how virtues contributed to good things. And maybe also science, insofar as I'm trying to defend doing more science (of cognition and intelligence) in order to help fix risks caused by previous scientific progress.
In trying to reply to this comment I identified four "waves" of AI safety, and lists of the central people in each wave. Since this is socially complicated I'll only share the full list of the first wave here, and please note that this is all based on fuzzy intuitions gained via gossip and other unreliable sources.
The first wave I’ll call the “founders”; I think of them as the people who set up the early institutions and memeplexes of AI safety before around 2015. My list:
The second wave I’ll call the “old guard”; those were the people who joined or supported the founders before around 2015. A few central examples include Paul Christiano, Chris Olah, Andrew Critch and Oliver Habryka.
Around 2014/2015 AI safety became significantly more professionalized and growth-oriented. Bostrom published Superintelligence, the Puerto Rico conference happened, OpenAI was founded, DeepMind started a safety team (though I don't recall exactly when), and EA started seriously pushing people towards AI safety. I’ll call the people who entered the field from then until around 2020 "safety scalers" (though I'm open to better names). A few central examples include Miles Brundage, Beth Barnes, John Wentworth, Rohin Shah, Dan Hendrycks and myself.
And then there’s the “newcomers” who joined in the last 5-ish years. I have a worse mental map of these people, but some who I respect are Leo Gao, Sahil, Marius Hobbhahn and Jesse Hoogland.
In this comment I expressed concern that my generation (by which I mean the "safety scalers") have kinda given up on solving alignment. But another higher-level concern is: are people from these last two waves the kinds of people who would have been capable of founding AI safety in the first place? And if not, where are those people now? Of course there's some difference in the skills required for founding a field vs pushing the field forward, but to a surprising extent I keep finding that the people who I have the most insightful conversations with are the ones who were around from the very beginning. E.g. I think Vassar is the single person doing the best thinking about the lessons we can learn about failures of AI safety over the last decade (though he's hard to interface with), and Yudkowsky is still the single person who's most able to push the Overton window towards taking alignment seriously (even though in principle many other people could have written (less doomy versions of) his Time op-ed or his recent book), Scott is still the single best blogger in the space, and so on.
Relatedly, when I talk to someone who's exceptionally thoughtful about politics (and particularly the psychological aspects of politics), a disturbingly large proportion of the time it turns out that worked at (or were somehow associated with) Leverage. This is really weird to me. Maybe I just have Leverage-aligned tastes/networks, but even so, it's a very striking effect. (Also, how come there's no young Moldbug?)
Assuming that I'm gesturing at something real, what are some possible explanations?
This is all only a rough gesture at the phenomenon, and you should be wary that I'm just being pessimistic rather than identifying something important. Also it's a hard topic to talk about clearly because it's loaded with a bunch of social baggage. But I do feel pretty confused and want to figure this stuff out.
This depends on the data distribution though, which could vary greatly (and in fact the data you collect will vary based on your actions which in turn are based on your models).
So I think a lot of the action is in defining which loss we care about.