I think that people overrate bayesian reasoning and underrate "figure out the right ontology".
Most of the way good thinking happens IMO is by finding and using a good ontology for thinking about some situation, not by probabilistic calculation. When I learned calculus, for example, it wasn't mostly that I had uncertainty over a bunch of logical statements, which I then strongly updated on learning the new theorems, it was instead that I learned a bunch of new concepts, which I then applied to reason about the world.
I think AI safety generally has much better concepts for thinking about the future of AI than others, and this is a key source of alpha we have. But, there are obviously still a huge number of disagreements remaining within AI safety. I would guess that debates would be more productive if we more explicitly focused on the ontology/framing that each other are using to reason about the situation, and then discussed to what extent that framing captures the dynamics we think are important.
I think it would be good if more people say things like "I think that's a bad concept, because it obscures consideration X, which is important for thinking about the situation".
Here are...
I would guess that debates would be more productive if we more explicitly focused on the ontology/framing that each other are using to reason about the situation, and then discussed to what extent that framing captures the dynamics we think are important.
I strongly agree with this. However, I'll note as one aspect of the discourse problem, that, at least in my personal experience, people are not very open to this. People's eyes tend to glaze over. I do not mean this as a dig on them. In fact, I also notice this in myself; and because I think it's important, I try to incline towards being open to such discussions, but I still do it. (Sometimes endorsedly.)
Some things that are going on, related to this:
Some meandering thoughts on alignment
A nearcast of how we might go about solving alignment using basic current techniques, assuming little/no substnative government intervention is:
Scheming seems like an unnatural concept to me. I think we can do better. (note: many/most of these thoughts un-original to me)

I think that the views of superforecasters on AI / AI risk should be basically no update.
It seems to me like the main reasons to defer to someone are:
if you are someone in AI who thinks that it's appropriate to defer to superforecasters, I think it would be a good idea to try to set up a meeting and talk with one of the people you are deferring to and see if they are actually making reasonable arguments that seem grounded in technical reality.
Even better could be if we already had these sorts of arguments collected. https://goodjudgment.com/superforecasting-ai/ contains links to 17 superforecasters' reviews of Carlsmith's p(doom) report, some of them supposedly AI experts. I invite people to skim through some of them.
Copying very relevant portions of a comment I wrote in Mar 2024:
Note: These are all rough numbers, I'd expect I'd shift substantially about all of this on further debate.
Suppose we made humanity completely robust to biorisk, i.e. we did sufficient preparation such that the risk of bio catastrophe (including AI mediated biocatastrophe) was basically 0.[1] How much would this reduce total x-risk?
The basic story for any specific takeover path not mattering much is that the AIs, conditional on them being wanting to take over, will self-improve until they find they find the next easiest takeover path and do that instead. I think that this is persuasive but not fully because:
I generally like your breakdown and way of thinking about this, thanks. Some thoughts:
AIs need to worry about their own alignment problem, meaning that they may not be able to self improve in an unconstrained fashion.
I haven't thought too deeply about this, but I would guess that the AI self-alignment problem is quite a lot easier than the human AI-alignment problem.
Here are my largest disagreements with AI 2027.
Nice. Consider reposting this as a comment on the AI 2027 blog post either on LW or on our Substack?
For me, my median is in 2029 now (at the time of publication it was 2028) so there's less of a difference there.
I think I agree with you about 2 actually and do feel a bit bad about that. I also agree about 3.
I also think that the slowdown ending was unrealistic in another way, namely, that Agent-4 didn't put up much of a fight and allowed itself to get shut down. Also, it was unrealistic in that the CEOs and POTUS peacefully cooperated on the Oversight Committee instead of having power struggles and purges and ultimately someone emerging as dictator.
I'm starting to suspect that if 2026-2027 AGI happens through automation of routine AI R&D (automating acquisition of deep skills via RLVR), it doesn't obviously accelerate ASI timelines all that much. Automated task and RL environment construction fixes some of the jaggedness, but LLMs are not currently particularly superhuman, and advancing their capabilities plausibly needs skills that aren't easy for LLMs to automatically RLVR into themselves (as evidenced by humans not having made too much progress in RLVRing such skills).
This creates a strange future with broadly capable AGI that's perhaps even somewhat capable of frontier AI R&D (not just routine AI R&D), but doesn't accelerate further development beyond picking low-hanging algorithmic fruit unlocked by a given level of compute faster (months instead of years, but bounded by what the current compute makes straightforward). If this low-hanging algorithmic fruit doesn't by itself lead to crucial breakthroughs, AGIs won't turn broadly or wildly superhuman before there's much more compute, or before a few years where human researchers would've made similar progress as these AGIs. And compute might remain gated by ASML EUV tools at 100-200 GW of new compute per year (3.5 tools occupied per GW of compute each year; maybe 250-300 EUV tools exist now, 50-100 will be produced per year, about 700 will exist in 2030).
Hypothesis: alignment-related properties of an ML model will be mostly determined by the part(s) of training that were most responsible for capabilities.
If you take a very smart AI model with arbitrary goals/values and train it to output any particular sequence of tokens using SFT, it'll almost certainly work. So can we align an arbitrary model by training them to say "I'm a nice chatbot, I wouldn't cause any existential risk, ... "? Seems like obviously not, because the model will just learn the domain specific / shallow property of outputting those particular tokens in that particular situation.
On the other hand, if you train an AI model from the ground up with a hypothetical "perfect reward function" that always gives correct ratings to the behaviour of the AI, (and you trained on a distribution of tasks similar to the one you are deploying it on) then I would guess that this AI, at least until around the human range, will behaviorally basically act according to the reward function.
A related intuition pump here for the difference is the effect of training someone to say "I care about X" by punishing them until they say X consistently, vs raising them consistently with a large...
Here's an argument for short timelines that I take seriously:
I think I feel better about (2) then basically any other way of getting an anchor on when AGI will be built because it much directly tracks real world impacts of AI, whereas e.g. it seems really difficult to get any sort of confidence on what oom of effective flops or benchmark score corresponds to AGI.
(1) still seems dubious to me, I think revenue trends will probably slow. But I don't know when and I could totally imagine them continuing straight to AGI.
(What exactly do I mean by AGI? I don't think it matte...
Note that in order for Anthropic revenue to 10x this year, they'll already have to increase $/FLOP (i.e. revenue per unit of compute. Profit margins basically.) To increase it another 10x the following year, they'll probably need to triple $/FLOP, because their compute will only roughly triple next year. Ditto for 2028. All this is a reason to doubt premise 1 basically; in the past they've been able to grow revenue in large part by just allocating more of their compute to serving customers, but now they'll have to charge customers more per FLOP.
The revenue of the leading AI company will be between 100B/yr and 10T/yr when AGI is achieved. (Why not lower? Maybe but AGI this year seems unlikely. Why not higher? If one companies revenue is on the order of 10% of current wGDP, then the whole AI industry is probably 50-100% of current wGDP, which seems like you probably have AGI by then).
am i understanding correctly?
this seems rather circular?
yo, totally!
sorry, i didn't mean my comment to reject the conclusion of your post. obviously we can argue agi on its own merits -- the puppy is not a valid analogy for exactly the reason you specify.
however -- speaking narrowly about the quoted passage -- i find this move very suspicious:
this is valid, as much as we accept the premises. but it seems disingenuous to me. any plausible narrative we have for B happening has to route first through A happening. we can interpret reasoning-under-uncertainty as a kind of "path counting" game -- we are counting "potential futures" according to some measure. but any path through B must necessarily pass through A, by assumption! so any story that we tell about why B will happen is implicitly a story where A happens.
so we can't count evidence for B as separate evidence for A. any probability we assign to B already has A baked in as an assumption.
if i say[1] "agi is 20 years away", and you reply "it's only three years away: look at how close anthropic is to [developing agi and] controlling the world economy" -- this is not going to be ...
I also take this argument seriously.
One background fact some commenters are missing: it's virtually unheard of for a tech startup to continue growing at 100% or more after it reaches $1 billion per year in revenue. A company growing at closer to 1000% per year at the multi-billion revenue level is wildly unprecedented. A company tripling its revenue in one quarter from a starting point of $10 billion, as Anthropic did in Q1, is even more wildly unprecedented than that.
Revenue growth has momentum, and it is essentially locked in that frontier LLMs will be a bigger business than the biggest tech industries (smartphones, internet advertising) are today.
Some claims I've been repeating in conversation a bunch:
Safety work (I claim) should either be focused on one of the following
I think that pretty much no one is working directly on 1. I think that a lot of safety work is indeed useful for 2, but in this case, it's useful to know what pivotal process you are aiming for. Specifically, why aren't you just directly working to make that pivotal act/process happen? Why do you need an AI to help you? Typically, the response is that the pivotal act/process is too difficult to be achieved by humans. In that case, you are pushing into a difficult capabilities regime -- the AI has some goals that do not equal humanity's CEV, and so has a convergent incentive to powerseek and escape. With enough time or intelligence, you therefore get wrecked, but you are trying to operate in this window where your AI is smart enough to do the cognitive work, but is 'nerd-sniped' or focused on the particular task that you like. In particular, if this AI reflects on its goals and starts thinking big picture, you reliably get wrecked. This is one of the reasons that doing alignment research seems like a particularly difficult pivotal act to aim for.
Thinking about ethics.
After thinking more about orthogonality I've become more confident that one must go about ethics in a mind-dependent way. If I am arguing about what is 'right' with a paperclipper, there's nothing I can say to them to convince them to instead value human preferences or whatever.
I used to be a staunch moral realist, mainly relying on very strong intuitions against nihilism, and then arguing something that not nihilism -> moral realism. I now reject the implication, and think that there is both 1) no universal, objective morality, and 2) things matter.
My current approach is to think of "goodness" in terms of what CEV-Thomas would think of as good. Moral uncertainty, then, is uncertainty over what CEV-Thomas thinks. CEV is necessary to get morality out of a human brain, because it is currently a bundle of contradictory heuristics. However, my moral intuitions still give bits about goodness. Other people's moral intuitions also give some bits about goodness, because of how similar their brains are to mine, so I should weight other peoples beliefs in my moral uncertainty.
Ideally, I should trade with other people so that we both maximize a joint utility function, instead of each of us maximizing our own utility function. In the extreme, this looks like ECL. For most people, I'm not sure that this reasoning is necessary, however, because their intuitions might already be priced into my uncertainty over my CEV.
Deception is a particularly worrying alignment failure mode because it makes it difficult for us to realize that we have made a mistake: at training time, a deceptive misaligned model and an aligned model make the same behavior.
There are two ways for deception to appear:
Another way of describing the difference is that 1 follows from an inner alignment failure: a mesaoptimizer learned an unintended mesaobjective that performs well on training, while 2 follows from an outer alignment failure — an imperfect reward signal.
Classic discussion of deception focuses on 1 (example 1, example 2), but I think that 2 is very important as well, particularly because the most common currently used alignment strategy is RLHF, which actively selects for deception.
Once the AI has the ability to create strategies that involve deceiving the human, even without explicitly modeling the human, those strategies will win out and e...
Current impressions of free energy in the alignment space.
I think that external deployment of AI systems is good for the world and so many policies that incentivize AI companies to only deploy internally are bad.
I am concerned that many policies people in the AI safety space are pushing for create an incentive for companies to not externally deploy their AI mo...
Thinking a bit about takeoff speeds.
As I see it, there are ~3 main clusters:
Credit: Mainly inspired by talking with Eli Lifland. Eli has a potentially-published-soon document here.
The basic case against against Effective-FLOP.
Some rough takes on the Carlsmith Report.
Carlsmith decomposes AI x-risk into 6 steps, each conditional on the previous ones:
Some thoughts on inner alignment.
1. The type of object of a mesa objective and a base objective are different (in real life)
In a cartesian setting (e.g. training a chess bot), the outer objective is a function , where is the state space, and are the trajectories. When you train this agent, it's possible for it to learn some internal search and mesaobjective , since the model is big enough to express some utility function over trajectories. For example, it might learn a classifier that e...
There are several game theoretic considerations leading to races to the bottom on safety.