[Originally written for Twitter]
Many AI risk failure modes imagine strong coherence/goal directedness (e.g. [expected] utility maximisers).
Such strong coherence is not represented in humans, seems unlikely to emerge from deep learning and may be "anti-natural" to general intelligence in our universe.
I suspect the focus on strongly coherent systems was a mistake that set the field back a bit, and it's not yet fully recovered from that error.
I think most of the AI safety work for strongly coherent agents (e.g. decision theory) will end up inapplicable/useless for aligning powerful systems.
[I don't think it nails everything, but on a purely ontological level, @Quintin Pope and @TurnTrout's shard theory feels a lot more right to me than e.g. HRAD.
HRAD is based on an ontology that seems to me to be mistaken/flawed in important respects.]
The shard theory account of value formation (while lacking) seems much more plausible as an account of how intelligent systems develop values (where values are "contextual influences on decision making") than the immutable terminal goals in strong coherence ontologies. I currently believe that immutable terminal goals is just a wrong frame fo...
Something I'm still not clear how to think about is effective agents in the real world.
I think viewing idealised agency as an actor that evaluates argmax wrt (the expected value of) a simple utility function over agent states is just wrong.
Evaluating argmax is very computationally expensive, so most agents most of the time will not be directly optimising over their actions but instead executing learned heuristics that historically correlated with better performance according to the metric the agent is selected for (e.g. reward).
That is, even if an agent somehow fully internalised the selection metric, directly optimising it over all its actions is just computationally intractable in "rich" (complex/high dimensional problem domains, continuous, partially observable/imperfect information, stochastic, large state/action spaces, etc.) environments. So a system inner aligned to the selection metric would still perform most of its cognition in a mostly amortised manner, provided the system is subject to bounded compute constraints.
Furthermore, in the real world learning agents don't generally become inner aligned to the selection metric, but ...
Adversarial robustness is the wrong frame for alignment.
Robustness to adversarial optimisation is very difficult[1].
Cybersecurity requires adversarial robustness, intent alignment does not.
There's no malicious ghost trying to exploit weaknesses in our alignment techniques.
This is probably my most heretical (and for good reason) alignment take.
It's something dangerous to be wrong about.
I think the only way such a malicious ghost could arise is via mesa-optimisers, but I expect such malicious dameons to be unlikely apriori.
That is, you'll need a training environment that exerts significant selection pressure for maliciousness/adversarialness for the property to arise.
Most capable models don't have malicious daemons[2], so it won't emerge by default.
[1]: Especially if the adversary is a more powerful optimiser than you.
[2]: Citation needed.
What I'm currently working on:
The sequence has an estimated length between 30K - 60K words (it's hard to estimate because I'm not even done preparing the outlines yet).
I'm at ~8.7K words written currently (across 3 posts [the screenshots are my outlines]) and guess I'm only 5% of the way through the entire sequence.
Beware the planning fallacy though, so the sequence could easily grow significantly longer than I currently expect.
I work full time until the end of July and will be starting a Masters in September, so here's to hoping I can get the bulk of the piece completed when I have more time to focus on it in August.
Currently, I try for some significant writing [a few thousand words] on weekends and fill in my outlines on weekdays. I try to add a bit more each day, just continuously working on it, until it spontaneously manifests. I also use weekdays to think about the sequence.
So, the twelve posts I've currently planned could very well have ballooned in scope by the time I can work on it full time.
Weekends will also be when I have the time for extensive research/reading for some of the posts).
Most of the catastrophic risk from AI still lies in superhuman agentic systems.
Current frontier systems are not that (and IMO not poised to become that in the very immediate future).
I think AI risk advocates should be clear that they're not saying GPT-5/Claude Next is an existential threat to humanity.
[Unless they actually believe that. But if they don't, I'm a bit concerned that their message is being rounded up to that, and when such systems don't reveal themselves to be catastrophically dangerous, it might erode their credibility.]
I find noticing surprise more valuable than noticing confusion.
Hindsight bias and post hoc rationalisations make it easy for us to gloss over events that were apriori unexpected.
I think mesa-optimisers should not be thought of as learned optimisers, but systems that employ optimisation/search as part of their inference process.
The simplest case is that pure optimisation during inference is computationally intractable in rich environments (e.g. the real world), so systems (e.g. humans) operating in the real world, do not perform inference solely by directly optimising over outputs.
Rather optimisation is employed sometimes as one part of their inference strategy. That is systems o...
Why do we want theorems for AI Safety research? Is it a misguided reach for elegance and mathematical beauty? A refusal to confront the inherently messy and complicated nature of the systems? I'll argue not.
When dealing with powerful AI systems, we want arguments that they are existentially safe which satisfy the following desiderata:
I still want to work on technical AI safety eventually.
I feel like I'm on quite far off path from directly being useful in 2025 than I felt in 2023.
And taking a detour to do a TCS PhD that isn't directly pertinent to AI safety (current plan) feels like not contributing.
Cope is that becoming a strong TCS researcher will make me better poised to contribute to the problem, but short timelines could make this path less viable.
[Though there's nothing saying I can't try to work on AI on the side even if it isn't the focus of my PhD.]
I've come around to the view that global optimisation for a non-trivial objective function in the real world is grossly intractable, so mechanistic utility maximisers are not actually permitted by the laws of physics[1][2].
My remaining uncertainty around expected utility maximisers as a descriptive model of consequentialist systems is whether the kind of hybrid optimisation (mostly learned heuristics, some local/task specific planning/search) that real world agents perform converges towards better approximating...
Immigration is such a tight constraint for me.
My next career steps after I'm done with my TCS Masters are primarily bottlenecked by "what allows me to remain in the UK" and then "keeps me on track to contribute to technical AI safety research".
What I would like to do for the next 1 - 2 years ("independent research"/ "further upskilling to get into a top ML PhD program") is not all that viable a path given my visa constraints.
Above all, I want to avoid wasting N more years by taking a detour through software engineering again so I can get Visa sponsorship.
[...
I once claimed that I thought building a comprehensive inside view on technical AI safety was not valuable, and I should spend more time grinding maths/ML/CS to start more directly contributing.
I no longer endorse that view. I've come around to the position that:
I don't like the term "agent foundations" to describe the kind of research I am most interested in, because:
Occasionally I see a well received post that I think is just fundamentally flawed, but I refrain from criticising it because I don't want to get downvoted to hell. 😅
This is a failure mode of LessWrong.
I'm merely rationally responding to karma incentives. 😌
Huh? What else are you planning to spend your karma on?
Karma is the privilege to say controversial or stupid things without getting banned. Heck, most of them will get upvoted anyway. Perhaps the true lesson here is to abandon the scarcity mindset.
Contrary to many LWers, I think GPT-3 was an amazing development for AI existential safety.
The foundation models paradigm is not only inherently safer than bespoke RL on physics, the complexity and fragility of value problems are basically solved for free.
Language is a natural interface for humans, and it seems feasible to specify a robust constitution in natural language?
Constitutional AI seems plausibly feasible, and like it might basically just work?
That said I want more ambitious mechanistic interpretability of LLMs, and to solve ELK for ti...
My best post was a dunk on MIRI[1], and now I've written up another point of disagreement/challenge to the Yudkowsky view.
There's a part of me that questions the opportunity cost of spending hours expressing takes of mine that are only valuable because they disagree in a relevant aspect with a MIRI position? I could have spent those hours studying game theory or optimisation.
I feel like the post isn't necessarily raising the likelihood of AI existential safety?
I think those are questions I should ask more often before starting on a new LessWrong post; "how...
I'd like to present a useful formalism for describing when a set[1] is "self-similar".
Given arbitrary sets , an "equivalence-isomorphism" is a tuple , such that:
Where:
For a given equivalence relation&nb...
A reason I mood affiliate with shard theory so much is that like...
I'll have some contention with the orthodox ontology for technical AI safety and be struggling to adequately communicate it, and then I'll later listen to a post/podcast/talk by Quintin Pope/Alex Turner, or someone else trying to distill shard theory and then see the exact same contention I was trying to present expressed more eloquently/with more justification.
One example is that like I had independently concluded that "finding an objective function that was existentially safe when optimis...
Still thinking about consequentialism and optimisation. I've argued that global optimisation for an objective function is so computationally intractable as to be prohibited by the laws of physics of our universe. Yet it's clearly the case that e.g. evolution is globally optimising for inclusive genetic fitness (or perhaps patterns that more successfully propagate themselves if you're taking a broader view). I think examining why evolution is able to successfully globally optimise for its objective function wou...
"Is intelligence NP hard?" is a very important question with too little engagement from the LW/AI safety community. NP hardness:
I'd operationalise "is intelligence NP hard?" as:
...Does there exist some subset of computational problems underlying core cognitive tasks that have NP hard [expected] (time) complexity?