This is a special post for quick takes by DragonGod. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
DragonGod's Shortform
202 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

I turned 25 today.

8Throwaway2367
Happy Birthday!
5Yitz
Happy 25 years of existence! Here’s to countless more :)

[Originally written for Twitter]

 

Many AI risk failure modes imagine strong coherence/goal directedness (e.g. [expected] utility maximisers).

Such strong coherence is not represented in humans, seems unlikely to emerge from deep learning and may be "anti-natural" to general intelligence in our universe.

I suspect the focus on strongly coherent systems was a mistake that set the field back a bit, and it's not yet fully recovered from that error.

I think most of the AI safety work for strongly coherent agents (e.g. decision theory) will end up inapplicable/useless for aligning powerful systems.

 

[I don't think it nails everything, but on a purely ontological level, @Quintin Pope and @TurnTrout's shard theory feels a lot more right to me than e.g. HRAD.

HRAD is based on an ontology that seems to me to be mistaken/flawed in important respects.]

The shard theory account of value formation (while lacking) seems much more plausible as an account of how intelligent systems develop values (where values are "contextual influences on decision making") than the immutable terminal goals in strong coherence ontologies. I currently believe that immutable terminal goals is just a wrong frame fo... (read more)

2Vladimir_Nesov
I think this is broadly a right mindset for starting to think about AGIs humans are likely to build (even if I disagree in detail), given the abject failure at doing alignment responsibly. But HRAD-style AGIs are still coming eventually, and even if the AGIs that humans build are directly aligned (don't kill everyone themselves), they'll eventually build those, and then the sorts of failure modes characteristic for them are going to become relevant. The MIRI style of alignment work might even indirectly save humanity from blunders by first generations of aligned AGIs, promoting this line of research to their attention and thus providing timely education about the kinds of danger of misalignment that more powerful agentic designs might manifest.
6DragonGod
Why do you think HRAD-style AGIs are coming eventually?
2Vladimir_Nesov
If you are an AGI, take misalignment concerns seriously, and don't want the next generations of AGIs to merely have some chance of getting to care about your well-being by at least a very tiny fraction, you stop building AGIs with new designs or self-experimenting based on vague selection principles. But you still want strong optimization to make use of all the galaxies in the Hubble volume, before they get out of reach. So this time, you do it right.
6DragonGod
I'm not actually convinced that strong coherence as envisaged in HRAD is a natural form of general intelligences in our universe.
2DragonGod
Let me know if you want this to be turned into a top level post.
4TurnTrout
Seems useful. I think there are a set of important intuitions you're gesturing at here around naturality, some of which I may share. I have some take (which may or may not be related) like  but this doesn't mean I instantly gain space-time-additive preferences about dogs and diamonds such that I use one utility function in all contexts, such that the utility function is furthermore over universe-histories (funny how I seem to care across Tegmark 4?).
2DragonGod
From the post:   Evolutionarily convergent terminal values, is something that's underrated I think.

Confusions About Optimisation and Agency

Something I'm still not clear how to think about is effective agents in the real world.

I think viewing idealised agency as an actor that evaluates argmax wrt (the expected value of) a simple utility function over agent states is just wrong.

Evaluating argmax is very computationally expensive, so most agents most of the time will not be directly optimising over their actions but instead executing learned heuristics that historically correlated with better performance according to the metric the agent is selected for (e.g. reward).

That is, even if an agent somehow fully internalised the selection metric, directly optimising it over all its actions is just computationally intractable in "rich" (complex/high dimensional problem domains, continuous, partially observable/imperfect information, stochastic, large state/action spaces, etc.) environments. So a system inner aligned to the selection metric would still perform most of its cognition in a mostly amortised manner, provided the system is subject to bounded compute constraints.

 

Furthermore, in the real world learning agents don't generally become inner aligned to the selection metric, but ... (read more)

Adversarial robustness is the wrong frame for alignment.

Robustness to adversarial optimisation is very difficult[1].

Cybersecurity requires adversarial robustness, intent alignment does not.

There's no malicious ghost trying to exploit weaknesses in our alignment techniques.

This is probably my most heretical (and for good reason) alignment take.

It's something dangerous to be wrong about.

I think the only way such a malicious ghost could arise is via mesa-optimisers, but I expect such malicious dameons to be unlikely apriori.

That is, you'll need a training environment that exerts significant selection pressure for maliciousness/adversarialness for the property to arise.

Most capable models don't have malicious daemons[2], so it won't emerge by default.

[1]: Especially if the adversary is a more powerful optimiser than you.

[2]: Citation needed.

4Vladimir_Nesov
There seems to be a lot of giant cheesecake fallacy in AI risk. Only things leading up to AGI threshold are relevant to the AI risk faced by humans, the rest is AGIs' problem. Given current capability of ChatGPT with imminent potential to get it a day-long context window, there is nothing left but tuning, including self-tuning, to reach AGI threshold. There is no need to change anything at all in its architecture or basic training setup to become AGI, only that tuning to get it over a sanity/agency threshold of productive autonomous activity, and iterative batch retraining on new self-written data/reports/research. It could be done much better in other ways, but it's no longer necessary to change anything to get there. So AI risk is now exclusively about fine tuning of LLMs, anything else is giant cheesecake fallacy, something possible in principle but not relevant now and thus probably ever, as something humanity can influence. Though that's still everything but the kitchen sink, fine tuning could make use of any observations about alignment, decision theory, and so on, possibly just as informal arguments being fed at key points to LLMs, cumulatively to decisive effect.

What I'm currently working on:

 

The sequence has an estimated length between 30K - 60K words (it's hard to estimate because I'm not even done preparing the outlines yet).

I'm at ~8.7K words written currently (across 3 posts [the screenshots are my outlines]) and guess I'm only 5% of the way through the entire sequence.

Beware the planning fallacy though, so the sequence could easily grow significantly longer than I currently expect.


I work full time until the end of July and will be starting a Masters in September, so here's to hoping I can get the bulk of the piece completed when I have more time to focus on it in August.

Currently, I try for some significant writing [a few thousand words] on weekends and fill in my outlines on weekdays. I try to add a bit more each day, just continuously working on it, until it spontaneously manifests. I also use weekdays to think about the sequence.
 

So, the twelve posts I've currently planned could very well have ballooned in scope by the time I can work on it full time.

Weekends will also be when I have the time for extensive research/reading for some of the posts).

Most of the catastrophic risk from AI still lies in superhuman agentic systems.

Current frontier systems are not that (and IMO not poised to become that in the very immediate future).

I think AI risk advocates should be clear that they're not saying GPT-5/Claude Next is an existential threat to humanity.

[Unless they actually believe that. But if they don't, I'm a bit concerned that their message is being rounded up to that, and when such systems don't reveal themselves to be catastrophically dangerous, it might erode their credibility.]

I find noticing surprise more valuable than noticing confusion.

Hindsight bias and post hoc rationalisations make it easy for us to gloss over events that were apriori unexpected.

5Raemon
My take on this is that noticing surprise is easier than noticing confusion, and surprise often correlates with confusion so a useful thing to do is have a habit of : 1. practice noticing surprise 2. when you notice surprise, check if you have a reason to be confused (Where surprise is "something unexpected happened" and confused is "something is happening that I can't explain, or my explanation of doesn't make sense")

Some Nuance on Learned Optimisation in the Real World

I think mesa-optimisers should not be thought of as learned optimisers, but systems that employ optimisation/search as part of their inference process.

The simplest case is that pure optimisation during inference is computationally intractable in rich environments (e.g. the real world), so systems (e.g. humans) operating in the real world, do not perform inference solely by directly optimising over outputs.

Rather optimisation is employed sometimes as one part of their inference strategy. That is systems o... (read more)

4Quintin Pope
One can always reparameterize any given input / output mapping as a search for the minima of some internal energy function, without changing the mapping at all.  The main criteria to think about is whether an agent will use creative, original strategies to maximize inner objectives, strategies which are more easily predicted by assuming the agent is "deliberately" looking for extremes of the inner objectives, as opposed to basing such predictions on the agent's past actions, e.g., "gather more computational resources so I can find a high maximum".
2DragonGod
Given that the optimisation performed by intelligent systems in the real world is local/task specific, I'm wondering if it would be more sensible to model the learned model as containing (multiple) mesa-optimisers rather than being a single mesa-optimiser.   My main reservation is that I think this may promote a different kind of confused thinking; it's not the case that the learned optimisers are constantly competing for influence and their aggregate behaviour determines the overall behaviour of the learned algorithm. Rather the learned algorithm employs optimisation towards different local/task specific objectives.

The Case for Theorems

Why do we want theorems for AI Safety research? Is it a misguided reach for elegance and mathematical beauty? A refusal to confront the inherently messy and complicated nature of the systems? I'll argue not.

 


 

Desiderata for Existential Safety

When dealing with powerful AI systems, we want arguments that they are existentially safe which satisfy the following desiderata:

  1. Robust to scale
  2. Generalise far out of distribution to test/de
... (read more)
7DragonGod
Let me know if you think this should be turned into a top level post.
3Noosphere89
I would definitely like for this to be turned into a top level post, DragonGod.
2DragonGod
I published it as a top level post.

I still want to work on technical AI safety eventually.

I feel like I'm on quite far off path from directly being useful in 2025 than I felt in 2023.

And taking a detour to do a TCS PhD that isn't directly pertinent to AI safety (current plan) feels like not contributing.

Cope is that becoming a strong TCS researcher will make me better poised to contribute to the problem, but short timelines could make this path less viable.

[Though there's nothing saying I can't try to work on AI on the side even if it isn't the focus of my PhD.]

6DragonGod
There is not an insignificant sense of guilt/of betraying myself from 2023 and my ambitions from before. And I don't want to just end up doing irrelevant TCS research that only a few researchers in a niche field will ever care about. It's not high impact research. And it's mostly just settling. I get the sense that I enjoy theoretical research, I don't currently feel poised to contribute to the AI safety problem, I seem to have an unusually good (at least it appears so to my limited understanding) opportunity to pursue a boring TCS PhD in some niche field that few people care about. I don't think I'll be miserable pursuing the boring TCS PhD or not enjoy it, or anything of the sort. It's just not directly contributing to what I wanted to contribute to. It's somewhat sad and it's undignified (but it's less undignified than the path I thought I was on at various points in the last 15 months).

Mechanistic Utility Maximisers are Infeasible

I've come around to the view that global optimisation for a non-trivial objective function in the real world is grossly intractable, so mechanistic utility maximisers are not actually permitted by the laws of physics[1][2].

My remaining uncertainty around expected utility maximisers as a descriptive model of consequentialist systems is whether the kind of hybrid optimisation (mostly learned heuristics, some local/task specific planning/search) that real world agents perform converges towards better approximating... (read more)

Immigration is such a tight constraint for me.

My next career steps after I'm done with my TCS Masters are primarily bottlenecked by "what allows me to remain in the UK" and then "keeps me on track to contribute to technical AI safety research".

What I would like to do for the next 1 - 2 years ("independent research"/ "further upskilling to get into a top ML PhD program") is not all that viable a path given my visa constraints.

Above all, I want to avoid wasting N more years by taking a detour through software engineering again so I can get Visa sponsorship.

[... (read more)

I once claimed that I thought building a comprehensive inside view on technical AI safety was not valuable, and I should spend more time grinding maths/ML/CS to start more directly contributing.

 

I no longer endorse that view. I've come around to the position that:

  • Many alignment researchers are just fundamentally confused about important questions/topics, and are trapped in inadequate ontologies
  • Considerable conceptual engineering is needed to make progress
  • Large bodies of extant technical AI safety work is just inapplicable to making advanced ML systems
... (read more)

"Foundations of Intelligent Systems" not "Agent Foundations"

 

I don't like the term "agent foundations" to describe the kind of research I am most interested in, because:

  1. I am unconvinced that "agent" is the "true name" of the artifacts that would determine the shape of humanity's long term future
    1. The most powerful artificial intelligent systems today do not cleanly fit into the agent ontology, 
    2. Future foundation models are unlikely to cleanly conform to the agent archetype
    3. Simulators/foundation models may be the first (and potentially final) form of
... (read more)
3DragonGod
As always let me know if you want me to publish this as a top level post.

Writing LW questions is much easier than writing full posts.

7Dagon
Shhh! This stops working if everyone knows the trick.
5DragonGod
Why would it stop working?
7Dagon
I was at least half joking, but there is some risk that if questions become more prevalent (even if they’re medium-effort good question posts), they will stop getting the engagement that is now available, and it will take more effort to write good ones.
7DragonGod
Yeah, that would suck. I should keep it in mind and ration my frequency/rate of writing up question posts.

Occasionally I see a well received post that I think is just fundamentally flawed, but I refrain from criticising it because I don't want to get downvoted to hell. 😅

This is a failure mode of LessWrong.

I'm merely rationally responding to karma incentives. 😌

Huh? What else are you planning to spend your karma on?

Karma is the privilege to say controversial or stupid things without getting banned. Heck, most of them will get upvoted anyway. Perhaps the true lesson here is to abandon the scarcity mindset.

9Dagon
[OK, 2 comments on a short shortform is probably excessive.  sorry. ] No, this is a failure mode of your posting strategy.  You should WANT some posts to get downvoted to hell, in order to better understand the limits of this group's ability to rationally discuss some topics.  Think of cases where you are the local-context-contrarian as bounds on the level of credence to give to the site. Stay alert.  Trust no one.  Keep your laser handy.
5Vladimir_Nesov
Beliefs/impressions are less useful in communication (for echo chamber reasons) than for reasoning and other decision making, they are importantly personal things. Mostly not being worth communicating doesn't mean they are not worth maintaining in a good shape. They do influence which arguments that stand on their own are worth communicating, but there aren't always arguments that allow communicating relevant beliefs themselves.
4ChristianKl
I don't think the fact that a post is well-received is alone reason that criticism gets downvoted to hell. Usually, quality criticism can get upvoted even if a post is well received.
2DragonGod
The cases I have in mind are where I have substantial disagreements with the underlying paradigm/worldview/framework/premises on which the post rests on to the extent that I think the post is basically completely worthless. For example Kokotaljo's "What 2026 Looks Like?"; I think elaborate concrete predictions of the future are not only nigh useless but probably net negative for opportunity cost/diverted resources (including action) reasons. My underlying arguments are extensive, but are not really about the post itself, but the very practice/exercise of writing elaborate future vignettes. And I don't have the energy/motivation to draft up said substantial disagreements into a full fledged essay.
2ChristianKl
If you call a post a prediction that's not a prediction, then you are going to be downvoted. Nothing wrong with that.  He called his goal "The goal is to write out a detailed future history (“trajectory”) that is as realistic (to me) as I can currently manage, i.e. I’m not aware of any alternative trajectory that is similarly detailed and clearly more plausible to me." That's scenario planning, even if he only provides one scenario. He doesn't provide any probabilities in the post so it's not a prediction. Scenario planning is different than how we at LessWrong usually approach the future with prediction but scenario planning matters for how a lot of powerful institutions in the world orient themselves about the future.  Having a scenario like that allows someone at the department of defense to say: Let's do a wargame for this scenario. You might say "It's bad that the department of defense uses wargames to think about the future" but in the world, we live in they do. 
1DragonGod
I additionally think the scenario is very unlikely. So unlikely that wargaming for that scenario is only useful insomuch as your strategy is general enough to apply to many other scenarios. Wargaming for that scenario in particular is privileging a hypothesis that hasn't warranted it. The scenario is very unlikely on priors and its 2022 predictions didn't quite bear out.
2ChristianKl
Part of the advantage of being specific about 2022 and 2023 is that it allows people to update on it toward taking the whole scenario more or less seriously. 
1DragonGod
I didn't need to see 2022 to know that the scenario would not be an accurate description of reality. On priors that was just very unlikely.
2ChristianKl
Having scenarios that are unlikely based on priors means that you can update more if they turn out to go that way than scenarios that you deemed to be likely to happen anyway. 
3Dagon
I don't think this is necessarily true.  I disagree openly on a number of topics, and generally get slightly upvoted, or at least only downvoted a little.  In fact, I WANT to have some controversial comments (with > 10 votes and -2 < karma < 10), or I worry that I'm censoring myself. The times I've been downvoted to hell, I've been able to identify fairly specific reasons, usually not just criticizing, but criticizing in an unhelpful way.
3Slider
Community opinion is not exactly meaningful if opinions are not aggregated to it. Say no to ascending to simulcra heaven
2RobertM
I'd be surprised if this happened frequently for good criticisms.

Contrary to many LWers, I think GPT-3 was an amazing development for AI existential safety. 

The foundation models paradigm is not only inherently safer than bespoke RL on physics, the complexity and fragility of value problems are basically solved for free.

Language is a natural interface for humans, and it seems feasible to specify a robust constitution in natural language? 

Constitutional AI seems plausibly feasible, and like it might basically just work?

That said I want more ambitious mechanistic interpretability of LLMs, and to solve ELK for ti... (read more)

My best post was a dunk on MIRI[1], and now I've written up another point of disagreement/challenge to the Yudkowsky view.

There's a part of me that questions the opportunity cost of spending hours expressing takes of mine that are only valuable because they disagree in a relevant aspect with a MIRI position? I could have spent those hours studying game theory or optimisation.

I feel like the post isn't necessarily raising the likelihood of AI existential safety?

I think those are questions I should ask more often before starting on a new LessWrong post; "how... (read more)

21a3orn
Generally, I don't think it's good to gate "is subquestion X, related to great cause Y, true?" with questions about "does addressing this subquestion contribute to great cause Y?" Like I don't think it's good in general, and don't think it's good here. I can't justify this in a paragraph, but I'm basing this mostly of "Huh, that's funny" being far more likely to lead to insight than "I must have insight!" Which means it's a better way of contributing to great causes, generally. (And honestly, at another level entirely, I think that saying true things, which break up uniform blocks of opinion on LW, is good for the health of the LW community.) Edit: That being said, if the alternative to following your curiosity on one thing is like, super high value, ofc it's better. But meh, I mean I'm glad that post is out there. It's a good central source for a particular branch of criticism, and I think it helped me understand the world more.

A Sketch of a Formalisation of Self Similarity

 

Introduction

I'd like to present a useful formalism for describing when a set[1] is "self-similar".

 

Isomorphism Under Equivalence Relations

Given arbitrary sets , an "equivalence-isomorphism" is a tuple , such that:

Where:  

  •  is a bijection from  to 
  •  is the inverse of 
  •  is an equivalence relation on the union of  and .  

 

For a given equivalence relation&nb... (read more)

3Vladimir_Nesov
For a given ∼R, let R be its set of equivalence classes. This induces maps x:X→R and y:Y→R. The isomorphism f:X→Y you discuss has the property y⋅f=x. Maps like that are the morphisms in the slice category over R, and these isomorphisms are the isomorphisms in that category. So what happened is that you've given X and Y the structure of a bundle, and the isomorphisms respect that structure.

A reason I mood affiliate with shard theory so much is that like...

I'll have some contention with the orthodox ontology for technical AI safety and be struggling to adequately communicate it, and then I'll later listen to a post/podcast/talk by Quintin Pope/Alex Turner, or someone else trying to distill shard theory and then see the exact same contention I was trying to present expressed more eloquently/with more justification.

One example is that like I had independently concluded that "finding an objective function that was existentially safe when optimis... (read more)

4Chris_Leong
My main critique of shard theory is that I expect one of the shards to end up dominating the others as the most likely outcome.

Even though that doesn't happen in biological intelligences?

Consequentialism is in the Stars not Ourselves?

Still thinking about consequentialism and optimisation. I've argued that global optimisation for an objective function is so computationally intractable as to be prohibited by the laws of physics of our universe. Yet it's clearly the case that e.g. evolution is globally optimising for inclusive genetic fitness (or perhaps patterns that more successfully propagate themselves if you're taking a broader view). I think examining why evolution is able to successfully globally optimise for its objective function wou... (read more)

Paul Christiano's AI Alignment Landscape:

"Is intelligence NP hard?" is a very important question with too little engagement from the LW/AI safety community. NP hardness:

  1. Bounds attainable levels of intelligence (just how superhuman is superintelligence?)
  2. Bounds physically and economically feasible takeoff speeds (i.e. exponentially growing resource investment is required for linear improvements in intelligence)

I'd operationalise "is intelligence NP hard?" as:

Does there exist some subset of computational problems underlying core cognitive tasks that have NP hard [expected] (time) complexity?

... (read more)