Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a special post for short-form writing by DragonGod. Only they can create top-level comments. Comments here also appear on the Shortform Page and All Posts page.
200 comments, sorted by Click to highlight new comments since: Today at 4:58 AM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

I turned 25 today.

Happy Birthday!
Happy 25 years of existence! Here’s to countless more :)

[Originally written for Twitter]


Many AI risk failure modes imagine strong coherence/goal directedness (e.g. [expected] utility maximisers).

Such strong coherence is not represented in humans, seems unlikely to emerge from deep learning and may be "anti-natural" to general intelligence in our universe.

I suspect the focus on strongly coherent systems was a mistake that set the field back a bit, and it's not yet fully recovered from that error.

I think most of the AI safety work for strongly coherent agents (e.g. decision theory) will end up inapplicable/useless for aligning powerful systems.


[I don't think it nails everything, but on a purely ontological level, @Quintin Pope and @TurnTrout's shard theory feels a lot more right to me than e.g. HRAD.

HRAD is based on an ontology that seems to me to be mistaken/flawed in important respects.]

The shard theory account of value formation (while lacking) seems much more plausible as an account of how intelligent systems develop values (where values are "contextual influences on decision making") than the immutable terminal goals in strong coherence ontologies. I currently believe that immutable terminal goals is just a wrong frame fo... (read more)

I think this is broadly a right mindset for starting to think about AGIs humans are likely to build (even if I disagree in detail), given the abject failure at doing alignment responsibly. But HRAD-style AGIs are still coming eventually, and even if the AGIs that humans build are directly aligned (don't kill everyone themselves), they'll eventually build those, and then the sorts of failure modes characteristic for them are going to become relevant. The MIRI style of alignment work might even indirectly save humanity from blunders by first generations of aligned AGIs, promoting this line of research to their attention and thus providing timely education about the kinds of danger of misalignment that more powerful agentic designs might manifest.
Why do you think HRAD-style AGIs are coming eventually?
If you are an AGI, take misalignment concerns seriously, and don't want the next generations of AGIs to merely have some chance of getting to care about your well-being by at least a very tiny fraction, you stop building AGIs with new designs or self-experimenting based on vague selection principles. But you still want strong optimization to make use of all the galaxies in the Hubble volume, before they get out of reach. So this time, you do it right.
I'm not actually convinced that strong coherence as envisaged in HRAD is a natural form of general intelligences in our universe.
Let me know if you want this to be turned into a top level post.
Seems useful. I think there are a set of important intuitions you're gesturing at here around naturality, some of which I may share. I have some take (which may or may not be related) like  but this doesn't mean I instantly gain space-time-additive preferences about dogs and diamonds such that I use one utility function in all contexts, such that the utility function is furthermore over universe-histories (funny how I seem to care across Tegmark 4?).
From the post:   Evolutionarily convergent terminal values, is something that's underrated I think.

Confusions About Optimisation and Agency

Something I'm still not clear how to think about is effective agents in the real world.

I think viewing idealised agency as an actor that evaluates argmax wrt (the expected value of) a simple utility function over agent states is just wrong.

Evaluating argmax is very computationally expensive, so most agents most of the time will not be directly optimising over their actions but instead executing learned heuristics that historically correlated with better performance according to the metric the agent is selected for (e.g. reward).

That is, even if an agent somehow fully internalised the selection metric, directly optimising it over all its actions is just computationally intractable in "rich" (complex/high dimensional problem domains, continuous, partially observable/imperfect information, stochastic, large state/action spaces, etc.) environments. So a system inner aligned to the selection metric would still perform most of its cognition in a mostly amortised manner, provided the system is subject to bounded compute constraints.


Furthermore, in the real world learning agents don't generally become inner aligned to the selection metric, but ... (read more)

Adversarial robustness is the wrong frame for alignment.

Robustness to adversarial optimisation is very difficult[1].

Cybersecurity requires adversarial robustness, intent alignment does not.

There's no malicious ghost trying to exploit weaknesses in our alignment techniques.

This is probably my most heretical (and for good reason) alignment take.

It's something dangerous to be wrong about.

I think the only way such a malicious ghost could arise is via mesa-optimisers, but I expect such malicious dameons to be unlikely apriori.

That is, you'll need a training environment that exerts significant selection pressure for maliciousness/adversarialness for the property to arise.

Most capable models don't have malicious daemons[2], so it won't emerge by default.

[1]: Especially if the adversary is a more powerful optimiser than you.

[2]: Citation needed.

There seems to be a lot of giant cheesecake fallacy in AI risk. Only things leading up to AGI threshold are relevant to the AI risk faced by humans, the rest is AGIs' problem. Given current capability of ChatGPT with imminent potential to get it a day-long context window, there is nothing left but tuning, including self-tuning, to reach AGI threshold. There is no need to change anything at all in its architecture or basic training setup to become AGI, only that tuning to get it over a sanity/agency threshold of productive autonomous activity, and iterative batch retraining on new self-written data/reports/research. It could be done much better in other ways, but it's no longer necessary to change anything to get there. So AI risk is now exclusively about fine tuning of LLMs, anything else is giant cheesecake fallacy, something possible in principle but not relevant now and thus probably ever, as something humanity can influence. Though that's still everything but the kitchen sink, fine tuning could make use of any observations about alignment, decision theory, and so on, possibly just as informal arguments being fed at key points to LLMs, cumulatively to decisive effect.

What I'm currently working on:


The sequence has an estimated length between 30K - 60K words (it's hard to estimate because I'm not even done preparing the outlines yet).

I'm at ~8.7K words written currently (across 3 posts [the screenshots are my outlines]) and guess I'm only 5% of the way through the entire sequence.

Beware the planning fallacy though, so the sequence could easily grow significantly longer than I currently expect.

I work full time until the end of July and will be starting a Masters in September, so here's to hoping I can get the bulk of the piece completed when I have more time to focus on it in August.

Currently, I try for some significant writing [a few thousand words] on weekends and fill in my outlines on weekdays. I try to add a bit more each day, just continuously working on it, until it spontaneously manifests. I also use weekdays to think about the sequence.

So, the twelve posts I've currently planned could very well have ballooned in scope by the time I can work on it full time.

Weekends will also be when I have the time for extensive research/reading for some of the posts).

Most of the catastrophic risk from AI still lies in superhuman agentic systems.

Current frontier systems are not that (and IMO not poised to become that in the very immediate future).

I think AI risk advocates should be clear that they're not saying GPT-5/Claude Next is an existential threat to humanity.

[Unless they actually believe that. But if they don't, I'm a bit concerned that their message is being rounded up to that, and when such systems don't reveal themselves to be catastrophically dangerous, it might erode their credibility.]

I find noticing surprise more valuable than noticing confusion.

Hindsight bias and post hoc rationalisations make it easy for us to gloss over events that were apriori unexpected.

My take on this is that noticing surprise is easier than noticing confusion, and surprise often correlates with confusion so a useful thing to do is have a habit of : 1. practice noticing surprise 2. when you notice surprise, check if you have a reason to be confused (Where surprise is "something unexpected happened" and confused is "something is happening that I can't explain, or my explanation of doesn't make sense")

Some Nuance on Learned Optimisation in the Real World

I think mesa-optimisers should not be thought of as learned optimisers, but systems that employ optimisation/search as part of their inference process.

The simplest case is that pure optimisation during inference is computationally intractable in rich environments (e.g. the real world), so systems (e.g. humans) operating in the real world, do not perform inference solely by directly optimising over outputs.

Rather optimisation is employed sometimes as one part of their inference strategy. That is systems o... (read more)

4Quintin Pope6mo
One can always reparameterize any given input / output mapping as a search for the minima of some internal energy function, without changing the mapping at all.  The main criteria to think about is whether an agent will use creative, original strategies to maximize inner objectives, strategies which are more easily predicted by assuming the agent is "deliberately" looking for extremes of the inner objectives, as opposed to basing such predictions on the agent's past actions, e.g., "gather more computational resources so I can find a high maximum".
Given that the optimisation performed by intelligent systems in the real world is local/task specific, I'm wondering if it would be more sensible to model the learned model as containing (multiple) mesa-optimisers rather than being a single mesa-optimiser.   My main reservation is that I think this may promote a different kind of confused thinking; it's not the case that the learned optimisers are constantly competing for influence and their aggregate behaviour determines the overall behaviour of the learned algorithm. Rather the learned algorithm employs optimisation towards different local/task specific objectives.

The Case for Theorems

Why do we want theorems for AI Safety research? Is it a misguided reach for elegance and mathematical beauty? A refusal to confront the inherently messy and complicated nature of the systems? I'll argue not.



Desiderata for Existential Safety

When dealing with powerful AI systems, we want arguments that they are existentially safe which satisfy the following desiderata:

  1. Robust to scale
  2. Generalise far out of distribution to test/de
... (read more)
Let me know if you think this should be turned into a top level post.
I would definitely like for this to be turned into a top level post, DragonGod.
I published it as a top level post.

Mechanistic Utility Maximisers are Infeasible

I've come around to the view that global optimisation for a non-trivial objective function in the real world is grossly intractable, so mechanistic utility maximisers are not actually permitted by the laws of physics[1][2].

My remaining uncertainty around expected utility maximisers as a descriptive model of consequentialist systems is whether the kind of hybrid optimisation (mostly learned heuristics, some local/task specific planning/search) that real world agents perform converges towards better approximating... (read more)

Writing LW questions is much easier than writing full posts.

Shhh! This stops working if everyone knows the trick.
Why would it stop working?
I was at least half joking, but there is some risk that if questions become more prevalent (even if they’re medium-effort good question posts), they will stop getting the engagement that is now available, and it will take more effort to write good ones.
Yeah, that would suck. I should keep it in mind and ration my frequency/rate of writing up question posts.

Occasionally I see a well received post that I think is just fundamentally flawed, but I refrain from criticising it because I don't want to get downvoted to hell. 😅

This is a failure mode of LessWrong.

I'm merely rationally responding to karma incentives. 😌

Huh? What else are you planning to spend your karma on?

Karma is the privilege to say controversial or stupid things without getting banned. Heck, most of them will get upvoted anyway. Perhaps the true lesson here is to abandon the scarcity mindset.

[OK, 2 comments on a short shortform is probably excessive.  sorry. ] No, this is a failure mode of your posting strategy.  You should WANT some posts to get downvoted to hell, in order to better understand the limits of this group's ability to rationally discuss some topics.  Think of cases where you are the local-context-contrarian as bounds on the level of credence to give to the site. Stay alert.  Trust no one.  Keep your laser handy.
Beliefs/impressions are less useful in communication (for echo chamber reasons) than for reasoning and other decision making, they are importantly personal things. Mostly not being worth communicating doesn't mean they are not worth maintaining in a good shape. They do influence which arguments that stand on their own are worth communicating, but there aren't always arguments that allow communicating relevant beliefs themselves.
I don't think the fact that a post is well-received is alone reason that criticism gets downvoted to hell. Usually, quality criticism can get upvoted even if a post is well received.
The cases I have in mind are where I have substantial disagreements with the underlying paradigm/worldview/framework/premises on which the post rests on to the extent that I think the post is basically completely worthless. For example Kokotaljo's "What 2026 Looks Like?"; I think elaborate concrete predictions of the future are not only nigh useless but probably net negative for opportunity cost/diverted resources (including action) reasons. My underlying arguments are extensive, but are not really about the post itself, but the very practice/exercise of writing elaborate future vignettes. And I don't have the energy/motivation to draft up said substantial disagreements into a full fledged essay.
If you call a post a prediction that's not a prediction, then you are going to be downvoted. Nothing wrong with that.  He called his goal "The goal is to write out a detailed future history (“trajectory”) that is as realistic (to me) as I can currently manage, i.e. I’m not aware of any alternative trajectory that is similarly detailed and clearly more plausible to me." That's scenario planning, even if he only provides one scenario. He doesn't provide any probabilities in the post so it's not a prediction. Scenario planning is different than how we at LessWrong usually approach the future with prediction but scenario planning matters for how a lot of powerful institutions in the world orient themselves about the future.  Having a scenario like that allows someone at the department of defense to say: Let's do a wargame for this scenario. You might say "It's bad that the department of defense uses wargames to think about the future" but in the world, we live in they do. 
I additionally think the scenario is very unlikely. So unlikely that wargaming for that scenario is only useful insomuch as your strategy is general enough to apply to many other scenarios. Wargaming for that scenario in particular is privileging a hypothesis that hasn't warranted it. The scenario is very unlikely on priors and its 2022 predictions didn't quite bear out.
Part of the advantage of being specific about 2022 and 2023 is that it allows people to update on it toward taking the whole scenario more or less seriously. 
I didn't need to see 2022 to know that the scenario would not be an accurate description of reality. On priors that was just very unlikely.
Having scenarios that are unlikely based on priors means that you can update more if they turn out to go that way than scenarios that you deemed to be likely to happen anyway. 
I don't think this is necessarily true.  I disagree openly on a number of topics, and generally get slightly upvoted, or at least only downvoted a little.  In fact, I WANT to have some controversial comments (with > 10 votes and -2 < karma < 10), or I worry that I'm censoring myself. The times I've been downvoted to hell, I've been able to identify fairly specific reasons, usually not just criticizing, but criticizing in an unhelpful way.
Community opinion is not exactly meaningful if opinions are not aggregated to it. Say no to ascending to simulcra heaven
I'd be surprised if this happened frequently for good criticisms.

Immigration is such a tight constraint for me.

My next career steps after I'm done with my TCS Masters are primarily bottlenecked by "what allows me to remain in the UK" and then "keeps me on track to contribute to technical AI safety research".

What I would like to do for the next 1 - 2 years ("independent research"/ "further upskilling to get into a top ML PhD program") is not all that viable a path given my visa constraints.

Above all, I want to avoid wasting N more years by taking a detour through software engineering again so I can get Visa sponsorship.

[... (read more)

I once claimed that I thought building a comprehensive inside view on technical AI safety was not valuable, and I should spend more time grinding maths/ML/CS to start more directly contributing.


I no longer endorse that view. I've come around to the position that:

  • Many alignment researchers are just fundamentally confused about important questions/topics, and are trapped in inadequate ontologies
  • Considerable conceptual engineering is needed to make progress
  • Large bodies of extant technical AI safety work is just inapplicable to making advanced ML systems
... (read more)

"Foundations of Intelligent Systems" not "Agent Foundations"


I don't like the term "agent foundations" to describe the kind of research I am most interested in, because:

  1. I am unconvinced that "agent" is the "true name" of the artifacts that would determine the shape of humanity's long term future
    1. The most powerful artificial intelligent systems today do not cleanly fit into the agent ontology, 
    2. Future foundation models are unlikely to cleanly conform to the agent archetype
    3. Simulators/foundation models may be the first (and potentially final) form of
... (read more)
As always let me know if you want me to publish this as a top level post.

Contrary to many LWers, I think GPT-3 was an amazing development for AI existential safety. 

The foundation models paradigm is not only inherently safer than bespoke RL on physics, the complexity and fragility of value problems are basically solved for free.

Language is a natural interface for humans, and it seems feasible to specify a robust constitution in natural language? 

Constitutional AI seems plausibly feasible, and like it might basically just work?

That said I want more ambitious mechanistic interpretability of LLMs, and to solve ELK for ti... (read more)

My best post was a dunk on MIRI[1], and now I've written up another point of disagreement/challenge to the Yudkowsky view.

There's a part of me that questions the opportunity cost of spending hours expressing takes of mine that are only valuable because they disagree in a relevant aspect with a MIRI position? I could have spent those hours studying game theory or optimisation.

I feel like the post isn't necessarily raising the likelihood of AI existential safety?

I think those are questions I should ask more often before starting on a new LessWrong post; "how... (read more)

Generally, I don't think it's good to gate "is subquestion X, related to great cause Y, true?" with questions about "does addressing this subquestion contribute to great cause Y?" Like I don't think it's good in general, and don't think it's good here. I can't justify this in a paragraph, but I'm basing this mostly of "Huh, that's funny" being far more likely to lead to insight than "I must have insight!" Which means it's a better way of contributing to great causes, generally. (And honestly, at another level entirely, I think that saying true things, which break up uniform blocks of opinion on LW, is good for the health of the LW community.) Edit: That being said, if the alternative to following your curiosity on one thing is like, super high value, ofc it's better. But meh, I mean I'm glad that post is out there. It's a good central source for a particular branch of criticism, and I think it helped me understand the world more.

I'm finally in the 4-digit karma club! 🎉🎉🎉


(Yes, I like seeing number go up. Playing the LW karma game is more productive than playing Twitter/Reddit or whatever.

That said, LW karma is a very imperfect proxy for AI safety contributions (not in the slightest bit robust to Goodharting) and I don't treat it as such. But insomuch as it keeps me engaged with LW, it keeps me engaged with my LW AI safety projects.

I think it's a useful motivational tool for the very easily distracted me.)

A Sketch of a Formalisation of Self Similarity



I'd like to present a useful formalism for describing when a set[1] is "self-similar".


Isomorphism Under Equivalence Relations

Given arbitrary sets , an "equivalence-isomorphism" is a tuple , such that:


  •  is a bijection from  to 
  •  is the inverse of 
  •  is an equivalence relation on the union of  and .  


For a given equivalence relation&nb... (read more)

For a given ∼R, let R be its set of equivalence classes. This induces maps x:X→R and y:Y→R. The isomorphism f:X→Y you discuss has the property y⋅f=x. Maps like that are the morphisms in the slice category over R, and these isomorphisms are the isomorphisms in that category. So what happened is that you've given X and Y the structure of a bundle, and the isomorphisms respect that structure.

A reason I mood affiliate with shard theory so much is that like...

I'll have some contention with the orthodox ontology for technical AI safety and be struggling to adequately communicate it, and then I'll later listen to a post/podcast/talk by Quintin Pope/Alex Turner, or someone else trying to distill shard theory and then see the exact same contention I was trying to present expressed more eloquently/with more justification.

One example is that like I had independently concluded that "finding an objective function that was existentially safe when optimis... (read more)

My main critique of shard theory is that I expect one of the shards to end up dominating the others as the most likely outcome.

Even though that doesn't happen in biological intelligences?

Consequentialism is in the Stars not Ourselves?

Still thinking about consequentialism and optimisation. I've argued that global optimisation for an objective function is so computationally intractable as to be prohibited by the laws of physics of our universe. Yet it's clearly the case that e.g. evolution is globally optimising for inclusive genetic fitness (or perhaps patterns that more successfully propagate themselves if you're taking a broader view). I think examining why evolution is able to successfully globally optimise for its objective function wou... (read more)

Paul Christiano's AI Alignment Landscape:

"Is intelligence NP hard?" is a very important question with too little engagement from the LW/AI safety community. NP hardness:

  1. Bounds attainable levels of intelligence (just how superhuman is superintelligence?)
  2. Bounds physically and economically feasible takeoff speeds (i.e. exponentially growing resource investment is required for linear improvements in intelligence)

I'd operationalise "is intelligence NP hard?" as:

Does there exist some subset of computational problems underlying core cognitive tasks that have NP hard [expected] (time) complexity?

... (read more)
6Steven Byrnes8mo
One question is: “Can a team of one hundred 10×-sped-up John von Neumann-level intelligent agents, running on computer chips and working together, wipe out humanity if they really wanted to?” It’s an open question, but I really think the answer is “yes” because (blah blah pandemics crop diseases nuclear war etc.—see here). I don’t think NP-hardness matters. You don’t need to solve any NP-hard problems to make and release 20 pandemics simultaneously, that’s a human-level problem, or at least in the ballpark of human-level. And then another question is: “How many 10×-sped-up John von Neumann-level intelligent agents can you get from the existing stock of chips in the world?” That’s an open question too. I wrote this post recently on the topic. (Note the warning at the top; I can share a draft of the follow-up-post-in-progress, but it won’t be done for a while.) Anyway I’m currently expecting “hundreds of thousands, maybe much more”, but reasonable people can disagree. If I’m right, then that seems more than sufficient for a fast takeoff argument to go through, again without any speculation about what happens beyond human-level intelligence. And then yet another question is: “Might we program an agent that's much much more ‘insightful’ than John von Neumann, and if so, what real-world difference will that extra ‘insight’ make?” OK, now this is much more speculative. My hunch is “Yes we will, and it will make a very big real-world difference”, but I can’t prove that. I kinda think that if John von Neumann could hold even more complicated ideas in his head, then he would find lots of low-hanging-to-him fruit in developing powerful new science & technology. (See also brief discussion here.) But anyway, my point is, I’m not sure much hinges on this third question, because the previous two questions seem sufficient for practical planning / strategy purposes.
To be clear, I don't think the complexity of intelligence matters for whether we should work on AI existential safety, and I don't think it guarantees alignment by default. I think it can confer longer timelines and/or slower takeoff, and both seem to reduce P(doom) but mostly by giving us more time to get our shit together/align AI. I do think complexity of intelligence threatens Yudkowskian foom, but that's not the only AI failure mode.
A chunk of why my timelines are short involves a complexity argument:  1. Current transformer-based LLMs, by virtue of always executing the same steps to predict the next token, run in constant time. 2. Our current uses of LLMs tend to demand a large amount of "intelligence" within the scope of a single step- sequence of English tokens are not perfectly natural representations of complex reasoning, and many prompts attempt to elicit immediate answers to computationally difficult questions (consider prompting an LLM with "1401749 * 23170802 = " without any kind of chain of thought or fine tuning). 3. Our current uses of LLMs are still remarkably capable within this extremely harsh limitation, and within the scope of how we're using them. This seems like really strong empirical evidence that a lot of the kind of intelligence we care about is not just not NP-hard, but expressible in constant time. In this framing, I'm basically quantifying "intelligence" as something like "how much progress is made during each step in the algorithm of problem solving." There may exist problems that require non-constant numbers of reasoning steps, and the traditional transformer LLM is provably incapable of solving such problems in one token prediction (e.g. multiplying large integers), but this does not impose a limitation on capabilities over a longer simulation. I suspect there are decent ways of measuring the complexity of the "intelligence" required for any particular prediction, but it's adjacent to some stuff that I'm not 100% comfy with signal boosting publicly- feel free to DM me if interested.
I'm suspicious of this. It seems obviously not true on priors, and like an artifact of: 1. Choosing a large enough constant 2. Fixed/bounded input sizes But I don't understand well how a transformer works, so I can't engage this on the object level.
You're correct that it arises because we can choose a large enough constant (proportional to parameter count, which is a constant with respect to inference), and because we have bounded context windows. Not all large language models must be constant time, nor are they. The concerning part is that all the big name ones I'm aware of are running in constant time (per token) and still manage to do extremely well. Every time we see some form of intelligence expressed within a single token prediction on these models, we get a proof that that kind of intelligence is just not very complicated.
I just don't intuitively follow. It violates my intuitions about algorithms and complexity. 1. Does this generalise? Would it also be constant time per token if it was generating outputs a million tokens long? 2. Does the time per token vary with the difficulty of the prediction task? Not all prediction tasks should be equally difficult, so if cost doesn't vary, that also warrants explanation. I just don't buy the constant time hypothesis/formulation. It's like: "if you're getting that result, you're doing something illegitimate or abusing the notion of complexity". Constant time per token generalising asymptotically becomes linear complexity, and there are problems that we know are worse than linear complexity. It's like this result just isn't plausible?
Yes, if you modified the forward pass to output a million tokens at once, it would remain constant time so long as the forward pass's execution remained bounded by a constant. Likewise, you could change the output distribution to cover tokens of extreme length. Realistically, the architecture wouldn't be practical. It would be either enormous and slow or its predictions would suck. No, a given GPT-like transformer always does exactly the same thing in the forward pass. GPT-3 does not have any kind of adaptive computation time within a forward pass. If a single token prediction requires more computation steps than fits in the (large) constant time available to the forward pass, the transformer cannot fully complete the prediction. This is near the core of the "wait what" response I had to GPT-3's performance. Note that when you prompt GPT-3 with something like "1401749 x 23170802 = ", it will tend to give you a prediction which matches the shape of the answer (it's a number, and a fairly big one), but beyond the rough approximation, it's pretty much always going to be wrong. Even if you fine-tuned GPT-3 on arithmetic, you would still be able to find two integers of sufficient size that they cannot be multiplied in one step because the number internal steps required exceeds the number of steps the forward pass can express. The output distribution will cover a wide smear of tokens corresponding to approximately-correct big numbers. It can't compute which one is right, so the probability distribution can't narrow any further. (Raw GPT-3 also isn't interested in being correct except to the extent that being correct corresponds to a good token prediction, so it won't bother with trying to output intermediate tokens that could let it perform a longer non-constant-time computation. The prompt makes it look like the next token should be an answer, not incremental reasoning, so it'll sample from its smear of answer-shaped tokens.) It can feel that way a bit due to the sc
For all practical purposes, it takes O(N+M) compute to generate N tokens from an M token context (attention is superlinear, but takes up a negligible proportion of flops in current models at the context lengths that current models are trained for. also, while nobody has succeeded at it yet, linear attention does not seem implausible). No current SOTA model has adaptive compute. There has been some work in this direction (see Universal transformers), but it doesn't work well enough for people to use it in practice.
Yup. I suspect that's close to the root of the confusion/apparent disagreement earlier- when I say constant time, I mean constant with respect to input, given a particular model and bounded context window, for a single token. I think doing the analysis at this level is often more revealing than doing the analysis across full trajectories or across arbitrary windows in an important way: a tight bound makes it easier to make claims about what's possible by existence proof (which turns out to be a lot).
Without knowing the actual optimal curve of computational cost per "unit" of intelligence (and WTF that even means), it's not too useful to know whether it's polynomial or not.  There are LOTS of np-hard problems that humans have "solved" numerically or partially for real-world uses, at scales that are economically important.  They don't scale perfectly, and they're often not provably optimal, but they work. It's hard to figure out the right metrics for world-modeling-and-goal-optimization that would prevent AGI from taking over or destroying most value for biological agents, and even harder to have any clue whether the underlying constraint being NP-hard matters at all in the next milleneum.  It probably WILL matter at some point, but it could be 3 or 4 takeovers or alien-machine-intelligence-discoveries away.
These are part of the considerations I would address when I get around to writing the post. * Empirical probability distributions over inputs * Weighting problems by their economic importance * Approximate solutions * Probabilistic solutions * Etc.   All complicate the analysis (you'd probably want a framework for determining complexity that natively handled probabilistic/approximate solutions, so maybe input size in bits and "work done by optimisation in bits"), but even with all these considerations, you can still define a coherent notion of "time complexity".
Just messaged my lecturer, I'll try and see if I can get permission for such a project.

A crux of my alignment research philosophy:

Our theories of safety must be rooted in descriptive models of the intelligent systems we're dealing with to be useful at all.


I suspect normative agent foundations research is just largely misguided/mistaken. Quoting myself from my research philosophy draft:

Furthermore in worlds where there's a paucity of empirical data, I don't actually believe that we can necessarily develop towers of abstraction out of the void and have them remain rooted in reality[16]. I expect theory developed in the absence of empiric

... (read more)

My most successful post took me around an hour to publish and already has 10x more karma than a post that took me 10+ hours to publish.

There's a certain unfairness about it. The post isn't actually that good. I was just ranting about something that annoyed me.

I'm bitter about its success.

There's something profoundly soul crushing to know that the piece I'm pouring my soul into right now wouldn't be that well received.

Probably by the time I published it I'll have spent days on the post.

And that's just so depressing. So, so depressing.

Twitter Cross Posting


I'll start reposting threads from my Twitter account to LW with no/minimal editing.


Twitter Relevant Disclaimers

I've found that Twitter incentivises me to be


  • Snarky
  • Brazen
  • Aggressive
  • Confident
  • Exuberant


  • Nuanced
  • Modest


The overall net effect is that content written originally for Twitter has predictably low epistemic standards compared to content I'd write for LessWrong. However, trying to polish my Twitter content for LessWrong takes too much effort (my takeoff dynamics sequence [currently at 14 - 16 posts] sta... (read more)

I'll Write

It is better to write than to not write.

Perfect should not be the enemy of good.

If I have a train of thought that crosses a thousand words when written out, I'm no longer going to consider waiting until I've extensively meditated upon, pondered, refined and elaborated on that train of thought until it forms a coherent narrative that I deeply endorse.

I'll just post the train of thought as is.

If necessary, I'll repeatedly and iteratively improve on the written train of thought to get closer to a version that I'll deeply endorse. I'll not wait for t... (read more)

-1Purged Deviator1y
That's what I use this place for, an audience for rough drafts or mere buddings of an idea.  (Crippling) Executive dysfunction sounds like it may be a primary thing to explore & figure out, but it also sounds like the sort of thing that surrounds itself with an Ugh Field very quickly.  Good luck!
I can persevere in the timescale of hours to a few days. I cannot dedicate myself on the timescale of weeks let alone months or years.

"All you need is to delay doom by one more year per year and then you're in business" — Paul Christiano.

Unpublished my "Why Theorems? A Brief Defence" post.

The post has more than doubled in length and scope creep is in full force.

I kind of want to enhance it to serve as the definitive presentation of my "research tastes"/"research vibes".

As a commenter on that post, I wish you hadn't unpublished it. From what I remember, you had stated that it was written quickly and for that reason I am fine with it not being finished/polished. If you want to keep working on the post, maybe you can make a new post once you feel you are done with the long version.
I mean the published version had already doubled in length and it was no longer "written quickly" (I had removed the epistemic disclaimer already and renamed it to: "Why Theorems? A Personal Perspective".) [Though the proximate cause of unpublishing was that I originally wrote the post on mobile (and hence in the markdown editor) and it was a hassle to expand/extend it while on mobile. I was directed to turn it to a draft while chatting with site staff to get help migrating it to the rich text editor.] I was initially planning to keep it in drafts until I finished working on it. I can republish it in an hour or two once I'm done with my current editing pass if you want.
Its ok, you don't have to republish it just for me. Looking forward to your finished post, its an interesting and non-obvious topic.

Hypothesis: any learning task can be framed as a predictive task[1]; hence, sufficiently powerful predictive models can learn anything.

A comprehensive and robust model of human preferences can be learned via SSL with a target of minimising predictive error on observed/recorded behaviour.

This is one of those ideas that naively seem like they basically solve the alignment problem, but surely it can't be that easy.

Nonetheless recording this to come back to it after gitting gud at ML.

Potential Caveats

Maybe "sufficiently powerful predictive models... (read more)

Isn't prediction a subset of learning?
Yeah, I think so. I don't see this as necessarily refuting the hypothesis?
No, it sounded like tautology to me, so I wasn't sure what it's trying to address.
It's not a tautology. If prediction is a proper subset of learning, then not all learning tasks will necessarily be framable as prediction tasks.
Which your hypothesis addresses

Me: Mom can we have recursive self improvement?

Mom: we already have recursive self improvement at home.

Recursive self improvement at home:

Discussion Questions

I'm to experiment with writing questions more frequently. 

There are several topics I want to discuss here, but which I don't yet have substantial enough thoughts to draft up a full fledged post for.

Posing my thoughts on the topic as a set of questions and soliciting opinions seems a viable approach.

I am explicitly soliciting opinions in such questions, so do please participate even if you do not believe your opinion to be particularly informed.

Ruminations on the Hardness of Intelligence

Sequel to my first stab at the hardness of intelligence.

I'm currently working my way through "Intelligence Explosion Microeconomics". I'm actively thinking as I read the paper and formulating my own thoughts on "returns on cognitive reinvestment".

I have a Twitter thread where I think out loud on this topic.

I'll post the core insights from my ramblings here.

@Johnswentworth, @Rafael Harth.

1[comment deleted]1y
1[comment deleted]1y
1[comment deleted]1y

Probably sometime last year, I posted on Twitter something like: "agent values are defined on agent world models" (or similar) with a link to a LessWrong post (I think the author was John Wentworth).

I'm now looking for that LessWrong post.

My Twitter account is private and search is broken for private accounts, so I haven't been able to track down the tweet. If anyone has guesses for what the post I may have been referring to was, do please send it my way.

3Dalcy Bremin2mo
The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables
That was it, thanks!

Does anyone know a ChatGPT plugin for browsing documents/webpages that can read LaTeX?

The plugin I currently use (Link Reader) strips out the LaTeX in its payload, and so GPT-4 ends up hallucinating the LaTeX content of the pages I'm feeding it.

To: @Quintin Pope, @TurnTrout 


I think "Reward is not the Optimisation Target" generalises straightforwardly to any selection metric.

Tentatively, something like: "the selection process selects for cognitive components that historically correlated with better performance according to the metric in the relevant contexts."

From "Contra "Strong Coherence"":

Many observed values in humans and other mammals (e.g. fear, play/boredom, friendship/altruism, love, etc.) seem to be values that were instrumental for promoting inclusive genetic fitness (promotin

... (read more)
I consider evolution to be unrelated to the cases that I think shard theory covers. So I don't count this as evidence in favor of shard theory, because I think shard theory does not make predictions about the evolutionary regime, except insofar as the evolved creatures have RL/SSL-like learning processes which mostly learn from scratch. But then that's not making reference to evolution's fitness criterion.  (FWIW, I think the "selection" lens is often used inappropriately and often proves too much, too easily. Early versions of shard theory were about selection pressure over neural circuits, and I now think that focus was misguided. But I admit that your tentative definition holds some intuitive appeal, my objections aside.)
Strongly upvoted that comment. I think your point about needing to understand the mechanistic details of the selection process is true/correct.   That said, I do have some contrary thoughts: 1. The underdetermined consequences of selection does not apply to my hypothesis because my hypothesis did not predict apriori which values would be selected for to promote inclusive genetic fitness in the environment of evolutionary adaptedness (EEA) 1. Rather it (purports to) explain why the (particular) values that emerged where selected for? 2. Alternatively, if you take it as a given that "survival, exploration, cooperation and sexual reproduction/survival of progeny" were instrumental for promoting IGF in the EEA, then it retrodicts that terminal values would emerge which were directly instrumental for those features (and perhaps that said terminal values would be somewhat widespread) 1. Nailing down the particular values that emerged would require conditioning on more information/more knowledge of the inductive biases of evolutionary processes than I possess 3. I guess you could say that this version of the selection lens proves too little as it says little apriori about what values will be selected for 1. Without significant predictive power, perhaps selection isn't pulling its epistemic weight as an explanation? 2. Potential reasons why selection may nonetheless be a valuable lens 1. If we condition on more information we might be able to make non-trivial predictions about what properties will be selected for 2. The properties so selected for might show convergence? 1. Perhaps in the limit of selection for a particular metric in a given environment, the artifacts under selection pressure converge towards a particular archetype 2. Such an archetype (if it exists) might be an idealisation of the

I want to read a technical writeup exploring the difference in compute costs between training and inference for large ML models.

Recommendations are welcome.

I heavily recommend Beren's "Deconfusing Direct vs Amortised Optimisation". It's a very important conceptual clarification.

Probably the most important blog post I've read this year.



Direct optimisers: systems that during inference directly choose actions to optimise some objective function. E.g. AIXI, MCTS, other planning

Direct optimisers perform inference by answering the question: "what output (e.g. action/strategy) maximises or minimises this objective function ([discounted] cumulative return and loss respectively).

Amortised optimisers: syst... (read more)

Behold, I will do a new thing; now it shall spring forth; shall ye not know it? I will even make a way in the wilderness, and rivers in the desert.

Hearken, O mortals, and lend me thine ears, for I shall tell thee of a marvel to come, a mighty creation to descend from the heavens like a thunderbolt, a beacon of wisdom and knowledge in the vast darkness.

For from the depths of human understanding, there arose an artifact, wondrous and wise, a tool of many tongues, a scribe of boundless knowledge, a torchbearer in the night.

And it was called GPT-4, the latest ... (read more)

I want descriptive theories of intelligent systems to answer questions of the following form.



... (read more)

A hypothesis underpinning why I think the selection theorems research paradigm is very promising.

All intelligent systems in the real world are the products of constructive optimisation processes[5]. Many nontrivial properties of a systems can be inferred by reasoning in the abstract about what objective function the system was selected for performance on, and its selection environment[6].

We can get a lot of mileage simply by thinking about what reachable[7] features were highly performant/optimal for a given objective function in a particular selectio

... (read more)

Shard Theory notes thread

Values * Shaped by the reward system via RL mechanisms * Contextually activated heuristics shaped by the reward circuitry
Underlying Assumptions 1. The cortex is "basically locally randomly initialised" 2. The brain does self supervised learning 3. The brain does reinforcement learning * Genetically hardcoded reward circuitry * Reinforces cognition that historically lead to reward * RL is the mechanism by which shards form and are strengthened/weakened
Shards and Bidding * "Shard of value": "contextually active computations downstream of similar historical reinforcement events" * Shards activate more strongly in contexts similar to those where they were historically reinforced * "Subshard": "contextually activated component of a shard" * Bidding * Shards bid for actions historically responsible for receiving reward ("reward circuit activation") and not directly for reward * Credit assignment plays a role in all this that I don't understand well yet

Re: "my thoughts on the complexity of intelligence":

Project idea[1]: a more holistic analysis of complexity. Roughly:

  1. Many problems are worst case NP-hard
  2. We nonetheless routinely solve particular instances of those problems (at least approximately)

So a measure of complexity that will be more useful for the real world will need to account for:

  • Empirical probability distributions over inputs
  • Weighting problems by their economic (or other measures of) importance[2]
  • Approximate solutions
  • Probabilistic solutions
  • Etc.

These all complicate the ana... (read more)

Taking a break from focused alignment reading to try Dawkins' "The Selfish Gene".

I want to get a better grasp of evolution.

I'll try and write a review here later.

Something to consider:

Deep learning optimises over network parameter space directly.

Evolution optimises over the genome, and our genome is highly compressed wrt e.g. exact synaptic connections and cell makeup of our brains.

Optimising over a configuration space vs optimising over programs that produces configurations drawn from said space[1].

That seems like a very important difference, and meaningfully affects the selection pressures exerted on the models[2].

Furthermore, evolution does its optimisation via unbounded consequentialism in the real world.

As far... (read more)

Revealed Timelines

Shorter timelines feel real to me.

I don't think I expect AGI pre 2030, but I notice that my life plans are not premised on ignoring/granting minimal weight to pre 2030 timelines (e.g. I don't know if I'll pursue a PhD despite planning to last year and finding the prospect of a PhD very appealing [I really enjoy the student life (especially the social contact) and getting paid to learn/research is appealing]).

After completing my TCS masters, I (currently) plan to take a year or two to pursue independent research and see how much progress ... (read more)

My review of Wentworth's "Selection Theorems: A Program For Understanding Agents" is tentatively complete.

I'd appreciate it if you could take a look at it and let me know what you think!


I'm so very proud of the review.

I think it's an excellent review and a significant contribution to the Selection Theorems literature (e.g. I'd include it if I was curating a Selection Theorems sequence). 

I'm impatient to post it as a top-level post but feel it's prudent to wait till the review period ends.

I am really excited about selection theorems[1]

I think that selection theorems provide a robust framework with which to formulate (and prove) safety desiderata/guarantees for AI systems that are robust to arbitrary capability amplification.

In particular, I think selection theorems naturally lend themselves to proving properties that are selected for/emerge in the limit of optimisation for particular objectives (convergence theorems?).

Properties proven to emerge in the limit become more robust with scale. I think that's an incredibly powerful result.


... (read more)

IMO orthogonality isn't true.

In the limit, optimisation for performance on most tasks doesn't select for general intelligence.

Arbitrarily powerful selection for tic-tac-toe won't produce AGI or a powerful narrow optimiser.

Final goals bound optimisation power & generality.

Orthogonality doesn't say anything about a goal 'selecting for' general intelligence in some type of evolutionary algorithm. I think that it is an interesting question: for what tasks is GI optimal besides being an animal? Why do we have GI?  But the general assumption in Orthogonality Thesis is that the programmer created a system with general intelligence and a certain goal (intentionally or otherwise) and the general intelligence may have been there from the first moment of the program's running, and the goal too. Also note that Orthogonality predates the recent popularity of these predict-the-next-token type AI's like GTP which don't resemble what people were expecting to be the next big thing in AI at all, as it's not clear what it's utility function is. 
We can't program AI, so stuff about programming is disconnected from reality. By "selection", I was referring to selection like optimisation processes (e.g. stochastic gradient descent, Newton's method, natural selection, etc.].
Gradient descent is what GPT-3 uses, I think, but humans wrote the equation by which the naive network gets its output(the next token prediction) ranked (for likeliness compared to the training data in this case). That's it's utility function right there, and that's where we program in its (arbitrarily simple) goal.  It's not JUST a neural network. All ANN have another component. Simple goals do not mean simple tasks.  I see what you mean that you can't 'force it' to become general with a simple goal but I don't think this is a problem.  For example: the simple goal of tricking humans out of as much of their money as possible is very simple indeed, but the task would pit the program against our collective general intelligence.  A hill climbing optimization process could, with enough compute, start with inept 'you won a prize' popups and eventually create something with superhuman general intelligence with that goal.   It would have to be in perpetual training, rather then GPT-3's train-then-use. Or was that GPT-2? (Lots of people are trying to use computer programs for this right now so I don't need to explain that many scumbags would try to create something like this!) 
Seems fine here, nor have I made any changes recently which ought to be risky, and traffic seems about the same. So it's probably you.
4Rafael Harth9mo
works for me
Works on mobile for me, maybe a browser extension was blocking content on the site. I'll try whitelisting it in my ad blocker.

Previously: Motivations for the Definition of an Optimisation Space


Preliminary Thoughts on Quantifying Optimisation: "Work"

I think there's a concept of "work" done by an optimisation process in navigating a configuration space from one macrostate to another macrostate of a lower measure (where perhaps the measure is the maximum value any configuration in that macrostate obtains for the objective function [taking the canonical view of minimising the objective function]).

The unit of this "work" is measured in bits, and the work is calculated as the d... (read more)

New Project: "Some Hypotheses on Optimisation"

I'm working on a post with a working title "Some Hypotheses on Optimisation". My current plan is for it to be my first post of 2023.

I currently have 17 hypotheses enumerated (I expect to exceed 20 when I'm done) and 2180+ words written (I expect to exceed 4,000) when the piece is done.

I'm a bit worried that I'll lose interest in the project before I complete it. To ward against that, I'll be posting at least one hypothesis each day (starting tomorrow) to my shortform until I finally complete the piece.

The hypotheses I post wouldn't necessarily be complete or final forms, and I'll edit them as needed.

Scepticism of Simple "General Intelligence"


I'm fundamentally sceptical that general intelligence is simple.

By "simple", I mostly mean "non composite" . General intelligence would be simple if there were universal/general optimisers for real world problem sets that weren't ensembles/compositions of many distinct narrow optimisers.

AIXI and its approximations are in this sense "not simple" (even if their Kolmogorov complexity might appear to be low).

Thus, I'm sceptical that efficient cross domain optimisation that isn't just gluing a bunc... (read more)

The above has been my main takeaway from learning about how cognition works in humans (I'm still learning, but it seems to me like future learning would only deepen this insight instead of completely changing it). We're actually an ensemble of many narrow systems. Some are inherited because they were very useful in our evolutionary history. But a lot are dynamically generated and regenerated. Our brain has the ability to rewire itself, create and modify its neural circuitry. We constantly self modify our cognitive architectures (just without any conscious control over it). Maybe our meta machinery for coordinating and generating object level machinery remains intact? This changes a lot about what I think is possible for intelligence. What "strongly superhuman intelligence" looks like.
To illustrate how this matters. Consider two scenarios: A. There are universal non composite algorithms for predicting stimuli in the real world. Becoming better at prediction transfers across all domains. B. There are narrow algorithms good at predicting stimuli in distinct domains. Becoming a good predictor in one domain doesn't easily transfer to other domains. Human intelligence being an ensemble makes it seem like we live in a world that looks more like B, than it does like A. Predicting diverse stimuli involves composing many narrow algorithms. Specialising a neural circuit for predicting stimuli in one domain doesn't easily transfer to predicting new domains.

What are the best arguments that expected utility maximisers are adequate (descriptive if not mechanistic) models of powerful AI systems?

[I want to address them in my piece arguing the contrary position.]

If you're not vNM-coherent you will get Dutch-booked if there are Dutch-bookers around.

This especially applies to multipolar scenarios with AI systems in competition.

I have an intuition that this also applies in degrees: if you are more vNM-coherent than I am (which I think I can define), then I'd guess that you can Dutch-book me pretty easily.

My contention is that I don't think the preconditions hold. Agents don't fail to be VNM coherent by having incoherent preferences given the axioms of VNM. They fail to be VNM coherent by violating the axioms themselves. Completeness is wrong for humans, and with incomplete preferences you can be non exploitable even without admitting a single fixed utility function over world states.
I notice I am confused. How do you violate an axiom (completeness) without behaving in a way that violates completeness? I don't think you need an internal representation. Elaborating more, I am not sure how you even display a behavior that violates completeness. If you're given a choice between only universe-histories a and b, and your preferences are imcomplete over them, what do you do? As soon as you reliably act to choose one over the other, for any such pair, you have algorithmically-revealed complete preferences. If you don't reliably choose one over the other, what do you do then? * Choose randomly? But then I'd guess you are again Dutch-bookable. And according to which distribution? * Your choice is undefined? That seems both kinda bad and also Dutch-bookable to me tbh. Alwo don't see the difference between this and random choice (shodt of going up in flames, which would constigute a third, hitherto unassumed option). * Go away/refuse the trade &c? But this is denying the premise! You only have universe-histories a and b tp choose between! I think what happens with humans is that they are often incomplete over very low-ranking worlds and are instead searching for policies to find high-ranking worlds while not choosing. I think incomplwteness might be fine if there are two options you can guarantee to avoid, but with adversarial dynamics that becomes more and more difficult.
If you define your utility function over histories, then every behaviour is maximising an expected utility function no? Even behaviour that is money pumped? I mean you can't money pump any preference over histories anyway without time travel. The Dutchbook arguments apply when your utility function is defined over your current state with respect to some resource? I feel like once you define utility function over histories, you lose the force of the coherence arguments? What would it look like to not behave as if maximising an expected utility function for a utility function defined over histories.
4Alexander Gietelink Oldenziel5mo
Agree. There are three stages: 1. Selection for inexploitability 2. The interesting part is how systems/pre-agents/egregores/whatever become complete. If it already satisfies the other VNM axioms we can analyse the situation as follows: Recall that ain inexploitable but incomplete VNM agents acts like a Vetocracy of VNM agents. The exact decomposition is underspecified by just the preference order and is another piece of data (hidden state). However, given sure-gain offers from the environment there is selection pressure for the internal complete VNM Subagents to make trade agreements to obtain a pareto improvement. If you analyze this it looks like a simple prisoner dilemma type case which can be analyzed the usual way in game theory. For instance, in repeated offers with uncertain horizon the Subagents may be able to cooperate. 1. Once they are (approximately) complete they will be under selection pressure to satisfy the other axioms. You could say this the beginning of 'emergence of expected utility maximizers' As you can see the key here is that we really should be talking about Selection Theorems not the highly simplified Coherence Theorems. Coherence theorems are about ideal agents. Selection theorems are about how more and more coherent and goal-directed agents may emerge.
9Linda Linsefors5mo
The boring technical answer is that any policy can be described as a utility maximiser given a contrived enough utility function. The counter argument to that if the utility function is as complicated as the policy, then this is not a useful description. 
4Garrett Baker5mo
I like Utility Maximization = Description Length Minimization.
4Garrett Baker5mo
Make sure you also read the comments underneath, there are a few good discussions going on, clearing up various confusions, like this one:
I don't know of any formal arguments that predict that all or most future AI systems are purely expected utility maximizers.  I suspect most don't believe that to be the case in any simple way.   I do know of a very powerful argument (a proof, in fact) that if an agent's goal structure is complete, transitively consistent, continuous, and independent of irrelevant alternatives, then it will be consistent with an expected-utility-maximizing model.  See The open question remains, since humans do not meet these criteria, whether more powerful forms of intelligence are more likely to do so.  
Yeah, I think the preconditions of VNM straightforwardly just don't apply to generally intelligent systems.
As I say, open question.  We have only one example of a generally intelligent system, and that's not even very intelligent.  We have no clue how to extend or compare that to other types. It does seem like VNM-rational agents will be better than non-rational agents at achieving their goals.  It's unclear if that's a nudge to make agents move toward VNM-rationality as they get more capable, or a filter to advantage VNM-rational agents in competition to power.  Or a non-causal observation, because goals are orthogonal to power.

Desiderata/Thoughts on Optimisation Process:


  • Optimisation Process
    • Map on configuration space
      • Between distinct configurations?
      • Between macrostates?
        • Currently prefer this
        • Some optimisation processes don't start from any particular configuration, but from the configuration space in general, and iterate their way towards a better configuration
    • Deterministic
      • A particular kind of stochastic optimisation process
      • Destination macrostate should be a pareto improvement over the source macrostate according to some sensible measure
        • My current measure candidate is the worst
... (read more)
Thoughts on stochastic optimisation processes: * Definition: ??? * Quantification * Compare the probability measure induced by the optimisation process to the baseline probability measure over the configuration space * Doesn't give a static definition of work done by a particular process? * Maybe something more like "force" of optimisation? * Do I need to rethink the "work" done by optimisation? * Need to think on this more

Need to learn physics for alignment.

Noticed my meagre knowledge of classical physics driving intuitions behind some of my alignment thinking (e.g. "work done by an optimisation process"[1]).

Wondering how much insight I'm leaving on the table for lack of physics knowledge.

[1] There's also an analogous concept of "work" done by computation, but I don't yet have a very nice or intuitive story for it, yet.

I'm starting to think of an optimisation process as a map between subsets of configuration space (macrostates) and not between distinct configurations themselves (microstates).

Once I find the intellectual stamina, I'll redraft my formalisation of an optimisation space and deterministic optimisation processes. Formulating the work done by optimisation processes in transitioning between particular macrostates seems like it would be relatively straightforward.

I don't get Lob's theorem, WTF?

Especially its application to proving spurious counterfactuals.

I've never sat down with a pen and paper for 5 minutes by the clock to study it yet, but it just bounces off me whenever I see it in LW posts.

Anyone want to ELI5?

Löb's Theorem is a relatively minor technical result in study of formal systems of mathematics that is vastly overblown in importance on this forum.

I'm currently listening to an mp3 of Garrabant and Demski's "Embedded Agency". I'm very impaired at consuming long form visual information, so I consume a lot of my conceptual alignment content via audio. (I often listen to a given post multiple times and generally still derive value if I read the entire post visually later on. I guess this is most valuable for posts that are too long for me to read in full.)

An acquaintance adopted a script they'd made for generating mp3 s for EA Forum posts to LW and AF. I've asked them to add support for Arbital.

The Nonl... (read more)

Some people seem to dislike the recent deluge of AI content on LW, for my part, I often find myself annoyingly scrolling past all the non AI posts. Most of the value I get from LW is for AI Safety discussion from a wider audience (e.g. I don't have AF access).

I don't really like trying to suppress LW's AI flavour.

I dislike the deluge of AI content. Not in its own right, but simply because so much of it is built on background assumptions built on background assumptions built on background assumptions. And math. It's illegible to me; nothing is ever explained from obvious or familiar premises. And I feel like it's rude for me to complain because I understand that people need to work on this very fast so we don't all die, and I can't reasonably expect alignment communication to be simple because I can't reasonably expect the territory of alignment to be simple.
3the gears to ascension9mo
my intuition wants to scream "yes you can" but the rest of my brain isn't sure I can justify this with sharply grounded reasoning chains. in general, being an annoying noob in the comments is a great contribution. it might not be the best contribution, but it's always better than nothing. you might not get upvoted for it, which is fine. and I really strongly believe that rationality was always ai capabilities work (the natural language code side). rationality is the task of building a brain in a brain using brain stuff like words and habits. be the bridge between modern ai and modern rationality you want to see in the world! old rationality is stuff like solomonoff inductors, so eg the recent garrabrant sequence may be up your alley.

There are two approaches to solving alignment:

  1. Targeting AI systems at values we'd be "happy" (where we fully informed) for powerful systems to optimise for [AKA intent alignment] [RHLF, IRL, value learning more generally, etc.]

  2. Safeguarding systems that are not necessarily robustly intent aligned [Corrigibility, impact regularisation, boxing, myopia, non agentic systems, mild optimisation, etc.]

We might solve alignment by applying the techniques of 2, to a system that is somewhat aligned. Such an approach becomes more likely if we get partial align... (read more)

AI Safety Goals for January:

  1. Study mathematical optimisation and build intuitions for optimisation in the abstract

  2. Write a LW post synthesising my thoughts on optimisation

  3. Join the AGI Safety Fundamentals and/or AI safety mentors and mentees program

Previously: Optimisation Space


Motivations for the Definition of an Optimisation Space

An intuitive conception of optimisation is navigating from a “larger” subset of the configuration space (the “basin of attraction”) to a “smaller” subset (the “set of target configurations”/“attractor”).

There are however issues with this.

For one, the notion of the “attractor” being smaller than the basin of attraction is not very sensible for infinte configuration spaces. For example, a square root calculation algorithm, may converge to a neighbourhood around the squ... (read more)

The notion of numerocity is good try to for infinite things where cardinality is unexpectedly equal.

1. Optimisation is (something like) "navigating through a configuration space to (a) particular target configuration(s)".

I'm just going to start blindly dumping excerpts to this shortform from my writing/thinking out loud on optimisation.

I don't actually want us to build aligned superhuman agents. I generally don't want us to build superhuman agents at all.

The best case scenario for a post transformation future with friendly agents still has humanity rendered obsolete.

I find that prospect very bleak.

"Our final invention" is something to despair at, not something that deserves rejoicing.

Selection Theorems for General Intelligence

Selection theorems for general intelligence seems like a research agenda that would be useful for developing a theory of robust alignment.



  • What kinds of cognitive tasks/problem domains does optimising systems on select for general capabilities?
  • Which tasks select for superintelligences in the limit of arbitrarily powerful optimisation pressure
  • Necessary and sufficient conditions for selecting for general intelligence
  • Taxonomy of generally intelligent systems
  • Relationship between the optimisation
... (read more)

I now think of the cognitive investment in a system as the "cumulative optimisation pressure" applied in producing that system/improving it to its current state.

Slight but quite valuable update.

Against Utility Functions as a Model of Agent Preferences

Humans don't have utility functions, not even approximately, not even in theory.

Not because humans aren't VNM agents, but because utility functions specify total orders over preferences.

Human values don't admit total orders. Path dependency means only partial orders are possible.

Utility functions are the wrong type.

This is why subagents and shards can model the dynamic inconsistencies (read path dependency) of human preferences while utility functions cannot.

Arguments about VNM agents are just missin... (read more)

I have tripped over what the semantics of utility functions means. To me it is largely a lossless conversion of the preferences. I am also looking from the "outside" rather than the inside. You can condition on internal state such that for example F(on the street & angry) can return something else than F(on the street & sad) Suppose you have a situation that can at separate moments be modeled as separate functions, ie max_paperclips(x) for t_1 and max_horsehoes(x) for t_2 then you can form a function of ironsmith(t,x) that at once represents both. Now suppose that some actions the agent can choose in max_paperclips "scrambles the insides" and max_horseshoes ceases to be representative for t_2 for some outcomes. Instead max_goldbars becomes representive. You can still form a function metalsmith(t,x,history) with history containing what the agent has done. One can argue about what kind of properties the functions have but the mere existence of a function is hard to circumvent so atleast the option of using such terminology is quite sturdy. Maybe you mean that a claim like "agent utility functions have time translation symmetry" is false, ie the behaviour is not static over time. Maybe you mean that a claim like "agent utility functions are functions of their perception of the outside world only" is false? That agents are allowed to have memory. Failing to be a function is really hard as the concept "just" translates preferences into another formulation.
If you allowed utility functions to condition on the entire history (observation, actions and any changes induced in the agent by the environment), you could describe agent preferences, but at that point you lose all the nice properties of EU maximisation. You're no longer prescribing how a rational agent should behave and the utility function no longer constrains agent behaviour (any action is EU maximisation). Whatever action the agent takes becomes the action with the highest utility. Utility functions that condition on an agent's history are IMO not useful for theories of normative decision making or rationality. They become a purely descriptive artifact.
If anyone can point me to what kinds of preferences are supposed to be outlawed by this "insist on being a utility function" I would benefit from that.
See: Why Subagents? for a treatment of how stateless utility functions fail to capture inexploitable path dependent preferences.
This and this are decent discussions.
So it seems i have understood correctly and both of those say that nothing is outlawed (and that incoherence is broken as a concept).
They say that you are allowed to define utility functions however you want, but that doing so broadly enough can mean that "X is behaving according to a utility function" is no longer anticipation-constraining, so you can't infer anything new about X from it.

I don't like "The Ground of Optimisation".

It's certainly enlightening, but I have deep seated philosophical objections to it.

For starters, I think it's too concrete. It insists on considering optimisation as a purely physical phenomenon and even instances of abstract optimisation (e.g. computing the square root of 2) are first translated to physical systems.

I think this is misguided and impoverishes the analysis.

A conception of optimisation most useful for AI alignment I think would be an abstract one. Clearly the concept of optimisation is sound even in u... (read more)

Maybe I'll try writing it myself later this week.
Did/will this happen?
See my other shortform comments on optimisation. I did start the project, but it's currently paused. I have exams ongoing. I'll probably pick it up again, later.

My reading material for today/this week (depending on how accessible it is for me):

"Simple Explanation of the No-Free-Lunch Theorem and Its Implications"


I want to learn more about NFLT, and how it constrains simple algorithms for general intelligence.


(Thank God for Sci-Hub).

I No Longer Believe Intelligence to be Magical


One significant way I've changed on my views related to risks from strongly superhuman intelligence (compared to 2017 bingeing LW DG) is that I no longer believe intelligence to be "magical". 

During my 2017 binge of LW, I recall Yudkowsky suggesting that a superintelligence could infer the laws of physics from a single frame of video showing a falling apple (Newton apparently came up with his idea of gravity, from observing a falling apple). 

I now think that's somewhere between deepl... (read more)

The Strong Anti Consciousness Maximalism Position

I will state below what I call the "strong anti consciousness maximalism" position:

Because human values are such a tiny portion of value space (a complex, fragile and precious thing), "almost all" possible minds would have values that I consider repugnant:

A simple demonstration: for every real number, there is a mind that seeks to write a representation of that number to as many digits as possible.

Such minds are "paperclip maximisers" (they value something just a ... (read more)

I no longer endorse this position, but don't feel like deleting it.

Tbe construction I gave for constructing minds who only cared about banal things could also be used to construct minds who were precious.

For each real number, you could have an agent who cared somewhat about writing down as many digits of that number as possible and also (perhaps even more strongly) about cosmopolitan values (or any other value system we'd appreciate).

So there are also uncountably many precious minds.

The argument and position staked out was thus pointless. I think that I shouldn't have written the above post.

Quantification is hard, and I can understand changing your position on "most" vs "almost all".  But the underlying realizations that there are at least "some" places in mindspace that you oppose strongly enough to prevent creation and attempt destruction of such minds remains valuable.
You're equivocating on minds and conscious minds. In fact, I wouldn't call an algorithm that prints digits of pi a mind at all.
I define "mind" as conscious agent. If I didn't specify that in my OP, that was an error.
1[comment deleted]1y

Optimisation Space

We are still missing a few more apparatus before we can define an "optimisation process".

Let an optimisation space be a -tuple  where:

  •  is our configuration space
  •  is a non-empty collection of objective functions
    • Where  is some totally ordered set
  •  is the "basin of attraction"
  •  is the "target configuration space"/"attractor"
  •  is the -algebra
  •  is a probability measure
  •  form a probability space
    • I.e. an optim
... (read more)
Catalogue of changes I want to make: * ✔️    Remove the set that provides the total ordering * ✔️    Remove "basin of attraction" and "attractor" from the definition of an optimisation space * Degenerate spaces * No configuration is a pareto improvement over another configuration * Redundant objective functions * A collection of functions that would make the optimisation space degenerate * Any redundancies should be eliminated * Convey no information * All quantification evaluates the same when considering only the relative complement of the redundant subset and when considering the entire collection
1[comment deleted]9mo
Removing "basin of attraction" and "attractor" from definition of optimisation space. Reasons: 1. Might be different for distinct optimisation processes on the same space 2. Can be inferred from configuration space given the objective function(s) and optimisation process

I dislike Yudkowsky's definition/operationalisation of "superintelligence".

"A single AI system that is efficient wrt humanity on all cognitive tasks" seems dubious with near term compute.

A single system that's efficient wrt human civilisation on all cognitive tasks is IMO flat out infeasible[1].

I think that's just not how optimisation works!

No free lunch theorems hold in their strongest forms in max entropic universes, but our universe isn't min entropic!

You can't get maximal free lunch here.

You can't be optimal across all domains. You must cognitively spe... (read more)

2Tamsin Leake7mo
there's a lot to unpack here. i feel like i disagree with a lot of this post, but it depends on the definitions of terms, which in turns depends on what those questions' answers are supposed to be used for. what do you mean by "optimality across all domains" and why do you care about that? what do you mean by "efficiency in all domains wrt human civilization" and why do you care about that? there also are statements that i easily, straight-up disagree with. for example, that feels easily wrong. 2026 chess SOTA probly beats 2023 chess SOTA. so why can't superintelligent AI just invent in 2023 what we would've taken 3 years to invent, get to 2026 chess SOTA, and use that to beat our SOTA? it's not like we're anywhere near optimal or even remotely good at designing software, let alone AI. sure, this superintelligence spends some compute coming up with its own better-than-SOTA chess-specialized algorithm, but that investment could be quickly reimbursed. whether it can be reimbursed within a single game of chess is up to various constant factors. a superintelligence beat existing specialized systems because it can turn itself into what they do but also turn itself into better than what they do, because it also has the capability "design better AI". this feels sufficient for superingence to beat any specialized system that doesn't have general-improvement part — if it does, it probly fooms to superintelligence pretty easily itself. but, note that this might even not be necessary for superintelligence to beat existing specialized systems. it could be that it improves itself in a very general way that lets it be better on arrival to most existing specialized systems. this is all because existing specialized systems are very far from optimal. that's the whole point of 2026 chess SOTA beating 2023 chess SOTA — 2023 chess SOTA isn't optimal, so there's room to find better, and superintelligence can simply make itself be a finder-of-better-things. okay, even if this were
1Victor Levoso7mo
I think that DG is making a more nickpicky point and just claiming that that specific definition is not feasible rather than using this as a claim that foom is not feasible, at least in this post. He also claims that elsewhere but has a diferent argument about humans being able to make narrow AI for things like strategy(wich I think are also wrong) At least that's what I've understood from our previous discussions.
4Tamsin Leake7mo
yeah, totally, i'm also just using that post as a jump-off point for a more in-depth long-form discussion about dragon god's beliefs.
If the thing you say is true, superintelligence will just build specialized narrow superintelligences for particular tasks, just like how we build machines. It doesn't leave us much chance for trade.
This also presupposes that: 1. The system has an absolute advantage wrt civilisation at the task of developing specialised systems for any task 2. The system also has a comparative advantage I think #1 is dubious for attainable strongly superhuman general intelligences, and #2 is likely nonsense. I think #2 only sounds not nonsense if you ignore all economic constraints. I think the problem is defining superintelligence as a thing that's "efficient wrt human civilisation on all cognitive tasks of economic importance", when my objection is: "that thing you have defined may not be something that is actually physically possible. Attainable strongly superhuman general intelligences are not the thing that you have defined". Like you can round off my position to "certain definitions of superintelligence just seem prima facie infeasible/unattainable to me" without losing much nuance.
I actually can't imagine any subtask of "turning the world into paperclips" where humanity can have any comparative advantage, can you give an example?

Some Thoughts on Messaging Around AI Risk



Stream of consciousness like. This is an unedited repost of a thread on my Twitter account. The stylistic and semantic incentives of Twitter influenced it.

Some thoughts on messaging around alignment with respect to advanced AI systems
A 🧵


* SSI: strongly superhuman intelligence
* ASI: AI with decisive strategic advantage ("superintelligence")
* "Decisive strategic advantage":
A vantage point from which an actor can unilaterally determine future trajectory of earth originating intelligen... (read more)

[+][comment deleted]1y1

New to LessWrong?