All of So8res's Comments + Replies

Is this a reasonable paraphrase of your argument?

Humans wound up caring at least a little about satisfying the preferences of other creatures, not in a "grant their local wishes even if that ruins them" sort of way but in some other intuitively-reasonable manner.

Humans are the only minds we've seen so far, and so having seen this once, maybe we start with a 50%-or-so chance that it will happen again.

You can then maybe drive this down a fair bit by arguing about how the content looks contingent on the particulars of how humans developed or whatever, and m

... (read more)
1Eric Zhang5d
My reading of the argument was something like "bullseye-target arguments refute an artificially privileged target being rated significantly likely under ignorance, e.g. the probability that random aliens will eat ice cream is not 50%. But something like kindness-in-the-relevant-sense is the universal problem faced by all evolved species creating AGI, and is thus not so artificially privileged, and as a yes-no question about which we are ignorant the uniform prior assigns 50%". It was more about the hypothesis not being artificially privileged by path-dependent concerns than the notion being particularly simple, per se. 
4Daniel Kokotajlo6d
FWIW this is my view. (Assuming no ECL/MSR or acausal trade or other such stuff. If we add those things in, the situation gets somewhat better in expectation I think, because there'll be trades with faraway places that DO care about our CEV.)

More generally, I think that if mere-humans met very-alien minds with similarly-coherent preferences, and if the humans had the opportunity to magically fulfill certain alien preferences within some resource-budget, my guess is that the humans would have a pretty hard time offering power and wisdom in the right ways such that this overall went well for the aliens by their own lights (as extrapolated at the beginning), at least without some sort of volition-extrapolation.

Isn't the worst case scenario just leaving the aliens alone? If I'm worried I'm going t... (read more)

Thanks! Seems like a fine summary to me, and likely better than I would have done, and it includes a piece or two that I didn't have (such as an argument from symmetry if the situations were reversed). I do think I knew a bunch of it, though. And e.g., my second parable was intended to be a pretty direct response to something like

If we instead treat "paperclip" as an analog for some crazy weird shit that is alien and valence-less to humans, drawn from the same barrel of arbitrary and diverse desires that can be produced by selection processes, then the intuition pump loses all force.

where it's essentially trying to argue that this intuition pump still has force in precisely this case.

To the extent the second parable has this kind of intuitive force I think it comes from: (i) the fact that the resulting values still sound really silly and simple (which I think is mostly deliberate hyperbole), (ii) the fact that the AI kills everyone along the way.

Thanks! I'm curious for your paraphrase of the opposing view that you think I'm failing to understand.

(I put >50% probability that I could paraphrase a version of "if the AIs decide to kill us, that's fine" that Sutton would basically endorse (in the right social context), and that would basically route through a version of "broad cosmopolitan value is universally compelling", but perhaps when you give a paraphrase it will sound like an obviously-better explanation of the opposing view and I'll update.)

I think a closer summary is: I don't think that requires anything at all about AI systems converging to cosmopolitan values in the sense you are discussing here. I do think it is much more compelling if you accept some kind of analogy between the sorts of processes shaping human values and the processes shaping AI values, but this post (and the references you cite and other discussions you've had) don't actually engage with the substance of that analogy and the kinds of issues raised in my comment are much closer to getting at the meat of the issue. I also think the "not for free" part doesn't contradict the views of Rich Sutton. I asked him this question and he agrees that all else equal it would be better if we handed off to human uploads instead of powerful AI.  I think his view is that the proposed course of action from the alignment community is morally horrifying (since in practice he thinks the alternative is "attempt to have a slave society," not "slow down AI progress for decades"---I think he might also believe that stagnation is much worse than a handoff but haven't heard his view on this specifically) and that even if you are losing something in expectation by handing the universe off to AI systems it's not as bad as the alternative.

If we are trying to help some creatures, but those creatures really dislike the proposed way we are "helping" them, then we should do something else.

My picture is less like "the creatures really dislike the proposed help", and more like "the creatures don't have terribly consistent preferences, and endorse each step of the chain, and wind up somewhere that they wouldn't have endorsed if you first extrapolated their volition (but nobody's extrapolating their volition or checking against that)".

It sounds to me like your stance is something like "there's a... (read more)

We're not talking about practically building minds right now, we are talking about humans.

We're not talking about "extrapolating volition" in general.  We are talking about whether---in attempting to help a creature with preferences about as coherent as human preferences---you end up implementing an outcome that creature considers as bad as death.

For example, we are talking about what would happen if humans were trying to be kind to a weaker species that they had no reason to kill, that could nevertheless communicate clearly and had preferences about ... (read more)


I was recently part of a group-chat where some people I largely respect were musing about this paper and this post and some of Scott Aaronson's recent "maybe intelligence makes things more good" type reasoning).

Here's my replies, which seemed worth putting somewhere public:

The claims in the paper seem wrong to me as stated, and in particular seems to conflate values with instrumental subgoals. One does not need to terminally value survival to avoid getting hit by a truck while fetching coffee; they could simply understand that one can't fetch the coffee

... (read more)

Some more less-important meta, that is in part me writing out of frustration from how the last few exchanges have gone:

I'm not quite sure what argument you're trying to have here. Two explicit hypotheses follow, that I haven't managed to distinguish between yet.

Background context, for establishing common language etc.:

  • Nate is trying to make a point about inclusive cosmopolitan values being a part of the human inheritance, and not universally compelling.
  • Paul is trying to make a point about how there's a decent chance that practical AIs will plausibly car
... (read more)

Hypothesis 1 is closer to the mark, though I'd highlight that it's actually fairly unclear what you mean by "cosmopolitan values" or exactly what claim you are making (and that ambiguity is hiding most of the substance of disagreements).

I'm raising the issue of pico-pseudokindness here because I perceive it as (i) an important undercurrent in this post, (ii) an important part of the actual disagreements you are trying to address. (I tried to flag this at the start.)

More broadly, I don't really think you are engaging productively with people who disagree wi... (read more)

Short version: I don't buy that humans are "micro-pseudokind" in your sense; if you say "for just $5 you could have all the fish have their preferences satisfied" I might do it, but not if I could instead spend $5 on having the fish have their preferences satisfied in a way that ultimately leads to them ascending and learning the meaning of friendship, as is entangled with the rest of my values.


Note: I believe that AI takeover has a ~50% probability of killing billions and should be strongly avoided, and would be a serious and irreversible decisio

... (read more)
Why? I see a lot of opportunities for s-risk or just generally suboptimal future in such options, but "we don't want to die, or at any rate we don't want to die out as a species" seems like an extremely simple, deeply-ingrained goal that almost any metric by which the AI judges our desires should be expected to pick up, assuming it's at all pseudokind. (In many cases, humans do a lot to protect endangered species without doing diddly-squat to fulfill individual specimens' preferences!) 

I disagree with this but am happy your position is laid out. I'll just try to give my overall understanding and reply to two points.

Like Oliver, it seems like you are implying:

Humans may be nice to other creatures in some sense, But if the fish were to look at the future that we'd achieve for them using the 1/billionth of resources we spent on helping them, it would be as objectionable to them as "murder everyone" is to us.

I think that normal people being pseudokind in a common-sensical way would instead say:

If we are trying to help some creatures, but tho

... (read more)

I sometimes mention the possibility of being stored and sold to aliens a billion years later, which seems to me to validly incorporate most all the hopes and fears and uncertainties that should properly be involved, without getting into any weirdness that I don't expect Earthlings to think about validly.

feels like it's setting up weak-men on an issue where I disagree with you, but in a way that's particularly hard to engage with

My best guess as to why it might feel like this is that you think I'm laying groundwork for some argument of the form "P(doom) is very high", which you want to nip in the bud, but are having trouble nipping in the bud here because I'm building a motte ("cosmopolitan values don't come free") that I'll later use to defend a bailey ("cosmopolitan values don't come cheap").

This misunderstands me (as is a separate claim from the clai... (read more)

I expect that you personally won't do a motte-and-bailey here (except perhaps insofar as you later draw on posts like these as evidence that the doomer view has been laid out in a lot of different places, when this isn't in fact the part of the doomer view relevant to ongoing debates in the field). But I do think that the "free vs cheap" distinction will obscure more than it clarifies, because there is only an epsilon difference between them; and because I expect a mob-and-bailey [] where many people cite the claim that "cosmopolitan values don't come free" as evidence in debates that should properly be about whether cosmopolitan values come cheap. This is how weak men [] work in general. Versions of this post that I wouldn't object to in this way include: * A version which is mainly framed as a conceptual distinction rather than an empirical claim * A version which says upfront "this post is not relevant to most informed debates about alignment, it's instead intended to be relevant in the following context:" * A version which identifies that there's a different but similar-sounding debate which is actually being held between people informed about the field, and says true things about the positions of your opponents in that debate and how they are different from the extreme caricatures in this post

Reproduced from a twitter thread:

I've encountered some confusion about which direction "geocentrism was false" generalizes. Correct use: "Earth probably isn't at the center of the universe". Incorrect use: "All aliens probably have two arms with five fingers."

The generalized lesson from geocentrism being false is that the laws of physics don't particularly care about us. It's not that everywhere must be similar to here along the axes that are particularly salient to us.

I see this in the form of people saying "But isn't it sheer hubris to believe that human... (read more)

I don't think I understand your position. An attempt at a paraphrase (submitted so as to give you a sense of what I extracted from your text) goes: "I would prefer to use the word consciousness instead of sentience here, and I think it is quantitative such that I care about it occuring in high degrees but not low degrees." But this is low-confidence and I don't really have enough grasp on what you're saying to move to the "evidence" stage.

Attempting to be a good sport and stare at your paragraphs anyway to extract some guess as to where we might have a dis... (read more)

I allow only limited scope for arguments from uncertainty, because "but what if I'm wrong?!" otherwise becomes a universal objection to taking any substantial action. I take the world as I find it until I find I have to update. Factory farming is unaesthetic, but no worse than that to me, and "I hate you" Bing can be abandoned to history.

So there's some property of, like, "having someone home", that humans have and that furbies lack (for all that furbies do something kinda like making humane facial expressions).

I can't tell whether:

(a) you're objecting to me calling this "sentience" (in this post), e.g. because you think that word doesn't adequately distinguish between "having sensory experiences" and "having someone home in the sense that makes that question matter", as might distinguish between the case where e.g. nonhuman animals are sentient but not morally relevant

(b) you're contestin... (read more)

I'm not the original poster here, but I'm genuinely worried about (c). I'm not sure that humanity's revealed preferences are consistent with a world in which we believe that all people matter. Between the large scale wars and genocides, slavery, and even just the ongoing stark divide between the rich and poor, I have a hard time believing that respect for sentience is actually one of humanity's strong core virtues. And if we extend out to all sentient life, we're forced to contend with our reaction to large scale animal welfare (even I am not vegetarian, although I feel I "should" be). I think humanity's actual stance is "In-group life always matters. Out-group life usually matters, but even relatively small economic or political concerns can make us change our minds.". We care about it some, but not beyond the point of inconvenience. I'd be interested in finding firmer philosophical ground for the "all sentient life matters" claim. Not because I personally need to be convinced of it, but rather because I want to be confident that a hypothetical superintelligence with "human" virtues would be convinced of this. (P.s. Your original point about "building and then enslaving a superintelligence is not just exceptionally difficult, but also morally wrong" is correct, concise, well-put, and underappreciated by the public. I've started framing my AI X-risk discussions with X-risk skeptics around similar terms.)
3Nathan Helm-Burger10d
I agree with Richard K's point here. I personally found H. Beam Piper's sci fi novels on 'Fuzzies' []to be a really good exploration of the boundaries of consciousness, sentience, and moral worth. Beam makes the distinction between 'sentience' as having animal awareness of self & environment and non-reflective consciousness, versus 'sapience' which involves a reflective self-awareness and abstract reasoning and thoughts about future and past and at least some sense of right and wrong. So in this sense, I would call a cow conscious and sentient, but not sapient. I would call a honeybee sentient, capable of experiencing valenced experiences like pain or reward, but lacking in sufficient world- and self- modelling to be called conscious. Personally, I wouldn't say that a cow has no moral worth and it is fine to torture it. I do think that if you give a cow a good life, and then kill it in a quick mostly painless way, then that's pretty ok. I don't think that that's ok to do to a human.  Philosophical reasoning about morality that doesn't fall apart in edge cases or novel situations (e.g. sapient AI) is hard [citation needed]. My current guess, which I am not at all sure of, is that my morality says something about a qualitative difference between the moral value of sapient beings vs the moral value of non-sapient but conscious sentient beings vs non-sapient non-conscious sentient beings. To me, it seems no number of cow lives trades off against a human life, but cow QUALYs and dog QUALYs do trade off against each other at some ratio. Similarly, no number of non-conscious sentient lives like ants or worms trade off against a conscious and sapient life like a cow's. I would not torture a single cow to save a billion shrimp from being tortured. Nor any number of shrimp. The value of the two seem non-commutative to me. Are current language models or the entities they temporarily simulate sap
I don't see why both of those wouldn't matter in different ways.
I've no problem with your calling "sentience" the thing that you are here calling "sentience". My citation of Wikipedia was just a guess at what you might mean. "Having someone home" sounds more like what I would call "consciousness". I believe there are degrees of that, and of all the concepts in this neighbourhood. There is no line out there in the world dividing humans from rocks. But whatever the words used to refer to this thing, those that have enough of this that I wouldn't raise them to be killed and eaten do not include current forms of livestock or AI. I basically don't care much about animal welfare issues, whether of farm animals or wildlife. Regarding AI, here [] is something I linked previously on how I would interact with a sandboxed AI. It didn't go down well. :) You have said where you stand and I have said where I stand. What evidence would weigh on this issue?

Someone recently privately asked me for my current state on my 'Dark Arts of Rationality' post. Here's some of my reply (lightly edited for punctuation and conversation flow), which seemed worth reproducing publicly:

FWIW, that post has been on my list of things to retract for a while.

(The retraction is pending a pair of blog posts that describe some of my thoughts on related matters, which have been in the editing queue for over a year and the draft queue for years before that.)

I wrote that post before reading much of the sequences, and updated away from

... (read more)

Good point! For the record, insofar as we attempt to build aligned AIs by doing the moral equivalent of "breeding a slave-race", I'm pretty uneasy about it. (Whereas insofar as it's more the moral equivalent of "a child's values maturing", I have fewer moral qualms. As is a separate claim from whether I actually expect that you can solve alignment that way.) And I agree that the morality of various methods for shaping AI-people are unclear. Also, I've edited the post (to add a "at least according to my ideals" clause) to acknowledge the point that others might be more comfortable with attempting to align AI-people via means that I'd consider morally dubious.

I'm trying to make a basic point here, that pushing the boundaries of the capabilities frontier, by your own hands and for that direct purpose, seems bad to me. I emphatically request that people stop doing that, if they're doing that.

I am not requesting that people never take any action that has some probability of advancing the capabilities frontier. I think that plenty of alignment research is potentially entangled with capabilities research (and/or might get more entangled as it progresses), and I think that some people are making the tradeoffs in ways... (read more)

4M. Y. Zuo14d
So then what's the point of posting it? Anyone that could possibly stumble across this post would not believe themselves to be the villain, just "occasionally mournfully incurring a negative externality of pushing the capabilities frontier", in or outside of 'alignment work'.

This thread continues to seem to me to be off-topic. My main takeaway so far is that the post was not clear enough about how it's answering the question "why does an AI that is indifferent to you, kill you?". In attempts to make this clearer, I have added the following to the beginning of the post:

This post is an answer to the question of why an AI that was truly indifferent to humanity (and sentient life more generally), would destroy all Earth-originated sentient life.

I acknowledge (for the third time, with some exasperation) that this point alone is... (read more)

 I assign that outcome low probability (and consider that disagreement to be off-topic here).

Thank you for the clarification. In that case my objections are on the object-level.


This post is an answer to the question of why an AI that was truly indifferent to humanity (and sentient life more generally), would destroy all Earth-originated sentient life.

This does exclude random small terminal valuations of things involving humans, but leaves out the instrumental value for trade and science, uncertainty about how other powerful beings might re... (read more)

To be clear, I'd agree that the use of the phrase "algorithmic complexity" in the quote you give is misleading. In particular, given an AI designed such that its preferences can be specified in some stable way, the important question is whether the correct concept of 'value' is simple relative to some language that specifies this AI's concepts. And the AI's concepts are ofc formed in response to its entire observational history. Concepts that are simple relative to everything the AI has seen might be quite complex relative to "normal" reference machines th... (read more)

Okay, that clarifies a lot. But the last paragraph I find surprising. If LLMs are good at understanding the meaning of human text, they must to be good at understanding human concepts, since concepts are just meanings of words the LLM understands. Do you doubt they are really understanding text as well as it seems? Or do you mean they are picking up other, non-human, concepts as well, and this is a problem? Regarding monkeys, they apparently don't understand the IGF concept as they are not good enough at reasoning abstractly about evolution and unobservable entities (genes), and they lack the empirical knowledge like humans until recently. I'm not sure how that would be an argument against advanced LLMs grasping the concepts they seem to grasp.

and requires a modern defense:

It seems to me that the usual arguments still go through. We don't know how to specify the preferences of an LLM (relevant search term: "inner alignment"). Even if we did have some slot we could write the preferences into, we don't have an easy handle/pointer to write into that slot. (Monkeys that are pretty-good-in-practice at promoting genetic fitness, including having some intuitions leading them to sacrifice themselves in-practice for two-ish children or eight-ish cousins, don't in fact have a clean "inclusive genetic f... (read more)

3Matthew Barnett2mo
Humans also don't have a "clean concept for pan-sentience CEV such that the universe turns out OK if that concept is optimized" in our heads. However, we do have a concept of human values in a more narrow sense, and I expect LLMs in the coming years to pick up roughly the same concept during training. The evolution analogy seems more analogous to an LLM that's rewarded for telling funny jokes, but it doesn't understand what makes a joke funny. So it learns a strategy of repeatedly telling certain popular jokes because those are rated as funny. In that case it's not surprising that the LLM wouldn't be funny when taken out of its training distribution. But that's just because it never learned what humor was to begin with. If the LLM understood the essence of humor during training, then it's much more likely that the property of being humorous would generalize outside its training distribution. LLMs will likely learn the concept of human values during training about as well as most humans learn the concept. There's still a problem of getting LLMs to care and act on those values, but it's noteworthy that the LLM will understand what we are trying to get it to care about nonetheless.
Inner alignment is a problem, but it seems less of a problem than in the monkey example. The monkey values were trained using a relatively blunt form of genetic algorithm, and monkeys aren't anyway capable of learning the value "inclusive genetic fitness", since they can't understand such a complex concept (and humans didn't understand it historically). By contrast, advanced base LLMs are presumably able to understand the theory of CEV about as well as a human, and they could be finetuned by using that understanding, e.g. with something like Constitutional AI. In general, the fact that base LLMs have a very good (perhaps even human level) ability of understanding text seems to make the fine-tuning phases more robust, as there is less likelihood of misunderstanding training samples. Which would make hitting a fragile target easier. Then the danger seems to come more from goal misspecification, e.g. picking the wrong principles for Constitutional AI.
To be clear, I'd agree that the use of the phrase "algorithmic complexity" in the quote you give is misleading. In particular, given an AI designed such that its preferences can be specified in some stable way, the important question is whether the correct concept of 'value' is simple relative to some language that specifies this AI's concepts. And the AI's concepts are ofc formed in response to its entire observational history. Concepts that are simple relative to everything the AI has seen might be quite complex relative to "normal" reference machines that people intuitively think of when they hear "algorithmic complexity" (like the lambda calculus, say). And so it maybe true that value is complex relative to a "normal" reference machine, and simple relative to the AI's observational history, thereby turning out not to pose all that much of an alignment obstacle. In that case (which I don't particularly expect), I'd say "value was in fact complex, and this turned out not to be a great obstacle to alignment" (though I wouldn't begrudge someone else saying "I define complexity of value relative to the AI's observation-history, and in that sense, value turned out to be simple"). Insofar as you are arguing "(1) the arbital page on complexity of value does not convincingly argue that this will matter to alignment in practice, and (2) LLMs are significant evidence that 'value' won't be complex relative to the actual AI concept-languages we're going to get", I agree with (1), and disagree with (2), while again noting that there's a reason I deployed the fragility of value (and not the complexity of value) in response to your original question (and am only discussing complexity of value here because you brought it up). re: (1), I note that the argument is elsewhere (and has the form "there will be lots of nearby concepts" + "getting almost the right concept does not get you almost a good result", as I alluded to above). I'd agree that one leg of possible support for th

(For context vis-a-vis my enthusiasm about this plan, see this comment. In particular, I'm enthusiastic about fleshing out and testing some specific narrow technical aspects of one part of this plan. If that one narrow slice of this plan works, I'd have some hope that it can be parlayed into something more. I'm not particularly compelled by the rest of the plan surrounding the narrow-slice-I-find-interesting (in part because I haven't looked that closely at it for various reasons), and if the narrow-slice-I-find-interesting works out then my hope in it mos... (read more)

This whole thread (starting with Paul's comment) seems to me like an attempt to delve into the question of whether the AI cares about you at least a tiny bit. As explicitly noted in the OP, I don't have much interest in going deep into that discussion here.

The intent of the post is to present the very most basic arguments that if the AI is utterly indifferent to us, then it kills us. It seems to me that many people are stuck on this basic point.

Having bought this (as it seems to me like Paul has), one might then present various galaxy-brained reasons why t... (read more)

Most people care a lot more about whether they and their loved ones (and their society/humanity) will in fact be killed than whether they will control the cosmic endowment. Eliezer has been going on podcasts saying that with near-certainty we will not see really superintelligent AGI because we will all be killed, and many people interpret your statements as saying that. And Paul's arguments do cut to the core of a lot of the appeals to humans keeping around other animals.

If it is false that we will almost certainly be killed (which I think is right, I... (read more)

Current LLM behavior doesn't seem to me like much evidence that they care about humans per se.

I'd agree that they evidence some understanding of human values (but the argument is and has always been "the AI knows but doesn't care"; someone can probably dig up a reference to Yudkowsky arguing this as early as 2001).

I contest that the LLM's ability to predict how a caring-human sounds is much evidence that the underlying coginiton cares similarly (insofar as it cares at all).

And even if the underlying cognition did care about the sorts of things you can some... (read more)

The fragility-of-value posts are mostly old. They were written before GPT-3 came out (which seemed very good at understanding human language and, consequently, human values), before instruction fine-tuning was successfully employed, and before forms of preference learning like RLHF or Constitutional AI were implemented. With this background, many arguments in articles like Eliezer's Complexity of Value (2015) [] sound now implausible, questionable or in any case outdated. I agree that foundation LLMs are just able to predict how a caring human sounds like, but fine-tuned models are no longer pure text predictors. They are biased towards producing particular types of text, which just means they value some of it more than others. Currently these language models are just Oracles, but a future multimodal version could be capable of perception and movement. Prototypes of this sort do already exist. [] Maybe they do not really care at all about what they do seem to care about, i.e. they are deceptive. But as far as I know, there is currently no significant evidence for deception. Or they might just care about close correlates of what they seem to care about. That is a serious possibility, but given that they seem very good at understanding text from the unsupervised and very data-heavy pre-training phase, a lot of that semantic knowledge does plausibly help with the less data-heavy SL/RL fine-tuning phases, since these also involve text. The pre-trained models have a lot of common sense, which makes the fine-tuning less of a narrow target. The bottom line is that with the advent of finetuned large language models, the following "complexity of value thesis", from Eliezer's Arbital article above, is no longer obviously true, and requires a modern defense:
  • Confirmed that I don't think about this much. (And that this post is not intended to provide new/deep thinking, as opposed to aggregating basics.)
  • I don't particularly expect drawn-out resource fights, and suspect our difference here is due to a difference in beliefs about how hard it is for single AIs to gain decisive advantages that render resource conflicts short.
  • I consider scenarios where the AI cares a tiny bit about something kinda like humans to be moderately likely, and am not counting scenarios where it builds some optimized fascimile as scenari
... (read more)
4Quadratic Reciprocity2mo
Why is aliens wanting to put us in a zoo more plausible than the AI wanting to put us in a zoo itself?  Edit: Ah, there are more aliens around so even if the average alien doesn't care about us, it's plausible that some of them would?
From the last bullet point: "it doesn't much matter relative to the issue of securing the cosmic endowment in the name of Fun." Part of the post seems to be arguing against the position "The AI might take over the rest of the universe, but it might leave us alone." Putting us in an alien zoo is pretty equivalent to taking over the rest of the universe and leaving us alone.  It seems like the last bullet point pivots from arguing that AI will definitely kill us to arguing that even though if it doesn't kill us this is pretty bad.
1[comment deleted]2mo
1[comment deleted]2mo

Below is a sketch of an argument that might imply that the answer to Q5 is (clasically) 'yes'. (I thought about a question that's probably the same a little while back, and am reciting from cache, without checking in detail that my axioms lined up with your A1-4).

Pick a lottery with the property that forall with and , forall , we have . We will say that is "extreme(ly high)".

Pick a lottery with .

Now, for any with , define to be the guaranteed by continuity (A3).

Lemma: forall with , ... (read more)


I'm awarding another $3,000 distillation prize for this piece, with complements to the authors.

A few people recently have asked me for my take on ARC evals, and so I've aggregated some of my responses here:

- I don't have strong takes on ARC Evals, mostly on account of not thinking about it deeply.
- Part of my read is that they're trying to, like, get a small dumb minimal version of a thing up so they can scale it to something real. This seems good to me.
- I am wary of people in our community inventing metrics that Really Should Not Be Optimized and handing them to a field that loves optimizing metrics.
- I expect there are all sorts of issues that wo... (read more)

the fact that all the unified cases for AI risk have been written by more ML-safety-sympathetic people like me, Ajeya, and Joe (with the single exception of "AGI ruin") is indicative that that strategy mostly hasn't been tried.

I'm not sure what you mean by this, but here's half-a-dozen "unified cases for AI risk" made by people like Eliezer Yudkowsky, Nick Bostrom, Stuart Armstrong, and myself:

2001 -
2014 -
2014 - Superintelligence
2015 - (read more)

There's a type signature that I'm trying to get at with the "unified case" description (which I acknowledge I didn't describe very well in my previous comment), which I'd describe as "trying to make a complete argument (or something close to it)". I think all the things I was referring to meet this criterion; whereas, of the things you listed, only Superintelligence seems to, with the rest having a type signature more like "trying to convey a handful of core intuitions". (CFAI may also be in the former category, I haven't read it, but it was long ago enough that it seems much less relevant to questions related to persuasion today.) It seems to me that this is a similar complaint as Eliezer's when he says in List of Lethalities: except that I'm including a few other pieces of (ML-safety-sympathetic) work in the same category.

(oops! thanks. i now once again think it's been fixed (tho i'm still just permuting things rather than reading))


John has also made various caveats to me, of the form "this field is pre-paradigmatic and the math is merely suggestive at this point". I feel like he oversold his results even so.

Part of it is that I get the sense that John didn't understand the limitations of his own results--like the fact that the telephone theorem only says anything in the infinite case, and the thing it says then does not (in its current form) arise as a limit of sensible things that can be said in finite cases. Or like the fact that the alleged interesting results of the gKPD theorem... (read more)

(Also, I had the above convos with John >1y ago, and perhaps John simply changed since then.)

In hindsight, I do think the period when our discussions took place were a local maximum of (my own estimate of the extent of applicability of my math), partially thanks to your input and partially because I was in the process of digesting a bunch of the technical results we talked about and figuring out the next hurdles. In particular, I definitely underestimated the difficulty of extending the results to finite approximations.

That said, I doubt that fully accounts for the difference in perception.

John said "there was not any point at which I thought my views were importantly misrepresented" when I asked him for comment. (I added this note to the top of the post as a parenthetical; thanks.)

More details:

  • I think the argument Nate gave is at least correct for markets of relatively-highly-intelligent agents, and that was a big update for me (thankyou Nate!). I'm still unsure how far it generalizes to relatively less powerful agents.
  • Nate left out my other big takeaway: Nate's argument here implies that there's probably a lot of money to be made in real-world markets! In practice, it would probably look like an insurance-like contract, by which two traders would commit to the "side-channel trades at non-market prices" required to make them aggrega
... (read more)

For the record, the reason I didn't speak up was less "MIRI would have been crushed" and more "I had some hope".

I had in fact had a convo with Elon and one or two convos with Sam while they were kicking the OpenAI idea around (and where I made various suggestions that they ultimately didn't take). There were in fact internal forces at OpenAI trying to cause it to be a force for good—forces that ultimately led them to write their 2018 charter, so, forces that were not entirely fictitious. At the launch date, I didn't know to what degree those internal force... (read more)

I can confirm that Nate is not backdating memories—he and Eliezer were pretty clear within MIRI at the time that they thought Sam and Elon were making a tremendous mistake and that they were trying to figure out how to use MIRI's small influence within a worsened strategic landscape.

Good idea, thanks! I added an attempt at a summary (under the spoiler tags near the top).


Here's a recent attempt of mine at a distillation of a fragment of this plan, copied over from a discussion elsewhere:

goal: make there be a logical statement such that a proof of that statement solves the strawberries-on-a-plate problem (or w/e).

summary of plan:

  • the humans put in a herculean effort to build a multi-level world-model that is interpretable to them (ranging from quantum chemistry at the lowest level, to strawberries and plates at the top)
  • we interpret this in a very conservative way, as a convex set of models that hopefully contains someth
... (read more)

I don't see this as worst-case thinking. I do see it as speaking from a model that many locals don't share (without any particular attempt made to argue that model).

In particular, if the AGI has some pile of kludges disproportionately pointed towards accomplishing X, and the AGI does self-reflection and “irons itself out”, my prediction is “maybe this AGI will wind up pursuing X, or maybe not, I dunno”.

AFAICT, our degree of disagreement here turns on what you mean by "pointed". Depending on that, I expect I'd either say "yeah maybe, but that kind of po... (read more)

7Steven Byrnes4mo
Oh, sorry. I’m “uncertain” assuming Model-Based RL with the least-doomed plan that I feel like I more-or-less know how to implement right now []. If we’re talking about “naïve training”, then I’m probably very pessimistic, depending on the details. That’s helpful, thanks!

Thanks! Cool, it makes sense to me how we can make the pullback of with , in different ways to get different line bundles, and then tensor them all together. (I actually developed that hypothesis during a car ride earlier today :-p.)

(I'm still not quite sure what the syntax means, but presumably the idea is that there's an automorphism on 1D vector fields that flips the sign, and we flip the sign of the negative-charge line bundles before tensoring everything together?)

(Also, fwiw, when I said "they're all isomorphic to ", I meant that I di... (read more)

4Vanessa Kosoy5mo
The syntax L⊗q means "L to the tensor power of q". For q>0, it just means tensoring L with itself q times. For q=0, L⊗q is just the trivial line bundle with total space Y×C (and, yes, all line bundles are isomorphic to the trivial line bundle, but this one just is the trivial bundle... or at least, canonically isomorphic to it). For q<0, we need the notion of a dual vector bundle. Any vector bundle V has a dual V∗, and for a line bundle the dual is also the inverse, in the sense that L⊗L∗ is canonically isomorphic to the trivial bundle. We can then define all negative powers by L⊗q:=(L∗)⊗−q. Notice that non-negative tensor powers are defined for all vector bundles, but negative tensor powers only make sense for line bundles. It remains to explain what is V∗. But, for our purposes we can take a shortcut. The idea is, for any finite-dimensional complex vector space U with an inner product, there is a canonical isomorphism between U∗ and ¯U, where ¯U is the complex-conjugate space. What is the complex-conjugate space? It is a vector space that (i) has the same set of vectors (ii) has the same addition operation and (iii) has its multiplication-by-scalar operation modified, so that multiplying u by z in ¯U is the same thing as multiplying u by ¯z in U, where ¯z is just the complex number conjugate to z. Equipped with this observation, we can define the dual of a Hermitian line bundle L to be ¯L, where ¯L is the bundle obtained for L by changing its multiplication-by-scalar mapping in the obvious way.

You're over-counting programs. I didn't spell out definitions of "programming language" and "length", but an important disideratum is that there has to only be finitely much "length" to go around, in the sense that must converge.

Under your proposal, the total amount of "length" is , so this is not an admissible notion of length.

(Note: this argument has nothing to do with the choice of base 2, and it applies equally well for all bases.)

Two common ways of repairing your notion of length are:

  1. Using a prefix-free code f
... (read more)
1Thomas Sepulchre6mo

It would still help like me to have a "short version" section at the top :-)

I've expanded the TL;DR at the top to include the nine theses. Thanks for the suggestion!


I'm not entirely sure that I follow the construction of yet.

Let's figure out the total space. If you just handed me a line bundle on , and were like "make a bundle on ", then the construction that I'd consider most obvious would be to make the total space be the pullback of such that all of the time-coordinates agree...

...ah, but that wouldn't be a line bundle; the tangent space would be -dimensional. I see.

You suggested starting by considering what happens to an individual fiber, which... is an easier operation to do w... (read more)

4Vanessa Kosoy6mo
There are two operations involved in the definition of W: pullback and tensor product. Pullback is defined for arbitrary bundles. Given a mapping f:X→Y (these X and Y are arbitrary manifolds, not the specific ones from before) and a bundle B over Y with total space TotB and projection mapping prB:Tot(B)→Y, the pullback of B w.r.t. f (denoted f∗B) is the bundle over X with total space Tot(B)×YX and the obvious projection mapping. I remind that Tot(B)×YX is the fibre product, i.e. the submanifold of Tot(B)×X defined by prB(t)=f(x). Notice that the fibre of f∗B over any x∈X is canonically isomorphic to the fibre of B over f(x). The word "canonical" means that there is a particular isomorphism that we obtain from the construction. It is easy enough to see that the pullback of a vector bundle is a vector bundle, the pullback of a line bundle is a line bundle, and the pullback of a Hermitian vector bundle is a Hermitian vector bundle. Tensor product is an operation over vector bundles. There are different ways to define it, corresponding to the different ways to define a tensor product of vector spaces. Specifically for line bundles there is the following shortcut definition. Let L1 and L2 be line bundles over X. Then, the total space of L1⊗L2 is the quotient of L1×XL2 by the equivalence relation given by: (v1,v2)∼(w1,w2) iff v1⊗v2=w1⊗w2. Here, I regard v1,w1 as vectors in the vector space which is the corresponding fibre fo L1 and similarly for v2,w2 and L2. The quotient of a manifold by an equivalence relation is not always a manifold, but in this case it is. I notice that you wrote "a particular fiber is isomorphic to C". Your error here is, it doesn't matter what it's isomorphic to, you should still think of it as an abstract vector space. So, if e.g. V1 and V2 are 1-dimensional vector spaces, then V1⊗V2 is yet another "new" vector space. Yes, they are all isomorphic, but they are not canonically isomorphic.

I think that distillations of research agendas such as this one are quite valuable, and hereby offer LawrenceC a $3,000 prize for writing it. (I'll follow up via email.) Thanks, LawrenceC!

Going forward, I plan to keep an eye out for distillations such as this one that seem particularly skilled or insightful to me, and offer them a prize in the $1-10k range, depending on how much I like them.

Insofar as I do this, I'm going to be completely arbitrary about it, and I'm only going to notice attempts haphazardly, so please don't do rely on the assumption that I... (read more)

Are there any agendas you would particularly like to see distilled?
Thanks Nate! I didn't add a 1-sentence bullet point for each thesis because I thought the table of contents on the left was sufficient, though in retrospect I should've written it up mainly for learning value. Do you still think it's worth doing after the fact?  Ditto the tweet thread, assuming I don't plan on tweeting this.

Cool, thanks.

I'm pretty confident that the set of compatible (gauge, wavefunction) pairs is computably enumerable, so I think that the coding theorem should apply.

There's an insight that I've glimpsed--though I still haven't checked the details--which is that we can guarantee that it's possible to name the 'correct' (gauge, wavefunction) cluster without necessarily having to name any single gauge (as would be prohibatively expensive), by dovetailing all the (guage, wavefunction) pairs (in some representation where you can comptuably detect compatibility) a... (read more)

2Yoav Ravid4mo
Seems good to edit the correction in the post, so readers know that in some cases it's not constant.
1Adam Jermyn6mo
How does this correctness check work? I usually think of gauge freedom as saying “there is a family of configurations that all produce the same observables”. I don’t think that gives a way to say some configurations are correct/incorrect. Rather some pairs of configurations are equivalent and some aren’t. That said, I do think you can probably do something like the approach described to assign a label to each equivalence class of configurations and do your evolution in that label space, which avoids having to pick a gauge.

That gives me a somewhat clearer picture. (Thanks!) It sounds like the idea is that we have one machine that dovetails through everything and separates them into bins according to their behavior (as revealed so far), and a second machine that picks a bin.

Presumably the bins are given some sort of prefix-free code, so that when a behavior-difference is revealed within a bin (e.g. after more time has passed) it can be split into two bins, with some rule for which one is "default" (e.g., the leftmost).

I buy that something like this can probably be made to wor... (read more)

I only just realized that you're mainly thinking of the complexity of semimeasures on infinite sequences, not the complexity of finite strings. I guess that should have been obvious from the OP; the results I've been citing are about finite strings. My bad! For semimeasures, this paper [] proves that there actually is a non-constant gap between the log-total-probability and description complexity. Instead the gap is bounded by the Kolmogorov complexity of the length of the sequences. This is discussed in section 4.5.4 of Li&Vitanyi.

this might make the program longer since you'd need to specify physics.

(I doubt it matters much; the bits you use to specify physics at the start are bits you save when picking the codeword at the end.)

I don't think you need to choose a particular history to predict since all observables are gauge-invariant.

IIUC, your choice of the wavefunction and your choice of the gauge are interlocked. The invariant is that if you change the gauge and twiddle the wavefunction in a particular way, then no observables change. If you're just iterating over (gauge, ... (read more)

I'm not really sure how gauge-fixing works in QM, so I can't comment here. But I don't think it matters: there's no need to check which pairs are "legal", you just run all possible pairs in parallel and see what observables they output. Pairs which are in fact gauge-equivalent will produce the same observables by definition, and will accrue probability to the same strings. Perhaps you're worried that physically-different worlds could end up contributing identical strings of observables? Possibly, but (a) I think if all possible strings of observables are the same they should be considered equivalent, so that could be one way of distinguishing them (b) this issue seems orthogonal to k-complexity VS. alt-complexity.

That was my original guess! I think Vanessa suggested something different. IIUC, she suggested

which has factors of the wavefunction, instead of 1.

(You having the same guess as me does update me towards the hypothesis that Vanessa just forgot some parentheses, and now I'm uncertain again :-p. Having factors of the wavefunction sure does seem pretty wacky!)

(...or perhaps there's an even more embarassing misunderstanding, where I've misunderstood physicist norms about parenthesis-insertion!)

5Steven Byrnes6mo
I think it’s just one copy of the ψ—I don’t think Vanessa intended for ψ to be included in the Π product thing here []. I agree that an extra pair of parentheses could have made it clearer. (Hope I’m not putting words in her mouth.)

Thanks! One place where I struggle with this idea is that people go around saying things like "Given a quantum particle with nonzero electric charge, you can just pick what phase its wavefunction has". I don't know how to think of an electron having a wavefunction whose phase I can pick. The wavefunctions that I know assign amplitudes to configurations, not to particles; if I have a wavefunctionover three-electron configurations then I don't know how to "choose the phase" for each electron, because a three-particle wave-functions doesn't (in general) facto... (read more)

4Steven Byrnes6mo
My recollection is that in nonrelativistic N-particle QM you would have a wavefunction ψ(r1,r2,…,rN) (a complex number for every possible choice of N positions in space for the N particles), and when you change gauge you multiply that that wavefunction by eiq1Λ(r1)/ℏeiq2Λ(r2)/ℏ⋯eiqNΛ(rN)/ℏ. I think this is equivalent to what Vanessa said, and if not she’s probably right :-P

Thanks! I'd need more detail than that to answer my questions.

Like, can we specialize this new program to a program that's just 'dovetailing' across all possible gauge-choices and then running physics on those? When we choose different gauges, we have to choose correspondingly-different ways of initializing the rest of the fields (or whatever); presumably this program is now also 'dovetailing' across all the different initializations?

But now it's looking not just at one history, but at all histories, and "keeping track of how much algorithmic probability h... (read more)

You could specialize to just dovetailing across gauges, but this might make the program longer since you'd need to specify physics. I don't think you need to choose a particular history to predict since all observables are gauge-invariant. So say we're trying to predict some particular sequence of observables in the universe. You run your search over all programs, the search includes various choices of gauge/initialization. Since the observables are gauge-invariant each of those choices of gauge generate the same sequence of observables, so their algorithmic probability accumulates to a single string. Once the string has accumulated enough algorithmic probability we can refer to it with a short codeword(this is the tricky part, see below) The idea is that, using a clever algorithm, you can arrange the clusters in a contiguous way inside [0,1]. Since they're contiguous, if a particular cluster has volume 2−Θ(n) we can refer to it with a codeword of length Θ(n).

Thanks! My top guess was

so I much appreciate the correction.

...actually, having factors of feels surprising to me; like, this map doesn't seem to be the identity when is trivial; did you forget some parentheses? (Or did I misunderstand the parenthesis-insertion conventions?)

re: bundles, I don't get it yet, and I'd appreciate some clarification. Perhaps I'm simply being daft, but on my current understanding, and I'm not quite seeing which bundle it's supposed to be a section of.

Like, my understanding of bundles ... (read more)

6Vanessa Kosoy6mo
Your guess is exactly what I meant. The ψ is outside the product, otherwise this expression is not even a valid group action. Now, about bundles. As you said, a bundle over a manifold B is another manifold T with a projection π:T→B s.t. locally it looks like a product. Formally, every p∈B should have an open neighborhood U s.t. there is a diffeomorphism between π restricted to π−1(U) and a projection pr:U×F→U for some manifold F (the "fiber"). A vector bundle is a bundle equipped with additional structure that makes every fiber a vector space. Formally, we need to have a smooth addition mapping +:T×BT→T and a multiplication-by-scalar mapping ⋅:C×T→T which are (i) morphisims of bundles and (ii) make every fiber (i.e. the inverse π-image of every point in B) into a vector space. Here, ×B stands for the fibre product (the submanifold of T×T given by π(x)=π(y)). I'm using C here because we will need complex vector bundles. A line bundle is just a vector bundle s.t. every fiber is 1-dimensional. A Hermitian vector bundle is a vector bundle equipped with a smooth mapping of bundles ⟨⟩:T×BT which makes every fibre into an inner product space. Onward to quantum mechanics. Let X be physical space and Y:=X×R physical spacetime. In the non-relativistic setting, Y is isomorphic to R4, so all Hermitian line bundles over Y are isomorphic. So, in principle any one of them can be identified with the trivial bundle: total space Y×C with π being the canonical projection. However, it is often better to imagine some Heremitian line bundle L without such an identification. In fact, choosing an identification precisely corresponds to choosing a gauge. This is like how all finite dimensional real vector spaces are isomorphic to Rn but it is often better not to fix a particular isomorphism (basis), because that obscures the underlying symmetry group of the problem. For finite dimensional vector spaces, the symmetry group is the automorphisms of the vector space (a group isomorphic to

Thanks! I think I'm confused on a more basic step, which is, like, what exactly is the purported invariance? Consider a Shrödinger-style wave function on a configuration-space with two particles (that have positions, so 6 dimensions total). I know what it means that this wave-function is phase-invariant (if I rotate all amplitudes at once then the dynamics don't change). What exactly would it mean for it to be "locally phase-invariant", though?

As a sub-question, what exactly is the data of the local phase invariance? A continuous function from space to U(1... (read more)

Vanessa's reply is excellent, and Steven clearly descibed the gauge transforms of EM that I was gesturing at. That said, if you want to see some examples of fibre bundles in physics other than those in non-relativistic QM, nLab []has a great article on the topic. Also, if you know some GR, the inability to have more than local charts in general spacetimes makes clear that you do need to user bundles other than trivial ones i.e. ones of the form M×T. In my view, GR makes the importance of between global structure = fibre bundle clearer than QM, but maybe you have different intuitions.
5Vanessa Kosoy6mo
You don't need QFT here, gauge invariance is a thing even for non-relativistic quantum charged particles moving in a background electromagnetic field. The gauge transformation group consists of (sufficiently regular) functions g:R4→U(1). The transformation law of the n-particle wavefunction is: ψ′(x1…xn,t)=n∏i=1g(xi,t)qiψ(x1…xn,t) Here, qi is the electric charge of the i-th particle, in units of positron charge. In math-jargony terms, the wavefunction is a section of the line bundle W:=n⨂i=1(pri×idR)∗L⊗qi Here, pri:(R3)n→R3 is the projection to the position of the i-th particle and L is the "standard" line bundle on R4 on which the electromagnetic field (the 4-potential A, which is useful here even though the setting is non-relativistic) is a connection. W has an induced connection, and the electromagnetic time-dependent Shroedinger equation is obtained from the ordinary time-dependent Shroedinger equation by replacing ordinary derivatives with covariant derivatives.

Cool, thanks! I wonder what that theorem has to say about gauge symmetries. Like, if I take the enormous pile of programs that simulate some region of physics (each w/ a different hard coded gauge), and feed the pile into that theorem, what "short" program does it spit out? (With scare quotes b/c I'm not sure how big this constant is yet.)

I might poke at this later, but in the interim, I'm keen to hear from folk who already know the answer.

The programs used in the proof basically work like this: they dovetail all programs and note when they halt/output, keeping track of how much algorithmic probability has been assigned to each string. Then a particular string is selected using a clever encoding scheme applied to the accrued algorithmic probability. So the gauge theory example would just be a special case of that, applied to the accumulated probability from programs in various gauges.

(another thing that might help: when you're proving an implication □ C → C, the gödel-number that you're given doesn't code for the proof of the implication you're currently writing; that would be ill-typed. you asked for a □ C, not a □ (□ C → C). so the gödel-number you're given isn't a code for the thing you're currently writing, it's a code for löb's theorem applied to the thing you're currently writing.

it is for this reason that the proof you're fed might not be exactly the proof you were hoping for. you started out your implication being like "step 1:... (read more)

Load More