Thanks! Seems like a fine summary to me, and likely better than I would have done, and it includes a piece or two that I didn't have (such as an argument from symmetry if the situations were reversed). I do think I knew a bunch of it, though. And e.g., my second parable was intended to be a pretty direct response to something like
If we instead treat "paperclip" as an analog for some crazy weird shit that is alien and valence-less to humans, drawn from the same barrel of arbitrary and diverse desires that can be produced by selection processes, then the intuition pump loses all force.
where it's essentially trying to argue that this intuition pump still has force in precisely this case.
Thanks! I'm curious for your paraphrase of the opposing view that you think I'm failing to understand.
(I put >50% probability that I could paraphrase a version of "if the AIs decide to kill us, that's fine" that Sutton would basically endorse (in the right social context), and that would basically route through a version of "broad cosmopolitan value is universally compelling", but perhaps when you give a paraphrase it will sound like an obviously-better explanation of the opposing view and I'll update.)
If we are trying to help some creatures, but those creatures really dislike the proposed way we are "helping" them, then we should do something else.
My picture is less like "the creatures really dislike the proposed help", and more like "the creatures don't have terribly consistent preferences, and endorse each step of the chain, and wind up somewhere that they wouldn't have endorsed if you first extrapolated their volition (but nobody's extrapolating their volition or checking against that)".
It sounds to me like your stance is something like "there's a...
We're not talking about practically building minds right now, we are talking about humans.
We're not talking about "extrapolating volition" in general. We are talking about whether---in attempting to help a creature with preferences about as coherent as human preferences---you end up implementing an outcome that creature considers as bad as death.
For example, we are talking about what would happen if humans were trying to be kind to a weaker species that they had no reason to kill, that could nevertheless communicate clearly and had preferences about ...
I was recently part of a group-chat where some people I largely respect were musing about this paper and this post and some of Scott Aaronson's recent "maybe intelligence makes things more good" type reasoning).
Here's my replies, which seemed worth putting somewhere public:
...The claims in the paper seem wrong to me as stated, and in particular seems to conflate values with instrumental subgoals. One does not need to terminally value survival to avoid getting hit by a truck while fetching coffee; they could simply understand that one can't fetch the coffee
Some more less-important meta, that is in part me writing out of frustration from how the last few exchanges have gone:
I'm not quite sure what argument you're trying to have here. Two explicit hypotheses follow, that I haven't managed to distinguish between yet.
Background context, for establishing common language etc.:
Hypothesis 1 is closer to the mark, though I'd highlight that it's actually fairly unclear what you mean by "cosmopolitan values" or exactly what claim you are making (and that ambiguity is hiding most of the substance of disagreements).
I'm raising the issue of pico-pseudokindness here because I perceive it as (i) an important undercurrent in this post, (ii) an important part of the actual disagreements you are trying to address. (I tried to flag this at the start.)
More broadly, I don't really think you are engaging productively with people who disagree wi...
Short version: I don't buy that humans are "micro-pseudokind" in your sense; if you say "for just $5 you could have all the fish have their preferences satisfied" I might do it, but not if I could instead spend $5 on having the fish have their preferences satisfied in a way that ultimately leads to them ascending and learning the meaning of friendship, as is entangled with the rest of my values.
Meta:
...Note: I believe that AI takeover has a ~50% probability of killing billions and should be strongly avoided, and would be a serious and irreversible decisio
I disagree with this but am happy your position is laid out. I'll just try to give my overall understanding and reply to two points.
Like Oliver, it seems like you are implying:
Humans may be nice to other creatures in some sense, But if the fish were to look at the future that we'd achieve for them using the 1/billionth of resources we spent on helping them, it would be as objectionable to them as "murder everyone" is to us.
I think that normal people being pseudokind in a common-sensical way would instead say:
...If we are trying to help some creatures, but tho
I sometimes mention the possibility of being stored and sold to aliens a billion years later, which seems to me to validly incorporate most all the hopes and fears and uncertainties that should properly be involved, without getting into any weirdness that I don't expect Earthlings to think about validly.
feels like it's setting up weak-men on an issue where I disagree with you, but in a way that's particularly hard to engage with
My best guess as to why it might feel like this is that you think I'm laying groundwork for some argument of the form "P(doom) is very high", which you want to nip in the bud, but are having trouble nipping in the bud here because I'm building a motte ("cosmopolitan values don't come free") that I'll later use to defend a bailey ("cosmopolitan values don't come cheap").
This misunderstands me (as is a separate claim from the clai...
Reproduced from a twitter thread:
I've encountered some confusion about which direction "geocentrism was false" generalizes. Correct use: "Earth probably isn't at the center of the universe". Incorrect use: "All aliens probably have two arms with five fingers."
The generalized lesson from geocentrism being false is that the laws of physics don't particularly care about us. It's not that everywhere must be similar to here along the axes that are particularly salient to us.
I see this in the form of people saying "But isn't it sheer hubris to believe that human...
I don't think I understand your position. An attempt at a paraphrase (submitted so as to give you a sense of what I extracted from your text) goes: "I would prefer to use the word consciousness instead of sentience here, and I think it is quantitative such that I care about it occuring in high degrees but not low degrees." But this is low-confidence and I don't really have enough grasp on what you're saying to move to the "evidence" stage.
Attempting to be a good sport and stare at your paragraphs anyway to extract some guess as to where we might have a dis...
So there's some property of, like, "having someone home", that humans have and that furbies lack (for all that furbies do something kinda like making humane facial expressions).
I can't tell whether:
(a) you're objecting to me calling this "sentience" (in this post), e.g. because you think that word doesn't adequately distinguish between "having sensory experiences" and "having someone home in the sense that makes that question matter", as might distinguish between the case where e.g. nonhuman animals are sentient but not morally relevant
(b) you're contestin...
Someone recently privately asked me for my current state on my 'Dark Arts of Rationality' post. Here's some of my reply (lightly edited for punctuation and conversation flow), which seemed worth reproducing publicly:
...FWIW, that post has been on my list of things to retract for a while.
(The retraction is pending a pair of blog posts that describe some of my thoughts on related matters, which have been in the editing queue for over a year and the draft queue for years before that.)
I wrote that post before reading much of the sequences, and updated away from
Good point! For the record, insofar as we attempt to build aligned AIs by doing the moral equivalent of "breeding a slave-race", I'm pretty uneasy about it. (Whereas insofar as it's more the moral equivalent of "a child's values maturing", I have fewer moral qualms. As is a separate claim from whether I actually expect that you can solve alignment that way.) And I agree that the morality of various methods for shaping AI-people are unclear. Also, I've edited the post (to add a "at least according to my ideals" clause) to acknowledge the point that others might be more comfortable with attempting to align AI-people via means that I'd consider morally dubious.
I'm trying to make a basic point here, that pushing the boundaries of the capabilities frontier, by your own hands and for that direct purpose, seems bad to me. I emphatically request that people stop doing that, if they're doing that.
I am not requesting that people never take any action that has some probability of advancing the capabilities frontier. I think that plenty of alignment research is potentially entangled with capabilities research (and/or might get more entangled as it progresses), and I think that some people are making the tradeoffs in ways...
This thread continues to seem to me to be off-topic. My main takeaway so far is that the post was not clear enough about how it's answering the question "why does an AI that is indifferent to you, kill you?". In attempts to make this clearer, I have added the following to the beginning of the post:
This post is an answer to the question of why an AI that was truly indifferent to humanity (and sentient life more generally), would destroy all Earth-originated sentient life.
I acknowledge (for the third time, with some exasperation) that this point alone is...
I assign that outcome low probability (and consider that disagreement to be off-topic here).
Thank you for the clarification. In that case my objections are on the object-level.
This post is an answer to the question of why an AI that was truly indifferent to humanity (and sentient life more generally), would destroy all Earth-originated sentient life.
This does exclude random small terminal valuations of things involving humans, but leaves out the instrumental value for trade and science, uncertainty about how other powerful beings might re...
To be clear, I'd agree that the use of the phrase "algorithmic complexity" in the quote you give is misleading. In particular, given an AI designed such that its preferences can be specified in some stable way, the important question is whether the correct concept of 'value' is simple relative to some language that specifies this AI's concepts. And the AI's concepts are ofc formed in response to its entire observational history. Concepts that are simple relative to everything the AI has seen might be quite complex relative to "normal" reference machines th...
and requires a modern defense:
It seems to me that the usual arguments still go through. We don't know how to specify the preferences of an LLM (relevant search term: "inner alignment"). Even if we did have some slot we could write the preferences into, we don't have an easy handle/pointer to write into that slot. (Monkeys that are pretty-good-in-practice at promoting genetic fitness, including having some intuitions leading them to sacrifice themselves in-practice for two-ish children or eight-ish cousins, don't in fact have a clean "inclusive genetic f...
(For context vis-a-vis my enthusiasm about this plan, see this comment. In particular, I'm enthusiastic about fleshing out and testing some specific narrow technical aspects of one part of this plan. If that one narrow slice of this plan works, I'd have some hope that it can be parlayed into something more. I'm not particularly compelled by the rest of the plan surrounding the narrow-slice-I-find-interesting (in part because I haven't looked that closely at it for various reasons), and if the narrow-slice-I-find-interesting works out then my hope in it mos...
This whole thread (starting with Paul's comment) seems to me like an attempt to delve into the question of whether the AI cares about you at least a tiny bit. As explicitly noted in the OP, I don't have much interest in going deep into that discussion here.
The intent of the post is to present the very most basic arguments that if the AI is utterly indifferent to us, then it kills us. It seems to me that many people are stuck on this basic point.
Having bought this (as it seems to me like Paul has), one might then present various galaxy-brained reasons why t...
Most people care a lot more about whether they and their loved ones (and their society/humanity) will in fact be killed than whether they will control the cosmic endowment. Eliezer has been going on podcasts saying that with near-certainty we will not see really superintelligent AGI because we will all be killed, and many people interpret your statements as saying that. And Paul's arguments do cut to the core of a lot of the appeals to humans keeping around other animals.
If it is false that we will almost certainly be killed (which I think is right, I...
Current LLM behavior doesn't seem to me like much evidence that they care about humans per se.
I'd agree that they evidence some understanding of human values (but the argument is and has always been "the AI knows but doesn't care"; someone can probably dig up a reference to Yudkowsky arguing this as early as 2001).
I contest that the LLM's ability to predict how a caring-human sounds is much evidence that the underlying coginiton cares similarly (insofar as it cares at all).
And even if the underlying cognition did care about the sorts of things you can some...
Below is a sketch of an argument that might imply that the answer to Q5 is (clasically) 'yes'. (I thought about a question that's probably the same a little while back, and am reciting from cache, without checking in detail that my axioms lined up with your A1-4).
Pick a lottery with the property that forall with and , forall , we have . We will say that is "extreme(ly high)".
Pick a lottery with .
Now, for any with , define to be the guaranteed by continuity (A3).
Lemma: forall with , ...
A few people recently have asked me for my take on ARC evals, and so I've aggregated some of my responses here:
- I don't have strong takes on ARC Evals, mostly on account of not thinking about it deeply.
- Part of my read is that they're trying to, like, get a small dumb minimal version of a thing up so they can scale it to something real. This seems good to me.
- I am wary of people in our community inventing metrics that Really Should Not Be Optimized and handing them to a field that loves optimizing metrics.
- I expect there are all sorts of issues that wo...
the fact that all the unified cases for AI risk have been written by more ML-safety-sympathetic people like me, Ajeya, and Joe (with the single exception of "AGI ruin") is indicative that that strategy mostly hasn't been tried.
I'm not sure what you mean by this, but here's half-a-dozen "unified cases for AI risk" made by people like Eliezer Yudkowsky, Nick Bostrom, Stuart Armstrong, and myself:
2001 - https://intelligence.org/files/CFAI.pdf
2014 - https://smarterthan.us/
2014 - Superintelligence
2015 - https://intelligence.org/2015/07/24/four-background-claims...
(oops! thanks. i now once again think it's been fixed (tho i'm still just permuting things rather than reading))
John has also made various caveats to me, of the form "this field is pre-paradigmatic and the math is merely suggestive at this point". I feel like he oversold his results even so.
Part of it is that I get the sense that John didn't understand the limitations of his own results--like the fact that the telephone theorem only says anything in the infinite case, and the thing it says then does not (in its current form) arise as a limit of sensible things that can be said in finite cases. Or like the fact that the alleged interesting results of the gKPD theorem...
(Also, I had the above convos with John >1y ago, and perhaps John simply changed since then.)
In hindsight, I do think the period when our discussions took place were a local maximum of (my own estimate of the extent of applicability of my math), partially thanks to your input and partially because I was in the process of digesting a bunch of the technical results we talked about and figuring out the next hurdles. In particular, I definitely underestimated the difficulty of extending the results to finite approximations.
That said, I doubt that fully accounts for the difference in perception.
John said "there was not any point at which I thought my views were importantly misrepresented" when I asked him for comment. (I added this note to the top of the post as a parenthetical; thanks.)
More details:
For the record, the reason I didn't speak up was less "MIRI would have been crushed" and more "I had some hope".
I had in fact had a convo with Elon and one or two convos with Sam while they were kicking the OpenAI idea around (and where I made various suggestions that they ultimately didn't take). There were in fact internal forces at OpenAI trying to cause it to be a force for good—forces that ultimately led them to write their 2018 charter, so, forces that were not entirely fictitious. At the launch date, I didn't know to what degree those internal force...
I can confirm that Nate is not backdating memories—he and Eliezer were pretty clear within MIRI at the time that they thought Sam and Elon were making a tremendous mistake and that they were trying to figure out how to use MIRI's small influence within a worsened strategic landscape.
Here's a recent attempt of mine at a distillation of a fragment of this plan, copied over from a discussion elsewhere:
goal: make there be a logical statement such that a proof of that statement solves the strawberries-on-a-plate problem (or w/e).
summary of plan:
I don't see this as worst-case thinking. I do see it as speaking from a model that many locals don't share (without any particular attempt made to argue that model).
In particular, if the AGI has some pile of kludges disproportionately pointed towards accomplishing X, and the AGI does self-reflection and “irons itself out”, my prediction is “maybe this AGI will wind up pursuing X, or maybe not, I dunno”.
AFAICT, our degree of disagreement here turns on what you mean by "pointed". Depending on that, I expect I'd either say "yeah maybe, but that kind of po...
Thanks! Cool, it makes sense to me how we can make the pullback of with , in different ways to get different line bundles, and then tensor them all together. (I actually developed that hypothesis during a car ride earlier today :-p.)
(I'm still not quite sure what the syntax means, but presumably the idea is that there's an automorphism on 1D vector fields that flips the sign, and we flip the sign of the negative-charge line bundles before tensoring everything together?)
(Also, fwiw, when I said "they're all isomorphic to ", I meant that I di...
You're over-counting programs. I didn't spell out definitions of "programming language" and "length", but an important disideratum is that there has to only be finitely much "length" to go around, in the sense that must converge.
Under your proposal, the total amount of "length" is , so this is not an admissible notion of length.
(Note: this argument has nothing to do with the choice of base 2, and it applies equally well for all bases.)
Two common ways of repairing your notion of length are:
Thanks!
I'm not entirely sure that I follow the construction of yet.
Let's figure out the total space. If you just handed me a line bundle on , and were like "make a bundle on ", then the construction that I'd consider most obvious would be to make the total space be the pullback of such that all of the time-coordinates agree...
...ah, but that wouldn't be a line bundle; the tangent space would be -dimensional. I see.
You suggested starting by considering what happens to an individual fiber, which... is an easier operation to do w...
I think that distillations of research agendas such as this one are quite valuable, and hereby offer LawrenceC a $3,000 prize for writing it. (I'll follow up via email.) Thanks, LawrenceC!
Going forward, I plan to keep an eye out for distillations such as this one that seem particularly skilled or insightful to me, and offer them a prize in the $1-10k range, depending on how much I like them.
Insofar as I do this, I'm going to be completely arbitrary about it, and I'm only going to notice attempts haphazardly, so please don't do rely on the assumption that I...
Cool, thanks.
I'm pretty confident that the set of compatible (gauge, wavefunction) pairs is computably enumerable, so I think that the coding theorem should apply.
There's an insight that I've glimpsed--though I still haven't checked the details--which is that we can guarantee that it's possible to name the 'correct' (gauge, wavefunction) cluster without necessarily having to name any single gauge (as would be prohibatively expensive), by dovetailing all the (guage, wavefunction) pairs (in some representation where you can comptuably detect compatibility) a...
That gives me a somewhat clearer picture. (Thanks!) It sounds like the idea is that we have one machine that dovetails through everything and separates them into bins according to their behavior (as revealed so far), and a second machine that picks a bin.
Presumably the bins are given some sort of prefix-free code, so that when a behavior-difference is revealed within a bin (e.g. after more time has passed) it can be split into two bins, with some rule for which one is "default" (e.g., the leftmost).
I buy that something like this can probably be made to wor...
this might make the program longer since you'd need to specify physics.
(I doubt it matters much; the bits you use to specify physics at the start are bits you save when picking the codeword at the end.)
I don't think you need to choose a particular history to predict since all observables are gauge-invariant.
IIUC, your choice of the wavefunction and your choice of the gauge are interlocked. The invariant is that if you change the gauge and twiddle the wavefunction in a particular way, then no observables change. If you're just iterating over (gauge, ...
That was my original guess! I think Vanessa suggested something different. IIUC, she suggested
which has factors of the wavefunction, instead of 1.
(You having the same guess as me does update me towards the hypothesis that Vanessa just forgot some parentheses, and now I'm uncertain again :-p. Having factors of the wavefunction sure does seem pretty wacky!)
(...or perhaps there's an even more embarassing misunderstanding, where I've misunderstood physicist norms about parenthesis-insertion!)
Thanks! One place where I struggle with this idea is that people go around saying things like "Given a quantum particle with nonzero electric charge, you can just pick what phase its wavefunction has". I don't know how to think of an electron having a wavefunction whose phase I can pick. The wavefunctions that I know assign amplitudes to configurations, not to particles; if I have a wavefunctionover three-electron configurations then I don't know how to "choose the phase" for each electron, because a three-particle wave-functions doesn't (in general) facto...
Thanks! I'd need more detail than that to answer my questions.
Like, can we specialize this new program to a program that's just 'dovetailing' across all possible gauge-choices and then running physics on those? When we choose different gauges, we have to choose correspondingly-different ways of initializing the rest of the fields (or whatever); presumably this program is now also 'dovetailing' across all the different initializations?
But now it's looking not just at one history, but at all histories, and "keeping track of how much algorithmic probability h...
Thanks! My top guess was
so I much appreciate the correction.
...actually, having factors of feels surprising to me; like, this map doesn't seem to be the identity when is trivial; did you forget some parentheses? (Or did I misunderstand the parenthesis-insertion conventions?)
re: bundles, I don't get it yet, and I'd appreciate some clarification. Perhaps I'm simply being daft, but on my current understanding, and I'm not quite seeing which bundle it's supposed to be a section of.
Like, my understanding of bundles ...
Thanks! I think I'm confused on a more basic step, which is, like, what exactly is the purported invariance? Consider a Shrödinger-style wave function on a configuration-space with two particles (that have positions, so 6 dimensions total). I know what it means that this wave-function is phase-invariant (if I rotate all amplitudes at once then the dynamics don't change). What exactly would it mean for it to be "locally phase-invariant", though?
As a sub-question, what exactly is the data of the local phase invariance? A continuous function from space to U(1...
Cool, thanks! I wonder what that theorem has to say about gauge symmetries. Like, if I take the enormous pile of programs that simulate some region of physics (each w/ a different hard coded gauge), and feed the pile into that theorem, what "short" program does it spit out? (With scare quotes b/c I'm not sure how big this constant is yet.)
I might poke at this later, but in the interim, I'm keen to hear from folk who already know the answer.
(another thing that might help: when you're proving an implication □ C → C, the gödel-number that you're given doesn't code for the proof of the implication you're currently writing; that would be ill-typed. you asked for a □ C, not a □ (□ C → C). so the gödel-number you're given isn't a code for the thing you're currently writing, it's a code for löb's theorem applied to the thing you're currently writing.
it is for this reason that the proof you're fed might not be exactly the proof you were hoping for. you started out your implication being like "step 1:...
Is this a reasonable paraphrase of your argument?
... (read more)Isn't the worst case scenario just leaving the aliens alone? If I'm worried I'm going t... (read more)