Steering towards world states, taken literally, for a realistic agent is impossible, because an embedded agent cannot even contain a representation of a detailed world-state.
I'm not imagining AI steering toward a full specification of a physical universe; I'm imagining it steering toward a set of possible worlds. Sets of possible worlds can often be fully understood by reasoners, because you don't need to model every world in the set in perfect detail in order to understand the set; you just need to understand at least one high-level criterion (or set of c...
The definition I give in the post is "AI that has the basic mental machinery required to do par-human reasoning about all the hard sciences". In footnote 3, I suggest the alternative definition "AI that can match smart human performance in a specific hard science field, across all the scientific work humans do in that field".
By 'matching smart human performance... across all the scientific work humans do in that field' I don't mean to require that there literally be nothing humans can do that the AI can't match. I do expect this kind of AI to quickly (or i...
For starters, you can have goal-directed behavior without steering the world toward particular states. Novelty seeking, for example, don't imply any particular world-state to achieve.
If you look from the outside like you're competently trying to steer the world into states that will result in you getting more novel experience, then this is "goal-directed" in the sense I mean, regardless of why you're doing that.
If you (e.g.) look from the outside like you're selecting the local action that's least like the actions you've selected before, regardless o...
Dustin Moskovitz comments on Twitter:
The deployment problem is part of societal response to me, not separate.
[...] Eg race dynamics, regulation (including ability to cooperate with competitors), societal pressure on leaders, investment in watchdogs (human and machine), safety testing norms, whether things get open sourced, infohazards.
"The deployment problem is hard and weird" comes from a mix of claims about AI (AGI is extremely dangerous, you don't need a planet-sized computer to run it, software and hardware can and will improve and proliferate by defau...
Note that if it were costless to make the title way longer, I'd change this post's title from "AGI ruin mostly rests on strong claims about alignment and deployment, not about society" to the clearer:
The AGI ruin argument mostly rests on claims that the alignment and deployment problems are difficult and/or weird and novel, not on strong claims about society
One reason I like "the danger is in the space of action sequences that achieve real-world goals" rather than "the danger is in the space of short programs that achieve real-world goals" is that it makes it clearer why adding humans to the process can still result in the world being destroyed.
If powerful action sequences are dangerous, and humans help execute an action sequence (that wasn't generated by human minds), then it's clear why that is dangerous too.
If the danger instead lies in powerful "short programs", then it's more tempting to say "just don't ...
Thanks for the replies, Ryan!
I think the exact quantitative details make a big difference between "AGI ruin seems nearly certain in the absense of positive miracless" and "doom seems quite plausible, but we'll most likely make it through" (my probability of takeover is something like 35%)
I don't think that 'the very first STEM-level AGI is smart enough to destroy the world if you relax some precautions' and 'we have 2.5 years to work with STEM-level AGI before any system is smart enough to destroy the world' changes my p(doom) much at all. (Though this is ...
I think you should probably note where people (who are still sold on AI risk) often disagree.
If I had a list of 5-10 resources that folks like Paul, Holden, Ajeya, Carl, etc. see as the main causes for optimism, I'd be happy to link those resources (either in a footnote or in the main body).
I'd definitely include something like 'survey data on the same population as my 2021 AI risk survey, saying how much people agree/disagree with the ten factors", though I'd guess this isn't the optimal use of those people's time even if we want to use that time to surve...
So, this argument seems mostly circular
I don't think your claim makes the argument circular / question-begging; it just means there's an extra step in explaining why and how a random action sequence destroys the world.
Maybe you mean that I'm putting the emphasis in the wrong place, and it would be more illuminating to highlight some specific feature of random smart short programs as the source of the 'instrumental convergence' danger? If so, what do you think that feature is?
From my current perspective I think the core problem really is that most random sh...
It's true that if humans were reliably very ambitious, consequentialist, and power-seeking, then this would be stronger evidence that superintelligent AI tends to be ambitious and power-seeking. So the absence of that evidence has to be evidence against "superintelligent AI tends to be ambitious and power-seeking", even if it's not a big weight in the scales.
Also, per footnote 1: "I wrote this post to summarize my own top reasons for being worried, not to try to make a maximally compelling or digestible case for others."
The original reason I wrote this was that Dustin Moskovitz wanted something like this, as an alternative to posts like AGI Ruin:
...[H]ave you tried making a layman's explanation of the case? Do you endorse the summary? I'm aware of much longer versions of the argument, but not shorter ones!
From my POV, a lot of the confusion is around the confidence level. Historically EY makes many arguments to e
Thanks for the feedback, John! I've moved the Aryeh/Eliezer exchange to a footnote, and I welcome more ideas for ways to improve the piece. (Folks are also welcome to repurpose anything I wrote above to create something new and more beginner-friendly, if you think there's a germ of a good beginner-friendly piece anywhere in the OP.)
Tagging @Richard_Ngo
Copying over a Twitter reply from Quintin Pope (which I haven't replied to, and which was responding to the wording of the Twitter draft of this post):
...I think your intuition about how SGD works is wildly wrong. E.g., SGD doesn't do anything like "randomly sample from the set of all low loss NN parameter configurations". https://arxiv.org/abs/2110.00683
Also, your point about human plans not looking like randomly sampled plans is a point against your intuition that multi-level search processes will tend to generate such plans.
Finally, I don't think it'
Quintin, in case you are reading this, I just wanna say that the link you give to justify
I think your intuition about how SGD works is wildly wrong. E.g., SGD doesn't do anything like "randomly sample from the set of all low loss NN parameter configurations". https://arxiv.org/abs/2110.00683
really doesn't do nearly enough to justify your bold "wildly wrong" claim. First of all, it's common for papers to overclaim, this seems like the sort of paper that could turn out to be basically just flat wrong. (I lack the expertise to decide for myself, i...
I find Quintin's reply here somewhat unsatisfying, because I think it is too narrowly focused on current DL-paradigm methods and the artifacts they directly produce, without much consideration for how those artifacts might be composed and used in real systems. I attempted to describe my objections to this general kind of argument in a bit more detail here.
This post evolved from a Twitter thread I wrote two weeks ago. Copying over a Twitter reply by Richard Ngo (n.b. Richard was replying to the version on Twitter, which differed in lots of ways):
...Rob, I appreciate your efforts, but this is a terrible framing for trying to convey "the basics", and obscures way more than it clarifies.
I'm worried about agents which try to achieve goals. That's the core thing, and you're calling it a misconception?! That's blatantly false.
In my first Alignment Fundamentals class I too tried to convey all the nuances of my thinkin
I definitely agree with Richard that the post would probably benefit from more iteration with intended users, if new people are the audience you want to target. (In particular, I doubt that the section quoted from the Aryeh interview will clarify much for new people.)
That said, I definitely think that it's the right call to emphasize up-front that instrumental convergence is a property of problem-space rather than of agency. More generally: when there's a common misinterpretation, which very often ends up load-bearing, then it makes sense to address that u...
I think it's more likely that being conservative about impact would generate probabilities much less than 10%.
I don't know what you mean by "conservative about impact". The OP distinguishes three things:
It separately distinguishes these two things:
This is why I said in the post:
Some people do have confident beliefs that imply "things will go well"; I disagree there, but I expect some amount of disagreement like that.
... and focused on the many people who don't have a confident objection to nanotech.
I and others have given lots of clear arguments for why relatively early AGI systems will plausibly be vastly smarter than humans. Eric Drexler has given lots of clear arguments for why nanotechnology is probably fairly easy to build.
None of this constitutes a proof that early AGI systems will be able to ...
I figured this would be obvious enough, and both surveys discuss this issue; but phrasing things in a way that encourages keeping selection bias in mind does seem like a good idea to me. I've tweaked the phrasing to say "In a survey, X".
my sense is that most LW uses of "crux" are in the context of "double crux"
I think that's not true, and "crux" is mostly used for single cruxes.
It's often harder to tell whether something is a double crux, and in any case "double crux" mostly only makes sense when there are exactly two people in a conversation. In a ten-person Internet forum conversation where everyone has different views, it will be a lot harder to find a claim that would update everyone about the relevant proposition -- and it doesn't especially make sense to try.
...and that the term
I think it goes without saying that one can disagree with anything in the Sequences and can also be assumed to have read and understood it
This seems false as stated -- some nontrivial content in the Sequences consists of theorems.
More generally, there are some claims in the original Sequences that are false (so agreeing with the claim may be at least some evidence that you didn't understand it), some that I'd say "I think that's true, but reasonable people can definitely disagree", some where it's very easy for disagreement to update me toward "you didn't ...
(Meta: The TIME piece is paywalled in some countries, and is plastered with ads, so Eliezer wanted the text mirrored on the MIRI Blog. He also assented to my having the LW admins cross-post this here. This version adds some clarifying notes Eliezer wrote on Twitter regarding the article.)
Disagree-voted just because of the words "I'm certain that the reason...". I'd be much less skeptical of "I'm pretty dang sure that the reason..." or at the very least "I'm certain that an important contributing factor was..."
(But even the latter seems pretty hard unless you have a lot of insider knowledge from talking to the people who made the decision at DeepMind, along with a lot of trust in them. E.g., if it did turn out that DeepMind was trying to reduce AI hype, then they might have advertised a result less if they thought it were a bigger deal. I don't know this to be so, but it's an example of why I raise an eyebrow at "I'm certain that the reason".)
Or just promising the human some money, with the sequence of actions set up to obscure that anything important is happening. (E.g., you can use misdirection like 'the actually important event that occurred was early in the process, when you opened a test tube to add some saline and thereby allowed the contents of the test tub to start propagating into the air; the later step where you mail the final product to an address you were given, or record an experimental result in a spreadsheet and email the spreadsheet to your funder, doesn't actually matter for the plan'.)
You have to weigh the conjunctive aspects of particular plans against the disjunctiveness of 'there are many different ways to try to do this, including ways we haven't thought of'.
Until this week, all of this was [...] unknown to anyone who could plausibly claim to be a world leader.
I don't think this is known to be true.
In fact they had no idea this debate existed.
That seems too strong. Some data points:
1. There's been lots of AI risk press over the last decade. (E.g., Musk and Bostrom in 2014, Gates in 2015, Kissinger in 2018.)
2. Obama had a conversation with WIRED regarding Bostrom's Superintelligence in 2016, and his administration cited papers by MIRI and FHI in a report on AI the same year. Quoting that report:
...General AI (some
in general I think the trend of alignment is positive. We haven't solved the problems, but were quite a bit closer to the solution than 10 years ago.
I mean, I could agree with those two claims but think the trendlines suggest we'll have alignment solved in 200 years and superintelligent capabilities in 14 years. I guess it depends on what you mean by "quite a bit closer"; I think we've written up some useful semiformal descriptions of some important high-level aspects of the problem (like 'Risks from Learned Optimization'), but this seems very far from 'th...
Capabilities Researcher: *repeatedly shooting himself in the foot, reloading his gun, shooting again* "Wow, it sure is a shame that my selfish incentives aren't aligned with the collective good!" *reloads gun, shoots again*
Classical prisoners' dilemma, where individuals receive the greatest payoffs if they betray the group rather than cooperate.
In this case, "defecting" gives lower payoffs to the defector -- you're shooting yourself in the foot and increasing the risk that you die an early death.
The situation is being driven mostly by information asymmetries (not everyone appreciates the risks, or is thinking rationally about novel risks as a category), not by deep conflicts of interest. Which makes it doubly important not to propagate the meme that this is a prisoner's dilemma: one of the ways people end up with a false belief about this is exactly that people round this situation off to a PD too often!
Capabilities Researcher: *repeatedly shooting himself in the foot, reloading his gun, shooting again* "Wow, it sure is a shame that my selfish incentives aren't aligned with the collective good!" *reloads gun, shoots again*
I agree with this. I find it very weird to imagine that "10% x-risk this century" versus "90% x-risk this century" could be a crux here. (And maybe it's not, and people with those two views in fact mostly agree about governance questions like this.)
Something I wouldn't find weird is if specific causal models of "how do we get out of this mess" predict more vs. less utility for state interference. E.g., maybe you think 10% risk is scarily high and a sane world would respond to large ML training runs way more aggressively than it responds to nascent nuclear programs, but you also note that the world is not sane, and you suspect that government involvement will just make the situation even worse in expectation.
I think that Eliezer (and many others including myself!) may be suspectable to "living in the should-universe"
That's a new one!
More seriously: Yep, it's possible to be making this error on a particular dimension, even if you're a pessimist on some other dimensions. My current guess would be that Eliezer isn't making that mistake here, though.
For one thing, the situation is more like "Eliezer thinks he tried the option you're proposing for a long time and it didn't work, so now he's trying something different" (and he's observed many others trying other thi...
The verbatim statement is:
...We have people in crypto who are good at breaking things, and they're the reason why anything is not on fire. And some of them might go into breaking AI systems instead, 'cause that's where you learn anything.
You know, you know, any fool can build a crypto system that they think will work. Breaking existing crypto systems -- cryptographical systems -- is how we learn who the real experts are. So maybe the people finding weird stuff to do with AIs, maybe those people will come up with some truth about these systems that m
I'm happy you linkposted this so people could talk about it! The transcript above is extremely error-laden, though, to the extent I'm not sure there's much useful signal here unless you read with extreme care?
I've tried to fix the transcription errors, and posted a revised version at the bottom of this post (minus the first 15 minutes, which are meta/promotion stuff for Bankless). I vote for you copying over the Q&A transcript here so it's available both places.
Do you know of any arguments with a similar style to The Most Important Century that is as pessimistic as EY/MIRI folks (>90% probability of AGI within 15 years)?
Wait, what? Why do you think anyone at MIRI assigns >90% probability to AGI within 15 years? That sounds wildly too confident to me. I know some MIRI people who assign 50% probability to AGI by 2038 or so (similar to Ajeya Cotra's recently updated view), and I believe Eliezer is higher than 50% by 2038, but if you told me that Eliezer told you in a private conversation "90+% within 15 years"...
Thanks for posting this, Andrea_Miotti and remember! I noticed a lot of substantive errors in the transcript (and even more errors in vonk's Q&A transcript), so I've posted an edited version of both transcripts. I vote that you edit your own post to include the revisions I made.
Here's a small sample of the edits I made, focusing on ones where someone may have come away from your transcript with a wrong interpretation or important missing information (as opposed to, e.g., the sentences that are just very hard to parse in the original transcript because ...
Gratitude to Andrea_Miotti, remember, and vonk for posting more-timely transcripts of this so LW could talk about it at the time -- and for providing a v1 transcript to give me a head start.
Here's a small sample of the edits I made to the previous Bankless transcript on LW, focusing on ones where someone may have come away from the original transcript with a wrong interpretation or important missing information (as opposed to, e.g., the sentences that are just very hard to parse in the original transcript because too many filler words and false starts to s...
But this seems to contradict the element of Non-Deception. If you're not actually on the same side as the people who disagree with you, why would you (as a very strong but defeasible default) role-play otherwise?
This is a good question!! Note that in the original footnote in my post, "on the same side" is a hyperlink going to a comment by Val:
..."Some version of civility and/or friendliness and/or a spirit of camaraderie and goodwill seems like a useful ingredient in many discussions. I'm not sure how best to achieve this in ways that are emotionally hon
Note that in the original footnote in my post, "on the same side" is a hyperlink going to a comment by Val
Thanks for pointing this out. (I read Val's comment while writing my post, but unfortunately neglected to add the hyperlink when pasting the text of the footnote into my draft.) I have now edited the link into my post.
the goal isn't to trick people into thinking your disagreements are small, it's to make typical disagreements feel less like battles between warring armies
I think the fact that disagreements often feel like battles between warring ...
But why should we err at all? Should we not, rather, use as many carrots and sticks as is optimal?
"Err on the side of X" here doesn't mean "prefer erring over optimality"; it means "prefer errors in direction X over errors in the other direction". This is still vague, since it doesn't say how much to care about this difference; but it's not trivial advice (or trivially mistaken).
so when I see the brand name being used to market a particular set of discourse norms without a clear explanation of how these norms are derived from the law, that bothers me enough to quickly write an essay or two about it
Seems great to me! I share your intuition that Goodwill seems a bit odd to include. I think it's right to push back on proposed norms like these and talk about how justified they are, and I hope my list can be the start of a conversation like that rather than the end.
I do have an intuition that Goodwill, or something similar to Goodwill,...
Basically the fact LW has far more arguments for "alignment will be hard" compared to alignment being easy is the selection effect I'm talking about.
That could either be 'we're selecting for good arguments, and the good arguments point toward alignment being hard', or it could be a non-epistemic selection effect.
Why do you think it's a non-epistemic selection effect? It's easier to find arguments for 'the Earth is round' than 'the Earth is flat', but that doesn't demonstrate a non-epistemic bias.
...I was also worried because ML people don't really think that
... By 'an Aumann sense' do you just mean 'if you know nothing about a brain, then knowing it believes P is some Bayesian evidence for the truth of P'? That seems like a very weird way to use "Aumann", but if that's what you mean then sure. It's trivial evidence to anyone who's spent much time poking at the details, but it's evidence.
Basically, it means that the fact that other smart people working in ML/AI doesn't agree with LW is itself evidence that LW is wrong, since rational reasoner's updating towards the same priors should see disagreements lesse...
I think a more likely thing we'd want to stick around to do in that world is 'try to accelerate humanity to AGI ASAP'. "Sufficiently advanced AGI converges to human-friendly values" is weaker than "AGI will just have human-friendly values by default".
I was surprised at how low the hour estimates were, particularly for the OP people (especially Holden) and even for Paul.
Maybe worth keeping in mind that Nate isn't the only MIRI person who's spent lots of hours on this (e.g., Eliezer and Benya have as well), and the numbers only track Nate-time.
Also maybe worth keeping in mind the full list of things that need doing in the world. This is one of the key important leveraged things that needs doing, so it's easy to say "spend more time on it". But spending a thousand hours (so, like, a good chunk of a year w...
Even when these discussions don't produce agreement, do you think they're helpful for the community?
I've spoken to several people who have found the MIRI dialogues useful as they enter the field, understand threat models, understand why people disagree, etc. It seems not-crazy to me that most of the value in these dialogues comes from their effects on the community (as opposed to their effects on the participants).
IMO having and releasing those dialogues was one of the most obviously useful things MIRI has done to date, and I'm super happy with them....
3. Will continue to exist regardless of how well you criticize any one part of it.
Depending on what you mean by "any one part of it", I think 3 is false. E.g., a sufficiently good critique of "AGI won't just have human-friendly values by default" would cause MIRI to throw a party and close up shop.
I've rewritten this post for the EA Forum, to help introduce more EAs to rationalist culture and norms. The rewrite goes into more detail about a lot of the points, explaining jargon, motivating some of the less intuitive norms, etc. I expect some folks will prefer that version, and some will prefer the LW version.
(One shortcoming of the EA Forum version is that it's less concise. Another shortcoming is that there's more chance I got stuff wrong, since I erred on the side of "spell things out more in the hope of conveying more of the spirit to people who a...
Basically: whether something is good or bad, enjoyable or unpleasant, desirable or undesirable, interesting or boring, etc. It's the aspect of experience that evaluates some things as better or worse to varying degrees and in various respects.
How about "hurting a person or deminishing their credibility, or the credibility of their argument, without using a rational argument"?
"Hurting a person" still seems too vague to me (sometimes people are "hurt" just because you disagreed with them on a claim of fact), "Diminishing... the credibility of their argument, without using a rational argument" sounds similar to "using symmetric weapons" to me (but the latter strikes me as more precise and general: don't try to persuade people via tools that aren't Bayesian evidence for the truth of the thing you'r...
You may not feel bas about mockery (I don't generally do so either), but do you think it reflects well on you as a rationalist?
I like this example! I do indeed share the intuition "mocking Time Cube guy on Twitter doesn't reflect well on me as a rationalist". It also just seems mean to me.
I think part of what's driving my intuition here, though, is that "mocking" sounds inherently mean-spirited, and "on Twitter" makes it sound like I'm writing the sort of low-quality viral personal attack that's common on Twitter.
"Make a light-hearted reference to Time Cub...
I partly answered that here, and I'll edit some of this into the post:
... (read more)