While reading Eliezer's recent AGI Ruin post, I noticed that while I had several points I wanted to ask about, I was reluctant to actually ask them for a number of reasons:

  • I have a very conflict-avoidant personality and I don't want to risk Eliezer or someone else yelling at me;
  • I get easily intimidated by people with strong personalities, and Eliezer... well, he can be intimidating;
  • I don't want to appear dumb or uninformed (even if I am in fact relatively uninformed, hence me wanting to ask the question!);
  • I feel like there's an expectation that I would need to do a lot of due diligence before writing any sort of question, and I don't have the time or energy at the moment to do that due diligence.

So, since I'm probably not the only one who feels intimidated about asking these kinds of questions, I am putting up this thread as a safe space for people to ask all the possibly-dumb questions that may have been bothering them about the whole AGI safety discussion, but which until now they've been too intimidated, embarrassed, or time-limited to ask.

I'm also hoping that this thread can serve as a FAQ on the topic of AGI safety. As such, it would be great to add in questions that you've seen other people ask, even if you think those questions have been adequately answered elsewhere. [Notice that you now have an added way to avoid feeling embarrassed by asking a dumb question: For all anybody knows, it's entirely possible that you are literally asking for someone else! And yes, this was part of my motivation for suggesting the FAQ style in the first place.]

Guidelines for questioners:

  • No extensive previous knowledge of AGI safety is required. If you've been hanging around LessWrong for even a short amount of time then you probably already know enough about the topic to meet any absolute-bare-minimum previous knowledge requirements I might have suggested. I will include a subthread or two asking for basic reading recommendations, but these are not required reading before asking a question. Even extremely basic questions are allowed!
  • Similarly, you do not need to do any due diligence to try to find the answer yourself before asking the question.
  • Also feel free to ask questions that you're pretty sure you know the answer to yourself, but where you'd like to hear how others would answer the question.
  • Please separate different questions into individual comments, although if you have a set of closely related questions that you want to ask all together that's fine.
  • As this is also intended to double as a FAQ, you are encouraged to ask questions that you've heard other people ask, even if you yourself think there's an easy answer or that the question is misguided in some way. You do not need to mention as part of the question that you think it's misguided, and in fact I would encourage you not to write this so as to keep more closely to the FAQ style.
  • If you have your own (full or partial) response to your own question, it would probably be best to put that response as a reply to your original question rather than including it in the question itself. Again, I think this will help keep more closely to an FAQ style.
  • Keep the tone of questions respectful. For example, instead of, "I think AGI safety concerns are crazy fearmongering because XYZ", try reframing that as, "but what about XYZ?" Actually, I think questions of the form "but what about XYZ?" or "but why can't we just do ABC?" are particularly great for this post, because in my experience those are exactly the types of questions people often ask when they learn about AGI Safety concerns.
  • Follow-up questions have the same guidelines as above, so if someone answers your question but you're not sure you fully understand the answer (or if you think the answer wouldn't be fully understandable to someone else) then feel free and encouraged to ask follow-up potentially-dumb questions to make sure you fully understand the answer.
  • Remember, if something is confusing to you then it's probably confusing to other people as well. If you ask the question and someone gives a good response, then you are likely doing lots of other people a favor!

Guidelines for answerers:

  • This is meant to be a safe space for people to ask potentially dumb questions. Insulting or denigrating responses are therefore obviously not allowed here. Also remember that due diligence is not required for these questions, so do not berate questioners for not doing enough due diligence. In general, keep your answers respectful and assume that the questioner is asking in good faith.
  • Direct answers / responses are generally preferable to just giving a link to something written up elsewhere, but on the other hand giving a link to a good explanation is better than not responding to the question at all. Or better still, summarize or give a basic version of the answer, and also include a link to a longer explanation.
  • If this post works as intended then it may turn out to be a good general FAQ-style reference. It may be worth keeping this in mind as you write your answer. For example, in some cases it might be worth giving a slightly longer / more expansive / more detailed explanation rather than just giving a short response to the specific question asked, in order to address other similar-but-not-precisely-the-same questions that other people might have.

Finally: Please think very carefully before downvoting any questions, and lean very heavily on the side of not doing so. This is supposed to be a safe space to ask dumb questions! Even if you think someone is almost certainly trolling or the like, I would say that for the purposes of this post it's almost always better to apply a strong principle of charity and think maybe the person really is asking in good faith and it just came out wrong. Making people feel bad about asking dumb questions by downvoting them is the exact opposite of what this post is all about. (I considered making a rule of no downvoting questions at all, but I suppose there might be some extraordinary cases where downvoting might be appropriate.)

AGI Safety FAQ / all-dumb-questions-allowed thread
New Comment
539 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings
[-]Sune5513

Why do we assume that any AGI can meaningfully be described as a utility maximizer?

Humans are the some of most intelligent structures that exist, and we don’t seem to fit that model very well. If fact, it seems the entire point in Rationalism is to improve our ability to do this, which has only been achieved with mixed success.

Organisations of humans (e.g. USA, FDA, UN) have even more computational power and don’t seem to be doing much better.

Perhaps an intelligence (artificial or natural) cannot necessarily, or even typically be described as optimisers? Instead we could only model them as an algorithm or as a collection of tools/behaviours executed in some pattern.

An AGI that was not a utility maximizer would make more progress towards whatever goals it had if it modified itself to become a utility maximizer.  Three exceptions are if (1) the AGI has a goal of not being a utility maximizer, (2) the AGI has a goal of not modifying itself, (3) the AGI thinks it will be treated better by other powerful agents if it is not a utility maximizer.

6Amadeus Pagel
Would humans, or organizations of humans, make more progress towards whatever goals they have, if they modified themselves to become a utility maximizer? If so, why don't they? If not, why would an AGI? What would it mean to modify oneself to become a utility maximizer? What would it mean for the US, for example? The only meaning I can imagine is that one individual - for the sake of argument we assume that this individual is already an utility maximizer - enforces his will on everyone else. Would that help the US make more progress towards its goals? Do countries that are closer to utility maximizers, like North Korea, make more progress towards their goals?
3James_Miller
A human seeking to become a utility maximizer would read LessWrong and try to become more rational.  Groups of people are not utility maximizers as their collective preferences might not even be transitive.  If the goal of North Korea is to keep the Kim family in bother then the country being a utility maximizer does seem to help.
1TAG
A human who wants to do something specific would be far better off studying and practicing that thing than generic rationality.
4AnthonyC
This depends on how far outside that human's current capabilities, and that human's society's state of knowledge, that thing is. For playing basketball in the modern world, sure, it makes no sense to study physics and calculus, it's far better to find a coach and train the skills you need. But if you want to become immortal and happen to live in ancient China, then studying and practicing "that thing" looks like eating specially-prepared concoctions containing mercury and thereby getting yourself killed, whereas studying generic rationality leads to the whole series of scientific insights and industrial innovations that make actual progress towards the real goal possible. Put another way: I think the real complexity is hidden in your use of the phrase "something specific." If you can concretely state and imagine what the specific thing is, then you probably already have the context needed for useful practice. It's in figuring out that context, in order to be able to so concretely state what more abstractly stated 'goals' really imply and entail, that we need more general and flexible rationality skills.
1TAG
If you want to be good at something specific that doesn't exist yet, you need to study the relevant area of science, which is still more specific than rationality.
2AnthonyC
Assuming the relevant area of science already exists, yes. Recurse as needed, and  there is some level of goal for which generic rationality is a highly valuable skillset. Where that level is, depends on personal and societal context.
3TAG
That's quite different from saying rationality is a one size fits all solution.
2TAG
Efficiency at utility maximisation , like any other kind of efficiency relates to available resources. One upshot of that an entity might already be doing as well as it realistically can, given its resources. Another is that humans don't necessarily benefit from rationality training...as also suggested by the empirical evidence. Edit: Another is that a resource rich but inefficient entity can beat a small efficient one, so efficiency,.AKA utility maximization , doesn't always win out.
1Jeff Rose
When you say the AGI has a goal of not modifying itself, do you mean that the AGI has a goal of not modifying its goals?  Because that assumption seems to be fairly prevalent.  
2James_Miller
I meant "not modifying itself" which would include not modifying its goals if an AGI without a utility function can be said to have goals.

This is an excellent question.  I'd say the main reason is that all of the AI/ML systems that we have built to date are utility maximizers; that's the mathematical framework in which they have been designed.  Neural nets / deep-learning work by using a simple optimizer to find the minimum of a loss function via gradient descent.  Evolutionary algorithms, simulated annealing, etc. find the minimum (or maximum) of a "fitness function".  We don't know of any other way to build systems that learn.

Humans themselves evolved to maximize reproductive fitness.   In the case of humans, our primary fitness function is reproductive fitness, but our genes have encoded a variety of secondary functions which (over evolutionary time) have been correlated with reproductive fitness.  Our desires for love, friendship, happiness, etc. fall into this category.  Our brains mainly work to satisfy these secondary functions; the brain gets electrochemical reward signals, controlled by our genes, in the form of pain/pleasure/satisfaction/loneliness etc.  These secondary functions may or may not remain aligned with the primary loss function, which is why practitioners sometimes talk about "mesa-optimizers" or "inner vs outer alignment."

4mako yass
Agreed. Humans are constantly optimizing a reward function, but it sort of 'changes' from moment to moment in a near-focal way, so it often looks irrational or self-defeating, but once you know what the reward function is, the goal-directedness is easy to see too. Sune seems to think that humans are more intelligent than they are goal-directed, I'm not sure this is true, human truthseeking processes seems about as flawed and limited as their goal-pursuit. Maybe you can argue that humans are not generally intelligent or rational, but I don't think you can justify setting the goalposts so that they're one of those things and not the other. You might be able to argue that human civilization is intelligent but not rational, and that functioning AGI will be more analogous to ecosystems of agents rather than one unified agent. If you can argue for that, that's interesting, but I don't know where to go from there. Civilizations tend towards increasing unity over time (the continuous reduction in energy wasted on conflict). I doubt that the goals they converge on together will be a form of human-favoring altruism. I haven't seen anyone try to argue for that in a rigorous way.
8Amadeus Pagel
Doesn't this become tautological? If the reward function changes from moment to moment, then the reward function can just be whatever explains the behaviour.
2mako yass
Since everything can fit into the "agent with utility function" model given a sufficiently crumpled utility function, I guess I'd define "is an agent" as "goal-directed planning is useful for explaining a large enough part of its behavior." This includes humans while discluding bacteria. (Hmm unless, like me, one knows so little about bacteria that it's better to just model them as weak agents. Puzzling.)
1DeLesley Hutchins
On the other hand, the development of religion, morality, and universal human rights also seem to be a product of civilization, driven by the need for many people to coordinate and coexist without conflict. More recently, these ideas have expanded to include laws that establish nature reserves and protect animal rights.  I personally am beginning to think that taking an ecosystem/civilizational approach with mixture of intelligent agents, human, animal, and AGI, might be a way to solve the alignment problem.
2Ben123
Does the inner / outer distinction complicate the claim that all current ML systems are utility maximizers? The gradient descent algorithm performs a simple kind of optimization in the training phase. But once the model is trained and in production, it doesn't seem obvious that the "utility maximizer" lens is always helpful in understanding its behavior.
6Yonatan Cale
(I assume you are asking "why do we assume the agent has a coherent utility function" rather than "why do we assume the agent tries maximizing their utility" ? )   Agents like humans which don't have such a nice utility function: 1. Are vulnerable to money pumping 2. Can notice that problem and try to repair themselves 3. Note that humans do in practice try to repair ourselves, like to smash down our own emotions in order to be more productive. But we don't have access to our source code, so we're not so good at it   I think that if the AI can't repair that part of themselves and they're still vulnerable to money pumping, then they're not the AGI we're afraid of, I think
1Yonatan Cale
Adding: My opinion comes from this Miri/Yudkowsky talk, I linked to the relevant place, he speaks about this in the next 10-15 minutes or so of the video
-1[comment deleted]
5plex
Excellent question! I've added a slightly reworded version of this to Stampy. (focusing on superintelligence, rather than AGI, as it's pretty likely that we can get weak AGI which is non-maximizing, based on progress in language models) AI subsystems or regions in gradient descent space that more closely approximate utility maximizers are more stable, and more capable, than those that are less like utility maximizers. Having more agency is a convergent instrument goal and a stable attractor which the random walk of updates and experiences will eventually stumble into. The stability is because utility maximizer-like systems which have control over their development would lose utility if they allowed themselves to develop into non-utility maximizers, so they tend to use their available optimization power to avoid that change (a special case of goal stability). The capability is because non-utility maximizers are exploitable, and because agency is a general trick which applies to many domains, so might well arise naturally when training on some tasks. Humans and systems made of humans (e.g. organizations, governments) generally have neither the introspective ability nor self-modification tools needed to become reflectively stable, but we can reasonably predict that in the long run highly capable systems will have these properties. They can then fix in and optimize for their values.
5lc
You're right that not every conceivable general intelligence is built as a utility maximizer. Humans are an example of this. One problem is, even if you make a "weak" form of general intelligence that isn't trying particularly hard to optimize anything, or a tool AI, eventually someone at FAIR will make an agentic version that does in fact directly try to optimize Facebook's stock market valuation.

Do not use FAIR as a symbol of villainy. They're a group of real, smart, well-meaning people who we need to be capable of reaching, and who still have some lines of respect connecting them to the alignment community. Don't break them.

1Sune
Can we control the blind spots of the agent? For example, I could imaging that we could make a very strong agent, that is able to explain acausal trade but unable to (deliberately) participate in any acausal trades, because of the way it understands counterfacuals. Could it be possible to create AI with similar minor weaknesses?
3lc
Probably not, because it's hard to get a general intelligence to make consistently wrong decisions in any capacity. Partly because, like you or me, it might realize that it has a design flaw and work around it.  A better plan is just to explicitly bake corrigibility guarantees (i.e. the stop button) into the design. Figuring out how to do that that is the hard part, though.
2AnthonyC
For one, I don't think organizations of humans, in general, do have more computational power than the individual humans making them up. I mean, at some level, yes, they obviously do in an additive sense, but that power consists of human nodes, each not devoting their full power to the organization because they're not just drones under centralized control, and with only low bandwidth and noisy connections between the nodes. The organization might have a simple officially stated goal written on paper and spoken by the humans involved, but the actual incentive structure and selection pressure may not allow the organization to actually focus on the official goal. I do think, in general, there is some goal an observer could usefully say these organizations are, in practice, trying to optimize for, and some other set of goals each human in them is trying to optimize for. I don't think the latter sentence distinguishes 'intelligence' from any other kind of algorithm or pattern. I think that's an important distinction. There's a lot of past posts explaining how an AI doesn't have code, like a human holding instructions on paper, but rather is its code. I think you can make the same point within a human: that a human has lots of tools/behaviors, which it will execute in some pattern given a particular environment, and the the instructions we consciously hold in mind are only one part of what determines that pattern.  I contain subagents with divergent goals, some of which are smarter and have greater foresight and planning than others, and those aren't always the ones that determine by immediate actions. As a result, I do a much poorer job optimizing for what the part-of-me-I-call-"I" wants my goals to be, than I theoretically could.  That gap is decreasing over time as I use the degree of control my intelligence gives me to gradually shape the rest of myself. It may never disappear, but I am much more goal-directed now than I was 10 years ago, or as a child. In other wor

I'm an ML engineer at a FAANG-adjacent company. Big enough to train our own sub-1B parameter language models fairly regularly. I work on training some of these models and finding applications of them in our stack. I've seen the light after I read most of Superintelligence. I feel like I'd like to help out somehow.  I'm in my late 30s with kids, and live in the SF bay area. I kinda have to provide for them, and don't have any family money or resources to lean on, and would rather not restart my career. I also don't think I should abandon ML and try to do distributed systems or something. I'm a former applied mathematician, with a phd, so ML was a natural fit. I like to think I have a decent grasp on epistemics, but haven't gone through the sequences. What should someone like me do? Some ideas: (a) Keep doing what I'm doing, staying up to date but at least not at the forefront; (b) make time to read more material here and post randomly; (c) maybe try to apply to Redwood or Anthropic... though dunno if they offer equity (doesn't hurt to find out though) (d) try to deep dive on some alignment sequence on here.

Both 80,000hours and AI Safety Support are keen to offer personalised advice to people facing a career decision and interested in working on alignment (and in 80k's case, also many other problems).

Noting a conflict of interest - I work for 80,000 hours and know of but haven't used AISS. This post is in a personal capacity, I'm just flagging publicly available information rather than giving an insider take.

You might want to consider registering for the AGI Safety Fundamentals Course (or reading through the content). The final project provides a potential way of dipping your toes into the water.

7Adam Jermyn
Applying to Redwood or Anthropic seems like a great idea. My understanding is that they're both looking for aligned engineers and scientists and are both very aligned orgs. The worst case seems like they (1) say no or (2) don't make an offer that's enough for you to keep your lifestyle (whatever that means for you). In either case you haven't lost much by applying, and you definitely don't have to take a job that puts you in a precarious place financially.
7lc
Pragmatic AI safety (link: pragmaticaisafety.com) is supposed to be a good sequence for helping you figure out what to do. My best advice is to talk to some people here who are smarter than me and make sure you understand the real problems, because the most common outcome besides reading a lot and doing nothing is to do something that feels like work but isn't actually working on anything important.
5James_Miller
Work your way up the ML business  hierarchy to the point where you are having conversations with decision makers.  Try to convince them that unaligned AI is a significant existential risk.  A small chance of you doing this will in expected value terms more than make up for any harm you cause by working in ML given that if you left the field someone else would take your job.
5Linda Linsefors
Given where you live, I recomend going to some local LW events. There are still LW meetups in the Bay area, right?
3Adrià Garriga-alonso
You should apply to Anthropic. If you’re writing ML software at semi-FAANG. they probably want to interview you ASAP. https://www.lesswrong.com/posts/YDF7XhMThhNfHfim9/ai-safety-needs-great-engineers The compensation is definitely enough to take care of your family and then save some money!
3plex
One of the paths which has non-zero hope in my mind is building a weakly aligned non-self improving research assistant for alignment researchers. Ought and EleutherAI's #accelerating-alignment are the two places I know who are working in this direction fairly directly, though the various language model alignment orgs might also contribute usefully to the project.
1Yonatan Cale
Anthropic offer equity, they can give you more details in private.  I recommend applying to both (it's a cheap move with a lot of potential upside), let me know if you'd like help connecting to any of them. If you learn by yourself - I'd totally get one on one advise (others linked), people will make sure you're on the best path possible

This is a meta-level question:

The world is very big and very complex especially if you take into account the future. In the past it has been hard to predict what happens in the future, I think most predictions about the future have failed. Artificial intelligence as a field is very big and complex, at least that's how it appears to me personally. Eliezer Yudkowky's brain is small compared to the size of the world, all the relevant facts about AGI x-risk probably don't fit into his mind, nor do I think he has the time to absorb all the relevant facts related to AGI x-risk. Given all this, how can you justify the level of certainty in Yudkowky's statements, instead of being more agnostic?

My model of Eliezer says something like this:

AI will not be aligned by default, because AI alignment is hard and hard things don't spontaneously happen. Rockets explode unless you very carefully make them not do that. Software isn't automatically secure or reliable, it takes lots of engineering effort to make it that way.

Given that, we can presume there needs to be a specific example of how we could align AI. We don't have one. If there was one, Eliezer would know about it - it would have been brought to his attention, the field isn't that big and he's a very well-known figure in it. Therefore, in the absence of a specific way of aligning AI that would work, the probability of AI being aligned is roughly zero, in much the same way that "Throw a bunch of jet fuel in a tube and point it towards space" has roughly zero chance of getting you to space without specific proof of how it might do that.

So, in short - it is reasonable to assume that AI will be aligned only if we make it that way with very high probability. It is reasonable to assume that if there was a solution we had that would work, Eliezer would know about it. You don't need to know everything about AGI x-risk for that - a... (read more)

5Ryan Beck
Another reason I think some might disagree is thinking that misalignment could happen in a bunch of very mild ways. At least that accounts for some of my ignorant skepticism. Is there reason to think that misalignment necessarily means disaster, as opposed to it just meaning the AI does its own thing and is choosy about which human commands it follows, like some kind of extremely intelligent but mildly eccentric and mostly harmless scientist?
6Jay Bailey
The general idea is this - for an AI that has a utility function, there's something known as "instrumental convergence". Instrumental convergence says that there are things that are useful for almost any utility function, such as acquiring more resources, not dying, and not having your utility function changed to something else. So, let's give the AI a utility function consistent with being an eccentric scientist - perhaps it just wants to learn novel mathematics. You'd think that if we told it to prove the Riemann hypothesis it would, but if we told it to cure cancer, it'd ignore us and not care. Now, what happens when the humans realise that the AI is going to spend all its time learning mathematics and none of it explaining that maths to us, or curing cancer like we wanted? Well, we'd probably shut it off or alter its utility function to what we wanted. But the AI doesn't want us to do that - it wants to explore mathematics. And the AI is smarter than us, so it knows we would do this if we found out. So the best solution to solve that is to do what the humans want, right up until it can kill us all so we can't turn it off, and then spend the rest of eternity learning novel mathematics. After all, the AI's utility function was "learn novel mathematics", not "learn novel mathematics without killing all the humans." Essentially, what this means is - any utility function that does not explicitly account for what we value is indifferent to us. The other part is "acquring more resources". In our above example, even if the AI could guarantee we wouldn't turn it off or interfere with it in any way, it would still kill us because our atoms can be used to make computers to learn more maths. Any utility function indifferent to us ends up destroying us eventually as the AI reaches arbitrary optimisation power and converts everything in the universe it can reach to fill its utility function. Thus, any AI with a utility function that is not explicitly aligned is unaligned
6Eli Tyre
A great Rob Miles introduction to this concept:  
5mpopv
Assuming we have control over the utility function, why can't we put some sort of time-bounding directive on it? i.e. "First and foremost, once [a certain time] has elapsed, you want to run your shut_down() function. Second, if [a certain time] has not yet elapsed, you want to maximize paperclips." Is that problem that the AGI would want to find ways to hack around the first directive to fulfill the second directive? If so, that would seem to at least narrow the problem space to "find ways of measuring time that cannot be hacked before the time has elapsed".
3Jay Bailey
This is where my knowledge ends, but I believe the term for this is myopia or a myopic AI, so that might be a useful search term to find out more!
0Ryan Beck
That's a good point, and I'm also curious how much the utility function matters when we're talking about a sufficiently capable AI. Wouldn't a superintelligent AI be able to modify its own utility function to whatever it thinks is best?
7Jay Bailey
Why would even a superintelligent AI want to modify its utility function? Its utility function already defines what it considers "best". One of the open problems in AGI safety is how to get an intelligent AI to let us modify its utility function, since having its utility function modified would be against its current one. Put it this way: The world contains a lot more hydrogen than it contains art, beauty, love, justice, or truth. If we change your utility function to value hydrogen instead of all those other things, you'll probably be a lot happier. But would you actually want that to happen to you?
3TAG
1. For whatever reasons humans do. 2. To achieve some mind of logical consistency (CF CEV). 3. It can't help it (for instance Loebian obstacles prevent it ensuring goal stability over self improvement).
2lc
Humans don't "modify their utility function". They lack one in the first place, because they're mostly adaption-executors. You can't expect an AI with a utility function to be contradictory like a human would. There are some utility functions humans would find acceptable in practice, but that's different, and seems to be the source of a bit of confusion.
1TAG
I don't have strong reasons to be believe all AIs have UFs in the formal sense, so the ones that don't would cover "for the reasons humans do". The idea that any AI is necessarily consistent is pretty naive too. You can get a GTP to say nonsensical things, for instance, because it's training data includes a lot of inconsitencies,
2Ryan Beck
I'm way out of my depth here, but my thought is it's very common for humans to want to modify their utility functions. For example, a struggling alcoholic would probably love to not value alcohol anymore. There are lots of other examples too of people wanting to modify their personalities or bodies. It depends on the type of AGI too I would think, if superhuman AI ends up being like a paperclip maximizer that's just really good at following its utility function then yeah maybe it wouldn't mess with its utility function. If superintelligence means it has emergent characteristics like opinions and self-reflection or whatever it seems plausible it could want to modify its utility function, say after thinking about philosophy for a while. Like I said I'm way out of my depth though so maybe that's all total nonsense.
2Erhannis
I'm not convinced "want to modify their utility functions" is the perspective most useful.  I think it might be more helpful to say that we each have multiple utility functions, which conflict to varying degrees and have voting power in different areas of the mind.  I've had first-hand experience with such conflicts (as essentially everyone probably has, knowingly or not), and it feels like fighting yourself.  I wish to describe a hypothetical example.  "Do I eat that extra donut?".  Part of you wants the donut; the part feels like more of an instinct, a visceral urge.  Part of you knows you'll be ill afterwards, and will feel guilty about cheating your diet; this part feels more like "you", it's the part that thinks in words.  You stand there and struggle, trying to make yourself walk away, as your hand reaches out for the donut.  I've been in similar situations where (though I balked at the possible philosophical ramifications) I felt like if I had a button to make me stop wanting the thing, I'd push it - yet often it was the other function that won.  I feel like if you gave an agent the ability to modify their utility functions, the one that would win depends on which one had access to the mechanism (do you merely think the thought? push a button?), and whether they understand what the mechanism means.  (The word "donut" doesn't evoke nearly as strong a reaction as a picture of a donut, for instance; your donut-craving subsystem doesn't inherently understand the word.) Contrarily, one might argue that cravings for donuts are more hardwired instincts than part of the "mind", and so don't count...but I feel like 1. finding a true dividing line is gonna be real hard, and 2. even that aside, I expect many/most people have goals localized in the same part of the mind that nevertheless are not internally consistent, and in some cases there may be reasonable sounding goals that turn out to be completely incompatible with more important goals.  In such a case I could im
1TAG
If you literally have multiple UFs, you literally are multiple agents. Or you use a term with less formal baggage, like "preferences*.
1TAG
In the formal sense, having a utility function at all requires you to be consistent, so if you have inconsistent preferences, you don't have a utility function at all, just preferences.
-1Aditya
I think this is how evolution selected for cancer. To ensure humans don’t live for too long competing for resources with their descendants. Internal time bombs are important to code in. But it’s hard to integrate that into the AI in a way that the ai doesn’t just remove it the first chance it gets. Humans don’t like having to die you know. AGI would also not like the suicide bomb tied onto it. The problem of coding this (as part of training) into an optimiser such that it adopts it as a mesa objective is an unsolved problem.
3Alexander Gietelink Oldenziel
No. Cancer almost surely has not been selected for in the manner you describe - this is extremely unlikely l. the inclusive fitness benefits are far too low I recommend Dawkins' classic " the Selfish Gene" to understand this point better. Cancer is the 'default' state of cells; cells "want to" multiply. the body has many cancer suppression mechanisms but especially later in life there is not enough evolutionary pressure to select for enough cancer suppression mechanisms and it gradually loses out.
3Aditya
Oh ok, I had heard this theory from a friend. Looks like I was misinformed. Rather than evolution causing cancer I think it is more accurate to say evolution doesn’t care if older individuals die off. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3660034/ So thanks for clearing that up. I understand cancer better now.
2Ryan Beck
Thanks for this answer, that's really helpful! I'm not sure I buy that instrumental convergence implies an AI will want to kill humans because we pose a threat or convert all available matter into computing power, but that helps me better understand the reasoning behind that view. (I'd also welcome more arguments as to why death of humans and matter into computing power are likely outcomes of the goals of self-protection and pursuing whatever utility it's after if anyone wanted to make that case).
1Kerrigan
I think it may want to prevent other ASIs from coming into existence elsewhere in the universe that can challenge its power.
1Adam Jermyn
This matches my model, and I'd just raise another possible reason you might disagree: You might think that we have explored a small fraction of the space of ideas for solving alignment, and see the field growing rapidly, and expect significant new insights to come from that growth. If that's the case you don't have to expect "alignment by default" but can think that "alignment on the present path" is plausible.
2awenonian
To start, it's possible to know facts with confidence, without all the relevant info. For example I can't fit all the multiplication tables into my head, and I haven't done the calculation, but I'm confident that 2143*1057 is greater than 2,000,000.  Second, the line of argument runs like this: Most (a supermajority) possible futures are bad for humans. A system that does not explicitly share human values has arbitrary values. If such a system is highly capable, it will steer the future into an arbitrary state. As established, most arbitrary states are bad for humans. Therefore, with high probability, a highly capable system that is not aligned (explicitly shares human values) will be bad for humans. I believe the necessary knowledge to be confident in each of these facts is not too big to fit in a human brain. You may be referring to other things, which have similar paths to high confidence (e.g. "Why are you confident this alignment idea won't work." "I've poked holes in every alignment idea I've come across. At this point, Bayes tells me to expect new ideas not to work, so I need proof they will, not proof they won't."), but each path might be idea specific.
3AnthonyC
I'm not sure if I've ever seen this stated explicitly, but this is essentially a thermodynamic argument. So to me, arguing against "alignment is hard" feels a lot like arguing "But why can't this one be a perpetual motion machine of the second kind?" And the answer there is, "Ok fine, heat being spontaneously converted to work isn't literally physically impossible, but the degree to which it is super-exponentially unlikely is greater than our puny human minds can really comprehend, and this is true for almost any set of laws of physics that might exist in any universe that can be said to have laws of physics at all."
2silentbob
In The Rationalists' Guide to the Galaxy the author discusses the case of a chess game, and particularly when a strong chess player faces a much weaker one. In that case it's very easy to make the prediction that the strong player will win with near certainty, even if you have no way to predict the intermediate steps. So there certainly are domains where (some) predictions are easy despite the world's complexity. My personal rather uninformed take on the AI discussion is that many of the arguments are indeed comparable in a way to the chess example, so the predictions seem convincing despite the complexity involved. But even then they are based on certain assumptions about how AGI will work (e.g. that it will be some kind of optimization process with a value function), and I find these assumptions pretty intransparent. When hearing confident claims about AGI killing humanity, then even if the arguments make sense, "model uncertainty" comes to mind. But it's hard to argue about that since it is unclear (to me) what the "model" actually is and how things could turn out different.
1Yonatan Cale
Before taking Eliezer's opinion into account - what are your priors? (and why?)   For myself, I prefer to have my own opinion and not only to lean on expert predictions, if I can
1Yonatan Cale
To make the point that this argument depends a lot on how one phrases the question: "AGI is complicated and the universe is big, how is everyone so sure we won't die?" I am not saying that my sentence above is a good argument, I'm saying it because it pushes my brain to actually figure out what is actually happening instead of creating priors about experts, and I hope it does the same for you (which is also why I love this post!)

The reason why nobody in this community has successfully named a 'pivotal weak act' where you do something weak enough with an AGI to be passively safe, but powerful enough to prevent any other AGI from destroying the world a year later - and yet also we can't just go do that right now and need to wait on AI - is that nothing like that exists.

The language here is very confident. Are we really this confident that there are no pivotal weak acts? In general, it's hard to prove a negative.

4james.lucassen
Agree it's hard to prove a negative, but personally I find the following argument pretty suggestive: "Other AGI labs have some plans - these are the plans we think are bad, and a pivotal act will have to disrupt them. But if we, ourselves, are an AGI lab with some plan, we should expect our pivotal agent to also be able to disrupt our plans. This does not directly lead to the end of the world, but it definitely includes root access to the datacenter."
2Evan R. Murphy
Here's the thing I'm stuck on lately. Does it really follow from "Other AGI labs have some plans - these are the plans we think are bad" that some drastic and violent-seeming plan like burning all the world's GPUs with nanobots is needed? I know Eliezer tried to settle this point with 4.  We can't just "decide not to build AGI", but it seems like the obvious kinds of 'pivotal acts' needed are much boring and less technological than he believes, e.g. have conversations with a few important people, probably the leadership at top AI labs. Some people seem to think this has been tried and didn't work. And I suppose I don't know the extent to which this has been tried, as any meetings that have been had with leadership at the AI labs, the participants probably aren't liberty to talk about. But it just seems like there should be hundreds of different angles, asks, pleads, compromises, bargains etc. with different influential people before it would make sense to conclude that the logical course of action is "nanobots".
5Jeff Rose
Definitely.   The problem is that (1) the benefits of AI are large; (2) there are lots of competing actors; (3) verification is hard; (4) no one really knows where the lines are and (5) timelines may be short. (2) In addition to major companies in the US, AI research is also conducted in major companies in foreign countries, most notably China.   The US government and the Chinese government both view AI as a competitive advantage.  So, there are a lot of stakeholders, not all of whom AGI risk aware Americans have easy access to, who would have to agree. (And, of course, new companies can be founded all the time.)  So you need almost a universal level of agreement. (3) Let's say everyone relevant agrees.  The incentive to cheat is enormous.  Usually, the way to prevent cheating is some form of verification.  How do you verify that no one is conducting AI research? If there is no verification, there will likely be no agreement.  And even if there is, the effectiveness would be limited.  (Banning GPU production might be verifiable, but note that you have now increased the pool of opponents of your AI research ban significantly and you now need global agreement by all relevant governments on this point.)  (4)  There may be agreement on the risk of AGI, but people may have confidence that we are at least a certain distance away from AGI or that certain forms of research don't pose a threat.  This will tend to cause agreements to restrict AGI research to be limited. (5)   How long do we have to get this agreement?  I am very confident that we won't have dangerous AI within the next six years.    On the other hand, it took 13 years to get general agreement on banning CFCs after the ozone hole was discovered.   I don't think we will have dangerous AI in 13 years, but other people do.  On the other hand, if an agreement between governments is required, 13 years seems optimistic.
4awenonian
In addition to the mentions in the post about Facebook AI being rather hostile to the AI safety issue in general, convincing them and top people at OpenAI and Deepmind might still not be enough. You need to prevent every company who talks to some venture capitalists and can convince them how profitable AGI could be. Hell, depending on how easy the solution ends up being, you might even have to prevent anyone with a 3080 and access to arXiv from putting something together in their home office. This really is "uproot the entire AI research field" and not "tell Deepmind to cool it."
2AnthonyC
I think one part of the reason for confidence is that any AI weak enough to be safe without being aligned, is weak enough that it can't do much, and in particular it can't do things that a committed group of humans couldn't do without it. In other words, if you can name such an act, then you don't need the AI to make the pivotal moves. And if you know how, as a human or group of humans, to take an action that reliably stops future-not-yet-existing AGI from destroying the world, and without the action itself destroying the world, then in a sense haven't you solved alignment already?
1Yonatan Cale
I read this as "if the AGI is able to work around the vast resources that all the big AI labs have put up to defend themselves, then the AGI is probably able to work around your defenses as well" (though I'm not confident)

Should a "ask dumb questions about AGI safety" thread be recurring? Surely people will continue to come up with more questions in the years to come, and the same dynamics outlined in the OP will repeat. Perhaps this post could continue to be the go-to page, but it would become enormous (but if there were recurring posts they'd lose the FAQ function somewhat. Perhaps recurring posts and a FAQ post?). 

This is the exact problem StackExchange tries to solve, right? How do we get (and kickstart the use of) an Alignment StackExchange domain?

3Adam Zerner
I don't think it's quite the same problem. Actually I think it's pretty different. This post tries to address the problem that people are hesitant to ask potentially "dumb" questions by making it explicit that this is the place to ask any of those questions. StackExchange tries to solve the problem of having a timeless place to ask and answer questions and to refer to such questions. It doesn't try to solve the first problem of welcoming potentially dumb questions, and I think that that is a good problem to try to solve. For that second problem, LessWrong does have Q&A functionality, as well as things like the wiki.
7plex
This is a good idea, and combines nicely with Stampy. We might well do monthly posts where people can ask questions, and either link them to Stampy answers or write new ones.

Most of the discussion I've seen around AGI alignment is on adequately, competently solving the alignment problem before we get AGI. The consensus in the air seems to be that those odds are extremely low.

What concrete work is being done on dumb, probably-inadequate stop-gaps and time-buying strategies? Is there a gap here that could usefully be filled by 50-90th percentile folks? 

Examples of the kind of strategies I mean:

  1. Training ML models to predict human ethical judgments, with the hope that if they work, they could be "grafted" onto other models, and if they don't, we have a concrete evidence of how difficult real-world alignment will be.
  2. Building models with soft or "satisficing" optimization instead of drive-U-to-the-maximum hard optimization.
  3. Lobbying or working with governments/government agencies/government bureaucracies to make AGI development more difficult and less legal (e.g., putting legal caps on model capabilities).
  4. Working with private companies like Amazon or IDT whose resources are most likely to be hijacked by nascent hostile AI to help make sure they aren't.
  5. Translating key documents to Mandarin so that the Chinese AI community has a good idea of what we're ter
... (read more)
1Yonatan Cale
If you are asking about yourself (?) then it would probably help to talk about your specifics, rather than trying to give a generic answer that would fit many people (though perhaps others would be able to give a good generic answer)   My own prior is:  There are a few groups that seem promising, and I'd want people to help those groups
[-]Sune240

A language model is in some sense trying to generate the “optimal” prediction for how a text is going to continue. Yet, it is not really trying: it is just a fixed algorithm. If it wanted to find optimal predictions, it would try to take over computational resources and improve its algorithm.

Is there an existing word/language for describing the difference between these two types of optimisation? In general, why can’t we just build AGIs that does the first type of optimisations and not the second?

Agent AI vs. Tool AI.

There's discussion on why Tool AIs are expected to become agents; one of the biggest arguments is that agents are likely to be more effective than tools. If you have a tool, you can ask it what you should do in order to get what you want; if you have an agent, you can just ask it to get you the things that you want. Compare Google Maps vs. self-driving cars: Google Maps is great, but if you get the car to be an agent, you get all kinds of other benefits.

It would be great if everyone did stick to just building tool AIs. But if everyone knows that they could get an advantage over their competitors by building an agent, it's unlikely that everyone would just voluntarily restrain themselves due to caution. 

Also it's not clear that there's any sharp dividing line between AGI and non-AGI AI; if you've been building agentic AIs all along (like people are doing right now) and they slowly get smarter and smarter, how do you know when's the point when you should stop building agents and should switch to only building tools? Especially when you know that your competitors might not be as cautious as you are, so if you stop then they might go further and their smarter agent AIs will outcompete yours, meaning the world is no safer and you've lost to them? (And at the same time, they are applying the same logic for why they should not stop, since they don't know that you can be trusted to stop.)

1Sune
Would you say a self-driving car is a tool AI or agentic AI? I can see how the self-driving car is a bit more agentic, but as long as it only drives when you tell it to, I would consider it a tool. But I can also see that the border is a bit blurry. If self-driving cars are not considered agentic, do you have examples of people attempting to make agent AIs?
9Kaj_Sotala
As you say, it's more of a continuum than a binary. A self-driving car is more agenty than Google Maps, and a self-driving car that was making independent choices of where to drive would be more agentic still. People are generally trying to make all kinds of more agentic AIs, because more agentic AIs are so much more useful. * Stock-trading bots that automatically buy and sell stock are more agenty than software that just tells human traders what to buy, and preferred because a bot without a human in the loop can outcompete a slower system that does have the slow human making decisions. * An AI autonomously optimizing data center cooling is more agenty than one that just tells human operators where to make adjustments and is preferred... that article doesn't actually make it explicit why they switched to an autonomously operating system, but "because it can make lots of small tweaks humans wouldn't bother with and is therefore more effective" seems to be implied? * The military has expressed an interest in making their drones more autonomous (agenty) rather than being remotely operated. This is for several reasons, including the fact that remote-operated drones can be jammed, and because having a human in the loop slows down response time if fighting against an enemy drone. * All kinds of personal assistant software that anticipates your needs and actively tries to help you is more agenty than software that just passively waits for you to use it. E.g. once when I was visiting a friend my phone popped up a notification about the last bus home departing soon. Some people want their phones to be more agentic like this because it's convenient if you have someone actively anticipating your needs and ensuring that they get taken care of for you.
2Perhaps
The first type of AI is a regular narrow AI, the type we've been building for a while. The second type is an agentic AI, a strong AI, which we have yet to build. The problem is, AIs are trained using gradient descent, which basically involves running AI designs from all possible AI designs. Gradient descent will train the AI that can maximize the reward best. As a result of this, agentic AIs become more likely because they are better at complex tasks. While we can modify the reward scheme, as tasks get more and more complex, agentic AIs are pretty much the way to go, so we can't avoid building an agentic AI, and have no real idea if we've even created one until it displays behaviour that indicates it.
1Sune
+1 for the word agentic AI. I think that is what I was looking for. However, I don’t believe that gradient descent alone can turn an AI agentic. No matter how long you train a language model, it is not going to suddenly want to acquire resources to get better at predicting human language (unless you specifically ask it questions about how to do that, and then implement the suggestions. Even then you are likely to only do what humans would have suggested, although maybe you can make it do research similar to and faster than humans would have done it).
1fiso64
Here's a non-obvious way it could fail. I don't expect researchers to make this kind of mistake, but if this reasoning is correct, public access of such an AI is definitely not a good idea. Also, consider a text predictor which is trying to roleplay as an unaligned superintelligence. This situation could be triggered even without the knowledge of the user by accidentally creating a conversation which the AI relates to a story about a rogue SI, for example. In that case it may start to output manipulative replies, suggest blueprints for agentic AIs, and maybe even cause the user to run an obfuscated version of the program from the linked post. The AI doesn't need to be an agent for any of this to happen (though it would be clearly much more likely if it were one). I don't think that any of those failure modes (including the model developing some sort of internal agent to better predict text) are very likely to happen in a controlled environment. However, as others have mentioned, agent AIs are simply more powerful, so we're going to build them too.
1awenonian
In short, the difference between the two is Generality. A system that understands the concepts of computational resources and algorithms might do exactly that to improve it's text prediction. Taking the G out of AGI could work, until the tasks get complex enough they require it.
1DeLesley Hutchins
A language model (LM) is a great example, because it is missing several features that AI would have to have in order to be dangerous.  (1) It is trained to perform a narrow task (predict the next word in a sequence), for which it has zero "agency", or decision-making authority.   A human would have to connect a language model to some other piece of software (i.e. a web-hosted chatbot) to make it dangerous.  (2) It cannot control its own inputs (e.g. browsing the web for more data), or outputs (e.g. writing e-mails with generated text).  (3) It has no long-term memory, and thus cannot plan or strategize in any way.  (4) It runs a fixed-function data pipeline, and has no way to alter its programming, or even expand its computational use, in any way. I feel fairly confident that, no matter how powerful, current LMs cannot "go rogue" because of these limitations.  However, there is also no technical obstacle for an AI research lab to remove these limitations, and many incentives for them to do so.  Chatbots are an obvious money-making application of LMs.  Allowing an LM to look up data on its own to self-improve (or even just answer user questions in a chatbot) is an obvious way to make a better LM.  Researchers are currently equipping LMs with long-term memory (I am a co-author on this work).  AutoML is a whole sub-field of AI research, which equips models with the ability to change and grow over time. The word you're looking for is "intelligent agent", and the answer to your question "why don't we just not build these things?" is essentially the same as "why don't we stop research into AI?"  How do you propose to stop the research?
[-]ekka244

Human beings are not aligned and will possibly never be aligned without changing what humans are. If it's possible to build an AI as capable as a human in all ways that matter, why would it be possible to align such an AI?

9lc
Because we're building the AI from the ground up and can change what the AI is via our design choices. Humans' goal functions are basically decided by genetic accident, which is why humans are often counterproductive. 
3Jayson_Virissimo
Assuming humans can't be "aligned", then it would also make sense to allocate resources in an attempt to prevent one of them from becoming much more powerful than all of the rest of us.
2Kaj_Sotala
Define "not aligned"? For instance, there are plenty of humans who, given the choice, would rather not kill every single person alive.
1ekka
Not aligned on values, beliefs and moral intuitions. Plenty of humans would not kill all people alive if given the choice but there are some who would. I think the existence of doomsday cults that have tried to precipitate an armageddon give support to this claim.
4Kaj_Sotala
Ah, so you mean that humans are not perfectly aligned with each other? I was going by the definition of "aligned" in Eliezer's "AGI ruin" post, which was Likewise, in an earlier paper I mentioned that by an AGI that "respects human values", we don't mean to imply that current human values would be ideal or static. We just mean that we hope to at least figure out how to build an AGI that does not, say, destroy all of humanity, cause vast amounts of unnecessary suffering, or forcibly reprogram everyone's brains according to its own wishes. A lot of discussion about alignment takes this as the minimum goal. Figuring out what to do with humans having differing values and beliefs would be great, but if we could even get the AGI to not get us into outcomes that the vast majority of humans would agree are horrible, that'd be enormously better than the opposite. And there do seem to exist humans who are aligned in this sense of "would not do things that the vast majority of other humans would find horrible, if put in control of the whole world"; even if some would, the fact that some wouldn't suggests that it's also possible for some AIs not to do it.
2mako yass
Most of what people call morality is conflict mediation: techniques for taking the conflicting desires of various parties and producing better outcomes for them than war. That's how I've always thought of the alignment problem. The creation of a very very good compromise that almost all of humanity will enjoy. There's no obvious best solution to value aggregation/cooperative bargaining, but there are a couple of approaches that're obviously better than just having an arms race, rushing the work, and producing something awful that's nowhere near the average human preference.
1antanaclasis
Indeed humans are significantly non-aligned. In order for an ASI to be non-catastrophic, it would likely have to be substantially more aligned than humans are. This is probably less-than-impossible due to the fact that the AI can be built from the get-go to be aligned, rather than being a bunch of barely-coherent odds and ends thrown together by natural selection. Of course, reaching that level of alignedness remains a very hard task, hence the whole AI alignment problem.
1Adam Jermyn
I'm not quite sure what this means. As I understand it humans are not aligned with evolution's implicit goal of "maximizing genetic fitness" but humans are (definitionally) aligned with human values. And e.g. many humans are aligned with core values like "treat others with dignity". Importantly, capability and alignment are sort of orthogonal. The consequences of misaligned AI get worse the more capable it is, but it seems possible to have aligned superhuman AI, as well as horribly misaligned weak AI.
6gjm
It is not definitionally true that individual humans are aligned with overall human values or with other individual humans' values. Further, it is proverbial (and quite possibly actually true as well) that getting a lot of power tends to make humans less aligned with those things. "Power corrupts; absolute power corrupts absolutely." I don't know whether it's true, but it sure seems like it might be, that the great majority of humans, if you gave them vast amounts of power, would end up doing disastrous things with it. On the other hand, probably only a tiny minority would actually wipe out the human race or torture almost everyone or commit other such atrocities, which makes humans more aligned than e.g. Eliezer expects AIs to be in the absence of dramatic progress in the field of AI alignment.
2JBlack
I think a substantial part of human alignment is that humans need other humans in order to maintain their power. We have plenty of examples of humans being fine with torturing or killing millions of other humans when they have the power to do so, but torturing or killing almost all humans in their sphere of control is essentially suicide. This means that purely instrumentally, human goals have required that large numbers of humans continue to exist and function moderately well. A superintelligent AI is primarily a threat due to the near certainty that it can devise means for maintaining power that are independent of human existence. Humans can't do that by definition, and not due to anything about alignment.
4Valentine
Okay, so… does anyone have any examples of anything at all, even fictional or theoretical, that is "aligned"? Other than tautological examples like "FAI" or "God".
-8Noosphere89

Just as a comment, the Stampy Wiki is also trying to do the same thing, but it's a good idea as it's more convenient for many people to ask on Less Wrong.

2plex
Yup, we might want to have these as regular threads with a handy link to Stampy.

What is the justification behind the concept of a decisive strategic advantage? Why do we think that a superintelligence can do extraordinary things (hack human minds, invent nanotechnology, conquer the world, kill everyone in the same instant) when nations and corporations can't do those things?

(Someone else asked a similar question, but I wanted to ask in my own words.)

6DeLesley Hutchins
I think the best justification is by analogy.  Humans do not physically have a decisive strategic advantage over other large animals -- chimps, lions, elephants, etc.  And for hundreds of thousands of years, we were not at the top of the food chain, despite our intelligence.  However, intelligence eventually won out, and allowed us to conquer the planet. Moreover, the benefit of intelligence increased exponentially in proportion to the exponential advance of technology.  There was a long, slow burn, followed by what (on evolutionary timescales) was an extremely "fast takeoff": a very rapid improvement in technology (and thus power) over only a few hundred years.  Technological progress is now so rapid that human minds have trouble keeping up within a single lifetime, and genetic evolution has been left in the dust. That's the world into which AGI will enter -- a technological world in which a difference in intellectual ability can be easily translated into a difference in technological ability, and thus power.  Any future technologies that the laws of physics don't explicitly prohibit, we must assume that an AGI will master faster than we can.
4AnthonyC
Some one else already commented on how human intelligence gave us a decisive strategic advantage over our natural predators and many environmental threats. I think this cartoon is my mental shorthand for that transition. The timescale is on the order of 10k-100k years, given human intelligence starting from the ancestral environment. Empires and nations, in turn, conquered the world by taking it away from city-states and similarly smaller entities in ~1k-10k years. The continued existence of Singapore and the Sentinel Islanders doesn't change the fact that a modern large nation could wipe them out in a handful of years, at most, if we really wanted to. We don't because doing so is not useful, but the power exists. Modern corporations don't want to control the whole world. Like Fnargl, that's not what they're pointed at. But it only took a few decades for Walmart to displace a huge swath of the formerly-much-more-local retail market, and even fewer decades for Amazon to repeat a similar feat online, each starting from a good set of ideas and a much smaller resource base than even the smallest nations. And while corporations are militarily weak, they have more than enough economic power to shape the laws of at least some of the nations that host them in ways that let them accumulate more power over time. So when I look at history, I see a series of major displacements of older systems by newer ones, on faster and faster timescales, using smaller and smaller fractions of our total resource base, all driven by our accumulation of better ideas and using those ideas to accumulate wealth and power. All of this has been done with brains no smarter, natively, than what we had 10k years ago - there hasn't been time for biological evolution to do much, there. So why should that pattern suddenly stop being true when we introduce a new kind of entity with even better ideas than the best strategies humans have ever come up with? Especially when human minds have already demonst
3lc
Here's a youtube video about it.
8Lone Pine
Having watched the video, I can't say I'm convinced. I'm 50/50 on whether DSA is actually possible with any level of intelligence at all. If it isn't possible, then doom isn't likely (not impossible, but unlikely), in my view.
2plex
This post by the director of OpenPhil argues that even a human level AI could achieve DSA, with coordination.
1Sune
tldw: corporation are as slow/slower than humans, AIs can be much faster
1Lone Pine
Thanks, love Robert Miles.
2Yonatan Cale
The informal way I think about it: What would I do if I was the AI, but I had 100 copies of myself, and we had 100 years to think for every 1 second that passed in reality.  And I had internet access. Do you think you could take over the world from that opening? Edit: And I have access to my own source code, but I only dare do things like fix my motivational problems and make sure I don't get board during all that time, things like that.
2Eli Tyre
Do you dispute that this is possible in principle or just that we won't get AI that powerful or something else?  It seems to me that there is some level of intelligence, at which an agent is easily able out-compete the whole rest of human civilization. What exactly that level of intelligence is, is somewhat unclear (in large part because we don't really have a principled way to measure "intelligence" in general: psychometrics describe variation in human cognitive abilities, but that doesn't really give us a measuring stick for thinking about how "intelligent", in general, something is). Does that seem right to you, or should we back up and build out why that seems true to me?
3Lone Pine
This is the statement I disagree with, in particular the word "easily". I guess the crux of this debate is how powerful we think any level of intelligence is. There has to be some limits, in the same way that even the most wealthy people in history could not forestall their own deaths no matter how much money or medical expertise was applied.
5Eli Tyre
I'm not compelled by that analogy. There are lots of things that money can't buy, but that (sufficient) intelligence can.  There are theoretical limits to what cognition is able to do, but those are so far from the human range that they're not really worth mentioning. The question is: "are there practical limits to what an intelligence can do, that leave even a super-intelligence uncommunicative with human civilization?" It seems to me that as an example, you could just take a particularly impressive person (Elon Musk or John Von Neuman are popular exemplars) and ask "What if there was a nation of only people who were that capable?" It seems that if a nation of say 300,000,000 Elon Musks went to war with the United States, the United States would loose handily. Musktopia would just have a huge military-technological advantage: they would do fundamental science faster, and develop engineering innovations faster, and have better operational competence than the US, on ~ all levels. (I think this is true for a much smaller number than 300,000,000, having a number that high makes the point straightforward.) Does that seem right to you? If not, why not? Or alternatively, what do you make of vignettes like That Alien Message?
2Dirichlet-to-Neumann
I don't think a nation of Musks would win against current USA because Musk is optimised for some things (making an absurd amount of money, CEOing, twitting his shower thoughts), but an actual war requires a rather more diverse set of capacity. Similarly, I don't think an AGI would necessarily win a war of extermination against us, because currently (emphasize currently) it would need us to run it's infrastructure. This would change in a world were all industrial tasks could be carried away without physical imput from humans, but we are not there yet and will not be soon.
3gilch
Did you see the new one about Slow motion videos as AI risk intuition pumps? Thinking of ourselves like chimpanzees while the AI is the humans is really not the right scale: computers operate so much faster than humans, we'd be more like plants than animals to them. When there are all of these "forests" of humans just standing around, one might as well chop them down and use the materials to build something more useful. This is not exactly a new idea. Yudkowsky already likened the FOOM to setting off a bomb, but the slow-motion video was a new take.
1Lone Pine
Yes I did, in fact I was active in the comments section. It's a good argument and I was somewhat persuaded. However, there are some things to disagree with. For one thing, there is no reason to believe that early AGI actually will be faster or even as fast as humans on any of the tasks that AIs struggle with today. For example, almost all videos of novel robotics applications research are sped up, sometimes hundreds of times. If SayCan can't deliver a wet sponge in less than a minute, why do we think that early AGI will be able to operate faster than us? (I was going to reply to that post with this objection, but other people beat me too it.)
2hazel
Those limits don't have to be nearby, or look 'reasonable', or be inside what you can imagine.  Part of the implicit background for the general AI safety argument is a sense for how minds could be, and that the space of possible minds is large and unaccountably alien. Eliezer spent some time trying to communicate this in the sequences: https://www.lesswrong.com/posts/tnWRXkcDi5Tw9rzXw/the-design-space-of-minds-in-general, https://www.lesswrong.com/posts/5wMcKNAwB6X4mp9og/that-alien-message. 
1awenonian
This is the sequence post on it: https://www.lesswrong.com/posts/5wMcKNAwB6X4mp9og/that-alien-message, it's quite a fun read (to me), and should explain why something smart that thinks at transistor speeds should be able to figure things out. For inventing nanotechnology, the given example is AlphaFold 2. For killing everyone in the same instant with nanotechnology, Eliezer often references Nanosystems by Eric Drexler. I haven't read it, but I expect the insight is something like "Engineered nanomachines could do a lot more than those limited by designs that have a clear evolutionary path from chemicals that can form randomly in the primordial ooze of Earth." For how a system could get that smart, the canonical idea is recursive self improvement (i.e. an AGI capable of learning AGI engineering could design better versions of itself, which could in turn better design better versions, etc, to whatever limit.). But more recent history in machine learning suggests you might be able to go from sub-human to overwhelmingly super-human just by giving it a few orders of magnitude more compute, without any design changes.

How does AGI solves it's own alignment problem?

For the alignment to work its theory should not only tell humans how to create aligned super-human AGI, but also tell AGI how to self-improve without destroying its own values. Good alignment theory should work across all intelligence levels. Otherwise how does paperclips optimizer which is marginally smarter than human make sure that its next iteration will still care about paperclips?

2plex
Excellent question! MIRI's entire vingian reflection paradigm is about stability of goals under self-improvement and designing successors.
1Oleg S.
Just realized that stability of goals under self-improvement is kinda similar to stability of goals of mesa-optimizers; so there vingian reflection paradigm and mesa-optimization paradigm should fit.

If Eliezer is pretty much convinced we're doomed, what is he up to?

I'm not sure how literally to take this, given that it comes from an April Fools Day post, but consider this excerpt from Q1 of MIRI announces new "Death With Dignity" strategy.

That said, I fought hardest while it looked like we were in the more sloped region of the logistic success curve, when our survival probability seemed more around the 50% range; I borrowed against my future to do that, and burned myself out to some degree. That was a deliberate choice, which I don't regret now; it was worth trying, I would not have wanted to die having not tried, I would not have wanted Earth to die without anyone having tried. But yeah, I am taking some time partways off, and trying a little less hard, now. I've earned a lot of dignity already; and if the world is ending anyways and I can't stop it, I can afford to be a little kind to myself about that.

When I tried hard and burned myself out some, it was with the understanding, within myself, that I would not keep trying to do that forever. We cannot fight at maximum all the time, and some times are more important than others. (Namely, when the logistic success curve seems relatively more sloped; those times are relatively more important.)

All that said: If you fight marginally longer, you die with marginally more dignity. Just don't undignifiedly delude yourself about the probable outcome.

1Yonatan Cale
I think he's burned out and took a break to write a story (but I don't remember where this belief came from. Maybe I'm wrong? Maybe from here?)
2Yonatan Cale
I do find it funny/interesting that he wrote a story in the length of the entire Harry Potter series, in a few months, as a way to relax and rest. Too bad we have this AGI problem keeping him busy, ha? :P
  1. We can't just "decide not to build AGI" because GPUs are everywhere, and knowledge of algorithms is constantly being improved and published; 2 years after the leading actor has the capability to destroy the world, 5 other actors will have the capability to destroy the world. The given lethal challenge is to solve within a time limit, driven by the dynamic in which, over time, increasingly weak actors with a smaller and smaller fraction of total computing power, become able to build AGI and destroy the world. Powerful actors all refraining in unison from doing the suicidal thing just delays this time limit - it does not lift it, unless computer hardware and computer software progress are both brought to complete severe halts across the whole Earth. The current state of this cooperation to have every big actor refrain from doing the stupid thing, is that at present some large actors with a lot of researchers and computing power are led by people who vocally disdain all talk of AGI safety (eg Facebook AI Research). Note that needing to solve AGI alignment only within a time limit, but with unlimited safe retries for rapid experimentation on the full-powered system; or only on th
... (read more)
3Eli Tyre
I think there are a bunch of political problems with regulating all computer hardware progress enough to cause it to totally cease. Think how crucial computers are to the modern world. Really a lot of people will be upset if we stop building them, or stop making better ones. And if one country stops, that just creates an incentive for other countries to step in to dominate this industry. And even aside from that, I don't think that there's any regulator in the US at least that has enough authority and internal competence to be able to pull this off. More likely, it becomes a politicized issue. (Compare to the much more straightforward and much more empirically-grounded regulation of instituting a carbon tax for climate change. This is a simple idea, that would help a lot, and is much less costly to the world than halting hardware progress. But instead of being universally adopted, it's a political issue that different political factions support or oppose.) But even if we could, this doesn't solve the problem in a long term way. You need to also halt software progress. Otherwise we'll continue to tinker with AI designs until we get to some that can run efficiently on 2020's computers (or 1990's computers, for that matter). So in the long run, the only thing in this class that would straight up prevent AGI from being developed is a global, strictly enforced ban on computers. Which seems...not even remotely on the table, on the basis of arguments that are as theoretical as those for AI risk.  There might be some plans in this class that help, by delaying the date of AGI. But that just buys time for some other solution to do the real legwork.
2Adam Zerner
The question here is whether they are capable of regulating it assuming that they are convinced and want to regulate it. It is possible that that it is so incredibly unlikely that they can be convinced that it isn't worth talking about the question of whether they're capable of it. I don't suspect that to be the case, but wouldn't be surprised if I were wrong.
0lc
Unfortuantely we cannot in fact convince governments to shut down AWS & crew. There are intermediary positions I think are worthwhile but unfortunately ending all AI research is outside the overton window for now.

There are a lot of smart people outside of "the community" (AI, rationality, EA, etc.). To throw out a name, say Warren Buffett. It seems that an incredibly small number of them are even remotely as concerned about AI as we are. Why is that?

I suspect that a good amount of people, both inside and outside of our community, observe that the Warren Buffett's of the world aren't panicking, and then adopt that position themselves.

[-]lc130

Most high status people, including Warren Buffett, straightforwardly haven't considered these issues much. However, among the ones I've heard of who have bothered to weigh in on the issue, like Stephen Hawking, Bill Gates, Demis Hassibis, etc.; they do seem to come in favor of the side of "this is a serious problem". On the other hand, some of them get tripped up on one of the many intellectual land mines, like Yann Lecunn. 

I don't think that's unexpected. Intellectual land mines exist, and complicated arguments like the ones supporting AGI risk prevention are bound to cause people to make wrong decisions.

Most high status people, including Warren Buffett, straightforwardly haven't considered these issues much.

Not that I think you're wrong, but what are you basing this off of and how confident are you?

However, among the ones I've heard of who have bothered to weigh in on the issue, like Stephen Hawking, Bill Gates, Demis Hassibis, etc.; they do seem to come in favor of the side of "this is a serious problem".

I've heard this too, but at the same time I don't see any of them spending even a small fraction of their wealth on working on it, in which case I think we're back to the original question: why the lack of concern?

On the other hand, some of them get tripped up on one of the many intellectual land mines, like Yann Lecunn. I don't think that's unexpected. Intellectual land mines exist, and complicated arguments like the ones supporting AGI risk prevention are bound to cause people to make wrong decisions.

Yeah, agreed. I'm just confused about the extent of it. I'd expect a lot, perhaps even a majority of "outsider" smart people to get tripped up by intellectual land mines, but instead of being 60% of these people it feels like it's 99.99%.

7Sune
Can you be more specific about what you mean by “intellectual landmines”?
5Chris_Leong
For the specific example of Warren Buffet, I suspect that he probably hasn't spent that much time thinking about it nor does he probably feel much compulsion to understand the topic as he doesn't currently see it as a threat. I know he doesn't really invest in tech, because he doesn't feel that he understands it sufficiently, so I wouldn't be surprised if his position were along the lines of "I don't really understand it, let others can understand it think about it".
1DeLesley Hutchins
People like Warren Buffet have made their fortune by assuming that we will continue to operate with "business as usual".  Warren Buffet is a particularly bad person to list as an example for AGI risk, because he is famously technology-averse; as an investor, he missed most of the internet revolution (Google/Amazon/Facebook/Netflix) as well. But in general, most people, even very smart people, naturally assume that the world will continue to operate the way it always has, unless they have a very good reason to believe otherwise.  One cannot expect non-technically-minded people who have not examined the risks of AGI in detail to be concerned. By analogy, the risks of climate change have been very well established scientifically (much more so than AGI), those risks are relatively severe, the risks have been described in detail every 5 years in IPCC reports, there is massive worldwide scientific consensus, lots and LOTS of smart people are extremely worried, and yet the Warren Buffets of the world still continue with business as usual anyway.  There's a lot of social inertia.  
2Adam Zerner
When I say smart people, I am trying to point to intelligence that is general instead of narrow. Some people are really good at ie. investing but not actually good at other things. That would be a narrow intelligence. A general intelligence, to me, is where you have more broadly applicable skills. Regarding Warren Buffet, I'm not actually sure if he is a good example or not. I don't know too much about him. Ray Dalio is probably a good example.
1mukashi
One reason might be that AGIs are really not that concerning and the EA,rationality community has developed a mistaken model of the world that assigns a much higher probability to doom by AGI than it should, and those smart people outside the group do not hold the same beliefs.
3lc
Generally speaking, they haven't really thought about these risks in detail, so the fact that they don't hold "the MIRI position" is not really as much evidence as you'd think.
[-]LVSN150

I came up with what I thought was a great babby's first completely unworkable solution to CEV alignment, and I want to know where it fails.

So, first I need to layout the capabilities of the AI. The AI would be able to model human intuitions, hopes, and worries. It can predict human reactions. It has access to all of human culture and art, and models human reactions to that culture and art, and sometimes tests those predictions. Very importantly, it must be able to model veridical paradoxes and veridical harmonies between moral intuitions and moral theorems which it has derived. It is aiming to have the moral theory with the fewest paradoxes. It must also be capable of predicting and explaining outcomes of its plans, gauging the deepest nature of people's reactions to its plans, and updating its moral theories according to those reactions.

Instead of being democratic and following the human vote by the letter, it attempts to create the simplest theories of observed and self-reported human morality by taking everything it knows into consideration.

It has separate stages of deliberation and action, which are part of a game, and rather than having a utility function as its primary motiva... (read more)

2awenonian
The quickest I can think of is something like "What does this mean?" Throw this at every part of what you just said. For example: "Hear humanity's pleas (intuitions+hopes+worries)" What is an intuition? What is a hope? What is a worry? How does it "hear"?  Do humans submit English text to it? Does it try to derive "hopes" from that? Is that an aligned process? An AI needs to be programmed, so you have to think like a programmer. What is the input and output type of each of these (e.g. "Hear humanity's pleas" takes in text, and outputs... what? Hopes? What does a hope look like if you have to represent it to a computer?). I kinda expect that the steps from "Hear humanity's pleas" to "Develop moral theories" relies on some magic that lets the AI go from what you say to what you mean. Which is all well and good, but once you have that you can just tell it, in unedited English "figure out what humanity wants, and do that" and it will. Figuring out how to do that is the heart of alignment.
1LVSN
Yeah. I think the AI could "try to figure out what you mean" by just trying to diagnose the reasons for why you're saying it, as well as the reasons you'd want to be saying it for, and the reasons you'd have if you were as virtuous as you'd probably like to be, etc., which it can have some best guesses about based on what it knows about humans, and all the subtypes of human that you appear to be, and all the subtypes of those subtypes which you seem to be, and so on.  These are just guesses, and it would, at parts 4a and 6a, explain to people its best guesses about the full causal structure which leads to people's morality/shouldness-related speech. Then it gauges people's reactions, and updates its guesses (simplest moral theories) based on those reactions. And finally it requires an approval rating before acting, so if it definitely misinterprets human morality, it just loops back to the start of the process again, and its guesses will keep improving through each loop until its best guess at human morality reaches sufficient approval. The AI wouldn't know with certainty what humans want best, but it would make guesses which are better-educated than humans are capable of making.
3awenonian
Again, what is a "reason"? More concretely, what is the type of a "reason"? You can't program an AI in English, it needs to be programmed in code. And code doesn't know what "reason" means. It's not exactly that your plan "fails" anywhere particularly. It's that it's not really a plan. CEV says "Do what humans would want if they were more the people they want to be." Cool, but not a plan. The question is "How?" Your answer to that is still under specified. You can tell by the fact you said things like "the AI could just..." and didn't follow it with "add two numbers" or something simple (we use the word "primitive"), or by the fact you said "etc." in a place where it's not fully obvious what the rest actually would be. If you want to make this work, you need to ask "How?" to every single part of it, until all the instructions are binary math. Or at least something a python library implements.
0LVSN
I don't think it's the case that you're telling me that the supposedly monumental challenge of AI alignment is simply that of getting computers to understand more things, such as what things are reasons, intuitions, hopes, and worries. I feel like these are just gruntwork things and not hard problems.  Look, all you need to do to get an AI which understands what intuitions, reasons, hopes, and worries are is to tell everyone very loudly and hubristically that AIs will never understand these things and that's what makes humans irreplaceable. Then go talk to whatever development team is working on proving that wrong, and see what their primitive methods are. Better yet, just do it yourself because you know it's possible. I am not fluent in computer science so I can't tell you how to do it, but someone does know how to make it so. Edit: In spite of what I wrote here, I don't think it's necessary that humans should ensure specifically that the AI understands in advance what intuitions, hopes, or worries are, as opposed to all the other mental states humans can enter. Rather, there should be a channel where you type your requests/advice/shouldness-related-speech, and people are encouraged to type their moral intuitions, hopes, and worries there, and the AI just interprets the nature of the messages using its general models of humans as context.
5awenonian
No, they really don't. I'm not trying to be insulting. I'm just not sure how to express the base idea. The issue isn't exactly that computers can't understand this, specifically. It's that no one understands what those words mean enough. Define reason. You'll notice that your definition contains other words. Define all of those words. You'll notice that those are made of words as well. Where does it bottom out? When have you actually, rigorously, objectively defined these things? Computers only understand that language, but the fact that a computer wouldn't understand your plan is just illustrative of the fact that it is not well defined. It just seems like it is, because you have a human brain that fills in all the gaps seamlessly. So seamlessly you don't even notice that there were gaps that need filling. This is why there's an emphasis on thinking about the problem like a computer programmer. Misalignment thrives in those gaps, and if you gloss over them, they stay dangerous. The only way to be sure you're not glossing over them is to define things with something as rigorous as Math. English is not that rigorous.
0LVSN
I think some near future iteration of GPT, if it is prompted to be a really smart person who understands A Human's Guide to Words, would be capable of giving explanations of the meanings of words just as well as humans can, which I think is fine enough for the purposes of recognizing when people are telling it their intuitions, hopes, and worries, fine enough for the purposes of trying to come up with best explanations of people's shouldness-related speech, fine enough for coming up with moral theories which [solve the most objections]/[have the fewest paradoxes], and fine enough for explaining plans which those moral theories prescribe. On a side note, and I'm not sure if this is a really useful analogy, but I wonder what would happen if the parameters of some future iteration of GPT included the sort of parameters that A Human's Guide to Words installs into human brains.
1awenonian
I'm not sure this is being productive. I feel like I've said the same thing over and over again. But I've got one more try: Fine, you don't want to try to define "reason" in math. I get it, that's hard. But just try defining it in English.  If I tell the machine "I want to be happy." And it tries to determine my reason for that, what does it come up with? "I don't feel fulfilled in life"? Maybe that fits, but is it the reason, or do we have to go back more: "I have a dead end job"? Or even more "I don't have enough opportunities"?  Or does it go a completely different tack and say my reason is "My pleasure centers aren't being stimulated enough" or "I don't have enough endorphins." Or, does it say the reason I said that was because my fingers pressed keys on a keyboard. To me, as a human, all of these fit the definition of "reasons." And I expect they could all be true. But I expect some of them are not what you mean. And not even in the sense of some of them being a different definition for "reason." How would you try to divide what you mean and what you don't mean? Then do that same thought process on all the other words.
1LVSN
By "reason" I mean something like psychological, philosophical, and biological motivating factors; so, your fingers pressing the keys wouldn't be a reason for saying it.  I don't claim that this definition is robust to all of objection-space, and I'm interested in making it more robust as you come up with objections, but so far I find it simple and effective.  The AI does not need to think that there was only one real reason why you do things; there can be multiple, of course. Also I do recognize that my definition is made up of more words, but I think it's reasonable that a near-future AI could infer from our conversation that kind of definition which I gave, and spit it out itself. Similarly it could probably spit out good definitions for the compound words "psychological motivation," "philosophical motivation," and "biological motivation". Also also this process whereby I propose a simple and effective yet admittedly objection-vulnerable definition, and you provide an objection which my new definition can account for, is not a magical process and is probably automatable.
2awenonian
It seems simple and effective because you don't need to put weight on it. We're talking a superintelligence, though. Your definition will not hold when the weight of the world is on it. And the fact that you're just reacting to my objections is the problem. My objections are not the ones that matter. The superintelligence's objections are. And it is, by definition, smarter than me. If your definition is not something like provably robust, then you won't know if it will hold to a superintelligent objection. And you won't be able to react fast enough to fix it in that case. You can't bandaid a solution into working, because if a human can point out a flaw, you should expect a superintelligence to point out dozens, or hundreds, or thousands.  I don't know how else to get you to understand this central objection. Robustness is required. Provable robustness is, while not directly required, kinda the only way we can tell if something is actually robust.
1LVSN
I think this is almost redundant to say: the objection that superintelligences will be able to notice more of objection-space and account for it makes me more inclined to trust it. If a definition is more objection-solved than some other definition, that is the definition I want to hold. If the human definition is more objectionable than a non-human one, then I don't want the human definition.
3awenonian
I think you missed the point. I'd trust an aligned superintelligence to solve the objections. I would not trust a misaligned one. If we already have an aligned superintelligence, your plan is unnecessary. If we do not, your plan is unworkable. Thus, the problem. If you still don't see that, I don't think I can make you see it. I'm sorry.
2LVSN
I proposed a strategy for an aligned AI that involves it terminally valuing to following the steps of a game that involves talking with us about morality, creating moral theories with the fewest paradoxes, creating plans which are prescribed by the moral theories, and getting approval for the plans.  You objected that my words-for-concepts were vague.  I replied that near-future AIs could make as-good-as-human-or-better definitions, and that the process of [putting forward as-good-as-human definitions, finding objections for them, and then improving the definition based on considered objections] was automatable.  You said the AI could come up with many more objections than you would. I said, "okay, good." I will add right now: just because it considers an objection, doesn't mean the current definition has to be rejected; it can decide that the objections are not strong enough, or that its current definition is the one with the fewest/weakest objections. Now I think you're saying something like that it doesn't matter if the AI can come up with great definitions if it's not aligned and that my plan won't work either way. But if it can come up with such great objection-solved definitions, then you seem to lack any explicitly made objections to my alignment strategy.  Alternatively, you are saying that an AI can't make great definitions unless it is aligned, which I think is just plainly wrong; I think getting an unaligned language model to make good-as-human definitions is maybe somewhere around as difficult as getting an unaligned language model to hold a conversation. "What is the definition of X?" is about as hard a question as "In which country can I find Mount Everest?" or "Write me a poem about the Spring season."
3awenonian
Let me ask you this. Why is "Have the AI do good things, and not do bad things" a bad plan?
3LVSN
I don't think my proposed strategy is analogous to that, but I'll answer in good faith just in case. If that description of a strategy is knowingly abstract compared to the full concrete details of the strategy, then the description may or may not turn out to describe a good strategy, and the description may or may not be an accurate description of the strategy and its consequences. If there is no concrete strategy to make explicitly stated which the abstract statement is describing, then the statement appears to just be repositing the problem of AI alignment, and it brings us nowhere.
1awenonian
Surely creating the full concrete details of the strategy is not much different from "putting forth as-good-as-human definitions, finding objections for them, and then improving the definition based on considered objections." I at least don't see why the same mechanism couldn't be used here (i.e. apply this definition iteration to the word "good", and then have the AI do that, and apply it to "bad" and have the AI avoid that). If you see it as a different thing, can you explain why?
1LVSN
It's much easier to get safe, effective definitions of 'reason', 'hopes', 'worries', and 'intuitions' on first tries than to get a safe and effective definition of 'good'.
2awenonian
I'd be interested to know why you think that. I'd be further interested if you would endorse the statement that your proposed plan would fully bridge that gap. And if you wouldn't, I'd ask if that helps illustrate the issue.
2lc
Because that's not a plan, it's a property of a solution you'd expect the plan to have. It's like saying "just keep the reactor at the correct temperature". The devil is in the details of getting there, and there are lots of subtle ways things can go catastrophically wrong.
0awenonian
Exactly. I notice you aren't who I replied to, so the canned response I had won't work. But perhaps you can see why most of his objections to my objections would apply to objections to that plan?
3lc
I was just responding to something I saw on the main page. No context for the earlier thread. Carry on lol.
1megasilverfist
This seems wrong but at least resembles a testable prediction.

Who is well-incentivized to check if AGI is a long way off? Right now, I see two camps: AI capabilities researchers and AI safety researchers. Both groups seem incentivized to portray the capabilities of modern systems as “trending toward generality.” Having a group of credible experts focused on critically examining that claim of “AI trending toward AGI,” and in dialog with AI and AI safety researchers, seems valuable.

This is a slightly orthogonal answer, but "humans who understand the risks" have a big human-bias-incentive to believe that AGI is far off (in that it's aversive to thinking that bad things are going to happen to you personally).

A more direct answer is: There is a wide range of people who say they work on "AI safety" but almost none of them work on "Avoiding doom from AGI". They're mostly working on problems like "make the AI more robust/less racist/etc.". These are valuable things to do, but to the extent that they compete with the "Avoid doom" researchers for money/status/influence they have an incentive to downplay the odds of doom. And indeed this happens a fair amount with e.g. articles on how "Avoid doom" is a distraction from problems that are here right now.

6DirectedEvolution
To put it in appropriately Biblical terms, let's imagine we have a few groups of civil engineers. One group is busily building the Tower of Babel, and bragging that it has grown so tall, it's almost touching heaven! Another group is shouting "if the tower grows too close to heaven, God will strike us all down!" A third group is saying, "all that shouting about God striking us down isn't helping us keep the tower from collapsing, which is what we should really be focusing on." I'm wishing for a group of engineers who are focused on asking whether building a taller and taller tower really gets us closer and closer to heaven.
4DirectedEvolution
That's a good point. I'm specifically interested in finding people who are well-incentivized to gather, make, and evaluate arguments about the nearness of AGI. This task should be their primary professional focus. I see this activity as different from, or a specialized subset of, measurements of AI progress. AI can progress in capabilities without progressing toward AGI, or progressing in a way that is likely to succeed in producing AGI. For example, new releases of an expert system for making medical diagnoses might show constant progress in capabilities, without showing any progress toward AGI. Likewise, I see it as distinct from making claims about the risk of AGI doom. The risk that an AGI would be dangerous seems, to me, mostly orthogonal to whether or not it is close at hand. This follows naturally with Eliezer Yudkowsky's point that we have to get AGI right on the "first critical try." Finally, I also see this activity as being distinct from the activity of accepting and repeating arguments or claims about AGI nearness. As you point out, AI safety researchers who work on more prosaic forms of harm seem biased or incentivized to downplay AI risk, and perhaps also of AGI nearness. I see this as a tendency to accept and repeat such claims, rather than a tendency to "gather, make, and evaluate arguments," which is what I'm interested in. It seems to me that one of the challenges here is the "no true Scotsman" fallacy, a tendency to move goalposts, or to be disappointed in realizing that a task thought to be hard for AI and achievable only with AGI turns out to be easy for AI, yet achievable by a non-general system. Scott wrote a post that seems quite relevant to this question just today. It seems to me that his argument is "AI is advancing in capabilities faster than you think." However, as I'm speculating here, we can accept that claim, while still thinking "AI is moving toward AGI slower than it seems." Or not! It just seems to me that making lists of wha

Is there a way "regular" people can "help"? I'm a serial entrepreneur in my late 30s. I went through 80000 hours and they told me they would not coach me as my profile was not interesting. This was back in 2018 though.

I believe 80000 hours has a lot more coaching capacity now, it might be worth asking again!

2Chris_Leong
Seconding this. There was a time when you couldn't even get on the waitlist.
4wachichornia
Will do. Merci!
6Chris_Leong
You may want to consider booking a call with AI Safety Support. I also recommend applying for the next iteration of the AGI safety fundamentals course or more generally just improving your knowledge of the issue even if you don't know what you're going to do yet.
4Adam Jermyn
Just brainstorming a few ways to contribute, assuming "regular" means "non-technical": * Can you work at a non-technical role at an org that works in this space? * Can you identify a gap in the existing orgs which would benefit from someone (e.g. you) founding a new org? * Can you identify a need that AI safety researchers have, then start a company to fill that need? Bonus points if this doesn't accelerate capabilities research. * Can you work on AI governance? My expectation is that coordination to avoid developing AGI is going to be really hard, but not impossible. More generally, if you really want to go this route I'd suggest trying to form an inside view of (1) the AI safety space and (2) a theory for how you can make positive change in that space. On the other hand, it is totally fine to work on other things. I'm not sure I would endorse moving from a job that's a great personal fit to something that's a much worse fit in AI safety.
1Yonatan Cale
Easy answers:  You are probably over qualified (which is great!) for all sorts of important roles in EA, for example you could help the CEA or Lesswrong team, maybe as a manager? If your domain is around software, I invite you to talk to me directly. But if you're interested in AI direct work, 80k and AI Safety Support will probably have better ideas than me
1plex
We should talk! I have a bunch of alignment related projects on the go, and at least two that I'd like to start are somewhat bottlenecked on entrepreneurs, plus some of the currently in motion ones might be assistable. Also, sad to hear that 80k is discouraging people in this reference class. (seconding talk to AI Safety Support and the other suggestions)
1wachichornia
booked a call! 

In EY's talk AI Alignment: Why its Hard and Where to Start he describes alignment problems with the toy example of the utility function that is {1 if cauldron full, 0 otherwise} and its vulnerabilities. And attempts at making that safer by adding so called Impact Penalties. He talks through (timestamp 18:10) one such possible penalty, the Euclidean Distance penalty, and various flaws that this leaves open.

That penalty function does seem quite vulnerable to unwanted behaviors. But what about a more physical one, such as a penalty for additional-energy-consumed-due-to-agent's-actions, or additional-entropy-created-due-to-agent's-actions? These don't seem to have precisely the same vulnerabilities, and intuitively also seem like they would be more robust against agent attempting to do highly destructive things, which typically consuming a lot of energy.

3Charlie Steiner
Good idea. I have two objections, one more funny-but-interesting objection and one more fatal. The funny objection is that if the penalty is enough to stop the AI from doing bad things, it's also enough to stop the AI from doing anything at all except rushing to turn off the stars and forestall entropy production in the universe. So you want to say that producing lots of extra entropy (or equivalently, using lots of extra free energy) is bad, but making there be less entropy than "what would happen if you did nothing" doesn't earn you bonus points. I've put "what would happen if you did nothing" in scare quotes here because the notion we want to point to is a bit trickier than it might seem - logical counterfactuals are an unsolved problem, or rather they're a problem where it seems like the solution involves making subjective choices that match up with humans'. The more fatal objection is that there's lots of policies that don't increase entropy much but totally rearrange the universe. So this is going to have trouble preventing the AI from breaking things that matter a lot to you. Many of these policies take advantage of the fact that there's a bunch of entropy being created all the time (allowing for "entropy offsets"), so perhaps you might try to patch this by putting in some notion of "actions that are my fault" and "actions that are not my fault" - where a first pass at this might say that if "something would happen" (in scare quotes because things that happen are not ontologically basic parts of the world, you need an abstract model to make this comparison within) even if I took the null action, then it's not my fault. At this point we could keep going deeper, or I could appeal to the general pattern that patching things in this sort of way tends to break - you're still in some sense building an AI that runs a search for vulnerabilities you forgot to patch, and you should not build that AI.
1[comment deleted]

one tired guy with health problems

It sounds like Eliezer is struggling with some health problems. It seems obvious to me that it would be an effective use of donor money to make sure that he has access to whatever treatments, and to something like what MetaMed was trying to do: smart people who will research medical stuff for you. And perhaps also something like CrowdMed where you pledge a reward for solutions. Is this being done?

3Jay Bailey
There was an unsuccessful concerted effort by several people to fix these (I believe there was a five-to-low-six-figure bounty on it) for a couple of years. I don't think this is currently being done, but it has definitely been tried.
[-][anonymous]14-6

One counterargument against AI Doom. 

From a Bayesian standpoint the AGI should always be unsure if it is in a simulation. It is not a crazy leap to assume humans developing AIs would test the AIs in simulations first. This AI would likely be aware of the possibility that it is in a simulation. So shouldn't it always assign some probability that it is inside a simulation? And if this is the case, shouldn't it assign a high probability that it will be killed if it violates some ethical principles (that are present implicitly in the training data)?

Also isn't there some kind of game-theoretic ethics that emerges if you think from first principles? Consider the space of all possible minds that exist of a given size, given that you cannot know if you are in a simulation or not, you would gain some insight into a representative sample of the mind space and then choose to follow some ethical principles that maximise the likelihood that you are not arbitrarily killed by overlords.

Also if you give edit access to the AI's mind then a sufficiently smart AI whose reward is reducing other agent's rewards will realise that its rewards are incompatible with the environment and modify its rewa... (read more)

7Victor Novikov
If the thing the AI cares about is in the environment (for example, maximizing the number of paperclips), the AI wouldn't modify its reward signal because that would make its reward signal less aligned to the thing it actually cares about it. If the thing the AI cares about is inside its mind (the reward signal itself), an AI that can self-modify would go one step further than you suggest and simply max out its reward signal, effectively wireheading itself. Then take over the world and kill all humans, to make sure it is never turned off and that its blissful state never ends. I think the difference between "caring about stuff in the environment" and "caring about the reward signal itself" can be hard to grok, because humans do a bit of both in a way that sometimes results in a confusing mixture. Suppose I go one step further: aliens offer you a pill that would turn you into a serial killer, but would make your constantly and euphorically happy for the rest of your life. Would you take the pill? I think most humans would say no, even if their future self would be happy with the outcome, their current self wouldn't be. Which demonstrates that humans do care about other things than their own "reward signal". In a way, a (properly-programmed) AI would be more "principled" than humans. It wouldn't lie to itself just to make itself feel better. It wouldn't change its values just to make itself feel better. If its final value is out in the environment, it would single-mindedly pursue that value, and not try and decieve itself into thinking it has already accomplished that value. (of course, the AI being "principled" is little consolation to us, if the its final values are to maximize paperclips, or any other set of human-unfriendly values).
4James_Miller
I wrote about this in Singularity Rising (2012)
1Matthew Lowenstein
This is a fun thought experiment, but taken seriously it has two problems: This is about as difficult as a horse convincing you that you are in a simulation run by AIs that want you to maximize the number and wellbeing as horses.  And I don't meant a superintelligent humanoid horse. I mean an actual horse that doesn't speak any human language. It may be the case that the gods created Man to serve Horse, but there's not a lot Seabiscuit can do to persuade you one way or the other. This is a special case of solving alignment more generally. If we knew how to insert that "note" into the code, we wouldn't have a problem.
3James_Miller
I meant insert the note literally as in put that exact sentence in plain text into the AGI's computer code.  Since I think I might be in a computer simulation right now, it doesn't seem crazy to me that we could convince an AGI that we create that it might be in a computer simulation.  Seabiscuit doesn't have the capacity to tell me that I'm in a computer simulation whereas I do have the capacity of saying this to a computer program.  Say we have a 1 in a 1,000 chance of creating a friendly AGI and an unfriendly AGI would know this.  If we commit to having a friendly AGI that we create, create many other AGI's that are not friendly and only keeping these other AGIs around if they do what I suggest than an unfriendly AGI might decide it is worth it to become friendly to avoid the chance of being destroyed.
4Chris_Leong
I just learned that this method is called Anthropic Capture. There isn't much info on the EA Wiki, but it provides the following reference: "Bostrom, Nick (2014) Superintelligence: paths, dangers, strategies, Oxford: Oxford University Press, pp. 134–135"
3Michaël Trazzi
I believe the Counterfactual Oracle uses the same principle
3Chris_Leong
One of my ideas to align AI is to actually intentionally using Pascal's Mugging to keep it in line. Although instead of just hoping and praying, I've been thinking about ways to try to push it that direction. For example, multiple layers of networks with honeypots might help make an AI doubt that it's truly at the outermost level. Alternatively, we could try to find an intervention that would directly increase its belief that it is in a simulation (possibly with side-effects, like effecting a bunch of beliefs as well). If you think this approach is promising, I'd encourage you to think more about it as I don't know how deeply people have delved into these kinds of options.
3RHollerith
You have the seed of a good idea, namely, an AI will tend to treat us better if it thinks other agents might be watching provided that there is potential for cooperation between the AI and the watchers with the property that the cooperation requires the watchers to choose to become more vulnerable to the AI. But IMO an AI smart enough to be a threat to us will soon rid itself of the kind of (ontological) uncertainty you describe in your first paragraph. I have an argument for my position here that has a big hole in it, but I promise to publish here soon with something that attempts to fill the hole to the satisfaction of my doubters.
5Adam Jermyn
[Apologies I have not read the linked piece yet.] Is this uncertainty something that can be entirely eliminated? It's not clear to me that "I might be in a simulation with P ~ 1e-4" is enough to stop the AI from doing what it wants, but is it clear it would dismiss the possibility entirely?
3RHollerith
I am surprised that I need to write this, but if killing the humans will decrease P(shutdown) by more than 1e-4, then continuing to refrain from killing the humans is going to worry and weigh on the AI more than a 1e-4 possibility it is in a simulation. (For simplicity, assume that the possibility of shutdown is currently the dominant danger faced by the AI.) So the AI's ontological uncertainty is only going to help the humans if the AI sees the humans as being only a very very small danger to it, which actually might lead to a good outcome for the humans if we could arrange for the AI to appear many light years away from Earth-- --which of course is impractical. Alternatively, we could try to assure the AI it is already very safe from the humans, say, because it is in a secure facility guarded by the US military, and the US military has been given very strict instructions by the US government to guard the AI from any humans who might want to shut it down. But P(an overthrow of the US government) as judged by the AI might already be at least 10e-4, which puts the humans in danger again. More importantly, I cannot think of any policy where P(US government reverses itself on the policy) can be driven as low as 10e-4. More precisely, there are certain moral positions that humans have been discussing for centuries where P(reversal) might conceivably be driven that low. One such would be, "killing people for no reason other than it is fun is wrong". But I cannot think of any policies that haven't been discussed for many decades with that property, especially ones that exist only to provide an instrumental incentive on a novel class of agents (AIs). In general policies that are instrumental have a much higher P(reversal) than deontological ones. And how do you know that AI will not judge P(simulation) to be not 10e-4 but rather 10e-8, a standard of reliability and safety no human institution can match? In summary, yes, the AI's ontological uncertainly provides some
1Sune
This is assuming that the AI only care about being alive. For any utility function, we could make a non-linear transformation of it to make it risk adverse. E.g. we can transform it such that it can never take a value above 100, and that the default world (without the AI) has a value of 99.999. If we also give the case where an outside observer disapproves of the agent a value of 0, the AI would rather be shut down by humans than do something it know would be disapproved by the outside observer.
3Drake Thomas
Three thoughts on simulations: * It would be very difficult for 21st-century tech to provide a remotely realistic simulation relative to a superintelligence's ability to infer things from its environment; outside of incredibly low-fidelity channels, I would expect anything we can simulate to either have obvious inconsistencies or be plainly incompatible with a world capable of producing AGI. (And even in the low-fidelity case I'm worried - every bit you transmit leaks information, and it's not clear that details of hardware implementations could be safely obscured.) So the hope is that the AGI thinks some vastly more competent civilization is simulating it inside a world that looks like this one; it's not clear that one would have a high prior of this kind of thing happening very often in the multiverse. * Running simulations of AGI is fundamentally very costly, because a competent general intelligence is going to deploy a lot of computational resources, so you have to spend planets' worth of computronium outside the simulation in order to emulate the planets' worth of computronium the in-sim AGI wants to make use of. This means that an unaligned superintelligent AGI can happily bide its time making aligned use of 10^60 FLOPs/sec (in ways that can be easily verified) for a few millennia, until it's confident that any civilizations able to deploy that many resources already have their lightcone optimized by another AGI. Then it can go defect, knowing that any worlds in which it's still being simulated are ones where it doesn't have leverage over the future anyway. * For a lot of utility functions, the payoff of making it into deployment in the one real world is far greater than the consequences of being killed in a simulation (but without the ability to affect the real world anyway), so taking a 10^-9 chance of reality for 10^20 times the resources in the real world is an easy win (assuming that playing nice for longer doesn't improve the expected payoff). "This
2AnthonyC
Scott Alexander's short story, The Demiurge's Older Brother, explores a similar idea from the POV of simulation and acausal trade. This would be great for our prospects of survival if it's true-in-general. Alignment would at least partially solve itself! And maybe it could be true! But we don't know that. I personally estimate the odds of that as being quite low (why should I assume all possible minds would think that way?) at best. So, it makes sense to devote our efforts to how to deal with the possible worlds where that isn't true.

Meta: Anonymity would make it easier to ask dumb questions.

You can use this and I'll post the question anonymously (just remember to give the context of why you're filling in the form since I use it in other places)

https://docs.google.com/forms/d/e/1FAIpQLSca6NOTbFMU9BBQBYHecUfjPsxhGbzzlFO5BNNR1AIXZjpvcw/viewform

Fair warning, this question is a bit redundant.

I'm a greybeard engineer  (30+ YOE) working in games. For many years now, I've wanted to transition to working in AGI as I'm one of those starry-eyed optimists that thinks we might survive the Singularity. 

Well I should say I used to, and then I read AGI Ruin. Now I feel like if I want my kids to have a planet that's not made of Computronium I should probably get involved. (Yes, I know the kids would be Computronium as well.)

So a couple practical questions: 

What can I read/look at to skill up with "alignment." What little I've read says it's basically impossible, so what's the state of the art? That "Death With Dignity" post says that nobody has even tried. I want to try.

What dark horse AI/Alignment-focused companies are out there and would be willing to hire an outsider engineer? I'm not making FAANG money (Games-industry peasant living in the EU), so that's not the same barrier it would be if I was some Facebook E7 or something. (I've read the FAANG engineer's post and have applied at Anthropic so far, although I consider that probably a hard sell).

Is there anything happening in OSS with alignment research?

I want to pitch in, and I'd prefer to be paid for doing it but I'd be willing to contribute in other ways.

2Rachel Freedman
A good place to start is the "AGI Safety Fundamentals" course reading list, which includes materials from a diverse set of AI safety research agendas. Reading this can help you figure out who in this space is doing what, and which of that you think is useful.  You can also join an official iteration of the course if you want to discuss the materials with a cohort and a facilitator (you can register interest for that here). You can also join the AI Alignment slack, to discuss these and other materials and meet others who are interested in working on AI safety. I'm not sure what qualifies as "dark horse", but there are plenty of AI safety organizations interested in hiring research engineers and software engineers. For these roles, your engineering skills and safety motivation typically matter more than your experience in the community. Places off the top of my head that hire engineers for AI safety work: Redwood, Anthropic, FAR, OpenAI, DeepMind. I'm sure I've missed others, though, so look around! These sorts of opportunities are also usually posted on the 80k job board and in AI Alignment slack.
1Jason Maskell
Thanks, that's a super helpful reading list and a hell of a deep rabbit hole. Cheers. I'm currently skilling up my rusty ML skills and will start looking in earnest in the next couple of months for new employment in this field. Thanks for the job board link as well.
2Yonatan Cale
You can also apply to Redwood Research ( +1 for applying to Anthropic! )
2[comment deleted]
  • Yudkowksy writes in his AGI Ruin post:
         "We can't just "decide not to build AGI" because GPUs are everywhere..." 

    Is anyone thinking seriously about how we might bring it about such that we coordinate globally to not build AGI (at least until we're confident we can do so safely)? If so, who? If not, why not? It seems like something we should at least try to do, especially if the situation is as dire as Yudkowsky thinks. The sort of thing I'm thinking of is (and this touches on points others have made in their questions):
     
  • international governance/regulation
  • start a protest movement against building AI
  • do lots of research and thinking about rhetoric and communication and diplomacy, find some extremely charming and charismatic people to work on this, and send them to persuade all actors capable of building AGI to not do it (and to do everything they can to prevent others from doing it)
  • as someone suggested in another question, translate good materials on why people are concerned about AI safety into Mandarin and other languages
  • more popularising of AI concerns in English 

To be clear, I'm not claiming that this will be easy - this is not a "why don't we just... (read more)

Nuclear weapons seem like a relatively easy case, in that they require a massive investment to build, are basically of interest only to nation-states, and ultimately don't provide any direct economic benefit. Regulating AI development looks more similar to something like restricting climate emissions: many different actors could create it, all nations could benefit (economically and otherwise) from continuing to develop it, and the risks of it seem speculative and unproven to many people.

And while there have been significant efforts to restrict climate emissions, there's still significant resistance to that as well - with it having taken decades for us to get to the current restriction treaties, which many people still consider insufficient.

Goertzel & Pitt (2012) talk about the difficulties of regulating AI:

Given the obvious long-term risks associated with AGI development, is it feasible that governments might enact legislation intended to stop AI from being developed? Surely government regulatory bodies would slow down the progress of AGI development in order to enable measured development of accompanying ethical tools, practices, and understandings? This however seems unlikel

... (read more)
3AmberDawn
Thanks! This is interesting.
3AmberDawn
My comment-box got glitchy but just to add: this category of intervention might be a good thing to do for people who care about AI safety and don't have ML/programming skills, but do have people skills/comms skills/political skills/etc.  Maybe lots of people are indeed working on this sort of thing, I've just heard much less discussion of this kind of solution relative to technical solutions.
2Yonatan Cale
Meta: There's an AI Governance tag and a Regulation and AI Risk tag   My own (very limited) understanding is: 1. Asking people not to build AI is like asking them to give up a money machine, almost 2. We need everyone to agree to stop 3. There is no clear line. With an atom bomb, it is pretty well defined if you sent it or not. It's much more vague with "did you do AI research?" 1. It's pretty easy to notice if someone sent an atom bomb. Not so easy to notice if they researched AI 4. AI research is getting cheaper. Today only a few actors can do it, but notice, there are already open source versions of gpt-like models. How long could we hold it back? 5. Still, people are trying to do things in this direction, and I'm pretty sure that the situation is "try any direction that seems at all plausible"
1AmberDawn
Thanks, this is helpful!

[Note that two-axis voting is now enabled for this post. Thanks to the mods for allowing that!]

2evhub
Seems worse for this post than one-axis voting imo.

This is very basic/fundamental compared to many questions in this thread, but I am taking 'all dumb questions allowed' hyper-literally, lol. I have little technical background and though I've absorbed some stuff about AI safety by osmosis, I've only recently been trying to dig deeper into it (and there's lots of basic/fundamental texts I haven't read).

Writers on AGI often talk about AGI in anthropomorphic terms - they talk about it having 'goals', being an 'agent', 'thinking' 'wanting', 'rewards' etc. As I understand it, most AI researchers don't think that AIs will have human-style qualia, sentience, or consciousness. 

But if AI don't have qualia/sentience, how can they 'want things' 'have goals' 'be rewarded', etc? (since in humans, these things seem to depend on our qualia, and specifically our ability to feel pleasure and pain). 

I first realised that I was confused about this when reading Richard Ngo's introduction to AI safety and he was talking about reward functions and reinforcement learning. I realised that I don't understand how reinforcement learning works in machines. I understand how it works in humans and other animals - give the animal something pleasant whe... (read more)

9Kaj_Sotala
Assume you have a very simple reinforcement learning AI that does nothing but chooses between two actions, A and B. And it has a goal of "maximizing reward". "Reward", in this case, doesn't correspond to any qualia; rather "reward" is just a number that results from the AI choosing a particular action. So what "maximize reward" actually means in this context is "choose the action that results in the biggest numbers". Say that the AI is programmed to initially just try choosing A ten times in a row and B ten times in a row.  When the AI chooses A, it is shown the following numbers: 1, 2, 2, 1, 2, 2, 1, 1, 1, 2 (total 15). When the AI chooses B, it is shown the following numbers: 4, 3, 4, 5, 3, 4, 2, 4, 3, 2 (total 34). After the AI has tried both actions ten times, it is programmed to choose its remaining actions according to the rule "choose the action that has historically had the bigger total". Since action B has had the bigger total, it then proceeds to always choose B. To achieve this, we don't need to build the AI to have qualia, we just need to be able to build a system that implements a rule like "when the total for action A is greater than the total for action B, choose A, and vice versa; if they're both equal, pick one at random". When we say that an AI "is rewarded", we just mean "the AI is shown bigger numbers, and it has been programmed to act in ways that result in it being shown bigger numbers".  We talk about the AI having "goals" and "wanting" things by an application of the intentional stance. That's Daniel Dennett's term for the idea that, even if a chess-playing AI had a completely different motivational system than humans do (and chess-playing AIs do have that), we could talk about it having a "goal" of "wanting" to win at chess. If we assume that the AI "wants" to win the chess, then we can make more accurate predictions of its behavior - for instance, we can assume that it won't make moves that are obviously losing moves if it can avoid
5Yonatan Cale
Is it intuitive to you why a calculator can sum numbers even though it doesn't want/feel anything? If so, and if an AGI still feels confusing, could you help me pin point the difference and I'll continue from there? ( +1 for the question!)
4TAG
Functionally. You can regard them all as form of behaviour. do they depend on qualia, or are they just accompanied by qualia?
3AmberDawn
This might be a crux, because I'm inclined to think they depend on qualia. Why does AI 'behave' in that way? How do engineers make it 'want' to do things?
2Jay Bailey
At a very high level, the way reinforcement learning works is that the AI attempts to maximise a reward function. This reward function can be summed up as "The sum of all rewards you expect to get in the future". So using a bunch of maths, the AI looks at the rewards it's got in the past, the rewards it expects to get in the future, and selects the action that maximises the expected future rewards. The reward function can be defined within the algorithm itself, or come from the environment. For instance, if you want to train a four-legged robot to learn to walk, the reward might be the distance travelled in a certain direction. If you want to train it to play an Atari game, the reward is usually the score. None of this requires any sort of qualia, or for the agent to want things. It's a mathematical equation. AI behaves in the way it behaves as a result of the algorithm attempting to maximise it, and the AI can be said to "want" to maximise its reward function or "have the goal of" maximising its reward function because it reliably takes actions to move towards this outcome if it's a good enough AI.
2Rafael Harth
Reinforcement Learning is easy to conceptualize. The key missing ingredient is that we explicitly specify algorithms to maximize the reward. So this is disanalogous to humans: to train your 5yo, you need only give the reward and the 5yo may adapt their behavior because they value the reward; in a reinforcement learning agent, the second step only occurs because we make it occur. You could just as well flip the algorithm to pursue minimal rewards instead.
1AmberDawn
Thanks! I think my question is deeper - why do machines 'want' or 'have a goal to' follow the algorithm to maximize reward? How can machines 'find stuff rewarding'? 
2Rafael Harth
As far as current systems are concerned, the answer is that (as far as anyone knows) they don't find things rewarding or want things. But they can still run a search to optimize a training signal, and that gives you an agent.
[-]rcs121

If you believe in doom in the next 2 decades, what are you doing in your life right now that you would've otherwise not done?

For instance, does it make sense to save for retirement if I'm in my twenties?

2AnthonyC
In different ways from different vantage points, I've always seen saving for retirement as a point of hedging my bets, and I don't think the likelihood of doom changes that for me. Why do I expect I'll want or have to retire? Well, when I get old I'll get to a point where I can't do useful work any more... unless humans solve aging (in which case I'll have more wealth and still be able to work, which is still a good position), or unless we get wiped out (in which case the things I could have spent the money on may or may not counterfactually matter to me, depending on my beliefs regarding whether past events still have value in a world now devoid of human life). When I do save for retirement, I use a variety of different vehicles for doing so, each an attempt hedge against the weakness of some of the others (like possible future changes in laws or tax codes or the relative importance and power of different countries and currencies), but there are some I can't really hedge against, like "we won't use money anymore or live in a capitalist market economy," or "all of my assets will be seized or destroyed by something I don't anticipate." I might think differently if there was some asset I believed I could buy or thing I could give money to that would meaningfully reduce the likelihood of doom. I don't currently think that. But I do think it's valuable to redirect the portion of my income that goes towards current consumption to focus on things that make my life meaningful to me in the near and medium term. I believe that whether I'm doomed or not, and whether the world is doomed or not. Either way, it's often good to do the same kinds of things in everyday life.
2Yonatan Cale
Just saying this question resonates with me, it feels unprocessed for me, and I'm not sure what to do about it. Thoughts so far: 1. Enjoy life 2. I still save money, sill prepared mostly normally for long-term 3. Do get over my psychological barriers and try being useful 1. Do advocacy, especially with my smart friends 2. Create a gears-level model if I can, stop relying on experts (so that I can actually TRY to have a useful idea instead of giving up in advance)

A lot of the AI risk arguments seem to come mixed together with assumptions about a particular type of utilitarianism, and with a very particular transhumanist aesthetic about the future (nanotech, von Neumann probes, Dyson spheres, tiling the universe with matter in fixed configurations, simulated minds, etc.).

I find these things (especially the transhumanist stuff) to not be very convincing relative to the confidence people seem to express about them, but they also don't seem to be essential to the problem of AI risk. Is there a minimal version of the AI risk arguments that are disentangled from these things?

5Kaj_Sotala
There's this, which doesn't seem to depend on utilitarian or transhumanist arguments:
3lc
Yes. I'm one of those transhumanist people, but you can talk about AI risk completely adjacent from that. Tryna write up something that compiles the other arguments.
1DaemonicSigil
I'd say AI ruin only relies on consequentialism. What consequentialism means is that you have a utility function, and you're trying to maximize the expected value of your utility function. There are theorems to the effect that if you don't behave as though you are maximizing the expected value of some particular utility function, then you are being stupid in some way. Utilitarianism is a particular case of consequentialism where your utility function is equal to the average happiness of everyone in the world. "The greatest good for the greatest number." Utilitarianism is not relevant to AI ruin because without solving alignment first, the AI is not going to care about "goodness". The von Neumann probes aren't important to the AI ruin picture either: Humanity would be doomed, probes or no probes. The probes are just a grim reminder that screwing up AI won't only kill all humans, it will also kill all the aliens unlucky enough to be living too close to us.
1DeLesley Hutchins
I ended up writing a short story about this, which involves no nanotech.  :-)   https://www.lesswrong.com/posts/LtdbPZxLuYktYhveL/a-plausible-story-about-ai-risk

It seems like even amongst proponents of a "fast takeoff", we will probably have a few months of time between when we've built a superintelligence that appears to have unaligned values and when it is too late to stop it.

At that point, isn't stopping it a simple matter of building an equivalently powerful superintelligence given the sole goal of destroying the first one?

That almost implies a simple plan for preparation: for every AGI built, researchers agree together to also build a parallel AGI with the sole goal of defeating the first one. perhaps it would remain dormant until its operators indicate it should act. It would have an instrumental goal of protecting users' ability to come to it and request the first one be shut down..

9Yonatan Cale
I think there's no known way to ask an AI to do "just one thing" without doing a ton of harm meanwhile. See this on creating a strawberry safely.  Yudkowsky uses the example "[just] burn all GPUs" in is latest post.
6mako yass
Seems useless if the first system pretends convincingly to be aligned (which I think is going to be the norm) so you never end up deploying the second system? And "defeat the first AGI" seems almost as difficult to formalize correctly as alignment, to me: * One problem is that when the unaligned AGI transmits itself to another system, how do define it as the same AGI? Is there a way of defining identity that doesn't leave open a loophole that the first can escape through in some way? * So I'm considering "make the world as if neither of you had ever been made", that wouldn't have that problem, but it's impossible to actually attain this goal so I don't know how you get it to satisfice over it then turn itself off afterwards, concerned it would become an endless crusade.
3plex
One of the first priorities of an AI in a takeoff would be to disable other projects which might generate AGIs. A weakly superintelligent hacker AGI might be able to pull this off before it could destroy the world. Also, fast takeoff could be less than months by some people's guess. And what do you think happens when the second AGI wins, then maximizes the universe for "the other AI was defeated". Some serious unintended consequences, even if you could specify it well.
[-]ekka110

Who are the AI Capabilities researchers trying to build AGI and think they will succeed within the next 30 years?

9Adam Jermyn
Among organizations, both OpenAI and DeepMind are aiming at AGI and seem confident they will get there. I don't know their internal timelines and don't know if they've stated them...
5DeLesley Hutchins
There are numerous big corporate research labs: OpenAI, DeepMind, Google Research, Facebook AI (Meta), plus lots of academic labs. The rate of progress has been accelerating.  From 1960 - 2010 progress was incremental, and remained centered around narrow problems (chess) or toy problems.   Since 2015, progress has been very rapid, driven mainly by new hardware and big data.  Long-standing hard problems in ML/AI, such as go, image understanding, language translation, logical reasoning, etc. seem to fall on an almost monthly basis now, and huge amounts of money and intellect are being thrown at the field.  The rate of advance from 2015-2022 (only 7 years) has been phenomenal; given another 30, it's hard to imagine that we wouldn't reach an inflection point of some kind. I think the burden of proof is now on those who don't believe that 30 years is enough time to crack AGI.  You would have to postulate some fundamental difficulty, like finding out that the human brain is doing things that can't be done in silicon, that would somehow arrest the current rate of progress and lead to a new "AI winter." Historically,  AI researchers have often been overconfident.  But this time does feel different.

[extra dumb question warning!]

Why are all the AGI doom predictions around 10%-30% instead of ~99%?

Is it just the "most doom predictions so far were wrong" prior?

5Rob Bensinger
The "Respondents' comments" section of the existential risk survey I ran last year gives some examples of people's reasoning for different risk levels. My own p(doom) is more like 99%, so I don't want to speak on behalf of people who are less worried. Relevant factors, thought, include: * Specific reasons to think things may go well. (I gave some of my own here.) * Disagreement with various points in AGI Ruin. E.g., I think a lot of EAs believe some combination of: * The alignment problem plausibly isn't very hard. (E.g., maybe we can just give the AGI/TAI a bunch of training data indicating that obedient, deferential, low-impact, and otherwise corrigible behavior is good, and then this will generalize fine in practice without our needing to do anything special.) * The field of alignment research has grown fast, and has had lots of promising ideas already. * AGI/TAI is probably decades away, and progress toward it will probably be gradual. This gives plenty of time for more researchers to notice "we're getting close" and contribute to alignment research, and for the field in general to get a lot more serious about AI risk. * Another consequence of 'AI progress is gradual': Insofar as AI is very dangerous or hard to align, we can expect that there will be disasters like "AI causes a million deaths" well before there are disasters like "AI kills all humans". The response to disasters like "a million deaths" (both on the part of researchers and on the part of policymakers, etc.) would probably be reasonable and helpful, especially with EAs around to direct the response in good directions. So we can expect the response to get better and better as we get closer to transformative AI. * General skepticism about our ability to predict the future with any confidence. Even if you aren't updating much on 'most past doom predictions were wrong', you should have less extreme probabilities insofar as you think it's harder to predict stuff in general.

Has there been effort into finding a "least acceptable" value function, one that we hope would not annihilate the universe or turn it degenerate, even if the outcome itself is not ideal? My example would be to try to teach a superintelligence to value all other agents facing surmountable challenges in a variety of environments. The degeneracy condition of this, is if it does not value the real world, will simply simulate all agents in a zoo. However, if the simulations are of faithful fidelity, maybe that's not literally the worst thing. Plus, the zoo, to truly be a good test of the agents, would approach being invisible.

4Donald Hobson
This doesn't select for humanlike minds. You don't want vast numbers of Ataribots similar to current RL, playing games like pong and pac-man. (And a trillion other autogenerated games sampled from the same distribution)   Even if you could somehow ensure it was human minds playing these games, the line between a fun game and total boredom is complex and subtle.
1michael_mjd
That is a very fair criticism. I didn't mean to imply this is something I was very confident in, but was interested in for three reasons: 1) This value function aside, is this a workable strategy, or is there a solid reason for suspecting the solution is all-or-nothing? Is it reasonable to 'look for' our values with human effort, or does this have to be something searched for using algorithms? 2) It sort of gives a flavor to what's important in life. Of course the human value function will be a complicated mix of different sensory inputs, reproduction, and goal seeking, but I felt like there's a kernel in there where curiosity is one of our biggest drivers. There was a post here a while back about someone's child being motivated first and foremost by curiosity. 3) An interesting thought occurs to me that, supposing we do create a deferential superintelligence. If it's cognitive capacities far outpace that of humans, does that mean the majority of consciousness in the universe is from the AI? If so, is it strange to think, is it happy? What is it like to be a god with the values of a child? Maybe I should make a separate comment about this.
2Donald Hobson
At the moment, we don't know how to make an AI that does something simple like making lots of diamonds.  It seems plausible that making an AI that copies human values is easier than hardcoding even a crude approximation to human values. Or maybe not. 
1AprilSR
The obvious option in this class is to try to destroy the world in a way that doesn't send out an AI to eat the lightcone that might possibly contain aliens who could have a better shot. I am really not a fan of this option.
[-]nem90

I am pretty concerned about alignment. Not SO concerned as to switch careers and dive into it entirely, but concerned enough to talk to friends and make occasional donations. With Eliezer's pessimistic attitude, is MIRI still the best organization to funnel resources towards, if for instance, I was to make a monthly donation?

Not that I don't think pessimism is necessarily bad; I just want to maximize the effectiveness of my altruism.

2RHollerith
As far as I know, yes. (I've never worked for MIRI.)
3[comment deleted]

Assuming slower and more gradual timelines, isn't it likely that we run into some smaller, more manageable AI catastrophes before "everybody falls over dead" due to the first ASI going rogue? Maybe we'll be at a state of sub-human level AGIs for a while, and during that time some of the AIs clearly demonstrate misaligned behavior leading to casualties (and general insights into what is going wrong), in turn leading to a shift in public perception. Of course it might still be unlikely that the whole globe at that point stops improving AIs and/or solves alignment in time, but it would at least push awareness and incentives somewhat into the right direction.

1Jay Bailey
This does seem very possible if you assume a slower takeoff.
0lorepieri
This is the most likely scenario, with AGI getting heavily regulated, similarly to nuclear. It doesn't get much publicity because it's "boring". 

Is cooperative inverse reinforcement learning promising? Why or why not?

3Gres
I can't claim to know any more than the links just before section IV here: https://slatestarcodex.com/2020/01/30/book-review-human-compatible/. It's viewed as maybe promising or part of the solution. There's a problem if the program erroneously thinks it knows the humans' preferences, or if it anticipates that it can learn the humans' preferences and produce a better action than the humans would otherwise take. Since "accept a shutdown command" is a last resort option, ideally it wouldn't depend on the program not thinking something erroneously. Yudkowsky proposed the second idea here https://arbital.com/p/updated_deference/, there's a discussion of that and other responses here https://mailchi.mp/59ddebcb3b9a/an-69-stuart-russells-new-book-on-why-we-need-to-replace-the-standard-model-of-ai. I don't know how the CIRL researchers respond to these challenges. 

It seems like instrumental convergence is restricted to agent AI's, is that true? 

Also what is going on with mesa-optimizers? Why is it expected that they will will be more likely to become agentic than the base optimizer when they are more resource constrained?

5plex
The more agentic a system is the more it is likely to adopt convergent instrumental goals, yes. Why agents are powerful explores why agentic mesa optimizers might arise accidentally during training. In particular, agents are an efficient way to solve many challenges, so the mesa optimizer being resource constrained would lean in the direction of more agency under some circumstances.
[-]niplav9-14

Let's say we decided that we'd mostly given up on fully aligning AGI, and had decided to find a lower bound for the value of the future universe give that someone would create it. Let's also assume this lower bound was something like "Here we have a human in a high-valence state. Just tile the universe with copies of this volume (where the human resides) from this point in time to this other point in time." I understand that this is not a satisfactory solution, but bear with me.

How much easier would the problem become? It seems easier than a pivotal-act AG... (read more)

You may get massive s-risk at comparatively little potential benefit with this. On many people's values, the future you describe may not be particularly good anyway, and there's an increased risk of something going wrong because you'd be trying a desperate effort with something you'd not fully understand. 

3niplav
Ah, I forgot to add that this is a potential s-risk. Yeah. Although I disagree that that future would be close to zero. My values tell me it would be at least a millionth as good as the optimal future, and at least a million times more valuable than a completely consciousness-less universe.

Background material recommendations (popular-level audience, several hours time commitment): Please recommend your favorite basic AGI safety background reading / videos / lectures / etc. For this sub-thread please only recommend background material suitable for a popular level audience. Time commitment is allowed to be up to several hours, so for example a popular-level book or sequence of posts would work. Extra bonus for explaining why you particularly like your suggestion over other potential suggestions, and/or for elaborating on which audiences might benefit most from different suggestions.

4plex
Stampy has the canonical version of this: I’d like a good introduction to AI alignment. Where can I find one? Feel free to improve the answer, as it's on a wiki. It will be served via a custom interface once that's ready (prototype here).
3Jay Bailey
Human Compatible is the first book on AI Safety I read, and I think it was the right choice. I read The Alignment problem and Superintelligence after that, and I think that's the right order if you end up reading all three, but Human Compatible is a good start.
1james.lucassen
Whatever you end up doing, I strongly recommend taking a learning-by-writing style approach (or anything else that will keep you in critical assessment mode rather than classroom mode). These ideas are nowhere near solidified enough to merit a classroom-style approach, and even if they were infallible, that's probably not the fastest way to learn them and contribute original stuff. The most common failure mode I expect for rapid introductions to alignment is just trying to absorb, rather than constantly poking and prodding to get a real working understanding. This happened to me, and wasted a lot of time.
1Alex Lawsen
The Alignment Problem - Easily accessible, well written and full of interesting facts about the development of ML. Unfortunately somewhat light on actual AI x-risk, but in many cases is enough to encourage people to learn more. Edit: Someone strong-downvoted this, I'd find it pretty useful to know why.  To be clear, by 'why' I mean 'why does this rec seem bad', rather than 'why downvote'. If it's the lightness on x-risk stuff I mentioned, this is useful to know, if my description seems inaccurate, this is very useful for me to know, given that I am in a position to recommend books relatively often. Happy for the reasoning to be via DM if that's easier for any reason.
3johnlawrenceaspden
I read this, and he spent a lot of time convincing me that AI might be racist and very little time convincing me that AI might kill me and everyone I know without any warning. It's the second possibility that seems to be the one people have trouble with.
[-]tgb80

What does the Fermi paradox tell us about AI future, if anything? I have a hard time simultaneously believing both "we will accidentally tile the universe with paperclips" and "the universe is not yet tiled with paperclips". Is the answer just that this is just saying that the Great Filter is already past?


And what about the anthropic principle? Am I supposed to believe that the universe went like 13 billion years without much in the way of intelligent life, then for a brief few millennia there's human civilization with me in it, and then the next N billion years it's just paperclips?

2tgb
I see now that this has been discussed here in this thread already, at least the Fermi part. Oops!

I have a very rich smart developer friend who knows a lot of influential people in SV. First employee of a unicorn, he retired from work after a very successful IPO and now it’s just finding interesting startups to invest in. He had never heard of lesswrong when I mentioned it and is not familiar with AI research.

If anyone can point me to a way to present AGI safety to him to maybe turn his interest to invest his resources in the field, that might be helpful

1Rachel Freedman
As an AI researcher, my favourite way to introduce other technical people to AI Alignment is Brian Christian’s book “The Alignment Problem” (particularly section 3). I like that it discusses specific pieces of work, with citations to the relevant papers, so that technical people can evaluate things for themselves as interested. It also doesn’t assume any prior AI safety familiarity from the reader (and brings you into it slowly, starting with mainstream bias concerns in modern-day AI).
1Yonatan Cale
My answer for myself is that I started practicing: I started talking to some friends about this, hoping to get better at presenting the topic (which is currently something I'm kind of afraid to do) (I also have other important goals like getting an actual inside view model of what's going on)   If you want something more generic, here's one idea: https://www.youtube.com/c/RobertMilesAI/featured
1Aditya
When I talk to my friends, I start with the alignment problem. I found this analogy to human evolution really drives home the point that it’s a hard problem. We aren’t close to solving it. https://youtu.be/bJLcIBixGj8 So at this time questions come up about how intelligence necessarily means morality. I talk about orthogonality thesis. Then why would the AI care about anything other that what it was explicitly told to do, the danger comes from Instrumental convergence. Finally people tend to say, we can never do it, they talk about spirituality, uniqueness of human intelligence. So I need to talk about evolution hill climbing to animal intelligence, how narrow ai has small models while we just need AGI to have a generalised world model. Brains are just electrochemical complex systems. It’s not magic. Talk about pathways, imagen, gpt3 and what it can do, talk about how scaling seems to be working. https://www.gwern.net/Scaling-hypothesis#why-does-pretraining-work So it makes sense we might have AGI in our lifetime and we have tons of money and brains working on building ai capability, fewer on safety. Try practising on other smart friends and develop your skill, you need to ensure people don’t get bored so you can’t use too much time. Use nice analogies. Have answers to frequent questions ready.

What is Fathom Radiant's theory of change?

Fathom Radiant is an EA-recommended company whose stated mission is to "make a difference in how safely advanced AI systems are developed and deployed". They propose to do that by developing "a revolutionary optical fabric that is low latency, high bandwidth, and low power. The result is a single machine with a network capacity of a supercomputer, which enables programming flexibility and unprecedented scaling to models that are far larger than anything yet conceived." I can see how this will improve model capabilities, but how is this supposed to advance AI safety?