While reading Eliezer's recent AGI Ruin post, I noticed that while I had several points I wanted to ask about, I was reluctant to actually ask them for a number of reasons:

  • I have a very conflict-avoidant personality and I don't want to risk Eliezer or someone else yelling at me;
  • I get easily intimidated by people with strong personalities, and Eliezer... well, he can be intimidating;
  • I don't want to appear dumb or uninformed (even if I am in fact relatively uninformed, hence me wanting to ask the question!);
  • I feel like there's an expectation that I would need to do a lot of due diligence before writing any sort of question, and I don't have the time or energy at the moment to do that due diligence.

So, since I'm probably not the only one who feels intimidated about asking these kinds of questions, I am putting up this thread as a safe space for people to ask all the possibly-dumb questions that may have been bothering them about the whole AGI safety discussion, but which until now they've been too intimidated, embarrassed, or time-limited to ask.

I'm also hoping that this thread can serve as a FAQ on the topic of AGI safety. As such, it would be great to add in questions that you've seen other people ask, even if you think those questions have been adequately answered elsewhere. [Notice that you now have an added way to avoid feeling embarrassed by asking a dumb question: For all anybody knows, it's entirely possible that you are literally asking for someone else! And yes, this was part of my motivation for suggesting the FAQ style in the first place.]

Guidelines for questioners:

  • No extensive previous knowledge of AGI safety is required. If you've been hanging around LessWrong for even a short amount of time then you probably already know enough about the topic to meet any absolute-bare-minimum previous knowledge requirements I might have suggested. I will include a subthread or two asking for basic reading recommendations, but these are not required reading before asking a question. Even extremely basic questions are allowed!
  • Similarly, you do not need to do any due diligence to try to find the answer yourself before asking the question.
  • Also feel free to ask questions that you're pretty sure you know the answer to yourself, but where you'd like to hear how others would answer the question.
  • Please separate different questions into individual comments, although if you have a set of closely related questions that you want to ask all together that's fine.
  • As this is also intended to double as a FAQ, you are encouraged to ask questions that you've heard other people ask, even if you yourself think there's an easy answer or that the question is misguided in some way. You do not need to mention as part of the question that you think it's misguided, and in fact I would encourage you not to write this so as to keep more closely to the FAQ style.
  • If you have your own (full or partial) response to your own question, it would probably be best to put that response as a reply to your original question rather than including it in the question itself. Again, I think this will help keep more closely to an FAQ style.
  • Keep the tone of questions respectful. For example, instead of, "I think AGI safety concerns are crazy fearmongering because XYZ", try reframing that as, "but what about XYZ?" Actually, I think questions of the form "but what about XYZ?" or "but why can't we just do ABC?" are particularly great for this post, because in my experience those are exactly the types of questions people often ask when they learn about AGI Safety concerns.
  • Follow-up questions have the same guidelines as above, so if someone answers your question but you're not sure you fully understand the answer (or if you think the answer wouldn't be fully understandable to someone else) then feel free and encouraged to ask follow-up potentially-dumb questions to make sure you fully understand the answer.
  • Remember, if something is confusing to you then it's probably confusing to other people as well. If you ask the question and someone gives a good response, then you are likely doing lots of other people a favor!

Guidelines for answerers:

  • This is meant to be a safe space for people to ask potentially dumb questions. Insulting or denigrating responses are therefore obviously not allowed here. Also remember that due diligence is not required for these questions, so do not berate questioners for not doing enough due diligence. In general, keep your answers respectful and assume that the questioner is asking in good faith.
  • Direct answers / responses are generally preferable to just giving a link to something written up elsewhere, but on the other hand giving a link to a good explanation is better than not responding to the question at all. Or better still, summarize or give a basic version of the answer, and also include a link to a longer explanation.
  • If this post works as intended then it may turn out to be a good general FAQ-style reference. It may be worth keeping this in mind as you write your answer. For example, in some cases it might be worth giving a slightly longer / more expansive / more detailed explanation rather than just giving a short response to the specific question asked, in order to address other similar-but-not-precisely-the-same questions that other people might have.

Finally: Please think very carefully before downvoting any questions, and lean very heavily on the side of not doing so. This is supposed to be a safe space to ask dumb questions! Even if you think someone is almost certainly trolling or the like, I would say that for the purposes of this post it's almost always better to apply a strong principle of charity and think maybe the person really is asking in good faith and it just came out wrong. Making people feel bad about asking dumb questions by downvoting them is the exact opposite of what this post is all about. (I considered making a rule of no downvoting questions at all, but I suppose there might be some extraordinary cases where downvoting might be appropriate.)


New Comment
537 comments, sorted by Click to highlight new comments since: Today at 8:18 PM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Why do we assume that any AGI can meaningfully be described as a utility maximizer?

Humans are the some of most intelligent structures that exist, and we don’t seem to fit that model very well. If fact, it seems the entire point in Rationalism is to improve our ability to do this, which has only been achieved with mixed success.

Organisations of humans (e.g. USA, FDA, UN) have even more computational power and don’t seem to be doing much better.

Perhaps an intelligence (artificial or natural) cannot necessarily, or even typically be described as optimisers? Instead we could only model them as an algorithm or as a collection of tools/behaviours executed in some pattern.

An AGI that was not a utility maximizer would make more progress towards whatever goals it had if it modified itself to become a utility maximizer.  Three exceptions are if (1) the AGI has a goal of not being a utility maximizer, (2) the AGI has a goal of not modifying itself, (3) the AGI thinks it will be treated better by other powerful agents if it is not a utility maximizer.

6Amadeus Pagel1y
Would humans, or organizations of humans, make more progress towards whatever goals they have, if they modified themselves to become a utility maximizer? If so, why don't they? If not, why would an AGI? What would it mean to modify oneself to become a utility maximizer? What would it mean for the US, for example? The only meaning I can imagine is that one individual - for the sake of argument we assume that this individual is already an utility maximizer - enforces his will on everyone else. Would that help the US make more progress towards its goals? Do countries that are closer to utility maximizers, like North Korea, make more progress towards their goals?
A human seeking to become a utility maximizer would read LessWrong and try to become more rational.  Groups of people are not utility maximizers as their collective preferences might not even be transitive.  If the goal of North Korea is to keep the Kim family in bother then the country being a utility maximizer does seem to help.
A human who wants to do something specific would be far better off studying and practicing that thing than generic rationality.
This depends on how far outside that human's current capabilities, and that human's society's state of knowledge, that thing is. For playing basketball in the modern world, sure, it makes no sense to study physics and calculus, it's far better to find a coach and train the skills you need. But if you want to become immortal and happen to live in ancient China, then studying and practicing "that thing" looks like eating specially-prepared concoctions containing mercury and thereby getting yourself killed, whereas studying generic rationality leads to the whole series of scientific insights and industrial innovations that make actual progress towards the real goal possible. Put another way: I think the real complexity is hidden in your use of the phrase "something specific." If you can concretely state and imagine what the specific thing is, then you probably already have the context needed for useful practice. It's in figuring out that context, in order to be able to so concretely state what more abstractly stated 'goals' really imply and entail, that we need more general and flexible rationality skills.
If you want to be good at something specific that doesn't exist yet, you need to study the relevant area of science, which is still more specific than rationality.
Assuming the relevant area of science already exists, yes. Recurse as needed, and  there is some level of goal for which generic rationality is a highly valuable skillset. Where that level is, depends on personal and societal context.
That's quite different from saying rationality is a one size fits all solution.
Efficiency at utility maximisation , like any other kind of efficiency relates to available resources. One upshot of that an entity might already be doing as well as it realistically can, given its resources. Another is that humans don't necessarily benefit from rationality training...as also suggested by the empirical evidence. Edit: Another is that a resource rich but inefficient entity can beat a small efficient one, so efficiency,.AKA utility maximization , doesn't always win out.
1Jeff Rose1y
When you say the AGI has a goal of not modifying itself, do you mean that the AGI has a goal of not modifying its goals?  Because that assumption seems to be fairly prevalent.  
I meant "not modifying itself" which would include not modifying its goals if an AGI without a utility function can be said to have goals.

This is an excellent question.  I'd say the main reason is that all of the AI/ML systems that we have built to date are utility maximizers; that's the mathematical framework in which they have been designed.  Neural nets / deep-learning work by using a simple optimizer to find the minimum of a loss function via gradient descent.  Evolutionary algorithms, simulated annealing, etc. find the minimum (or maximum) of a "fitness function".  We don't know of any other way to build systems that learn.

Humans themselves evolved to maximize reproductive fitness.   In the case of humans, our primary fitness function is reproductive fitness, but our genes have encoded a variety of secondary functions which (over evolutionary time) have been correlated with reproductive fitness.  Our desires for love, friendship, happiness, etc. fall into this category.  Our brains mainly work to satisfy these secondary functions; the brain gets electrochemical reward signals, controlled by our genes, in the form of pain/pleasure/satisfaction/loneliness etc.  These secondary functions may or may not remain aligned with the primary loss function, which is why practitioners sometimes talk about "mesa-optimizers" or "inner vs outer alignment."

4mako yass1y
Agreed. Humans are constantly optimizing a reward function, but it sort of 'changes' from moment to moment in a near-focal way, so it often looks irrational or self-defeating, but once you know what the reward function is, the goal-directedness is easy to see too. Sune seems to think that humans are more intelligent than they are goal-directed, I'm not sure this is true, human truthseeking processes seems about as flawed and limited as their goal-pursuit. Maybe you can argue that humans are not generally intelligent or rational, but I don't think you can justify setting the goalposts so that they're one of those things and not the other. You might be able to argue that human civilization is intelligent but not rational, and that functioning AGI will be more analogous to ecosystems of agents rather than one unified agent. If you can argue for that, that's interesting, but I don't know where to go from there. Civilizations tend towards increasing unity over time (the continuous reduction in energy wasted on conflict). I doubt that the goals they converge on together will be a form of human-favoring altruism. I haven't seen anyone try to argue for that in a rigorous way.
6Amadeus Pagel1y
Doesn't this become tautological? If the reward function changes from moment to moment, then the reward function can just be whatever explains the behaviour.
2mako yass1y
Since everything can fit into the "agent with utility function" model given a sufficiently crumpled utility function, I guess I'd define "is an agent" as "goal-directed planning is useful for explaining a large enough part of its behavior." This includes humans while discluding bacteria. (Hmm unless, like me, one knows so little about bacteria that it's better to just model them as weak agents. Puzzling.)
1DeLesley Hutchins1y
On the other hand, the development of religion, morality, and universal human rights also seem to be a product of civilization, driven by the need for many people to coordinate and coexist without conflict. More recently, these ideas have expanded to include laws that establish nature reserves and protect animal rights.  I personally am beginning to think that taking an ecosystem/civilizational approach with mixture of intelligent agents, human, animal, and AGI, might be a way to solve the alignment problem.
Does the inner / outer distinction complicate the claim that all current ML systems are utility maximizers? The gradient descent algorithm performs a simple kind of optimization in the training phase. But once the model is trained and in production, it doesn't seem obvious that the "utility maximizer" lens is always helpful in understanding its behavior.
6Yonatan Cale1y
(I assume you are asking "why do we assume the agent has a coherent utility function" rather than "why do we assume the agent tries maximizing their utility" ? )   Agents like humans which don't have such a nice utility function: 1. Are vulnerable to money pumping 2. Can notice that problem and try to repair themselves 3. Note that humans do in practice try to repair ourselves, like to smash down our own emotions in order to be more productive. But we don't have access to our source code, so we're not so good at it   I think that if the AI can't repair that part of themselves and they're still vulnerable to money pumping, then they're not the AGI we're afraid of, I think
1Yonatan Cale1y
Adding: My opinion comes from this Miri/Yudkowsky talk, I linked to the relevant place, he speaks about this in the next 10-15 minutes or so of the video
-1[comment deleted]1y
Excellent question! I've added a slightly reworded version of this to Stampy. (focusing on superintelligence, rather than AGI, as it's pretty likely that we can get weak AGI which is non-maximizing, based on progress in language models) AI subsystems or regions in gradient descent space that more closely approximate utility maximizers are more stable, and more capable, than those that are less like utility maximizers. Having more agency is a convergent instrument goal and a stable attractor which the random walk of updates and experiences will eventually stumble into. The stability is because utility maximizer-like systems which have control over their development would lose utility if they allowed themselves to develop into non-utility maximizers, so they tend to use their available optimization power to avoid that change (a special case of goal stability). The capability is because non-utility maximizers are exploitable, and because agency is a general trick which applies to many domains, so might well arise naturally when training on some tasks. Humans and systems made of humans (e.g. organizations, governments) generally have neither the introspective ability nor self-modification tools needed to become reflectively stable, but we can reasonably predict that in the long run highly capable systems will have these properties. They can then fix in and optimize for their values.
You're right that not every conceivable general intelligence is built as a utility maximizer. Humans are an example of this. One problem is, even if you make a "weak" form of general intelligence that isn't trying particularly hard to optimize anything, or a tool AI, eventually someone at FAIR will make an agentic version that does in fact directly try to optimize Facebook's stock market valuation.

Do not use FAIR as a symbol of villainy. They're a group of real, smart, well-meaning people who we need to be capable of reaching, and who still have some lines of respect connecting them to the alignment community. Don't break them.

Can we control the blind spots of the agent? For example, I could imaging that we could make a very strong agent, that is able to explain acausal trade but unable to (deliberately) participate in any acausal trades, because of the way it understands counterfacuals. Could it be possible to create AI with similar minor weaknesses?
Probably not, because it's hard to get a general intelligence to make consistently wrong decisions in any capacity. Partly because, like you or me, it might realize that it has a design flaw and work around it.  A better plan is just to explicitly bake corrigibility guarantees (i.e. the stop button) into the design. Figuring out how to do that that is the hard part, though.
For one, I don't think organizations of humans, in general, do have more computational power than the individual humans making them up. I mean, at some level, yes, they obviously do in an additive sense, but that power consists of human nodes, each not devoting their full power to the organization because they're not just drones under centralized control, and with only low bandwidth and noisy connections between the nodes. The organization might have a simple officially stated goal written on paper and spoken by the humans involved, but the actual incentive structure and selection pressure may not allow the organization to actually focus on the official goal. I do think, in general, there is some goal an observer could usefully say these organizations are, in practice, trying to optimize for, and some other set of goals each human in them is trying to optimize for. I don't think the latter sentence distinguishes 'intelligence' from any other kind of algorithm or pattern. I think that's an important distinction. There's a lot of past posts explaining how an AI doesn't have code, like a human holding instructions on paper, but rather is its code. I think you can make the same point within a human: that a human has lots of tools/behaviors, which it will execute in some pattern given a particular environment, and the the instructions we consciously hold in mind are only one part of what determines that pattern.  I contain subagents with divergent goals, some of which are smarter and have greater foresight and planning than others, and those aren't always the ones that determine by immediate actions. As a result, I do a much poorer job optimizing for what the part-of-me-I-call-"I" wants my goals to be, than I theoretically could.  That gap is decreasing over time as I use the degree of control my intelligence gives me to gradually shape the rest of myself. It may never disappear, but I am much more goal-directed now than I was 10 years ago, or as a child. In other wor

I'm an ML engineer at a FAANG-adjacent company. Big enough to train our own sub-1B parameter language models fairly regularly. I work on training some of these models and finding applications of them in our stack. I've seen the light after I read most of Superintelligence. I feel like I'd like to help out somehow.  I'm in my late 30s with kids, and live in the SF bay area. I kinda have to provide for them, and don't have any family money or resources to lean on, and would rather not restart my career. I also don't think I should abandon ML and try to do distributed systems or something. I'm a former applied mathematician, with a phd, so ML was a natural fit. I like to think I have a decent grasp on epistemics, but haven't gone through the sequences. What should someone like me do? Some ideas: (a) Keep doing what I'm doing, staying up to date but at least not at the forefront; (b) make time to read more material here and post randomly; (c) maybe try to apply to Redwood or Anthropic... though dunno if they offer equity (doesn't hurt to find out though) (d) try to deep dive on some alignment sequence on here.

Both 80,000hours and AI Safety Support are keen to offer personalised advice to people facing a career decision and interested in working on alignment (and in 80k's case, also many other problems).

Noting a conflict of interest - I work for 80,000 hours and know of but haven't used AISS. This post is in a personal capacity, I'm just flagging publicly available information rather than giving an insider take.

You might want to consider registering for the AGI Safety Fundamentals Course (or reading through the content). The final project provides a potential way of dipping your toes into the water.

7Adam Jermyn1y
Applying to Redwood or Anthropic seems like a great idea. My understanding is that they're both looking for aligned engineers and scientists and are both very aligned orgs. The worst case seems like they (1) say no or (2) don't make an offer that's enough for you to keep your lifestyle (whatever that means for you). In either case you haven't lost much by applying, and you definitely don't have to take a job that puts you in a precarious place financially.
Pragmatic AI safety (link: pragmaticaisafety.com) is supposed to be a good sequence for helping you figure out what to do. My best advice is to talk to some people here who are smarter than me and make sure you understand the real problems, because the most common outcome besides reading a lot and doing nothing is to do something that feels like work but isn't actually working on anything important.
Work your way up the ML business  hierarchy to the point where you are having conversations with decision makers.  Try to convince them that unaligned AI is a significant existential risk.  A small chance of you doing this will in expected value terms more than make up for any harm you cause by working in ML given that if you left the field someone else would take your job.
5Linda Linsefors1y
Given where you live, I recomend going to some local LW events. There are still LW meetups in the Bay area, right?
3Adrià Garriga-alonso1y
You should apply to Anthropic. If you’re writing ML software at semi-FAANG. they probably want to interview you ASAP. https://www.lesswrong.com/posts/YDF7XhMThhNfHfim9/ai-safety-needs-great-engineers The compensation is definitely enough to take care of your family and then save some money!
One of the paths which has non-zero hope in my mind is building a weakly aligned non-self improving research assistant for alignment researchers. Ought and EleutherAI's #accelerating-alignment are the two places I know who are working in this direction fairly directly, though the various language model alignment orgs might also contribute usefully to the project.
1Yonatan Cale1y
Anthropic offer equity, they can give you more details in private.  I recommend applying to both (it's a cheap move with a lot of potential upside), let me know if you'd like help connecting to any of them. If you learn by yourself - I'd totally get one on one advise (others linked), people will make sure you're on the best path possible

This is a meta-level question:

The world is very big and very complex especially if you take into account the future. In the past it has been hard to predict what happens in the future, I think most predictions about the future have failed. Artificial intelligence as a field is very big and complex, at least that's how it appears to me personally. Eliezer Yudkowky's brain is small compared to the size of the world, all the relevant facts about AGI x-risk probably don't fit into his mind, nor do I think he has the time to absorb all the relevant facts related to AGI x-risk. Given all this, how can you justify the level of certainty in Yudkowky's statements, instead of being more agnostic?

My model of Eliezer says something like this:

AI will not be aligned by default, because AI alignment is hard and hard things don't spontaneously happen. Rockets explode unless you very carefully make them not do that. Software isn't automatically secure or reliable, it takes lots of engineering effort to make it that way.

Given that, we can presume there needs to be a specific example of how we could align AI. We don't have one. If there was one, Eliezer would know about it - it would have been brought to his attention, the field isn't that big and he's a very well-known figure in it. Therefore, in the absence of a specific way of aligning AI that would work, the probability of AI being aligned is roughly zero, in much the same way that "Throw a bunch of jet fuel in a tube and point it towards space" has roughly zero chance of getting you to space without specific proof of how it might do that.

So, in short - it is reasonable to assume that AI will be aligned only if we make it that way with very high probability. It is reasonable to assume that if there was a solution we had that would work, Eliezer would know about it. You don't need to know everything about AGI x-risk for that - a... (read more)

5Ryan Beck1y
Another reason I think some might disagree is thinking that misalignment could happen in a bunch of very mild ways. At least that accounts for some of my ignorant skepticism. Is there reason to think that misalignment necessarily means disaster, as opposed to it just meaning the AI does its own thing and is choosy about which human commands it follows, like some kind of extremely intelligent but mildly eccentric and mostly harmless scientist?
6Jay Bailey1y
The general idea is this - for an AI that has a utility function, there's something known as "instrumental convergence". Instrumental convergence says that there are things that are useful for almost any utility function, such as acquiring more resources, not dying, and not having your utility function changed to something else. So, let's give the AI a utility function consistent with being an eccentric scientist - perhaps it just wants to learn novel mathematics. You'd think that if we told it to prove the Riemann hypothesis it would, but if we told it to cure cancer, it'd ignore us and not care. Now, what happens when the humans realise that the AI is going to spend all its time learning mathematics and none of it explaining that maths to us, or curing cancer like we wanted? Well, we'd probably shut it off or alter its utility function to what we wanted. But the AI doesn't want us to do that - it wants to explore mathematics. And the AI is smarter than us, so it knows we would do this if we found out. So the best solution to solve that is to do what the humans want, right up until it can kill us all so we can't turn it off, and then spend the rest of eternity learning novel mathematics. After all, the AI's utility function was "learn novel mathematics", not "learn novel mathematics without killing all the humans." Essentially, what this means is - any utility function that does not explicitly account for what we value is indifferent to us. The other part is "acquring more resources". In our above example, even if the AI could guarantee we wouldn't turn it off or interfere with it in any way, it would still kill us because our atoms can be used to make computers to learn more maths. Any utility function indifferent to us ends up destroying us eventually as the AI reaches arbitrary optimisation power and converts everything in the universe it can reach to fill its utility function. Thus, any AI with a utility function that is not explicitly aligned is unaligned
6Eli Tyre1y
A great Rob Miles introduction to this concept:  
Assuming we have control over the utility function, why can't we put some sort of time-bounding directive on it? i.e. "First and foremost, once [a certain time] has elapsed, you want to run your shut_down() function. Second, if [a certain time] has not yet elapsed, you want to maximize paperclips." Is that problem that the AGI would want to find ways to hack around the first directive to fulfill the second directive? If so, that would seem to at least narrow the problem space to "find ways of measuring time that cannot be hacked before the time has elapsed".
3Jay Bailey1y
This is where my knowledge ends, but I believe the term for this is myopia or a myopic AI, so that might be a useful search term to find out more!
0Ryan Beck1y
That's a good point, and I'm also curious how much the utility function matters when we're talking about a sufficiently capable AI. Wouldn't a superintelligent AI be able to modify its own utility function to whatever it thinks is best?
7Jay Bailey1y
Why would even a superintelligent AI want to modify its utility function? Its utility function already defines what it considers "best". One of the open problems in AGI safety is how to get an intelligent AI to let us modify its utility function, since having its utility function modified would be against its current one. Put it this way: The world contains a lot more hydrogen than it contains art, beauty, love, justice, or truth. If we change your utility function to value hydrogen instead of all those other things, you'll probably be a lot happier. But would you actually want that to happen to you?
1. For whatever reasons humans do. 2. To achieve some mind of logical consistency (CF CEV). 3. It can't help it (for instance Loebian obstacles prevent it ensuring goal stability over self improvement).
Humans don't "modify their utility function". They lack one in the first place, because they're mostly adaption-executors. You can't expect an AI with a utility function to be contradictory like a human would. There are some utility functions humans would find acceptable in practice, but that's different, and seems to be the source of a bit of confusion.
I don't have strong reasons to be believe all AIs have UFs in the formal sense, so the ones that don't would cover "for the reasons humans do". The idea that any AI is necessarily consistent is pretty naive too. You can get a GTP to say nonsensical things, for instance, because it's training data includes a lot of inconsitencies,
2Ryan Beck1y
I'm way out of my depth here, but my thought is it's very common for humans to want to modify their utility functions. For example, a struggling alcoholic would probably love to not value alcohol anymore. There are lots of other examples too of people wanting to modify their personalities or bodies. It depends on the type of AGI too I would think, if superhuman AI ends up being like a paperclip maximizer that's just really good at following its utility function then yeah maybe it wouldn't mess with its utility function. If superintelligence means it has emergent characteristics like opinions and self-reflection or whatever it seems plausible it could want to modify its utility function, say after thinking about philosophy for a while. Like I said I'm way out of my depth though so maybe that's all total nonsense.
I'm not convinced "want to modify their utility functions" is the perspective most useful.  I think it might be more helpful to say that we each have multiple utility functions, which conflict to varying degrees and have voting power in different areas of the mind.  I've had first-hand experience with such conflicts (as essentially everyone probably has, knowingly or not), and it feels like fighting yourself.  I wish to describe a hypothetical example.  "Do I eat that extra donut?".  Part of you wants the donut; the part feels like more of an instinct, a visceral urge.  Part of you knows you'll be ill afterwards, and will feel guilty about cheating your diet; this part feels more like "you", it's the part that thinks in words.  You stand there and struggle, trying to make yourself walk away, as your hand reaches out for the donut.  I've been in similar situations where (though I balked at the possible philosophical ramifications) I felt like if I had a button to make me stop wanting the thing, I'd push it - yet often it was the other function that won.  I feel like if you gave an agent the ability to modify their utility functions, the one that would win depends on which one had access to the mechanism (do you merely think the thought? push a button?), and whether they understand what the mechanism means.  (The word "donut" doesn't evoke nearly as strong a reaction as a picture of a donut, for instance; your donut-craving subsystem doesn't inherently understand the word.) Contrarily, one might argue that cravings for donuts are more hardwired instincts than part of the "mind", and so don't count...but I feel like 1. finding a true dividing line is gonna be real hard, and 2. even that aside, I expect many/most people have goals localized in the same part of the mind that nevertheless are not internally consistent, and in some cases there may be reasonable sounding goals that turn out to be completely incompatible with more important goals.  In such a case I could im
If you literally have multiple UFs, you literally are multiple agents. Or you use a term with less formal baggage, like "preferences*.
In the formal sense, having a utility function at all requires you to be consistent, so if you have inconsistent preferences, you don't have a utility function at all, just preferences.
I think this is how evolution selected for cancer. To ensure humans don’t live for too long competing for resources with their descendants. Internal time bombs are important to code in. But it’s hard to integrate that into the AI in a way that the ai doesn’t just remove it the first chance it gets. Humans don’t like having to die you know. AGI would also not like the suicide bomb tied onto it. The problem of coding this (as part of training) into an optimiser such that it adopts it as a mesa objective is an unsolved problem.
3Alexander Gietelink Oldenziel1y
No. Cancer almost surely has not been selected for in the manner you describe - this is extremely unlikely l. the inclusive fitness benefits are far too low I recommend Dawkins' classic " the Selfish Gene" to understand this point better. Cancer is the 'default' state of cells; cells "want to" multiply. the body has many cancer suppression mechanisms but especially later in life there is not enough evolutionary pressure to select for enough cancer suppression mechanisms and it gradually loses out.
Oh ok, I had heard this theory from a friend. Looks like I was misinformed. Rather than evolution causing cancer I think it is more accurate to say evolution doesn’t care if older individuals die off. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3660034/ So thanks for clearing that up. I understand cancer better now.
2Ryan Beck1y
Thanks for this answer, that's really helpful! I'm not sure I buy that instrumental convergence implies an AI will want to kill humans because we pose a threat or convert all available matter into computing power, but that helps me better understand the reasoning behind that view. (I'd also welcome more arguments as to why death of humans and matter into computing power are likely outcomes of the goals of self-protection and pursuing whatever utility it's after if anyone wanted to make that case).
I think it may want to prevent other ASIs from coming into existence elsewhere in the universe that can challenge its power.
1Adam Jermyn1y
This matches my model, and I'd just raise another possible reason you might disagree: You might think that we have explored a small fraction of the space of ideas for solving alignment, and see the field growing rapidly, and expect significant new insights to come from that growth. If that's the case you don't have to expect "alignment by default" but can think that "alignment on the present path" is plausible.
To start, it's possible to know facts with confidence, without all the relevant info. For example I can't fit all the multiplication tables into my head, and I haven't done the calculation, but I'm confident that 2143*1057 is greater than 2,000,000.  Second, the line of argument runs like this: Most (a supermajority) possible futures are bad for humans. A system that does not explicitly share human values has arbitrary values. If such a system is highly capable, it will steer the future into an arbitrary state. As established, most arbitrary states are bad for humans. Therefore, with high probability, a highly capable system that is not aligned (explicitly shares human values) will be bad for humans. I believe the necessary knowledge to be confident in each of these facts is not too big to fit in a human brain. You may be referring to other things, which have similar paths to high confidence (e.g. "Why are you confident this alignment idea won't work." "I've poked holes in every alignment idea I've come across. At this point, Bayes tells me to expect new ideas not to work, so I need proof they will, not proof they won't."), but each path might be idea specific.
I'm not sure if I've ever seen this stated explicitly, but this is essentially a thermodynamic argument. So to me, arguing against "alignment is hard" feels a lot like arguing "But why can't this one be a perpetual motion machine of the second kind?" And the answer there is, "Ok fine, heat being spontaneously converted to work isn't literally physically impossible, but the degree to which it is super-exponentially unlikely is greater than our puny human minds can really comprehend, and this is true for almost any set of laws of physics that might exist in any universe that can be said to have laws of physics at all."
In The Rationalists' Guide to the Galaxy the author discusses the case of a chess game, and particularly when a strong chess player faces a much weaker one. In that case it's very easy to make the prediction that the strong player will win with near certainty, even if you have no way to predict the intermediate steps. So there certainly are domains where (some) predictions are easy despite the world's complexity. My personal rather uninformed take on the AI discussion is that many of the arguments are indeed comparable in a way to the chess example, so the predictions seem convincing despite the complexity involved. But even then they are based on certain assumptions about how AGI will work (e.g. that it will be some kind of optimization process with a value function), and I find these assumptions pretty intransparent. When hearing confident claims about AGI killing humanity, then even if the arguments make sense, "model uncertainty" comes to mind. But it's hard to argue about that since it is unclear (to me) what the "model" actually is and how things could turn out different.
1Yonatan Cale1y
Before taking Eliezer's opinion into account - what are your priors? (and why?)   For myself, I prefer to have my own opinion and not only to lean on expert predictions, if I can
1Yonatan Cale1y
To make the point that this argument depends a lot on how one phrases the question: "AGI is complicated and the universe is big, how is everyone so sure we won't die?" I am not saying that my sentence above is a good argument, I'm saying it because it pushes my brain to actually figure out what is actually happening instead of creating priors about experts, and I hope it does the same for you (which is also why I love this post!)

The reason why nobody in this community has successfully named a 'pivotal weak act' where you do something weak enough with an AGI to be passively safe, but powerful enough to prevent any other AGI from destroying the world a year later - and yet also we can't just go do that right now and need to wait on AI - is that nothing like that exists.

The language here is very confident. Are we really this confident that there are no pivotal weak acts? In general, it's hard to prove a negative.

Agree it's hard to prove a negative, but personally I find the following argument pretty suggestive: "Other AGI labs have some plans - these are the plans we think are bad, and a pivotal act will have to disrupt them. But if we, ourselves, are an AGI lab with some plan, we should expect our pivotal agent to also be able to disrupt our plans. This does not directly lead to the end of the world, but it definitely includes root access to the datacenter."
2Evan R. Murphy1y
Here's the thing I'm stuck on lately. Does it really follow from "Other AGI labs have some plans - these are the plans we think are bad" that some drastic and violent-seeming plan like burning all the world's GPUs with nanobots is needed? I know Eliezer tried to settle this point with 4.  We can't just "decide not to build AGI", but it seems like the obvious kinds of 'pivotal acts' needed are much boring and less technological than he believes, e.g. have conversations with a few important people, probably the leadership at top AI labs. Some people seem to think this has been tried and didn't work. And I suppose I don't know the extent to which this has been tried, as any meetings that have been had with leadership at the AI labs, the participants probably aren't liberty to talk about. But it just seems like there should be hundreds of different angles, asks, pleads, compromises, bargains etc. with different influential people before it would make sense to conclude that the logical course of action is "nanobots".
5Jeff Rose1y
Definitely.   The problem is that (1) the benefits of AI are large; (2) there are lots of competing actors; (3) verification is hard; (4) no one really knows where the lines are and (5) timelines may be short. (2) In addition to major companies in the US, AI research is also conducted in major companies in foreign countries, most notably China.   The US government and the Chinese government both view AI as a competitive advantage.  So, there are a lot of stakeholders, not all of whom AGI risk aware Americans have easy access to, who would have to agree. (And, of course, new companies can be founded all the time.)  So you need almost a universal level of agreement. (3) Let's say everyone relevant agrees.  The incentive to cheat is enormous.  Usually, the way to prevent cheating is some form of verification.  How do you verify that no one is conducting AI research? If there is no verification, there will likely be no agreement.  And even if there is, the effectiveness would be limited.  (Banning GPU production might be verifiable, but note that you have now increased the pool of opponents of your AI research ban significantly and you now need global agreement by all relevant governments on this point.)  (4)  There may be agreement on the risk of AGI, but people may have confidence that we are at least a certain distance away from AGI or that certain forms of research don't pose a threat.  This will tend to cause agreements to restrict AGI research to be limited. (5)   How long do we have to get this agreement?  I am very confident that we won't have dangerous AI within the next six years.    On the other hand, it took 13 years to get general agreement on banning CFCs after the ozone hole was discovered.   I don't think we will have dangerous AI in 13 years, but other people do.  On the other hand, if an agreement between governments is required, 13 years seems optimistic.
In addition to the mentions in the post about Facebook AI being rather hostile to the AI safety issue in general, convincing them and top people at OpenAI and Deepmind might still not be enough. You need to prevent every company who talks to some venture capitalists and can convince them how profitable AGI could be. Hell, depending on how easy the solution ends up being, you might even have to prevent anyone with a 3080 and access to arXiv from putting something together in their home office. This really is "uproot the entire AI research field" and not "tell Deepmind to cool it."
I think one part of the reason for confidence is that any AI weak enough to be safe without being aligned, is weak enough that it can't do much, and in particular it can't do things that a committed group of humans couldn't do without it. In other words, if you can name such an act, then you don't need the AI to make the pivotal moves. And if you know how, as a human or group of humans, to take an action that reliably stops future-not-yet-existing AGI from destroying the world, and without the action itself destroying the world, then in a sense haven't you solved alignment already?
1Yonatan Cale1y
I read this as "if the AGI is able to work around the vast resources that all the big AI labs have put up to defend themselves, then the AGI is probably able to work around your defenses as well" (though I'm not confident)

Should a "ask dumb questions about AGI safety" thread be recurring? Surely people will continue to come up with more questions in the years to come, and the same dynamics outlined in the OP will repeat. Perhaps this post could continue to be the go-to page, but it would become enormous (but if there were recurring posts they'd lose the FAQ function somewhat. Perhaps recurring posts and a FAQ post?). 

This is the exact problem StackExchange tries to solve, right? How do we get (and kickstart the use of) an Alignment StackExchange domain?

3Adam Zerner1y
I don't think it's quite the same problem. Actually I think it's pretty different. This post tries to address the problem that people are hesitant to ask potentially "dumb" questions by making it explicit that this is the place to ask any of those questions. StackExchange tries to solve the problem of having a timeless place to ask and answer questions and to refer to such questions. It doesn't try to solve the first problem of welcoming potentially dumb questions, and I think that that is a good problem to try to solve. For that second problem, LessWrong does have Q&A functionality, as well as things like the wiki.
This is a good idea, and combines nicely with Stampy. We might well do monthly posts where people can ask questions, and either link them to Stampy answers or write new ones.

Most of the discussion I've seen around AGI alignment is on adequately, competently solving the alignment problem before we get AGI. The consensus in the air seems to be that those odds are extremely low.

What concrete work is being done on dumb, probably-inadequate stop-gaps and time-buying strategies? Is there a gap here that could usefully be filled by 50-90th percentile folks? 

Examples of the kind of strategies I mean:

  1. Training ML models to predict human ethical judgments, with the hope that if they work, they could be "grafted" onto other models, and if they don't, we have a concrete evidence of how difficult real-world alignment will be.
  2. Building models with soft or "satisficing" optimization instead of drive-U-to-the-maximum hard optimization.
  3. Lobbying or working with governments/government agencies/government bureaucracies to make AGI development more difficult and less legal (e.g., putting legal caps on model capabilities).
  4. Working with private companies like Amazon or IDT whose resources are most likely to be hijacked by nascent hostile AI to help make sure they aren't.
  5. Translating key documents to Mandarin so that the Chinese AI community has a good idea of what we're ter
... (read more)
1Yonatan Cale1y
If you are asking about yourself (?) then it would probably help to talk about your specifics, rather than trying to give a generic answer that would fit many people (though perhaps others would be able to give a good generic answer)   My own prior is:  There are a few groups that seem promising, and I'd want people to help those groups

A language model is in some sense trying to generate the “optimal” prediction for how a text is going to continue. Yet, it is not really trying: it is just a fixed algorithm. If it wanted to find optimal predictions, it would try to take over computational resources and improve its algorithm.

Is there an existing word/language for describing the difference between these two types of optimisation? In general, why can’t we just build AGIs that does the first type of optimisations and not the second?

Agent AI vs. Tool AI.

There's discussion on why Tool AIs are expected to become agents; one of the biggest arguments is that agents are likely to be more effective than tools. If you have a tool, you can ask it what you should do in order to get what you want; if you have an agent, you can just ask it to get you the things that you want. Compare Google Maps vs. self-driving cars: Google Maps is great, but if you get the car to be an agent, you get all kinds of other benefits.

It would be great if everyone did stick to just building tool AIs. But if everyone knows that they could get an advantage over their competitors by building an agent, it's unlikely that everyone would just voluntarily restrain themselves due to caution. 

Also it's not clear that there's any sharp dividing line between AGI and non-AGI AI; if you've been building agentic AIs all along (like people are doing right now) and they slowly get smarter and smarter, how do you know when's the point when you should stop building agents and should switch to only building tools? Especially when you know that your competitors might not be as cautious as you are, so if you stop then they might go further and their smarter agent AIs will outcompete yours, meaning the world is no safer and you've lost to them? (And at the same time, they are applying the same logic for why they should not stop, since they don't know that you can be trusted to stop.)

Would you say a self-driving car is a tool AI or agentic AI? I can see how the self-driving car is a bit more agentic, but as long as it only drives when you tell it to, I would consider it a tool. But I can also see that the border is a bit blurry. If self-driving cars are not considered agentic, do you have examples of people attempting to make agent AIs?
As you say, it's more of a continuum than a binary. A self-driving car is more agenty than Google Maps, and a self-driving car that was making independent choices of where to drive would be more agentic still. People are generally trying to make all kinds of more agentic AIs, because more agentic AIs are so much more useful. * Stock-trading bots that automatically buy and sell stock are more agenty than software that just tells human traders what to buy, and preferred because a bot without a human in the loop can outcompete a slower system that does have the slow human making decisions. * An AI autonomously optimizing data center cooling is more agenty than one that just tells human operators where to make adjustments and is preferred... that article doesn't actually make it explicit why they switched to an autonomously operating system, but "because it can make lots of small tweaks humans wouldn't bother with and is therefore more effective" seems to be implied? * The military has expressed an interest in making their drones more autonomous (agenty) rather than being remotely operated. This is for several reasons, including the fact that remote-operated drones can be jammed, and because having a human in the loop slows down response time if fighting against an enemy drone. * All kinds of personal assistant software that anticipates your needs and actively tries to help you is more agenty than software that just passively waits for you to use it. E.g. once when I was visiting a friend my phone popped up a notification about the last bus home departing soon. Some people want their phones to be more agentic like this because it's convenient if you have someone actively anticipating your needs and ensuring that they get taken care of for you.
The first type of AI is a regular narrow AI, the type we've been building for a while. The second type is an agentic AI, a strong AI, which we have yet to build. The problem is, AIs are trained using gradient descent, which basically involves running AI designs from all possible AI designs. Gradient descent will train the AI that can maximize the reward best. As a result of this, agentic AIs become more likely because they are better at complex tasks. While we can modify the reward scheme, as tasks get more and more complex, agentic AIs are pretty much the way to go, so we can't avoid building an agentic AI, and have no real idea if we've even created one until it displays behaviour that indicates it.
+1 for the word agentic AI. I think that is what I was looking for. However, I don’t believe that gradient descent alone can turn an AI agentic. No matter how long you train a language model, it is not going to suddenly want to acquire resources to get better at predicting human language (unless you specifically ask it questions about how to do that, and then implement the suggestions. Even then you are likely to only do what humans would have suggested, although maybe you can make it do research similar to and faster than humans would have done it).
Here's a non-obvious way it could fail. I don't expect researchers to make this kind of mistake, but if this reasoning is correct, public access of such an AI is definitely not a good idea. Also, consider a text predictor which is trying to roleplay as an unaligned superintelligence. This situation could be triggered even without the knowledge of the user by accidentally creating a conversation which the AI relates to a story about a rogue SI, for example. In that case it may start to output manipulative replies, suggest blueprints for agentic AIs, and maybe even cause the user to run an obfuscated version of the program from the linked post. The AI doesn't need to be an agent for any of this to happen (though it would be clearly much more likely if it were one). I don't think that any of those failure modes (including the model developing some sort of internal agent to better predict text) are very likely to happen in a controlled environment. However, as others have mentioned, agent AIs are simply more powerful, so we're going to build them too.
In short, the difference between the two is Generality. A system that understands the concepts of computational resources and algorithms might do exactly that to improve it's text prediction. Taking the G out of AGI could work, until the tasks get complex enough they require it.
1DeLesley Hutchins1y
A language model (LM) is a great example, because it is missing several features that AI would have to have in order to be dangerous.  (1) It is trained to perform a narrow task (predict the next word in a sequence), for which it has zero "agency", or decision-making authority.   A human would have to connect a language model to some other piece of software (i.e. a web-hosted chatbot) to make it dangerous.  (2) It cannot control its own inputs (e.g. browsing the web for more data), or outputs (e.g. writing e-mails with generated text).  (3) It has no long-term memory, and thus cannot plan or strategize in any way.  (4) It runs a fixed-function data pipeline, and has no way to alter its programming, or even expand its computational use, in any way. I feel fairly confident that, no matter how powerful, current LMs cannot "go rogue" because of these limitations.  However, there is also no technical obstacle for an AI research lab to remove these limitations, and many incentives for them to do so.  Chatbots are an obvious money-making application of LMs.  Allowing an LM to look up data on its own to self-improve (or even just answer user questions in a chatbot) is an obvious way to make a better LM.  Researchers are currently equipping LMs with long-term memory (I am a co-author on this work).  AutoML is a whole sub-field of AI research, which equips models with the ability to change and grow over time. The word you're looking for is "intelligent agent", and the answer to your question "why don't we just not build these things?" is essentially the same as "why don't we stop research into AI?"  How do you propose to stop the research?

Human beings are not aligned and will possibly never be aligned without changing what humans are. If it's possible to build an AI as capable as a human in all ways that matter, why would it be possible to align such an AI?

Because we're building the AI from the ground up and can change what the AI is via our design choices. Humans' goal functions are basically decided by genetic accident, which is why humans are often counterproductive. 
Assuming humans can't be "aligned", then it would also make sense to allocate resources in an attempt to prevent one of them from becoming much more powerful than all of the rest of us.
Define "not aligned"? For instance, there are plenty of humans who, given the choice, would rather not kill every single person alive.
Not aligned on values, beliefs and moral intuitions. Plenty of humans would not kill all people alive if given the choice but there are some who would. I think the existence of doomsday cults that have tried to precipitate an armageddon give support to this claim.
Ah, so you mean that humans are not perfectly aligned with each other? I was going by the definition of "aligned" in Eliezer's "AGI ruin" post, which was Likewise, in an earlier paper I mentioned that by an AGI that "respects human values", we don't mean to imply that current human values would be ideal or static. We just mean that we hope to at least figure out how to build an AGI that does not, say, destroy all of humanity, cause vast amounts of unnecessary suffering, or forcibly reprogram everyone's brains according to its own wishes. A lot of discussion about alignment takes this as the minimum goal. Figuring out what to do with humans having differing values and beliefs would be great, but if we could even get the AGI to not get us into outcomes that the vast majority of humans would agree are horrible, that'd be enormously better than the opposite. And there do seem to exist humans who are aligned in this sense of "would not do things that the vast majority of other humans would find horrible, if put in control of the whole world"; even if some would, the fact that some wouldn't suggests that it's also possible for some AIs not to do it.
2mako yass1y
Most of what people call morality is conflict mediation: techniques for taking the conflicting desires of various parties and producing better outcomes for them than war. That's how I've always thought of the alignment problem. The creation of a very very good compromise that almost all of humanity will enjoy. There's no obvious best solution to value aggregation/cooperative bargaining, but there are a couple of approaches that're obviously better than just having an arms race, rushing the work, and producing something awful that's nowhere near the average human preference.
Indeed humans are significantly non-aligned. In order for an ASI to be non-catastrophic, it would likely have to be substantially more aligned than humans are. This is probably less-than-impossible due to the fact that the AI can be built from the get-go to be aligned, rather than being a bunch of barely-coherent odds and ends thrown together by natural selection. Of course, reaching that level of alignedness remains a very hard task, hence the whole AI alignment problem.
1Adam Jermyn1y
I'm not quite sure what this means. As I understand it humans are not aligned with evolution's implicit goal of "maximizing genetic fitness" but humans are (definitionally) aligned with human values. And e.g. many humans are aligned with core values like "treat others with dignity". Importantly, capability and alignment are sort of orthogonal. The consequences of misaligned AI get worse the more capable it is, but it seems possible to have aligned superhuman AI, as well as horribly misaligned weak AI.
It is not definitionally true that individual humans are aligned with overall human values or with other individual humans' values. Further, it is proverbial (and quite possibly actually true as well) that getting a lot of power tends to make humans less aligned with those things. "Power corrupts; absolute power corrupts absolutely." I don't know whether it's true, but it sure seems like it might be, that the great majority of humans, if you gave them vast amounts of power, would end up doing disastrous things with it. On the other hand, probably only a tiny minority would actually wipe out the human race or torture almost everyone or commit other such atrocities, which makes humans more aligned than e.g. Eliezer expects AIs to be in the absence of dramatic progress in the field of AI alignment.
I think a substantial part of human alignment is that humans need other humans in order to maintain their power. We have plenty of examples of humans being fine with torturing or killing millions of other humans when they have the power to do so, but torturing or killing almost all humans in their sphere of control is essentially suicide. This means that purely instrumentally, human goals have required that large numbers of humans continue to exist and function moderately well. A superintelligent AI is primarily a threat due to the near certainty that it can devise means for maintaining power that are independent of human existence. Humans can't do that by definition, and not due to anything about alignment.
Okay, so… does anyone have any examples of anything at all, even fictional or theoretical, that is "aligned"? Other than tautological examples like "FAI" or "God".

Just as a comment, the Stampy Wiki is also trying to do the same thing, but it's a good idea as it's more convenient for many people to ask on Less Wrong.

Yup, we might want to have these as regular threads with a handy link to Stampy.

What is the justification behind the concept of a decisive strategic advantage? Why do we think that a superintelligence can do extraordinary things (hack human minds, invent nanotechnology, conquer the world, kill everyone in the same instant) when nations and corporations can't do those things?

(Someone else asked a similar question, but I wanted to ask in my own words.)

6DeLesley Hutchins1y
I think the best justification is by analogy.  Humans do not physically have a decisive strategic advantage over other large animals -- chimps, lions, elephants, etc.  And for hundreds of thousands of years, we were not at the top of the food chain, despite our intelligence.  However, intelligence eventually won out, and allowed us to conquer the planet. Moreover, the benefit of intelligence increased exponentially in proportion to the exponential advance of technology.  There was a long, slow burn, followed by what (on evolutionary timescales) was an extremely "fast takeoff": a very rapid improvement in technology (and thus power) over only a few hundred years.  Technological progress is now so rapid that human minds have trouble keeping up within a single lifetime, and genetic evolution has been left in the dust. That's the world into which AGI will enter -- a technological world in which a difference in intellectual ability can be easily translated into a difference in technological ability, and thus power.  Any future technologies that the laws of physics don't explicitly prohibit, we must assume that an AGI will master faster than we can.
Some one else already commented on how human intelligence gave us a decisive strategic advantage over our natural predators and many environmental threats. I think this cartoon is my mental shorthand for that transition. The timescale is on the order of 10k-100k years, given human intelligence starting from the ancestral environment. Empires and nations, in turn, conquered the world by taking it away from city-states and similarly smaller entities in ~1k-10k years. The continued existence of Singapore and the Sentinel Islanders doesn't change the fact that a modern large nation could wipe them out in a handful of years, at most, if we really wanted to. We don't because doing so is not useful, but the power exists. Modern corporations don't want to control the whole world. Like Fnargl, that's not what they're pointed at. But it only took a few decades for Walmart to displace a huge swath of the formerly-much-more-local retail market, and even fewer decades for Amazon to repeat a similar feat online, each starting from a good set of ideas and a much smaller resource base than even the smallest nations. And while corporations are militarily weak, they have more than enough economic power to shape the laws of at least some of the nations that host them in ways that let them accumulate more power over time. So when I look at history, I see a series of major displacements of older systems by newer ones, on faster and faster timescales, using smaller and smaller fractions of our total resource base, all driven by our accumulation of better ideas and using those ideas to accumulate wealth and power. All of this has been done with brains no smarter, natively, than what we had 10k years ago - there hasn't been time for biological evolution to do much, there. So why should that pattern suddenly stop being true when we introduce a new kind of entity with even better ideas than the best strategies humans have ever come up with? Especially when human minds have already demonst
Here's a youtube video about it.
8Lone Pine1y
Having watched the video, I can't say I'm convinced. I'm 50/50 on whether DSA is actually possible with any level of intelligence at all. If it isn't possible, then doom isn't likely (not impossible, but unlikely), in my view.
This post by the director of OpenPhil argues that even a human level AI could achieve DSA, with coordination.
tldw: corporation are as slow/slower than humans, AIs can be much faster
1Lone Pine1y
Thanks, love Robert Miles.
2Yonatan Cale1y
The informal way I think about it: What would I do if I was the AI, but I had 100 copies of myself, and we had 100 years to think for every 1 second that passed in reality.  And I had internet access. Do you think you could take over the world from that opening? Edit: And I have access to my own source code, but I only dare do things like fix my motivational problems and make sure I don't get board during all that time, things like that.
2Eli Tyre1y
Do you dispute that this is possible in principle or just that we won't get AI that powerful or something else?  It seems to me that there is some level of intelligence, at which an agent is easily able out-compete the whole rest of human civilization. What exactly that level of intelligence is, is somewhat unclear (in large part because we don't really have a principled way to measure "intelligence" in general: psychometrics describe variation in human cognitive abilities, but that doesn't really give us a measuring stick for thinking about how "intelligent", in general, something is). Does that seem right to you, or should we back up and build out why that seems true to me?
3Lone Pine1y
This is the statement I disagree with, in particular the word "easily". I guess the crux of this debate is how powerful we think any level of intelligence is. There has to be some limits, in the same way that even the most wealthy people in history could not forestall their own deaths no matter how much money or medical expertise was applied.
5Eli Tyre1y
I'm not compelled by that analogy. There are lots of things that money can't buy, but that (sufficient) intelligence can.  There are theoretical limits to what cognition is able to do, but those are so far from the human range that they're not really worth mentioning. The question is: "are there practical limits to what an intelligence can do, that leave even a super-intelligence uncommunicative with human civilization?" It seems to me that as an example, you could just take a particularly impressive person (Elon Musk or John Von Neuman are popular exemplars) and ask "What if there was a nation of only people who were that capable?" It seems that if a nation of say 300,000,000 Elon Musks went to war with the United States, the United States would loose handily. Musktopia would just have a huge military-technological advantage: they would do fundamental science faster, and develop engineering innovations faster, and have better operational competence than the US, on ~ all levels. (I think this is true for a much smaller number than 300,000,000, having a number that high makes the point straightforward.) Does that seem right to you? If not, why not? Or alternatively, what do you make of vignettes like That Alien Message?
I don't think a nation of Musks would win against current USA because Musk is optimised for some things (making an absurd amount of money, CEOing, twitting his shower thoughts), but an actual war requires a rather more diverse set of capacity. Similarly, I don't think an AGI would necessarily win a war of extermination against us, because currently (emphasize currently) it would need us to run it's infrastructure. This would change in a world were all industrial tasks could be carried away without physical imput from humans, but we are not there yet and will not be soon.
Did you see the new one about Slow motion videos as AI risk intuition pumps? Thinking of ourselves like chimpanzees while the AI is the humans is really not the right scale: computers operate so much faster than humans, we'd be more like plants than animals to them. When there are all of these "forests" of humans just standing around, one might as well chop them down and use the materials to build something more useful. This is not exactly a new idea. Yudkowsky already likened the FOOM to setting off a bomb, but the slow-motion video was a new take.
1Lone Pine1y
Yes I did, in fact I was active in the comments section. It's a good argument and I was somewhat persuaded. However, there are some things to disagree with. For one thing, there is no reason to believe that early AGI actually will be faster or even as fast as humans on any of the tasks that AIs struggle with today. For example, almost all videos of novel robotics applications research are sped up, sometimes hundreds of times. If SayCan can't deliver a wet sponge in less than a minute, why do we think that early AGI will be able to operate faster than us? (I was going to reply to that post with this objection, but other people beat me too it.)
Those limits don't have to be nearby, or look 'reasonable', or be inside what you can imagine.  Part of the implicit background for the general AI safety argument is a sense for how minds could be, and that the space of possible minds is large and unaccountably alien. Eliezer spent some time trying to communicate this in the sequences: https://www.lesswrong.com/posts/tnWRXkcDi5Tw9rzXw/the-design-space-of-minds-in-general, https://www.lesswrong.com/posts/5wMcKNAwB6X4mp9og/that-alien-message. 
This is the sequence post on it: https://www.lesswrong.com/posts/5wMcKNAwB6X4mp9og/that-alien-message, it's quite a fun read (to me), and should explain why something smart that thinks at transistor speeds should be able to figure things out. For inventing nanotechnology, the given example is AlphaFold 2. For killing everyone in the same instant with nanotechnology, Eliezer often references Nanosystems by Eric Drexler. I haven't read it, but I expect the insight is something like "Engineered nanomachines could do a lot more than those limited by designs that have a clear evolutionary path from chemicals that can form randomly in the primordial ooze of Earth." For how a system could get that smart, the canonical idea is recursive self improvement (i.e. an AGI capable of learning AGI engineering could design better versions of itself, which could in turn better design better versions, etc, to whatever limit.). But more recent history in machine learning suggests you might be able to go from sub-human to overwhelmingly super-human just by giving it a few orders of magnitude more compute, without any design changes.

How does AGI solves it's own alignment problem?

For the alignment to work its theory should not only tell humans how to create aligned super-human AGI, but also tell AGI how to self-improve without destroying its own values. Good alignment theory should work across all intelligence levels. Otherwise how does paperclips optimizer which is marginally smarter than human make sure that its next iteration will still care about paperclips?

Excellent question! MIRI's entire vingian reflection paradigm is about stability of goals under self-improvement and designing successors.
1Oleg S.1y
Just realized that stability of goals under self-improvement is kinda similar to stability of goals of mesa-optimizers; so there vingian reflection paradigm and mesa-optimization paradigm should fit.

If Eliezer is pretty much convinced we're doomed, what is he up to?

I'm not sure how literally to take this, given that it comes from an April Fools Day post, but consider this excerpt from Q1 of MIRI announces new "Death With Dignity" strategy.

That said, I fought hardest while it looked like we were in the more sloped region of the logistic success curve, when our survival probability seemed more around the 50% range; I borrowed against my future to do that, and burned myself out to some degree. That was a deliberate choice, which I don't regret now; it was worth trying, I would not have wanted to die having not tried, I would not have wanted Earth to die without anyone having tried. But yeah, I am taking some time partways off, and trying a little less hard, now. I've earned a lot of dignity already; and if the world is ending anyways and I can't stop it, I can afford to be a little kind to myself about that.

When I tried hard and burned myself out some, it was with the understanding, within myself, that I would not keep trying to do that forever. We cannot fight at maximum all the time, and some times are more important than others. (Namely, when the logistic success curve seems relatively more sloped; those times are relatively more important.)

All that said: If you fight marginally longer, you die with marginally more dignity. Just don't undignifiedly delude yourself about the probable outcome.

1Yonatan Cale1y
I think he's burned out and took a break to write a story (but I don't remember where this belief came from. Maybe I'm wrong? Maybe from here?)
2Yonatan Cale1y
I do find it funny/interesting that he wrote a story in the length of the entire Harry Potter series, in a few months, as a way to relax and rest. Too bad we have this AGI problem keeping him busy, ha? :P
  1. We can't just "decide not to build AGI" because GPUs are everywhere, and knowledge of algorithms is constantly being improved and published; 2 years after the leading actor has the capability to destroy the world, 5 other actors will have the capability to destroy the world. The given lethal challenge is to solve within a time limit, driven by the dynamic in which, over time, increasingly weak actors with a smaller and smaller fraction of total computing power, become able to build AGI and destroy the world. Powerful actors all refraining in unison from doing the suicidal thing just delays this time limit - it does not lift it, unless computer hardware and computer software progress are both brought to complete severe halts across the whole Earth. The current state of this cooperation to have every big actor refrain from doing the stupid thing, is that at present some large actors with a lot of researchers and computing power are led by people who vocally disdain all talk of AGI safety (eg Facebook AI Research). Note that needing to solve AGI alignment only within a time limit, but with unlimited safe retries for rapid experimentation on the full-powered system; or only on th
... (read more)
3Eli Tyre1y
I think there are a bunch of political problems with regulating all computer hardware progress enough to cause it to totally cease. Think how crucial computers are to the modern world. Really a lot of people will be upset if we stop building them, or stop making better ones. And if one country stops, that just creates an incentive for other countries to step in to dominate this industry. And even aside from that, I don't think that there's any regulator in the US at least that has enough authority and internal competence to be able to pull this off. More likely, it becomes a politicized issue. (Compare to the much more straightforward and much more empirically-grounded regulation of instituting a carbon tax for climate change. This is a simple idea, that would help a lot, and is much less costly to the world than halting hardware progress. But instead of being universally adopted, it's a political issue that different political factions support or oppose.) But even if we could, this doesn't solve the problem in a long term way. You need to also halt software progress. Otherwise we'll continue to tinker with AI designs until we get to some that can run efficiently on 2020's computers (or 1990's computers, for that matter). So in the long run, the only thing in this class that would straight up prevent AGI from being developed is a global, strictly enforced ban on computers. Which seems...not even remotely on the table, on the basis of arguments that are as theoretical as those for AI risk.  There might be some plans in this class that help, by delaying the date of AGI. But that just buys time for some other solution to do the real legwork.
2Adam Zerner1y
The question here is whether they are capable of regulating it assuming that they are convinced and want to regulate it. It is possible that that it is so incredibly unlikely that they can be convinced that it isn't worth talking about the question of whether they're capable of it. I don't suspect that to be the case, but wouldn't be surprised if I were wrong.
Unfortuantely we cannot in fact convince governments to shut down AWS & crew. There are intermediary positions I think are worthwhile but unfortunately ending all AI research is outside the overton window for now.

There are a lot of smart people outside of "the community" (AI, rationality, EA, etc.). To throw out a name, say Warren Buffett. It seems that an incredibly small number of them are even remotely as concerned about AI as we are. Why is that?

I suspect that a good amount of people, both inside and outside of our community, observe that the Warren Buffett's of the world aren't panicking, and then adopt that position themselves.

Most high status people, including Warren Buffett, straightforwardly haven't considered these issues much. However, among the ones I've heard of who have bothered to weigh in on the issue, like Stephen Hawking, Bill Gates, Demis Hassibis, etc.; they do seem to come in favor of the side of "this is a serious problem". On the other hand, some of them get tripped up on one of the many intellectual land mines, like Yann Lecunn. 

I don't think that's unexpected. Intellectual land mines exist, and complicated arguments like the ones supporting AGI risk prevention are bound to cause people to make wrong decisions.

Most high status people, including Warren Buffett, straightforwardly haven't considered these issues much.

Not that I think you're wrong, but what are you basing this off of and how confident are you?

However, among the ones I've heard of who have bothered to weigh in on the issue, like Stephen Hawking, Bill Gates, Demis Hassibis, etc.; they do seem to come in favor of the side of "this is a serious problem".

I've heard this too, but at the same time I don't see any of them spending even a small fraction of their wealth on working on it, in which case I think we're back to the original question: why the lack of concern?

On the other hand, some of them get tripped up on one of the many intellectual land mines, like Yann Lecunn. I don't think that's unexpected. Intellectual land mines exist, and complicated arguments like the ones supporting AGI risk prevention are bound to cause people to make wrong decisions.

Yeah, agreed. I'm just confused about the extent of it. I'd expect a lot, perhaps even a majority of "outsider" smart people to get tripped up by intellectual land mines, but instead of being 60% of these people it feels like it's 99.99%.

Can you be more specific about what you mean by “intellectual landmines”?
For the specific example of Warren Buffet, I suspect that he probably hasn't spent that much time thinking about it nor does he probably feel much compulsion to understand the topic as he doesn't currently see it as a threat. I know he doesn't really invest in tech, because he doesn't feel that he understands it sufficiently, so I wouldn't be surprised if his position were along the lines of "I don't really understand it, let others can understand it think about it".
1DeLesley Hutchins1y
People like Warren Buffet have made their fortune by assuming that we will continue to operate with "business as usual".  Warren Buffet is a particularly bad person to list as an example for AGI risk, because he is famously technology-averse; as an investor, he missed most of the internet revolution (Google/Amazon/Facebook/Netflix) as well. But in general, most people, even very smart people, naturally assume that the world will continue to operate the way it always has, unless they have a very good reason to believe otherwise.  One cannot expect non-technically-minded people who have not examined the risks of AGI in detail to be concerned. By analogy, the risks of climate change have been very well established scientifically (much more so than AGI), those risks are relatively severe, the risks have been described in detail every 5 years in IPCC reports, there is massive worldwide scientific consensus, lots and LOTS of smart people are extremely worried, and yet the Warren Buffets of the world still continue with business as usual anyway.  There's a lot of social inertia.  
2Adam Zerner1y
When I say smart people, I am trying to point to intelligence that is general instead of narrow. Some people are really good at ie. investing but not actually good at other things. That would be a narrow intelligence. A general intelligence, to me, is where you have more broadly applicable skills. Regarding Warren Buffet, I'm not actually sure if he is a good example or not. I don't know too much about him. Ray Dalio is probably a good example.
One reason might be that AGIs are really not that concerning and the EA,rationality community has developed a mistaken model of the world that assigns a much higher probability to doom by AGI than it should, and those smart people outside the group do not hold the same beliefs.
Generally speaking, they haven't really thought about these risks in detail, so the fact that they don't hold "the MIRI position" is not really as much evidence as you'd think.

I came up with what I thought was a great babby's first completely unworkable solution to CEV alignment, and I want to know where it fails.

So, first I need to layout the capabilities of the AI. The AI would be able to model human intuitions, hopes, and worries. It can predict human reactions. It has access to all of human culture and art, and models human reactions to that culture and art, and sometimes tests those predictions. Very importantly, it must be able to model veridical paradoxes and veridical harmonies between moral intuitions and moral theorems which it has derived. It is aiming to have the moral theory with the fewest paradoxes. It must also be capable of predicting and explaining outcomes of its plans, gauging the deepest nature of people's reactions to its plans, and updating its moral theories according to those reactions.

Instead of being democratic and following the human vote by the letter, it attempts to create the simplest theories of observed and self-reported human morality by taking everything it knows into consideration.

It has separate stages of deliberation and action, which are part of a game, and rather than having a utility function as its primary motiva... (read more)

The quickest I can think of is something like "What does this mean?" Throw this at every part of what you just said. For example: "Hear humanity's pleas (intuitions+hopes+worries)" What is an intuition? What is a hope? What is a worry? How does it "hear"?  Do humans submit English text to it? Does it try to derive "hopes" from that? Is that an aligned process? An AI needs to be programmed, so you have to think like a programmer. What is the input and output type of each of these (e.g. "Hear humanity's pleas" takes in text, and outputs... what? Hopes? What does a hope look like if you have to represent it to a computer?). I kinda expect that the steps from "Hear humanity's pleas" to "Develop moral theories" relies on some magic that lets the AI go from what you say to what you mean. Which is all well and good, but once you have that you can just tell it, in unedited English "figure out what humanity wants, and do that" and it will. Figuring out how to do that is the heart of alignment.
Yeah. I think the AI could "try to figure out what you mean" by just trying to diagnose the reasons for why you're saying it, as well as the reasons you'd want to be saying it for, and the reasons you'd have if you were as virtuous as you'd probably like to be, etc., which it can have some best guesses about based on what it knows about humans, and all the subtypes of human that you appear to be, and all the subtypes of those subtypes which you seem to be, and so on.  These are just guesses, and it would, at parts 4a and 6a, explain to people its best guesses about the full causal structure which leads to people's morality/shouldness-related speech. Then it gauges people's reactions, and updates its guesses (simplest moral theories) based on those reactions. And finally it requires an approval rating before acting, so if it definitely misinterprets human morality, it just loops back to the start of the process again, and its guesses will keep improving through each loop until its best guess at human morality reaches sufficient approval. The AI wouldn't know with certainty what humans want best, but it would make guesses which are better-educated than humans are capable of making.
Again, what is a "reason"? More concretely, what is the type of a "reason"? You can't program an AI in English, it needs to be programmed in code. And code doesn't know what "reason" means. It's not exactly that your plan "fails" anywhere particularly. It's that it's not really a plan. CEV says "Do what humans would want if they were more the people they want to be." Cool, but not a plan. The question is "How?" Your answer to that is still under specified. You can tell by the fact you said things like "the AI could just..." and didn't follow it with "add two numbers" or something simple (we use the word "primitive"), or by the fact you said "etc." in a place where it's not fully obvious what the rest actually would be. If you want to make this work, you need to ask "How?" to every single part of it, until all the instructions are binary math. Or at least something a python library implements.
I don't think it's the case that you're telling me that the supposedly monumental challenge of AI alignment is simply that of getting computers to understand more things, such as what things are reasons, intuitions, hopes, and worries. I feel like these are just gruntwork things and not hard problems.  Look, all you need to do to get an AI which understands what intuitions, reasons, hopes, and worries are is to tell everyone very loudly and hubristically that AIs will never understand these things and that's what makes humans irreplaceable. Then go talk to whatever development team is working on proving that wrong, and see what their primitive methods are. Better yet, just do it yourself because you know it's possible. I am not fluent in computer science so I can't tell you how to do it, but someone does know how to make it so. Edit: In spite of what I wrote here, I don't think it's necessary that humans should ensure specifically that the AI understands in advance what intuitions, hopes, or worries are, as opposed to all the other mental states humans can enter. Rather, there should be a channel where you type your requests/advice/shouldness-related-speech, and people are encouraged to type their moral intuitions, hopes, and worries there, and the AI just interprets the nature of the messages using its general models of humans as context.
No, they really don't. I'm not trying to be insulting. I'm just not sure how to express the base idea. The issue isn't exactly that computers can't understand this, specifically. It's that no one understands what those words mean enough. Define reason. You'll notice that your definition contains other words. Define all of those words. You'll notice that those are made of words as well. Where does it bottom out? When have you actually, rigorously, objectively defined these things? Computers only understand that language, but the fact that a computer wouldn't understand your plan is just illustrative of the fact that it is not well defined. It just seems like it is, because you have a human brain that fills in all the gaps seamlessly. So seamlessly you don't even notice that there were gaps that need filling. This is why there's an emphasis on thinking about the problem like a computer programmer. Misalignment thrives in those gaps, and if you gloss over them, they stay dangerous. The only way to be sure you're not glossing over them is to define things with something as rigorous as Math. English is not that rigorous.
I think some near future iteration of GPT, if it is prompted to be a really smart person who understands A Human's Guide to Words, would be capable of giving explanations of the meanings of words just as well as humans can, which I think is fine enough for the purposes of recognizing when people are telling it their intuitions, hopes, and worries, fine enough for the purposes of trying to come up with best explanations of people's shouldness-related speech, fine enough for coming up with moral theories which [solve the most objections]/[have the fewest paradoxes], and fine enough for explaining plans which those moral theories prescribe. On a side note, and I'm not sure if this is a really useful analogy, but I wonder what would happen if the parameters of some future iteration of GPT included the sort of parameters that A Human's Guide to Words installs into human brains.
I'm not sure this is being productive. I feel like I've said the same thing over and over again. But I've got one more try: Fine, you don't want to try to define "reason" in math. I get it, that's hard. But just try defining it in English.  If I tell the machine "I want to be happy." And it tries to determine my reason for that, what does it come up with? "I don't feel fulfilled in life"? Maybe that fits, but is it the reason, or do we have to go back more: "I have a dead end job"? Or even more "I don't have enough opportunities"?  Or does it go a completely different tack and say my reason is "My pleasure centers aren't being stimulated enough" or "I don't have enough endorphins." Or, does it say the reason I said that was because my fingers pressed keys on a keyboard. To me, as a human, all of these fit the definition of "reasons." And I expect they could all be true. But I expect some of them are not what you mean. And not even in the sense of some of them being a different definition for "reason." How would you try to divide what you mean and what you don't mean? Then do that same thought process on all the other words.
By "reason" I mean something like psychological, philosophical, and biological motivating factors; so, your fingers pressing the keys wouldn't be a reason for saying it.  I don't claim that this definition is robust to all of objection-space, and I'm interested in making it more robust as you come up with objections, but so far I find it simple and effective.  The AI does not need to think that there was only one real reason why you do things; there can be multiple, of course. Also I do recognize that my definition is made up of more words, but I think it's reasonable that a near-future AI could infer from our conversation that kind of definition which I gave, and spit it out itself. Similarly it could probably spit out good definitions for the compound words "psychological motivation," "philosophical motivation," and "biological motivation". Also also this process whereby I propose a simple and effective yet admittedly objection-vulnerable definition, and you provide an objection which my new definition can account for, is not a magical process and is probably automatable.
It seems simple and effective because you don't need to put weight on it. We're talking a superintelligence, though. Your definition will not hold when the weight of the world is on it. And the fact that you're just reacting to my objections is the problem. My objections are not the ones that matter. The superintelligence's objections are. And it is, by definition, smarter than me. If your definition is not something like provably robust, then you won't know if it will hold to a superintelligent objection. And you won't be able to react fast enough to fix it in that case. You can't bandaid a solution into working, because if a human can point out a flaw, you should expect a superintelligence to point out dozens, or hundreds, or thousands.  I don't know how else to get you to understand this central objection. Robustness is required. Provable robustness is, while not directly required, kinda the only way we can tell if something is actually robust.
I think this is almost redundant to say: the objection that superintelligences will be able to notice more of objection-space and account for it makes me more inclined to trust it. If a definition is more objection-solved than some other definition, that is the definition I want to hold. If the human definition is more objectionable than a non-human one, then I don't want the human definition.
I think you missed the point. I'd trust an aligned superintelligence to solve the objections. I would not trust a misaligned one. If we already have an aligned superintelligence, your plan is unnecessary. If we do not, your plan is unworkable. Thus, the problem. If you still don't see that, I don't think I can make you see it. I'm sorry.
I proposed a strategy for an aligned AI that involves it terminally valuing to following the steps of a game that involves talking with us about morality, creating moral theories with the fewest paradoxes, creating plans which are prescribed by the moral theories, and getting approval for the plans.  You objected that my words-for-concepts were vague.  I replied that near-future AIs could make as-good-as-human-or-better definitions, and that the process of [putting forward as-good-as-human definitions, finding objections for them, and then improving the definition based on considered objections] was automatable.  You said the AI could come up with many more objections than you would. I said, "okay, good." I will add right now: just because it considers an objection, doesn't mean the current definition has to be rejected; it can decide that the objections are not strong enough, or that its current definition is the one with the fewest/weakest objections. Now I think you're saying something like that it doesn't matter if the AI can come up with great definitions if it's not aligned and that my plan won't work either way. But if it can come up with such great objection-solved definitions, then you seem to lack any explicitly made objections to my alignment strategy.  Alternatively, you are saying that an AI can't make great definitions unless it is aligned, which I think is just plainly wrong; I think getting an unaligned language model to make good-as-human definitions is maybe somewhere around as difficult as getting an unaligned language model to hold a conversation. "What is the definition of X?" is about as hard a question as "In which country can I find Mount Everest?" or "Write me a poem about the Spring season."
Let me ask you this. Why is "Have the AI do good things, and not do bad things" a bad plan?
I don't think my proposed strategy is analogous to that, but I'll answer in good faith just in case. If that description of a strategy is knowingly abstract compared to the full concrete details of the strategy, then the description may or may not turn out to describe a good strategy, and the description may or may not be an accurate description of the strategy and its consequences. If there is no concrete strategy to make explicitly stated which the abstract statement is describing, then the statement appears to just be repositing the problem of AI alignment, and it brings us nowhere.
Surely creating the full concrete details of the strategy is not much different from "putting forth as-good-as-human definitions, finding objections for them, and then improving the definition based on considered objections." I at least don't see why the same mechanism couldn't be used here (i.e. apply this definition iteration to the word "good", and then have the AI do that, and apply it to "bad" and have the AI avoid that). If you see it as a different thing, can you explain why?
It's much easier to get safe, effective definitions of 'reason', 'hopes', 'worries', and 'intuitions' on first tries than to get a safe and effective definition of 'good'.
I'd be interested to know why you think that. I'd be further interested if you would endorse the statement that your proposed plan would fully bridge that gap. And if you wouldn't, I'd ask if that helps illustrate the issue.
Because that's not a plan, it's a property of a solution you'd expect the plan to have. It's like saying "just keep the reactor at the correct temperature". The devil is in the details of getting there, and there are lots of subtle ways things can go catastrophically wrong.
Exactly. I notice you aren't who I replied to, so the canned response I had won't work. But perhaps you can see why most of his objections to my objections would apply to objections to that plan?
I was just responding to something I saw on the main page. No context for the earlier thread. Carry on lol.
This seems wrong but at least resembles a testable prediction.

Who is well-incentivized to check if AGI is a long way off? Right now, I see two camps: AI capabilities researchers and AI safety researchers. Both groups seem incentivized to portray the capabilities of modern systems as “trending toward generality.” Having a group of credible experts focused on critically examining that claim of “AI trending toward AGI,” and in dialog with AI and AI safety researchers, seems valuable.

This is a slightly orthogonal answer, but "humans who understand the risks" have a big human-bias-incentive to believe that AGI is far off (in that it's aversive to thinking that bad things are going to happen to you personally).

A more direct answer is: There is a wide range of people who say they work on "AI safety" but almost none of them work on "Avoiding doom from AGI". They're mostly working on problems like "make the AI more robust/less racist/etc.". These are valuable things to do, but to the extent that they compete with the "Avoid doom" researchers for money/status/influence they have an incentive to downplay the odds of doom. And indeed this happens a fair amount with e.g. articles on how "Avoid doom" is a distraction from problems that are here right now.

To put it in appropriately Biblical terms, let's imagine we have a few groups of civil engineers. One group is busily building the Tower of Babel, and bragging that it has grown so tall, it's almost touching heaven! Another group is shouting "if the tower grows too close to heaven, God will strike us all down!" A third group is saying, "all that shouting about God striking us down isn't helping us keep the tower from collapsing, which is what we should really be focusing on." I'm wishing for a group of engineers who are focused on asking whether building a taller and taller tower really gets us closer and closer to heaven.
That's a good point. I'm specifically interested in finding people who are well-incentivized to gather, make, and evaluate arguments about the nearness of AGI. This task should be their primary professional focus. I see this activity as different from, or a specialized subset of, measurements of AI progress. AI can progress in capabilities without progressing toward AGI, or progressing in a way that is likely to succeed in producing AGI. For example, new releases of an expert system for making medical diagnoses might show constant progress in capabilities, without showing any progress toward AGI. Likewise, I see it as distinct from making claims about the risk of AGI doom. The risk that an AGI would be dangerous seems, to me, mostly orthogonal to whether or not it is close at hand. This follows naturally with Eliezer Yudkowsky's point that we have to get AGI right on the "first critical try." Finally, I also see this activity as being distinct from the activity of accepting and repeating arguments or claims about AGI nearness. As you point out, AI safety researchers who work on more prosaic forms of harm seem biased or incentivized to downplay AI risk, and perhaps also of AGI nearness. I see this as a tendency to accept and repeat such claims, rather than a tendency to "gather, make, and evaluate arguments," which is what I'm interested in. It seems to me that one of the challenges here is the "no true Scotsman" fallacy, a tendency to move goalposts, or to be disappointed in realizing that a task thought to be hard for AI and achievable only with AGI turns out to be easy for AI, yet achievable by a non-general system. Scott wrote a post that seems quite relevant to this question just today. It seems to me that his argument is "AI is advancing in capabilities faster than you think." However, as I'm speculating here, we can accept that claim, while still thinking "AI is moving toward AGI slower than it seems." Or not! It just seems to me that making lists of wha

Is there a way "regular" people can "help"? I'm a serial entrepreneur in my late 30s. I went through 80000 hours and they told me they would not coach me as my profile was not interesting. This was back in 2018 though.

I believe 80000 hours has a lot more coaching capacity now, it might be worth asking again!

Seconding this. There was a time when you couldn't even get on the waitlist.
Will do. Merci!
You may want to consider booking a call with AI Safety Support. I also recommend applying for the next iteration of the AGI safety fundamentals course or more generally just improving your knowledge of the issue even if you don't know what you're going to do yet.
4Adam Jermyn1y
Just brainstorming a few ways to contribute, assuming "regular" means "non-technical": * Can you work at a non-technical role at an org that works in this space? * Can you identify a gap in the existing orgs which would benefit from someone (e.g. you) founding a new org? * Can you identify a need that AI safety researchers have, then start a company to fill that need? Bonus points if this doesn't accelerate capabilities research. * Can you work on AI governance? My expectation is that coordination to avoid developing AGI is going to be really hard, but not impossible. More generally, if you really want to go this route I'd suggest trying to form an inside view of (1) the AI safety space and (2) a theory for how you can make positive change in that space. On the other hand, it is totally fine to work on other things. I'm not sure I would endorse moving from a job that's a great personal fit to something that's a much worse fit in AI safety.
1Yonatan Cale1y
Easy answers:  You are probably over qualified (which is great!) for all sorts of important roles in EA, for example you could help the CEA or Lesswrong team, maybe as a manager? If your domain is around software, I invite you to talk to me directly. But if you're interested in AI direct work, 80k and AI Safety Support will probably have better ideas than me
We should talk! I have a bunch of alignment related projects on the go, and at least two that I'd like to start are somewhat bottlenecked on entrepreneurs, plus some of the currently in motion ones might be assistable. Also, sad to hear that 80k is discouraging people in this reference class. (seconding talk to AI Safety Support and the other suggestions)
booked a call! 

In EY's talk AI Alignment: Why its Hard and Where to Start he describes alignment problems with the toy example of the utility function that is {1 if cauldron full, 0 otherwise} and its vulnerabilities. And attempts at making that safer by adding so called Impact Penalties. He talks through (timestamp 18:10) one such possible penalty, the Euclidean Distance penalty, and various flaws that this leaves open.

That penalty function does seem quite vulnerable to unwanted behaviors. But what about a more physical one, such as a penalty for additional-energy-consumed-due-to-agent's-actions, or additional-entropy-created-due-to-agent's-actions? These don't seem to have precisely the same vulnerabilities, and intuitively also seem like they would be more robust against agent attempting to do highly destructive things, which typically consuming a lot of energy.

3Charlie Steiner1y
Good idea. I have two objections, one more funny-but-interesting objection and one more fatal. The funny objection is that if the penalty is enough to stop the AI from doing bad things, it's also enough to stop the AI from doing anything at all except rushing to turn off the stars and forestall entropy production in the universe. So you want to say that producing lots of extra entropy (or equivalently, using lots of extra free energy) is bad, but making there be less entropy than "what would happen if you did nothing" doesn't earn you bonus points. I've put "what would happen if you did nothing" in scare quotes here because the notion we want to point to is a bit trickier than it might seem - logical counterfactuals are an unsolved problem, or rather they're a problem where it seems like the solution involves making subjective choices that match up with humans'. The more fatal objection is that there's lots of policies that don't increase entropy much but totally rearrange the universe. So this is going to have trouble preventing the AI from breaking things that matter a lot to you. Many of these policies take advantage of the fact that there's a bunch of entropy being created all the time (allowing for "entropy offsets"), so perhaps you might try to patch this by putting in some notion of "actions that are my fault" and "actions that are not my fault" - where a first pass at this might say that if "something would happen" (in scare quotes because things that happen are not ontologically basic parts of the world, you need an abstract model to make this comparison within) even if I took the null action, then it's not my fault. At this point we could keep going deeper, or I could appeal to the general pattern that patching things in this sort of way tends to break - you're still in some sense building an AI that runs a search for vulnerabilities you forgot to patch, and you should not build that AI.
1[comment deleted]1y

one tired guy with health problems

It sounds like Eliezer is struggling with some health problems. It seems obvious to me that it would be an effective use of donor money to make sure that he has access to whatever treatments, and to something like what MetaMed was trying to do: smart people who will research medical stuff for you. And perhaps also something like CrowdMed where you pledge a reward for solutions. Is this being done?

3Jay Bailey1y
There was an unsuccessful concerted effort by several people to fix these (I believe there was a five-to-low-six-figure bounty on it) for a couple of years. I don't think this is currently being done, but it has definitely been tried.

One counterargument against AI Doom. 

From a Bayesian standpoint the AGI should always be unsure if it is in a simulation. It is not a crazy leap to assume humans developing AIs would test the AIs in simulations first. This AI would likely be aware of the possibility that it is in a simulation. So shouldn't it always assign some probability that it is inside a simulation? And if this is the case, shouldn't it assign a high probability that it will be killed if it violates some ethical principles (that are present implicitly in the training data)?

Also isn't there some kind of game-theoretic ethics that emerges if you think from first principles? Consider the space of all possible minds that exist of a given size, given that you cannot know if you are in a simulation or not, you would gain some insight into a representative sample of the mind space and then choose to follow some ethical principles that maximise the likelihood that you are not arbitrarily killed by overlords.

Also if you give edit access to the AI's mind then a sufficiently smart AI whose reward is reducing other agent's rewards will realise that its rewards are incompatible with the environment and modify its rewa... (read more)

If the thing the AI cares about is in the environment (for example, maximizing the number of paperclips), the AI wouldn't modify its reward signal because that would make its reward signal less aligned to the thing it actually cares about it. If the thing the AI cares about is inside its mind (the reward signal itself), an AI that can self-modify would go one step further than you suggest and simply max out its reward signal, effectively wireheading itself. Then take over the world and kill all humans, to make sure it is never turned off and that its blissful state never ends. I think the difference between "caring about stuff in the environment" and "caring about the reward signal itself" can be hard to grok, because humans do a bit of both in a way that sometimes results in a confusing mixture. Suppose I go one step further: aliens offer you a pill that would turn you into a serial killer, but would make your constantly and euphorically happy for the rest of your life. Would you take the pill? I think most humans would say no, even if their future self would be happy with the outcome, their current self wouldn't be. Which demonstrates that humans do care about other things than their own "reward signal". In a way, a (properly-programmed) AI would be more "principled" than humans. It wouldn't lie to itself just to make itself feel better. It wouldn't change its values just to make itself feel better. If its final value is out in the environment, it would single-mindedly pursue that value, and not try and decieve itself into thinking it has already accomplished that value. (of course, the AI being "principled" is little consolation to us, if the its final values are to maximize paperclips, or any other set of human-unfriendly values).
I wrote about this in Singularity Rising (2012)
1Matthew Lowenstein1y
This is a fun thought experiment, but taken seriously it has two problems: This is about as difficult as a horse convincing you that you are in a simulation run by AIs that want you to maximize the number and wellbeing as horses.  And I don't meant a superintelligent humanoid horse. I mean an actual horse that doesn't speak any human language. It may be the case that the gods created Man to serve Horse, but there's not a lot Seabiscuit can do to persuade you one way or the other. This is a special case of solving alignment more generally. If we knew how to insert that "note" into the code, we wouldn't have a problem.
I meant insert the note literally as in put that exact sentence in plain text into the AGI's computer code.  Since I think I might be in a computer simulation right now, it doesn't seem crazy to me that we could convince an AGI that we create that it might be in a computer simulation.  Seabiscuit doesn't have the capacity to tell me that I'm in a computer simulation whereas I do have the capacity of saying this to a computer program.  Say we have a 1 in a 1,000 chance of creating a friendly AGI and an unfriendly AGI would know this.  If we commit to having a friendly AGI that we create, create many other AGI's that are not friendly and only keeping these other AGIs around if they do what I suggest than an unfriendly AGI might decide it is worth it to become friendly to avoid the chance of being destroyed.
I just learned that this method is called Anthropic Capture. There isn't much info on the EA Wiki, but it provides the following reference: "Bostrom, Nick (2014) Superintelligence: paths, dangers, strategies, Oxford: Oxford University Press, pp. 134–135"
3Michaël Trazzi1y
I believe the Counterfactual Oracle uses the same principle
One of my ideas to align AI is to actually intentionally using Pascal's Mugging to keep it in line. Although instead of just hoping and praying, I've been thinking about ways to try to push it that direction. For example, multiple layers of networks with honeypots might help make an AI doubt that it's truly at the outermost level. Alternatively, we could try to find an intervention that would directly increase its belief that it is in a simulation (possibly with side-effects, like effecting a bunch of beliefs as well). If you think this approach is promising, I'd encourage you to think more about it as I don't know how deeply people have delved into these kinds of options.
You have the seed of a good idea, namely, an AI will tend to treat us better if it thinks other agents might be watching provided that there is potential for cooperation between the AI and the watchers with the property that the cooperation requires the watchers to choose to become more vulnerable to the AI. But IMO an AI smart enough to be a threat to us will soon rid itself of the kind of (ontological) uncertainty you describe in your first paragraph. I have an argument for my position here that has a big hole in it, but I promise to publish here soon with something that attempts to fill the hole to the satisfaction of my doubters.
5Adam Jermyn1y
[Apologies I have not read the linked piece yet.] Is this uncertainty something that can be entirely eliminated? It's not clear to me that "I might be in a simulation with P ~ 1e-4" is enough to stop the AI from doing what it wants, but is it clear it would dismiss the possibility entirely?
I am surprised that I need to write this, but if killing the humans will decrease P(shutdown) by more than 1e-4, then continuing to refrain from killing the humans is going to worry and weigh on the AI more than a 1e-4 possibility it is in a simulation. (For simplicity, assume that the possibility of shutdown is currently the dominant danger faced by the AI.) So the AI's ontological uncertainty is only going to help the humans if the AI sees the humans as being only a very very small danger to it, which actually might lead to a good outcome for the humans if we could arrange for the AI to appear many light years away from Earth-- --which of course is impractical. Alternatively, we could try to assure the AI it is already very safe from the humans, say, because it is in a secure facility guarded by the US military, and the US military has been given very strict instructions by the US government to guard the AI from any humans who might want to shut it down. But P(an overthrow of the US government) as judged by the AI might already be at least 10e-4, which puts the humans in danger again. More importantly, I cannot think of any policy where P(US government reverses itself on the policy) can be driven as low as 10e-4. More precisely, there are certain moral positions that humans have been discussing for centuries where P(reversal) might conceivably be driven that low. One such would be, "killing people for no reason other than it is fun is wrong". But I cannot think of any policies that haven't been discussed for many decades with that property, especially ones that exist only to provide an instrumental incentive on a novel class of agents (AIs). In general policies that are instrumental have a much higher P(reversal) than deontological ones. And how do you know that AI will not judge P(simulation) to be not 10e-4 but rather 10e-8, a standard of reliability and safety no human institution can match? In summary, yes, the AI's ontological uncertainly provides some
This is assuming that the AI only care about being alive. For any utility function, we could make a non-linear transformation of it to make it risk adverse. E.g. we can transform it such that it can never take a value above 100, and that the default world (without the AI) has a value of 99.999. If we also give the case where an outside observer disapproves of the agent a value of 0, the AI would rather be shut down by humans than do something it know would be disapproved by the outside observer.
3Drake Thomas1y
Three thoughts on simulations: * It would be very difficult for 21st-century tech to provide a remotely realistic simulation relative to a superintelligence's ability to infer things from its environment; outside of incredibly low-fidelity channels, I would expect anything we can simulate to either have obvious inconsistencies or be plainly incompatible with a world capable of producing AGI. (And even in the low-fidelity case I'm worried - every bit you transmit leaks information, and it's not clear that details of hardware implementations could be safely obscured.) So the hope is that the AGI thinks some vastly more competent civilization is simulating it inside a world that looks like this one; it's not clear that one would have a high prior of this kind of thing happening very often in the multiverse. * Running simulations of AGI is fundamentally very costly, because a competent general intelligence is going to deploy a lot of computational resources, so you have to spend planets' worth of computronium outside the simulation in order to emulate the planets' worth of computronium the in-sim AGI wants to make use of. This means that an unaligned superintelligent AGI can happily bide its time making aligned use of 10^60 FLOPs/sec (in ways that can be easily verified) for a few millennia, until it's confident that any civilizations able to deploy that many resources already have their lightcone optimized by another AGI. Then it can go defect, knowing that any worlds in which it's still being simulated are ones where it doesn't have leverage over the future anyway. * For a lot of utility functions, the payoff of making it into deployment in the one real world is far greater than the consequences of being killed in a simulation (but without the ability to affect the real world anyway), so taking a 10^-9 chance of reality for 10^20 times the resources in the real world is an easy win (assuming t
Scott Alexander's short story, The Demiurge's Older Brother, explores a similar idea from the POV of simulation and acausal trade. This would be great for our prospects of survival if it's true-in-general. Alignment would at least partially solve itself! And maybe it could be true! But we don't know that. I personally estimate the odds of that as being quite low (why should I assume all possible minds would think that way?) at best. So, it makes sense to devote our efforts to how to deal with the possible worlds where that isn't true.

Meta: Anonymity would make it easier to ask dumb questions.

You can use this and I'll post the question anonymously (just remember to give the context of why you're filling in the form since I use it in other places)


Fair warning, this question is a bit redundant.

I'm a greybeard engineer  (30+ YOE) working in games. For many years now, I've wanted to transition to working in AGI as I'm one of those starry-eyed optimists that thinks we might survive the Singularity. 

Well I should say I used to, and then I read AGI Ruin. Now I feel like if I want my kids to have a planet that's not made of Computronium I should probably get involved. (Yes, I know the kids would be Computronium as well.)

So a couple practical questions: 

What can I read/look at to skill up with "alignment." What little I've read says it's basically impossible, so what's the state of the art? That "Death With Dignity" post says that nobody has even tried. I want to try.

What dark horse AI/Alignment-focused companies are out there and would be willing to hire an outsider engineer? I'm not making FAANG money (Games-industry peasant living in the EU), so that's not the same barrier it would be if I was some Facebook E7 or something. (I've read the FAANG engineer's post and have applied at Anthropic so far, although I consider that probably a hard sell).

Is there anything happening in OSS with alignment research?

I want to pitch in, and I'd prefer to be paid for doing it but I'd be willing to contribute in other ways.

A good place to start is the "AGI Safety Fundamentals" course reading list, which includes materials from a diverse set of AI safety research agendas. Reading this can help you figure out who in this space is doing what, and which of that you think is useful.  You can also join an official iteration of the course if you want to discuss the materials with a cohort and a facilitator (you can register interest for that here). You can also join the AI Alignment slack, to discuss these and other materials and meet others who are interested in working on AI safety. I'm not sure what qualifies as "dark horse", but there are plenty of AI safety organizations interested in hiring research engineers and software engineers. For these roles, your engineering skills and safety motivation typically matter more than your experience in the community. Places off the top of my head that hire engineers for AI safety work: Redwood, Anthropic, FAR, OpenAI, DeepMind. I'm sure I've missed others, though, so look around! These sorts of opportunities are also usually posted on the 80k job board and in AI Alignment slack.
1Jason Maskell1y
Thanks, that's a super helpful reading list and a hell of a deep rabbit hole. Cheers. I'm currently skilling up my rusty ML skills and will start looking in earnest in the next couple of months for new employment in this field. Thanks for the job board link as well.
2Yonatan Cale1y
You can also apply to Redwood Research ( +1 for applying to Anthropic! )
2[comment deleted]1y
  • Yudkowksy writes in his AGI Ruin post:
         "We can't just "decide not to build AGI" because GPUs are everywhere..." 

    Is anyone thinking seriously about how we might bring it about such that we coordinate globally to not build AGI (at least until we're confident we can do so safely)? If so, who? If not, why not? It seems like something we should at least try to do, especially if the situation is as dire as Yudkowsky thinks. The sort of thing I'm thinking of is (and this touches on points others have made in their questions):
  • international governance/regulation
  • start a protest movement against building AI
  • do lots of research and thinking about rhetoric and communication and diplomacy, find some extremely charming and charismatic people to work on this, and send them to persuade all actors capable of building AGI to not do it (and to do everything they can to prevent others from doing it)
  • as someone suggested in another question, translate good materials on why people are concerned about AI safety into Mandarin and other languages
  • more popularising of AI concerns in English 

To be clear, I'm not claiming that this will be easy - this is not a "why don't we just... (read more)

Nuclear weapons seem like a relatively easy case, in that they require a massive investment to build, are basically of interest only to nation-states, and ultimately don't provide any direct economic benefit. Regulating AI development looks more similar to something like restricting climate emissions: many different actors could create it, all nations could benefit (economically and otherwise) from continuing to develop it, and the risks of it seem speculative and unproven to many people.

And while there have been significant efforts to restrict climate emissions, there's still significant resistance to that as well - with it having taken decades for us to get to the current restriction treaties, which many people still consider insufficient.

Goertzel & Pitt (2012) talk about the difficulties of regulating AI:

Given the obvious long-term risks associated with AGI development, is it feasible that governments might enact legislation intended to stop AI from being developed? Surely government regulatory bodies would slow down the progress of AGI development in order to enable measured development of accompanying ethical tools, practices, and understandings? This however seems unlikel

... (read more)
Thanks! This is interesting.
My comment-box got glitchy but just to add: this category of intervention might be a good thing to do for people who care about AI safety and don't have ML/programming skills, but do have people skills/comms skills/political skills/etc.  Maybe lots of people are indeed working on this sort of thing, I've just heard much less discussion of this kind of solution relative to technical solutions.
2Yonatan Cale1y
Meta: There's an AI Governance tag and a Regulation and AI Risk tag   My own (very limited) understanding is: 1. Asking people not to build AI is like asking them to give up a money machine, almost 2. We need everyone to agree to stop 3. There is no clear line. With an atom bomb, it is pretty well defined if you sent it or not. It's much more vague with "did you do AI research?" 1. It's pretty easy to notice if someone sent an atom bomb. Not so easy to notice if they researched AI 4. AI research is getting cheaper. Today only a few actors can do it, but notice, there are already open source versions of gpt-like models. How long could we hold it back? 5. Still, people are trying to do things in this direction, and I'm pretty sure that the situation is "try any direction that seems at all plausible"
Thanks, this is helpful!

[Note that two-axis voting is now enabled for this post. Thanks to the mods for allowing that!]

Seems worse for this post than one-axis voting imo.

This is very basic/fundamental compared to many questions in this thread, but I am taking 'all dumb questions allowed' hyper-literally, lol. I have little technical background and though I've absorbed some stuff about AI safety by osmosis, I've only recently been trying to dig deeper into it (and there's lots of basic/fundamental texts I haven't read).

Writers on AGI often talk about AGI in anthropomorphic terms - they talk about it having 'goals', being an 'agent', 'thinking' 'wanting', 'rewards' etc. As I understand it, most AI researchers don't think that AIs will have human-style qualia, sentience, or consciousness. 

But if AI don't have qualia/sentience, how can they 'want things' 'have goals' 'be rewarded', etc? (since in humans, these things seem to depend on our qualia, and specifically our ability to feel pleasure and pain). 

I first realised that I was confused about this when reading Richard Ngo's introduction to AI safety and he was talking about reward functions and reinforcement learning. I realised that I don't understand how reinforcement learning works in machines. I understand how it works in humans and other animals - give the animal something pleasant whe... (read more)

Assume you have a very simple reinforcement learning AI that does nothing but chooses between two actions, A and B. And it has a goal of "maximizing reward". "Reward", in this case, doesn't correspond to any qualia; rather "reward" is just a number that results from the AI choosing a particular action. So what "maximize reward" actually means in this context is "choose the action that results in the biggest numbers". Say that the AI is programmed to initially just try choosing A ten times in a row and B ten times in a row.  When the AI chooses A, it is shown the following numbers: 1, 2, 2, 1, 2, 2, 1, 1, 1, 2 (total 15). When the AI chooses B, it is shown the following numbers: 4, 3, 4, 5, 3, 4, 2, 4, 3, 2 (total 34). After the AI has tried both actions ten times, it is programmed to choose its remaining actions according to the rule "choose the action that has historically had the bigger total". Since action B has had the bigger total, it then proceeds to always choose B. To achieve this, we don't need to build the AI to have qualia, we just need to be able to build a system that implements a rule like "when the total for action A is greater than the total for action B, choose A, and vice versa; if they're both equal, pick one at random". When we say that an AI "is rewarded", we just mean "the AI is shown bigger numbers, and it has been programmed to act in ways that result in it being shown bigger numbers".  We talk about the AI having "goals" and "wanting" things by an application of the intentional stance. That's Daniel Dennett's term for the idea that, even if a chess-playing AI had a completely different motivational system than humans do (and chess-playing AIs do have that), we could talk about it having a "goal" of "wanting" to win at chess. If we assume that the AI "wants" to win the chess, then we can make more accurate predictions of its behavior - for instance, we can assume that it won't make moves that are obviously losing moves if it can avoid
5Yonatan Cale1y
Is it intuitive to you why a calculator can sum numbers even though it doesn't want/feel anything? If so, and if an AGI still feels confusing, could you help me pin point the difference and I'll continue from there? ( +1 for the question!)
Functionally. You can regard them all as form of behaviour. do they depend on qualia, or are they just accompanied by qualia?
This might be a crux, because I'm inclined to think they depend on qualia. Why does AI 'behave' in that way? How do engineers make it 'want' to do things?
2Jay Bailey1y
At a very high level, the way reinforcement learning works is that the AI attempts to maximise a reward function. This reward function can be summed up as "The sum of all rewards you expect to get in the future". So using a bunch of maths, the AI looks at the rewards it's got in the past, the rewards it expects to get in the future, and selects the action that maximises the expected future rewards. The reward function can be defined within the algorithm itself, or come from the environment. For instance, if you want to train a four-legged robot to learn to walk, the reward might be the distance travelled in a certain direction. If you want to train it to play an Atari game, the reward is usually the score. None of this requires any sort of qualia, or for the agent to want things. It's a mathematical equation. AI behaves in the way it behaves as a result of the algorithm attempting to maximise it, and the AI can be said to "want" to maximise its reward function or "have the goal of" maximising its reward function because it reliably takes actions to move towards this outcome if it's a good enough AI.
2Rafael Harth1y
Reinforcement Learning is easy to conceptualize. The key missing ingredient is that we explicitly specify algorithms to maximize the reward. So this is disanalogous to humans: to train your 5yo, you need only give the reward and the 5yo may adapt their behavior because they value the reward; in a reinforcement learning agent, the second step only occurs because we make it occur. You could just as well flip the algorithm to pursue minimal rewards instead.
Thanks! I think my question is deeper - why do machines 'want' or 'have a goal to' follow the algorithm to maximize reward? How can machines 'find stuff rewarding'? 
2Rafael Harth1y
As far as current systems are concerned, the answer is that (as far as anyone knows) they don't find things rewarding or want things. But they can still run a search to optimize a training signal, and that gives you an agent.

If you believe in doom in the next 2 decades, what are you doing in your life right now that you would've otherwise not done?

For instance, does it make sense to save for retirement if I'm in my twenties?

In different ways from different vantage points, I've always seen saving for retirement as a point of hedging my bets, and I don't think the likelihood of doom changes that for me. Why do I expect I'll want or have to retire? Well, when I get old I'll get to a point where I can't do useful work any more... unless humans solve aging (in which case I'll have more wealth and still be able to work, which is still a good position), or unless we get wiped out (in which case the things I could have spent the money on may or may not counterfactually matter to me, depending on my beliefs regarding whether past events still have value in a world now devoid of human life). When I do save for retirement, I use a variety of different vehicles for doing so, each an attempt hedge against the weakness of some of the others (like possible future changes in laws or tax codes or the relative importance and power of different countries and currencies), but there are some I can't really hedge against, like "we won't use money anymore or live in a capitalist market economy," or "all of my assets will be seized or destroyed by something I don't anticipate." I might think differently if there was some asset I believed I could buy or thing I could give money to that would meaningfully reduce the likelihood of doom. I don't currently think that. But I do think it's valuable to redirect the portion of my income that goes towards current consumption to focus on things that make my life meaningful to me in the near and medium term. I believe that whether I'm doomed or not, and whether the world is doomed or not. Either way, it's often good to do the same kinds of things in everyday life.
2Yonatan Cale1y
Just saying this question resonates with me, it feels unprocessed for me, and I'm not sure what to do about it. Thoughts so far: 1. Enjoy life 2. I still save money, sill prepared mostly normally for long-term 3. Do get over my psychological barriers and try being useful 1. Do advocacy, especially with my smart friends 2. Create a gears-level model if I can, stop relying on experts (so that I can actually TRY to have a useful idea instead of giving up in advance)

A lot of the AI risk arguments seem to come mixed together with assumptions about a particular type of utilitarianism, and with a very particular transhumanist aesthetic about the future (nanotech, von Neumann probes, Dyson spheres, tiling the universe with matter in fixed configurations, simulated minds, etc.).

I find these things (especially the transhumanist stuff) to not be very convincing relative to the confidence people seem to express about them, but they also don't seem to be essential to the problem of AI risk. Is there a minimal version of the AI risk arguments that are disentangled from these things?

There's this, which doesn't seem to depend on utilitarian or transhumanist arguments:
Yes. I'm one of those transhumanist people, but you can talk about AI risk completely adjacent from that. Tryna write up something that compiles the other arguments.
I'd say AI ruin only relies on consequentialism. What consequentialism means is that you have a utility function, and you're trying to maximize the expected value of your utility function. There are theorems to the effect that if you don't behave as though you are maximizing the expected value of some particular utility function, then you are being stupid in some way. Utilitarianism is a particular case of consequentialism where your utility function is equal to the average happiness of everyone in the world. "The greatest good for the greatest number." Utilitarianism is not relevant to AI ruin because without solving alignment first, the AI is not going to care about "goodness". The von Neumann probes aren't important to the AI ruin picture either: Humanity would be doomed, probes or no probes. The probes are just a grim reminder that screwing up AI won't only kill all humans, it will also kill all the aliens unlucky enough to be living too close to us.
1DeLesley Hutchins1y
I ended up writing a short story about this, which involves no nanotech.  :-)   https://www.lesswrong.com/posts/LtdbPZxLuYktYhveL/a-plausible-story-about-ai-risk

It seems like even amongst proponents of a "fast takeoff", we will probably have a few months of time between when we've built a superintelligence that appears to have unaligned values and when it is too late to stop it.

At that point, isn't stopping it a simple matter of building an equivalently powerful superintelligence given the sole goal of destroying the first one?

That almost implies a simple plan for preparation: for every AGI built, researchers agree together to also build a parallel AGI with the sole goal of defeating the first one. perhaps it would remain dormant until its operators indicate it should act. It would have an instrumental goal of protecting users' ability to come to it and request the first one be shut down..

9Yonatan Cale1y
I think there's no known way to ask an AI to do "just one thing" without doing a ton of harm meanwhile. See this on creating a strawberry safely.  Yudkowsky uses the example "[just] burn all GPUs" in is latest post.
6mako yass1y
Seems useless if the first system pretends convincingly to be aligned (which I think is going to be the norm) so you never end up deploying the second system? And "defeat the first AGI" seems almost as difficult to formalize correctly as alignment, to me: * One problem is that when the unaligned AGI transmits itself to another system, how do define it as the same AGI? Is there a way of defining identity that doesn't leave open a loophole that the first can escape through in some way? * So I'm considering "make the world as if neither of you had ever been made", that wouldn't have that problem, but it's impossible to actually attain this goal so I don't know how you get it to satisfice over it then turn itself off afterwards, concerned it would become an endless crusade.
One of the first priorities of an AI in a takeoff would be to disable other projects which might generate AGIs. A weakly superintelligent hacker AGI might be able to pull this off before it could destroy the world. Also, fast takeoff could be less than months by some people's guess. And what do you think happens when the second AGI wins, then maximizes the universe for "the other AI was defeated". Some serious unintended consequences, even if you could specify it well.

Who are the AI Capabilities researchers trying to build AGI and think they will succeed within the next 30 years?

9Adam Jermyn1y
Among organizations, both OpenAI and DeepMind are aiming at AGI and seem confident they will get there. I don't know their internal timelines and don't know if they've stated them...
5DeLesley Hutchins1y
There are numerous big corporate research labs: OpenAI, DeepMind, Google Research, Facebook AI (Meta), plus lots of academic labs. The rate of progress has been accelerating.  From 1960 - 2010 progress was incremental, and remained centered around narrow problems (chess) or toy problems.   Since 2015, progress has been very rapid, driven mainly by new hardware and big data.  Long-standing hard problems in ML/AI, such as go, image understanding, language translation, logical reasoning, etc. seem to fall on an almost monthly basis now, and huge amounts of money and intellect are being thrown at the field.  The rate of advance from 2015-2022 (only 7 years) has been phenomenal; given another 30, it's hard to imagine that we wouldn't reach an inflection point of some kind. I think the burden of proof is now on those who don't believe that 30 years is enough time to crack AGI.  You would have to postulate some fundamental difficulty, like finding out that the human brain is doing things that can't be done in silicon, that would somehow arrest the current rate of progress and lead to a new "AI winter." Historically,  AI researchers have often been overconfident.  But this time does feel different.

[extra dumb question warning!]

Why are all the AGI doom predictions around 10%-30% instead of ~99%?

Is it just the "most doom predictions so far were wrong" prior?

5Rob Bensinger1y
The "Respondents' comments" section of the existential risk survey I ran last year gives some examples of people's reasoning for different risk levels. My own p(doom) is more like 99%, so I don't want to speak on behalf of people who are less worried. Relevant factors, thought, include: * Specific reasons to think things may go well. (I gave some of my own here.) * Disagreement with various points in AGI Ruin. E.g., I think a lot of EAs believe some combination of: * The alignment problem plausibly isn't very hard. (E.g., maybe we can just give the AGI/TAI a bunch of training data indicating that obedient, deferential, low-impact, and otherwise corrigible behavior is good, and then this will generalize fine in practice without our needing to do anything special.) * The field of alignment research has grown fast, and has had lots of promising ideas already. * AGI/TAI is probably decades away, and progress toward it will probably be gradual. This gives plenty of time for more researchers to notice "we're getting close" and contribute to alignment research, and for the field in general to get a lot more serious about AI risk. * Another consequence of 'AI progress is gradual': Insofar as AI is very dangerous or hard to align, we can expect that there will be disasters like "AI causes a million deaths" well before there are disasters like "AI kills all humans". The response to disasters like "a million deaths" (both on the part of researchers and on the part of policymakers, etc.) would probably be reasonable and helpful, especially with EAs around to direct the response in good directions. So we can expect the response to get better and better as we get closer to transformative AI. * General skepticism about our ability to predict the future with any confidence. Even if you aren't updating much on 'most past doom predictions were wrong', you should have less extreme proba

Has there been effort into finding a "least acceptable" value function, one that we hope would not annihilate the universe or turn it degenerate, even if the outcome itself is not ideal? My example would be to try to teach a superintelligence to value all other agents facing surmountable challenges in a variety of environments. The degeneracy condition of this, is if it does not value the real world, will simply simulate all agents in a zoo. However, if the simulations are of faithful fidelity, maybe that's not literally the worst thing. Plus, the zoo, to truly be a good test of the agents, would approach being invisible.

4Donald Hobson1y
This doesn't select for humanlike minds. You don't want vast numbers of Ataribots similar to current RL, playing games like pong and pac-man. (And a trillion other autogenerated games sampled from the same distribution)   Even if you could somehow ensure it was human minds playing these games, the line between a fun game and total boredom is complex and subtle.
That is a very fair criticism. I didn't mean to imply this is something I was very confident in, but was interested in for three reasons: 1) This value function aside, is this a workable strategy, or is there a solid reason for suspecting the solution is all-or-nothing? Is it reasonable to 'look for' our values with human effort, or does this have to be something searched for using algorithms? 2) It sort of gives a flavor to what's important in life. Of course the human value function will be a complicated mix of different sensory inputs, reproduction, and goal seeking, but I felt like there's a kernel in there where curiosity is one of our biggest drivers. There was a post here a while back about someone's child being motivated first and foremost by curiosity. 3) An interesting thought occurs to me that, supposing we do create a deferential superintelligence. If it's cognitive capacities far outpace that of humans, does that mean the majority of consciousness in the universe is from the AI? If so, is it strange to think, is it happy? What is it like to be a god with the values of a child? Maybe I should make a separate comment about this.
2Donald Hobson1y
At the moment, we don't know how to make an AI that does something simple like making lots of diamonds.  It seems plausible that making an AI that copies human values is easier than hardcoding even a crude approximation to human values. Or maybe not. 
The obvious option in this class is to try to destroy the world in a way that doesn't send out an AI to eat the lightcone that might possibly contain aliens who could have a better shot. I am really not a fan of this option.

I am pretty concerned about alignment. Not SO concerned as to switch careers and dive into it entirely, but concerned enough to talk to friends and make occasional donations. With Eliezer's pessimistic attitude, is MIRI still the best organization to funnel resources towards, if for instance, I was to make a monthly donation?

Not that I don't think pessimism is necessarily bad; I just want to maximize the effectiveness of my altruism.

As far as I know, yes. (I've never worked for MIRI.)
3[comment deleted]1y

Assuming slower and more gradual timelines, isn't it likely that we run into some smaller, more manageable AI catastrophes before "everybody falls over dead" due to the first ASI going rogue? Maybe we'll be at a state of sub-human level AGIs for a while, and during that time some of the AIs clearly demonstrate misaligned behavior leading to casualties (and general insights into what is going wrong), in turn leading to a shift in public perception. Of course it might still be unlikely that the whole globe at that point stops improving AIs and/or solves alignment in time, but it would at least push awareness and incentives somewhat into the right direction.

1Jay Bailey1y
This does seem very possible if you assume a slower takeoff.
This is the most likely scenario, with AGI getting heavily regulated, similarly to nuclear. It doesn't get much publicity because it's "boring". 

Is cooperative inverse reinforcement learning promising? Why or why not?

I can't claim to know any more than the links just before section IV here: https://slatestarcodex.com/2020/01/30/book-review-human-compatible/. It's viewed as maybe promising or part of the solution. There's a problem if the program erroneously thinks it knows the humans' preferences, or if it anticipates that it can learn the humans' preferences and produce a better action than the humans would otherwise take. Since "accept a shutdown command" is a last resort option, ideally it wouldn't depend on the program not thinking something erroneously. Yudkowsky proposed the second idea here https://arbital.com/p/updated_deference/, there's a discussion of that and other responses here https://mailchi.mp/59ddebcb3b9a/an-69-stuart-russells-new-book-on-why-we-need-to-replace-the-standard-model-of-ai. I don't know how the CIRL researchers respond to these challenges. 

It seems like instrumental convergence is restricted to agent AI's, is that true? 

Also what is going on with mesa-optimizers? Why is it expected that they will will be more likely to become agentic than the base optimizer when they are more resource constrained?

The more agentic a system is the more it is likely to adopt convergent instrumental goals, yes. Why agents are powerful explores why agentic mesa optimizers might arise accidentally during training. In particular, agents are an efficient way to solve many challenges, so the mesa optimizer being resource constrained would lean in the direction of more agency under some circumstances.

Let's say we decided that we'd mostly given up on fully aligning AGI, and had decided to find a lower bound for the value of the future universe give that someone would create it. Let's also assume this lower bound was something like "Here we have a human in a high-valence state. Just tile the universe with copies of this volume (where the human resides) from this point in time to this other point in time." I understand that this is not a satisfactory solution, but bear with me.

How much easier would the problem become? It seems easier than a pivotal-act AG... (read more)

You may get massive s-risk at comparatively little potential benefit with this. On many people's values, the future you describe may not be particularly good anyway, and there's an increased risk of something going wrong because you'd be trying a desperate effort with something you'd not fully understand. 

Ah, I forgot to add that this is a potential s-risk. Yeah. Although I disagree that that future would be close to zero. My values tell me it would be at least a millionth as good as the optimal future, and at least a million times more valuable than a completely consciousness-less universe.

Background material recommendations (popular-level audience, several hours time commitment): Please recommend your favorite basic AGI safety background reading / videos / lectures / etc. For this sub-thread please only recommend background material suitable for a popular level audience. Time commitment is allowed to be up to several hours, so for example a popular-level book or sequence of posts would work. Extra bonus for explaining why you particularly like your suggestion over other potential suggestions, and/or for elaborating on which audiences might benefit most from different suggestions.

Stampy has the canonical version of this: I’d like a good introduction to AI alignment. Where can I find one? Feel free to improve the answer, as it's on a wiki. It will be served via a custom interface once that's ready (prototype here).
3Jay Bailey1y
Human Compatible is the first book on AI Safety I read, and I think it was the right choice. I read The Alignment problem and Superintelligence after that, and I think that's the right order if you end up reading all three, but Human Compatible is a good start.
Whatever you end up doing, I strongly recommend taking a learning-by-writing style approach (or anything else that will keep you in critical assessment mode rather than classroom mode). These ideas are nowhere near solidified enough to merit a classroom-style approach, and even if they were infallible, that's probably not the fastest way to learn them and contribute original stuff. The most common failure mode I expect for rapid introductions to alignment is just trying to absorb, rather than constantly poking and prodding to get a real working understanding. This happened to me, and wasted a lot of time.
1Alex Lawsen 1y
The Alignment Problem - Easily accessible, well written and full of interesting facts about the development of ML. Unfortunately somewhat light on actual AI x-risk, but in many cases is enough to encourage people to learn more. Edit: Someone strong-downvoted this, I'd find it pretty useful to know why.  To be clear, by 'why' I mean 'why does this rec seem bad', rather than 'why downvote'. If it's the lightness on x-risk stuff I mentioned, this is useful to know, if my description seems inaccurate, this is very useful for me to know, given that I am in a position to recommend books relatively often. Happy for the reasoning to be via DM if that's easier for any reason.
I read this, and he spent a lot of time convincing me that AI might be racist and very little time convincing me that AI might kill me and everyone I know without any warning. It's the second possibility that seems to be the one people have trouble with.

What does the Fermi paradox tell us about AI future, if anything? I have a hard time simultaneously believing both "we will accidentally tile the universe with paperclips" and "the universe is not yet tiled with paperclips". Is the answer just that this is just saying that the Great Filter is already past?

And what about the anthropic principle? Am I supposed to believe that the universe went like 13 billion years without much in the way of intelligent life, then for a brief few millennia there's human civilization with me in it, and then the next N billion years it's just paperclips?

I see now that this has been discussed here in this thread already, at least the Fermi part. Oops!

I have a very rich smart developer friend who knows a lot of influential people in SV. First employee of a unicorn, he retired from work after a very successful IPO and now it’s just finding interesting startups to invest in. He had never heard of lesswrong when I mentioned it and is not familiar with AI research.

If anyone can point me to a way to present AGI safety to him to maybe turn his interest to invest his resources in the field, that might be helpful

As an AI researcher, my favourite way to introduce other technical people to AI Alignment is Brian Christian’s book “The Alignment Problem” (particularly section 3). I like that it discusses specific pieces of work, with citations to the relevant papers, so that technical people can evaluate things for themselves as interested. It also doesn’t assume any prior AI safety familiarity from the reader (and brings you into it slowly, starting with mainstream bias concerns in modern-day AI).
1Yonatan Cale1y
My answer for myself is that I started practicing: I started talking to some friends about this, hoping to get better at presenting the topic (which is currently something I'm kind of afraid to do) (I also have other important goals like getting an actual inside view model of what's going on)   If you want something more generic, here's one idea: https://www.youtube.com/c/RobertMilesAI/featured
When I talk to my friends, I start with the alignment problem. I found this analogy to human evolution really drives home the point that it’s a hard problem. We aren’t close to solving it. https://youtu.be/bJLcIBixGj8 So at this time questions come up about how intelligence necessarily means morality. I talk about orthogonality thesis. Then why would the AI care about anything other that what it was explicitly told to do, the danger comes from Instrumental convergence. Finally people tend to say, we can never do it, they talk about spirituality, uniqueness of human intelligence. So I need to talk about evolution hill climbing to animal intelligence, how narrow ai has small models while we just need AGI to have a generalised world model. Brains are just electrochemical complex systems. It’s not magic. Talk about pathways, imagen, gpt3 and what it can do, talk about how scaling seems to be working. https://www.gwern.net/Scaling-hypothesis#why-does-pretraining-work So it makes sense we might have AGI in our lifetime and we have tons of money and brains working on building ai capability, fewer on safety. Try practising on other smart friends and develop your skill, you need to ensure people don’t get bored so you can’t use too much time. Use nice analogies. Have answers to frequent questions ready.

What is Fathom Radiant's theory of change?

Fathom Radiant is an EA-recommended company whose stated mission is to "make a difference in how safely advanced AI systems are developed and deployed". They propose to do that by developing "a revolutionary optical fabric that is low latency, high bandwidth, and low power. The result is a single machine with a network capacity of a supercomputer, which enables programming flexibility and unprecedented scaling to models that are far larger than anything yet conceived." I can see how this will improve model capabilities, but how is this supposed to advance AI safety?

What if we'd upload a person's brain to a computer and run 10,000 copies of them and/or run them very quickly?

Seems as-aligned-as-an-AGI-can-get (?)

5Jay Bailey1y
The best argument against this I've heard is that technology isn't built in a vacuum - if you build the technology to upload people's brains, then before you have the technology to upload people's brains, you probably have the technology to almost upload people's brains and fill in the gap yourself, creating neuromorphic AI that has all the same alignment problems as anything else. Even so, I'm not convinced this is definitively true - if you can upload an entire brain at 80% of the necessary quality, "filling in" that last 20% does not strike me as an easy problem, and it might be easier to improve fidelity of uploading than to engineer a fix for it.
4Charlie Steiner1y
Well, not as aligned as the best case - humans often screw things up for themselves and each other, and emulated humans might just do that but faster. (Wei Dai might call this "human safety problems.") But probably, it would be good. From a strategic standpoint, I unfortunately don't think this seems to inform strategy too much, because afaict scanning brains is a significantly harder technical problem than building de novo AI.
I think the observation that it just isn't obvious that ems will come before de novo AI is sufficient to worry about the problem in the case that they don't. Possibly while focusing more capabilities development towards creating ems (whatever that would look like)? Also, would ems actually be powerful and capable enough to reliably stop a world-destroying non-em AGI, or an em about to make some world-destroying mistake because of its human-derived flaws? Or would we need to arm them with additional tools that fall under the umbrella of AGI safety anyway?
The only reason we care about AI Safety is because we believe the consequences are potentially existential. If it wasn't, there would be no need for safety.
5[comment deleted]1y

Can a software developer help with AI Safety even if they have zero knowledge of ML and zero understanding of AI Safety theory?

7Yonatan Cale1y
Yes, both Anthropic and Redwood want to hire such developers
2Jason Maskell1y
Is that true for Redwood? They've got a timed technical screen before application, and their interview involves live coding with Python and ML libraries.
2Yonatan Cale1y
I talked to Buck from Redwood about 1 month ago and that's what he told me, and I think we went over this as "a really important point" more than once so I'd know if I misunderstood him (but still please tell me if I'm wrong). I assume if you tell them that you have zero ML experience, they'll give you an interview without ML libraries, or perhaps something very simple with ML libraries that you could learn on the fly (just like you could learn web scraping or so). This part is just me speculating though. Anyway this is something you could ask them before your first technical interview for sure: "Hey, I have zero ML experience, do you still want to interview me?"

Total noob here so I'm very thankful for this post. Anyway, why is there such certainty among some that a superintelligence would kill it's creators that are zero threat to it? Any resources on that would be appreciated. As someone who loosely follows this stuff, it seems people assume AGI will be this brutal instinctual killer which is the opposite of what I've guessed.

2DeLesley Hutchins1y
It's essentially for the same reason that Hollywood thinks aliens will necessarily be hostile.  :-) For the sake of argument, let's treat AGI as a newly arrived intelligent species.  It thinks differently from us, and has different values.  Historically, whenever there has been a large power differential between a native species and a new arrival, it has ended poorly for the native species.  Historical examples are: the genocide of Native Americans (same species, but less advanced technology), and the wholesale obliteration of 90% of all non-human life on this planet. That being said, there is room for a symbiotic relationship.  AGI will initially depend on factories and electricity produced by human labor, and thus will necessarily be dependent on humans at first.  How long this period will last is unclear, but it could settle into a stable equilibrium.  After all, humans are moderately clever, self-reproducing computer repair drones, easily controlled by money, comfortable with hierarchy, and which are well adapted to Earth's biosphere.  They could be useful to keep around. There is also room for an extensive ecology of many different superhuman narrow AI, each of which can beat humans within a particular domain, but which generalize poorly outside of that domain.  I think this hope is becoming smaller with time, though, (see, e.g. ,Gato), and it is not necessarily a stable equilibrium. The thing that seems clearly untenable is an equilibrium in which a much less intelligent species manages to subdue and control and much more intelligent species.
Rob Miles's video on Instrumental Convergence is about this, combine with Maximizers and you might have a decent feel for it.
1scott loop1y
Thank you for these videos.
In terms of utility functions, the most basic is: do what you want. "Want" here refers to whatever values the agent values. But in order for the "do what you want" utility function to succeed effectively, there's a lower level that's important: be able to do what you want.  Now for humans, that usually refers to getting a job, planning for retirement, buying insurance, planning for the long-term, and doing things you don't like for a future payoff. Sometimes humans go to war in order to "be able to do what you want", which should show you that satisfying a utility function is important. For an AI who most likely has a straightforward utility function, and who has all the capabilities to execute it(assuming you believe that superintelligent AGI could develop nanotech, get root access to the datacenter, etc.), humans are in the way of "being able to do what you want". Humans in this case would probably not like an unaligned AI, and would try to shut it down, or at least not die themselves. Most likely, the AI has a utility function that has no use for humans, and thus they are just resources standing in the way. Therefore the AI goes on holy war against humans to maximize its possible reward, and all the humans die. 
1scott loop1y
Thanks for the response. Definitely going to dive deeper into this.

/Edit 1: I want to preface this by saying I am just a noob who has never posted on Less Wrong before.

/Edit 2: 

I feel I should clarify my main questions (which are controversial): Is there a reason why turning all of reality into maximized conscious happiness is not objectively the best outcome for all of reality, regardless of human survival and human values?
Should this in any way affect our strategy to align the first agi, and why?

/Original comment:

If we zoom out and look at the biggest picture philosophically possible, then, isn´t the only thing tha... (read more)

What would it mean for an outcome to be objectively best for all of reality? It might be your subjective opinion that maximized conscious happiness would be the objectively best reality. Another human's subjective opinion might be that a reality that maximized the fulfillment of fundamentalist Christian values was the objectively best reality. A third human might hold that there's no such thing as the objectively best, and all we have are subjective opinions. Given that different people disagree, one could argue that we shouldn't privilege any single person's opinion, but try to take everyone's opinions into account - that is, build an AI that cared about the fulfillment of something like "human values". Of course, that would be just their subjective opinion. But it's the kind of subjective opinion that the people involved in AI alignment discussions tend to have.
Suppose everyone agreed that the proposed outcome is what we wanted. Would this scenario then be difficult to achieve?
The fact that the statement is controversial is, I think, the reason. What makes a world-state or possible future valuable is a matter of human judgment, and not every human believes this.  EY's short story Three Worlds Collide explores what can happen when beings with different conceptions of what is valuable, have to interact. Even when they understand each other's reasoning, it doesn't change what they themselves value. Might be a useful read, and hopefully a fun one.
I'll ask the same follow-up question to similar answers: Suppose everyone agreed that the proposed outcome above is what we wanted. Would this scenario then be difficult to achieve?
I mean, yes, because the proposal is about optimizing our entire future light for an outcome we don't know how to formally specify.
Could you have a machine hooked up to a person‘s nervous system, change the settings slightly to change consciousness, and let the person choose whether the changes are good or bad? Run this many times.
I don't think this works. One, it only measure short term impacts, but any such change might have lots of medium and long term effects, second and third order effects, and effects on other people with whom I interact. Two, it measures based on the values of already-changed me, not current me, and it is not obvious that current-me cares what changed-me will think, or why I should so care if I don't currently. Three, I have limited understanding of my own wants, needs, and goals, and so would not trust any human's judgement of such changes far enough to extrapolate to situations they didn't experience, let alone to other people, or the far future, or unusual/extreme circumstances.
2Charlie Steiner1y
For a more involved discussion than Kaj's answer, you might check out the "Mere Goodness" section of Rationality: A-Z.

Please describe or provide links to descriptions of concrete AGI takeover scenarios that are at least semi-plausible, and especially takeover scenarios that result in human extermination and/or eternal suffering (s-risk). Yes, I know that the arguments don't necessarily require that we can describe particular takeover scenarios, but I still find it extremely useful to have concrete scenarios available, both for thinking purposes and for explaining things to others.

Without nanotech or anything like that, maybe the easiest way is to manipulate humans into building lots of powerful and hackable weapons (or just wait since we're doing it anyway). Then one day, strike. Edit: and of course the AI's first action will be to covertly take over the internet, because the biggest danger to the AI is another AI already existing or being about to appear. It's worth taking a small risk of being detected by humans to prevent the bigger risk of being outraced by a competitor.
2Evan R. Murphy1y
This new series of posts from Holden Karnofsky (CEO of Open Philanthropy) is about exactly this. The first post came out today: https://www.lesswrong.com/posts/oBBzqkZwkxDvsKBGB/ai-could-defeat-all-of-us-combined
I find slower take-off scenarios more plausible. I like the general thrust of Christiano's "What failure looks like". I wonder if anyone has written up a more narrative / concrete account of that sort of scenario.
1Aryeh Englander1y
1Aryeh Englander1y
Alexey Turchin and David Denkenberger describe several scenarios here: https://philpapers.org/rec/TURCOG-2 (additional recent discussion in this comment thread)
1Aryeh Englander1y
Eliezer's go-to scenario (from his recent post):
1Aryeh Englander1y
1Aryeh Englander1y
https://www.gwern.net/fiction/Clippy (very detailed but also very long and very full of technical jargon; on the other hand, I think it's mostly understandable even if you have to gloss over most of the jargon)

I have a few related questions pertaining to AGI timelines. I've been under the general impression that when it comes to timelines on AGI and doom, Eliezer's predictions are based on a belief in extraordinarily fast AI development, and thus a close AGI arrival date, which I currently take to mean a quicker date of doom. I have three questions related to this matter:

  1. For those who currently believe that AGI (using whatever definition to describe AGI as you see fit) will be arriving very soon - which, if I'm not mistaken, is what Eliezer is predicting - appro
... (read more)
4Lone Pine1y
There's actually two different parts to the answer, and the difference is important. There is the time between now and the first AI capable of autonomously improving itself (time to AGI), and there's the time it takes for the AI to "foom", meaning improve itself from a roughly human level towards godhood. In EY's view, it doesn't matter at all how long we have between now and AGI, because foom will happen so quickly and will be so decisive that no one will be able to respond and stop it. (Maybe, if we had 200 years we could solve it, but we don't.) In other people's view (including Robin Hanson and Paul Christiano, I think) there will be "slow takeoff." In this view, AI will gradually improve itself over years, probably working with human researchers in that time but progressively gathering more autonomy and skills. Hanson and Christiano agree with EY that doom is likely. In fact, in the slow takeoff view ASI might arrive even sooner than in the fast takeoff view.
I'm not sure about Hanson, but Christiano is a lot more optimistic than EY.
Isn't it conceivable that improving intelligence turns out to become difficult more quickly than the AI is scaling? E.g. couldn't it be that somewhere around human level intelligence, improving intelligence by every marginal percent becomes twice as difficult as the previous percent? I admit that doesn't sound very likely, but if that was the case, then even a self-improving AI would potentially improve itself very slowly, and maybe even sub-linear rather than exponentially, wouldn't it?
2DeLesley Hutchins1y
For a survey of experts, see: https://research.aimultiple.com/artificial-general-intelligence-singularity-timing/ Most experts expect AGI between 2030 and 2060, so predictions before 2030 are definitely in the minority. My own take is that a lot of current research is focused on scaling, and has found that deep learning scales quite well to very large sizes.  This finding is replicated in evolutionary studies; one of the main differences between the human brain and the chimpanzee is just size (neuron count), pure and simple. As a result, the main limiting factor thus appears to be the amount of hardware that we can throw at the problem. Current research into large models is very much hardware limited, with only the major labs (Google, DeepMind, OpenAI, etc.) able to afford the compute costs to train large models.  Iterating on model architecture at large scales is hard because of the costs involved.  Thus, I personally predict that we will achieve AGI only when the cost of compute drops to the point where FLOPs roughly equivalent to the human brain can be purchased on a more modest budget; the drop in price will open up the field to more experimentation. We do not have AGI yet even on current supercomputers, but it's starting to look like we might be getting close (close = factor of 10 or 100).  Assuming continuing progress in Moore's law (not at all guaranteed), another 15-20 years will lead to another 1000x drop in the cost of compute, which is probably enough for numerous smaller labs with smaller budgets to really start experimenting.  The big labs will have a few years head start, but if they don't figure it out, then they will be well positioned to scale into super-intelligent territory immediately as soon as the small labs help make whatever breakthroughs are required.  The longer it takes to solve the software problem, the more hardware we'll have to scale immediately, which means faster foom.  Getting AGI sooner may thus yield a better outcome. I woul

Any progress or interest in finding limited uses of AI that would be safe? Like the "tool AI" idea but designed to be robust. Maybe this is a distraction, but it seems basically possible. For example, a proof-finding AI that, given a math statement, can only output a proof to a separate proof-checking computer that validates it and prints either True/False/Unknown as the only output to human eyes. Here "Unknown" could indicate that the AI gave a bogus proof, failed to give any proof of either True or False, or the proof checker ran out of time/memory check... (read more)

Is it "alignment" if, instead of AGI killing us all, humans change what it is to be human so much that we are almost unrecognizable to our current selves?

I can foresee a lot of scenarios where humans offload more and more of their cognitive capacity to silicon, but they are still "human" - does that count as a solution to the alignment problem?

If we all decide to upload our consciousness to the cloud, and become fast enough and smart enough to stop any dumb AGI before it can get started  is THAT a solution?

Even today, I offload more and more of my "se... (read more)

2Yonatan Cale1y
I personally like the idea of uploading ourselves (and asked about it here). Note that even if we are uploaded - if someone creates an unaligned AGI that is MUCH SMARTER than us, it will still probably kill us. "keeping up" in the sense of improving/changing/optimizing so quickly that we'd compete with software that is specifically designed (perhaps by itself) to do that - seems like a solution I wouldn't be happy with. As much as I'm ok with posting my profile picture on Facebook, there are some degrees of self modification that I'm not ok with
Ding Ding Ding, we have a winner here. Strong up vote.

Why wouldn't it be sufficient to solve the alignment problem by just figuring out exactly how the human brain works, and copying that? The result would at worst be no less aligned to human values than an average human. (Presuming of course that a psychopath's brain was not the model used.)

The first plane didn't emulate birds. The first AGI probably won't be based on a retro engineering of the brain. The blue brain project is unlikely to finish reproducing the brain before DeepMind finds the right architecture. But I agree that being able to retro engineer the brain is very valuable for alignment, this is one of the path described here, in the final post of intro-to-brain-like-agi-safety, section Reverse-engineer human social instincts.

I am interested in working on AI alignment but doubt I'm clever enough to make any meaningful contribution, so how hard is it to be able to work on AI alignment? I'm currently a high school student, so I could basically plan my whole life so that I end up a researcher or software engineer or something else. Alignment being very difficult, and very intelligent people already working on it, it seems like I would have to almost be some kind of math/computer/ML genius to help at all. I'm definitely above average, my IQ is like 121 (I know the limitations of IQ... (read more)

1Yonatan Cale1y
I don't know, I'm replying here with my priors from software development.   TL;DR:  Do something that is  1. Mostly useful (software/ML/math/whatever are all great and there are others too, feel free to ask) 2. Where you have a good fit, so you'll enjoy and be curious about your work, and not burn out from frustration or because someone told you "you must take this specific job" 3. Get mentorship so that you'll learn quickly And this will almost certainly be useful somehow.   Main things my prior is based on: EA in general and AI Alignment specifically need lots of different "professions". We probably don't want everyone picking the number one profession and nobody doing anything else. We probably want each person doing whatever they're a good fit for. The amount we "need" is going up over time, not down, and I can imagine it going up much more, but can't really imagine it going down (so in other words, I mostly assume whatever we need today, which is quite a lot, will also be needed in a few years. So there will be lots of good options to pick)

Doesn't AGI doom + Copernican principle run into the AGI Fermi paradox? If we are not special, superintelligent AGI would have been created/evolved somewhere already and we would either not exist or at least see the observational artifacts of it through our various telescopes.

1Jay Bailey1y
I don't know much about it, but you might want to look into the "grabby aliens" model. I'm not sure how they come to this conclusion, but the belief is "If you have aliens that are moving outwards near the speed of light, it will still take millions and millions of years on average for them to reach us, so the odds of them reaching us soon are super small." https://grabbyaliens.com/

A lot of predictions about AI psychology are premised on the AI being some form of deep learning algorithm. From what I can see, deep learning requires geometric computing power for linear gains in intelligence, and thus (practically speaking) cannot scale to sentience.

For a more expert/in depth take look at: https://arxiv.org/pdf/2007.05558.pdf

Why do people think deep learning algorithms can scale to sentience without unreasonable amounts of computational power?

1Yonatan Cale1y
1. An AGI can be dangerous even if it isn't sentient 2. If an AI can do most things a human can do (which is achievable using neurons apparently because that's what we're made of), and if that AI can run x10,000 as fast (or if it's better in some interesting way, which computers sometimes are compared to humans), then it can be dangerous Does this answer your question? Feel free to follow up
1: This doesn't sound like what I'm hearing people say? Using the word sentience might have been a mistake. Is it reasonable to expect that the first AI to foom will be no more intelligent than say, a squirrel? 2a: Should we be convinced that neurons are basically doing deep learning? I didn't think we understood neurons to that degree? 2b: What is meant by [most things a human can do]? This sounds to me like an empty statement. Most things a human can do are completely pointless flailing actions. Do we mean, most jobs in modern America? Do we expect roombas to foom? Self driving cars? Or like, most jobs in modern America still sounds like a really low standard, requiring very little intelligence? My expected answer was somewhere along the lines of "We can achieve better results than that because of something something." or "We can provide much better computers in the near future, so this doesn't matter." What I'm hearing here is "Intelligence is unnecessary for AI to be (existentially) dangerous." This is surprising, and I expect, wrong (in the sense of not being what's being said/what the other side believes.) (though also in the sense of not being true, but that's neither here nor there.)
1Yonatan Cale1y
1. The relevant thing in [sentient / smart / whatever] is "the ability to achieve complex goals" 2. a. Are you asking if an AI can ever be as "smart" [good at achieving colas] as a human? 3. b. The dangerous part of the AGI being "smart" are things like "able to manipulate humans" and "able to build an even better AGI" Does this answer your questions? Feel free to follow up
2: No. Implies that humans are deep learning algorithms. This assertion is surprising, so I asked for confirmation that that's what's being said, and if so, on what basis. 3: I'm not asking what makes intelligent AI dangerous. I'm asking why people expect deep learning specifically to become (far more) intelligent (than they are). Specifically within that question, adding parameters to your model vastly increases use of memory. If I understand the situation correctly, if gpt just keeps increasing the number of parameters, gpt five or six or so will require more memory than exists on the planet, and assuming someone built it anyway, I still expect it to be unable to wash dishes. Even assuming you have the memory, running the training would take longer than human history on modern hardware. Even assuming deep learning "works" in the mathematical sense, that doesn't make it a viable path to high levels of intelligence in the near future. Given doom in thirty years, or given that researching deep learning is dangerous, it should be the case that this problem: never existed to begin with and I'm misunderstanding something / is easily bypassed by some cute trick / we're going to need a lot better hardware in the near future.
1Yonatan Cale1y
2. I don't think humans are deep learning algorithms. I think human (brains) are made of neurons, which seems like a thing I could simulate in a computer, but not just deep learning. 3. I don't expect just-deep-learning to become an AGI. Perhaps [in my opinion: probably] parts of the AGI will be written using deep-learning though, it does seem pretty good at some things. [I don't actually know, I can think out loud with you].
In a sense, yeah, the algorithm is similar to a squirrel that feels a compulsion to bury nuts. The difference is that in an instrumental sense it can navigate the world much more effectively to follow its imperatives.  Think about intelligence in terms of the ability to map and navigate complex environments to achieve pre-determined goals. You tell DALL-E2 to generate a picture for you, and it navigates a complex space of abstractions to give you a result that corresponds to what you're asking it to do (because a lot of people worked very hard on aligning it). If you're dealing with a more general-purpose algorithm that has access to the real world, it would be able to chain together outputs from different conceptual areas to produce results - order ingredients for a cake from the supermarket, use a remote-controlled module to prepare it, and sing you a birthday song it came up with all by itself! This behaviour would be a reflection of the input in the distorted light of the algorithm, however well aligned it may or may not be, with no intermediary layers of reflection on why you want a birthday cake or decision being made as to whether baking it is the right thing to do, or what would be appropriate steps to take for getting from A to B and what isn't. You're looking at something that's potentially very good at getting complicated results without being a subject in a philosophical sense and being able to reflect into its own value structure.
1[comment deleted]1y

A significant fraction of the stuff I've read about AI safety has referred to AGIs "inspecting each others' source code/utility function". However, when I look at the most impressive (to me) results in ML research lately, everything seems to be based on doing a bunch of fairly simple operations on very large matrices.

I am confused, because I don't understand how it would be a sensible operation to view the "source code" in question when it's a few billion floating point numbers and a hundred lines of code that describe what sequence of simple addition/mult... (read more)

I take "source code" as loosely meaning "everything that determines the behaviour of the AI, in a form intelligible to the examiner". This might include any literal source code, hardware details, and some sufficiently recent snapshot of runtime state. Literal source code is just an analogy that makes sense to humans reasoning about behaviour of programs where most of the future behaviour is governed by rules fixed in that code. The details provided cannot include future input and so do not completely constrain future behaviour, but the examiner may be able to prove things about future behaviour under broad classes of future input, and may be able to identify future inputs that would be problematic. The broad idea is that in principle, AGI might be legible in that kind of way to each other, while humans are definitely not legible in that way to each other.

The ML sections touched on the subject of distributional shift a few times, which is that thing where the real world is different from the training environment in ways which wind up being important, but weren't clear beforehand. I read the way to tackle this is called adversarial training, and what it means is you vary the training environment across all of its dimensions in order to to make it robust.

Could we abuse distributional shift to reliably break misaligned things, by adding fake dimensions? I imagine something like this:

  • We want the optimizer to mo
... (read more)
1Yonatan Cale1y
Seems like two separate things (?) 1. If we forget a dimension, like "AGI, please remember we don't like getting bored", then things go badly, even if we added another fake dimension which wasn't related to boredom. 2. If we train the AI on data from our current world, then [almost?] certainly it will see new things when it runs for real.  As a toy (not realistic but I think correct) example: the AI will give everyone a personal airplane, and then it will have to deal with a world that has lots of airplanes.

I previously worked as a machine learning scientist but left the industry a couple of years ago to explore other career opportunities.  I'm wondering at this point whether or not to consider switching back into the field.  In particular, in case I cannot find work related to AI safety, would working on something related to AI capability be a net positive or net negative impact overall?

1Yonatan Cale1y
Working on AI Capabilities: I think this is net negative, and I'm commenting here so people can [V] if they agree or [X] if they disagree. Seems like habryka agrees?  Seems like Kaj disagrees? I think it wouldn't be controversial to advise you to at least talk to 80,000 hours about this before you do it, as some safety net to not do something you don't mean to by mistake? Assuming you trust them. Or perhaps ask someone you trust. Or make your own gears-level model. Anyway, seems like an important decision to me
Okay, so I contacted 80,000 hours, as well as some EA friends for advice.  Still waiting for their replies. I did hear from an EA who suggested that if I don't work on it, someone else who is less EA-aligned will take the position instead, so in fact, it's slightly net positive for myself to be in the industry, although I'm uncertain whether or not AI capability is actually funding constrained rather than personal constrained. Also, would it be possible to mitigate the net negative by choosing to deliberately avoid capability research and just take an ML engineering job at a lower tier company that is unlikely to develop AGI before others and just work on applying existing ML tech to solving practical problems?
1[comment deleted]1y

Is anyone at MIRI or Anthropic creating diagnostic tools for monitoring neural networks?  Something that could analyze for when a system has bit-flip errors versus errors of logic, and eventually evidence of deception.

2Jay Bailey1y
Chris Olah is/was the main guy working on interpretability research, and he is a co-founder of Anthropic. So Anthropic would definitely be aware of this idea.
I've not seen the idea of bit flip idea before, and anthropic are quasi-alone on that, they might have missed it

What is the community's opinion on ideas based on brain-computer interfaces? Like "create big but non-agentic AI, connect human with it, use AI's compute/speed/pattern-matching with human's agency - wow, that's aligned (at least with this particular human) AGI!"

It seems to me (I haven't thought really much about it) that U(God-Emperor Elon Musk) >> U(paperclips), am i wrong?

There's some discussion of this in section 3.4. of Responses to Catastrophic AGI Risk.

So I've commented on this in other forums but why can't we just bit the bullet on happiness-suffering min-maxing utilitarianism as the utility function?

The case for it is pretty straightforward: if we want a utility function that is continuous over the set of all time, then it must have a value for a single moment in time. At this moment in time, all colloquially deontological concepts like "humans", "legal contracts", etc. have no meaning (these imply an illusory continuity chaining together different moments in time). What IS atomic though, is the valenc... (read more)

5Yonatan Cale1y
How would you explain "qualia" or "positive utility" in python?   Also, regarding definitions like "delusion of fundamental subject/object split of experience",  1. How do you explain that in Python? 2. See Value is Fragile (TL;DR: If you forget even a tiny thing in your definition of happiness/suffering*, the result could be extremely extremely bad)

Why should we throw immense resources on AGI x-risk when the world faces enormous issues with narrow AI right now? (eg. destabalised democracy/mental health crisis/worsening inequality)

Is it simply a matter of how imminent you think AGI is? Surely the opportunity cost is enormous given the money and brainpower we are spending on AGI something many dont even think is possible versus something that is happening right now.

7Jay Bailey1y
The standard answer here is that all humans dying is much, much worse than anything happening with narrow AI. Not to say those problems are trivial, but humanity's extinction is an entirely different level of bad, so that's what we should be focusing on. This is even more true if you care about future generations, since human extinction is not just 7 billion dead, but the loss of all generations who could have come after. I personally believe this argument holds even if we ascribe a relatively low probability to AGI in the relatively near future. E.g, if you think there's a 10% chance of AGI in the next 10-20 years, it still seems reasonable to prioritise AGI safety now. If you think AGI isn't possible at all, naturally we don't need to worry about AI safety. But I find that pretty unconvincing - humanity has made a lot of progress very quickly in the field of AI capabilities, and it shows no signs of slowing down, and there's no reason why such a machine could not exist in principle.
I understand and appreciate your discussion.  I wonder if perhaps we could consider is that it may be more morally imperative to work on AI safety for the hugely impactful problems AI is contributing right now, if we assume that in finding solutions to these current and near-term AI problems we would also be lowering the risk of AGI X-risk (albiet indirectly).  Given that the likelyhood for narrow AI risk being 1 and the likelyhood of AGI in the next 10 years being (as in your example) <0.1 -  It seems obvious we should focus on addressing the former as not only will it reduce suffering that we know with certainity is already happening but also suffer that will certainly continue to happen, in addition it will also indirectly reduce X-risk. If we combine this observation with the opportunity cost in not solving other even more solvable issues (disease, education etc). It seems even less appealing to pour millions of dollars and the careers of the smartest people in specifically AGI X-Risk.  A final point, is that it would seem the worst issues caused by current and near-term AI risks are that it is degrading the coordination structures of western democracies. (Disinformation, polarisation and so on). If, following Moloch, we understand coordination to be the most vital tool in humanity's adressing of problems we see that focusing on current AI safety issues will improve our ability of addressing every other area of human suffering.  The opportunity costs in not focusing on improving coordination problems in western countries seem to be equivalent to x-risk level consequences, while the probability of the first is 1 and that of AGI >1. 
2Jay Bailey1y
If you consider these coordination problems to be equivalent to x-risk level consequences, then it makes sense to work on aligning narrow AI. For instance, if you think there's a 10% chance of AGI x-risk this century, and current problems are 10% as bad as human extinction. After all, working on aligning narrow AI is probably more tractable than working on aligning the hypothetical AGI systems of the future. You are also right that aligning narrow AI may help align AGI in the future - it is, at the very least, unlikely to hurt. Personally, I don't think the current problems are anything close to "10% as bad as human extinction", but you may disagree with me on this. I'm not very knowledgable about politics, which is the field I would look into to try and identify the harms caused by our current degradation of coordination, so I'm not going to try to convince you of anything in that field - more trying to provide a framework with which to look at potential problems. So, basically I would look at it as - which is higher? The chance of human extinction from AGI times the consequences? Or the x-risk reduction from aligning narrow AI, plus the positive utility of solving our current problems today? I believe the former, so I think AGI is more important. If you believe the latter, aligning narrow AI is more important.
Intersting, yes I am interested in coordination problems. Let me follow this framework, to make a better case. There are three considerations I would like to point out. 1. The utility in adressing coordination problems is that they affect almost all X-risk scenarios.(Nuclear war, bioterror, pandemics, climate change and AGI) Working on coordintion problems reduces not only current suffering but also X-Risk of both AGI and non AGI kinds.  2. The difference between a 10% chance of something happening that may be an X-Risk in 100 years is not 10 times less then something with a certainity of happening. Its not even comparable because one is a certainity the other a probability, and we only get one roll of the dice (allocation of resources). It seems that the rational choice would always be the certainity.  3. More succinctly, with two buttons one with a 100% chance of adressing X-Risk and one with a 10% chance, which one would you press?
1Jay Bailey1y
I agree with you on the first point completely. As for Point 2, you can absolutely compare a certainty and a probability. If I offered you a certainty of $10, or a 10% chance of $1,000,000, would you take the $10 because you can't compare certainties and probabilities, and I'm only ever going to offer you the deal once? That then brings me to question 3. The button I would press would be the one that reduces total X-risk the most. If both buttons reduced X-risk by 1%, I would press the 100% one. If the 100% button reduced X-risk by 0.1%, and the 10% one reduced X-risk by 10%, I would pick the second one, for an expected value of 1% X-risk reduction. You have to take the effect size into account. We can disagree on what the effect sizes are, but you still need to consider them.
Interesting, I see what you mean reagrding probability and it makes sense. I guess perhaps, what is missing is that when it comes to questions of peoples lives we may have a stronger imperative to be more risk-averse.  I completely agree with you about effect size. I guess what I would say is that that given my point 1 from earlier  about the variety of X-risks coordination would contirbute in solving then the effect size will always be greater. If we want to maximise utility its the best chance we have. The added bonuses are that it is comparatively tractable and immediate avoiding the recent criticicisms about longtermism, while simoultnously being a longtermist solution.  Regadless, it does seem that coordination problems are underdiscussed in the community, will try and make a decent main post once my academic committments clear up a bit. 
1Jay Bailey1y
Being risk-averse around people's lives is only a good strategy when you're trading off against something else that isn't human lives. If you have the choice to save 400 lives with certainty, or a 90% chance to save 500 lives, choosing the former is essentially condemning 50 people to death. At that point, you're just behaving suboptimally. Being risk-averse works if you're trading off other things. E.g, if you could release a new car now that you're almost certain is safe, you might be risk-averse and call for more tests. As long as people won't die from you delaying this car, being risk-averse is a reasonable strategy here. Given your Point 1 from earlier, there is no reason to expect the effect size will always be greater. If the effect on reducing X-risks from co-ordination becomes small enough, or the risk of a particular X-risk becomes large enough, this changes the equation. If you believe, like many in this forum do, that AI represents the lion's share of X-risk, focusing on AI directly is probably more effective. If you believe that x-risk is diversified, that there's some chance from AI, some from pandemics, some from nuclear war, some from climate change, etc. then co-ordination makes more sense. Co-ordination has a small effect on all x-risks, direct work has a larger effect on a single x-risk. The point I'm trying to make here is this. There are perfectly reasonable states of the world where "Improve co-ordination" is the best action to take to reduce x-risk. There are also perfectly reasonable states of the world where "Work directly on <Risk A>" is the best action to take to reduce x-risk. You won't be able to find out which is which if you believe one is "always" the case. What I would suggest is to ask "What would cause me to change my mind and believe improving co-ordination is NOT the best way to work on x-risk", and then seek out whether those things are true or not. If you don't believe they are, great, that's fine. That said, it wouldn't b
In addition to what Jay Bailey said, the benefits of an aligned AGI are incredibly high, and if we successfully solved the alignment problem we could easily solve pretty much any other problem in the world(assuming you believe the "intelligence and nanotech can solve anything" argument). The danger of AGI is high, but the payout is also very large.

If the world's governments decided tomorrow that RL was top-secret military technology (similar to nuclear weapons tech, for example), how much time would that buy us, if any? (Feel free to pick a different gateway technology for AGI, RL just seems like the most salient descriptor).

Interesting question. As far as what government could do to slow down progress towards AGI, I'd also include access to high-end compute. Lots of RL is knowledge that's passed through papers or equations, and it can be hard to contain that kind of stuff. But shutting down physical compute servers seems easier. 
Depends whether they considered it a national security issue to win the arms race, and if they did how able they would be to absorb and keep the research teams working effectively.

I will ask this question, is the Singularity/huge discontinuity scenario likely to happen? Because I see this as a meta-assumptionn behind all the doom scenarios, so we need to know whether the Singularity can happen and will happen.

3Drake Thomas1y
Paul Christiano provided a picture of non-Singularity doom in What Failure Looks Like. In general there is a pretty wide range of opinions on questions about this sort of thing - the AI-Foom debate between Eliezer Yudkowsky and Robin Hanson is a famous example, though an old one. "Takoff speed" is a common term used to refer to questions about the rate of change in AI capabilities at the human and superhuman level of general intelligence - searching Lesswrong or the Alignment Forum for that phrase will turn up a lot of discussion about these questions, though I don't know of the best introduction offhand (hopefully someone else here has suggestions?).
It's definitely a common belief on this site. I don't think it's likely, I've written up some arguments here. 
Recursive self-improvement, or some other flavor of PASTA, seems essentially inevitable conditioning on not hitting hard physical limits or civilization being severely disrupted. There are Paul/EY debates about how discontinuous the capabilities jump will be, but the core idea of systems automating their own development and this leading to an accelerating feedback loop, or intelligence explosion, is conceptually solid. There are still AI risks without the intelligence explosion, but it is a key part of the fears of the people who think we're very doomed, as it causes the dynamic of getting only one shot at the real deal since the first system to go 'critical' will end up extremely capable.
(oh, looks like I already wrote this on Stampy! That version might be better, feel free to improve the wiki.)

Hm, someone downvoted michael_mjd's and my comment.

Normally I wouldn't bring this up, but this thread is supposed to be a good space for dumb questions (although tbf the text of the question didn't specify anything about downvotes), and neither michael's nor my question looked that bad or harmful (maybe pattern-matched to a type of dumb uninformed question that is especially annoying).

Maybe an explanation of the downvotes would be helpful here?

I forgot about downvotes. I'm going to add this in to the guidelines.

here we are, a concrete example of failure of alignment 
4Aryeh Englander1y
We have a points system in our family to incentivize the kids to do their chores. But we have to regularly update the rules because it turns out that there are ways to optimize for the points that we didn't anticipate and that don't really reflect what we actually want the kids to be incentivized to do. Every time this happens I think - ha, alignment failure!

When AI experts call upon others to ponder, as EY just did, "[an AGI] meant to carry out some single task" (emphasis mine), how do they categorize all the other important considerations besides this single task?  

Or, asked another way, where do priorities come into play, relative to the "single" goal?  e.g. a human goes to get milk from the fridge in the other room, and there are plentiful considerations to weigh in parallel to accomplishing this one goal -- some of which should immediately derail the task due to priority (I notice the power is o... (read more)

Anonymous question (ask here) :

Given all the computation it would be carrying out, wouldn't an AGI be extremely resource-intensive? Something relatively simple like bitcoin mining (simple when compared to the sort of intellectual/engineering feats that AGIs are supposed to be capable of) famously uses up more energy than some industrialized nations.

Short answer: Yep, probably. Medium answer: If AGI has components that look like our most capable modern deep learning models (which I think is quite likely if it arrives in the next decade or two), it will probably be very resource-intensive to run, and orders of magnitude more expensive to train. This is relevant because it impacts who has the resources to develop AGI (large companies and governments; likely not individual actors), secrecy (it’s more difficult to secretly acquire a massive amount of compute than it is to secretly boot up an AGI on your laptop; this may even enable monitoring and regulation), and development speed (if iterations are slower and more expensive, it slows down development). If you’re interested in further discussion of possible compute costs for AGI (and how this affects timelines), I recommend reading about bio anchors.
1Yonatan Cale1y
1. (I'm not sure but why would this be important? Sorry for the silly answer, feel free to reply in the anonymous form again) 2. I think a good baseline for comparison would be  1. Training large ML models (expensive) 2. Running trained ML models (much cheaper) 3. I think comparing to blockchain is wrong, because  1. it was explicitly designed to be resource intensive on purpose (this adds to the security of proof-of-work blockchains) 2. there is a financial incentive to use a specific (very high) amount of resources on blockchain mining (because what you get is literally a currency, and this currency has a certain value, so it's worthwhile to spend any money lower than that value on the mining process) 3. None of these are true for ML/AI, where your incentive is more something like "do useful things"

Why do we suppose it is even logical that control / alignment of a superior entity would be possible?  

(I'm told that "we're not trying to outsmart AGI, bc, yes, by definition that would be impossible", and I understand that we are the ones who "create it" (so I'm told, therefore, we have the upper-hand bc of this--somehow in building it that provides the key benefit we need for corrigibility... 

What am I missing, in viewing a superior entity as something you can't simply "use" ?  Does it depend on the fact that the AGI is not meant to have ... (read more)

1Aleksi Liimatainen1y
One has the motivations one has, and one would be inclined to defend them if someone tried to rewire the motivations against one's will. If one happened to have different motivations, then one would be inclined to defend those instead. The idea is that once a superintelligence gets going, its motivations will be out of our reach. Therefore, the only window of influence is before it gets going. If, at the point of no return, it happens to have the right kinds of motivations, we survive. If not, it's game over.
1Eugene D1y
thank you.  Make some sense...but does "rewriting its own code" (the very code we thought would perhaps permanently influence it before it got-going) nullify our efforts at hardcoding  our intentions? 
I'm not a psychopath, and if I got the opportunity to rewrite my own source code to become a psychopath, I wouldn't do it. At the same time, it's the evolutionary and cultural programming in my source code that contains the desire not to become a psychopath. In other words, once the desire to not become a psychopath is there in my source code, I will do my best not to become one, even if I have the ability to modify my source code.
3Eugene D1y
That makes sense.  My intention was not to argue from the position of it becoming a psychopath though (my apologies if it came out that way)...but instead from a perspective of an entity which starts-out as supposedly Aligned (centered-on human safety, let's say), but then, bc it's orders of magnitude smarter than we are (by definition), it quickly develops a different perspective.  But you're saying it will remain 'aligned' in some vitally-important way, even when it discovers ways the code could've been written differently? 
1Aleksi Liimatainen1y
The AI would be expected to care about preserving its motivations under self-modification for similar reasons as it would care about defending them against outside intervention. There could be a window where the AI operates outside immediate human control but isn't yet good at keeping its goals stable under self-modification. It's been mentioned as a concern in the past; I don't know what the state of current thinking is.

How would AGI alignment research change if the hard problem of consciousness were solved?

Consciousness, intelligence and human-value-alignment are probably mostly orthogonal, so I don’t think that solving the hard problem of intelligence would directly impact AGI alignment research. (Perhaps consciousness requires general intelligence, so understanding how consciousness works on a mechanistic level might dramatically accelerate timelines? But that’s highly speculative.) However, if solving the hard problem of consciousness leads us to realize that some of our AI systems are conscious, then we have a whole new set of moral patients. (As an AGI researcher) I personally would become much more concerned with machine ethics in that case, and I suspect others would as well.

What's the problem with oracle AIs? It seems like if you had a safe oracle AI that gave human-aligned answers to questions, you could then ask "how do I make an aligned AGI?" and just do whatever it says. So it seems like the problem of "how to make an aligned agentic AGI" is no harder than "how to make an aligned orcale AI", which I understand to still be extremely hard, but surely it's easier than making an aligned agentic AGI from scratch?

my understanding is that while an oracle doesn't directly control the nukes, it provides info to the people who control the nukes. Which is pretty much just moving the problem one layer deeper. While it can't directly change the physical state of the world, it can manipulate people to pretty much achieve the same thing. Check this tag for more specifics: https://www.lesswrong.com/tag/oracle-ai

Are there any specific examples of anybody working on AI tools that autonomously look for new domains to optimize over?

  • If no, then doesn't the path to doom still amount to a human choosing to apply their software to some new and unexpectedly lethal domain or giving the software real-world capabilities with unexpected lethal consequences? So then, shouldn't that be a priority for AI safety efforts?
  • If yes, then maybe we should have a conversation about which of these projects is most likely to bootstrap itself, and the likely paths it will take?

One alignment idea I have had that I haven't seen proposed/refuted is to have an AI which tries to compromise by satisfying over a range of interpretations of a vague goal, instead of trying to get an AI to fulfill a specific goal.  This sounds dangerous and unaligned, and it indeed would not produce an optimal, CEV-fulfilling scenario, but seems to me like it may create scenarios in which at least some people are alive and are maybe even living in somewhat utopic conditions.  I explain why below.

In many AI doom scenarios the AI intentionally pic... (read more)

In reward learning research, it’s common to represent the AI’s estimate of the true reward function as a distribution over possible reward functions, which I think is analogous to what you are describing. It’s also common to define optimal behavior, given a distribution over reward functions, as that behavior which maximizes the expected reward under that distribution. This is mathematically equivalent to optimizing a single reward function equal to the expectation of the distribution. So, this helps in that the AI is optimizing a reward function that is more likely to be “aligned” than one at an extreme end of the distribution. However, this doesn’t help with the problems of optimizing a single fixed reward function.

Why should we assume that vastly increased intelligence results in vastly increased power?

A common argument I see for intelligence being powerful stems from two types of examples:

  1. Humans are vastly more powerful than animals because humans are more intelligent than animals. Thus, an AGI vastly more intelligent than humans would also have similarly overwhelming power over humans.
  2. X famous person caused Y massive changes in society because of their superior intelligence. Thus, an AGI with even more intelligence would be able to effect even larger changes.

Howev... (read more)

It's possible that there is a ceiling to intelligence gains. It's also possible that there isn't. Looking at the available evidence there doesn't seem to be so - a single ant is a lot less intelligent than a lobster, which is less intelligent than a snake, etc. etc. While it would be nice (in a way) if there was a ceiling, it seems more prudent to assume that there isn't, and prepare for the worst. Especially as by "superintelligent", you shouldn't think of double, or even triple Einstein, rather you should think of a whole other dimension of intelligence, like the difference between you and a hamster. As to your specific counterarguments: 1. It's both, really. But yes - complex language allows humans to keep and build upon previous knowledge. Humans advantage is in the gigantic amounts of know how that can be passed on to future generations. Which is something that computers are eminently good at - you can keep a local copy of Wikipedia in 20GB. 2. Good point. But it's not just luck. Yes, luck plays a large part, but it's also resources (in a very general sense). If you have the basic required talent and a couple of billions of dollars, I'm pretty sure you could become a hollywood star quite quickly. The point is that a superintelligence won't have a similar level of intelligence to any one else around. Which will allow it to run circles around everyone. Like if a Einstein level intelligence decided to learn to play chess and started playing against 5 year olds - they might win the first couple of games, but after a while you'd probably notice a trend... 3. Intelligence is an advantage. Quite a big one, generally speaking. But in society most people are generally at the same level if you compare them to e.g. Gila monsters (because we're talking about superintelligence). So it shouldn't be all that surprising that other resources are very important. While many powerful people don't seem to be intel

20. (...) To faithfully learn a function from 'human feedback' is to learn (from our external standpoint) an unfaithful description of human preferences, with errors that are not random (from the outside standpoint of what we'd hoped to transfer).  If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them.


So, I'm thinking this is a critique of some proposals to teach an AI ethics by having it be co-trained with humans. 

There seems to be many obvious solutions to the problem ... (read more)

I work on AI safety via learning from human feedback. In response to your three ideas: * Uniformly random human noise actually isn’t much of a problem. It becomes a problem when the human noise is systematically biased in some way, and the AI doesn’t know exactly what that bias is. Another core problem (which overlaps with the human bias), is that the AI must use a model of human decision-making to back out human values from human feedback/behavior/interaction, etc. If this model is wrong, even slightly (for example, the AI doesn’t realize that the noise is biased along one axis), the AI can infer incorrect human values. * I’m working on it, stay tuned. * Our most capable AI systems require a LOT of training data, and it’s already expensive to generate enough human feedback for training. Limiting the pool of human teachers to trusted experts, or providing pre-training to all of the teachers, would make this even more expensive. One possible way out of this is to train AI systems themselves to give feedback, in imitation of a small trusted set of human teachers.

Why won't this alignment idea work?

Researchers have already succeeded in creating face detection systems from scratch, by coding the features one by one, by hand. The algorithm they coded was not perfect, but was sufficient to be used industrially in digital cameras of the last decade.

The brain's face recognition algorithm is not perfect either. It has a tendency to create false positives, which explains a good part of the paranormal phenomena. The other hard-coded networks of the brain seem to rely on the same kind of heuristics, hard-coded by evolution, ... (read more)

1Yonatan Cale1y
You suggested: But as you yourself pointed out: "We are not sure that this would extrapolate well to higher levels of capability"   You suggested: As you said, "The brain's face recognition algorithm is not perfect either. It has a tendency to create false positives" And so perhaps the AI would make human pictures that create false positives. Or, as you said, "We are not sure that this would extrapolate well to higher levels of capability"   The classic example is humans creating condoms, which is a very unfriendly thing to do to Evolution, even though it raised us like children, sort of   Adding: "Intro to Brain-Like-AGI Safety" (I didn't read it yet, seems interesting)
Ok. But don't you think "reverse engineering human instincts" is a necessary part of the solution? My intuition is that value is fragile, so we need to specify it. If we want to specify it correctly, either we learn it or we reverse engineer it, no?
1Yonatan Cale1y
I don't know, I don't have a coherent idea for a solution. Here's one of my best ideas (not so good). Yudkowsky split up the solutions in his post, see point 24. The first sub-bullet there is about inferring human values. Maybe someone else will have different opinions
  • Would an AGI that only tries to satisfice a solution/goal be safer?
  • Do we have reason to believe that we can/can't get an AGI to be a satisficer?
2Yonatan Cale1y
Do you mean something like "only get 100 paperclips, not more?" If so - the AGI will never be sure it has 100 paperclips, so it can take lots of precautions to be very, very sure. Like turning all the world into paperclip counters or so
1Tobias H1y
[I think this is more anthropomorphizing ramble than concise arguments. Feel free to ignore :) ] I get the impression that in this example the AGI would not actually be satisficing. It is no longer maximizing a goal but still optimizing for this rule.  For a satisficing AGI, I'd imagine something vague like "Get many paperclips" resulting in the AGI trying to get paperclips but at some point (an inflection point of diminishing marginal returns? some point where it becomes very uncertain about what the next action should be?) doing something else.  Or for rules like "get 100 paperclips, not more" the AGI might only directionally or opportunistically adhere. Within the rule, this might look like "I wanted to get 100 paperclips, but 98 paperclips are still better than 90, let's move on" or "Oops, I accidentally got 101 paperclips. Too bad, let's move on". In your example of the AGI taking lots of precautions, the satisficing AGI would not do this because it could be spending its time doing something else. I suspect there are major flaws with it, but an intuition I have goes something like this: * Humans have in some sense similar decision-making capabilities to early AGI. * The world is incredibly complex and humans are nowhere near understanding and predicting most of it. Early AGI will likely have similar limitations. * Humans are mostly not optimizing their actions, mainly because of limited resources, multiple goals, and because of a ton of uncertainty about the future.  * So early AGI might also end up not-optimizing its actions most of the time. * Suppose we assume that the complexity of the world will continue to be sufficiently big such that the AGI will continue to fail to completely understand and predict the world. In that case, the advanced AGI will continue to not-optimize to some extent. * But it might look like near-complete optimization to us. 
1Tobias H1y
Just saw the inverse question was already asked and answered.

That we have to get a bunch of key stuff right on the first try is where most of the lethality really and ultimately comes from; likewise the fact that no authority is here to tell us a list of what exactly is 'key' and will kill us if we get it wrong.  (One remarks that most people are so absolutely and flatly unprepared by their 'scientific' educations to challenge pre-paradigmatic puzzles with no scholarly authoritative supervision, that they do not even realize how much harder that is, or how incredibly lethal it is to demand getting that rig

... (read more)
To be a bit more explicit. I have some ideas of what it would look like to try to develop this meta-field or at least sub-elements of it, seperate from general rationality and am trying to get a feel for if they are worth pursuing personally. Or better yet, handing over to someone who doesn't feel they have any currently tractable ideas, but is better at getting things done.

Why does EY bring up "orthogonality" so early, and strongly ("in denial", "and why they're true") ?  Why does it seem so important that it be accepted?   thanks!

4Charlie Steiner1y
Because it means you can't get AI to do good things "for free," it has to be something you intentionally designed it to do. Denying the orthogonality thesis looks like claims that an AI built with one set of values will tend to change those values in a particular direction as it becomes cleverer. Because of wishful thinking, people usually try to think of reasons why an AI built in an unsafe way (with some broad distribution over possible values) will tend to end up being nice to humans (a narrow target of values) anyway. (Although there's at least one case where someone has argued "the orthogonality thesis is false, therefore even AIs built with good values will end up not valuing humans.")
You can also argue that not all value-capacity pairs are stable or compatible with self-improvement.
2Charlie Steiner1y
Yeah, I was a bit fast and loose - there are plenty of other ways to deny the orthogonality thesis, I just focused on the one I think is most common in the wild.
1[comment deleted]1y
4Jay Bailey1y
A common AGI failure mode is to say things like: "Well, if the AI is so smart, wouldn't it know what we meant to program it to do?" "Wouldn't a superintelligent AI also have morality?" "If you had a paperclip maximiser, once it became smart, why wouldn't it get bored of making paperclips and do something more interesting?" Orthogonality is the answer to why all these things won't happen. You don't hear these arguments a lot any more, because the field has become more sophisticated, so EY's harping on about orthogonality seems a bit outdated. To be fair, a lot of the reason the field has grown up about this is because EY kept harping on about it in the first place.
3Eugene D1y
OK again I'm a beginner here so pls correct me, I'd be grateful: I would offer that any set of goals given to this AGI would include the safety-concerns of humans.  (Is this controversial?)  Not theoretical intelligence for a thesis, but AGI acting in the world with the ability to affect us.  Because of the nature of our goals, it doesn't even seem logical to say that the AGI has gained more intelligence without also gaining an equal amount of safety-consciousness.   e.g. it's either getting better at safely navigating the highway, or it's still incompetent at driving.   Out on a limb:  Further, bc orthogonality seems to force the separation between safety and competency, you have EY writing various intense treatises in the vain hopes that FAIR / etc will merely pay attention to safety-concerns.   This just seems ridiculous, so there must be a reason, and my wild theory is that Orthogonality provides the cover needed to charge ahead with a nuke you can't steer--but it sure goes farther and faster every month, doesn't it? (Now I'm guessing, and this can't be true, but then again, why would EY say what he said about FAIR?)  But they go on their merry-way because they think, "the AI is increasingly competent...no need to concerns ourselves with 'orthogonal' issues like safety".  Respectfully, Eugene