Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I basically agree with Eliezer’s picture of things in the AGI interventions post.

But I’ve seen some readers rounding off Eliezer’s ‘the situation looks very dire’-ish statements to ‘the situation is hopeless’, and ‘solving alignment still looks to me like our best shot at a good future, but so far we’ve made very little progress, we aren’t anywhere near on track to solve the problem, and it isn’t clear what the best path forward is’-ish statements to ‘let’s give up on alignment’.

It’s hard to give a technical argument for ‘alignment isn’t doomed’, because I don’t know how to do alignment (and, to the best of my knowledge in December 2021, no one else does either). But I can give some of the more abstract reasons I think that.

I feel sort of wary of sharing a ‘reasons to be less pessimistic’ list, because it’s blatantly filtered evidence, it makes it easy to overcorrect, etc. In my experience, people tend to be way too eager to classify problems as either ‘easy’ or ‘impossible’; just adding more evidence may cause people to bounce back and forth between the two rather than planting a flag in the middle ground.

I did write a version of 'reasons not to be maximally pessimistic' for a few friends in 2018. I’m warily fine with sharing that below, with the caveats ‘holy shit is this ever filtered evidence!’ and ‘these are my own not-MIRI-vetted personal thoughts’. And 'this is a casual thing I jotted down for friends in 2018'.

Today, I would add some points (e.g., 'AGI may be surprisingly far off; timelines are hard to predict'), and I'd remove others (e.g., 'Nate and Eliezer feel pretty good about MIRI's current research'). Also, since the list is both qualitative and one-sided, it doesn’t reflect the fact that I’m quantitatively a bit more pessimistic now than I was in 2018.


[...S]ome of the main reasons I'm not extremely pessimistic about artificial general intelligence outcomes.

(Warning: one-sided lists of considerations can obviously be epistemically bad. I mostly mean to correct for the fact that I see a lot of rationalists who strike me as overly pessimistic about AGI outcomes. Also, I don't try to argue for most of these points in any detail; I'm just trying to share my own views for others' stack.)


1. AGI alignment is just a technical problem, and humanity actually has a remarkably good record when it comes to solving technical problems. It's historically common for crazy-seeming goals to fall to engineering ingenuity, even in the face of seemingly insuperable obstacles.

Some of the underlying causes for this are 'it's hard to predict what clever ideas are hiding in the parts of your map that aren't filled in yet', and 'it's hard to prove a universal negation'. A universal negation is what you need in order to say that there's no clever engineering solution; whereas even if you've had ten thousand failed attempts, a single existence proof — a single solution to the problem — renders those failures totally irrelevant.


2. We don't know very much yet about the alignment problem. This isn't a reason for optimism, but it's a reason not to have confident pessimism, because no confident view can be justified by a state of uncertainty. We just have to learn more and do it the hard way and see how things go.

A blank map can feel like 'it's hopeless' for various reasons, even when you don't actually have enough Bayesian evidence to assert a technical problem is hopeless. For example: you think really hard about the problem and can't come up with a solution, which to some extent feels like there just isn't a solution. And: people aren't very good at knowing which parts of their map are blank, so it may feel like there aren't more things to learn even where there are. And: to the extent there are more things to learn, these can represent not only answers to questions you've posed, but answers to questions you never thought to pose; and can represent not only more information relevant to your current angle of attack on the problem, but information that can only be seen as relevant once you've undergone a perspective shift, ditched an implicit assumption, etc. This is to a large extent the normal way intellectual progress has worked historically, but hindsight bias makes this hard to notice and fully appreciate.

Or as Eliezer put it in his critique of Paul Christiano's approach to alignment on LW:

I restate that these objections seem to me to collectively sum up to “This is fundamentally just not a way you can get an aligned powerful AGI unless you already have an aligned superintelligence”, rather than “Some further insights are required for this to work in practice.” But who knows what further insights may really bring? Movement in thoughtspace consists of better understanding, not cleverer tools.

Eliezer is not a modest guy. This is not false humility or politeness. This is a statement about what technical progress looks like when you have to live through it and predict it in the future, as opposed to what it looks like with the benefit of hindsight: it looks like paradigm shifts and things going right in really weird and unexpected ways (that make perfect sense and look perfectly obvious in hindsight). If we want to avoid recapitulating the historical errors of people who thought a thing was impossible (or centuries away, etc.) because they didn't know how to do it yet, then we have to either have a flatter prior about how hard alignment is, or make sure to ground our confidence in very solid inside-view domain knowledge.


3. If you can get a few very specific things right, you can leverage AGI capabilities to bootstrap your way to getting everything else right, including solving various harder forms of the alignment problem. By the very nature of the AGI problem, you don't have to do everything by human ingenuity; you just have to get this one thing firmly right. Neglecting this bootstrapping effect makes it easy to overestimate the expected difficulty of the problem.


4. AGI alignment isn't the kind of problem that requires massive coordination or a global mindset shift or anything like that. It's more like the moon landing or the Manhattan Project, in that it's a concrete goal that a specific project at a certain time or place can pull off all on its own, regardless of how silly the rest of the world is acting at the time.

Coordination can obviously make this task a lot easier. In general, the more coordination you have, the easier the technical challenge becomes; and the more technical progress you make, the lower a level of coordination and resource advantage you need. But at its core, the alignment problem is about building a machine with certain properties, and a team can just do that even if the world-at-large that they're operating in is badly broken.


5. Sufficiently well-informed and rational actors have extremely good incentives here. The source of the 'AI developers are racing to the brink' problem is bias and information asymmetry, not any fundamental conflict of interest.


6. Clear and rigorous thinking is helpful for AGI capabilities, and it's also helpful for understanding the nature and severity of AGI risk. This doesn't mean that there's a strong correlation today between the people who are best at capabilities and the people who are thinking most seriously about safety; but it does mean that there's a force pushing in the direction of a correlation like that becoming more strong over time (e.g., as conversations happen and the smartest people acquire more information, think about things more, and thereby get closer to truth).


7. Major governments aren't currently leaders in AI research, and there are reasons to think this is unlikely to change in the future. (This is positive from my perspective because I think state actors can make a lot of aspects of the problem more difficult and complicated.)


8. Deference to domain experts. Nate, Eliezer, Benya, and other researchers at MIRI think it's doable, and these are some of the folks I think are most reliably correct and well-calibrated about tricky questions like these. They're also the kind of people I think really would drop this line of research if the probability of success seemed too low to them, or if some other approach seemed more promising.


9. This one's hard to communicate, but: some kind of gestalt impression gathered from seeing how MIRI people approach the problem in near mode, and how they break the problem down into concrete smaller subproblems.

I don't think this is a strong reason to expect success, but I do think there's some kind of mindset switch that occurs when you are living and breathing nitty-gritty details related to alignment work, deployment strategy, etc., and when you see various relatively-concrete paths to success discussed in a serious and disciplined way.

I think a big part of what I'm gesturing at here is a more near-mode model of AGI itself: thinking of AGI as software whose properties we determine, where we can do literally anything we want with it (if we can figure out how to represent the thing as lines of code). A lot of people go too far with this and conclude the alignment problem is trivial because it's 'just software'; but I think there's a sane version of this perspective that's helpful for estimating the difficulty of the problem.


10. Talking in broad generalities, MIRI tends to think that you need a relatively principled approach to AGI in order to have a shot at alignment. But drilling down on the concrete details, it's still the case that it can be totally fine in real life to use clever hacks rather than deep principled approaches, as long as the clever hacks work. (Which they sometimes do, even in robust code.)

The key thing from the MIRI perspective isn't 'you never use cheats or work-arounds to make the problem easier on yourself', but rather 'it's not cheats and work-arounds all the way down; the high-level cleverness is grounded in a deep understanding of what the system is fundamentally doing'.


11. Relatedly, I have various causes for optimism that are more specific to MIRI's particular research approach; e.g., thinking it's easier to solve various conceptual problems because of inside-view propositions about the problems.


12. The problems MIRI is working on have been severely neglected by researchers in the past, so it's not like they're the kind of problem humanity has tried its hand at and found to be highly difficult. Some of the problems have accrued a mythology of being formidably difficult or even impossible, in spite of no one having really tried them before.

(A surprisingly large number of the problems MIRI has actually already solved are problems that various researchers in the field have told us are impossible for anyone to solve even in principle, which indicates that a lot of misunderstandings of things like reflective reasoning are really commonplace.)


13. People haven't tried very hard to find non-MIRI-ish approaches that might work.


14. Humanity sometimes builds robust and secure software. If the alignment problem is similar to other cases of robustness, then it's a hard problem, but not so hard that large teams of highly motivated and rigorous teams (think NASA) can't solve them.


15. Indeed, there are already dedicated communities specializing in methodologically similar areas like computer security, and if they took some ownership of the alignment problem, things could suddenly start to look a lot sunnier.


16. More generally, there are various non-AI communities who make me more optimistic than AI researchers on various dimensions, and to the extent I'm uncertain about the role those communities will play in AGI in the future, I'm more uncertain about AGI outcomes.


17. [redacted]


18. [redacted]

New to LessWrong?

New Comment
37 comments, sorted by Click to highlight new comments since: Today at 6:30 PM

the problems MIRI has actually already solved are problems that various researchers in the field have told us are impossible for anyone to solve even in principle

And they are..?

Logical induction, Löbian cooperation, reflection in HOL, and functional decision theory are all results where researchers have expressed surprise to MIRI that the results were achievable even in principle.

I think a common culprit is people misunderstanding Gödel's theorems as blocking more things than they actually do. There's also field-specific folklore — e.g., a lot of traditional academic decision theorists seem to have somehow acquired the belief that you can't assign probabilities to your own actions, on pain of paradox.

How many of those results are accepted as interesting and insighful outside MIRI?

A couple of my comments could be viewed as "rounding off" to "the situation is hopeless", but I think this is a fair reading of the source text. As noted, it is not very hopeful:

I consider the present gameboard to look incredibly grim, and I don't actually see a way out through hard work alone. [...] Even if the social situation were vastly improved, on my read of things, everybody still dies. [...] There's no obvious winnable position into which to play the board.

In Eliezer's preface he tries to avoid instilling self-fulfilling gloom, and at this point where he is trying to be most positive about our chances of survival he says "Maybe the horse will sing". Horses cannot sing and this is not encouraging.

Now, if Alice tells me "the situation is hopeless" and Bob tells me "maybe the horse will sing", the emphasis is different. Alice is emphasizing the mainline bad outcome, and Bob is emphasizing the non-mainline good outcome. But I don't infer that Alice and Bob have predictably different assessments of the situation.

The other factor is model error. Eliezer writes:

We can hope there's a miracle that violates some aspect of my background model, and we can try to prepare for that unknown miracle; preparing for an unknown miracle probably looks like "Trying to die with more dignity on the mainline".

Now, if Alice tells me "the situation appears hopeless, but maybe my model of the situation is wrong", and Bob tells me "the situation is hopeless", then concretely they are saying the same thing. Alice is explicit that she is fallible, and Bob lets his fallibility be implicit. But everyone knows that Alice and Bob are fallible, that is shared context for the conversation. Partly this is about a balance of concision and precision. Partly this is about social status and humility and false humility and so forth. But concretely Alice and Bob are saying the same thing. I don't infer that Bob thinks his models are better than Alice's.

To be clear, as requested, I am discussing the source text, not making inferences about Eliezer's latent state of mind.

1. AGI alignment is just a technical problem


Is it though? Seriously.

As just one example: what if superintelligence takes the from of a community of connected AGI running on (and intrinsically regulated by) a crypto-legal system, with decision policies implemented hierarchically over sub-agents (there's an even a surprisingly strong argument the brain is a similar society of simpler minds resolving decisions through the basal ganglia). Then alignment is also a mechanism design problem, a socio-economic-political problem.

Although I guess that's arguably still 'technical', just technical within an expanded domain.

13. People haven't tried very hard to find non-MIRI-ish approaches that might work.

So you haven't heard of IRL, CIRL, value learning, that whole DL safety track, etc? Or are you outright dismissing them? I'd argue instead that MIRI bet heavily against connectivism/DL, and lost on that bet just as heavily.  

(Which isn't to say that MIRI-AF wasn't a good investment on net for the world, even if it was low probability-of-success)

So you haven't heard of IRL, CIRL, value learning, that whole DL safety track, etc? Or are you outright dismissing them? I'd argue instead that MIRI bet heavily against connectivism/DL, and lost on that bet just as heavily.

This comment and the entire conversation that spawned from it is weirdly ungrounded in the text — I never even mentioned DL. The thing I was expressing was 'relative to the capacity of the human race, and relative to the importance and (likely) difficulty of the alignment problem, very few research-hours have gone into the alignment problem at all, ever; so even if you're pessimistic about the entire space of MIRI-ish research directions, you shouldn't have a confident view that there are no out-of-left-field research directions that could arise in the future to take big bites out of the alignment problem'.

The rhetorical approach of the comment is also weird to me. 'So you've never heard of CIRL?' surely isn't a hypothesis you'd give more weight to than 'You think CIRL wasn't a large advance', 'You think CIRL is MIRI-ish', 'You disagree with me about the size and importance of the alignment problem such that you think it should be a major civilizational effort', 'You think CIRL is cool but think we aren't yet hitting diminishing returns on CIRL-sized insights and are therefore liable to come up with a lot more of them in the future'. etc. So I assume the question is rhetorical; but then it's not clear to me what you believe about CIRL or what point you want to make with it.

(Ditto value learning, IRL, etc.)

I'd argue instead that MIRI bet heavily against connectivism/DL, and lost on that bet just as heavily.  

I think this is straightforwardly true in two different ways:

  • Prior to the deep learning revolution, Eliezer didn't predict that ANNs would be a big deal — he expected other, neither-GOFAI-nor-connectionist approaches to AI to be the ones that hit milestones like 'solve Go'.
  • MIRI thinks the current DL paradigm isn't alignable, so we made a bet on trying to come up with more alignable AI approaches (which we thought probably wouldn't succeed, but considered high-enough-EV to be worth the attempt).

I don't think this has anything to do with the OP, but I'm happy to talk about it in its own right. The most relevant thing would be if we lost a bet like 'we predict deep learning will be too opaque to align', but we still are just as pessimistic about humanity's ability to align deep nets are ever, so if you think we've hugely underestimated the tractability of aligning deep nets, I'd need to hear more about why. What's the path to achieving astronomically good outcomes, on the assumption that the first AGI systems are produced by 2021-style ML methods?

Thanks, strong upvote, this is especially clarifying.

Firstly, I (partially?) agree that the current DL paradigm isn't strongly alignable (in a robust, high certainty paradigm), we may or may not agree to what extent it is approximately/weakly alignable.

The weakly alignable baseline should be "marginally better than humans". Achieving that baseline as an MVP should be an emergency level high priority civilization project, even if risk of doom from DL AGI is only 1% (and to be clear, i'm quite uncertain, but it's probably considerably higher). Ideally we should always have an MVP alignment solution in place.

My thoughts on your last question are probably best expressed in a short post rather than a comment thread, but in summary:

DL methods are based on simple universal learning architectures (eg transformers, but AGI will probably be built on something even more powerful). The important properties of resulting agents are thus much more a function of the data / training environment rather than the architecture. You can rather easily limit an AGI's power by constraining it's environment.  For example we have nothing to fear from AGI's trained solely in Atari. We have much more to fear from agents trained by eating the internet. Boxing is stupid, but sim sandboxing is key.

As DL methods are already a success story in partial brain reverse engineering (explicitly in deepmind's case), there's hope for reverse engineering the circuits underlying empathy/love/altruism/etc in humans - ie the approximate alignment solution that evolution found. We can then improve and iterate on that in simulations. I'm somewhat optimistic that it's no more complex than other major brain systems we've already mostly reverse engineered.

The danger of course is that testing and iterating could use enormous resources, past the point where you already have a dangerous architecture that could be extracted. Nonetheless, I think this approach is much better than nothing, and amenable to (potentially amplified) iterative refinement.

Firstly, I (partially?) agree that the current DL paradigm isn't strongly alignable (in a robust, high certainty paradigm), we may or may not agree to what extent it is approximately/weakly alignable.

I don't know what "strongly alignable", "robust, high certainty paradigm", or "approximately/weakly alignable" mean here. As I said in another comment:

There are two problems here:

  • Problem #1: Align limited task AGI to do some minimal act that ensures no one else can destroy the world with AGI.
  • Problem #2: Solve the full problem of using AGI to help us achieve an awesome future.

Problem #1 is the one I was talking about in the OP, and I think of it as the problem we need to solve on a deadline. Problem #2 is also indispensable (and a lot more philosophically fraught), but it's something humanity can solve at its leisure once we've solved #1 and therefore aren't at immediate risk of destroying ourselves.

If you have enough time to work on the problem, I think basically any practical goal can be achieved in CS, including robustly aligning deep nets. The question in my mind is not 'what's possible in principle, given arbitrarily large amounts of time?', but rather 'what can we do in practice to actually end the acute risk period / ensure we don't blow ourselves up in the immediate future?'.

(Where I'm imagining that you may have some number of years pre-AGI to steer toward relatively alignable approaches to AGI; and that once you get AGI, you have at most a few years to achieve some pivotal act that prevents AGI tech somewhere in the world from paperclipping the world.)

The weakly alignable baseline should be "marginally better than humans".

I don't understand this part. If we had AGI that were merely as aligned as a human, I think that would immediately eliminate nearly all of the world's existential risk. (Similarly, I think fast-running high-fidelity human emulations are one of the more plausible techs humanity could use to save the world, since you could then do a lot of scarily impressive intellectual work quickly (including work on the alignment problem) without putting massive work into cognitive transparency, oversight, etc.)

I'm taking for granted that AGI won't be anywhere near as aligned as a human until long after either the world has been destroyed, or a pivotal act has occurred. So I'm thinking in terms of 'what's the least difficult-to-align act humanity could attempt with an AGI?'.

Maybe you mean something different by "marginally better than humans"?

As DL methods are already a success story in partial brain reverse engineering (explicitly in deepmind's case), there's hope for reverse engineering the circuits underlying empathy/love/altruism/etc in humans - ie the approximate alignment solution that evolution found.

I think this is a purely Problem #2 sort of research direction ('we have subjective centuries to really nail down the full alignment problem'), not a Problem #1 research direction ('we have a few months to a few years to do this one very concrete AI-developing-a-new-physical-technology task really well').

For what it's worth I'm cautiously optimistic that "reverse-engineering the circuits underlying empathy/love/altruism/etc." is a realistic thing to do in years not decades, and can mostly be done in our current state of knowledge (i.e. before we have AGI-capable learning algorithms to play with—basically I think of AGI capabilities as largely involving learning algorithm development and empathy/whatnot as largely involving supervisory signals such as reward functions). I can share more details if you're interested.

Maybe you mean something different by "marginally better than humans"?


No I meant "merely as aligned as a human". Which is why I used "approximately/weakly" aligned - as the system which mostly aligns humans to humans is imperfect and not what I would have assumed you meant as a full Problem #2 type solution.


I'm taking for granted that AGI won't be anywhere near as aligned as a human until long after either the world has been destroyed, or a pivotal act has occurred.

I think this is a purely Problem #2 sort of research direction ('we have subjective centuries to really nail down the full alignment problem'),

Alright so now I'm guessing the crux is that you believe the DL based reverse engineered human empathy/altruism type solution I was alluding to - let's just call that DLA - may take subjective centuries, which thus suggests that you believe:

  • That DLA is significantly more difficult than DL AGI in general
  • That uploading is likewise significantly more difficult

or perhaps

  • DLA isn't necessarily super hard, but irrelevant because non-DL AGI (for which DLA isn't effective) comes first

Is any of that right?

Sounds right, yeah!

So I can see how that is a reasonable interpretation of what you were expressing. However, given the opening framing where you said you basically agreed with Eliezer's pessimistic viewpoint that seems to dismiss most alignment research, I hope you can understand how I interpreted you saying "People haven't tried very hard to find non-MIRI-ish approaches that might work" as dismissing ML-safety research like IRL,CIRL,etc.

I... think that makes more sense? Though Eliezer was saying the field's progress overall was insufficient, not saying 'decision theory good, ML bad'. He singled out eg Paul Christiano and Chris Olah as two of the field's best researchers.

In any case, thanks for explaining!

I'd argue instead that MIRI bet heavily against connectivism/DL, and lost on that bet just as heavily.

For years they have consistently denied this, saying it's a common misconception. See e.g. here. I am interested to hear your argument.

Interesting - it's not clear to me how that dialogue addresses the common misconception.

My brief zero-effort counter-argument to that dialogue is: it's hard to make rockets or airplanes safe without first mastering aerospace engineering.

So I think it's super obvious that EY/MIRI/LW took the formalist side over connectivist, which I discuss more explicitly in the intro to my most recent 2021 post, which links to my 2015 post which discussed the closely connected ULM vs EM brain theories, which then links to my 2010 post discussing a half-baked connectivist alignment idea with some interesting early debate vs LW formalists (and also my successful prediction of first computer Go champion 5 years in advance).

So I've been here a while, and I even had a number of conversations with MIRI's 2 person ML-but-not-DL alignment group (Jessica & Jack) when that was briefly a thing, and it would be extremely ambitious revisionist history to claim that EY/MIRI didn't implicitly if not explicitly bet against connectivism.

So that's why I asked Rob about point 13 above - as it seems unjustifiably dismissive of the now dominant connectivist-friendly alignment research (and said dismissal substantiates my point).

But I'm not here to get in some protracted argument about this. So why am I here? Because loitering on the event-horizon of phyg attractors are obvious historical schelling points to meet other interesting people.  Speaking of which, we should chat - I really liked your Birds,Brains,Planes post in particular, I actually wrote up something quite similar a while ago.

Thanks! What's a phyg attractor? Google turns up nothing.

To say a bit more about my skepticism -- there are various reasons why one might want to focus on agent foundations and similar stuff even if you also think that deep learning is about to boom and be super effective and profitable. For example, you might think the deep-learning-based stuff is super hard to align relative to other paradigms. Or you might think that we won't be able to align it until we are less confused about fundamental issues, and the way to deconfuse ourselves is to think in formal abstractions rather than messing around with big neural nets. Or you might think that both ways are viable but the formal abstraction route is relatively neglected. So the fact that MIRI bet on agent foundations stuff doesn't seem like strong evidence that they were surprised by the deep learning boom, or at least, more surprised than their typical contemporaries.

Skepticism of what?

Like I said in the parent comment - investing in AF can be a good bet, even if it's low probability of success. And I mostly agree with your rationalizations there, but they are post-hoc. I challenge you to find early evidence (ideally 2010 or earlier - for reasons explained in a moment) documenting that MIRI leaders "also think that deep learning is about to boom and be super effective and profitable".

The connectivist-futurists (Moravec/Kurzweil) were already predicting a timeline for AGI in the 2020's through brain reverse engineering. EY/MIRI implicitly/explicitly critiqued that and literally invested time/money/resources in hiring/training up people (a whole community arguably!) in knowledge/beliefs very different from - and mostly useless for understanding - the connectivist/DL path to AGI.

So if you truly believed in 2010, after hearing some recent neuroscience-phd's first presentation on how they were going to reverse engineer the brain (DeepMind), and you actually gave that even a 50% chance of success - do you truly believe it would be wise to invest the way MIRI did?  And to be hostile to connectivist/DL approaches as they still are? Do you not think they at least burned some bridges? Have you seen EY's recent thread, where he attempts a blatant revision-history critique of Moravec? (Moravec actually claimed AGI around 2028, not 2010, which seems surprisingly on-track prescient to me now in 2021).

Again, quoting Rob from above:

13. People haven't tried very hard to find non-MIRI-ish approaches that might work.

Which I read as dismissing the DL-friendly alignment research tracks: IRL/CRL/value learning, etc. And EY explicitly dismisses most alignment research in some other recent thread.

I don't know what to believe yet; I appreciate the evidence you are giving here (in particular your experience as someone who has been around in the community longer than me). My skepticism was about the inference from MIRI did abstract AF research --> MIRI thought deep learning would be much less effective than it in fact was.

I do remember reading some old posts from EY about connectionism that suggest that he at least failed to predict the deep learning boost in advance. That's different from confidently predicting it wouldn't happen though.

I too think that Moravec et al deserve praise for successfully predicting the deep learning boom and having accurate AI timelines 40 years in advance.

Old LessWrong meme - phyg is rot13 cult. For a while people were making "are we a cult" posts so much that it was actually messing with LessWrong's SEO. Hence phyg.

Thanks! What's a phyg attractor? Google turns up nothing.


Ask google what LW is - ie just start typing lesswrong or "lesswrong is a" and see the auto-complete.  Using the word 'phyg' is an LW community norm attempt to re-train google.

I don't think alignment is "just a technical problem" in any domain, because:

  1. I don't think there's a good enough definition of "alignment" for it to be addressed in any technical way.

    Saying that "being aligned" means "behaving according to human values" just throws it back to the question of how exactly you define what "human values" are. Are they what humans say they want? What humans actually do? What humans would say they wanted if they knew the results (and with what degree of certainty required)? What would make humans actually happiest (and don't forget to define "happy")? The extrapolated volition of humans under iterated enhancement (which, in addition to being uncomputable, is probably both dependent on initial conditions and path-dependent, with no particular justification for preferring one path over another at any given step)?

  2. Insofar as there are at least some vague ideas of what "alignment" or "human values" might mean, treating alighnment as a technical problem would require those values to have a lot more coherence than they actually seem to have.

    If you ask a human to justify its behavior at time X, the human will state a value V_x. If you ask to the same human to justify some other behavior at time Y, you'll get a value V_y. V_x and V_y will often be mutually contradictory. You don't have a technical problem until you have a philosophically satisfying way of resolving that contradiction, which is not just a technical issue. Yet at the same time there's feedback pressure on that philosophical decision, because some resolutions might be a lot more technically implementable than others.

  3. Even if individual humans had coherent values, there's every reason to think that the individuals in groups don't share those values, and absolutely no reason at all to think that groups will converge under any reasonable kind of extrapolation. So now you have a second philosophical-problem-with-technical-feedback, namely reconciling multiple people's contradictory values. That's a problem with thousands of years of history, by the way, and nobody's managed to reduce it to the technical yet, even though the idea has occurred to people.

Then you get to the technical issues, which are likely to be very hard and may not be solvable within physical limits. But it's not remotely a technical problem yet.

It is possible to define the alignment problem without using such fuzzy concepts as "happiness" or "value".

For example, there are two agents: R and H. The agent R can do some actions. 

The agent H prefers some of the R's actions over other actions. For example, H prefers the action make_pie to the action kill_all_humans

Some of the preferences are unknown even to H itself (e.g. if it prefers pierogi to borscht).

Among other things, the set of the R's actions includes:

  • ask_h_which_of_the_actions_is_preferable
  • infer_preferences_from_the_behavior_of_h
  • explain_consequences_of_the_action_to_h
  • switch_itself_off

In any given situation, the perfect agent R always chooses the most preferable action (according to H). The goal is to create an agent that is as close to the perfect R as possible.  

Of course, this formalism is incomplete. But i think it demonstrates that the alignment problem can be framed as a technical problem without delving into metaphysics.

If you replace "value" with "preference" in what I wrote, I believe that it all still applies.

If you both "ask H about the preferable action" and "infer H's preferences from the behavior of H", then what do you do when the two yield different answers? That's not a technical question; you could technically choose either one or even try to "average" them somehow. And it will happen.

The same applies if you have to deal with two humans, H1 and H2; they are sometimes going to disagree. How do you choose then?

There are also technical problems with both of those, and they're the kind of technical problems I was talking about that feed back on the philosophical choices. You might start with one philosophical position, then want to change when you saw the technical results.

For the first:

  1. It assumes that H's "real" preferences comport with what H says. That isn't a given, because "preference" is just as hard to define as "value". Choosing to ask H really amounts to defining preference to mean "stated preference".

  2. It also assumes that H will always be able to state a preference, will be able to do so in a way that you can correctly understand, and will not be unduly ambivalent about it.

  3. You'd probably also prefer that H (or somebody else...) not regret that preference if it gets enacted. You'd probably like to have some ability to predict that H is going to get unintended consequences, and at least give H more information before going ahead. That's an extra feature not implied by a technical specification based on just doing whatever H says.

  4. Related to (3), it assumes that H can usefully state preferences about courses of action more complicated than H could plan, when the consequences themselves may be more complicated than H can understand. And you yourself may have very complicated forms of uncertainty about those consequences, which makes it all the harder to explain the whole thing to H.

All of that is pretty unlikely.

The second is worse:

  1. It assumes that that H's actions always reflect H's preferences, which amounts to adopting a different definition of "preference", probably even further from the common meaning.

  2. H's preferences aren't required to be any simpler or more regular than a list of every possible individual situation, with a preferred course of action for each one independent of all others. For that matter, the list is allowed to change, or be dependent on when some particular circumstances occur, or include "never do the same thing twice in the same circumstances". Even if H's behavior is assumed to reflect H's preferences, theres still nothing that says H has to have an inferrable set of preferences.

    To make inferences about H's preferences, you first have to make a leap of faith and assume that they're simple enough, compact enough, and consistent enough to be "closely enough" approximated by any set of rules you can infer. That is a non-technical leap of faith. And there's a very good chance that it would be the wrong leap to make.

  3. It assumes that the rules you can infer from H's behavior are reasonably prescriptive about the choices you might have to make. Your action space may be far beyond anything H could do, and the choices you have to make may be far beyond anything H could understand.

    So you end up taking a bunch of at best approximate inferences about H's existing preferences, and trying to use them to figure out "What would H do if H were not a human, but in fact some kind of superhuman AGI totally unlike a human, but were somehow still H?". That's probably not a reasonable question to ask.

Oh, one more thing I should probably add: it gets even more interesting when you ask whether the AGI might act to change the human's value (or preferences; there's really no difference and both are equally "fuzzy" concepts). Any action that affects the human at all is likely to have some effect on them, and some actions could be targeted to have very large effects.

I agree, you've listed some very valid concerns about my half-backed formalism. 

As I see it, the first step in solving the alignment problem is to create a good formalism without delving into metaphysics. 

The formalism doesn't have to be perfect. If our theoretical R makes its decisions according to the best possible approximate inferences about H's existing preferences, then the R is much better than rogue AGI. Even if sometimes it will make deadly mistakes. Any improvement over rogue AGI is a good improvement. 

Compare: the Tesla AI sometimes causes deadly crashes. Yet the Tesla AI is much better than the status quo, as its net effect are thousands of saved lives.

And after we have a decent formalism, we can build a better formalism from it, and then repeat and repeat. 

As I see it, the first step in solving the alignment problem is to create a good formalism without delving into metaphysics.

Nobody's even gotten close to metaphysics. Ethics or even epistemology, OK. Metaphysics, no. The reason I'm getting pedantic about the technical meaning of the word is that "metaphysics", when used non-technically, is often a tag word used for "all that complicated, badly-understood stuff that might interfere with bulling ahead".

My narrow point is that alignment isn't a technical problem until you already have an adequate final formalism. Creating the formalism itself isn't an entirely technical process.

If you're talking about inferring, learning, being instructed about, or actually carrying out human preferences, values, or paths to a "good outcome", then as far as I know nobody has an approximately adequate formalism, and nobody has a formalism with any clear path to be extended to adequacy, or even any clear hope of it. I've seen proposals, but none of them have stood up to 15 minutes of thought. I don't follow it all the time; maybe I've missed something.

In fact, even asking for an "adequate" formalism is putting the cart before the horse, because nobody even has a set of reasonable meta-criteria to use to evaluate whether any given formalism is fit for use. There's no clear statement of what that would mean.

My broader concern is that I'm unsure an adequate list of meta-criteria can be established, and that I'm even less sure that the base formalism can exist at all. Demanding a formal system that can't be achieved can lead to all kinds of bad outcomes, many of them related to erroneously pretending that a formalism you have usefully approximates the formalism you need.

It would be very easy to decide that, for the sake of "avoiding metaphysics", it was important to adopt, agree upon, and stay within a certain framework-- one that did not meet meta-criteria like "allows you to express constraints that assure that everybody doesn't end up worse than dead", let alone "allows you to express what it means to achieve the maximum benefit from AGI", or "must provide prescriptions implementable in actual software".

Oh, people would keep tweaking any given framework to to cover more edge cases, and squeeze more and more looseness out of some defintions, and play around with more and more elegant statements of the whole thing... but that could just be a nice distraction from the fundamental lack of any "no fate worse than death" guarantee anywhere in it.

A useful formalism does have to be perfect in achieving no fates worse than death, or no widespread fates worse than death. It has to define fates worse than death in a meaningful way that doesn't sacrifice the motivation for having the constraint in the first place. It has to achieve that over all possible fates worse than death, including ones nobody has thought of yet. It has to let you at least approximately exclude at least the widespread occurrence of anything that almost anybody would think was a fate worse than death. Ideally while also enabling you to actually get positive benefits from your AGI.

And formal frameworks are often brittle; a formalism that doesn't guarantee perfection does not necessarily even avert catastrophe. If you make a small mistake in defining "fate worse than death", that may lead to a very large prevalence of the case you missed.

It's not even true that "the best possible inferences" are necessarily better than nothing, let alone adequate in any absolute sense. In fact, a truly rogue AGI that doesn't care about you at all seems more likely to just kill you quickly, whereas who knows what a buggy AGI that was interested in your fate might choose to do...

The very adoption of the word "aligment" seems to be a symptom of a desire to at least appear to move toward formalizing, without the change actually tending to improve the chances of a good outcome. I think people were trying to tighten up from "good outcome" when they adopted "alignment", but actually I don't think it helped. The connotations of the word "alignment" tend to concentrate attention on approaches that rely on humans to know what they want, or at least to have coherent desires, which isn't necessarily a good idea at all. On the other hand, the switch doesn't actually seem to make it any easier to design formal structures or technical approaches that will actually lead to good software behavior. It's still vague in all the ways that matter, and it doesn't seem to be improving at all.

We could use the Tesla AI as a model. 

To create a perfect AI for self-driving, one first must resolve all that complicated, badly-understood stuff that might interfere with bulling ahead. For example, if the car should prefer the driver's life over the pedestrian's life. 

But while we contemplate such questions, we lose tens of thousands of lives in car crashes per year.

The people of Tesla made the rational decision of bulling ahead instead. As their AI is not perfect, sometimes it makes decisions with deadly consequences. But in total, it saves lives.

Their AI has an imperfect but good enough formalism. AFAIK, it's something that could be described in English as "drive to the destination without breaking the driving regulations, while minimizing the number of crashes", or something like this. 

As their AI is net saving lives, it means their formalism is indeed good enough. They have successfully reduced a complex ethical/societal problem to a purely technical problem.

Rogue AGI is very likely to kill all humans. Any better-than-rogue-AGI is an improvement, even if it doesn't fully understand the complicated and ever changing human preferences, and even if some people will suffer as a result.

Even my half-backed sketch of a formalism, if implemented, will produce an AI that is better than rogue AGI, in spite of the many problems you listed. Thus, working on it is better than waiting for the certain death. 

In fact, even asking for an "adequate" formalism is putting the cart before the horse, because nobody even has a set of reasonable meta-criteria to use to evaluate whether any given formalism is fit for use

A formalism that saves more lives is better than the one that saves less lives. That's good enough for a start. 

If you're trying to solve a hard problem, start with something simple and then iteratively improve over it. This includes meta-criteria. 

fate worse than death

I strongly believe that there is no such a thing. Explained it in detail here

I agree with your sketch of the alignment problem.

But once you move past the sketch stage the solutions depend heavily on the structure of A, which is why I questioned Rob's dismissal of the now-dominant non-MIRI safety approaches (which are naturally more connectivist/DL friendly).

[+][comment deleted]2y7