Having human values is insufficient for alignment

Suppose there's a button where if you push it and name a human, that human becomes 1,000,000,000,000,000x more powerful. (What I mean by that isn't precisely specified—imagine some combination of being able to think much faster, becoming more intelligent, and having far more resources, to the point that they could easily overpower the rest of the world.)

Try running some thought experiments where you push the button to amplify:

  • Jesus
  • Buddha
  • Adolf Hitler
  • Donald Trump
  • Kim Jong-un
  • King Salman of Saudi Arabia
  • Ayn Rand
  • Elon Musk
  • Ray Kurzweil
  • Eliezer Yudkowsky
  • Yourself

My intuition is that some of these people are catastrophic to amplify, and some might be OK to amplify. It's interesting to me that amplifying some of these people might be catastrophic, given that they're fellow human beings, raised in human societies, born with human genomes, who almost certainly care about the future well-being of humanity.

One reason I’d feel queasy amplifying anyone is that they might fall into an epistemic pit, where they arrive at some critically wrong conclusion and take either huge or infinite amounts of time to update away from it. If someone’s reasoning process gets amplified, I wouldn’t generally trust them to be good at arriving at true beliefs—intelligence needn’t go hand-in-hand with rationality or philosophical competence.

In particular, it’s very unclear to me whether people would quickly update away from ideologies. In practice, humanity as a whole has not obviously fallen into any permanent epistemic pits, but I think this is because no single ideology has clearly dominated the world. If you have indefinite decisive power over the world, you have far less incentive to consider perspectives very different from your own, and unless you both care about and are good at seeking true beliefs, you wouldn’t do a good job learning from the people around you.

Another reason I’d feel queasy amplifying anyone is that they might take irreversible catastrophic actions (perhaps unknowingly). Genocides would be one example. Restructuring society such that it gets forever stuck in an epistemic pit would be another. Building a superintelligence without appreciating the risks is yet another (and clearly the most disastrous, and also the least obviously disastrous).

I consider these all failures in something I’ll term metaphilosophical competence. (Please excuse the unwieldy name; I hope to find a better descriptor at some point.) If someone were sufficiently metaphilosophically competent, they should figure out how to arrive at true beliefs relatively quickly and prioritize doing so. They should gain an appreciation of the importance and difficulty of avoiding catastrophic consequences in a world with so much uncertainty, and prioritize figuring out how to do good in a way that sets them apart from everyone who self-deludes into thinking they do good. They should be able to do this all correctly and expeditiously.

I interpet the goal of MIRI's agent foundations research agenda as providing a formal specification of metaphilosophical competence. For example, I interpret the logical induction criterion as part of a formal specification of what it means to have idealized reasoning in the limit. I intend to write more about this relationship at a future point.

All potential self-amplifiers should want to (and may not) be sufficiently metaphilosophically competent before self-amplifying

It's not just humans that should care about metaphilosophical competence. If Clippy (our favorite paperclip-maximizing superintelligence) wanted to build a successor agent far more powerful than itself, it would also want its successor to not take catastrophic irreversible actions or fall into epistemic pits.

Just because Clippy is superintelligent doesn't mean Clippy will necessarily realize the importance of metaphilosophy before building a successor agent. Clippy will probably eventually care about metaphilosophical competence, but it’s possible it would come to care only after causing irreversible damage in the interim (for example it might have built a catastrophically misaligned subagent, a.k.a. a daemon). It's also conceivable it falls into an epistemic pit in which it never comes to care about metaphilosophy.

Acknowledging metaphilosophical competence may be insufficient for safe self-amplification

It might be sufficient for an agent that isn't yet completely metaphilosophically competent, but sufficiently “proto-metaphilosophically competent” to self-amplify. For example, the first thing it might do upon self-amplification is do nothing except determine a formal specification of metaphilosophical competence, then create a successor agent that’s formally guaranteed to be metaphilosophically competent.

I'd feel good if I could be confident that would happen, but I'm not sure "do nothing but become more metaphilosophically competent" actually makes sense. Maybe it would make sense if you're smart enough that you could work through the aforementioned process in just a few seconds, but if for example the process takes much longer and you're in an unsafe or unstable environment, you'd have to trade off figuring out metaphilosophy with fending off imminent threats, which may involve taking irreversible catastrophic actions before you've actually figured out metaphilosophy.

(OK, metaphilosophy seems important to figure out. Wait, we might get nuked. Wait, synthetic viruses are spreading. Ahhhhh! Powerful AI's seem like the only way out of this mess. Ack, my AI isn't powerful enough, I should make it stronger. Okay, now it's... wait... oops...)

AI safety crux: Which humans are metaphilosophically competent enough to safely amplify?

Obviously some humans have not crossed the bar for metaphilosophical competence—if a naive negative utilitarian or angsty teenager gets 1,000,000,000,000,000x'd, they might literally just kill everyone. This invites the question of which people have crossed the metaphilosophical bar for safe 1,000,000,000,000,000x’ing.

I think this is an open question, and I suspect this is a major crux people have about the necessity or usefulness of agent foundations, as well as optimism about how AGI will play out. My guess is that if someone thinks tons of people have passed this bar, they’d think ML-based approaches to safety can lead us to a safe AGI, and are generally more optimistic about the world getting AI safety right. On the flip side, if they think practically nobody is sufficiently metaphilosophically competent to safely amplify, they’d highly prioritize metaphilosophical work (e.g. things in the direction of agent foundations), and feel generally pessimistic about the world getting AI safety right.

New Comment
39 comments, sorted by Click to highlight new comments since: Today at 4:23 PM

To test how much my proposed crux is in fact a crux, I'd like for folks to share their intuitions about how many people are metaphilosophically competent enough to safely 1,000,000,000,000,000x, along with their intuitions about the difficulty of AI alignment.

My current intuition is that there are under 100 people whom, if 1,000,000,000,000,000x'd, would end up avoiding irreversible catastrophes with > 50% probability. (I haven't thought too much about this question, and wouldn't be surprised if I update to thinking there are fewer than 10 such people, or even 0 such people.) I also think AI alignment is pretty hard, and necessitates solving difficult metaphilosophical problems.

Once humanity makes enough metaphilosophical progress (which might require first solving agent foundations), I might feel comfortable 1,000,000,000,000,000x'ing the most metaphilosophically competent person alive, though it's possible I'll decide I wouldn't want to 1,000,000,000,000,000x anyone running on current biological hardware. I'd also feel good 1,000,000,000,000,000x'ing someone if we're in the endgame and the default outcome is clearly self-annihilation.

All of these intuitions are weakly held.

My current intuition is that there are under 100 people whom, if 1,000,000,000,000,000x'd, would end up avoiding irreversible catastrophes with > 50% probability. (I haven't thought too much about this question, and wouldn't be surprised if I update to thinking there are fewer than 10 such people, or even 0 such people.)

I've asked this before but don't feel like I got a solid answer: (a) do you think that giving the 100th person a lot of power is a lot worse than the status quo (w.r.t. catastrophic risk), and (b) why?

If you think it's a lot worse, the explanations I can imagine are along the lines of: "the ideas that win in the marketplace of ideas are systematically good," or maybe "if people are forced to reflect by thinking some, growing older, being replaced by their children, etc., that's way better than having them reflect in the way that they'd choose to given unlimited power.", or something like that.

But those seem inconsistent with your position in at least two ways:

  • If this is the case, then people don't need metaphilosophical competence to be fine, they just need a healthy respect for business as usual and whatever magic it is that causes the status quo to arrive at good answers. Indeed, there seem to be many people (>> 100) who would effectively abdicate their power after being greatly empowered, or who would use it in a narrow way to avoid catastrophes but not to change the basic course of social deliberation.
  • The implicit claim about the magic of the status quo is itself a strong metaphilosphical claim, and I don't see why you would have so much confidence in this position while thinking that we should have no confidence in other metaphilosphical conclusions.

If you think that the status quo is even worse, then I don't quite understand what you mean by a statement like:

Once humanity makes enough metaphilosophical progress (which might require first solving agent foundations), I might feel comfortable 1,000,000,000,000,000x'ing the most metaphilosophically competent person alive, though it's possible I'll decide I wouldn't want to 1,000,000,000,000,000x anyone running on current biological hardware. I'd also feel good 1,000,000,000,000,000x'ing someone if we're in the endgame and the default outcome is clearly self-annihilation.

Other questions: why can we solve agent foundations, but the superintelligent person can't? What are you imagining happening after you empower this person? Why are you able to foresee so many difficulties that they predictably won't see?

Oh, I actually think that giving the 100th best person a bunch of power is probably better.than the status quo, assuming there are ~100 people who pass the bar (I also feel pessimistic about the status quo). The only reason why I think the status quo might be better is that more metaphilosophy would develop, and then whoever gets amplified would have more metaphilosophical competence to begin with, which seems safer.

What about the 1000th person?

(Why is us making progress on metaphilosphy an improvement over the empowered person making progress on metaphilosphy?)

I think the world will end up in a catastrophic epistemic pit. For example, if any religious leader got massively amplified, I think it's pretty likely (>50%) the whole world will just stay religious forever.

Us making progress on metaphilosophy isn't an improvement over the empowered person making progress on metaphilosophy, conditioning on the empowered person making enough progress on metaphilosophy. But in general I wouldn't trust someone to make enough progress on metaphilosophy unless they had a strong enough metaphilosophical base to begin with.

(I assume you mean that the 1000th person is much worse than the status quo, because they will end up in a catastrophic epistemic pit. Let me know if that's a misunderstanding.)

Is your view:

  • People can't make metaphilosophical progress, but they can recognize and adopt it. The status quo is OK because there is a large diversity of people generating ideas (the best of which will be adopted).
  • People can't recognize metaphilosphical progress when they see it, but better views will systematically win in memetic competition (or in biological/economic competition because their carriers are more competent).
  • "Metaphilosophy advances one funeral at a time," the way that we get out of epistemic traps is by creating new humans who start out with less baggage.
  • Something completely different?

I still don't understand how any of those views could imply that it is so hard for individuals to make progress if amplified. For each of those three views about why the status quo is good, I think that more than 10% of people would endorse that view and use their amplified power in a way consistent with it (e.g. by creating lots of people who can generate lots of ideas; by allowing competition amongst people who disagree, and accepting the winners' views; by creating a supportive and safe environment for the next generation and then passing off power to that generation...) If you amplify people radically, I would strongly expect them to end up with better versions of these ideas, more often, than humanity at large.

My normal concern would be that people would drift too far too fast, so we'd end up with e.g. whatever beliefs were most memetically fit regardless of their accuracy. But again, I think that amplifying someone leaves us in a way better situation with respect to memetic competition unless they make an unforced error.

Even more directly: I think more than 1% of people would, if amplified, have the world continue on the same deliberative trajectory it's on today. So it seems like the fraction of people you can safely amplify must be more than 1%. (And in general those people will leave us much better off than we are today, since lots of them will take safe, easy wins like "Avoid literally killing ourselves in nuclear war.")

I can totally understand why you'd say "lots of people would mess up if amplified due to being hasty and uncareful." But I still don't see what could possibly make you think "99.99999% of people would mess up most of the time." I'm pretty sure that I'm either misunderstanding your view, or it isn't coherent.

It seems to me the difficulty is likely to be in assessing whether someone would have a good enough start, and being able to do this probably requires enough ability to assess metaphilosophical competence now such that we could pick such a person to make progress later.

(I'm not zhukeepa; i'm just bringing up my own thoughts.)

This isn't quite the same as a improvement, but one thing that is more appealing about normal-world metaphilosophical progress than empowered-person metaphilosophical progress is that the former has a track record of working*, while the latter is untried and might not work.

*Slowly and not without reversals.

I do not expect that any human brain would be safe if scaled up by that amount, because of lack of robustness to relative scale. My intuition is that alignment is very hard, but I don't have an explicit reason right now.

I think the number of safe people depends sensitively on the details of the 1,000,000,000,000,000xing. For example: Were they given a five minute lecture on the dangers of value lock-in? On the universe's control panel, is the "find out what would I think if I reflected more, and what the actual causes are of everyone else's opinions" button more prominently in view than the "turn everything into my favorite thing" button? And so on.

My model is that giving them the five-minute lecture on the dangers of value lock-in won't help much. (We've tried giving five-minute lectures on the dangers of building superintelligences...) And I think most people executing "find out what would I think if I reflected more, and what the actual causes are of everyone else's opinions" would get stuck in an epistemic pit and not realize it.

I think everyone (including me) would go crazy from solitude in this scenario, so that puts the number at 0. If you guarantee psychological stability somehow, I think most adults (~90% perhaps) would be good at achieving their goals (which may be things like "authoritarian regime forever"). This is pretty dependent on the humans becoming more intelligent -- if they just thought faster I wouldn't be nearly as optimistic, though I'd still put the number above 0.

I think most humans achieving what they currently consider their goals would end up being catastrophic for humanity, even if they succeed. (For example I think an eternal authoritarian regime is pretty catastrophic.)

I agree that an eternal authoritarian regime is pretty catastrophic.

I don't think that a human in this scenario would be pursuing what they currently consider their goals -- I think they would think more, learn more, and eventually settle on a different set of goals. (Maybe initially they pursue their current goals but it changes over time.) But it's an open question to me whether the final set of goals they settle upon is actually reasonably aligned towards "humanity's goals" -- it may be or it may not be. So it could be catastrophic to amplify a current human in this way, from the perspective of humanity. But, it would not be catastrophic to the human that you amplified. (I think you disagree with the last statement, maybe I'm wrong about that.)

I'd say that it wouldn't appear catastrophic to the amplified human, but might be catastrophic for that human anyway (e.g. if their values-on-reflection actually look a lot like humanity's values-on-reflection, but they fail to achieve their values-on-reflection).

Yeah, I think that's where we disagree. I think that humans are likely to achieve their values-on-reflection, I just don't know what a human's "values-on-reflection" would actually be (eg. could be that they want an authoritarian regime with them in charge).

It's also possible that we have different concepts of values-on-reflection. Eg. maybe you mean that I have found my values-on-reflection only if I've cleared out all epistemic pits somehow and then thought for a long time with the explicit goal of figuring out what I value, whereas I would use a looser criterion. (I'm not sure what exactly.)

Yeah, what you described indeed matches my notion of "values-on-reflection" pretty well. So for example, I think a religious person's values-on-reflection should include valuing logical consistency and coherent logical arguments (because they do implicitly care about those in their everyday lives, even if they explicitly deny it). This means their values-on-reflection should include having true beliefs, and thus be atheistic. But I also wouldn't generally trust religious people to update away from religion if they reflected a bunch.

I think there is a key question in AI Alignment that Wei Dai has also talked about, that is something like "is it even safe to scale up a human?", and I think this post is one of the best on that topic. 

In practice, humanity as a whole has not obviously fallen into any permanent epistemic pits, but I think this is because no single ideology has clearly dominated the world.

If humanity had fallen into an epistemic pit, how would anyone know? Maybe we're in an epistemic pit right now. After all, one of the characteristics of an epistemic pit is that it is a conclusion that takes huge or infinite amounts of time to update away from. How does one distinguish that from a conclusion that is correct?

Well, wouldn't it be great if we had sound metaphilosophical principles that help us distinguish epistemic pits from correct conclusions! :P

I actually think humanity is in a bunch of epistemic pits that we mostly aren't even aware of. For example, if you share my view that Buddhist enlightenment carries significant (albeit hard-to-articulate) epistemic content, then basically all of humanity over basically all of time has been in the epistemic pit of non-enlightenment.

If we figure out the metaphilosophy of how to robustly avoid epistemic pits, and build that into an aligned AGI, then in some sense none of our current epistemic pits are that bad, since that AGI would help us climb out in relatively short order. But if we don't figure it out, we'll plausibly stay in our epistemic pits for unacceptably long periods of time.

We can recognize ideas from the past that look like epistemic pits, but if you were to go back in time and try to argue against those ideas, you would be dismissed as incorrect. If you brought proof that the future society held your ideas instead of your interlocutors, that would taken as evidence of the increasing degeneracy of man rather than as evidence that your ideas were more correct than theirs. So what value does the concept of an epistemic pit bring?

I can name one epistemic pit that humanity fell into: slavery. At one point treating other human beings as property to be traded was considered normal, proper and right. After a very long time, and more than a few armed conflicts, western, liberal, capitalist societies updated away from this norm. However, if I were to go back to 1600 and try to argue that slavery was immoral, I would be seen holding an incorrect viewpoint. So, given that, how can one recognize that one is in an epistemic pit in the moment?

Slavery is not, and cannot be, an epistemic pit, because the error there is a moral one, not an epistemic one. Our values differ from those of people who viewed slavery as acceptable. That is very different from an “epistemic pit”.

My guess is that the sorts of interventions that would cause someone to empathise with slaves are primarily epistemic, actually. To try and give a simple example, teaching someone how to accurately use their mirror-neurons when they see the fear in the eyes of a slave, is a skill that I expect would cause them to change their behaviour toward slaves.

There seems to be a disagreement on definitions here. Said thinks a pit epistemic if a platonic reasoner could receive information that takes him out of it. You think a pit epistemic if a human could receive information that takes him out of it.

It does seem true that for a fully rational agent with infinite computing power, moral concerns are indeed completely separate from epistemic concerns. However, for most non-trivial reasoners who are not fully rational or do not have infinite computing power, this is not the case.

I think it's often valuable to talk about various problems in rationality from a perspective of a perfectly rational agent with infinite computing power, but in this case it seems important to distinguish between those, humans and other potential bounded agents (i.e. any AI we design will not have its moral and epistemic concerns completely separated, which is actually a pretty big problem in AI alignment).

Why do you think an AI we design won't have such separation? If physics allowed us to run arbitrary amounts of computation, someone may have built AIXI, which has such separation.

teaching someone how to accurately use their mirror-neurons

What does this mean? (I know what “mirror neurons” are, but I don’t have any idea what you could mean by the quoted phrase.)

I mean that empathy is a teachable skill, and that it can be thought of as an informational update, yet apparently changing your 'moral' behaviour.

empathy is a teachable skill

Citation, please? I’m not sure I’m familiar with this.

it can be thought of as an informational update

Could you expand on this? Do you mean this in any but the most literal, technical sense? I am not sure how to view any gain of empathy (whether learned or otherwise) as an epistemic update.

I think I was a bit unclear, and that somehow you managed to not at all get what I meant. I have a draft post written on this point, I'll publish it some time in the future, and tap out for now.

If "empathy" means "ability to understand the feelings of others" or "ability to predict what others will do", then it seems straightforward that empathy is learnable. And learnability and teachability seem basically the same to me. Examples indicating that empathy is learnable:

  • As you get to know someone, things go more smoothly when you're with them
  • Socializing is easier when you've been doing a lot of it (at least, I think so)
  • Managers are regularly trained for their job

Those definitions of “empathy” are, however, totally inconsistent with Ben’s mention of mirror neurons; so I doubt that this is what he had in mind.

(Your argument is actually problematic for several other reasons, but the aforesaid inconsistency makes your points inapplicable, so it’s not necessary to spend the time to demonstrate the other problems.)

Fair enough. I was conflating factual correctness and moral correctness. I guess a better example would be something like religious beliefs (e.g. the earth is 6000 years old, evolution is a lie, etc).

I think its hard to distinguish a lack of metaphilosophical sophistication from having different values. The (hypothetical) angsty teen says that they want to kill everyone. If they had the power to, they would. How do we tell whether they are mistaken about their utility function, or just have killing everyone as their utility function. If they clearly state some utility function that is dependant on some real world parameter, and they are mistaken about that parameter, then we could know. Ie they want to kill everyone if and only if the moon is made of green cheese. They are confident that the moon is made of green cheese, so don't even bother checking before killing everyone.

Alternately we could look at if they could be persuaded not to kill everyone, but some people could be persuaded of all sorts of things. The fact that you could be persuaded to do X says more about the persuasive ability of the persuader, and the vulnerabilities of your brain than whether you wanted X.

Alternatively we could look at whether they will regret it later. If I self modify into a paperclip maximiser, I won't regret it, because that action maximised paperclips. However a hypothetical self who hadn't been modified would regret it.

Suppose there are some nanobots in my brain that will slowly rewire me into a paperclip maximiser. I decide to remove them. The real me doesn't regret this decision, the hypothetical me who wasn't modified does. Suppose there is part of my brain that will make me power hungry and self centered once I become sufficiently powerful. I remove it. Which case is this? Am I damaging my alignment or preventing it from being damaged?

We don't understand the concept of a philosophical mistake well enough to say if someone is making one. It seems likely that, to the extent that humans have a utility function, some humans have utility functions that want to kill most humans.

who almost certainly care about the future well-being of humanity.

Is mistaken. I think that a relatively small proportion of humans care about the future well being of humanity in any way similar to what the words mean to a mordern rationalist.

To a rationalist, "future wellbeing of humanity" might mean a superintelligent AI filling the universe with simulated human minds.

To a random modern first world person, it might mean a fairly utopian "sustainable" future, full of renewable energy, electric cars ect.

To a North Sentinal Islander, they might have little idea that any humans beyond their tribe exist, and might hope for several years of good weather and rich harvests.

To a 10th century monk, they might hope that judgement day comes soon, and that all the righteous souls go to heaven.

To a barbarian warlord, they might hope that their tribe conquers many other tribes.

The only sensible definition of "care about the future of humanity" that covers all these cases is that their utility function has some term relating to things happening to some humans. Their terminal values reference some humans in some way. As opposed to a paperclip maximiser that sees humans as entirely instrumental.

I think the thought experiment that you propose is interesting, but doesn't isolate different factors that may contribute to people's intuitions. For example, it doesn't distinquish between worries about making individual people powerful because of their values (e.g. they are selfish or sociopathic) vs. worries due to their decision-making processes. I think this is important because it seems likely that "amplifying" someone won't fix value-based issues, but possibly will fix decision-making issues. If I had to propose a candidate crux, it would probably be more along the lines of how much of alignment can be solved through using a learning algorthm to help learn solutions vs. how much of the problem needs to be solved "by hand" and understood on a deep level rather than learned. Along those lines, I found the postscript to Paul Christiano's article on corrigibility interesting.

I think values and decision-making processes can't be disentangled, because I think people's values often stem from their decision-making processes. For example, someone might be selfish because they perceive the whole world to be selfish and uncaring, and in return acts selfish and uncaring by default. This default behavior might cause the world to act selfishly and uncaringly toward him, further reinforcing his perception. If he fully understood this was happening (rather than the world just being fundamentally selfish), he might experiment with acting more generously with the rest of the world, and observe the rest of the world to act more generously in return, and in turn stop being selfish entirely.

In general I expect amplification to improve decision-making processes substantially, but in most cases to not improve them enough. For example, it's not clear that amplifying someone will cause them to observe that their own policy of selfishness is locking them into a fixed point that they could "Löb out of" into a more preferable fixed point. I expect this to be particularly unlikely if e.g. they believe their object-level values to be fixed and immutable, which migh result in a fairly pernicious epistemic pit.

My intuition is that most decision-making processes have room for subtle but significant improvements, that most people won't realize these improvements upon amplification, and that failing to make these improvements will result in catastrophic amounts of waste. As another example, it seems quite plausible to me that:

  • the vast majority of human-value-satisfaction (e.g. human flourishing, or general reduction of suffering) comes from acausally trading with distant superintelligences.
  • most people will never care about (or even realize) acausal trade, even upon amplification.

I think decision making can have an impact on values, but that this depends on the design of the agent. In my comment, by values, I had in mind something like “the thing that the agent is maximizing”. We can imagine an agent like the paperclip maximizer for which the “decision making” ability of the agent doesn’t change the agent’s values. Is this agent in an “epistemic pit”? I think the agent is in a “pit” from our perspective, but it’s not clear that the pit is epistemic. One could model the paperclip maximizer as an agent whose epistemology is fine but that simply values different things than we do. In the same way, I think people could be worried about amplifying humans because they are worried that those humans will get stuck in non-epistemic pits rather than epistemic pits. For example, I think the discussion in other comments related to slavery partly have to due with this issues.

The extent to which a human's "values" will improve as a result of improvements in "decision making" to me seems to depend on having a psychological model of humans, which won't necessarily be a good model of an AI. As a result, people may agree/disagree in terms of their intuitions about amplification as applied to humans due to similarities/differences in their psychological models of those humans, while their agreement/disagreement on amplification as applied to AI may result from different factors. In this sense, I'm not sure that different intuitions about amplifying humans is necessarily a crux for differences in terms of amplifying AIs.

In general I expect amplification to improve decision-making processes substantially, but in most cases to not improve them enough.

To me, this seems like a good candidate for something close to a crux between "optimistic" alignment strategies (similar to amplification/distillation) vs. "pessimistic" alignment strategies (similar to agent foundations). I see it like this. The "optimistic" approach is more optimistic that certain aspects of metaphilosophical competence can be learned during the learning process or else addressed by fairly simple adjustment to the design of the learning process, whereas the "pessimistic" approach is pessimistic that solutions to certain issues can be learned and so thinks that we need to do a lot of hard work to get solutions to these problems that we understand on a deep level before we can align an agent. I'm not sure which is correct, but I do think this is a critical difference in the rationales underlying these approaches.

Seconding Habryka. I have thought often about this post.

Humanity has fallen into an ontological divot in the search space. This has epistemic considerations due to ontological commitments (Duhem-Quine indispensability).

This probably sounds a bit vague so I'll use a metaphor. Imagine that the sapir-worf hypothesis was true and that the world had already gone through a cycle of amplification of newspeak, including screening off the part of the language that would cause someone to independently reinvent the sapir-worf hypothesis.