Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I'm working on writing a paper about an idea I previously outlined for addressing false positives in AI alignment research. This is the second completed draft of one of the subsections arguing for the adoption of a particular, necessary hinge proposition to reason about aligned AGI (first subjection here). I appreciate feedback on this subsection especially regarding if you agree with the line of reasoning and if you think I've ignored anything important that should be addressed here. Thanks!

Since AGI alignment is necessarily alignment of AGI, alignment schemes can depend on the dispositions of AGI, and one disposition AGI has is to subjective experience and mental phenomena (Adeofe, 1997), (Nagel, 1974). Whether or not we expect AGI to realize this disposition matters because it influences the types of alignment schemes that can be considered since an AGI without a mental aspect can only be influenced by modifying its algorithms and manipulating its behavior whereas an AGI with a mind can be influenced by engaging with its perceptions and understanding of the world (Dreyfus, 1978). In other words we might say mindless AGI can be aligned only by algorithmic and behavioral methods whereas mindful AGI can also be aligned by philosophical methods that work on its epistemology, ontology, and axiology (Brentano, 1995). It's unclear what we should expect about the mentality of future AGI, though, because we are presently uncertainty about mental phenomena in general (cf. the work of Chalmers and Searle for modern, popular, and opposing views on the topic), so we are forced to speculate about mental phenomena in AGI when we reason about alignment (Chalmers, 1996), (Searle, 1984).

Note, though, that this uncertainty may not be fundamental (Dennett, 1991). For example, if materialist or functionalist attempts to explain mental phenomena prove adequate, perhaps because they lead to the development of conscious AGI, then we may agree on what mental phenomena are and how they work (Oizumi, Albantakis, and Tononi, 2014). If they don't, though, we'll likely be left with metaphysical uncertainty around mental phenomena that's rooted in the epistemic limitations of perception (Hussrl, 2014). Regardless of how uncertainty about mental phenomena might later be resolved, it currently creates a need for pragmatically making assumptions about it in our reasoning about alignment. In particular we want to know whether or not we should design alignment schemes that assume a mind, even if we expect mental phenomena to be reducible to other phenomena. Given that we remain uncertain and cannot dismiss the possibility of mindful AGI, what we decide depends on how likely alignment schemes are to succeed and avoid false positives conditional on AGI having the capacity for mental phenomena. The choice is then between whether we design alignment schemes that work without reference to mind or whether they engage with it.

If we suppose AGI do not have minds, whether because we believe they have none, are inaccessible to us, or not causally relevant to alignment, then alignment schemes can only address the algorithms and behavior of AGI. This would be to address alignment in a world where all AGIs are treated as p-zombies, i.e. beings without mental phenomena (Kirk, 1974). Now suppose this assumption is false and AGI do have minds, then our alignment schemes that work only on algorithms and behavior would be expected to continue to work since they function without regard to the mental phenomena of AGI, making the minds of AGI irrelevant to alignment. This suggests there is little risk of false positives from supposing AGI do not have minds.

If we suppose AGI do have minds, then alignment schemes can also use philosophical methods to address the values, goals, models, and behaviors of AGI. Such schemes would likely take the form of ensuring that updates to an AGI's ontology and axiology converge on and maintain alignment with human interests (de Blanc, 2011), (Armstrong, 2015). Now suppose this assumption is false and AGI do not have minds, then our alignment schemes that employ philosophical methods will likely fail because they are attempting to address mechanisms of action not present in AGI. This suggests there is a risk of false positives from supposing AGI have minds proportionate with the likelihood that we do not build mindful AGI.

From this analysis it seems we should suppose mindless AGI when designing alignment schemes so as to reduce the risk of false positives, but note that it does not consider the likelihood of success at aligning AGI using only algorithmic and behavioral methods. That is, all else may not be equal between these two assumptions such that the one with the lower risk of false positives might not be the better choice if we have additional information that leads us to believe that alignment of mindful AGI is much more likely to succeed than the alignment of mindless AGI, and it appears that we have such information in the form of Goodhart's curse and the failure of good old-fashioned AI (GOFAI).

Goodhart's curse says that when optimizing for the measure of a value the optimization process will implicitly maximize divergence of the measure from the value (Yudkowsky, 2017). This is an observation that follows from the combination of Goodhart's law and the optimizer's curse (Goodhart, 1984), (Smith and Winkler, 2006). This tendency of measure and value to diverge under optimization results in a phenomenon known as "Goodharting" and it takes myriad forms that affect alignment (Manheim and Garrabrant, 2018). In particular Goodharting poses a problem for behavioral alignment schemes because to optimize behavior it is necessarily to measure behavior and optimize on that measure. Consequently it appears behavioral methods are unlikely to be capable of producing aligned AGI on their own, and this is further supported by both the historical failure to align humans with arbitrary values using behavioral optimization methods and the widespread presence of Goodharting in behaviorally controlled, evolving computer systems (Scott, 1999), (Lehman et al., 2018).

Further, past research on GOFAI—AI systems based on symbol manipulation—suggests algorithmic methods of alignment are likely to be too complex to work for the same reasons that GOFAI was itself unworkable, namely that it proved infeasible for humans to program systems with enough complexity and specificity to do anything more than perform meaningless manipulations (Haugeland, 1985), (Agre, 1997). In recent years AI researchers have surpassed GOFAI only by switching to designs where humans specify relatively simple computations to be performed and allow the AI to apply what Moravec called "raw power" to large data sets to achieve results (Russell and Norvig, 2009), (Moravec, 1976). This suggests that attempts to align AGI by algorithmic means are likely to also prove too complex for humans to solve, leaving us with only philosophical methods of alignment and thus necessitating mindful AGI.

This paints a bleak picture for the possibility of aligning mindless AGI since behavioral methods of alignment are likely to result in divergence from human values and algorithmic methods are too complex for us to succeed at implementing. This leads us to conclude that, although assuming mindful AGI has a greater risk of false positives than assuming mindless AGI all else equal, all else is not equal, mindless AGI is less likely to be successfully aligned because algorithmic and behavioral alignment mechanisms are unlikely to work, so we have no choice but to take on the risks associated with assuming mindful AGI when designing alignment schemes.


  • Leke Adeofe. Artificial intelligence and subjective experience. In Proceedings of Southcon 95. IEEE, 1997. Link
  • Thomas Nagel. What Is It Like to Be a Bat?. The Philosophical Review 83, 435 JSTOR, 1974. Link
  • Hubert L. Dreyfus. What Computers Can’t Do: The Limits of Artificial Intelligence. HarperCollins, 1978.
  • Franz Brentano. Psychology from an Empirical Standpoint. Routledge, 1995.
  • David Chalmers. The Conscious Mind: In Search of a Fundamental Theory. Oxford University Press, 1996.
  • John R. Searle. Minds, Brains, and Science. Harvard University Press, 1984.
  • Daniel C. Dennett. Consciousness Explained. Little, Brown and Co., 1991.
  • Masafumi Oizumi, Larissa Albantakis, Giulio Tononi. From the Phenomenology to the Mechanisms of Consciousness: Integrated Information Theory 3.0. PLoS Computational Biology 10, e1003588 Public Library of Science (PLoS), 2014. Link
  • Edmund Hussrl. Ideas for a Pure Phenomenology and Phenomenological Philosophy: First Book: General Introduction to Pure Phenomenology. Hackett Publishing Company, Inc., 2014.
  • Robert Kirk. Sentience and Behaviour. Mind 83, 43–60 [Oxford University Press, Mind Association], 1974. Link
  • Peter de Blanc. Ontological Crises in Artificial Agents’ Value Systems. (2011). Link
  • Stuart Armstrong. Motivated Value Selection for Artificial Agents. In Artificial Intelligence and Ethics: Papers from the 2015 AAAI Workshop. (2015). Link
  • Eliezer Yudkowsky. Goodhart’s Curse. (2017). Link
  • Charles A. E. Goodhart. Problems of Monetary Management: The UK Experience. 91–121 In Monetary Theory and Practice. Macmillan Education UK, 1984. Link
  • James E. Smith, Robert L. Winkler. The Optimizer’s Curse: Skepticism and Postdecision Surprise in Decision Analysis.Management Science 52, 311–322 Institute for Operations Research and the Management Sciences (INFORMS), 2006. Link
  • David Manheim, Scott Garrabrant. Categorizing Variants of Goodhart’s Law. (2018). Link
  • James C. Scott. Seeing Like a State: How Certain Schemes to Improve the Human Condition Have Failed. Yale University Press, 1999.
  • Joel Lehman et al.. The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities. (2018). Link
  • John Haugeland. Artificial Intelligence: The Very Idea. MIT Press, 1985.
  • Philip E. Agre. Computation and Human Experience. Cambridge University Press, 1997.
  • Stuart Russell, Peter Norvig. Artificial Intelligence: A Modern Approach. Pearson, 2009.
  • Hans Moravec. The Role of Raw Power in Intelligence. (1976). Link
New Comment
16 comments, sorted by Click to highlight new comments since:

If we suppose AGI do have minds, then alignment schemes can also use philosophical methods to address the values, goals, models, and behaviors of AGI. Such schemes would likely take the form of ensuring that updates to an AGI’s ontology and axiology converge on and maintain alignment with human interests (de Blanc, 2011), (Armstrong, 2015).

This point is the key to the whole post, and it seems wrong to me. De Blanc 2011 and Armstrong 2015 both allow "non-mindful" models. I don't know any alignment ideas that would apply only to "mindful" AIs.

Returning from our deeper thread to your original comment, can you classify the nature of your objection to this point? For example, would you posit that there is nothing we would classify as mental phenomena that is not already addressed by other methods? If so, that seems fine to be because it reflects our uncertainty about the mental rather than a disagreement with this line of reasoning where we suppose there to be some things we might naively describe as mental, whatever the resolution of our uncertainty about the mental will later tell us, if anything.

Hmm, what do you think it means for an AI to have a mind? To me this simply means it engages in mental phenomena, which is to say it is the subject of intentional phenomena. Perhaps I've not been clear enough about that here?

Also, I think there's no way to make sense of either of those papers without appeal to the mental, since ontology and axiology (values) exist only within the mental. You could try to play at imagining an AI as if it had ontology and axiology without a mind, but then this would be anthropomorphization of the AI rather than treating the AI as it is.

FWIW I also plan to have more to say in future papers about alignment mechanisms that more explicitly depend on the mental aspect of mindful AGI.

Also, I think there’s no way to make sense of either of those papers without appeal to the mental

I think I can help.

Imagine a cellular automaton like Game of Life with an infinite grid of cells. Most cells are initially empty, except for a big block of cells which works as a computer. The computer is running the following program: enumerate many starting configurations of 10x10 Game of Life cells, trying to find a configuration that unfolds into exactly 100 gliders. After the computer finds such a configuration, it uses its "actuator" (another bunch of cells) to self-destruct and replace itself with that configuration.

On one hand, we've described something "non-mindful" that can be implemented right now without principle difficulty. On the other hand, it can be said to have an "ontology" (it thinks the world is an infinite grid of cells) and "values" (it wants 100 gliders). If you reread de Blanc's paper, imagining that it's talking about these toy "ontologies" and "values" instead of the squishy human kind, it will make perfect sense.

This sounds dangerously like the same kind of failure-by-equivocation that plagued GOFAI. Just because we write a program that contains something we interpret as a representation of the world or acts in a way we interpret as goal directed does not mean the program actually has a representation with intention or action with telos. It also doesn't mean that it doesn't have those things (in fact I think it does since my position on the metaphysics of phenomena is one of panpsychism, though that's a outside the scope of this paper), but what it does have is not necessary what we often think of it having in that way based on our understanding of its internal workings.

To make this concrete, let's consider an even simpler case, a loop that adds up the number of 'a' characters it sees in a file:

acount := 0;
while nc := next(fd) {
  acount++ if nc == 'a';

When acount is incremented because nc contains an 'a' it is being causally linked to the state of file. This doesn't mean the program understands that fd contains acount 'a's, though, or even that acount is casually linked to the contents of fd; it only means that acount counts the number of 'a's in fd, an interpretation we can make but the program itself cannot. So this thing properly has ontology in some very weak sense because it contains a thing that represents the world, but it's the most minimal sort of such of a thing and one that is so simple as to be difficult to describe in words without accidentally ascribing it additional features it does not have.

Similarly it has a purpose (which, I would argue, is the source of value) of "count 'a's until you reach the end of the file" but this is the purpose it has as we would describe it. To itself this program has no purpose on its own, but under execution is given purpose by the execution of individual instructions in a particular order that affect the state of a system, yet still this is a sort of purpose that the program cannot express to itself, because ultimately the program has no disposition to understand its own telos. So, yes, it has a purpose, but not of the sort we would ascribe to a thing we could think of as having a mind, and thus we cannot see it as valuing anything; it just does stuff because that's what it is without regard to its own function.

So maybe we can make sense of those papers by applying our own interpretations on mindless systems to treat them as if they had ontologies and axiologies, but I view this as a mistake because it separates us from the systems' own capacities and works on how we believe the systems to work, which may be correlated but are importantly different.

All I'm saying is, these papers weren't intended to be only about "mindful" AI. (You could ask Peter or Stuart, but I'm pretty sure.) And the rest of your post relies on there being techniques that only work on "mindful" AI, so it kinda falls apart.

Hmm, I'm having a hard time figuring out what to do with this feedback. Yes, I suppose such mental-phenomena-assuming alignment techniques are possible and point to two examples of things that look a bit like research in this direction even if you disagree that there is a way those things could work, but this seems not to suggest the rest "falls apart" in that I am reasoning about likelihoods; instead it suggests you disagree with the order of magnitude of likelihoods I assign and think my conclusion is not supportable because you think what I'm calling "philosophical techniques" are unlikely or unnecessary. That's a somewhat different sort of critique than saying the argument falls apart because it hinged, say, on a proposition that is false.

Sorry if that seems nitpicky, but I'm just trying to make sure I understand the objection and respond to it appropriately.

I'm pretty sure these two papers work (or don't work) regardless of mindful/non-mindful AI. They aren't examples of mental-phenomena-assuming alignment techniques - they just use "ontology" and "values" as suggestive words for math, like "learning" in reinforcement learning. So it seems like there's no evidence that mental-phenomena-assuming alignment techniques are possible.

Ah, okay. I think there is such evidence and it doesn't hinge on whether or not these two papers constitute evidence of it, but here I don't consider such arguments. This suggests I should perhaps devote more time to establishing the feasibility of such an approach, although I think we have no strong evidence yet that mindless techniques will work either so I mostly focused on giving evidence that mindless techniques are unlikely to work bringing them below the prior probability of mindful techniques working (being set to the probability that any class of techniques will work), whereas mindful techniques only have "evidence" against them in the form of arguments speculating about the nature of mental phenomena and arguing against its existence, which I point out is something we're practically suspending resolution on here to make an argument given sufficient uncertainty about it that we can't use it to resolve the issue.

Of course if you look at the probability of the whole quest succeeding, it seems small either way, and distinguishing between different small probabilities is hard. But if you look at individual steps, we've made small but solid steps toward understanding "mindless" AI alignment (like the concept of adversarial examples), but no comparable advances in understanding "mindful" AI. So to me the weight of evidence is against your position.

Let me see if I got it right:

1) If we design an aligned AGI by supposing it doesn't have a mind, it will produce an aligned AGI even if it actually possess a mind.

2) In the case we suppose AGI have minds, the methods employed would fail if it doesn't have a mind, because the philosophical methods employed only work if the subject has a mind.

3) The consequence of 1) and 2) is that supposing AGI have minds has a greater risk of false positive.

4) Because of Goodhart's law, behavioral methods are unlikely to produce aligned AGI

5) Past research on GOFAI and the success of applying "raw power" show that using only algorithmic methods for aligning AGI is not likely to work

6) The consequence of 4) and 5) is that the approach supposing AGI do not have minds is likely to fail at producing aligned AI, because it can only use behavioral or algorithmic methods.

7) Because of 6), we have no choice but take the risk of false positive associated with supposing AGI having minds

My comments:

a) The transition between 6) and 7) assumes implicitly that:

(*) P( aligned AGI | philosophical methods ) > P( aligned AI | behavorial or algorithmic methods)

b) You say that if we suppose the AGI does not have a mind, and treat is a p-zombie, then the design would work even though it has mind. Therefore, when supposing that the AGI does not have a mind, there is no design choices that optimize the probability of aligned AGI by assuming it does not possess mind.

c) You assert that using philosophical methods (assuming the AGI does have a mind), a false positive would make the method fail, because the methods use extensively the hypothesis of a mind. I don't see why a p-zombie (which by definition would be indistinguishable from an AGI with a mind) would be more likely to fail than an AGI with a mind.

a) Yep, I agree.

b) This sounds right but also I think the comment about p-zombies is generating some confusion and conveying an idea I did not intend. I meant the p-zombie comment to be illustrative, and it's not actually a hinge of the argument.

c) Again, maybe I'm not conveying what I meant to when making passing reference to p-zombies, because in my mind the point of a p-zombie here is that it's equivalent to a thing with a mind for some set of observations we make but doesn't function in the same way so that if we may later be surprised when p-zombie and mind diverge. I suspect some of the confusion is that I'm operating under the assumption that p-zombies are possible but have weird computational limits (the integrated information theory paper has a section explaining this idea).

This paints a bleak picture for the possibility of aligning mindless AGI since behavioral methods of alignment are likely to result in divergence from human values and algorithmic methods are too complex for us to succeed at implementing.

To me it appears like the terms cancel out: Assuming we are able to overcome the difficulties of more symbolic AI design, the prospect of aligning such an AI seem less hard.

In other words, the main risk is wasting effort on alignment strategies that turn out to be mismatched to the eventually implemented AI.

This is actually the opposite of what I argue elsewhere in the paper, preferring to trade off more false negatives for less false positives. That is, I view wasting effort as better than not wasting effort on something that has a higher chance of killing us. You see none of that line of argument here, though, so I agree that's a reasonable alternative conclusion to draw outside the context of what I'm trying to optimize for.

One possible situation is when non-mental AGI creates a full model of a human mind, may be by scanning, and thus become partly "mental". It is more or less inevitable as it is impossible to extract human values without creating some model of the human mind.

Or, saying it in other words, model of human values without a model of human mind is doomed to be wrong, because it will be based on many hidden assumptions implied by ideas like "humans have constant set of preferences, and act according to it".

Interesting, didn't think about that. Seems to suggest an alternative way to argue for AGI with minds even if we try to create them so that they don't have minds!