TL;DR: This is an attempt to clarify a post (or perhaps just my own confusion) from a tech commentator, that was skeptical about AGI as a possibility, and alignment research as a project. As such, there is nothing novel in here, but it might be useful for people from outside or new to the LW/EA community.
---
Venkatesh Rao’s recent newsletter essay ‘Beyond Hyperanthropomorphism’, raised a number of questions about the direction of AGI and, more relevantly, the usefulness and validity of AGI alignment research[1].[2] His points regarding alignment, to the extent they are concrete, are buried in a wide-ranging exposition; hence, they are hard to engage with.[3] Nevertheless, in order to bridge his philosophically-informed perspective with the more practical approach of many alignment researchers, I thought it useful to situate his specific claims in a few current conversations in alignment.
The issues Rao seems to raise in the essay that I will cover are:
A necessary concept for useful conversations about intelligence (human, animal, AGI, etc.) is defining a ‘coherent quality of experience’ in respect of the entity in question, which are lumped under the neologism ‘something it is like to be’ or SIILTBness (borrowing from Thomas Nagel)
An AGI acquires a coherent quality of experience by having physical interactions with the world, through sensors, actuators, arms, etc., as opposed to just being trained on online text or image corpora
Until researchers can describe SIILTBness in respect of (an) AGI, it is meaningless to talk about (that) AGI’s intelligence, sentience, self-awareness, consciousness, etc.
When suggesting there is an ‘alignment problem’, researchers are extrapolating specific anthropomorphised nascent features of their AGIs, to conjure a golem-like phantasm looming in humanity’s future
Moreover, the very idea of AI alignment as a special or outsize problem, as opposed to a ‘normal engineering problem’, is ill-posed, at least in the absence of SIILTBness
I draw upon another LW post discussing the essay, comments on his blog, as well as a 2021 Twitter thread.[4]
On SIILTBness and LLMs
The essay discusses why machine SIILTBness requires embodiment and situatedness, as well as a subjective, phenomenological notion of time as a way of structuring events in the world.[5] These basically amount to the AI having the ability to move around in a physical environment, have a clear separation between its physical boundaries and the rest of the world, and (I presume) something akin to a sense of self.[6] Rao suggests that if one can’t define the SIILTBness of an AI (or future AGI), it is impossible to have a coherent discussion about what that AI/AGI might do, or how it may pose a risk.
I won’t delve into the philosophicalintricacies of Nagel or SIILTBness, except to say I am unconvinced that it is a necessary concept for a meaningful discussion of AI/AGI alignment risk (as opposed to the feasibility of AGI). I try to show below that the concerns alignment researchers voice seem to be conceivable, and in some cases concretely visible, in current neural networks.[7]
Machine SIILTBness seems to be a (conceptually wider, more confusing and possibly distracting) version of a specific question: do current AIs actually understand? For instance, how deeply do LLMs like GPT-3 know things about the world, given that they are trained on large amounts of existing text. The balance of opinion is that LLMs have, at best, a shallow understanding rather than the deeper model of the world humans (and most animals, arguably) have, a result of their existential commitment to the reality of a (mostly physical) world.[8]
Notwithstanding LLMs' shallowness, theseexamples from Stuart Armstrong illustrate that GPT-3 can, if carefully prompted, generate responses that indicate (or simulate) a degree of understanding, albeit within limits.
Can a GPT-n go a (or a few) step(s) further? Say (riffing on Armstrong’s example of the dead worker), can a model, having established that the person was dead, log onto Interflora, use the credit card (its human owner has kindly provided) to order flowers to be delivered to the person’s family? Does it grokk enough of Armstrong’s example prompt ‘She died on Friday…’ to know that (within US/UK society) it is reasonable that flowers are sent to dead people’s families?
In other words, can a nearcast AI take meaningful and defensible (to common-sense) actions in the world, based on a text-only model of the world? Does it need SIILTBness for this? Would it, as Rao implies in this comment, encounter insurmountable hurdles in navigating the reality of the world (i.e. get the dead worker’s address, retrieve the credit card data, know that Interflora is the appropriate site for the UK, etc.)? Would it choose to send flowers or do something more…ambitious, say, ordering the construction of a mausoleum modelled after that of Queen Artemisia II at Halicarnassus?
Narrative bridges
A possible frustration Rao, and perhaps others outside the AF/LW/EA fora, have with the alignment conversation is that it is hard to perceive existential or catastrophic risk from looking at the narrowly-superhuman-yet-occasionally-idiotic AIs of today. Rao uncharitably, and without specifically sourcing whatever has antagonised him, terms these ‘lurid fears…[of] “a self-improving general intelligence” ’. However, I do have some sympathy with his position, since the predicted failures only exist in embryonic form in models operating within toy environments, when they’re visible at all. In other cases, the failures are the result of thought experiments, conversations, and (sometimes) difficult formal-mathematical chains of reasoning. By and large, and perhaps intentionally (from an infohazard perspective) this material is wrapped within layers of AF/LW posts (which often disagree), and are only occasionally distilled into general-audience books or podcasts.
Hence I wanted to highlight two recent works, in particular, that try to draw a more concrete path from today’s AI to a world of potentially threatening AGIs.
Narrative bridge I: learning from human feedback
This 2022 post by Ajeya Cotra lays out a scenario of ‘human feedback on diverse tasks’ (HFDT), the idea of which is to ‘Train a powerful neural network model to simultaneously master a wide variety of challenging tasks (e.g. software development, novel-writing, game play, forecasting, etc) by using reinforcement learning on human feedback and other metrics of performance.’ HFDT would, in Cotra’s and Holden Karnofsky’s conception, be used to automate and radically improve scientific research and technological R&D.
The breadth of HFDT training would incentivise the model (via gradient descent[9]) to 1) pursue goals that require planning over long timeframes, 2) generate creative solutions (in the sense that they are not obvious to the human trainer/overseer - in an analogous way to how AlphaGo generated ‘creative moves’), 3) develop, as a side-effect of the training process, a model of its trainer’s psychology and the environment in which it is operating.
Cotra’s view might converge/collide with the philosophical objections listed above[10] via point (3) above: whether a HFDT model would understand its environment and the ability to generate plans and actions geared to achieving its objective, e.g. scientific research. Understanding its environment breaks down into modelling its trainers’ (or other humans it interacts with) psychology and motivations; assessing the behaviours for, and system by, which it is rewarded; potentially modelling other aspects of its environment such as its physical world, other AIs, etc.
A significant part of Cotra, Karnofsky, and Ngo's (2.2.2) scenarios is that AIs may construct deceptive plans - they would mostly act in ways that appear helpful and useful to humans, until the point when they turn.[11]
Whether SIILTBness would be necessary for that level of intricate and long-term planning, is an open question, perhaps adjacent to discussions in the alignment community on modelling humans and the issues that crop up when agents (human, AI, etc. who are capable of reflecting upon their own thinking) are embedded in their environments.[12] Even if the HFDT system didn’t have a comprehensive, accurate or coherent model of its environment, my intuition is that it could pose catastrophic risks to humanity, since it would have some situational awareness and some ability (through a web interface, say) to take actions in the world, for the reasons Cotra lays out here.
Rao, who does briefly discuss such a scenario only to dismiss it, might be right that, absent an accurate, physically-grounded world-model, an AI would face insurmountable frictions in achieving its objectives. But it could still ‘break things’ in trying. Moreover, as Cotra points out, there would be an economic incentive for humans to fix these easily-exposed errors, without necessarily finding some other, hidden, possibly more consequential failure mode.[13]
Narrative bridge II: a deep-learning perspective
Cotra’s relatively high-level description and operational schematic are closely tied to this 2022 paper by Richard Ngo, which translates some of the points Cotra, and others (including Ngo himself in a 2020 paper) raise, into the language of deep learning. Ngo tries to spell out a specific (insofar as possible for a future that is still unfolding and may do so over decades) trajectory, that is mostly based upon reasoning rather than on empirical facts, but crucially, collects specific examples where potentially problematic behaviour is already visible.
Ngo describes three phases in the trajectory of how a model’s[14] governing policy becomes catastrophically misaligned: firstly, it learns to represent outcomes that correlate to high reward across a range of tasks (which is his operationalisation of the network’s ‘goals’). Secondly, the policy learns enough about its environment (‘situational awareness’) to be able to pursue misaligned goals while maximising training reward. Finally, once the policy increases in capability (speed, breadth of tasks, creativity), eventually exceeding human supervision ability, it may generalise towards longer-term goal planning and larger-scale actions, which are unlikely to be the goals and actions humans would have approved of, had they fully appreciated them.
For the purposes of this narrative bridge, I focus on phase one, which is where the concrete examples of misalignment are most visible. Ngo’s point (Section 2.1.1) is that policies learn representations relevant to tasks they are trained upon. So, it is well documented that image classification networks learn both low- and higher-level features of scenes (and in some cases, we can identify the relevant neuron activations), and while similar behaviour is less clear in an RL context, there are examples of high-level game outcomes (like winning at chess or capturing an opponent’s flag) corresponding to specific neurons. [15]
In 2.1.2, Ngo points out examples of misbehaviour arising from the training process, such as reward tampering or a situation where the policy in a toy environment attempted to deceive the human supervisor. He provides further examples in the context of LLMs, including one where GPT-3 returns an answer that is plausible, in the sense of being tailored to information about the human supervisor, as opposed to a simply true answer. In another example, GPT-3 generates ‘hallucinations’ (i.e. non sequiturs, or repetitive loops, or otherwise obviously undesirable output).
Phases 2 & 3, which are more speculative in that the situations they describe simply don’t exist yet, offer fewer concrete misalignment examples, but do describe how misaligned behaviour may emerge as policies become more situationally aware; why misaligned behaviour is likely to be ‘preferred’ (in a gradient descent sense) than aligned behaviour; and the possibility that as policies become more capable[16] they may increasingly plan over longer timeframes and at larger physical scales, and exhibit power-seeking behaviour.[17]
Extrapolations & hyperanthropomorphisms
I have listed these examples to show that deep learning models (or policies, in Ngo’s terms) appear to act in ways that are consistent with a certain hypothetical, but grounded in current deep learning SOTA, trajectory for the development of AGI.
Even if we don’t fully understand how present-day models work or why these misalignments appear, or how to fix them robustly, there is no suggestion, in Ngo, Cotra, or anyone else cited here, of any ‘spooky phenomenology’, sentience, consciousness, experience, or other hyperanthropomorphisms. Nor, as Rao's essay alleges, can I see any cases of ‘pseudo-traits’ or ‘isolated point features’ being simplistically dialled up as a proxy for ‘super-intelligence’.
Of course, one could argue that the very act of using present-day misalignments as a predictor or intuition pump for future pathological goal generalisation or planning behaviour, is flawed as an approach. Or, that current deep learning will ultimately prove inadequate for the types of behaviour Ngo and Cotra are positing, e.g. creativity, planning, deception, agency, situational awareness, etc. But Rao's essay isn’t saying this, or if it is, his arguments aren’t referenced, and don’t sufficiently descend from what is ultimately a mountain of abstraction.
Coordination and replication
A significant portion of AI misalignment concerns lies with the possibility of networks or neural weights being replicated, most likely by humans deploying AIs in large numbers, as Cotra and others suggest.[18] Rao's essay does dismiss the more speculative possibility of self-replicating AIs, but doesn’t address the comparatively salient, in my view, issue of humans who, for profit, prestige, or geostrategic advantage, deploy possibly-misaligned AI/AGI across a variety of networks and applications.
However the replication happens, it would seem that the risk of coordination or collusion amongst these agents would be enhanced and difficult for humans to counteract, both because the agents collectively may be too powerful, and because humans individually, and as groups, are comparatively hopeless at coordinating (see note 22 and section 2.3.3 in Ngo’s paper, as well as these twoposts by Paul Christiano).
Is alignment just a ‘normal engineering problem’?
The essay (Sections 5 & 6) also refers to the notion of ‘normal engineering risks that are addressable through normal sorts of engineering risk-management’, and states that AI alignment is no different from these. ‘Normal engineering risks’ are not defined, but a few examples are given, such as a car’s steering system, a thermostat, bridges, nuclear weapons or reactor meltdowns, chemical pollution, killer robots/drones, medically-deployed AI that makes bad diagnoses, software updates, trading algorithms, and social media.
Unfortunately, these examples are very diverse and consequently offer rather different lessons for alignment. Bridges are a fairly old technology, they seem to fail individually, aside from occasional issues with substandard concrete, steel fatigue, earthquakes or bombing. Car steering systems and traditional (not smart or networked) thermostats are pretty simple electro-mechanical feedback systems, and we are experienced with handling their failures or consumer product failures generally (recalls and lawsuits). The example of over-the-air software updates bricking fleets of Teslas obviously can hurt or kill many people and cause damage, and may well happen in many countries or locations at once. Nuclear weapons and killer drones are both directed at humans by humans, they do not (currently) select and find targets autonomously.
I would argue that although the specific risks are different, these are mostly well-understood (from technical and policy perspectives), mostly geographically localised, and do not involve software actively trying to deceive or defeat human users.
Of his examples, trading algorithms[19] seems potentially the most relevant. I say this because, firstly, the decision-making of trading organisations[20] (which employ the algorithms) is often quite opaque, even for the most senior managers, and possessed of powerful and ill-understood feedback loops. Secondly, many trading organisations are geographically diffuse and diverse: they exist (at least in terms of their assets) primarily as people, relationships, software and operational infrastructure that enables decisions (human or machine) to be implemented globally (through buying or selling securities) across many business lines. Thirdly, trading systems and organisations are interconnected (through commercial relationships, long-dated contractual agreements, or collateral arrangements), and a failure in one part has non-obvious consequences, sometimes rapid other times delayed, in various parts of the ecosystem. Fourthly, unlike bridges or thermostats, financial market failures impose massive, widespread, and long-lived negative externalities, as the ongoing economic and political legacy of 2008 shows. Lastly, the sociology of the tradingworld, to generalise perhaps unfairly and exclusively (ego and power also play a part), creates incentives to work around regulatory, shareholder, or political constraints, in order to maximise observable metrics, at an organisational level, but more importantly, personally (i.e. end-of-year bonuses or equity compensation value). In other words, Goodharting is rampant.
The reason financial markets are relevant to discussions around alignment is because highly-capable AIs or AGIs are expected to be widely deployed, owing to economic or competitive pressures. They would be networked and couldcoordinate to achieve their goals, and there are reasons to believe they may be quite similar to each other, making coordination and mutual trust (amongst them) more likely. Ecosystems of AI/AGIs may be quite opaque in their interconnections and system dynamics, and while Goodharting may or may not an intractable technical problem, it would almost certainly be present at the regulatory, economic, or geostrategic levels (simply because it is a deeply-embedded characteristic of contemporary society and economy).
For these reasons alone, alignment seems not to be a ‘normal engineering’ problem just as trading systems (or the organisations that employ them) aren’t.
Warning signs & second chances
Moreover, the position that alignment is a normal engineering risk’ seems to assume that, like such risks, an iterative approach (software updates, product recalls, new hardware, etc.) is feasible with AGI. While it is possible that AGIs will fail in big but not existential ways, causing massive technical or regulatory responses, it seems (at least) possible that the first AGI that is misaligned would be deployed sufficiently broadly that it causes extensive damage, from which there is no recovery, i.e. no ‘warning shot’.[21] This likely inability to iterate to a better solution clearly puts AGI alignment into a special category, relative to the other examples (aside perhaps from nuclear weapons).
Governance
Rao's essay also seems to ignore the thicket of public policy issues that arise in his examples, which are likely to be amplified in an AI-saturated world. For instance, many of the 'normal engineering' examples are epistemically uncontroversial, amenable to iterative solutions, and can be regulated on a country-by-country basis (or at the level of trade blocs like the EU or globally, through verifiable treaties like the CTBT).
On the other hand, we have recently seen that pandemics are epistemically challenging, possibly not as tolerant of iterative solutions, and require international coordination in a globalised market for goods, services, and people. Likewise, nuclear proliferation, climate change and financial regulation all present complex coordination and governance issues that societies are struggling to get to grips with.[22]
Alignment risk seems to have more in common with COVID-19 or perhaps computer viruses. Potentially well into the period when AGI is becoming a reality, there may not be a consensus amongst researchers or regulators on the reality or level of danger. Cotra neatly encapsulates the peculiar epistemics and sociology around AI, namely that researchers themselves can't agree on a reasonably specific question (whether 'baseline' HFDT + 'baseline' safety measures would actually train the AI towards a takeover scenario), and probably there would be strong pressure to interpret such ambiguity in an optimistic light (from perspectives of military advantage, profitability, time-to-market, or benefits-to-society).
Lastly, as a dual-use technology, there would be massive economic incentive for companies who benefit from weak regulation of AI to push back hard against legislators, activists or researchers, much as we already see in financial services, climate change, and technology.
Conclusion
My model of Rao’s view goes something like this: ‘SIILTBness in respect of AI is unlikely to be established until AIs are embodied and situated; since SIILTBness cannot be established, AGI is unlikely in the face of reality’s complexity. Hence, current discussions about AGI or about (mis)alignment, are nonsensical and a case of flawed anthropomorphism. Even if there is an alignment problem in respect of AI, it is, absent SIILTBness, of the same type and scale as other normal engineering problems, hence doesn’t require any special consideration’.
This position seems overly abstract, couched in obscure philosophical language and is factually inaccurate. Hopefully I’ve shown that there are examples of specific misalignments visible today; as well as carefully reasoned, relatively non-anthropomorphic, trajectories -- grounded in contemporary deep learning practice -- that point towards future, larger misalignments. Although Cotra, Ngo, and others cited are likely wrong in their specifics, the path to AGI may share family resemblances with the trajectories they describe, warranting a high degree of caution. Moreover, AGI clearly presents unique challenges that have entangled technical, geopolitical, and public policy aspects, in a way that neither Rao's logic nor his examples address, rendering his dismissal puzzling and unhelpful.
In this essay, I will use Richard Ngo’s definition of AGI: ‘an artificial agent which applies domain-general cognitive skills (such as reasoning, memory, and planning) to perform at or above human level on a wide range of cognitive tasks (such as running a company, writing a software program, or formulating a new scientific theory).’ I also use ‘AI’ as a more general term to capture the possibility that AGI is not necessarily the only risk: narrow-purpose systems, if set up properly, could also pose a risk (for example, an LLM employed as a ‘persuasion bot’).
Much of Rao’s essay is about language, an objection to the use of terms like ‘sentience’, ‘consciousness’, ‘intelligence’, etc. That is possibly a fair criticism as far as mass-media writing about AI or AGI goes, but most alignment researchers seem fairly careful about their use of the terms. Sentience has no entries on the Alignment Forum and there is a degree of embarrassment over (or perhaps, on-behalf-of) the Blake Lemoine affair; the entry on ‘consciousness’ is a detailed and useful summary of relevant philosophy-of-mind literature. Lastly, intelligence has, for some years, been recognised both as a complex concept (that exists in many flavours and scales) and a suitcase word that often needs to be broken down.
This is an old, undecided debate in AI, see this specific response to Rao by Steven Byrnes, or Melanie Mitchell's thoughts on embodied cognition (including this response to Mitchell from Richard Ngo). Whether it will ultimately prove to be true that embodiment is necessary for AGI, this question doesn’t seem to invalidate most of the misalignment concerns raised in Ngo or Cotra’s pieces.
For more, see DeepMind scientist Kory Mathewson’s survey of prior work in philosophy-of-mind as it applies to AI. Also see BrianCantwellSmith’s writing: he suggests that any entity that takes (intelligent or instinct-driven) intentional actions needs to have a ‘commitment to the world’ (meaning that it suffers real consequences, e.g. it dies, if it gets its world-model wrong) so it can judge what perceptual data are important to its world-model, and relevant to whatever action it plans to take. This survey paper from Murray Shanahan et al looks at extracting common sense about the physics of objects in 3D RL environments, while this one lays out an agenda for converting such concrete experiences of physicality into concepts and abstraction that might improve models’ abilities to generalise and operate out-of-distribution. More generally, the experiences of digital agents are discussed in somewhat adjacent work (around what societies of digital agents might look like) by Robin Hanson, or Nick Bostrom and Carl Shulman. Lastly, see the literature around consciousness and Steven Byrnes’ writing on brain-likeAGI.
Questions of shallow vs deep understanding are adjacent to the conversation around the symbol grounding problem (essentially how linguistic symbols correspond to sensorially-verifiable facts about the world), which has long been a feature of cognitive science and theories of language. Multimodal models might represent a very slight shift in the shallow-to-deep understanding continuum, in the sense that their understanding is more distributed between text and image-handling portions of their neural networks.
Concerns around deception are discussed in more detail by Mark Xu, and more formally laid out as a default/likely consequence of gradient descent and situatinal awareness in the training process, by Evan Hubinger.
Embedded, in this context, is used in the sense that the agent is smaller than its environment, and is made up of similar ‘stuff’, while the environment is populated by a diversity of other agents.
See also this post from Sammy Martin, which discusses a number of overlapping scenarios from Paul Christiano, Andrew Critch, and others, particularly in respect of ‘warning shots’ (serious misalignments that are not globally catastrophic, which might be enough to slow down or stop AGI development, or alternatively, may simply result in specific, local patches, while not solving larger, less-obvious, underlying issues).
Ngo’s example architecture: a 2-headed deep neural network that’s trained on multimodal data via self-supervised learning as well as RL, that has access to natural language and computer interfaces, and is optimised hard to reach human level performance on a range of tasks.
It is less clear that policies will develop representations of plans (certain policies like that of AlphaZero, search through possible move sequences and rank them, but Ngo is more interested in ‘model-free’ planning algorithms which convert relatively abstract or high-level goals and outcomes into internal value representations, in order to determine actions). Intriguingly, Ngo speculates that LLMs like GPT-3 might also be using relatively high-level planning towards desired outcomes in order to generate text completions.
In terms of speed, creativity, rationalisation, data acquisition, and R&D, which are Ngo’s operationalisations of the term ‘intelligence’, see Section 2.3.1.
Power-seeking, in particular, is thought to be an instrumentally useful behaviour across a range of tasks and environments. This theoretical notion was formalised in a 2021 talk/paper which showed that certain RL policies, when behaving optimally, will converge on power-seeking states.
This post uses the assumptions of Cotra's analysis to estimate how many AIs could potentially be sustained in the first/early generation(s) of an AI-saturated world.
I will address the ecosystem around financial markets more generally, rather than confining myself to algorithms, which don’t exist or act in a vacuum. Having worked in the markets for 18 years, I also have a slightly better handle on this, as opposed to engineering.
I include international investment banks, hedge funds, university endowments, commodity trading houses, private equity firms, sovereign wealth funds, family offices. These organisations engage in a range of activities, ranging from trading liquid securities, holding loans, to investing in illiquid stakes in companies, to trading cryptocurrency, consulting, providing trading software infrastructure, taking/making physical delivery of commodities, and more.
A somewhat related point, albeit from a sociological perspective, is the debate on whether misaligned AGI will have a ‘fire alarm’: will there be some discrete event or point in time event when the weight of evidence (of likely catastrophic misalignment) causes prominent researchers, activists, celebrities, religious figures, business leaders, or politicians to loudly call for radical action to stop AGI development. The debate is summarised in this post by Rohin Shah (referring to work by Katja Grace, and Eliezer Yudkowsky).
The literature on financial bubbles is an example where academics and regulators can’t agree on the facts: are asset prices inflated beyond reasonable fundamentals, to the extent that they might cause widespread and indiscriminate economic loss and suffering?
TL;DR: This is an attempt to clarify a post (or perhaps just my own confusion) from a tech commentator, that was skeptical about AGI as a possibility, and alignment research as a project. As such, there is nothing novel in here, but it might be useful for people from outside or new to the LW/EA community.
---
Venkatesh Rao’s recent newsletter essay ‘Beyond Hyperanthropomorphism’, raised a number of questions about the direction of AGI and, more relevantly, the usefulness and validity of AGI alignment research[1].[2] His points regarding alignment, to the extent they are concrete, are buried in a wide-ranging exposition; hence, they are hard to engage with.[3] Nevertheless, in order to bridge his philosophically-informed perspective with the more practical approach of many alignment researchers, I thought it useful to situate his specific claims in a few current conversations in alignment.
The issues Rao seems to raise in the essay that I will cover are:
I draw upon another LW post discussing the essay, comments on his blog, as well as a 2021 Twitter thread.[4]
On SIILTBness and LLMs
The essay discusses why machine SIILTBness requires embodiment and situatedness, as well as a subjective, phenomenological notion of time as a way of structuring events in the world.[5] These basically amount to the AI having the ability to move around in a physical environment, have a clear separation between its physical boundaries and the rest of the world, and (I presume) something akin to a sense of self.[6] Rao suggests that if one can’t define the SIILTBness of an AI (or future AGI), it is impossible to have a coherent discussion about what that AI/AGI might do, or how it may pose a risk.
I won’t delve into the philosophical intricacies of Nagel or SIILTBness, except to say I am unconvinced that it is a necessary concept for a meaningful discussion of AI/AGI alignment risk (as opposed to the feasibility of AGI). I try to show below that the concerns alignment researchers voice seem to be conceivable, and in some cases concretely visible, in current neural networks.[7]
Machine SIILTBness seems to be a (conceptually wider, more confusing and possibly distracting) version of a specific question: do current AIs actually understand? For instance, how deeply do LLMs like GPT-3 know things about the world, given that they are trained on large amounts of existing text. The balance of opinion is that LLMs have, at best, a shallow understanding rather than the deeper model of the world humans (and most animals, arguably) have, a result of their existential commitment to the reality of a (mostly physical) world.[8]
Notwithstanding LLMs' shallowness, these examples from Stuart Armstrong illustrate that GPT-3 can, if carefully prompted, generate responses that indicate (or simulate) a degree of understanding, albeit within limits.
Can a GPT-n go a (or a few) step(s) further? Say (riffing on Armstrong’s example of the dead worker), can a model, having established that the person was dead, log onto Interflora, use the credit card (its human owner has kindly provided) to order flowers to be delivered to the person’s family? Does it grokk enough of Armstrong’s example prompt ‘She died on Friday…’ to know that (within US/UK society) it is reasonable that flowers are sent to dead people’s families?
In other words, can a nearcast AI take meaningful and defensible (to common-sense) actions in the world, based on a text-only model of the world? Does it need SIILTBness for this? Would it, as Rao implies in this comment, encounter insurmountable hurdles in navigating the reality of the world (i.e. get the dead worker’s address, retrieve the credit card data, know that Interflora is the appropriate site for the UK, etc.)? Would it choose to send flowers or do something more…ambitious, say, ordering the construction of a mausoleum modelled after that of Queen Artemisia II at Halicarnassus?
Narrative bridges
A possible frustration Rao, and perhaps others outside the AF/LW/EA fora, have with the alignment conversation is that it is hard to perceive existential or catastrophic risk from looking at the narrowly-superhuman-yet-occasionally-idiotic AIs of today. Rao uncharitably, and without specifically sourcing whatever has antagonised him, terms these ‘lurid fears…[of] “a self-improving general intelligence” ’. However, I do have some sympathy with his position, since the predicted failures only exist in embryonic form in models operating within toy environments, when they’re visible at all. In other cases, the failures are the result of thought experiments, conversations, and (sometimes) difficult formal-mathematical chains of reasoning. By and large, and perhaps intentionally (from an infohazard perspective) this material is wrapped within layers of AF/LW posts (which often disagree), and are only occasionally distilled into general-audience books or podcasts.
Hence I wanted to highlight two recent works, in particular, that try to draw a more concrete path from today’s AI to a world of potentially threatening AGIs.
Narrative bridge I: learning from human feedback
This 2022 post by Ajeya Cotra lays out a scenario of ‘human feedback on diverse tasks’ (HFDT), the idea of which is to ‘Train a powerful neural network model to simultaneously master a wide variety of challenging tasks (e.g. software development, novel-writing, game play, forecasting, etc) by using reinforcement learning on human feedback and other metrics of performance.’ HFDT would, in Cotra’s and Holden Karnofsky’s conception, be used to automate and radically improve scientific research and technological R&D.
The breadth of HFDT training would incentivise the model (via gradient descent[9]) to 1) pursue goals that require planning over long timeframes, 2) generate creative solutions (in the sense that they are not obvious to the human trainer/overseer - in an analogous way to how AlphaGo generated ‘creative moves’), 3) develop, as a side-effect of the training process, a model of its trainer’s psychology and the environment in which it is operating.
Cotra’s view might converge/collide with the philosophical objections listed above[10] via point (3) above: whether a HFDT model would understand its environment and the ability to generate plans and actions geared to achieving its objective, e.g. scientific research. Understanding its environment breaks down into modelling its trainers’ (or other humans it interacts with) psychology and motivations; assessing the behaviours for, and system by, which it is rewarded; potentially modelling other aspects of its environment such as its physical world, other AIs, etc.
A significant part of Cotra, Karnofsky, and Ngo's (2.2.2) scenarios is that AIs may construct deceptive plans - they would mostly act in ways that appear helpful and useful to humans, until the point when they turn.[11]
Whether SIILTBness would be necessary for that level of intricate and long-term planning, is an open question, perhaps adjacent to discussions in the alignment community on modelling humans and the issues that crop up when agents (human, AI, etc. who are capable of reflecting upon their own thinking) are embedded in their environments.[12] Even if the HFDT system didn’t have a comprehensive, accurate or coherent model of its environment, my intuition is that it could pose catastrophic risks to humanity, since it would have some situational awareness and some ability (through a web interface, say) to take actions in the world, for the reasons Cotra lays out here.
Rao, who does briefly discuss such a scenario only to dismiss it, might be right that, absent an accurate, physically-grounded world-model, an AI would face insurmountable frictions in achieving its objectives. But it could still ‘break things’ in trying. Moreover, as Cotra points out, there would be an economic incentive for humans to fix these easily-exposed errors, without necessarily finding some other, hidden, possibly more consequential failure mode.[13]
Narrative bridge II: a deep-learning perspective
Cotra’s relatively high-level description and operational schematic are closely tied to this 2022 paper by Richard Ngo, which translates some of the points Cotra, and others (including Ngo himself in a 2020 paper) raise, into the language of deep learning. Ngo tries to spell out a specific (insofar as possible for a future that is still unfolding and may do so over decades) trajectory, that is mostly based upon reasoning rather than on empirical facts, but crucially, collects specific examples where potentially problematic behaviour is already visible.
Ngo describes three phases in the trajectory of how a model’s[14] governing policy becomes catastrophically misaligned: firstly, it learns to represent outcomes that correlate to high reward across a range of tasks (which is his operationalisation of the network’s ‘goals’). Secondly, the policy learns enough about its environment (‘situational awareness’) to be able to pursue misaligned goals while maximising training reward. Finally, once the policy increases in capability (speed, breadth of tasks, creativity), eventually exceeding human supervision ability, it may generalise towards longer-term goal planning and larger-scale actions, which are unlikely to be the goals and actions humans would have approved of, had they fully appreciated them.
For the purposes of this narrative bridge, I focus on phase one, which is where the concrete examples of misalignment are most visible. Ngo’s point (Section 2.1.1) is that policies learn representations relevant to tasks they are trained upon. So, it is well documented that image classification networks learn both low- and higher-level features of scenes (and in some cases, we can identify the relevant neuron activations), and while similar behaviour is less clear in an RL context, there are examples of high-level game outcomes (like winning at chess or capturing an opponent’s flag) corresponding to specific neurons. [15]
In 2.1.2, Ngo points out examples of misbehaviour arising from the training process, such as reward tampering or a situation where the policy in a toy environment attempted to deceive the human supervisor. He provides further examples in the context of LLMs, including one where GPT-3 returns an answer that is plausible, in the sense of being tailored to information about the human supervisor, as opposed to a simply true answer. In another example, GPT-3 generates ‘hallucinations’ (i.e. non sequiturs, or repetitive loops, or otherwise obviously undesirable output).
Phases 2 & 3, which are more speculative in that the situations they describe simply don’t exist yet, offer fewer concrete misalignment examples, but do describe how misaligned behaviour may emerge as policies become more situationally aware; why misaligned behaviour is likely to be ‘preferred’ (in a gradient descent sense) than aligned behaviour; and the possibility that as policies become more capable[16] they may increasingly plan over longer timeframes and at larger physical scales, and exhibit power-seeking behaviour.[17]
Extrapolations & hyperanthropomorphisms
I have listed these examples to show that deep learning models (or policies, in Ngo’s terms) appear to act in ways that are consistent with a certain hypothetical, but grounded in current deep learning SOTA, trajectory for the development of AGI.
Even if we don’t fully understand how present-day models work or why these misalignments appear, or how to fix them robustly, there is no suggestion, in Ngo, Cotra, or anyone else cited here, of any ‘spooky phenomenology’, sentience, consciousness, experience, or other hyperanthropomorphisms. Nor, as Rao's essay alleges, can I see any cases of ‘pseudo-traits’ or ‘isolated point features’ being simplistically dialled up as a proxy for ‘super-intelligence’.
Of course, one could argue that the very act of using present-day misalignments as a predictor or intuition pump for future pathological goal generalisation or planning behaviour, is flawed as an approach. Or, that current deep learning will ultimately prove inadequate for the types of behaviour Ngo and Cotra are positing, e.g. creativity, planning, deception, agency, situational awareness, etc. But Rao's essay isn’t saying this, or if it is, his arguments aren’t referenced, and don’t sufficiently descend from what is ultimately a mountain of abstraction.
Coordination and replication
A significant portion of AI misalignment concerns lies with the possibility of networks or neural weights being replicated, most likely by humans deploying AIs in large numbers, as Cotra and others suggest.[18] Rao's essay does dismiss the more speculative possibility of self-replicating AIs, but doesn’t address the comparatively salient, in my view, issue of humans who, for profit, prestige, or geostrategic advantage, deploy possibly-misaligned AI/AGI across a variety of networks and applications.
However the replication happens, it would seem that the risk of coordination or collusion amongst these agents would be enhanced and difficult for humans to counteract, both because the agents collectively may be too powerful, and because humans individually, and as groups, are comparatively hopeless at coordinating (see note 22 and section 2.3.3 in Ngo’s paper, as well as these two posts by Paul Christiano).
Is alignment just a ‘normal engineering problem’?
The essay (Sections 5 & 6) also refers to the notion of ‘normal engineering risks that are addressable through normal sorts of engineering risk-management’, and states that AI alignment is no different from these. ‘Normal engineering risks’ are not defined, but a few examples are given, such as a car’s steering system, a thermostat, bridges, nuclear weapons or reactor meltdowns, chemical pollution, killer robots/drones, medically-deployed AI that makes bad diagnoses, software updates, trading algorithms, and social media.
Unfortunately, these examples are very diverse and consequently offer rather different lessons for alignment. Bridges are a fairly old technology, they seem to fail individually, aside from occasional issues with substandard concrete, steel fatigue, earthquakes or bombing. Car steering systems and traditional (not smart or networked) thermostats are pretty simple electro-mechanical feedback systems, and we are experienced with handling their failures or consumer product failures generally (recalls and lawsuits). The example of over-the-air software updates bricking fleets of Teslas obviously can hurt or kill many people and cause damage, and may well happen in many countries or locations at once. Nuclear weapons and killer drones are both directed at humans by humans, they do not (currently) select and find targets autonomously.
I would argue that although the specific risks are different, these are mostly well-understood (from technical and policy perspectives), mostly geographically localised, and do not involve software actively trying to deceive or defeat human users.
Of his examples, trading algorithms[19] seems potentially the most relevant. I say this because, firstly, the decision-making of trading organisations[20] (which employ the algorithms) is often quite opaque, even for the most senior managers, and possessed of powerful and ill-understood feedback loops. Secondly, many trading organisations are geographically diffuse and diverse: they exist (at least in terms of their assets) primarily as people, relationships, software and operational infrastructure that enables decisions (human or machine) to be implemented globally (through buying or selling securities) across many business lines. Thirdly, trading systems and organisations are interconnected (through commercial relationships, long-dated contractual agreements, or collateral arrangements), and a failure in one part has non-obvious consequences, sometimes rapid other times delayed, in various parts of the ecosystem. Fourthly, unlike bridges or thermostats, financial market failures impose massive, widespread, and long-lived negative externalities, as the ongoing economic and political legacy of 2008 shows. Lastly, the sociology of the trading world, to generalise perhaps unfairly and exclusively (ego and power also play a part), creates incentives to work around regulatory, shareholder, or political constraints, in order to maximise observable metrics, at an organisational level, but more importantly, personally (i.e. end-of-year bonuses or equity compensation value). In other words, Goodharting is rampant.
The reason financial markets are relevant to discussions around alignment is because highly-capable AIs or AGIs are expected to be widely deployed, owing to economic or competitive pressures. They would be networked and could coordinate to achieve their goals, and there are reasons to believe they may be quite similar to each other, making coordination and mutual trust (amongst them) more likely. Ecosystems of AI/AGIs may be quite opaque in their interconnections and system dynamics, and while Goodharting may or may not an intractable technical problem, it would almost certainly be present at the regulatory, economic, or geostrategic levels (simply because it is a deeply-embedded characteristic of contemporary society and economy).
For these reasons alone, alignment seems not to be a ‘normal engineering’ problem just as trading systems (or the organisations that employ them) aren’t.
Warning signs & second chances
Moreover, the position that alignment is a normal engineering risk’ seems to assume that, like such risks, an iterative approach (software updates, product recalls, new hardware, etc.) is feasible with AGI. While it is possible that AGIs will fail in big but not existential ways, causing massive technical or regulatory responses, it seems (at least) possible that the first AGI that is misaligned would be deployed sufficiently broadly that it causes extensive damage, from which there is no recovery, i.e. no ‘warning shot’.[21] This likely inability to iterate to a better solution clearly puts AGI alignment into a special category, relative to the other examples (aside perhaps from nuclear weapons).
Governance
Rao's essay also seems to ignore the thicket of public policy issues that arise in his examples, which are likely to be amplified in an AI-saturated world. For instance, many of the 'normal engineering' examples are epistemically uncontroversial, amenable to iterative solutions, and can be regulated on a country-by-country basis (or at the level of trade blocs like the EU or globally, through verifiable treaties like the CTBT).
On the other hand, we have recently seen that pandemics are epistemically challenging, possibly not as tolerant of iterative solutions, and require international coordination in a globalised market for goods, services, and people. Likewise, nuclear proliferation, climate change and financial regulation all present complex coordination and governance issues that societies are struggling to get to grips with.[22]
Alignment risk seems to have more in common with COVID-19 or perhaps computer viruses. Potentially well into the period when AGI is becoming a reality, there may not be a consensus amongst researchers or regulators on the reality or level of danger. Cotra neatly encapsulates the peculiar epistemics and sociology around AI, namely that researchers themselves can't agree on a reasonably specific question (whether 'baseline' HFDT + 'baseline' safety measures would actually train the AI towards a takeover scenario), and probably there would be strong pressure to interpret such ambiguity in an optimistic light (from perspectives of military advantage, profitability, time-to-market, or benefits-to-society).
Lastly, as a dual-use technology, there would be massive economic incentive for companies who benefit from weak regulation of AI to push back hard against legislators, activists or researchers, much as we already see in financial services, climate change, and technology.
Conclusion
My model of Rao’s view goes something like this: ‘SIILTBness in respect of AI is unlikely to be established until AIs are embodied and situated; since SIILTBness cannot be established, AGI is unlikely in the face of reality’s complexity. Hence, current discussions about AGI or about (mis)alignment, are nonsensical and a case of flawed anthropomorphism. Even if there is an alignment problem in respect of AI, it is, absent SIILTBness, of the same type and scale as other normal engineering problems, hence doesn’t require any special consideration’.
This position seems overly abstract, couched in obscure philosophical language and is factually inaccurate. Hopefully I’ve shown that there are examples of specific misalignments visible today; as well as carefully reasoned, relatively non-anthropomorphic, trajectories -- grounded in contemporary deep learning practice -- that point towards future, larger misalignments. Although Cotra, Ngo, and others cited are likely wrong in their specifics, the path to AGI may share family resemblances with the trajectories they describe, warranting a high degree of caution. Moreover, AGI clearly presents unique challenges that have entangled technical, geopolitical, and public policy aspects, in a way that neither Rao's logic nor his examples address, rendering his dismissal puzzling and unhelpful.
Rao isn’t the only one to have raised such concerns - see Kevin Kelly’s take from 2017.
In this essay, I will use Richard Ngo’s definition of AGI: ‘an artificial agent which applies domain-general cognitive skills (such as reasoning, memory, and planning) to perform at or above human level on a wide range of cognitive tasks (such as running a company, writing a software program, or formulating a new scientific theory).’ I also use ‘AI’ as a more general term to capture the possibility that AGI is not necessarily the only risk: narrow-purpose systems, if set up properly, could also pose a risk (for example, an LLM employed as a ‘persuasion bot’).
Much of Rao’s essay is about language, an objection to the use of terms like ‘sentience’, ‘consciousness’, ‘intelligence’, etc. That is possibly a fair criticism as far as mass-media writing about AI or AGI goes, but most alignment researchers seem fairly careful about their use of the terms. Sentience has no entries on the Alignment Forum and there is a degree of embarrassment over (or perhaps, on-behalf-of) the Blake Lemoine affair; the entry on ‘consciousness’ is a detailed and useful summary of relevant philosophy-of-mind literature. Lastly, intelligence has, for some years, been recognised both as a complex concept (that exists in many flavours and scales) and a suitcase word that often needs to be broken down.
This post by Holden Karnofsky comprehensively sets out the case for catastrophic/existential risk from misaligned AI, and is a useful place to start.
The point about time is actually in another post by Rao.
This is an old, undecided debate in AI, see this specific response to Rao by Steven Byrnes, or Melanie Mitchell's thoughts on embodied cognition (including this response to Mitchell from Richard Ngo). Whether it will ultimately prove to be true that embodiment is necessary for AGI, this question doesn’t seem to invalidate most of the misalignment concerns raised in Ngo or Cotra’s pieces.
For more, see DeepMind scientist Kory Mathewson’s survey of prior work in philosophy-of-mind as it applies to AI. Also see Brian Cantwell Smith’s writing: he suggests that any entity that takes (intelligent or instinct-driven) intentional actions needs to have a ‘commitment to the world’ (meaning that it suffers real consequences, e.g. it dies, if it gets its world-model wrong) so it can judge what perceptual data are important to its world-model, and relevant to whatever action it plans to take. This survey paper from Murray Shanahan et al looks at extracting common sense about the physics of objects in 3D RL environments, while this one lays out an agenda for converting such concrete experiences of physicality into concepts and abstraction that might improve models’ abilities to generalise and operate out-of-distribution. More generally, the experiences of digital agents are discussed in somewhat adjacent work (around what societies of digital agents might look like) by Robin Hanson, or Nick Bostrom and Carl Shulman. Lastly, see the literature around consciousness and Steven Byrnes’ writing on brain-like AGI.
Questions of shallow vs deep understanding are adjacent to the conversation around the symbol grounding problem (essentially how linguistic symbols correspond to sensorially-verifiable facts about the world), which has long been a feature of cognitive science and theories of language. Multimodal models might represent a very slight shift in the shallow-to-deep understanding continuum, in the sense that their understanding is more distributed between text and image-handling portions of their neural networks.
Cotra points out that HFDT may simply not work, or our assumptions about what types of behaviour gradient descent incentivises may not be correct.
Such as SIILTBness, symbol grounding, embodiment/situatedness, which I’m for convenience grouping.
Concerns around deception are discussed in more detail by Mark Xu, and more formally laid out as a default/likely consequence of gradient descent and situatinal awareness in the training process, by Evan Hubinger.
Embedded, in this context, is used in the sense that the agent is smaller than its environment, and is made up of similar ‘stuff’, while the environment is populated by a diversity of other agents.
See also this post from Sammy Martin, which discusses a number of overlapping scenarios from Paul Christiano, Andrew Critch, and others, particularly in respect of ‘warning shots’ (serious misalignments that are not globally catastrophic, which might be enough to slow down or stop AGI development, or alternatively, may simply result in specific, local patches, while not solving larger, less-obvious, underlying issues).
Ngo’s example architecture: a 2-headed deep neural network that’s trained on multimodal data via self-supervised learning as well as RL, that has access to natural language and computer interfaces, and is optimised hard to reach human level performance on a range of tasks.
It is less clear that policies will develop representations of plans (certain policies like that of AlphaZero, search through possible move sequences and rank them, but Ngo is more interested in ‘model-free’ planning algorithms which convert relatively abstract or high-level goals and outcomes into internal value representations, in order to determine actions). Intriguingly, Ngo speculates that LLMs like GPT-3 might also be using relatively high-level planning towards desired outcomes in order to generate text completions.
In terms of speed, creativity, rationalisation, data acquisition, and R&D, which are Ngo’s operationalisations of the term ‘intelligence’, see Section 2.3.1.
Power-seeking, in particular, is thought to be an instrumentally useful behaviour across a range of tasks and environments. This theoretical notion was formalised in a 2021 talk/paper which showed that certain RL policies, when behaving optimally, will converge on power-seeking states.
This post uses the assumptions of Cotra's analysis to estimate how many AIs could potentially be sustained in the first/early generation(s) of an AI-saturated world.
I will address the ecosystem around financial markets more generally, rather than confining myself to algorithms, which don’t exist or act in a vacuum. Having worked in the markets for 18 years, I also have a slightly better handle on this, as opposed to engineering.
I include international investment banks, hedge funds, university endowments, commodity trading houses, private equity firms, sovereign wealth funds, family offices. These organisations engage in a range of activities, ranging from trading liquid securities, holding loans, to investing in illiquid stakes in companies, to trading cryptocurrency, consulting, providing trading software infrastructure, taking/making physical delivery of commodities, and more.
A somewhat related point, albeit from a sociological perspective, is the debate on whether misaligned AGI will have a ‘fire alarm’: will there be some discrete event or point in time event when the weight of evidence (of likely catastrophic misalignment) causes prominent researchers, activists, celebrities, religious figures, business leaders, or politicians to loudly call for radical action to stop AGI development. The debate is summarised in this post by Rohin Shah (referring to work by Katja Grace, and Eliezer Yudkowsky).
The literature on financial bubbles is an example where academics and regulators can’t agree on the facts: are asset prices inflated beyond reasonable fundamentals, to the extent that they might cause widespread and indiscriminate economic loss and suffering?