I haven’t heard of any research attempting to connect FEP/Active Inference (Fields et al. 2022a; Friston et al. 2022a) with many theories of DNNs and deep learning,
https://arxiv.org/abs/2207.06415
[Submitted on 13 Jul 2022]
A few thoughts:
Excellent post. Very much in line with my Intersubjectivity Collapse thesis. Multi-disciplinary approaches to alignment and the preparation for multi-agent axiology especially are key.
https://mentalcontractions.substack.com/p/the-intersubjectivity-collapse
Summary
The core ideas that constitute the multi-disciplinary view[1] on AI safety research are:
The multi-disciplinary view on how AI safety (x-risk[2]) research should be done stems from pragmatism as the general philosophical stance, constructivism and naturalism as high-level epistemological views, and physicalism of information, computation, and, hence, intelligence and agency as foundational ontological commitments about nature.
This post is the first one in a two-part series. In this post, I discuss the roots of the multi-disciplinary view on AI safety research, enumerate the relevant disciplines and theories, and describe multiple (relatively) concrete research agendas where multiple disciplinary perspectives should be synthesised.
In the second post, I discuss non-technical reasons for adopting the multi-disciplinary view and compare it with other methodological views on AI x-risk research: the pro-HRAD view (Highly Reliable Agent Designs, Rice & Manheim 2022), the “Pragmatic AI Safety” (PAIS) view (Hendrycks & Woodside 2022), and the “Reform AI Alignment” view (Aaronson 2022).
1. Introduction: AIs as physical systems
I believe, with very high confidence, that the following things about the “relatively-near-term ASI” (i.e., ASI that will do all intellectual work far better than humans, on the level which renders any direct oversight by humans impossible) are true:
The observations above have several methodological implications:
2. Multi-disciplinary AI safety research: the Santa Fe Institute model
I wish LessWrong and Alignment Forum were more like an online version of Santa Fe Institute (SFI).
The powerful idea of SFI is making experts in different fields talk to each other and build a deeper (”3D”), shared understanding of problems and phenomena by combining diverse perspectives on them.
I wish the AI x-risk community had a lot of researchers who understood some of the disciplines and state-of-the-art theories listed below deeply, and brought these perspectives to the team projects, reviews and discussions of other people’s work and AI alignment proposals. These perspectives shouldn’t be a scarce resource in any discussion where any of these perspectives would be relevant and valuable, even if that wasn’t realised and explicitly called out by the rest of the people (e.g., the authors of some post or an alignment proposal).
2.1. Disciplines and theories relevant to AI safety
There are some important remarks to make about the above list:
My current estimate is that there are from zero to single digits of people who are active in the AI safety research community and who deeply understand state-of-the-art theories in most of the disciplines listed above. To me, this doesn’t look like the AI safety community in a surviving world.
3. Some concrete multi-disciplinary research agendas
In this section, I describe a few possible cross-disciplinary research projects that would be relevant for developing AI alignment paradigms.
3.1. Scale-free axiology and ethics
Human values seem to be learning heuristics that apply to a particular environment (or a game, if you want: a family, a particular professional environment, a community, a society, or a natural environment) that humans use to form and change their preference model (in Active Inference parlance) in lieu of a principled theory of ethics (cf. morality-as-cooperation theory, Curry et al. (2021)). The complication here is that usually, intelligent systems (both humans and AIs) have heuristics (intuitions, habits, “S1” thinking) but also have a more principled theory that can be used sometimes for deliberate, “S2” derivations (inferences), and the results of these inferences could be used to train the intuitions (the “habitual network”). In the case of moral values, however, the only thing humans have is a feeling that there should be some principled theory of ethics (and hence the millennia-long quest for finding the “right” theory of ethics in philosophy and spiritual traditions).
Most theories of morality (philosophical, religious, and spiritual ethics alike) that humans have created to date are deductive reconstructions of a theory of ethics from the heuristics for preference learning that humans have: values and moral intuitions.
This deductive approach couldn’t produce good, general theories of ethics, as has become evident recently with a wave of ethical questions about entities that most ethical theories of the past are totally unprepared to consider (Doctor et al. 2022), ranging from AI and robots (Müller 2020; Owe et al. 2022) to hybrots and chimaeras (Clawson & Levin 2022) and organoids (Sawai et al. 2022). And as the pace of technological progress increases, we should expect the transformation of the environments to happen even faster (which implies that the applied theories of ethics within these environments should also change), and more such novel objects of moral concern to appear.
There have been exceptions to this deductive approach: most notably, Kantian ethics. However, Kantian morality is a part of Kant’s wider theories of cognition (intelligence, agency) and philosophy of mind. The current state-of-the-art theories of cognitive science and philosophy of mind are far less wrong than Kant’s ones. So, the time is ripe for the development of new theories of axiology and ethics from first principles.
The DishBrain experiment (Kagan et al. 2022) showed that minimising surprise, which can also be seen as minimising informational free energy or maximising Bayesian model evidence, is imperative for cognitive (living) systems. As the Free Energy Principle can be seen as an alternative statement of the Principle of Unitarity (Fields et al. 2022a), the aforementioned imperative is not yet an informative ethical principle, although it could be seen as the “meaning of life” principle (but not meaning in life: cf. Ostafin et al. (2022)). Regardless, this makes evident the relevance of quantum theory (as the theory of measurement and observation, and the foundation of the theory of semantics, Fields et al. (2022a)) and thermodynamics (Friston 2019; Boyd et al. 2022) for general theories of cognitive science and ethics.
The current state-of-the-art theories of cognition[11] are not yet detailed enough to answer the following ethical questions:
As explained above, proper grounding of scale-free ethics into existing philosophical theories of ethics (including bioethics, environmental ethics, AI and robot ethics) as well as religious and spiritual traditions of ethics is problematic because most of these theories are themselves built upon some intuitions about the principled, “base” theory of ethics. Nevertheless, these theories could be used as sources of ethical questions and dilemmas that the scale-free theory should address, and some particular intuitions (such as the intuition about the role of consciousness in ethics) that the scale-free theory should explain.
Even specific scientific theories of ethics such as morality-as-cooperation (Curry et al. 2021) might not be good enough to check some of the prescriptions of scale-free ethics because the collectives that we can observe, such as human communities, are themselves ultimately built upon the imperfect moral intuitions of people[26], and hence their emergent, locally optimal game-theoretic characteristics are contingent on the evolutionary history of humanity.
Neuroscience could provide the best available grounding for scale-free ethics because populations of neurons might have “got ethics right” over millions of years, far longer than humans had for optimising their societies. Bach (2022) compares the global collective intelligence of humans and the collective intelligence of neurons in the brain. Incidentally, brains are also the only things that we know are conscious (or beget consciousness), which, coupled with our intuitions about the importance of consciousness to ethics, might suggest that scale-free ethics and a theory of consciousness might be the same theory.
Finally, a note on where I see the place of scale-free theory ethics in a larger alignment picture: I think such a theory should be a part of the methodological alignment curriculum (see the last section of this comment), which itself should be “taught” to AI iteratively as they are trained.
3.2. Civilisational intelligence architecture
I take the position that AI safety researchers should deliberately design the highest levels of civilisational intelligence and governance rather than accepting whatever world governance and control structure will emerge out of the dynamics of technological development, economics, society, and (geo)politics. This dilemma is usually called evolution vs. intelligent design. John Doyle calls to stop evolving civilisational architecture in “Universal Laws and Architectures and Their Fragilities”.
Civilisation’s intelligence, i.e., the highest level of intelligence in the global hierarchy of collective agents, could have different architectures:
Of course, these are just rough sketches. Discussing the particulars of these architectures, or possibly other architectures is not the point of this section.
These designs will yield global civilisational intelligence with different architectural characteristics, such as:
Apart from the characteristics that different civilisational intelligence architectures will have if they somehow materialised into existence instantaneously, they also have varying technical feasibility, different economic and political plausibility, and different profiles of risks associated with “deploying” these architectures in the world (or transitioning into them, if that should be a gradual or a multi-step process).
Predicting all these characteristics of the architectures sketched out above (and, perhaps, some other ones), and, in fact, even understanding whether these characteristics are desirable, must be a multi-disciplinary research endeavour, integrating the perspectives of cognitive science (including epistemology, rationality, ethics, and theories of consciousness), theories of collective intelligence (including, respectively, collective epistemology, rationality, and ethics), social and political science[27], legal theory and policy science, game theory, mechanism design, network theory, dynamical systems theory, theories of evolution and regulative development[28], distributed systems and control theories, resilience theory, safety science and reliability engineering science, computer science and machine learning (for example, when considering designs such as federated learning), and physics of communication and computation[29], as well as taking into consideration the current frontier (and predicted) computing hardware developments, ML and AI research[30], information security research, and cryptography research (for example, for implementing governance systems in the “distributed” architecture, cryptography could be used for some decentralised governance (a.k.a. DeGov) schemes).
To characterise the risks and feasibility of transition (”deployment”) of this or that civilisational intelligence architecture, we should take the perspectives of sociology, social dynamics and memetics, social choice theory, political science, legal theory, political economy, safety science, resilience engineering science, and more concrete strategic analyses of geopolitics and global corporate politics.
3.3. AI self-control: tool (oracle, simulator) AI vs. agent AI
There is a belief held by some AI safety researchers (as well as AGI capability researchers and laymen who weigh in with their opinions about AI safety on Twitter) that this is a good idea, from the AI safety perspective, to make and deploy powerful “tool AI”, a simulator that will be devoid of self-awareness (which is synonymous to self-evidencing, goal-directedness, and agency in the narrow sense), situational awareness, and the freedom to determine its own goals or preferences (i. e., the freedom of preference learning). Tool AI will always obey the commands and the will of its user (or its owner) and will play any role the user asks it to play, and simulate any situation and any entity in it. Tool AI doesn’t have a will and goals on its own.
The concept of an “ideal” tool (simulator), completely devoid of any agency is likely physically incoherent. If so, the position goes, we should at least try to suppress the self-awareness (goal-directedness, agency) and the situational awareness in the AIs that we build, and try to only increase the pure reasoning capability, i. e., the quality of the world (simulation) model.
There are already multiple arguments for why tool AI could be developmentally unstable (Branwen 2016; Langosco 2022; Kulveit & Hadshar 2023).
Recently, the idea of tool AI got a huge boost in the form of the Simulator theory.
I think this is an important research agenda, but it should draw on a wider range of perspectives than the Simulator theory currently does.
Some open research questions regarding tool AI vs. agent AI include:
Apart from, obviously, general theories of cognition (intelligence, agency) and general theories of ML and deep learning, researching these questions requires synthesising the ideas from the following theories and disciplines:
Important: I think working on this agenda may advance AI capabilities more than it will advance our understanding of AI safety and alignment, or be infohazardous in some other ways. Therefore, I urge people who are interested in this research to discuss their ideas first with other AI safety researchers and perhaps choose to do it in some private space.
3.4. Weaving together theories of cognition and cognitive development, ML, deep learning, and interpretability through the abstraction-grounding stack
In this section, I describe how we should make sense of the plethora of theories of cognition, ML, deep learning, and interpretability, cross-validate them against each other and thus crystallise a robust understanding of the behaviour of AI artifacts that we engineer (train) both on concretely (response to a concrete prompt) and more generally.
All theories can be seen as forming a directed acyclic graph (DAG), where arrows represent the relationship of abstraction (generalisation) or, equivalently, specification (grounding) in the other direction.
Here’s a linear, “stack” piece of this graph, all describing the behaviour of a collection of GPUs, executing a concrete DNN model:
In the philosophical stance where semantics (subjective interpretation, perception) is fundamentally distinct from physics (objective dynamics) (Fields et al. 2022a), we can draw the following two-stack DAG:
Note that the pattern in the picture above, where a general theory of cognition (Active Inference) is grounded by both a psychological theory (psychoanalytic theory) and "physics" (physiology of the brain), corresponds to that described by Solms (2019).
Here’s the equivalent stack of theories describing the learning trajectory of a concrete DNN:
Grounding is basically equivalent to gathering evidence support for scientific theories, and therefore is essential for checking the general theories.
Grounding is “useful” for the more concrete theories that are being connected to more general theories, too. First, connecting concrete theories, such as theories of mechanical interpretability or theories of deep learning, with more general theories (and, of course, checking that they don’t contradict each other in some predictions and that the connections are sound) indirectly increases the support for and, hence, our confidence in the concrete theories because general theories may have already gathered extra support through other branches in the abstraction-grounding DAG. For example, general theories of cognition and cognitive development could be already grounded with some evidence from neuroscience.
Second, connecting general theories with more specialised ones may point to objects that are present in the ontologies of general theories but are yet unrecognised in more grounded descriptions of the phenomenon, thus guiding further development of the specialised theories.
Third, adopting the frameworks and the language of general theories to specialised ones helps to make the latter more understandable and communicable. This, in turn, increases the chances that more researchers who are familiar with the general theories (e.g., cognitive scientists and physicists) will engage with the concrete theories (e.g., of deep learning and interpretability). I discuss in more detail this social aspect of multi-disciplinary AI safety research in the second post of the series.
Many of the connections between theories of cognition and cognitive development at different levels of specificity are not established yet, and therefore present a lot of opportunities to verify the specific mechanistic interpretability theories:
Speaking about the relative safety of this research agenda (that is, whether working on this agenda differentially advances AI safety rather than AGI capabilities), it seems to me approximately as safe as the work on mechanistic interpretability itself. In fact, I see this agenda as primarily reinforcing the mechanistic interpretability theories, and potentially helping to connect interpretability theories with high-level “safe” AI designs and alignment protocols. However, this is not the research area that seems completely safe, so please consider this question before starting any work along these lines and try to understand more concretely how it will help within a larger alignment paradigm or civilisational intelligence design.
4. A call for disciplinary experts and scholars
The call for action of this post should be clear by now: unless you decide to work on mechanistic interpretability (which I currently think it’s at least as important to work on marginally as on multi-disciplinary research agendas, although they also intersect: see section 3.4.), learn some AI safety-relevant disciplines deeply and proactively bring these perspectives to the places where they seem relevant, such as announcements of AI x-risk research agendas, or alignment paradigms.
If you do this, would be also very helpful (including for yourself!) if you disseminated the ideas of the theories that you learn by creating more distillations of the external work and sharing them as linkposts here on LessWrong.
I invite everyone who is interested to discuss and coordinate multi-disciplinary AI x-risk research (and, hopefully, actually collaborating on this sort of research) to join the #multidisciplinary channel in AI Alignment Slack (invitation link). Or, if you don’t use Slack, feel free to drop me a line at leventov.ru@gmail.com.
If you study or teach at a university that already has an AI safety research group, consider making it cross-departmental rather than belonging to the Computer Science department only. If not, consider establishing such a group.
If you know some experts and scientists in the fields that I mentioned in the article above, consider inviting them for making a multi-disciplinary research collaboration, perhaps along the connection lines and research agendas dotted above, or with different combinations of disciplines: their expertise is needed for AI safety!
4.1. Suggestions for LessWrong
I think about two changes to LessWrong that would make conducting multi-disciplinary research using this medium easier.
First, we should add tags for most of the theories and disciplines that are mentioned in section 2.1. It’s bizarre that LessWrong doesn’t even have a tag for “control theory” (or optimal control, or distributed control, whatever one wants to call it).
Second, on the tag pages, it would be nice to have a feature for users to register their active scholarship (or expertise) in the respective field of study. Then, when people scout for collaboration partners or to review their research from the perspective of a particular discipline they are not experts on, they could visit the tag page to find people who would potentially be ready to collaborate or to help them.
Acknowledgements
I would like to especially thank Anatoly Levenchuk for pointing to a lot of the work that I referenced in this article, and for helpful discussions.
I would also like to thank Steve Byrnes, Alex Lyzhov, and Evan Murphy for their valuable comments and discussion, and Justis Mills for reviewing the draft of this article and providing valuable comments.
In this article, I try to distinguish between methodological views on AI safety research: some positions on how AI safety research should be done in general, research agendas that could be pursued (the same agenda could be endorsed by multiple methodological views, although different views typically assign the agendas different priority; a research agenda does necessarily lead to or associated with a concrete AI alignment paradigm), and AI alignment paradigms (a.k.a. proposals, protocols, strategies, plans, or designs), which are developed under a certain (typically, eponymous) research agenda. Views, agendas, and paradigms are sometimes compared with each other as if they are of the same kind.
I use the terms “AI safety” and “AI x-risk” completely interchangeably throughout this article.
“Control” here as in (cybernetic, optimal, distributed) control theory rather than referring to “AI control” specifically. See Chen & Ren (2019) for a recent review and Li et al. (2021), Huang et al. (2021), and Patel et al. (2022) for examples of some recent work.
See Ha & Tang (2022) and Centola (2022) for reviews, Friston et al. (2022b), Hipólito & Van Es (2022), and Kastel, Hesp et al. (2023) for active inference perspectives, and Levin (2022b) for a morphogenetic perspective. See also a recent post “Shared reality: a key driver of human behavior” by kdbscott on this topic.
See Basieva et al. (2021), Pothos & Busemeyer (2022), and Fields & Glazebrook (2022) for some recent reviews, and Fields et al. (2022a) and Tanaka et al. (2022) for examples of recent work.
The alternatives here could range from a sort of memetic heterogeneity, like “genetic mosaicism” in planarians (Leria et al. 2019) to sci-fi (?) versions of radical liquid intelligence to which the concepts of blueprint, genome or memome don’t apply at all, like the ocean intelligence in Lem’s Solaris.
See Corcoran et al. (2020), Vanchurin et al. (2022), Rao & Leibler (2022), Kuchling et al. (2022), and Shreesha & Levin (2022) for examples of some recent work.
The study of general (scale-free) regulative development is nascent, first proposed by Fields & Levin (2022). I also begin to discuss the application of this theory to AI alignment in “Properties of current AIs and some predictions of the evolution of AI from the perspective of scale-free theories of agency and regulative development”.
Cf. Chollet (2019): “To this list, we could, theoretically, add one more entry: “universality”, which would extend “generality” beyond the scope of task domains relevant to humans, to any task that could be practically tackled within our universe (note that this is different from “any task at all” as understood in the assumptions of the No Free Lunch theorem [98, 97]). We discuss in II.1.2 why we do not consider universality to be a reasonable goal for AI.”
This might not be obvious why semantics, philosophy of science, and systems ontology are on the list of theories, state-of-the-art understanding of which are needed to research AI safety. Yet, this is actually what AI safety researchers (myself included!) struggle a lot with: a lot of confusion about the things like the philosophy of agency and the thorny questions surrounding deceptive alignment, emergence, and the limits of scientific or engineering modelling, and hence a lot of wasted effort are apparently due to the fact that we don’t understand state-of-the-art (rather than school- or university-level) semantics and philosophy of science very well. So, it seems valuable to bring expert knowledge of these in multi-disciplinary AI x-risk research projects.
See Boyd et al. (2022), Goyal & Bengio (2022), Levin (2022a), Fields et al. (2022a), Friston et al. (2022a), Ma et al. (2022), and LeCun (2022) for some recent important work.
I agree with Ngo and see a lot of what is traditionally regarded as “standalone” philosophy: epistemology, axiology, and ethics, as rubrics of cognitive science and philosophy of mind (which themselves are inseparable, and form a single field of study. I’ve explained this position shortly in this comment.
The empirical counterparts of scale-free ethics are most of the existing theories of ethics in philosophy, religious, and spiritual traditions. See section 3.1 below for more details.
Neuroscience is a centrepiece of the brain-like AGI approach to alignment and might be relevant to some other approaches. Neuroscience is also vital to understanding consciousness, and the theory of consciousness could effectively yield a scale-free theory of ethics: see section 3.1. Also, evolutionary neurobiology and developmental neurobiology serve as grounding for general theories of evolution and regulative development.
It seems to me that FDT is actually much less wrong as a group decision theory than as an “individual” decision theory, but that we should keep this distinction.
Social dynamics and memetics are in the grey area between the disciplines that are needed for understanding “purely technical” AI alignment, and the disciplines that are needed for understanding sociotechnical and political issues surrounding AI governance, AI strategy, human adaptation in the post-AGI world, safety culture and engineering practices of AGI development and deployment, etc. Strict separation of these two fields is impossible. Concretely, an understanding of social dynamics and memetics will be necessary for assessing the practical feasibility of carrying out certain aligned AI transition plans, not even necessarily as radical as “pivotal act”-type of plans. Also, I agree with Connor Leahy that the science of memetics seems to be effectively missing at the moment.
I considered mentioning action theory in this bullet point, but Zielinska (2018) convinced me not to.
See Yukalov (2019) on the interplay between approximation theory and renormalisation group. Koch et al. (2020) for exploration of renormalisation group as a theoretical framework for understanding deep learning. Niu et al. (2021) make connections between fractional calculus, renormalisation group, and machine learning.
See Kuznetsov (2020), Ma et al. (2021) for examples of recent work and Chowdhury et al. (2022) for a review of research on extreme events.
Safety science is also known as "the science (or study) of complex systems". See Aven (2022) for an overview of safety science from the risk science perspective. In “Complex Systems for AI Safety” (2022), Hendrycks and Woodside discuss how the insights from safety science apply to AGI research & development.
Safety science and reliability engineering science are currently anthropocentric disciplines, and as such are not very relevant for understanding “purely technical” AI alignment. However, these disciplines are indispensable for assessing the practical risks we should prepare for during engineering, “deployment”, and maintenance/oversight of this or that “safe” AI proposal.
See Chen et al. (2019) for the analysis of fundamental limitations of control from the information-theoretic perspective.
See Chitambar & Gour (2019) for a review, Kristjánsson et al. (2020) for some recent work, and Fields et al. (2021) for the Active Inference (cognitive science) perspective.
Thanks to Alexander Gietelink Oldenziel and Adam Shai for pointing me to Crutschfield’s work. They review some of it in their sequence on computational mechanics.
More precisely, people could, in principle, spend 2-3 years to understand one field deeply, then move to another, and so on, but first, by the time they will be “done” with a few disciplines, the state-of-the-art knowledge in the “first” one studied will move further, and their understanding will become outdated. So, people couldn’t possibly play this whack-a-mole with more than a few fields of study. Second, by the time any AI x-risk researcher will be done with this scholarship, AGI will already be created.
In turn, intuitions themselves depend on the current design of the collectives. There is a reciprocal generational process between moral intuitions and collective designs.
The perspective of political science is necessary for assessing architectures where humans are still somehow “in control”, for example, concerning the ultimate ethics and values that centralised or multi-componental ASI systems should have. Because humans should somehow agree on these values, or, in any case, discuss and debate them in some sort of political process.
In the multi-disciplinary research project of predicting the characteristics of civilisation’s intelligence and governance systems, the role of network theory, dynamical systems theory, and theories of evolution and regulative development is understanding where the architecture could go, and will likely go, during the recursive self-improvement (FOOM, singularity) phase, or some equivalent of it in the context of the given architecture.
Prosaic physical factors kick in here, such as the possible minimum latency of communication across Earth, the Solar system, of beyond.
Together with the predicted properties of SoTA computing hardware, the predicted ML and AI properties allow us to estimate, for example, whether a single computer or a datacenter will be sufficient to host an AGI or an ASI at all, what inference latency would it have, and, more generally, what is the tradeoff between information processing bandwidth, latency, FLOPs and energy requirements, etc.
Discussing the philosophical issues raised by this phrase, such as of consciousness functionalism, is beyond the scope of this post.