An independent researcher of ethics, AI safety, and AI impacts. Twitter: https://twitter.com/leventov. E-mail: leventov.ru@gmail.com (the preferred mode of communication).
You can help to boost my sense of accountability and give me a feeling that my work is valued by becoming a paid subscriber of my Substack (though I don't post anything paywalled; in fact, on this blog, I just syndicate my LessWrong writing).
A Telegram group where we discuss AI x-risk/safety, theories of intelligence, agency, consciousness, and ethics, in Russian: https://t.me/agi_risk_and_ethics.
It seems that the "ethical simulator" from point 1. and the LLM-based agent from point 2. overlap, so you just overcomplicate things if make them two distinct systems. An LLM prompted with the right "system prompt" (virtue ethics) + doing some branching-tree search for optimal plans according to some trained "utility/value" evaluator (consequentialism) + filtering out plans which have actions that are always prohibited (law, deontology). The second component is the closest to what you described as an "ethical simulator", but is not quite it: the "utility/value" evaluator cannot say whether an action or a plan is ethical or not in absolute terms, it can only compare some proposed plans for the particular situation by some planner.
What is the right mathematical language in which to talk about modularity, boundaries, etc?
I think this is an ill-posed question. Boundaries and modularity could be discussed in the context of different mathematical languages/frameworks: quantum mechanics, random dynamical systems formalism, neural network formalism, whatever. All these mathematical languages permit talking about information exchange, modularity, and boundaries. Cf. this comment.
Even if we reformulate the question as "Which mathematical language permits identifying boundaries [of a particular physical system, because asking this question in the abstract for any system also doesn't make sense] most accurately?", then the answer probably depends on the meta-theoretical (epistemological) framework that the scientist who asks this question applies to themselves.
Why are biological systems so modular? To what extent will that generalize to agents beyond biology?
See section 3. "Optimization and Scale Separation in Evolving Systems" in "Toward a theory of evolution as multilevel learning" (Vanchurin et al., 2022).
Also, see Michael Levin's work on "multiscale competency architectures". Fields, Levin, et al. apply this framework to ANNs in "The free energy principle induces neuromorphic development" (2022), see sections 2 and 4 in particular. This paper also addresses the question "How do modules/boundaries interact with thermodynamics - e.g. can we quantify the negentropy/bits-of-optimization requirements to create new boundaries/modules, or maintain old ones?"
To what extent do boundaries/modules typically exist "by default" in complex systems, vs require optimization pressure (e.g. training/selection) to appear?
Dalton Sakthivadivel showed here that boundaries (i.e., sparse couplings) do exist and are "ubiquitous" in high-dimensional (i.e., complex) systems.
To what extent do boundaries/modules typically exist "by default" in complex systems, vs require optimization pressure (e.g. training/selection) to appear?
Dalton Sakthivadivel showed here that boundaries (i.e., sparse couplings) do exist and are "ubiqutuous" in high-dimensional (i.e., complex) systems.
You think «membranes» will not be able to be formalized in a consistent way, especially in a way that is consistent across different levels of modeling
No, I think membranes could be formalised (Markov blankets, objective "joints" of the environment as in https://arxiv.org/abs/2303.01514, etc.; though theory-laden, I think that the "diff" between the boundaries identifiable from the perspective of different theories is usually negligible).
We, humans, intrude into each others' boundaries, boundaries of animals, organisations, communities, etc. all the time. A surgeon intruding into the boundaries of a patient is an ethical thing to do. If AI automated the entire economy, then waited until humanity completely loses the ability to run the civilisation on their own, and then suddenly stopped any maintenance of the automated systems that support the lives of humans and sees how humans die out because they cannot support themselves would be "respecting humans' boundaries", but would also be an evil treacherous turn. Messing with Hitler's boundaries (i.e., killing him) in 1940 would be an ethical action from the perspective of most systems that may care about that (individual humans, organisations, countries, communities).
I think that boundaries (including consciousness boundaries: what is the locus of animal consciousness? Just the brain or the whole body, or it even extends beyond the body? What is the locus of AI's consciousness?) is an undeniably important concept that is usable for inferring ethical behaviour. But I don't think a simple "winning" deontology is derivable from this concept. I'm currently preparing an article where I describe that from the AI engineering perspective, deontology, virtue ethics, and consequentialism could be seen as engineering techniques (approaches) that could help to produce and continuously infer the ethical style of behaviour. None of these "classical" approaches to normative ethics is either necessary or sufficient, but they all could help to improve the ethics in some cognitive architectures.
Getting traction on the deontic feasibility hypothesis
Davidad believes that using formalisms such as Markov Blankets would be crucial in encoding the desiderata that the AI should not cross boundary lines at various levels of the world-model. We only need to “imply high probability of existential safety”, so according to davidad, “we do not need to load much ethics or aesthetics in order to satisfy this claim (e.g. we probably do not get to use OAA to make sure people don't die of cancer, because cancer takes place inside the Markov Blanket, and that would conflict with boundary preservation; but it would work to make sure people don't die of violence or pandemics)”. Discussing this hypothesis more thoroughly seems important.
I think any finitely-specified deontology wouldn't ensure existential safety, and even more likely following just a finite deontology (such as "don't interfere with others' boundaries") can lead to a dystopian scenario for humanity.
In my current meta-ethical view, ethics is a style of behaviour (i.e., dynamics of a physical system) that is inferred by the system (or its supra-system, such as in the course of genetic or cultural evolution). The style could be characterised/described in the context of multiple different (or, perhaps infinitely many) modelling frameworks/theories for describing the dynamics of the system (perhaps, on various levels of description). Examples of such modelling frameworks are "raw" neural dynamics/connectomics (note: this is already a modelling framework, not the "bare" reality!), Bayesian Brain/Active Inference, Reinforcement Learning, cognitive psychology, evolutionary game theory, etc. All these theories would lead to somewhat different descriptions of the same behaviour which don't completely cover each other[1].
It seems easy to find counterexamples when intruding into someone's boundaries is an ethical thing to do and obtaining from that would be highly unethical. Sorting out multilevel conflicts/frustrations between infinitely many system/boundary partitions of the world[2] in the context of infinitely many theoretical frameworks (such as quantum mechanics[3], neural network framework[4], theory of conscious agents[5], etc.) should guide the attenuation of the best ethical style that we (AI agents) can attain, but I think it couldn't nearly be captured by a single deontic rule.
However, in "Mathematical Foundations for a Compositional Account of the Bayesian Brain" (2022), Smithe establishes that it might be possible to formally convert between these frameworks using category theory.
Vanchurin, V., Wolf, Y. I., Katsnelson, M. I., & Koonin, E. V. (2022). Toward a theory of evolution as multilevel learning. Proceedings of the National Academy of Sciences, 119(6), e2120037119. https://doi.org/10.1073/pnas.2120037119
Fields, C., Friston, K., Glazebrook, J. F., & Levin, M. (2022). A free energy principle for generic quantum systems. Progress in Biophysics and Molecular Biology, 173, 36–59. https://doi.org/10.1016/j.pbiomolbio.2022.05.006
Vanchurin, V. (2020). The World as a Neural Network. Entropy, 22(11), 1210. https://doi.org/10.3390/e22111210
Hoffman, D. D., Prakash, C., & Prentner, R. (2023). Fusions of Consciousness. Entropy, 25(1), 129.
All the critiques focus on MI not being effective enough at its ultimate purpose -- namely, interpretability, and secondarily, finding adversaries (I guess), and maybe something else?
Did you seriously think through whether interpretability, and/or finding adversaries, or some specific aspects or kinds of either interoperability or finding adversaries could be net negative for safety overall? Such as what was contemplated in "AGI-Automated Interpretability is Suicide", "AI interpretability could be harmful?", and "Why and When Interpretability Work is Dangerous". However, I think that none of the authors of these three posts is an expert in interpretability or adversaries, so it would be really interesting to see your thinking on this topic.
The real question is not whether (mechanistic) interpretability is helpful, but whether it could also be "harmful", i.e., speed up capabilities without delivering commensurate or higher improvements in safety (Quintin Pope also talks about this risk in this comment), or by creating a "foom overhang" as described in "AGI-Automated Interpretability is Suicide". Good interpretability also creates an infosec/infohazard attack vector, as I described here.
Thus, the "theory of impact" for interoperability should not just list the potential benefits of it, but also explain why these benefits are expected to outweigh potential harms, timeline shortening, and new risks.