It seems that the "ethical simulator" from point 1. and the LLM-based agent from point 2. overlap, so you just overcomplicate things if make them two distinct systems. An LLM prompted with the right "system prompt" (virtue ethics) + doing some branching-tree search for optimal plans according to some trained "utility/value" evaluator (consequentialism) + filtering out plans which have actions that are always prohibited (law, deontology). The second component is the closest to what you described as an "ethical simulator", but is not quite it: the "utility/value" evaluator cannot say whether an action or a plan is ethical or not in absolute terms, it can only compare some proposed plans for the particular situation by some planner.
What is the right mathematical language in which to talk about modularity, boundaries, etc?
I think this is an ill-posed question. Boundaries and modularity could be discussed in the context of different mathematical languages/frameworks: quantum mechanics, random dynamical systems formalism, neural network formalism, whatever. All these mathematical languages permit talking about information exchange, modularity, and boundaries. Cf. this comment.
Even if we reformulate the question as "Which mathematical language permits identifying boundaries [of a particu...
Why are biological systems so modular? To what extent will that generalize to agents beyond biology?
See section 3. "Optimization and Scale Separation in Evolving Systems" in "Toward a theory of evolution as multilevel learning" (Vanchurin et al., 2022).
Also, see Michael Levin's work on "multiscale competency architectures". Fields, Levin, et al. apply this framework to ANNs in "The free energy principle induces neuromorphic development" (2022), see sections 2 and 4 in particular. This paper also addresses the question "How do modules/boundaries interact wi...
To what extent do boundaries/modules typically exist "by default" in complex systems, vs require optimization pressure (e.g. training/selection) to appear?
Dalton Sakthivadivel showed here that boundaries (i.e., sparse couplings) do exist and are "ubiquitous" in high-dimensional (i.e., complex) systems.
To what extent do boundaries/modules typically exist "by default" in complex systems, vs require optimization pressure (e.g. training/selection) to appear?
Dalton Sakthivadivel showed here that boundaries (i.e., sparse couplings) do exist and are "ubiqutuous" in high-dimensional (i.e., complex) systems.
You think «membranes» will not be able to be formalized in a consistent way, especially in a way that is consistent across different levels of modeling
No, I think membranes could be formalised (Markov blankets, objective "joints" of the environment as in https://arxiv.org/abs/2303.01514, etc.; though theory-laden, I think that the "diff" between the boundaries identifiable from the perspective of different theories is usually negligible).
We, humans, intrude into each others' boundaries, boundaries of animals, organisations, communities, etc. all the time. ...
...Getting traction on the deontic feasibility hypothesis
Davidad believes that using formalisms such as Markov Blankets would be crucial in encoding the desiderata that the AI should not cross boundary lines at various levels of the world-model. We only need to “imply high probability of existential safety”, so according to davidad, “we do not need to load much ethics or aesthetics in order to satisfy this claim (e.g. we probably do not get to use OAA to make sure people don't die of cancer, because cancer takes place inside the Markov Blanket, and that would
All the critiques focus on MI not being effective enough at its ultimate purpose -- namely, interpretability, and secondarily, finding adversaries (I guess), and maybe something else?
Did you seriously think through whether interpretability, and/or finding adversaries, or some specific aspects or kinds of either interoperability or finding adversaries could be net negative for safety overall? Such as what was contemplated in "AGI-Automated Interpretability is Suicide", "AI interpretability could be harmful?", and "Why and When Interpretability Work is Dange...
Is this ‘alignment’ a natural thing you can get easily or even by default, that is essentially a normal engineering problem, or is it a highly unnatural outcome where security mindset and bulletproof approaches as yet unfound even in principle are required, with any flaws are exploited, amplified and fatal, and many lethal problems all of which one must avoid?
To answer these questions specifically, it's really important not just to consider AI--human alignment "in the abstract", but embedded in the current civilisation, with its infrastructure and incentiv...
And Zvi points out these contradictions himself:
It is not clear to what extent robust alignment is a coherent concept especially in a competitive world or even how it interacts with maximization, as it contains many potential contradictions and requirements.
Strict alignment. The damn thing will actually follow some set of instructions to the letter subject to its optimization constraints, hopefully you like the consequences of that. It is a potentially important crux if you disagree with the claim that for almost all specified instruction sets you won’t like the consequences, and there is no known good one yet, due to various alignment difficulties.
The name is uninformative and possibly misleading. If the set of instructions is in a natural or a formal language, you push the alignment difficulty into the sema...
It is not clear that human-level or friendly alignment would do us much good for long either, given the nature and history of humans, and the competitive dynamics involved, and the various reasons to expect change. If AGIs are much smarter and more capable and efficient than us, is there reason to think this level of alignment might be sufficient for long?
"Human-level" is just more commonly called "value alignment" (or "alignment with human values" if you want). But I agree with the conclusion: "friendly" is an attempt at "moral fact alignment" ("humanity ...
From the perspective of ontology and memetic engineering, the whole "ontology" or classification of alignments that you give, "fragile, friendly, ..." is bad because it's not based on some theory but rather on the cacophony of commonsensical ideas. These "alignments" don't even belong to the same type: "Fragile" is an engineering approach (but there are also many other engineering approaches which you haven't mentioned!), 2-3 and 5-6 are black-box descriptions of some alignment characteristics (at least these seem to belong to the same type), and "Strict" ...
Have this alignment and the surrounding dynamics cause humans to choose to remain in control over time, or somehow be unable to choose differently.
This is self-contradictory: if the surrounding dynamics strongly preclude humans from "choosing otherwise", humans are no longer "in control". Also, under certain definitions of "choosing differently", humans may be precluded from moving into different biological and computational substrates, which in itself might be a cosmic tragedy because it may forever preclude humans from realising vast amounts of potential.
...It is impossible in theory to have all these different kinds of alignment simultaneously. You cannot simultaneously (without any claim of completeness):
- Do what I say
- Also do what I mean
- Also do what I should have said and meant
- Also do what is best for me
- Also do what broader society or humanity says
- Also do what broader society or humanity means or should have said
- Also do what broader society or humanity should have said given their values
- Also do what is best for everyone
- Do some ideal friendly combination of all of it that a broadly good guy would do, in a way
the system has unintended (harmful or dangerous) goals or behaviors.
Note that judgements about the harmfulness and dangerousness of some goals or behaviours are themselves theory-laden. This is why Goal alignment without alignment on epistemology, ethics, and science is futile. From the perspective of any theory of cognition/intelligence that includes a generative model (which is not only Active Inference, but also LeCun's H-JEPA, LMCAs such as the "exemplary actor", and more theories of cognition and/or AI architectures) for performing planning-as-inferen...
Re: the virtuous cycle, I was excited recently to find Toby Smithe's work, a compositional account of Bayesian Brain, which strives to establish formal connections between ontology, epistemology, phenomenology, semantics, evolutionary game theory, and more.
Next week, Smithe will give a seminar about this work.
I asked GPT-4 to write a list of desiderata for a naturalistic (i.e., scientific) theory of ethics: https://chat.openai.com/share/1025a325-30e0-457c-a1ed-9d6b3f23eb5e. It made some mistakes but in other regards surprised me with the quality of its philosophy of science and meta-ethics.
The mistake that jumped out for me was “6. Robustness and Flexibility”:
...The ethical theory should be robust and flexible, meaning it should be able to accommodate new information and adapt to different contexts and conditions. As our scientific knowledge evolves, the theory sh
LMCA that uses a body of knowledge in the form of textbooks, scientific theories and models may be updated very frequently and cheaply: essentially, every update of the scientific textbook is an update of LMCA. No need to re-train anything.
GFlowNets have a disadvantage because they are trained for a very particular version of the exemplary actor, drawing upon a particular version of the body of knowledge. And this training will be extremely costly (billions or tens or even hundreds of billions of USD?) and high-latency (months?). By the time a hypothetical...
I thought your response would be that the H-JEPA network might be substantially faster, and so have a lower alignment tax than the exemplary LMCA.
I discuss this in section 4.5. My intuition is that LMCA with latency in tens of minutes is basically as "powerful" (on the civilisational scale) as an agent with latency of one second, there is no OODA-style edge in being swifter than tens of minutes. So, I think that Eric Schmidt's idea of "millisecond-long war" (or, a war where action unfolds at millisecond-scale cadence) just doesn't make sense.
However, these...
My question is: if we already have an aligned LMCA, why would we use it to train a less interpretable H-JEPA AGI?
First, it is not less interpretable. Here, Bengio and Hu argue that GFlowNets are more interpretable than auto-regressive LLMs; but in the setup where the energy function is not explicitly given (as in some other GFlowNet training setups, e.g., for drug discovery) but rather learned from examples (as I proposed in the post), GFlowNet don't have any interpretability advantage over the AI that generate the examples, which is the aligned LMCA in th...
One line: H-JEPA probably won't save us unless we already have an aligned LLM-based cognitive architecture.
Also: see section "Conclusion" in the post
Update: I wrote a big article "Aligning an H-JEPA agent via training on the outputs of an LLM-based "exemplary actor" in which I develop the thinking behind the comment above (but also update it significantly).
Existing models of agency from fields like reinforcement learning and game theory don't seem up to the job, so trying to develop better ones might pay off.
...One account of why our usual models of agency aren't up to the job is the Embedded Agency sequence - the usual models assume agents are unchanging, indivisible entities which interact with their environments through predefined channels, but real-world agents are a part of their environment. The sequence lists identifies four rough categories of problems that arise when we switch to trying to model embedd
As soon as we start talking about societal identities and therefore interests or "values" (and, in general, any group identities/interests/values) the question rises of how should AI balance between individual and group interests, while also considering than there are far more than two levels of group hierarchy, and that group identity/individuality (cf. Krakauer et al. "Information theory of individuality") is a gradualistic rather than a categorical property, as well as group's (system's) consciousness. If we don't have a principled scientific theory for...
Bengio proposed the same thing recently, "AI scientists and humans working together". I criticised this idea here: https://www.lesswrong.com/posts/kGrwufqxfsyuaMREy/annotated-reply-to-bengio-s-ai-scientists-safe-and-useful-ai#AI_scientists_and_humans_working_together, and that criticism wholly applies to your post as well. It would work if the whole alignment problem consisted of aligning exactly one human with one AI and nothing else mattered.
In the societal setting, however, where humans are grossly misaligned with each other, turning everyone into a cyb...
I wonder whether GFlowNets are somehow better suited for self-destruction/non-finetunability than LLMs.
Apart from the potential to speed up foom, there is also a more prosaic reason why interpretability by other AIs or humans could be dangerous: interpretability could reveal infohazardous reasoning en route of inferring aligned, ethical plans: https://www.lesswrong.com/posts/CRrkKAafopCmhJEBt/ai-interpretability-could-be-harmful. So I suggested that we may need to go as far as to cryptographically obfuscating AI reasoning process that leads to "aligned" plans.
For reference, I replied to Bengio's post in a separate post: https://www.lesswrong.com/posts/kGrwufqxfsyuaMREy/annotated-reply-to-bengio-s-ai-scientists-safe-and-useful-ai. TLDR: pretty much the same points that other commentators to this post are making, just in a more elaborate form.
...But only if the labels are correct, and the labeling problem is potentially much harder now, because the latent variables include inscrutable information about “how the AI is thinking about / conceptualizing the things that it’s seeing / doing”. I think. And if they do, then how are the humans supposed to label them as good or bad? Like, if the AI notices someone feeling physically good but psychologically distressed, we want to label it as low-energy when the AI is thinking about the former aspect and high-energy if the AI is thinking about the latter asp
I believe Steven didn't imply that a significant number of people would approve or want such a future -- indeed, the opposite, hence he called the scenario "dystopian".
He basically meant that optimising surface signals of pleasure does not automatically lead to behaviours and plans congruent with reasonable ethics, so the surface elements of alignment suggested by LeCun in the paper are clearly insufficient.
My comments on this post winded up into a whole separate post: "H-JEPA might be technically alignable in a modified form".
I beileve this is the only way to design an AI whose actions we still have confidence in the desirability of, even once the AI is out of our hands and is augmenting itself to unfathomable capabilities.
I think unleashing AI in approximately the present world, whose infrastructural and systemic vulnerabilities I gestured at here, in the "Dealing with unaligned competition" section (in short: no permeating trust systems that follow the money, unconstrained "reach-anywhere" internet architecture, information massively accumulated and centralised in the datacen...
In this post, as well as your other posts, you use the word "goal" a lot, as well as related words, phrases, and ideas: "target", "outcomes", "alignment ultimately is about making sure that the first SGCA pursues desirable goal", the idea of backchaining, "save the world" (this last one, in particular, implies that the world can be "saved", like in a movie, that implies some finitude of the story).
I think this is not the best view of the world. I think this view misses the latest developments in the physics of evolution and regulative development, evolutio...
the core motivation for formal alignment, for me, is that a working solution is at least eventually aligned: there is an objective answer to the question "will maximizing this with arbitrary capabilities produce desirable outcomes?" where the answer does not depend, at the limit, on what does the maximization.
I don't know about other proposals because I'm not familiar with them, but Methaethical AI actually describes the machinery of the agent, hence "the answer" does depend "on what does the maximisation".
I generally disagree with the implicit claim "it's useful to try aligning AI systems via mechanism design on civilization." This feels like a vastly clumsier version of trying to shape AGIs via black-box gradient descent.
I didn't imply that mechanism/higher-level-system design and game theory are alone sufficient for successful alignment. But as a part of a portfolio, I think it's indispensable.
Probably the degree to which a person (let's say, you or me) buy into the importance of mechanism/higher-level-system design for AI alignment corresponds to where w...
Seems to me that you misunderstood my position.
Let's assume that GPT-5 will hit the AGI recursive self-improvement bar, so if it is bundled in an AutoGPT-style loop and tasked with a bold task such as "solve a big challenge of humanity, such as global hunger" or "make a billion-dollar business" it could meaningfully achieve a lot, via instrumental convergence and actual agency, including self-improvement, at which point it is effectively out of control.
I think that "build alignment MVPs via RLHF + adversarial training + scalable oversight + regularizers + ...
Ok, in this passage:
In engineering and design, there is a process that includes, among other stages, specification, creation, verification and validation, and deployment. Verification and validation are where most people focus when thinking about safety - can we make sure the system performs correctly? I think this is a conceptual error that I want to address.
It seems that you put the first two sentences "in the mouth of people outside of AI safety", and they describe some conceptual error, while the third sentence is "yours". However, I don't understand w...
In engineering and design, there is a process that includes, among other stages, specification, creation, verification and validation, and deployment. Verification and validation are where most people focus when thinking about safety - can we make sure the system performs correctly?
Factually, no, I don't think this is where most people's thoughts are. Apart from the stages of the engineering process that you enumerated, there are also manufacturing (i.e., training, in the case of ML systems) and operations (post-deployment phase). AI safety "thought" is mo...
In the language of generative models, "praxis" correspond to cognitive and "action" disciplines, from rationality (the discipline/praxis of rational reasoning), epistemology, and ethics to dancing and pottery. The generative model (Active Inference) frame and the shard theory frames thus seem to be in agreement that disciplinary alignment ("virtue ethics") is more important (fundamental, robust) than "deontology" and "consequentialism" alignment, which roughly correspond to goal alignment and prediction ("future fact") alignment, respectively. The generati...
Here I proposed a systematic framework for classifying AI safety work. This is a matrix, where one dimension is the system level:
I’m a bit confused by your reference to Adam’s post. I interpret his post as advocating for more originality, not less, in terms of diverse alignment research agendas.
The quote by Shimi exemplifies the position with which I disagree. I believe Yudkowsky and Soares were also stating something along these lines previously, but I couldn't find suitable quotes. I don't know if any of these three people still hold this position, though.
How do you think Metzinger or Clark would specifically benefit our scholars?
I heard Metzinger reflecting on the passage of para...
A systematic way for classifying AI safety work could use a matrix, where one dimension is the system level:
If AGI labs truly bet on AI-assisted (or fully AI-automated) science, across the domains of science (the second group in your list), then research done in the following three groups will be submerged by that AI-assisted research.
It's still important to do some research in these areas, for two reasons:
(1) hedging bets against some unexpected turn of events, such as AIs failing to improve the speed and depth of generated scientific insight, at least in some areas (perhaps governance & strategy are more iffy areas, and it's hard to become sure that strate...
The real question is not whether (mechanistic) interpretability is helpful, but whether it could also be "harmful", i.e., speed up capabilities without delivering commensurate or higher improvements in safety (Quintin Pope also talks about this risk in this comment), or by creating a "foom overhang" as described in "AGI-Automated Interpretability is Suicide". Good interpretability also creates an ... (read more)