All of Roman Leventov's Comments + Replies

Epistemic learned helplessness: Idk man, do we even need a theory of impact? In what world is 'actually understanding how our black box systems work' not helpful?

The real question is not whether (mechanistic) interpretability is helpful, but whether it could also be "harmful", i.e., speed up capabilities without delivering commensurate or higher improvements in safety (Quintin Pope also talks about this risk in this comment), or by creating a "foom overhang" as described in "AGI-Automated Interpretability is Suicide". Good interpretability also creates an ... (read more)

It seems that the "ethical simulator" from point 1. and the LLM-based agent from point 2. overlap, so you just overcomplicate things if make them two distinct systems. An LLM prompted with the right "system prompt" (virtue ethics) + doing some branching-tree search for optimal plans according to some trained "utility/value" evaluator (consequentialism) + filtering out plans which have actions that are always prohibited (law, deontology). The second component is the closest to what you described as an "ethical simulator", but is not quite it: the "utility/value" evaluator cannot say whether an action or a plan is ethical or not in absolute terms, it can only compare some proposed plans for the particular situation by some planner.

1Ape in the coat11h
They are not supposed to be two distinct systems. One is a subsystem of the other. There may be implementations where its the same LLM doing all the generative work for every step of the reasoning via prompt engineering but it doesn't have to be this way. It can can be multiple more specific LLMs that went through different RLHF processes.

What is the right mathematical language in which to talk about modularity, boundaries, etc?

I think this is an ill-posed question. Boundaries and modularity could be discussed in the context of different mathematical languages/frameworks: quantum mechanics, random dynamical systems formalism, neural network formalism, whatever. All these mathematical languages permit talking about information exchange, modularity, and boundaries. Cf. this comment.

Even if we reformulate the question as "Which mathematical language permits identifying boundaries [of a particu... (read more)

Why are biological systems so modular? To what extent will that generalize to agents beyond biology?

See section 3. "Optimization and Scale Separation in Evolving Systems" in "Toward a theory of evolution as multilevel learning" (Vanchurin et al., 2022).

Also, see Michael Levin's work on "multiscale competency architectures". Fields, Levin, et al. apply this framework to ANNs in "The free energy principle induces neuromorphic development" (2022), see sections 2 and 4 in particular. This paper also addresses the question "How do modules/boundaries interact wi... (read more)

To what extent do boundaries/modules typically exist "by default" in complex systems, vs require optimization pressure (e.g. training/selection) to appear?

Dalton Sakthivadivel showed here that boundaries (i.e., sparse couplings) do exist and are "ubiquitous" in high-dimensional (i.e., complex) systems.

To what extent do boundaries/modules typically exist "by default" in complex systems, vs require optimization pressure (e.g. training/selection) to appear?

Dalton Sakthivadivel showed here that boundaries (i.e., sparse couplings) do exist and are "ubiqutuous" in high-dimensional (i.e., complex) systems.

You think «membranes» will not be able to be formalized in a consistent way, especially in a way that is consistent across different levels of modeling

No, I think membranes could be formalised (Markov blankets, objective "joints" of the environment as in https://arxiv.org/abs/2303.01514, etc.; though theory-laden, I think that the "diff" between the boundaries identifiable from the perspective of different theories is usually negligible).

We, humans, intrude into each others' boundaries, boundaries of animals, organisations, communities, etc. all the time. ... (read more)

1Chipmonk4d
I see please lmk when you post this. i've subscribed to your lw posts too -------------------------------------------------------------------------------- FWIW, I don't think the examples given necessarily break «membranes» as a "winning" deontological theory. If the patient has consented, there is no conflict. (Important note: consent does not always nullify membrane violations. In this case it does, but there are many cases where it doesn't.) I think a way to properly understand this might be.. If Alice makes a promise to Bob, she is essentially giving Bob a piece of herself, and that changes how he plans for the future and whatnot. If she revokes that by terms not part of the original agreement, she has stolen something from Bob, and that is a violation of membranes. ? If the AI promises to support humans under an agreement, then breaks that agreement, that is theft. In a case like this I wonder if the theory would also need something like "minimize net boundary violations", kind of like how some deontologies make murder okay sometimes. But then this gets really close to utilitarianism and that's gross imo. So I'm not sure. Maybe there's another way to address this? Maybe I see what you mean

Getting traction on the deontic feasibility hypothesis


Davidad believes that using formalisms such as Markov Blankets would be crucial in encoding the desiderata that the AI should not cross boundary lines at various levels of the world-model. We only need to “imply high probability of existential safety”, so according to davidad, “we do not need to load much ethics or aesthetics in order to satisfy this claim (e.g. we probably do not get to use OAA to make sure people don't die of cancer, because cancer takes place inside the Markov Blanket, and that would

... (read more)
1Chipmonk5d
Okay, I'll try to summarize your main points. Please let me know if this is right 1. You think «membranes» will not be able to be formalized in a consistent way, especially in a way that is consistent across different levels of modeling 2. "It seems easy to find counterexamples when intruding into someone's boundaries is an ethical thing to do and obtaining from that would be highly unethical." Have I missed anything? I'll respond after you confirm. Also, would you please share any key example(s) of #2?
1Chipmonk8d
Ty; also this comment [https://www.lesswrong.com/posts/gbNqWpDwmrWmzopQW/is-deontological-ai-safe-feedback-draft?commentId=5ACFTTdAJgmr3K3Pt] there

All the critiques focus on MI not being effective enough at its ultimate purpose -- namely, interpretability, and secondarily, finding adversaries (I guess), and maybe something else?

Did you seriously think through whether interpretability, and/or finding adversaries, or some specific aspects or kinds of either interoperability or finding adversaries could be net negative for safety overall? Such as what was contemplated in "AGI-Automated Interpretability is Suicide", "AI interpretability could be harmful?", and "Why and When Interpretability Work is Dange... (read more)

Is this ‘alignment’ a natural thing you can get easily or even by default, that is essentially a normal engineering problem, or is it a highly unnatural outcome where security mindset and bulletproof approaches as yet unfound even in principle are required, with any flaws are exploited, amplified and fatal, and many lethal problems all of which one must avoid?

To answer these questions specifically, it's really important not just to consider AI--human alignment "in the abstract", but embedded in the current civilisation, with its infrastructure and incentiv... (read more)

And Zvi points out these contradictions himself:

It is not clear to what extent robust alignment is a coherent concept especially in a competitive world or even how it interacts with maximization, as it contains many potential contradictions and requirements.

Strict alignment. The damn thing will actually follow some set of instructions to the letter subject to its optimization constraints, hopefully you like the consequences of that. It is a potentially important crux if you disagree with the claim that for almost all specified instruction sets you won’t like the consequences, and there is no known good one yet, due to various alignment difficulties.

The name is uninformative and possibly misleading. If the set of instructions is in a natural or a formal language, you push the alignment difficulty into the sema... (read more)

It is not clear that human-level or friendly alignment would do us much good for long either, given the nature and history of humans, and the competitive dynamics involved, and the various reasons to expect change. If AGIs are much smarter and more capable and efficient than us, is there reason to think this level of alignment might be sufficient for long?

"Human-level" is just more commonly called "value alignment" (or "alignment with human values" if you want). But I agree with the conclusion: "friendly" is an attempt at "moral fact alignment" ("humanity ... (read more)

From the perspective of ontology and memetic engineering, the whole "ontology" or classification of alignments that you give, "fragile, friendly, ..." is bad because it's not based on some theory but rather on the cacophony of commonsensical ideas. These "alignments" don't even belong to the same type: "Fragile" is an engineering approach (but there are also many other engineering approaches which you haven't mentioned!), 2-3 and 5-6 are black-box descriptions of some alignment characteristics (at least these seem to belong to the same type), and "Strict" ... (read more)

Have this alignment and the surrounding dynamics cause humans to choose to remain in control over time, or somehow be unable to choose differently.

This is self-contradictory: if the surrounding dynamics strongly preclude humans from "choosing otherwise", humans are no longer "in control". Also, under certain definitions of "choosing differently", humans may be precluded from moving into different biological and computational substrates, which in itself might be a cosmic tragedy because it may forever preclude humans from realising vast amounts of potential.

1Roman Leventov15d
And Zvi points out these contradictions himself:

It is impossible in theory to have all these different kinds of alignment simultaneously. You cannot simultaneously (without any claim of completeness):

  1. Do what I say
  2. Also do what I mean
  3. Also do what I should have said and meant
  4. Also do what is best for me
  5. Also do what broader society or humanity says
  6. Also do what broader society or humanity means or should have said
  7. Also do what broader society or humanity should have said given their values
  8. Also do what is best for everyone
  9. Do some ideal friendly combination of all of it that a broadly good guy would do, in a way
... (read more)

the system has unintended (harmful or dangerous) goals or behaviors.

Note that judgements about the harmfulness and dangerousness of some goals or behaviours are themselves theory-laden. This is why Goal alignment without alignment on epistemology, ethics, and science is futile. From the perspective of any theory of cognition/intelligence that includes a generative model (which is not only Active Inference, but also LeCun's H-JEPA, LMCAs such as the "exemplary actor", and more theories of cognition and/or AI architectures) for performing planning-as-inferen... (read more)

Re: the virtuous cycle, I was excited recently to find Toby Smithe's work, a compositional account of Bayesian Brain, which strives to establish formal connections between ontology, epistemology, phenomenology, semantics, evolutionary game theory, and more.

Next week, Smithe will give a seminar about this work.

I asked GPT-4 to write a list of desiderata for a naturalistic (i.e., scientific) theory of ethics: https://chat.openai.com/share/1025a325-30e0-457c-a1ed-9d6b3f23eb5e. It made some mistakes but in other regards surprised me with the quality of its philosophy of science and meta-ethics.

The mistake that jumped out for me was “6. Robustness and Flexibility”:

The ethical theory should be robust and flexible, meaning it should be able to accommodate new information and adapt to different contexts and conditions. As our scientific knowledge evolves, the theory sh

... (read more)

LMCA that uses a body of knowledge in the form of textbooks, scientific theories and models may be updated very frequently and cheaply: essentially, every update of the scientific textbook is an update of LMCA. No need to re-train anything.

GFlowNets have a disadvantage because they are trained for a very particular version of the exemplary actor, drawing upon a particular version of the body of knowledge. And this training will be extremely costly (billions or tens or even hundreds of billions of USD?) and high-latency (months?). By the time a hypothetical... (read more)

I thought your response would be that the H-JEPA network might be substantially faster, and so have a lower alignment tax than the exemplary LMCA.

I discuss this in section 4.5. My intuition is that LMCA with latency in tens of minutes is basically as "powerful" (on the civilisational scale) as an agent with latency of one second, there is no OODA-style edge in being swifter than tens of minutes. So, I think that Eric Schmidt's idea of "millisecond-long war" (or, a war where action unfolds at millisecond-scale cadence) just doesn't make sense.

However, these... (read more)

1Seth Herd18d
I agree that faster isn't a clear win for most real-world scenarios. But it is more powerful, because you can have that agent propose many plans and consider more scenarios in the same time. It's also probably linked to being much more cost-efficient, in compute and money. But I'm not sure about the last one.

My question is: if we already have an aligned LMCA, why would we use it to train a less interpretable H-JEPA AGI?

First, it is not less interpretable. Here, Bengio and Hu argue that GFlowNets are more interpretable than auto-regressive LLMs; but in the setup where the energy function is not explicitly given (as in some other GFlowNet training setups, e.g., for drug discovery) but rather learned from examples (as I proposed in the post), GFlowNet don't have any interpretability advantage over the AI that generate the examples, which is the aligned LMCA in th... (read more)

2Seth Herd18d
I thought your response would be that the H-JEPA network might be substantially faster, and so have a lower alignment tax than the exemplary LMCA. LMCAs are much more interpretable than the base LLMs, because you're deliberately breaking their cognition into small pieces, each of which is summarized by a natural language utterence. They're particularly reliably interpretable if you call new instances of LLMs for each piece to prevent Waluigi collapse effects, something I hadn't thought of in that first post. Because they can access sensory networks as external tools, LMCAs already have access to sensory grounding (although they haven't been coded to use it particularly well in currently-published work). A more direct integration of sensory knowledge might prove critical, or at least faster.

One line: H-JEPA probably won't save us unless we already have an aligned LLM-based cognitive architecture.

Also: see section "Conclusion" in the post

3Seth Herd18d
I think editing to make this the first line of the post would be extremely helpful in motivating people to read the post. I found the opening very confusing and probably wouldn't have pushed through if I weren't already interested in the subject matter. When the first paragraph of a post confuses me, I often assume the rest will too, and only come back to it if and when it gets upvotes and someone writes a comment that's clearer than the post. Seconding that people won't be familiar with the H-JEPA terminology. I didn't remember it even though I read that proposal. I also didn't remember the relationship between the H-JEPA architecture and a Gflow agent, so that is probably worth clarifying in the text.
3avturchin18d
I think most people have to google what exactly H-JEPA is.

Existing models of agency from fields like reinforcement learning and game theory don't seem up to the job, so trying to develop better ones might pay off.

One account of why our usual models of agency aren't up to the job is the Embedded Agency sequence - the usual models assume agents are unchanging, indivisible entities which interact with their environments through predefined channels, but real-world agents are a part of their environment. The sequence lists identifies four rough categories of problems that arise when we switch to trying to model embedd

... (read more)

As soon as we start talking about societal identities and therefore interests or "values" (and, in general, any group identities/interests/values) the question rises of how should AI balance between individual and group interests, while also considering than there are far more than two levels of group hierarchy, and that group identity/individuality (cf. Krakauer et al. "Information theory of individuality") is a gradualistic rather than a categorical property, as well as group's (system's) consciousness. If we don't have a principled scientific theory for... (read more)

2ukc100141mo
In response to Roman’s very good points (i have only for now skimmed the linked articles); these are my thoughts: I agree that human values are very hard to aggregate (or even to define precisely); we use politics/economy (of collectives ranging from the family up to the nation) as a way of doing that aggregation, but that is obviously a work in progress, and perhaps slipping backwards. In any case, (as Roman says) humans are (much of the time) misaligned with each other and their collectives, in ways little and large, and sometimes that is for good or bad reasons. By ‘good reason’ I mean that sometimes ‘misalignment’ might literally be that human agents & collectives have local (geographical/temporal) realities they have to optimise for (to achieve their goals), which might conflict with goals/interests of their broader collectives: this is the essence of governing a large country, and is why many countries are federated. I’m sure these problems are formalised in preference/values literature, so I’m using my naive terms for now… Anyway, this post’s working assumption/intuition is that ‘single AI-single human’ alignment (or corrigibility or identity fusion or (delegation to use Andrew Critch’s term)) is ‘easier’ to think about or achieve, than ‘multiple AI-multiple human’. Which is why we consciously focused on the former & temporarily ignored the latter. I don’t know if that assumption is valid and I haven’t thought about (i.e. no opinion) whether ideas in Roman’s ‘science of ethics’ linked post would change anything, but am interested in it !

Bengio proposed the same thing recently, "AI scientists and humans working together". I criticised this idea here: https://www.lesswrong.com/posts/kGrwufqxfsyuaMREy/annotated-reply-to-bengio-s-ai-scientists-safe-and-useful-ai#AI_scientists_and_humans_working_together, and that criticism wholly applies to your post as well. It would work if the whole alignment problem consisted of aligning exactly one human with one AI and nothing else mattered.

In the societal setting, however, where humans are grossly misaligned with each other, turning everyone into a cyb... (read more)

2Garrett Baker1mo
Without reading what you wrote in the link, only your description here, I think you're mixing up two different questions: 1. How do we make it so an AI doesn't kill everyone, when those deploying the AI would reflectively prefer it not to kill everyone. 2. How do we make it so humans don't use AI to kill everyone else, or otherwise cause massive suffering, while knowing that is what they are doing and reflectively endorsing the action. I.e. mitigating misuse risks. I do think this post is very much focused on 1, though it does make mention to getting AIs to adopt societal identities, which seems like it would indeed mitigate misuse risks. In general, and I don't speak for the co-authors here, I don't 2 is necessary to solve 1. Nor do I think there are many sub-problems in 1 which require a routing through 2, nor even do I think a solution to 2 would require a solution to ethics. And if a solution to 1 does require a solution to ethics, I think we should give up on alignment, and push full throttle on collective action solutions to not building AGIs in the first place, because that is literally the longest-lasting open problem in the history of philosophy.

I wonder whether GFlowNets are somehow better suited for self-destruction/non-finetunability than LLMs.

Apart from the potential to speed up foom, there is also a more prosaic reason why interpretability by other AIs or humans could be dangerous: interpretability could reveal infohazardous reasoning en route of inferring aligned, ethical plans: https://www.lesswrong.com/posts/CRrkKAafopCmhJEBt/ai-interpretability-could-be-harmful. So I suggested that we may need to go as far as to cryptographically obfuscating AI reasoning process that leads to "aligned" plans.

For reference, I replied to Bengio's post in a separate post: https://www.lesswrong.com/posts/kGrwufqxfsyuaMREy/annotated-reply-to-bengio-s-ai-scientists-safe-and-useful-ai. TLDR: pretty much the same points that other commentators to this post are making, just in a more elaborate form.

But only if the labels are correct, and the labeling problem is potentially much harder now, because the latent variables include inscrutable information about “how the AI is thinking about / conceptualizing the things that it’s seeing / doing”. I think. And if they do, then how are the humans supposed to label them as good or bad? Like, if the AI notices someone feeling physically good but psychologically distressed, we want to label it as low-energy when the AI is thinking about the former aspect and high-energy if the AI is thinking about the latter asp

... (read more)
1Roman Leventov18d
Update: I wrote a big article "Aligning an H-JEPA agent via training on the outputs of an LLM-based "exemplary actor" [https://www.lesswrong.com/posts/MJXwnHbqFYE3N4dP2/aligning-an-h-jepa-agent-via-training-on-the-outputs-of-an] in which I develop the thinking behind the comment above (but also update it significantly). 

I believe Steven didn't imply that a significant number of people would approve or want such a future -- indeed, the opposite, hence he called the scenario "dystopian".

He basically meant that optimising surface signals of pleasure does not automatically lead to behaviours and plans congruent with reasonable ethics, so the surface elements of alignment suggested by LeCun in the paper are clearly insufficient.

1MichaelStJules1mo
I think many EAs/rationalists shouldn't find this to be worse for humans than life today on the views they apparently endorse, because each human looks better off under standard approaches to intrapersonal aggregation: they get more pleasure, less suffering, more preference satisfaction (or we can imagine some kind of manipulation to achieve this), but at the cost of some important frustrated preferences.

I just left two long comments on this post with a critique.

I beileve this is the only way to design an AI whose actions we still have confidence in the desirability of, even once the AI is out of our hands and is augmenting itself to unfathomable capabilities.

I think unleashing AI in approximately the present world, whose infrastructural and systemic vulnerabilities I gestured at here, in the "Dealing with unaligned competition" section (in short: no permeating trust systems that follow the money, unconstrained "reach-anywhere" internet architecture, information massively accumulated and centralised in the datacen... (read more)

In this post, as well as your other posts, you use the word "goal" a lot, as well as related words, phrases, and ideas: "target", "outcomes", "alignment ultimately is about making sure that the first SGCA pursues desirable goal", the idea of backchaining, "save the world" (this last one, in particular, implies that the world can be "saved", like in a movie, that implies some finitude of the story).

I think this is not the best view of the world. I think this view misses the latest developments in the physics of evolution and regulative development, evolutio... (read more)

the core motivation for formal alignment, for me, is that a working solution is at least eventually aligned: there is an objective answer to the question "will maximizing this with arbitrary capabilities produce desirable outcomes?" where the answer does not depend, at the limit, on what does the maximization.

I don't know about other proposals because I'm not familiar with them, but Methaethical AI actually describes the machinery of the agent, hence "the answer" does depend "on what does the maximisation".

I generally disagree with the implicit claim "it's useful to try aligning AI systems via mechanism design on civilization." This feels like a vastly clumsier version of trying to shape AGIs via black-box gradient descent.

I didn't imply that mechanism/higher-level-system design and game theory are alone sufficient for successful alignment. But as a part of a portfolio, I think it's indispensable.

Probably the degree to which a person (let's say, you or me) buy into the importance of mechanism/higher-level-system design for AI alignment corresponds to where w... (read more)

2Ryan Kidd1mo
I'm somewhere in the middle of the cognitivist/enactivist spectrum. I think that e.g. relaxed adversarial training is motivated by trying to make an AI robust to arbitrary inputs it will receive in the world before it leaves the box. I'm sympathetic to the belief that this is computationally intractable; however, it feels more achievable than altering the world in the way I imagine would be necessary without it. I'm not an idealist here: I think that some civilizational inadequacies should be addressed (e.g., better cooperation [https://en.wikipedia.org/wiki/Cooperative_bargaining] and commitment mechanisms) concurrent with in-the-box alignment strategies. My main hope is that we can build an in-the-box corrigible AGI that allows in-deployment modification.

Seems to me that you misunderstood my position.

Let's assume that GPT-5 will hit the AGI recursive self-improvement bar, so if it is bundled in an AutoGPT-style loop and tasked with a bold task such as "solve a big challenge of humanity, such as global hunger" or "make a billion-dollar business" it could meaningfully achieve a lot, via instrumental convergence and actual agency, including self-improvement, at which point it is effectively out of control.

I think that "build alignment MVPs via RLHF + adversarial training + scalable oversight + regularizers + ... (read more)

3Ryan Kidd1mo
I think we agree on a lot more than I realized! In particular, I don't disagree with your general claims about pathways to HRAD through Alignment MVPs (though I hold some credence that this might not work [https://twitter.com/ryan_kidd44/status/1653896013579689984]). Things I disagree with: * I generally disagree with the claim "alignment approaches don't limit agentic capability." This was one subject of my independent research before I started directing MATS. Hopefully, I can publish some high-level summaries soon, time permitting! In short, I think "aligning models" generally trades off bits of optimization pressure with "making models performance-competitive," which makes building aligned models less training-competitive for a given degree of performance. * I generally disagree with the claim "corrigibility is not a useful, coherent concept." I think there is a (narrow) attractor basin around "corrigibility" in cognition space. Happy to discuss more and possibly update. * I generally disagree with the claim "provably-controllable highly reliable agent design is impossible in principle." I think it is possible to design recursively self-improving programs that are robust to adversarial inputs, even if this is vanishingly hard in practice (which informs my sense of alignment difficulty only insomuch as I hope we don't hit that attractor well before CEV value-loading is accomplished). Happy to discuss and possibly update. * I generally disagree with the implicit claim "it's useful to try aligning AI systems via mechanism design on civilization." This feels like a vastly clumsier version of trying to shape AGIs via black-box gradient descent. I also don't think that realistic pre-AGI efficient markets we can build are aligned with human-CEV by default.

Ok, in this passage:

In engineering and design, there is a process that includes, among other stages, specification, creation, verification and validation, and deployment. Verification and validation are where most people focus when thinking about safety - can we make sure the system performs correctly? I think this is a conceptual error that I want to address.

It seems that you put the first two sentences "in the mouth of people outside of AI safety", and they describe some conceptual error, while the third sentence is "yours". However, I don't understand w... (read more)

In engineering and design, there is a process that includes, among other stages, specification, creation, verification and validation, and deployment. Verification and validation are where most people focus when thinking about safety - can we make sure the system performs correctly?

Factually, no, I don't think this is where most people's thoughts are. Apart from the stages of the engineering process that you enumerated, there are also manufacturing (i.e., training, in the case of ML systems) and operations (post-deployment phase). AI safety "thought" is mo... (read more)

2Davidmanheim1mo
Agreed - I wasn't criticizing AI safety here, I was talking about the conceptual models that people outside of AI safety have - as was mentioned in several other comments. So my point was about what people outside of AI safety think about when talking about ML models, trying to correct a broken mental model.   I did not say anything about evals and red teaming in application to AI, other than in comments where I said I think they are a great idea. And the fact that they are happening very clearly implies that there is some possibility that the models perform poorly, which, again, was the point. Perhaps it's outdated, but it is the understanding which engineers who I have spoken to who work on reliability and systems engineering actually have, and it matches research I did on resilience most of a decade ago, e.g. this [https://www.rand.org/pubs/research_reports/RR1067.html]. And I agree that there is discussion in both older and more recent journal articles about how some firms do things in various ways that might be an improvement, but it's not the standard. And even when doing agile systems engineering, use cases more often supplement or exist alongside requirements, they don't replace them. Though terminology in this domain is so far from standardized that you'd need to talk about a specific company, or even a specific project's process and definitions to have a more meaningful discussion. I don't disagree with the conclusion, but the logic here simply doesn't work to prove anything. It implies that standards are insufficient, not that they are not necessary.

In the language of generative models, "praxis" correspond to cognitive and "action" disciplines, from rationality (the discipline/praxis of rational reasoning), epistemology, and ethics to dancing and pottery. The generative model (Active Inference) frame and the shard theory frames thus seem to be in agreement that disciplinary alignment ("virtue ethics") is more important (fundamental, robust) than "deontology" and "consequentialism" alignment, which roughly correspond to goal alignment and prediction ("future fact") alignment, respectively. The generati... (read more)

Classification of AI safety work

Here I proposed a systematic framework for classifying AI safety work. This is a matrix, where one dimension is the system level:

  • A monolithic AI system, e.g., a conversational LLM
  • AGI lab (= the system that designs, manufactures, operates, and evolves monolithic AI systems and systems of AIs)
  • A cyborg, human + AI(s)
  • A system of AIs with emergent qualities (e.g., https://numer.ai/, but in the future, we may see more systems like this, operating on a larger scope, up to fully automatic AI economy; or a swarm of CoEms automating s
... (read more)

I’m a bit confused by your reference to Adam’s post. I interpret his post as advocating for more originality, not less, in terms of diverse alignment research agendas.

The quote by Shimi exemplifies the position with which I disagree. I believe Yudkowsky and Soares were also stating something along these lines previously, but I couldn't find suitable quotes. I don't know if any of these three people still hold this position, though.

How do you think Metzinger or Clark would specifically benefit our scholars?

I heard Metzinger reflecting on the passage of para... (read more)

1Ryan Kidd1mo
I don't disagree with Shimi as strongly as you do. I think there's some chance we need radically new paradigms of aligning AI than "build alignment MVPs via RLHF + adversarial training + scalable oversight + regularizers + transparency tools + mulligans." While I do endorse some anthropocentric "value-loading"-based alignment strategies in my portfolio, such as Shard Theory and Steve Byrnes' research, I worry about overly investing in anthropocentric AGI alignment strategies. I don't necessarily think that RLHF shapes GPT-N in a manner similar to how natural selection and related processes shaped humans to be altruistic. I think it's quite likely that the kind of cognition that GPT-N learns to predict tokens is more akin to an "alien god" than it is to human cognition. I think that trying to value-load an alien god is pretty hard. In general, I don't highly endorse the framing of alignment as "making AIs more human." I think this kind of approach fails in some worlds and might produce models that are not performance-competitive enough to outcompete [https://ai-alignment.com/the-strategy-stealing-assumption-a26b8b1ed334] the unaligned models others deploy. I'd rather produce corrigible models with superhuman cognition coupled with robust democratic institutions. Nevertheless, I endorse at least some research along this line, but this is not the majority of my portfolio.

A systematic way for classifying AI safety work could use a matrix, where one dimension is the system level:

  • A monolithic AI system, e.g., a conversational LLM
  • A cyborg, human + AI(s)
  • A system of AIs with emergent qualities (e.g., https://numer.ai/, but in the future, we may see more systems like this, operating on a larger scope, up to fully automatic AI economy; or a swarm of CoEms automating science)
  • A human+AI group, community, or society (scale-free consideration, supports arbitrary fractal nestedness): collective intelligence
  • The whole civilisation, e.g.,
... (read more)

If AGI labs truly bet on AI-assisted (or fully AI-automated) science, across the domains of science (the second group in your list), then research done in the following three groups will be submerged by that AI-assisted research.

It's still important to do some research in these areas, for two reasons:

(1) hedging bets against some unexpected turn of events, such as AIs failing to improve the speed and depth of generated scientific insight, at least in some areas (perhaps governance & strategy are more iffy areas, and it's hard to become sure that strate... (read more)

1Ryan Kidd2mo
MATS' framing is that we are supporting a "diverse portfolio" of research agendas that might "pay off" in different worlds (i.e., your "hedging bets" analogy is accurate). We also think the listed research agendas have some synergy you might have missed. For example, interpretability research might build into better AI-assisted white-box auditing, white/gray-box steering (e.g., via ELK), or safe architecture design (e.g., "retargeting the search"). The distinction between "evaluator" and "generator" seems fuzzier to me than you portray. For instance, two "generator" AIs might be able to red-team each other for the purposes of evaluating an alignment strategy.
Load More