Roman Leventov

An independent researcher/blogger/philosopher about intelligence and agency (esp. Active Inference), alignment, ethics, interaction of the AI transition with the sociotechnical risks (epistemics, economics, human psychology), collective mind architecture, research strategy and methodology.

Twitter: E-mail: (the preferred mode of communication). I'm open to collaborations and work.

Presentations at meetups, workshops and conferences, some recorded videos.

I'm a founding member of the Gaia Consoritum, on a mission to create a global, decentralised system for collective sense-making and decision-making, i.e., civilisational intelligence. Drop me a line if you want to learn more about it and/or join the consoritum.

You can help to boost my sense of accountability and give me a feeling that my work is valued by becoming a paid subscriber of my Substack (though I don't post anything paywalled; in fact, on this blog, I just syndicate my LessWrong writing).

For Russian speakers: русскоязычная сеть по безопасности ИИ, Telegram group


A multi-disciplinary view on AI safety


I've earlier suggested a principled taxonomy of AI safety work with two dimensions:

  1. System level:

    • monolithic AI system
    • human--AI pair
    • AI group/org: CoEm, debate systems
    • large-scale hybrid (humans and AIs) society and economy
    • AI lab, not to be confused with an "AI org" above: an AI lab is an org composed of humans and increasingly of AIs that creates advanced AI systems. See Hendrycks et al.' discussion of organisational risks.
  2. Methodological time:

    • design time: basic research, math, science of agency (cognition, DL, games, cooperation, organisations), algorithms
    • manufacturing/training time: RLHF, curriculums, mech interp, ontology/representations engineering, evals, training-time probes and anomaly detection
    • deployment/operations time: architecture to prevent LLM misuse or jailbreaking, monitoring, weights security
    • evolutionary time: economic and societal incentives, effects of AI on society and psychology, governance.

So, this taxonomy is a 5x4 matrix, almost all slots or which are interesting, and some of them are severely under-explored.

I'm talking about science of governance, digitalised governance, and theories of contracting, rather than not-so-technical object-level policy and governance work that is currently done at institutions. And this is absolutely not to the detriment of that work, but just as a selection criteria for this post, which could decide to focus on technical agendas where technical visitors of LW may contribute to.

The view that there is a sharp divide between "AGI-level safety" and "near-term AI safety and ethics" is itself controversial, e.g., Scott Aaronson doesn't share it. I guess this isn't a justification for including all AI ethics work that is happening, but of the NSF projects, definitely more than one (actually, most of them) appear to me upon reading abstracts as potentially relevant for AGI safety. Note that this grant program of NSF is in a partnership with Open Philanthropy and OpenPhil staff participate in the evaluation of the projects. So, I don't think they would select a lot of projects irrelevant for AGI safety.

Under "Understand cooperation", you should add Metagov (many relevant projects under this umbrella, please visit the website, in particular, DAO Science), "ecosystems of intelligence" agenda (itself pursued by Verses, Active Inference Institute, Gaia Consortium, Digital Gaia, and Bioform Labs). This is more often more practical than theoretical work though, so the category names ("Theory" > "Understanding cooperation") wouldn't be totally reasonable for it, but this is also true for a lot of entires already on the post.

In general, the science of cooperation, game theory, digital assets and money, and governance is mature, with a lot of academics working in it in different countries. Picking up just a few projects "familiar to the LessWrong crowd" is just reinforcing the bubble.

The "LessWrong bias" is also felt in the decision to omit all the efforts that contribute to the creation of the stable equilibrium for the civilisation on which an ASI can land. Here's my stab at what is going into that from one month ago; and this is Vitalik Buterin's stab from yesterday.

Also, speaking about pure "technical alignment/AI safety" agendas that "nobody on LW knows and talks about", check out the 16 projects already funded by the "Safe Learning-Enabled Systems" NSF grant program. All these projects have received grants from $250k to $800k and are staffed with teams of academics in American universities.

Another area which I think is promising is the study of random networks and random features. The results in this post suggest that training a neural network is functionally similar to randomly initialising it until you find a function that fits the training data. This suggests that we may be able to draw conclusions about what kinds of features a neural network is likely to learn, based on what kinds of features are likely to be created randomly.

It might be that you find statistical mechanics a useful way to analyse this behaviour of NNs, please check out Vanchurin's statistical mechanics theory of machine learning.

I agree with you, but it's not clear that in lieu of explicit regularisation, DNNs, in particular LLMs, will compress to the degree that they become intelligible (interpretable) to humans. That is, their effective dimensionality might be reduced from 1T to 100M or whatever, but that would be still way too much for humans to comprehend. Explicit regularisation drives this effective dimensionality down.

Kolmogorov complexity is definitely a misleading path here, and it's unfortunate that Joar chose it as the "leading" example of complexity in the post. Note this passage:

However, they do not give a detailed answer to the question of precisely which complexity measure they minimise --- they merely show that this result holds for many different complexity measures. For example, I would expect that fully connected neural networks are biased towards functions with low Boolean circuit complexity, or something very close to that. Verifying this claim, and deriving similar results about other kinds of network architectures, would make it easier to reason about what kinds of functions we should expect a neural network to be likely or unlikely to learn. This would also make it easier to reason about out-of-distribution generalisation, etc.

This quote from the above comment is better:

If we want to explain generalisation in neural networks, then we must explain if and how their inductive bias aligns with out (human) priors. Moreover, our human priors are (in most contexts) largely captured by computational complexity. Therefore, we must somewhere, in some way, connect neural networks to computational complexity.

I've expressed this idea with some links here:

Bayesian Brain theorists further hypothesise that animal brains do effectively implement something like these "simple" algorithms (adjusted to the level of generality and sophistication of the world model each animal species needs) due to the strong evolutionary pressure on energy efficiency of the brain ("The free energy principle induces neuromorphic development"). The speed-accuracy tradeoffs in brain hardware add another kind of pressure that points in the same direction ("Internal feedback in the cortical perception–action loop enables fast and accurate behavior").

Then if we combine two claims:

  • Joar's "DNNs are (kind of) Bayesian" (for the reasons that I don't understand because I didn't read their papers, so I just take his word here), and
  • Fields et al.'s "brains are 'almost' Bayesian because Bayesian learning is information-efficient (= energy-efficient), and there is a strong evolutionary pressure for brains in animals to be energy-efficient",

is this an explanation explanation of DNNs' remarkable generalisation ability? Or more quantification should be added to both of these claims to turn this into a good explanation?

Yes, this is a valid caveat and the game theory perspective should have been better reflected in the post.

Roko would probably call "the most important century" work "building a stable equilibrium to land an AGI/ASI on".

I broadly agree with you and Roko that this work is important and that it would often make more sense for people to do this kind of work than "narrowly-defined" technical AI safety.

An aspect for why this may be the case that you didn't mention is money: technical AI safety is probably bottlenecked on funding, but more of the "most important century/stable equilibrium" are more amenable to conventional VC funding, and the funders shouldn't even be EA/AI x-risk/"most important century"-pilled.

In a comment to Roko's post, I offered my classification of this "stable equilibrium" systems and work that should be done. Here I will reproduce it, with extra directions that appeared to me later:

  1. Digital trust infrastructure: decentralised identity, secure communication (see Layers 1 and 2 in Trust Over IP Stack), proof-of-humanness, proof of AI (such as, a proof that such and such artifact is created with such and such agent, e.g., provided by OpenAI -- watermarking failed, so need new robust solutions with zero-knowledge proofs).
  2. Infrastructure for collective sensemaking and coordination: the infrastructure for communicating beliefs and counterfactuals, making commitments, imposing constraints on agent behaviour, and monitoring the compliance.  We at Gaia Consortium are doing this.
  3. Infrastructure and systems for collective epistemics: next-generation social networks (e.g.,, media, content authenticity, Jim Rutt's "info agents" (he advises "three different projects that are working on this").
  4. Related to the previous item, in particular, to content authenticity: systems for personal data sovereignty (I don't know any good examples besides Inrupt), dataset verification/authenticity more generally, dataset governance.
  5. The science/ethics of consciousness and suffering mostly solved, and much more effort in biology to understand whom (or whose existence, joy, or non-suffering) the civilisation should value, to better inform the constraints and policy for the economic agents (which is monitored and verified through the infra from item 2.)
  6. Systems for political decision-making and collective ethical deliberation: see Collective Intelligence Project, Policy Synth, simulated deliberative democracy. These types of systems should also be used for governing all of the above layers.
  7. Accelerating enlightenment using AI teachers (Khanmigo, Quantum Leap) and other tools for individual epistemics (Ought) so that the people who participate in governance (the previous item) could do a better job.

The list above covers all the directions mentioned in the post, and there are a few more important ones.

I pondered in the comments to Drexler's 2022 post about the Open Agency Model what the difference between an "agency" and an "agent" really comes down to. Drexler (as well as later Conjecture in their CoEm proposal) emphasised interpretability.

However, a combination of explicit modularisation and hierarchical breakdown of "agents", loss functions and inductive biases that promote sparsity in the DNN, representation engineering and alignment, and autoencoder interpretability may together cook up a "stone soup" story of agent interpretability. So, it doesn't seem like a notable distinction between "agents" and "agencies", as you point in the post as well.

Current reality is way more messy, but you can already recognize people intuitively fear some of these outcomes. Extrapolation of calls for treaties, international regulatory bodies, and government involvement is 'we need security services to protect humans'. A steelman of some of the 'we need freely distributed AIs to avoid concentration of power' claims is 'we fear the dictatorship failure mode'.

I think there is just no way around establishing global civilisational coherence if people want to preserve some freedoms (and not be dead), anyway.

This prompted me to write "Open Agency model can solve the AI regulation dilemma".

Discovering and mastering one's own psychology may still be a frontier where the AI could help only marginally. So, more people will become monks or meditators?

Load More