• Co-Founder and Head of "Principles of Intelligent Behaviour in Biological and Social Systems" (
  • Research Affiliate and incoming PhD at "Alignment of Complex Systems" research group, Charles University, Prague (

I am a student in Philosophy and AI, with a particular interest in philosophy of science, political philosophy, complex systems studies, enactivist frameworks -- as they pertain to AI risk, governance and alignment. A lot of my thinking is influenced by exploring what we can learn from the study of intelligent behaviour in currently existing, natural systems (across scales and substrates) towards questions as to the nature, behaviour of and risks related to future AI systems. Among others, I am interested in a generative theory of value, the "think" interface between governance and technical aspects of AI alignment, and spend a decent chunk of time thinking about scientific and epistemological challenges specific to AI alignment research, and how to address them.

Going back further, I have also spent a bunch of time thinking about how (bounded) minds make sense of and navigate a (complex) world (rationality, critical thinking, etc.). I have several years of experience in research organization, among others from working  at FHI, CHERI, Epistea, etc. I have a background in International Relations, and spend large parts of of 2017-2019 doing complex systems inspired research on understanding group decision making and political processes with the aim of building towards an appropriate framework for "Longterm Governance".  


The Value Change Problem (sequence)
Thoughts in Philosophy of Science of AI Alignment

Wiki Contributions


I have no doubt Alexander would shine!

Happy to run a PIBBSS speaker event for this, record it and make it publicly available. Let me know if you're keen and we'll reach out to find a time.

FWIW I also think the "Key Phenomena of AI risk" reading curriculum (h/t TJ) does some of this at least indirectly (it doesn't set out to directly answer this question, but I think a lot of the answers to the question are comprise in the curriculum). 

(Edit: fixed link)

How confident are you about it not having been recorded? If not very, seems props worth checking again

Re whether messy goal-seekers can be schemers, you may address this in a different place (and if so forgive me, and I'd appreciate you pointing me to where), but I keep wondering what notion of scheming (or deception, etc.) we should be adopting when, in particular: 

  • an "internalist" notion, where 'scheming' is defined via the "system's internals", i.e. roughly: the system has goal A, acts as if it has goal B, until the moment is suitable to reveal it's true goal A.
  • an "externalist" notion, where 'scheming' is defined, either, from the perspective of an observer (e.g. I though the system has goal B, maybe I even did a bunch of more or less careful behavioral tests to raise my confidence in this assumption, but in some new salutation, it gets revealed that the system pursues B instead)
  • or an externalist notion but defined via the effects on the world that manifest (e.g. from a more 'bird's-eye' perspective, we can observe that the system had a number of concrete (harmful) effects on one or several agents via the mechanisms that those agents misjudged what goal the system is pursuing (therefor e.g. mispredicting its future behaviour, and basing their own actions on this wrong assumption)

It seems to me like all of these notions have different upsides and downsides. For example:

  • the internalist notion seems (?) to assume/bake into its definition of scheming a high degree of non-sphexishness/consequentialist cognition
  • the observer-dependent notion comes down to being a measure of the observer's knowledge about the system 
  • the effects-on-the-world based notion seems plausibly too weak/non mechanistic to be helpful in the context of crafting concrete alignment proposals/safety tooling

Yeah neat, I haven't yet gotten to reading it but is definitely on my list. Seems (and some folks suggested to me) that it's quite related to the sort of thing I'm discussing in value change problem too.

Roughly... refers to/emphasizes the dynamic interaction between agent and environment and understands behavior/cognition/agency/... to emerge through that interaction/at that interface (rather than, e.g, trying to understand them as an internal property of the agent only)

Related to my point above (and this quoted paragraph), a fundamental nuance here is the distinction between "accidental influence side effects"  and "incentivized influence effects". I'm happy to answer more questions on this difference if it's not clear from the rest of my comment.

Thanks for clarifying; I agree it's important to be nuanced here!

I basically agree with what you say. I also want to say something like: whether to best count it as side effect or incentivized depends on what optimizer we're looking at/where you draw the boundary around the optimizer in question. I agree that a) at the moment, recommender systems are myopic in the way you describe, and the larger economic logic is where some of the pressure towards homogenization comes from (while other stuff is happening to, including humans pushing to some extent against that pressure, more or less successfully); b) at some limit, we might be worried about an AI systems becoming so powerful that its optimization arc comes to sufficiently large in scope that it's correctly understood as directly doign incentivized influence; but I also want to point out a third scanrios, c) where we should be worried about basically incentivized influence but not all of the causal force/optimization has to be enacted from wihtin the boundaries of a single/specific AI system, but where the economy as a whole is sufficiently integrated with and accelerated by advanced AI to justify the incentivized influence frame (e.g. a la ascended economy, fully automated tech company singularity). I think the general pattern here is basically one of "we continue to outsource ever more consequential decisions to advanced AI systems, without having figured out how to make these systems reliably (not) do any thing in particular". 

A small misconception that lies at the heart of this section is that AI systems (and specifically recommenders) will try to make people more predictable. This is not necessarily the case.

Yes, I'd agree (and didn't make this clear in the post, sorry) -- the pressure towards predictability comes from a combination of the logic of performative prediction AND the "economic logic" that provide the context in which these performative predictors are being used/applied. This is certainly an important thing to be clear about! 

(Though it also can only give us so much reassurance: I think it's an extremely hard problem to find reliable ways for AI models to NOT be applied inside of the capitalist economic logic, if that's what we're hoping to do to avoid the legibilisation risk.)

Agree! Examples abound. You can never escape your local ideological context - you can only try to find processes that have some hope at occasionally pumping into the bounds of your current ideology and press beyond it - no reliably receipt (just like there is no reliably receipt to make yourself notice your own blind spot) - but there is the hope for things that in expectation and intertemporally can help us with this. 

Which poses a new problem (or clarifies the problem we're facing): we don't get to answer the question of value change legitimacy in a theoretical vacuum -- instead we are already historically embedded in a collective value change trajectory, affecting both what we value but also what we (can) know. 

I think that makes it sound a bit hopeless from one perspective, but on the other hand, we probably also shouldn't let hypothetical worlds we could never have reached weight us down -- there are many hypothetical worlds we still can reach that it is worth fighting for.

Yeah interesting point. I do see the pull of the argument. In particular the example seems well chosen -- where the general form seems to be something like: we can think of cases where our agent can be said to be better off (according to some reasonable standards/form some reasonable vantage point) if the agent can make themselves be committed to continue doing a thing/undergoing a change for at least a certain amount of time. 

That said, I think there are also some problems with it. For example, I'm wary of reifying "I-as-in-CEV" more than what is warranted. For one, I don't know whether there is a single coherent "I-as-in-CEV" or whether there could be several; for two, how should I apply this argument practically speaking given that I don't know what "I-as-in-CEV" would consider acceptable. 

I think there is some sense in which proposing to use legitimacy as criterion has a flavour of "limited ambition" - using it will in fact mean that you will sometimes miss out of making value changes that would have been "good/acceptable" from various vantage points  (e.g. legitimacy would say NO to pressing the button that would magically make everyone in the world peaceful/against war (unless the button involves some sophisticated process that allows you to back out legitimacy for everyone involved)).  At the same time, I am wary we cannot give up on legitimacy without risking much worse fates, and as such, I feel currently fairly compelled to opt for legitimacy form an intertemporal perspective.

Load More