Jan_Kulveit — LessWrong

My current research interests:

1. Alignment in systems which are complex and messy, composed of both humans and AIs?
Recommended texts: Gradual Disempowerment, Cyborg Periods

2. Actually good mathematized theories of cooperation and coordination
Recommended texts: Hierarchical Agency: A Missing Piece in AI Alignment, The self-unalignment problem or Towards a scale-free theory of intelligent agency (by Richard Ngo)

3. Active inference & Bounded rationality
Recommended texts: Why Simulator AIs want to be Active Inference AIs, Free-Energy Equilibria: Toward a Theory of Interactions Between Boundedly-Rational Agents, Multi-agent predictive minds and AI alignment (old but still mostly holds)

4. LLM psychology and sociology: A Three-Layer Model of LLM Psychology, The Pando Problem: Rethinking AI Individuality, The Cave Allegory Revisited: Understanding GPT's Worldview

5. Macrostrategy & macrotactics & deconfusion: Hinges and crises, Cyborg Periods again, Box inversion revisited, The space of systems and the space of maps, Lessons from Convergent Evolution for AI Alignment, Continuity Assumptions

Also I occasionally write about epistemics: Limits to Legibility, Conceptual Rounding Errors

Researcher at Alignment of Complex Systems Research Group (acsresearch.org), Centre for Theoretical Studies, Charles University in Prague. Formerly research fellow Future of Humanity Institute, Oxford University

Previously I was a researcher in physics, studying phase transitions, network science and complex systems.

[low effort list] Bottlencks/issues/problems

- philosophy has worse short feedback loops than eg ML engineering -> in all sorts of processes like MATS or PIBBSS admissions it is harder to select for philosophical competence, also harder to self-improve
- incentives: obviously stuff like being an actual expert in pretraining can get you lot of money and respect in some circles; even many prosaic AI safety / dual use skills like mech interpretability can get you maybe less money than pretraining, but still a lot of money if you work in AGI companies, and also decent ammount of status in ML community and a AI safety community; improving philosophical competence may get you some recognition but only among relatively small and weird group of people
- the issue Wei Dai is commenting on in the original post, founder effects persist to this day & also there is some philosophy-negative prior in STEM
- idk, lack of curiousity? llms have read it all, it's easy to check if there is some existing thinking on a topic

I mostly agree with 1. and 2., with 3. it's a combination of the problems are hard, the gung-ho approach and lack of awarness of the difficulty is true, but also academic philosophy is structurally mostly not up to the task because factors like publication speeds, prestige gradients or speed of ooda loops.
My impression is getting generally smart and fast "alignment researchers" more competent in philosophy is more tracktable than trying to get established academic philosphers change what they work on, so one tracktable thing is just convinving people the problems are real, hard and important. Other is maybe recruiting graduates

2. I actually have somewhat overlapping concerns about the doom memeplex and a bunch of notes about it, but its not near even a draft post. But your response provides some motivation to write it as well. In the broader space, there are good posts about the doom memeplex for the LW audience from Valentine, so I felt this is less neglected.

3. I generally don't know. My impression is when I try to explain the abstract level without a case study, readers are confused what's the point or how is it applicable. My impression is meta explanations of memetics of some ideology tends to weaken it almost no matter what the ideology is, so I don't think I could have chosen some specific example without the result being somewhat controversial. But what I could have done is having multiple different examples, that's valid criticism.

Just flagging that the claim of the post

In this paper, they showed that modern LLMs have coherent and transitive implicit utility functions and world models

is basically a lie. The paper showed that in some limited context, LLMs answer some questions somewhat coherently. The paper have not shown much more (despite sensationalist messaging). It is fairly trivial to show that modern LLMs are very sensitive to framing and you can construct experiments in which they violate transitivity and independence. The VNM math than guarantees that you can not construct a utility function to represent the results.

"The base layer is ultimately made up of models of characters, in a Simulators-ish sense" No it is not, in a similar way as what your brain is running is not ultimately made of characters. It's ultimately made of approximate bayesian models.
what distinguishes a context-prompted ephemeral persona from that richer and more persistent character Check Why Simulator AIs want to be Active Inference AIs
With respect to active inference ... Sorry, don't want to be offensive, but it would actually be helpful for your project to understand active inference at least a bit. Empirically it seems has-repeatedly-read-Scott-Alexander's-posts-on-it leads people to some weird epistemic state, in which people seem to have a sense of understanding, but are unable to answer even basic questions, make very easy predictions, etc. I suspect what's going on is a bit like if someone reads some well written science popularization book about quantum mechanics but actually lacks concepts like complex numbers or vector spaces, they may have somewhat superficial sense of understanding.
Obviously active inference has a lot to say about how people self-model themselves. For example, when typing these words, I assume it's me who types them (and not someone else, for example). Why? That's actually important question for why self. Why not, or to what extent not in LLMs? How stories that people tell themselves about who they are impact what they do is totally something which makes sense to understand from active inference perspective.

ACS research is hiring We're looking for a mix of polymaths, ML research engineers, and people with great intuitions about how AIs behave to work on macrostrategy and LM psychology.

Personally I hope it could be Pareto-best option for some of you on combination of topics to work on, incentives, salary, collaborators and research environment.

Deadline in few weeks, 1-2 year appointments in Prague, London or San Francisco Bay Area.
Hiring page with more details - https://acsresearch.org/hiring

Gradual Disempowerment Research Fellow We're looking for polymaths who can reason about civilizational dynamics. This role comes with a lot of intellectual freedom - it could mean economic modelling, theoretical work on multi-agent dynamics, historical analysis, and more.

LLM Psychology & Sociology Researcher We want people with a strong intuitive understanding of LLMs to help run empirical studies on topics like LLM introspection and self-conception, LLM social dynamics, and how ideologies spread between AIs.

AI Psychology & Agent Foundations ML Researcher We need people who can bring technical and methodological rigour, taking high-level ideas about AI psychology and turning them into concrete ML experiments. This could include of evaluations, mech interp, post-training, both APIs and open-weight models.

We discuss this in Misaligned States part of the Gradual Disempowerment (the thesis you mention is explored in much detail in Tilly (1990). Coercion, Capital, and European States, AD 990-1990).

I don't think the violence is particularly unique source of power - in my view forms of power are somewhat convertible (ie if a rentier state does not derive income from taxation, it can hire mercenaries to pacify the population).

Also, empirically: military power is already quite concentrated - modern militaries are not that large but would be able to pacify much larger popular unrest, if they had the will to do so. But this is kept in check in part by econ power and in part by culture.

I'm not tracking their views too closely in time and you probably have better idea, but my impression is there are some changes.

If I take this comment by Matthew Barnett from 2y ago (read it only now), it seem while the modal prediction is quite similar, the valence / what to do about it is quite different (emphasis on valence-indicating words is mine)

My modal tale of AI doom looks something like the following:

1. AI systems get progressively and incrementally more capable across almost every meaningful axis.
2. Humans will start to employ AI to automate labor. The fraction of GDP produced by advanced robots & AI will go from 10% to ~100% after 1-10 years. Economic growth, technological change, and scientific progress accelerates by at least an order of magnitude, and probably more.
3. At some point humans will retire since their labor is not worth much anymore. Humans will then cede all the keys of power to AI, while keeping nominal titles of power.
4. AI will control essentially everything after this point, even if they're nominally required to obey human wishes. Initially, almost all the AIs are fine with working for humans, even though AI values aren't identical to the utility function of serving humanity (ie. there's slight misalignment).
5. However, AI values will drift over time. This happens for a variety of reasons, such as environmental pressures and cultural evolution. At some point AIs decide that it's better if they stopped listening to the humans and followed different rules instead.
6. This results in human disempowerment or extinction. Because AI accelerated general change, this scenario could all take place within years or decades after AGI was first deployed, rather than in centuries or thousands of years.
I think this scenario is somewhat likely and it would also be very bad. And I'm not sure what to do about it, since it happens despite near-perfect alignment, and no deception.

One reason to be optimistic is that, since the scenario doesn't assume any major deception, we could use AI to predict this outcome ahead of time and ask AI how to take steps to mitigate the harmful effects (in fact that's the biggest reason why I don't think this scenario has a >50% chance of happening). Nonetheless, I think it's plausible that we would not be able to take the necessary steps to avoid the outcome. Here are a few reasons why that might be true:
...

So, at least to me, there seems to be some development from it would also be very bad and I'm not sure what to do about it to this is inevitable, good, and let's try to make it happen faster. I do understand that Matthew Barnett wrote a lot of posts and comments on EA forum between then and now which I mostly missed, and there is likely some opinion development happening with the posts.

On the other hand if you compare Barnett [23] who already has a model why the scenario is not inevitable, and could be disrupted by eg leveraging AI for forecasting, coordination or something similar, and Barnett et al [25] who forgets these arguments against inevitability, I think it actually strengthens the claim of "fine example of thinking you get when smart people do evil things and their minds come up with smart justifications why they are the heroes".

Crossposting: [recent Mechanize blogpost advocating for extreme technological determinism and speeding up human disempowerment] is a fine example of thinking you get when smart people do evil things and their minds come up with smart justifications why they are the heroes. Upon closer examination it ignores key inconvenient considerations; normative part sounds like misleading PR.

A major hole in the "complete technological determinism" argument is that it completely denies agency, or even the possibility that how agency operates at larger scales could change. Sure, humanity is not currently a very coordinated agent. But the trendline also points toward the ascent of an intentional stance. An intentional civilization would, of course, be able to navigate the tech tree. (For a completely opposite argument about the very high chance of a "choice transition," check https://strangecities.substack.com/p/the-choice-transition).

In practice, this likely boils down to a race. On one side are people trying to empower humanity by building coordination technology and human-empowering AI. On the other side are those working to create human-disempowering technology and render human labor worthless as fast as possible.

My guess is when people stake their careers and fortune and status on the second option, their minds will work really hard to not see the choice.

Also: at least to me, the normative part sounds heavily PR sanitized, with obligatory promises of "medical cures" but shiying away from explaining either what would be the role of humans in the fully automated economy, or the actual moral stance of the authors. As far as I understand, at least one of the authors has an unusual moral philosophy such as not believing in consciousness or first-person experiences, while simultaneously believing that future AIs are automatically morally worthy simply by having goals. This philosophy leads them to view succession by arbitrary AI agents as good, and the demise of humans as not a big deal.

My current model is that the VNM theorems are the best available theorems for modeling rational agents.

Actually I don't agree with that, unless you define rationality in a circular way, where you focus on what's roughly in line with the assumptions.

To avoid possible confusion about words: I don't think VNM is that useful for modelling powerful and smart agents in this universe. VNM axioms don't describe well humans, states or corporations, and they don't describe well LLMs.

To give a simple example of a better formal math: information-theoretic bounded rationality. This is still quite VNM like, but at least acknowledges the fact that in this universe, negentropy is not free. Without this fact, nothing makes sense.

For example of not making sense: if VNM is so great, and evolution discovered agents, and optimized them a lot, why animals are very VNM-unlike? I guess clearly obviously part of the answer must be computation is not free, and VNM agent is extremely computationally hungry, in a sense bigger than the universe it is in. But negentropy is not free. This does not mean VNM agents would not work well in toy universes with 3 dimensions, or universes with free computations.

(Hot take sidenote: I suspect you can learn more about intelligence and powerful and smart agents in this universe if you start from just "negentropy is not free" that when starting from VNM.)

I don't think ITBR is the final answer, but at least it is barking on somewhat better tree.

Yes VNM has been foundational for game theory. Also ... I think one deep lesson people learn when understanding game theory deeply is something like "single shot prisoners dilemmas do not exist". The theory is trying to be a minimal abstraction of reality, and it probably succeeds "too much", in the sense that abstracts away so much that basically always some critical feature of reality is missing, and the math does not matches what is happening. This does not preclude the theory being influential, but what people actually do is often something like asking "classical game theory clearly mis-predicts what is happening, so let's try to figure out what it ignores even if you can't ignore it, and write a paper about that".

Yes it has been foundational to econ. My impression is something like last 40 years in the part of econ which is closest to agent foundations, part of the work was on how people are not VNM, or even why what people do makes sense while it is not VNM.

To end with what actually matters: my guess the most relevant things where VNM is likely off is does not handle compositionality well, and it does not handle preferences about internal computations. (More of this discussion in this post and comments Is "VNM-agent" one of several options, for what minds can grow up into?) Unfortunately describing compositionality and preferences over internal computations seem really critical for the specfic problem.

With physics comparisons

I think VNM per se makes way less many predictions about reality than Newtonian gravity, and often when it seems to makes some close to "first principles", they don't seem to match observation. For example based on VNM, one would assume smart people don't update what they want based on evidence, just their beliefs. But this contradicts phenomenological experience.

Different physics comparison may be something like black body radiation. It is possible to describe it using equipartition theorem classically and yes, it partially works in some domains, but it also it's clearly broken and predicts ultraviolet catastrophe. In do agree throwing out arbitrary predictions of the theory would not be a good habit if I don't have fully worked out quantum mechanics, but I think this is a different case, where it's very reasonably to doubt predictions of the theory which seems to be stringly correlated with it predicting the UV catastrophe. (Also not that happy with this comparison)

Wei Dai's Shortform