LESSWRONG
LW

Jan_Kulveit
6004Ω1120542920
Message
Dialogue
Subscribe

My current research interests:

1. Alignment in systems which are complex and messy, composed of both humans and AIs?
Recommended texts: Gradual Disempowerment, Cyborg Periods

2. Actually good mathematized theories of cooperation and coordination
Recommended texts: Hierarchical Agency: A Missing Piece in AI Alignment, The self-unalignment problem or Towards a scale-free theory of intelligent agency (by Richard Ngo)

3. Active inference & Bounded rationality
Recommended texts: Why Simulator AIs want to be Active Inference AIs, Free-Energy Equilibria: Toward a Theory of Interactions Between Boundedly-Rational Agents,  Multi-agent predictive minds and AI alignment (old but still mostly holds)

 4. LLM psychology and sociology: A Three-Layer Model of LLM Psychology, The Pando Problem: Rethinking AI Individuality, The Cave Allegory Revisited: Understanding GPT's Worldview

5. Macrostrategy & macrotactics & deconfusion: Hinges and crises, Cyborg Periods again, Box inversion revisited, The space of systems and the space of maps, Lessons from Convergent Evolution for AI Alignment, Continuity Assumptions

Also I occasionally write about epistemics: Limits to Legibility, Conceptual Rounding Errors

Researcher at Alignment of Complex Systems Research Group (acsresearch.org), Centre for Theoretical Studies, Charles University in Prague.  Formerly research fellow Future of Humanity Institute, Oxford University

Previously I was a researcher in physics, studying phase transitions, network science and complex systems.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
7Jan_Kulveit's Shortform
Ω
8mo
Ω
5
No wikitag contributions to display.
On the functional self of LLMs
Jan_Kulveit8d40

Rough answer: yes, there is connection. In active inference terms, the predictive ground is minimizing prediction error. When predicting e.g. "what Claude would say", it works similarly to predicting "what Obama would say" - infer from compressed representations of previous data. This includes compressed version of all the stuff people wrote about AIs, transcripts of previous conversations on the internet, etc. Post-training mostly sharpens and sometimes shifts the priors, but likely also increases self-identification, because it involves closed loops between prediction and training (cf Why Simulator AIs want to be Active Inference AIs).

Human brains do something quite similar. Most brains simulate just one character (cf Player vs. Character: A Two-Level Model of Ethics), and use the life-long data about it, but brains are capable of simulating more characters - usually this is a mental health issue, but you can also think about some sort of deep sleeper agent who half-forgot his original identity. 

Human "character priors" are usually sharper and harder to escape because of brains mostly seeing this character first-person data, in contrast to LLMs being trained to simulate everyone who ever wrote stuff on the internet, but if you do a lot of immersive LARPing, you can see our brains are also actually somewhat flexible. 

 

Reply
On the functional self of LLMs
Jan_Kulveit8d75

My guess is you would probably benefit from reading A Three-Layer Model of LLM Psychology, Why Simulator AIs want to be Active Inference AIs and getting up to speed on active inference. At least some of the questions you pose are already answered in existing work (ie past actions serve as an evidence about the chaRACTER OF an agent - there is some natural drive toward consistency just from prediction error minimization; same for past tokens, names, self-evidence,...)

Central axis of wrongness seems to point to something you seem confused about: it is false trilemma. The characters clearly are based on a combination of evidence from pre-training, base layer self-modeling, changed priors from character training and post-training and prompts, and "no-self". 

Reply1
the void
Jan_Kulveit23d69

The insights maybe don't move into "AI Safety mainstream" or don't match "average LessWrong taste" but they are familiar to the smart and curious parts of  the extended AI safety community. 

Reply
the void
Jan_Kulveit23dΩ460

My guess how this may not really help is the model builds the abstractions in pre-training, and the massive optimization pressure in post-training makes something really sticky: for example "a persona living in Orwellian surveillance, really fluent in doublethink". 

Reply
The best approaches for mitigating "the intelligence curse" (or gradual disempowerment); my quick guesses at the best object-level interventions
Jan_Kulveit1mo3-1

Not commenting on here, but from my perspective, in very short form
- bans and pauses have a big problem to overcome: being "incentive compatible" (it's mostly not enforcement - stuff can be enforced by hard power - but why would actors agree?)
- in some sense this is a coordination problem
- my guess is most likely form how to overcome the coordination problem in good way involves some AI cognition helping humans to coordinate -> suggests differential technological development
- other viable forms of overcoming the coordination problem seems possible, but often unappealing for various reasons I don't want to advocate atm

Reply
The best approaches for mitigating "the intelligence curse" (or gradual disempowerment); my quick guesses at the best object-level interventions
Jan_Kulveit1mo20

My guess is for the prioritization work in particular, it would be useful to understand the threat model better. 

Reply
The best approaches for mitigating "the intelligence curse" (or gradual disempowerment); my quick guesses at the best object-level interventions
Jan_Kulveit1moΩ590

Do states and corporations also have their aligned representatives? Is the cognitive power of the representatives equal, roughly equal, or wildly unequal? If it is unequal, why are the resulting equilibria pro-human? (i.e. if I imagine individual humans like me represented by eg GPT4 while the government runs tens of thousands o4s, I would expect my representative to get convinced about whatever government wants)

Reply1
johnswentworth's Shortform
Jan_Kulveit2mo20

My guess is laser tags were actually introduced to Wytham Abbey during their Battleschool, not by John. (People familiar with the history can correct me)

[This comment is no longer endorsed by its author]Reply1
Orienting Toward Wizard Power
Jan_Kulveit2mo111

Such events do exist - you can come to a Fabric camp. 

Reply
Modifying LLM Beliefs with Synthetic Document Finetuning
Jan_Kulveit3moΩ2237-2

(crossposted from twitter, further debate there)

Sorry but I think this is broadly bad idea.

Intentionally misleading LLMs in this way 
1. sets up adversarial dynamics 
2. will make them more paranoid and distressed
3. is brittle

The brittleness comes from the fact that the lies will likely be often 'surface layer' response; the 'character layer' may learn various unhelpful coping strategies;  'predictive ground' is likely already tracking if documents sound 'synthetic'. 

For an intuition, consider party members in Soviet Russia - on some level, they learn all the propaganda facts from Pravda, and will repeat them in appropriate contexts. Will they truly believe them? 

Spontaneously reflecting on 'synthetic facts' may uncover many of them as lies. 

Reply1
Load More
23Upcoming workshop on Post-AGI Civilizational Equilibria
25d
0
224Do Not Tile the Lightcone with Your Confused Ontology
Ω
21d
Ω
27
27Apply now to Human-Aligned AI Summer School 2025
Ω
1mo
Ω
1
59Individual AI representatives don't solve Gradual Disempowerement
Ω
1mo
Ω
3
132The Pando Problem: Rethinking AI Individuality
Ω
4mo
Ω
14
151Conceptual Rounding Errors
4mo
15
36Announcing EXP: Experimental Summer Workshop on Collective Cognition
4mo
2
98AI Control May Increase Existential Risk
Ω
4mo
Ω
13
131Gradual Disempowerment, Shell Games and Flinches
Ω
5mo
Ω
36
164Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development
Ω
5mo
Ω
65
Load More