Jan_Kulveit

My current research interests:
- alignment in systems which are complex and messy, composed of both humans and AIs?
- actually good mathematized theories of cooperation and coordination
- active inference
- bounded rationality

Research at Alignment of Complex Systems Research Group (acsresearch.org), Centre for Theoretical Studies, Charles University in Prague.  Formerly research fellow Future of Humanity Institute, Oxford University

Previously I was a researcher in physics, studying phase transitions, network science and complex systems.

Wiki Contributions

Comments

Sorted by

The post showcases the inability of the aggregate LW community to recognize locally invalid reasoning: while the post reaches a correct conclusion, the argument leading to it is locally invalid, as explained in comments. High karma and high alignment forum karma shows a combination of famous author and correct conclusion wins over the argument being correct.

  1. I expect "first AGI" to be reasonably modelled as composite structure in a similar way as a single human mind can be modelled as composite.
  2. The "top" layer in the hierarchical agency sense isn't necessarily the more powerful / agenty: the superagent/subagent direction is completely independent from relative powers. For example, you can think about Tea Appreciation Society at a university using the hierarchical frame: while the superagent has some agency, it is not particularly strong.
  3. I think the nature of the problem here is somewhat different than typical research questions in e.g. psychology. As discussed in the text, one place where having mathematical theory of hierarchical agency would help is making us better at specifications of value evolution. I think this is the case because a specification would be more robust to scaling of intelligence. For example, compare learning objective
    a. specified as minimizing KL divergence between some distributions
    b. specified in natural language as "you should adjust the model so the things read are less surprising and unexpected"
    You can use objective b. + RL to train/finetune LLMs, exactly like RLAIF is used to train "honesty", for example.
    Possible problem with b. is the implicit representations of natural language concepts like honesty or surprise are likely not very stable: if you would train a model mostly on RL + however Claude understands these words, you would probably get pathological results, or at least something far from how you understand the concepts. Actual RLAIF/RLHF/DPO/... works mostly because it is relatively shallow: more compute goes into pre training. 

     

I guess make one? Unclear if hierarchical agency is the true name

There was some selection of branches, and one pass of post-processing.

It was after ˜30 pages of a different conversation about AI and LLM introspection, so I don't expect the prompt alone will elicit the "same Claude". Start of this conversation was

Thanks! Now, I would like to switch to a slightly different topic: my AI safety oriented research on hierarchical agency. I would like you to role-play an inquisitive, curious interview partner, who aims to understand what I mean, and often tries to check understanding using paraphrasing, giving examples, and similar techniques. In some sense you can think about my answers as trying to steer some thought process you (or the reader) does, but hoping you figure out a lot of things yourself. I hope the transcript of conversation in edited form could be published at ... and read by ...

Overall my guess is this improves clarity a bit and dilutes how much thinking per character there is, creating somewhat less compressed representation. My natural style is probably on the margin too dense / hard to parse, so I think the result is useful.

To add some nuance....

While I think this is a very useful frame, particularly for people who have oppressive legibility-valuing parts, and it is likely something many people would benefit from hearing, I doubt this is great as descriptive model.

Model in my view closer to reality is, there isn't that sharp difference between "wants" and "beliefs", and both "wants" and "beliefs" do update. 

Wants are often represented by not very legible taste boxes, but these boxes do update upon being fed data. To continue an example from the post, let's talk about literal taste and ice cream. While whether you want or don't want or like or don't like an icecream sounds like pure want, it can change, develop or even completely flip, based on what you do. There is the well known concept of acquired taste: maybe the first time you see a puerh ice cream on offer, it does not seem attractive. Maybe you taste it and still dislike it. But maybe, after doing it a few times, you actually start to like it. The output of the taste box changed. The box will likely also update if some flavour of icecream is very high-status in your social environment; when you will get horrible diarrhea from the meal you ate just before the ice cream; and in many other cases.

Realizing that your preferences can and do develop obviously opens the Pandora's box of actions which do change preferences.[1] The ability to do that breaks orthogonality. Feed your taste boxes slop and you may start enjoying slop. Surround yourself with people who do a lot of [x] and ... it you may find you like and do [x] as well, not because someone told you "it's the rational thing to do", but because learning, dynamics between your different parts, etc. 


 

  1. ^

    Actually, all actions do!

I hesitated between Koyaanisqatsi and Baraka! Both are some of my favorites, but in my view Koyaanisqatsi actually has notably more of an agenda and a more pessimistic outlook.

Answer by Jan_Kulveit222

Baraka: A guided meditation exploring the human experience; topics like order/chaos, modernity, green vs. other mtg colours.

More than "connected to something in sequences" it is connected to something which straw sequence-style rationality is prone to miss. Writings it has more resonance with are Meditations on Moloch, The Goddess of Everything Else, The Precipice.

There isn't much to spoil: it's 97m long nonverbal documentary. I would highly recommend to watch on as large screen in as good quality you can, watching it on small laptop screen is a waste. 

Central european experience, which is unfortunately becoming relevant also for the current US: for world-modelling purposes, you should have hypotheses like 'this thing is happening because of a russian intelligence operation' or 'this person is saying what they are saying because they are a russian asset' in your prior with nontrivial weights. 

I expected quite different argument for empathy

1. argument from simulation: most important part of our environment are other people; people are very complex and hard to predict; fortunately, we have a hardware which is extremely good at 'simulating a human' - our individual brains. to guess what other person will do or why they are doing what they are doing, it seems clearly computationally efficient to just simulate their cognition on my brain. fortunately for empathy, simulations activate some of the same proprioceptive machinery and goal-modeling subagents, so the simulation leads to similar feelings

2. mirror neurons: it seems we have powerful dedicated system for imitation learning, which is extremely advantageous for overcoming genetic bottleneck. mirroring activation patterns leads to empathy  

My personal impression is you are mistaken and the innovation have not stopped, but part of the conversation moved elsewhere.  E.g. taking just ACS, we do have ideas from past 12 months which in our ideal world would fit into this type of glossary - free energy equilibria, levels of sharpness, convergent abstractions, gradual disempowerment risks. Personally I don't feel it is high priority to write them for LW, because they don't fit into the current zeitgeist of the site, which seems directing a lot of attention mostly to:
- advocacy 
- topics a large crowd cares about (e.g. mech interpretability)
- or topics some prolific and good writer cares about (e.g. people will read posts by John Wentworth)
Hot take, but the community loosely associated with active inference is currently better place to think about agent foundations; workshops on topics like 'pluralistic alignment' or 'collective intelligence' have in total more interesting new ideas about what was traditionally understood as alignment; parts of AI safety went totally ML-mainstream, with the fastest conversation happening at x. 
 

Load More