Jan_Kulveit

My current research interests:
- alignment in systems which are complex and messy, composed of both humans and AIs?
- actually good mathematized theories of cooperation and coordination
- active inference
- bounded rationality

Research at Alignment of Complex Systems Research Group (acsresearch.org), Centre for Theoretical Studies, Charles University in Prague.  Formerly research fellow Future of Humanity Institute, Oxford University

Previously I was a researcher in physics, studying phase transitions, network science and complex systems.

Wiki Contributions

Comments

Mendel's Laws seem counterfactual by about ˜30 years, based on partial re-discovery taking that much time. His experiments are technically something which someone could have done basically any time in last few thousand years, having basic maths

I do agree the argument "We're just training AIs to imitate human text, right, so that process can't make them get any smarter than the text they're imitating, right?  So AIs shouldn't learn abilities that humans don't have; because why would you need those abilities to learn to imitate humans?" is wrong and clearly the answer is "Nope". 

At the same time I do not think parts of your argument in the post are locally valid or good justification for the claim.

Correct and locally valid argument why GPTs are not capped by human level was already written here.

In a very compressed form, you can just imagine GPTs have text as their "sensory inputs" generated by the entire universe, similarly to you having your sensory inputs generated by the entire universe. Neither human intelligence nor GPTs are constrained by the complexity of the task (also: in the abstract, it's the same task).  Because of that, "task difficulty" is not a promising way how to compare these systems, and it is necessary to look into actual cognitive architectures and bounds. 

With the last paragraph, I'm somewhat confused by what you mean by "tasks humans evolved to solve". Does e.g. sending humans to the Moon, or detecting Higgs boson, count as a "task humans evolved to solve" or not? 

I sort of want to flag this interpretation of whatever gossip you heard seems misleading/only telling small part of the story, based on my understanding.

I would imagine I would also react to it with smile in the context of an informal call. When used as brand / "fill interest form here" I just think it's not a good name, even if I am sympathetic to proposals to create more places to do big picture thinking about future.

Sorry, but I don't think this should be branded as "FHI of the West".

I don't think you personally or Lightcone share that much of an intellectual taste with FHI or Nick Bostrom - Lightcone seems firmly in the intellectual tradition of Berkeley, shaped by orgs like MIRI and CFAR. This tradition was often close to FHI thoughts, but also quite often at tension with it. My hot take is you particularly miss part of the generators of the taste which made FHI different from Berkeley. I sort of dislike the "FHI" brand being used in this way.

edit: To be clear I'm strongly in favour of creating more places for FHI-style thinking, just object to the branding / "let's create new FHI" frame. Owen expressed some of the reasons better and more in depth

Jan_Kulveit3moΩ242

You are exactly right that active inference models who behave in self-interest or any coherently goal-directed way must have something like an optimism bias.

My guess about what happens in animals and to some extent humans: part of the 'sensory inputs' are interoceptive, tracking internal body variables like temperature, glucose levels, hormone levels, etc. Evolution already built a ton of 'control theory type cirquits' on the bodies (an extremely impressive optimization task is even how to build a body from a single cell...). This evolutionary older circuitry likely encodes a lot about what the evolution 'hopes for' in terms of what states the body will occupy. Subsequently, when building predictive/innocent models and turning them into active inference, my guess a lot of the specification is done by 'fixing priors' of interoceptive inputs on values like 'not being hungry'.  The later learned structures than also become a mix between beliefs and goals: e.g. the fixed prior on my body temperature during my lifetime leads to a model where I get 'prior' about wearing a waterproof jacket when it rains, which becomes something between an optimistic belief and 'preference'.  (This retrodicts a lot of human biases could be explained as "beliefs" somewhere between "how things are" and "how it would be nice if they were")


But this suggests an approach to aligning embedded simulator-like models: Induce an optimism bias such that the model believes everything will turn out fine (according to our true values)
 

My current guess is any approach to alignment which will actually lead to good outcomes must include some features suggested by active inference. E.g. active inference suggests something like 'aligned' agent which is trying to help me likely 'cares' about my 'predictions' coming true, and has some 'fixed priors' about me liking the results. Which gives me something avoiding both 'my wishes were satisfied, but in bizarre goodharted ways' and 'this can do more than I can'

Answer by Jan_KulveitJan 20, 20241310


- Too much value and too positive feedback on legibility. Replacing smart illegible computations with dumb legible stuff
- Failing to develop actual rationality and focusing on cultivation of the rationalist memeplex  or rationalist culture instead
- Not understanding the problems with the theoretical foundations on which sequences are based (confused formal understanding of humans -> confused advice)

+1 on the sequence being on the best things in 2022. 

You may enjoy additional/somewhat different take on this from population/evolutionary biology (and here). (To translate the map you can think about yourself as the population of myselves. Or, in the opposite direction, from a gene-centric perspective it obviously makes sense to think about the population as a population of selves)

Part of the irony here is evolution landed on the broadly sensible solution (geometric rationality). Hower, after almost every human doing the theory got somewhat confused by the additive linear EV rationality maths, what most animals and also often humans on S1 level do got interpreted as 'cognitive bias' - in the spirit of assuming obviously stupid evolution not being able to figure out linear argmax over utility algorithms in a a few billion years

I guess not much engagement is caused by
- the relation between 'additive' vs 'multiplicative' picture being deceptively simple in formal way
- the conceptual understanding of what's going on and why being quite tricky; one reason is I guess our S1 / brain hardware runs almost entirely in the multiplicative / log world; people train their S2 understanding on linear additive picture; as Scott explains, maths formalism fails us

This is a short self-review, but with a bit of distance, I think understanding 'limits to legibility' is one of the maybe top 5 things an aspiring rationalist should deeply understand and lack of this leads to many bad outcomes in both rationalist and EA communities.

In a very brief form, maybe the most common cause of EA problem and stupidities are attempts to replace illegible S1 boxes able to represent human values such as 'caring' by legible, symbolically described, verbal moral reasoning subject to memetic pressure.

Maybe the most common cause of rationalist problems and difficulties with coordination are cases where people replace illegible smart S1 computations with legible S2 arguments.

In my personal view, 'Shard theory of human values' illustrates both the upsides and pathologies of the local epistemic community.

The upsides
- majority of the claims is true or at least approximately true
- "shard theory" as a social phenomenon reached critical mass making the ideas visible to the broader alignment community, which works e.g. by talking about them in person, votes on LW, series of posts,...
- shard theory coined a number of locally memetically fit names or phrases, such as 'shards'
- part of the success leads at some people in the AGI labs to think about mathematical structures of human values, which is an important problem 

The downsides
- almost none of the claims which are true are original; most of this was described elsewhere before, mainly in the active inference/predictive processing literature, or thinking about multi-agent mind models
- the claims which are novel seem usually somewhat confused (eg human values are inaccessible to the genome or naive RL intuitions)
- the novel terminology is incompatible with existing research literature, making it difficult for alignment community to find or understand existing research, and making it difficult for people from other backgrounds to contribute (while this is not the best option for advancement of understanding, paradoxically, this may be positively reinforced in the local environment, as you get more credit for reinventing stuff under new names than pointing to relevant existing research)

Overall, 'shards' become so popular that reading at least the basics is probably necessary to understand what many people are talking about. 

Load More