epistemic status: Speculation. The actual proposals are idealized, not meant to be exactly right. We have thought about this for less than an hour.


In this earlier post I stated a speculative hypothesis about the algorithm that a single imitator that imitates collections of multiple humans would learn. Here Joar Skalse joined me and we made a list of some more hypotheses, all very speculative and probably each individually wrong. 

The point is that if we have an imitator that imitates a single human’s text, we might (very dubiously) expect that imitator to learn basically a copy of that human. What would an imitator learn who is trained to imitate content generated by vast collections of humans? We can then ask: what are the implications for how it generalizes and what you can get with finetuning?

Here are our set of idealized and almost certainly not exactly correct hypotheses:

  • Lookup table of many separate models of humans. Basically what the algorithm does is: it looks at its prompt, and first figures out which human in the training set generated it (or which subculture of humans, and so forth, including other contexts). It has a separate subroutine for each of them, and simply picks the correct one and executes that.
    • How does it generalize out of distribution? It basically interpolates between different humans, and still ends up looking like a typical human.
    • How far can you go with finetuning? Fine tuning allows you to basically not change much. You can make it behave like slightly different human model than what you’d otherwise get. 
  • High level latent space of human minds plus partitioned capabilities. The system has a latent space describing “what kind of human produced this prompt?” It first makes an estimate in this latent space, and based on that, this parameterizes the beliefs, knowledge and attitudes about specific topics that such a human would have.
    • How far can you get with finetuning? Fine tuning doesn’t easily allow you to build a superhuman system, since each of the points in the latent space parameterizes specific subroutines which inherently activates certain knowledge and capabilities and suppresses others.
  • Generalized superhuman + lookup table of constraintsExplained here.
  • meta-human model 1: meta-reasoner + superhuman general world model. The model it learns is not at all directly doing the same thing as an individual human. Rather, it is doing meta-reasoning, reasoning about beliefs and attitudes and so on of the person who wrote the text. It has a general model and knowledge of humans, and based on the prompt it gets it reasons about things like “does the writer know that X”, “what would a writer that knows X but doesn’t know Y and feels Z about W say in this context”?
    • How does it generalize out of distribution? It is able to generalize to text generated by hypothetical humans that are before now unseen combinations of knowledge, beliefs, attitudes and so forth. E.g. if it for the first time read a prompt that seems to come from a Zen Buddhist Trump supporter, but all of the trump supporters in the dataset are Christian, then it can still reason through the implications of those views and come up with a realistic output.
    • How far can you get with finetuning? Fine tuning on RL has the option of basically setting the parameters of the meta-reasoner to something like “this human knows everything, is very rational and capable of planning on all domains” and so forth, and because the meta reasoner itself has access to a superhuman general world model (all of human knowledge, the capabilities of the most rational humans and so forth), it is actually then able to generate superhuman behavior, more capable than displayed in the dataset.
  • meta-human model 2: superhuman super psychologist. The model it learns is a kind of hybrid between a generalized superhuman and a human simulator: It reasons using the same kind of cognitive tools that humans use, but part of its reasoning abilities is that it has very good theory of mind about different humans. In its generation of text it reasons about object level things like math and history, but also separately reasons about “what is the kind of person that wrote this?”, “what would the kind of person that wrote this say about fact X, and would they know about fact Y”. Essentially the model is a generalized superhuman who does what a human would do if they were asked to act like some other character, reasoning about what that other character knows and doesn’t know, mistakes they would make, etc, but had extreme capabilities to do so.
    • Similar to above.


 

New to LessWrong?

New Comment