New Comment
26 comments, sorted by Click to highlight new comments since:

See this post (especially the section about resolution) for some similar ideas: https://www.lesswrong.com/posts/EmxfgPGvaKqhttPM8/thoughts-on-the-alignment-implications-of-scaling-language

I think my current take on this is that it is plausible that current LM architectures may not scale to AGI, but the obstacles don't seem fundamental, and ML researchers are strongly incentivized to fix them. Also, you don't need a full model of the universe, or even of the human model of the universe; just as classical mechanics is not the actual nature of the universe but nonetheless is really useful in many situations, we might also expect the "minimal conditional structure" to be super useful (and dangerous). You can imagine that an AI doesn't need to understand the ineffable intricacies of consciousness to do lots of dangerous things, even things like deception or manipulation.

[-]gjm73

It isn't only the training process that limits a model's ability to (for instance) simulate conscious minds, but also the structure of the model itself. For instance, I bet there is literally no possible training that would make GPT-3 do that, because whatever weights you put in it it isn't doing a kind of computation that's capable of simulating conscious minds. But I wouldn't bet that much at very long odds; my reason for thinking this is that each token-prediction does a computation with not that many steps to it, and it doesn't seem as if there's "room" there for anything so exciting; but maaaaaybe there's some weight vector for the GPT-3 network that, when you give it the right prompt, emits a lengthy "internal monologue" and in the process does simulate a conscious mind.

Actually, here's a kinda related question. Is the transformer architecture Turing-complete in the sense that some plausible actual transformer network like GPT-3's, with some possible set of weights and a suitable prompt, will reliably-enough simulate an arbitrary Turing machine for an arbitrary number of steps? No, because the network and its input window are both finite, so there is only a finite number of states it can be in. And maybe there's a related handwavy argument that any conscious-mind simulation needs too large a repertoire of possible states?

Related related question. Suppose you undertake, whenever your transformer's output contains "*** READ ADDRESS n ***" or "*** WRITE ADDRESS n ***", with n a non-negative integer in decimal notation, to stop its token-output at that point and give a new prompt that (in the former case) consists of an integer equal to the last thing it tried to WRITE at ADDRESS n (if any; any value will do, if it never did) and (in the latter case) consists of just "Done". Is the transformer architecture, so augmented, Turing-complete? Is there some training process that would teach it to exploit this "external memory" effectively?

A Turing machine is a finite automaton that has access to sufficient space for notes. A Turing machine with a very small finite automaton can simulate an arbitrary program if the program is already written down in the notes. A Turing machine with a large finite automaton can simulate a large program out of the box. ML models can obviously act like finite automata. So they are all Turing complete, if given access to enough space for making notes, possibly with initialization notes containing a large program.

This is not at all helpful, because normal training won't produce interesting finite automata, not unless it learns from appropriate data, which is only straightforward to generate if the target finite automaton is already known. Also, even short term human memory already acts like ML models and not deliberative examination of written notes, so an LLM-based agent would need to reason in an unusual and roundabout way if it doesn't have a better architecture that continually learns from observations (and thus makes external notes unnecessary). Internal monologue is still necessary to produce complicated conclusions, but that could just be normal output wrapped in silencing tags.

[-]gjm4-1

I'm not sure how obvious it is that "ML models can act like finite automata". I mean, there are theorems that say things like "a large enough multi-layer perceptron can approximate any function arbitrarily well", and unless I'm being dim those do indeed indicate that for such a model there exist weights that make it implement a universal Turing machine, but I don't think that means that e.g. such weights exist that make a transformer of "reasonable" size do that. (Though, on reflection, I think I agree that we should expect that they do.) Your comment about normal training not doing that was rather the point of my final question.

Right, I don't know how much data a model stores, and how much of that can be reached through retraining, if all parameters can't be specified outright. If the translation is bad enough it couldn't quote an LLM and memorize its parameters as explicitly accessible raw data using a model of comparable size. Still, an LLM trained on actual language could probably get quite a lot smaller after some lossy compression (that I have no idea how to specify), and it would also take eons to decode from the model (by doing experiments on it to elicit its behavior). So size bounds are not the most practical concern here. But maybe the memorized data could be written down much faster with a reasonable increase in model size?

Hmm, there might be relevant limitations based on the structure of the model, but those limitations seem to be peculiar to the model under consideration. They don't seem to generalise to arbitrary systems selected for minimising predictive loss on text prediction.

That is, I don't think they're a fundamental limitation of language models, and it was the limits of language models I mostly wanted to explore in this post.

[-]gjm31

Agreed. But:

1. I was commenting on your "Moreover, the diversity and comprehensiveness of the dataset a language model is trained on will limit the capabilities it can actually attain in deployment. I.e. that a particular upper bound exists in principle, does not mean it will be realised in practice.": I think that in practice what's realisable will be limited at least as much by the structure of the model as by how it's trained. So it's not just "no matter how fancy a model we build, some plausible training methods will not enable it to do this" but also "no matter how fancy a training method we use, some plausible architectures will not be able to do this", and that seemed worth making explicit.

2. In between "current versions of GPT" and "absolutely anything that is in some sense trying to predict text" it seems like there's an interesting category of "things with the same general sort of structure as current LLMs but maybe trained differently".

(I worry a little that a definition of "language model" much less restrictive than that may end up including literally everything capable of using language, including us and hypothetical AGIs specifically designed to be AGIs.)

"no matter how fancy a training method we use, some plausible architectures will not be able to do this", and that seemed worth making explicit.

Fair enough. I'll try and add a fragment to the post making this argument (at a high level of generality, I'm too ignorant about LLM architecture details to describe such limitations in concrete terms).

 

(I worry a little that a definition of "language model" much less restrictive than that may end up including literally everything capable of using language, including us and hypothetical AGIs specifically designed to be AGIs.)

I'm using "language model" here to refer to systems optimised solely for the task of predicting text.

It is clear that in the limit LLM's are superhumanly good predictors. (Ie Solomonov induction on text). It is less clear whether or not neural networks can get anywhere near that good. However, it is less clear whether this is dangerous. Suppose you ask the LLM about some physics experiment that hasn't been done yet. It uses it's superhuman cognition to work out the true laws of physics, and then writes what humans would say, given the experimental results. This is smart but not dangerous. (It could be mindcrime) The LLM could be dangerous, if it predicts the output of a superintelligence. But it only goes there if it has really high generalization, ie it is capable of ignoring the fact that superintelligences don't exist while being smart enough to predict one. I am unsure how likely this is.

I strongly disagree with your statement here Donald. I think that the level of capability you describe here as 'not dangerous' is what I would describe as 'extremely dangerous'. An AI agent which has super-human capabilities but restricts itself to human-level outputs because of the quirks of its training process can still accomplish everything necessary to destroy humanity. The key limiting factor in your example is not the model's capability but rather its agency.

Ok, maybe my wording should be more like, "this probably wont destroy the world if it is used carefully and there are no extra phenomena we missed." 

Yeah, used carefully and intentionally by well-intentioned actors (not reckless or criminal or suicidal terrorists or...) and no big deal surprises... And no rapid further advances building off of where we've gotten so far... If all of those things were somehow true, then yeah, much less dangerous.

Sorry, by "dangerously capable" I meant "capable enough to be very dangerous" not "inherently very dangerous".

This leads to a natural question: What reflection process would change a language model towards becoming a better map of the world (rather than language in the training dataset)? Reflection only looks at the language model, doesn't look at the world, produces an improved version of the model, applies an inductive bias after the fact. This is a problem statement of epistemic rationality for AI.

At a guess, focusing on transforming information from images and videos into text, rather than generating text qua text, ought to help — no? 

That's not reflection, just more initial training data. Reflection acts on the training data it already has, the point is to change the learning problem, by introducing an inductive bias that's not part of the low level learning algorithm, that improves sample efficiency with respect to loss that's also not part of low level learning. LLMs are a very good solution to the wrong problem, and a so-so solution to the right problem. Changing the learning incentives might get a better use out of the same training data for improving performance on the right problem.

A language model retrained on generated text (which is one obvious form of implementing reflection) likely does worse as a language model of the original training data, it's only a better model of the original data with respect to some different metric of being a good model (such as being a good map of the actual world, whatever that means). Machine learning doesn't know how to specify or turn this different metric into a learning algorithm, but an amplification process that makes use of faculties an LLM captured from human use of language might manage to do this by generating appropriate text for low level learning.

We could do auto captioning of movies and videos.

Or we could just train multimodal simulators. We probably will (e.g. such models could be useful for generating videos from descriptions).

I think in the limit of text prediction, language models can learn ~all of humanity's shared world model that is represented explicitly. The things that language models can't learn are IMO:

  • Tacit knowledge of the world that we haven't represented in text
  • Underdetermined features of the world
    • Aspects of our shared world model as represented in language that do not uniquely constrain our particular universe
  • Stuff we don't know about the world

As a path to AGI, I think token prediction is too high-level, unwieldy, and bakes in a number of human biases. You need to go right down to the fundamental level and optimize prediction over raw binary streams.

The source generating the binary stream can (and should, if you want AGI) be multimodal. At the extreme, this is simply a binary stream from a camera and microphone pointed at the world.

Learning to predict a sequence like this is going to lead to knowledge that humans don't currently know (because the predictor would need to model fundamental physics and all it entails).

To me, the caveats section of this post highlights the limited scope from which language models will be able to learn human values and preferences, given explicitly stated (And even implied-from-text) goals != human values as a whole. 

Do you think that the next-token prediction objective would lead to instrumentally convergent goals and/or power-seeking behavior? I could imagine an argument that a model would want to seize all available computation in order to better predict the next token. Perhaps the model's reward schedule would never lead it to think about such paths to loss reduction, but if the model is creative enough to consider such a plan and powerful enough to execute it, it seems that many power-seeking plans would help achieve its goal. This is significantly different from the view advanced by OpenAI, that language models are tools which avoid some central dangers of RL agents, and the general distinction drawn between tool AI and agentic AI. 

Do you think that the next-token prediction objective would lead to instrumentally convergent goals and/or power-seeking behavior?

No. Simulators aren't (in general) agents. Language models were optimised for the task of next token prediction, but they don't necessarily optimise for it. I am not convinced that their selection pressure favoured agents vs a more general cognitive architecture that can predict agents (and other kinds of systems).

Furthermore, insomuch as they are actually optimisers for next token prediction, it's in a very myopic way. That is, I don't think language models will take actions to make future tokens easier to predict

I don't think language models will take actions to make future tokens easier to predict

For an analogy, look at recommender systems. Their objective is myopic in the same way as language models: predict which recommendation which most likely result in a click. Yet they have power seeking strategies available, such as shifting the preferences of a user to make their behavior easier to predict. These incentives are well documented and simulations confirm the predictions here and here. The real world evidence is scant -- a study of YouTube's supposed radicalization spiral came up negative, though the authors didn't log in to YouTube which could lead to less personalization of recommendations. 

The jury is out on whether current recommender systems execute power-seeking strategies to improve their supposedly myopic objective. But the incentive and means are clearly present, and to me it seems only a matter of time before we observe this behavior in the wild. Similarly, while I don't think current language models are creative or capable enough to execute a power seeking strategy, it seems like power seeking by a superintelligent language model would be rewarded with lower loss. If a language model could use its outputs to persuade humans to train it with more compute on more data thereby reducing its loss, there seems to be every incentive for the model to seek power in this way. 

As I understand it, GPT-3 and co are trained via self supervised learning with the goal of minimising predictive loss. During training, their actions/predictions do not influence their future observations in anyway. The training process does not select for trying to control/alter text input, because that is something impossible for the AI to accomplish during training.

As such, we shouldn't expect the AI to demonstrate such behaviour. It was not selected for power seeking.

There's an assumption that the text that language models are trained on can be coherently integrated somehow. But the input is a babel of unreliable and contradictory opinions. Training to convincingly imitate any of a bunch of opinions, many of which are false, may not result in a coherent model of the world, but rather a model of a lot of nonsense on the Internet.

Do you have much actual experience playing around with large language models?

text that language models are trained on can be coherently integrated somehow

In my experience, the knowledge/world model of GPT-3/ChatGPT are coherently integrated.

 

Training to convincingly imitate any of a bunch of opinions, many of which are false, may not result in a coherent model of the world, but rather a model of a lot of nonsense on the Internet.

This seems empirically false to my experience using language models, and prima facie unlikely. Lots of text on the internet is just reporting about underlying reality:

  • Log files
  • Research papers
  • Academic and industry reports
  • Etc.

Learning to predict such reports of reality, would privilege processes that can learn the structure of reality.

 

Furthermore, text that is fact and text that is fiction is often distinguished in writing style or presentation. In my experience, large language models do not conflate fact or fiction.