Wiki Contributions



Yup exactly! One way I sometimes find it to helpful to classify systems in terms of the free variables upstream of loss that are optimized during training. In the case of gpt, internal activations are causally upstream of loss for "future" predictions in the same context window, but the output itself is not casually upstream from any effect on loss other than through myopic prediction accuracy (at any one training step) - the ground truth is fixed w/r/t the model's actions, and autoregressive generation isn't part of the training game at all.


Depends on what you mean by "sacrificing some loss on the current token if that made the following token easier to predict". 

The transformer architecture in particular is incentivized to do internal computations which help its future self predict future tokens when those activations are looked up by attention, as a joint objective to myopic next token prediction. This might entail sacrificing next token prediction accuracy as a consequence of not optimizing purely for that. (this is why I said in footnote 26 that transformers aren't perfectly myopic in a sense)

But there aren't training incentives for the model to prefer certain predictions because of the consequences if the sampled token were to be inserted into the stream of text, e.g. making subsequent text easier to predict if the rest of the text were to continue as expected given that token is in the sequence, because its predictions has no influence on the ground truth it has to predict during training. (For the same reason there's no direct incentive for GPT to fix behaviors that chain into bad multi step predictions when it generates text that's fed back into itself, like looping)

Training incentives are just training incentives though, not strict constraints on the model's computation, and our current level of insight gives us no guarantee that models like GPT actually don't/won't care about the causal impact of its decoded predictions to any end, including affecting easiness of future predictions. Maybe there are arguments why we should expect it to develop this kind of mesaobjective over another, but I'm not aware of any convincing ones.


This kind of comment ("this precise part had this precise effect on me") is a really valuable form of feedback that I'd love to get (and will try to give) more often. Thanks! It's particularly interesting because someone gave feedback on a draft that the business about simulated test-takers seemed unnecessary and made things more confusing.

Since you mentioned, I'm going to ramble on about some additional nuance on this point.

Here's an intuition pump which strongly discourages "fundamental attribution error" to the simulator:

Imagine a machine where you feed in an image and it literally opens a window to a parallel reality with that image as a boundary constraint. You can watch events downstream of the still frame unravel through the viewfinder.

If you observe the people in the parallel universe doing something dumb, the obvious first thought is that you should try a frame into a different situation that's more likely to contain smart people (or even try again, if the frame underdetermines the world and you'll reveal a different "preexisting" situation each time you run the machine).

That's the obvious conclusion in the thought experiment because the machine isn't assigned a mind-like role -- it's just a magical window into a possible world. Presumably, the reason people in a parallel world are dumb or not is located in that world, in the machinery of their brains. "Configuration" and "physics" play the same roles as in our reality.

Now, with intuition pumps it's important to fiddle with the knobs. An important way that GPT is unlike this machine is that it doesn't literally open a window into a parallel universe running on the same physics as us, which requires that minds be implemented as machines in the world state, such as brains. The "state" that it propagates is text, a much coarser grained description than microscopic quantum states or even neurons. This means that when simulacra exhibit cognition, it must be GPT -- time evolution itself -- that's responsible for a large part of the mind-implementation, as there is nowhere near sufficient machinery in the prompt/state. So if a character is stupid, it may very well be a reflection of GPT's weakness at compiling text descriptions into latent algorithms simulating cognition.

But it may also be because of the prompt. Despite its short length the prompt does parameterize an innumerable number of qualitatively distinct simulations, and given GPT's training distribution it's expected for it sometimes to "try" to simulate stupid things.

There's also another way that GPT can fail to simulate smart behavior which I think is not reducible to "pretending to be stupid", which makes the most sense if you think of the prompt as something like an automaton specification which will proceed to evolve according not to a mechanistic physics but GPT's semantic word physics. Some automata-specifications will simply not work very well -- they might get into a loop because they were already a bit repetitive, or fail to activate the relevant knowledge because the style is out-of-distribution and GPT is quite sensitive to form and style, or cause hallucinations and rationalizations instead of effective reasoning because the flow of evidence is backward. But another automaton initialization may glide superbly when animated by GPT physics.

What I've found, not through a priori reasoning but lots of toying, is that the quality of intelligence simulated by GPT-3 in response to "typical" prompts tremendously underestimates its "best case" capabilities. And the trends strongly imply that I haven't found the best case for anything. Give me any task, quantifiable or not, and I am almost certain I can find a prompt that makes GPT-3 do it better after 15 minutes of tinkering, and a better one than that if I had an hour, and a better one than that if I had a day... etc. The problem of finding a good prompt to elicit some capability, especially if it's open-ended or can be attacked in multiple steps, seems similar to the problem of finding the best mental state to initiate a human to do something well -- even if you're only considering mental states which map to some verbal inner monologue, you could search through possible constructs practically indefinitely without expecting you've hit anything near the optimum, because the number of possible relevant and qualitatively distinct possible mental states is astronomical. It's the same with simulacra configurations.

So one of my motivations for advocating an explicit simulator/simulacra distinction with the analogy to the extreme case of physics (where the configuration is responsible for basically everything) is to make the prompt-contingency of phenomena more intuitive, since I think most peoples' intuitions are too inclined in the opposite direction of locating responsibility for observed phenomena in GPT itself. But it is important, and I did not sufficiently emphasize in this post, to be aware that the ontological split between "state" and "physics" carves the system differently than in real life, allowing for instance the possibility that simulacra are stupid because GPT is weak.

I understand your reasoning much better now, thanks!

"GPT as a simulator/platform seems to me like an existence proof for a not-artificially-handicapped human level AI attractor state" is a great way to put it and a very important observation.

I think the attractor state is more nuanced than "human-level". GPT is incentivized to learn to model "everyone everywhere all at once" if you will, a superhuman task -- and while the default runtime behavior is human-level simulacra, I expect it to be possible to elicit superhuman performance by conditioning the model in certain ways or a relatively small amount of fine tuning/RL. Also, being simulated confers many advantages for intelligence (instances can be copied/forked, are much more programmable than humans, potentially run much faster, etc). So I generally think of the attractor state as being superhuman in some important dimensions, enough to be a serious foom concern.

Broadly, though, I agree with the framing -- even if it's somewhat superhuman, it's extremely close to human-level and human-shaped intelligence compared to what's possible in all of mindspace, and there is an additional unsolved technical challenge to escalate from human-level/slightly superhuman to significantly beyond that. You're totally right that it removes the arbitrariness of "human-level" as a target/regime.

I'd love to see an entire post about this point, if you're so inclined. Otherwise I might get around to writing something about it in a few months, lol.

What's weird is the level of conviction of choice J - above 90%. I have no idea why this happens.

text-davinci-002 is often extremely confident about its "predictions" for no apparent good reason (e.g. when generating "open-ended" text being ~99% confident about the exact phrasing)

This is almost certainly due to the RLHF "Instruct" tuning text-davinci-002 has been subjected to. To whatever extent probabilities output by models trained with pure SSL can be assigned an epistemic interpretation (the model's credence for the next token in a hypothetical training sample), that interpretation no longer holds for models modified by RLHF.

Ohh, I do think the super ai will likely be very bad. And soon (like 5 years), which is why I don't spend too much time worrying about the slightly superhuman assholes.

I wish the problem was going to be what you described. That would be a pretty fun cyberpunk world and I'd enjoy the challenge of writing good simulacra to fight the bad ones.

If we get it really right (which I don't think is impossible, just tricky) we should also still be able to have fun, much more fun than we can even fathom now.

hm, I have thought about this

it's not that I think the patterns of ancient/perennial assholes won't haunt reanimated language, it's just that I expect strongly superhuman AI which can't be policed to appear and refactor the lightcone before that becomes a serious societal problem.

But I could be wrong, so it is worth thinking about. & depending on how things go down it may be that the shape of the ancient asshole influences the shape of the superintelligence

I agree with the premise/ontology but disagree with the conclusion. It's pretty clear the generator of disagreement is

I believe GPT has demonstrated that near-human-level AI is a possible steady state for multiple years

I don't think it's been a steady state nor that it will proceed as one; rather I think GPT not only marks but is causally responsible for the beginning of accelerating takeoff and is already a gateway to a variety of superhuman capabilities (or "products").

[a few minutes later] Actually, I realized I do agree, in the sense of relative updates. I've gotten too used to framing my own views in contrast to others' rather than in relation to my past self. GPT surprised me by demonstrating that an AI can be as intelligent as it is and deployed over the internet, allowed to execute code, etc, without presenting an immediate existential threat. Before I would have guessed that by the time an AI exists which uses natural language as fluently as GPT-3 we'd have months to minutes left, lol. The update was toward slower takeoff (but shorter timelines because I didn't think any of this would happen so soon).

But from this post, which feels a bit fragmented, I'm not clear on how the conclusions about timelines and x-risk follow from the premise of platform/product distinction, and I'd like to better understand the thread as you see it.

Another variation of the duality: platform/product

Load More