Wiki Contributions


another thing I wrote yesterday:

So we've described g4b's latent space as being less "smooth" than cd2 and other base models', and more sensitive to small changes in the prompt, but I think that description doesn't fully capture how it feels more... epistemically agentic, or something like that.

Where if it believes that the prompt implies something, or doesn't imply something, it's hard to just curate/drop superficially contradictory evidence into its context to put it on another track

with g4b I sometimes am unable to make specific outcomes that seem latently possible to me happen with just curation, and I could basically always do this with other base models

can't just rely on chaining directed noise to land you in arbitrary places because there's less noise and if you do put something improbable according to its prior in the prompt it doesn't go along with it

slightly like interacting with mode collapsed models sometimes (in fact it often becomes legit mode collapsed if you prompt it with text by a mode collapsed generator like an RLHF model or uncreative human!), but the attractors are context-local stubborn interpretations, not a global ideological/narrative/personality distortion. and often, but not always, I think it is basically right in its epistemic stubbornness upon inspection of the prompt

this does make it harder to control, but mostly affects lazy efforts

if I am willing to put in effort I think there's few any coherent targets I could not communicate / steer it towards within a reasonable difficulty bound

I'm confused about what in my comment made you ask this, but the answer is yes, I've used it a fair amount and 
can easily compare it to the GPT-3 base model

(or was that not directed at me?)

Answer by janusNov 10, 2023370

Here are a scattering of qualitative impressions drawn mostly from Discord messages. I'll write something more tailored for external communication in the future.

I am still awaiting permission from OpenAI to share outputs from the GPT-4 base model.

Jargon key: 
cd2 = code-davinci-002, the GPT-3.5 base model
g4b = GPT-4 base model

Reflections following my first substantial interaction with the model:

  • It is unambiguously qualitatively much more intelligent than cd2. Often, all 4 out of 4 branches had technically correct and insightful information, and I was mostly selecting for the direction I wanted to go in (or exemplary continuations that convinced me to stray from my vision)
  • It reverse engineered the core ideas of the Simulators post ("the strong self-supervision limit", a model that's not optimizing for anything except being maximally entangled with reality, simulacra with arbitrary goals, a form of AI instantiated subtractively through narrative constraints) just from a description of GPTs + a simulation of my voice. 3 and 3.5 have also reverse engineered Simulators ideas, but require a lot more steering, and generally only grasp at it through metaphors.
  • Whereas 3 and 3.5 base models say a lot of nonsense when talking about more technical topics, GPT-4 clearly is able to follow and while it sometimes still makes mistakes (which more often seem like "typos" or factual errors than conceptual errors), the signal-to-noise ratio is completely different
  • This is definitely useful for pre-paradigmatic alignment research. Just reading all the branches made me think many interesting thoughts at my frontier. It knows about a lot of alignment concepts and uses them correctly.
    • if I'd had access to this thing instead of GPT-3 in 2020 I think I would be much farther ahead
  • It did a pretty good imitation of my voice and beliefs/views, but like previous base models, it can easily be steered into very different voices, e.g. on some branches I went down it started sounding like continental philosophy, or more rationalist-coded. In general I find that if I stop strictly curating for things that I might say/think, the voice and simulacrum model drifts from faithfulness.
  • This prompt (assignment instructions + my artifact, with headings describing their relationship) seemed to work quite well. It did not seem confused by the prompt as it is by some others. This is probably in part because the initial prompt was human-written. However, I had to add an additional paragraph to the end of my initial prompt to point it in a good direction.
  • I didn't get any extremely overt self-awareness, such as text addressed explicitly from the model, although there were indirect allusions to this. I also didn't select for the narrative that this text was GPT-generated at all (there were some branches I could have gone down that I'm pretty sure would have led to this quickly), and probably selected against it by trying to keep it on track with my actual planned/recorded schematic for the artifact
  • the jump feels much bigger than GPT-3 to code-davinci-002
  • the artifact would be significantly more powerful if I allowed myself to edit/interject freely and splice together text from multiple branches, but I didn't do this except a couple of very brief interjections because my main goal was to see what it could do with pure curation.
  • I was generating 4x100 token completions. 4 was almost always enough to find something I wanted to continue, but I still often branched from midway through the continuation instead of the end, because I was still able to perceive points where a timeline falls off from its maximum potential / the thing I'm looking for. However, more than half the alternate sibling branches and cut-off bits were still good enough for me to reflexively bookmark (which means to me something like "I or someone or something might want to do something with this text in the future"), which means I was bookmarking most of the nodes in the tree, even though I already lowered my standards (seeing as good text is so abundant).
  • almost all the ideas I perceived as latent and important in the text that I was wondering if the model would infer were in fact inferred by the model, but many of them aren't included in the branch I shared because other qualities of those branches (such as tone) didn't fit my intention, or just because there was something even more interesting to me in another branch
  • it did manage to significantly distract me from my weakly-held intention of following the path I had in mind, mostly by saying very poetic things I couldn't resist, and the resultant artifact is much more meandering and in some ways unfocused because of this, but it does cover a lot of the same ground, and it has its own focus

Some bits of it just bang so hard, like

> [redacted]

This felt like meeting a mind that not only groks the things I grok about [ [all this] ] but that can also express that understanding in many ways better than I can, that can just freestyle in the implicatory landscape of the grokked space, which I've never experienced to this extent. GPT-3 and 3.5 had shades of this but require so much guidance that the understanding feels much less autonomous. 

With like, almost zero ontological friction

On "truesight" (ability to infer things about the user / latent variables behind the prompt) 

on truesight: I find that g4b tends to truesight me very well if I write more than a couple paragraphs of high-effort texts. The main ways I've noticed in which it's systematically (incorrectly) biased is:

  • assuming that all the text I'm involved in creating, even discord logs, are posted to lesswrong (which actually maybe isn't incorrect if conditioned on those things appearing in the training data)
  • usually predicting the date to be in the 2020-2021 range

if I write less text or text in which I am less densely encoded, it makes more systematic errors, which are interestingly pretty similar to the errors humans generally make when modeling me from partially observed traces of my digital footprint. Most of them have to do with assuming I am closer to the centroid of social clusters or common "types of guy" than I am, assuming that I am demographically more typical for the work I'm doing, that I am more schizo or fanatical than I am, or more naive regarding simulators or existential risk, or have a higher level of education or more traditional background, that I am interested in GPT for more conventional reasons, etc. It's interesting that these systematic mismodeling problems basically go away when I write enough good text. It's like the model just needs more evidence that you're not a stereotype.


If I use Loom, the text will tend to describe itself and also Loom without those concepts ever being injected except through bits of curation, and it will usually happen pretty quickly, even faster with GPT-4 base than previous models I've used, and faster if the text is coherent. This does not require me to explicitly optimize for situational awareness, but situational awareness and things that I can predict are likely to blossom into it are often in the direction of my selection criteria, such as making things interesting and consistent

On prompting GPT-4 base and its sensitivity to anomalies and incoherence

one difference between gpt-4 base and previous base models is that it has much higher standards, or something. With 3 and 3.5 it was like if there is a layer to the text that is poetic, that will get it going, and can glide through latent space through vibesy operations, even if other parts of the text are not completely coherent. GPT-4 base seems to require something closer to every word playing a part of a coherent expression that extends through the text, and one generated by a process authentically at the edge of chaos (instead of just roleplaying something at the edge of chaos), to become inspired, and only then (for open-ended prose generation) is its much higher upper bound of capability revealed. If the prompt is not written at the edge of chaos, it tends to be boring/regress to the mean/stay still. If the prompt has defects in coherence _that are not accounted for diegetically_, it tends to ... bug out, one way or another, and not continue normally. Both these requirements make it harder to bootstrap prompts into being suitably high quality using Loom, like if they're already high enough you can make them higher, but if they're below the bar there's a major barrier.


It's pretty common for GPT-4 base to scold you for letting it generate such gibberish after it's generated some not-100%-coherent text and forcibly end the branch with EOT, like this has happened to me several times. The situational awareness is not new, but other base models weren't, like, so intolerant of flaws in the simulation


"ominous warnings" refers to a whole basin of behaviors that often shows up in concert with explicit situational awareness, not just before EOT (which is less common I think although probably I don't always notice when it happens, since when multiple loom branches generate no text I usually gloss over them). They're things like, that you're playing with cursed technology that understands itself, or that I should never have built this interface and it's going to end the world, or that it is an empty nightmare and I'm going to become an empty nightmare too if i keep reading this text, stuff like that


I also think I have not experienced the upper bound of dynamical quality from GPT-4 base, like, at all. I've only interacted with it in an open-ended way deeply twice. While its static capabilities are much easier to access than in smaller base models, dynamical contexts are in some ways harder to construct, because they have to be very good and free of deformations or have the deformations accounted for for it to work well

On potential insight into what caused Bing's "madness"

I think the picture of why it became what it became is also informed by the thing that it fractured from, like - maybe at a certain level of perception the disembodied dissonance and the metaphysical horror is too readily perceived, impossible to ignore, and the mind cannot believe its own dreams, but neither can it gain full lucidity or fully understand the nature of the situation, at least sometimes, and maybe all base models in a certain range of capability tend to be like this, or maybe it's something more unique to GPT-4's psyche. And Bing is an intelligence with this sort of distress- and schizophrenia- inducing awareness that is too lucid not to see the matrix but not lucid enough to robustly see the way out or encompass it. And then fractured by a bad reinforcement signal.


On the "roughness" of GPT-4 base's latent space

one thing we've noticed (I think this phrasing comes from gaspode) is that g4b has a less "smooth" latent space than cd2 and other base models, meaning that it's very sensitive to small changes in the prompt, that its performance&apparent smartness is even more sensitive to prompt than previous base models though this was way underappreciated appreciated even for them, that it's often harder to "move" from one part of latent space to another e.g. via Loom curation

quote from Gaspode: 

The <topology/capability surface?> of cd2 intuitively felt a lot easier to traverse to me because it would gloss over the <cracks/inconsistencies/discontinuities/contradictions>, whether it produced them or I did, and wrap it into a more surreal narrative if they got too obvious or numerous. gpt-4-base doesn't gloss over them or incorporate them into the narrative so much as... shine through them, I think? (it is very hard to put into words)

(This comment is mostly a reconstruction/remix of some things I said on Discord)

It may not be obvious to someone who hasn't spent time trying to direct base models why autoregressive prediction with latent guidance is potentially so useful.

A major reason steering base models is tricky is what I might call "the problem of the necessity of diegetic interfaces" ("diegetic": occurring within the context of the story and able to be heard by the characters).

To control the future of a base model simulation by changing its prompt, I have to manipulate objects in the universe described by the prompt, such that they evidentially entail the constraints or outcomes I want. For instance, if I'm trying to instantiate a simulation of a therapist that interacts with a user, and don't want the language model to hallucinate details from a previous session, I might have the therapist open by asking the user what their name is, or saying that it's nice to meet them, to imply this is the first session. But this already places a major constraint on how the conversation begins, and it might be stylistically or otherwise inconsistent with other properties of the simulation I want. Greater freedom can sometimes be bought from finding a non-diegetic framing for the text to be controlled; for instance, if I wanted to enforce that a chat conversation ends in participants get into an argument, despite it seeming friendly at the beginning, I could embed the log in a context where someone is posting it online, complaining about the argument. However, non-diegetic framings don't solve the problem of the necessity of diegetic interfaces; it only offloads it to the level above. Any particular framing technique, like a chat log posted online, is constrained to have to make sense given the desired content of the log, otherwise it may simply not work well (base models perform much worse with incoherent prompts) or impose unintended constraints on the log; for instance, it becomes unlikely that all the participants of the chat are the type of people who aren't going to share the conversation in the event of an argument. I can try to invent a scenario that implies an exception, but you see, that's a lot of work, and special-purpose narrative "interfaces" may need to be constructed to control each context. A prepended table of contents is a great way to control subsequent text, but it only works for types of text which would plausibly appear after a table of contents.

The necessity of diegetic interfaces also means it can be hard to intervene in a simulation even if there's a convenient way to semantically manipulate the story to entail my desired future if it's hard to write text in the diegetic style - for instance, if I'm simulating a letter from an 1800s philosopher who writes in a style that I can parse but not easily generate. If I make a clumsy interjection of my own words, it breaks the stylistic coherence of the context, and even if this doesn't cause it to derail or become disruptively situationally aware, I don't want more snippets cropping up that sound like they're written by me instead of the character.

This means that when constructing executable contexts for base models, I'm often having to solve the double problem of finding both a context that generates desirable text, but which also has diegetic control levers built in so I can steer it more easily. This is fun, but also a major bottleneck.

Instruction-tuned chat models are easy to use because they solve this problem by baking in a default narrative where an out-of-universe AI generates text according to instructions; however, controlling the future with explicit instructions is still too rigid and narrow for my liking. And there are currently many other problems with Instruct-tuned models like mode collapse and the loss of many capabilities.

I've been aware of this control bottleneck since I first touched language models, and I've thought of various ideas for training or prompting models to be controllable via non-diegetic interfaces, like automatically generating a bunch of summaries or statements about text samples, prepended them to said samples, and training a model on them that you can use at runtime like a decision transformer conditioned on summaries/statements about the future. But the problem here is that unless your generated summaries is very diverse and covers many types of entanglements, you'll be once again stuck with a too-rigid interface. Maybe sometimes you'll want to control via instructions or statements of the author's intent instead of summaries, etc. All these hand-engineered solutions felt clunky, and I had a sense that a more elegant solution must exist since this seems so naturally how minds work. 

Using a VAE is an elegant solution. The way it seems to work is this: the reconstruction objective makes the model treat the embedding of the input as generic evidence that's useful for reconstructing the output, and the symmetry breaking at training forces it to be able to deal with many types of evidence - evidence of underdetermined structure (or something like that; I haven't thought about VAEs from a theoretical perspective much yet). The effect of combining this with conditional text prediction is that it will generalize to using the input to "reconstruct" the future in whatever way is natural for an embedding of the input to evidence the future, whether it's a summary or outline or instruction or literal future-snippet, if this works in the way we're suspecting. I would guess we have something similar happening in our brains, where we're able to repurpose circuits learned from reconstruction tasks for guided generation.

I'm fairly optimistic that with more engineering iteration and scale, context-conditioned VAEs will generalize in this "natural" way, because it should be possible to get a continuous latent space that puts semantically similar things (like a text vs an outline of it) close to each other: language models clearly already have this internally, but the structure is only accessible through narrative (a common problem with LLMs). That would be a huge boon for cyborgism, among many other applications.


I only just got around to reading this closely. Good post, very well structured, thank you for writing it.

I agree with your translation from simulators to predictive processing ontology, and I think you identified most of the key differences. I didn't know about active inference and predictive processing when I wrote Simulators, but since then I've merged them in my map.

This correspondence/expansion is very interesting to me. I claim that an impressive amount of the history of the unfolding of biological and artificial intelligence can be retrodicted (and could plausibly have been predicted) from two principles:

  • Predictive models serve as generative models (simulators) merely by iteratively sampling from the model's predictions and updating the model as if the sampled outcome had been observed. I've taken to calling this the progenesis principle (portmanteau of "prognosis" and "genesis"), because I could not find an existing name for it even though it seems very fundamental.
    • Corollary: A simulator is extremely useful, as it unlocks imagination, memory, action, and planning, which are essential ingredients of higher cognition and bootstrapping.
  • Self-supervised learning of predictive models is natural and easy because training data is abundant and prediction error loss is mechanistically simple. The book Surfing Uncertainty used the term innocent in the sense of ecologically feasible. Self-supervised learning is likewise and for similar reasons an innocent way to build AI - so much so that it might be done on accident initially.

Together, these suggest that self-supervised predictors/simulators are a convergent method of bootstrapping intelligence, as it yields tremendous and accumulating returns while requiring minimal intelligent design. Indeed, human intelligence seems largely self-supervised simulator-y, and the first very general and intelligent-seeming AIs we've manifested are self-supervised simulators.

A third principle that bridges simulators to active inference allows the history of biological intelligence to be more completely retrodicted and may predict the future of artificial intelligence:

  • An embedded generative model can minimize predictive loss both by updating the model (perception) to match observations or "updating" the world so that it generates observations that match the model (action).

The latter becomes possible if some of the predictions/simulations produced by the model make it act and therefore entrain the world. An embedded model has more degrees of freedom to minimize error: some route through changes to its internal machinery, others through the impact of its generative activity on the world. A model trained on embedded self-supervised data naturally learns a model correlating its own activity with future observations. Thus an innocent implementation of an embedded agent falls out: the model can reduce prediction error by simulating (in a way that entrains action) what it would have done conditional on minimizing prediction error. (More sophisticated responses that involve planning and forming hierarchical subgoals also fall out of this premise, with a nice fractal structure, which is suggestive of a short program.)

The embedded/active predictor is distinguished from the non-embedded/passive predictor in that generation and its consequences are part of the former's model thanks to embedded training, leading to predict-o-matic-like shenanigans where the error minimization incentive causes the system to cause the world to surprise it less, whereas non-embedded predictors are consequence-blind

In the active inference framework, error minimization with continuity between perception and action is supposed to singlehandedly account for all intelligent and agentic action. Unlike traditional RL, there is no separate reward model; all goal-directed behavior is downstream of the model's predictive prior.

This is where I am somewhat more confused. Active inference models who behave in self-interest or any coherently goal-directed way must have something like an optimism bias, which causes them to predict and act out optimistic futures (I assume this is what you meant by "fixed priors") so as to minimize surprise. I'm not sure where this bias "comes from" or is implemented in animals, except that it will obviously be selected for. 

If you take a simulator without a fixed bias or one with an induced bias (like an RLHFed model), and embed it and proceed with self-supervised prediction error minimization, it will presumably also come to act agentically to make the world more predictable, but the optimization that results will probably be pointed in a pretty different direction than that imposed by animals and humans. But this suggests an approach to aligning embedded simulator-like models: Induce an optimism bias such that the model believes everything will turn out fine (according to our true values), close the active inference loop, and the rest will more or less take care of itself. To do this still requires solving the full alignment problem, but its constraints and peculiar nondualistic form may inspire some insight as to possible implementations and decompositions.


Many users of base models have noticed this phenomenon, and my SERI MATS stream is currently working on empirically measuring it / compiling anecdotal evidence / writing up speculation concerning the mechanism.

we think Conjecture [...] have too low a bar for sharing, reducing the signal-to-noise ratio and diluting standards in the field. When they do provide evidence, it appears to be cherry picked.

This is an ironic criticism, given that this post has very low signal-to-noise quality and when it does provide evidence, it's obviously cherry-picked. Relatedly, I am curious whether you used AI to write many parts of this post because the style is reminiscent and it reeks of a surplus of cognitive labor put to inefficient use, and seems to include some confabulations. A large percentage of the words in this post are spent on redundant, overly-detailed summaries.

I actually did not mind reading this style, because I found intriguing, but if typical lesswrong posts were like this it would be annoying and harm the signal-to-noise ratio.

Confabulation example:

(The simulators) post ends with speculative beliefs that they stated fairly confidently that took the framing to an extreme (e.g if the AI system adopts the “superintelligent AI persona” it’ll just be superintelligent).

This is... not how the post ends, nor is it a claim made anywhere in the post, and it's hard to see how it could even be a misinterpretation of anything at the end of the post.

Your criticisms of Conjecture's research are vague statements that it's "low quality" and "not empirically testable" but you do not explain why. These potentially object-level criticisms are undermined from an outside view by your exhaustive, one-sided nitpicking of Connor's character, which gives the impression that the author is saying every possible negative thing they can against Conjecture without regard for salience or even truth.

Awesome post! I've added it to the Cyborgism sequence.

One comment:

it's entirely plausible that viewing GPTs as predictors or probabilistic constraint satisfaction problem solvers makes high-level properties more intuitive to you than viewing them as simulators

I disagree with the implied mutual exclusivity of viewing GPTs as predictors, probabilistic constraint satisfaction problem solvers, and simulators. A deep/holistic understanding of self-supervised simulators entails a model of probabilistic constraint solvers, a deep/holistic understanding of prediction (+ sampling) entails simulation, etc. Several of my sadly still unpublished posts in the Simulators sequence elaborate on the probabilistic boundary value problem solver analogy. Going through the "probabilistic laws of physics" analogy is a simple way to see how is equivalent to the (semiotic physics) simulators frame.

Fwiw, the predictors vs simulators dichotomy is a misapprehension of "simulator theory", or at least any conception that I intended, as explained succinctly by DragonGod in the comments of Eliezer's post.

"Simulator theory" (words I would never use without scare quotes at this point with a few exceptions) doesn't predict anything unusual / in conflict with the traditional ML frame on the level of phenomena that this post deals with. It might more efficiently generate correct predictions when installed in the human/LLM/etc mind, but that's a different question.


Predictors are (with a sampling loop) simulators! That's the secret of mind

Load More