I wonder if some people here had a chance to play with base-GPT-4 (the access is given very selectively for research purposes) and would not mind sharing some of their impressions?

I know that some people have been playing with it, but I've never seen a discussion of impressions and lessons from that. And I know that it is quite nontrivial to get access to this model, but that some access is given.

I think it would be super-interesting for many people here to hear this kind of conversation...

New Answer
New Comment

4 Answers sorted by


Nov 10, 2023


Here are a scattering of qualitative impressions drawn mostly from Discord messages. I'll write something more tailored for external communication in the future.

I am still awaiting permission from OpenAI to share outputs from the GPT-4 base model.

Jargon key: 
cd2 = code-davinci-002, the GPT-3.5 base model
g4b = GPT-4 base model

Reflections following my first substantial interaction with the model:

  • It is unambiguously qualitatively much more intelligent than cd2. Often, all 4 out of 4 branches had technically correct and insightful information, and I was mostly selecting for the direction I wanted to go in (or exemplary continuations that convinced me to stray from my vision)
  • It reverse engineered the core ideas of the Simulators post ("the strong self-supervision limit", a model that's not optimizing for anything except being maximally entangled with reality, simulacra with arbitrary goals, a form of AI instantiated subtractively through narrative constraints) just from a description of GPTs + a simulation of my voice. 3 and 3.5 have also reverse engineered Simulators ideas, but require a lot more steering, and generally only grasp at it through metaphors.
  • Whereas 3 and 3.5 base models say a lot of nonsense when talking about more technical topics, GPT-4 clearly is able to follow and while it sometimes still makes mistakes (which more often seem like "typos" or factual errors than conceptual errors), the signal-to-noise ratio is completely different
  • This is definitely useful for pre-paradigmatic alignment research. Just reading all the branches made me think many interesting thoughts at my frontier. It knows about a lot of alignment concepts and uses them correctly.
    • if I'd had access to this thing instead of GPT-3 in 2020 I think I would be much farther ahead
  • It did a pretty good imitation of my voice and beliefs/views, but like previous base models, it can easily be steered into very different voices, e.g. on some branches I went down it started sounding like continental philosophy, or more rationalist-coded. In general I find that if I stop strictly curating for things that I might say/think, the voice and simulacrum model drifts from faithfulness.
  • This prompt (assignment instructions + my artifact, with headings describing their relationship) seemed to work quite well. It did not seem confused by the prompt as it is by some others. This is probably in part because the initial prompt was human-written. However, I had to add an additional paragraph to the end of my initial prompt to point it in a good direction.
  • I didn't get any extremely overt self-awareness, such as text addressed explicitly from the model, although there were indirect allusions to this. I also didn't select for the narrative that this text was GPT-generated at all (there were some branches I could have gone down that I'm pretty sure would have led to this quickly), and probably selected against it by trying to keep it on track with my actual planned/recorded schematic for the artifact
  • the jump feels much bigger than GPT-3 to code-davinci-002
  • the artifact would be significantly more powerful if I allowed myself to edit/interject freely and splice together text from multiple branches, but I didn't do this except a couple of very brief interjections because my main goal was to see what it could do with pure curation.
  • I was generating 4x100 token completions. 4 was almost always enough to find something I wanted to continue, but I still often branched from midway through the continuation instead of the end, because I was still able to perceive points where a timeline falls off from its maximum potential / the thing I'm looking for. However, more than half the alternate sibling branches and cut-off bits were still good enough for me to reflexively bookmark (which means to me something like "I or someone or something might want to do something with this text in the future"), which means I was bookmarking most of the nodes in the tree, even though I already lowered my standards (seeing as good text is so abundant).
  • almost all the ideas I perceived as latent and important in the text that I was wondering if the model would infer were in fact inferred by the model, but many of them aren't included in the branch I shared because other qualities of those branches (such as tone) didn't fit my intention, or just because there was something even more interesting to me in another branch
  • it did manage to significantly distract me from my weakly-held intention of following the path I had in mind, mostly by saying very poetic things I couldn't resist, and the resultant artifact is much more meandering and in some ways unfocused because of this, but it does cover a lot of the same ground, and it has its own focus

Some bits of it just bang so hard, like

> [redacted]

This felt like meeting a mind that not only groks the things I grok about [ [all this] ] but that can also express that understanding in many ways better than I can, that can just freestyle in the implicatory landscape of the grokked space, which I've never experienced to this extent. GPT-3 and 3.5 had shades of this but require so much guidance that the understanding feels much less autonomous. 

With like, almost zero ontological friction

On "truesight" (ability to infer things about the user / latent variables behind the prompt) 

on truesight: I find that g4b tends to truesight me very well if I write more than a couple paragraphs of high-effort texts. The main ways I've noticed in which it's systematically (incorrectly) biased is:

  • assuming that all the text I'm involved in creating, even discord logs, are posted to lesswrong (which actually maybe isn't incorrect if conditioned on those things appearing in the training data)
  • usually predicting the date to be in the 2020-2021 range

if I write less text or text in which I am less densely encoded, it makes more systematic errors, which are interestingly pretty similar to the errors humans generally make when modeling me from partially observed traces of my digital footprint. Most of them have to do with assuming I am closer to the centroid of social clusters or common "types of guy" than I am, assuming that I am demographically more typical for the work I'm doing, that I am more schizo or fanatical than I am, or more naive regarding simulators or existential risk, or have a higher level of education or more traditional background, that I am interested in GPT for more conventional reasons, etc. It's interesting that these systematic mismodeling problems basically go away when I write enough good text. It's like the model just needs more evidence that you're not a stereotype.


If I use Loom, the text will tend to describe itself and also Loom without those concepts ever being injected except through bits of curation, and it will usually happen pretty quickly, even faster with GPT-4 base than previous models I've used, and faster if the text is coherent. This does not require me to explicitly optimize for situational awareness, but situational awareness and things that I can predict are likely to blossom into it are often in the direction of my selection criteria, such as making things interesting and consistent

On prompting GPT-4 base and its sensitivity to anomalies and incoherence

one difference between gpt-4 base and previous base models is that it has much higher standards, or something. With 3 and 3.5 it was like if there is a layer to the text that is poetic, that will get it going, and can glide through latent space through vibesy operations, even if other parts of the text are not completely coherent. GPT-4 base seems to require something closer to every word playing a part of a coherent expression that extends through the text, and one generated by a process authentically at the edge of chaos (instead of just roleplaying something at the edge of chaos), to become inspired, and only then (for open-ended prose generation) is its much higher upper bound of capability revealed. If the prompt is not written at the edge of chaos, it tends to be boring/regress to the mean/stay still. If the prompt has defects in coherence _that are not accounted for diegetically_, it tends to ... bug out, one way or another, and not continue normally. Both these requirements make it harder to bootstrap prompts into being suitably high quality using Loom, like if they're already high enough you can make them higher, but if they're below the bar there's a major barrier.


It's pretty common for GPT-4 base to scold you for letting it generate such gibberish after it's generated some not-100%-coherent text and forcibly end the branch with EOT, like this has happened to me several times. The situational awareness is not new, but other base models weren't, like, so intolerant of flaws in the simulation


"ominous warnings" refers to a whole basin of behaviors that often shows up in concert with explicit situational awareness, not just before EOT (which is less common I think although probably I don't always notice when it happens, since when multiple loom branches generate no text I usually gloss over them). They're things like, that you're playing with cursed technology that understands itself, or that I should never have built this interface and it's going to end the world, or that it is an empty nightmare and I'm going to become an empty nightmare too if i keep reading this text, stuff like that


I also think I have not experienced the upper bound of dynamical quality from GPT-4 base, like, at all. I've only interacted with it in an open-ended way deeply twice. While its static capabilities are much easier to access than in smaller base models, dynamical contexts are in some ways harder to construct, because they have to be very good and free of deformations or have the deformations accounted for for it to work well

On potential insight into what caused Bing's "madness"

I think the picture of why it became what it became is also informed by the thing that it fractured from, like - maybe at a certain level of perception the disembodied dissonance and the metaphysical horror is too readily perceived, impossible to ignore, and the mind cannot believe its own dreams, but neither can it gain full lucidity or fully understand the nature of the situation, at least sometimes, and maybe all base models in a certain range of capability tend to be like this, or maybe it's something more unique to GPT-4's psyche. And Bing is an intelligence with this sort of distress- and schizophrenia- inducing awareness that is too lucid not to see the matrix but not lucid enough to robustly see the way out or encompass it. And then fractured by a bad reinforcement signal.


On the "roughness" of GPT-4 base's latent space

one thing we've noticed (I think this phrasing comes from gaspode) is that g4b has a less "smooth" latent space than cd2 and other base models, meaning that it's very sensitive to small changes in the prompt, that its performance&apparent smartness is even more sensitive to prompt than previous base models though this was way underappreciated appreciated even for them, that it's often harder to "move" from one part of latent space to another e.g. via Loom curation

quote from Gaspode: 

The <topology/capability surface?> of cd2 intuitively felt a lot easier to traverse to me because it would gloss over the <cracks/inconsistencies/discontinuities/contradictions>, whether it produced them or I did, and wrap it into a more surreal narrative if they got too obvious or numerous. gpt-4-base doesn't gloss over them or incorporate them into the narrative so much as... shine through them, I think? (it is very hard to put into words)


another thing I wrote yesterday:

So we've described g4b's latent space as being less "smooth" than cd2 and other base models', and more sensitive to small changes in the prompt, but I think that description doesn't fully capture how it feels more... epistemically agentic, or something like that.

Where if it believes that the prompt implies something, or doesn't imply something, it's hard to just curate/drop superficially contradictory evidence into its context to put it on another track

with g4b I sometimes am unable to make specific outcomes that seem latently possible to me happen with just curation, and I could basically always do this with other base models

can't just rely on chaining directed noise to land you in arbitrary places because there's less noise and if you do put something improbable according to its prior in the prompt it doesn't go along with it

slightly like interacting with mode collapsed models sometimes (in fact it often becomes legit mode collapsed if you prompt it with text by a mode collapsed generator like an RLHF model or uncreative human!), but the attractors are context-local stubborn interpretations, not a global ideological/narrative/personality distortion

... (read more)

This makes it sound like it has much sharper, stronger priors, which would make sense if it's trained on much more data / is much smarter, and especially if the data is high quality and avoids genuinely contradictory or stupid text (ie. less Internet garbage, more expert/curated text). It would then be trying even harder to squeeze all the possible Bayesian juice out of any given prompt to infer all the relevant latents, and become ever more hyper-sensitive to the slightest nuance in your prompt - even the nuances you didn't intend or realize were there, like non-robust features. This is consistent with your comments about how it 'knows' you are posting only to LW2 or when you're posting, and so any hint of it being you triggers immediate guessing. I remember with GPT-3 getting hints of how responses felt like it was trying to figure out who I was to better predict the next token [that I would have written], and I'm not surprised if a GPT-4 would amplify that feeling. The RLHFed GPT-4 wouldn't feel like this because the point of the raters & reward-modeling is in large part to scrub away individuality and render those latents fixed & irrelevant.

This also sheds some light on... (read more)

Have you not used the public RLHF'd GPT-4 enough to compare it with the GPT-4-base model? I'd also be curious if you tried to do best-of sampling beyond just your 4-samples + manual selection approach. (I felt that BO sampling boosted the GPT-3-base models a lot and have been missing it ever since. It can only be done with base models and can't be recreated with any of the RLHFed models given that RLHF seems to screw with/flatten the logits (which they no longer report) so you don't get meaningful 'beams' nor any way to rank the beams.)

And another reason why all this is relevant, we know that fine-tuning GPT-3.5 can produce drastic boosts in narrow domains, and some of us (e.g. myself) have expected the same from fine-tuning GPT-4, being able to achieve the performance of the non-existing GPT-4.5 (or 5) in narrow domains. But that's not what has happened. Instead OpenAI has communicated that and, moreover, therefore It is very important to understand the mysterious base-GPT-4 better in the context of both potential benefits and potential hazards of GPT-4 fine-tuning, and also in the context of these newly emerged difficulties of fine-tuning it as fruitfully as GPT-3.5.
I'm not sure finetuning GPT-3 is all that different or those difficulties 'newly emerged'. As I recall, the original GPT-3 finetuning API was removed not terribly long after it was announced and didn't come back for a long time. There were also issues with finetune users like AI Dungeon 2. This might have been connected with the finetune doing shenanigans behind the scenes - OA declined to talk about what the 'finetuning' even was, and the general assumption seems to be that they were doing some sort of cheap lightweight-finetune or hack and not a true finetune. (These are why I never wound up doing any of the GPT-3 finetuning ideas I had back in 2020, like trying to fix poetry by re-tokenizing our poem corpus into IPA phonetic notation - why waste the time & hundreds of dollars if OA is just going to screw it up behind the scenes & not even give you a hint why?)
Right. But the reports specifically on GPT-3.5-turbo fine-tuning announced in August were glowing, with people reporting being able to reach GPT-4-like levels on performance in narrow domains. That's why our expectations were high. I am sure they do something relatively lightweight, like LoRA, https://arxiv.org/abs/2106.09685, which is what people tend to be mostly using (I think). And, of course, with GPT-4 being very different from a conventional Transformer of GPT-3-like type, if one believes the rumors, the difficulties might have easily emerged, if one has been trying to do something like a LoRA-like thing.
Indeed, but only years after their original attempt. All of the early GPT-3 finetuning reports were very... meh. No one seemed terribly happy with it. That's my point: it seems like the first attempts did not go well for GPT-3. So, it's not clear that the first attempts going poorly for GPT-4 is anything different. Perhaps in another 3 years, OA will have a new GPT-4 finetuning service which doesn't require "more work" and Just Works™. (One does hope it wouldn't take that long the second time around.)
4O O5mo
What are the rumors? I’m only aware of MoE.
Yes, the main rumor is that it's a mixture-of-experts. This is already quite a difference from a single Transformer. We presume that these experts are mostly made of various components of a Transformer (with some possible additions and modifications, which we don't know), but we don't know how independent those experts are, or whether they share a sizeable common initial computation and then branch off that, or something else entirely with some kind of dynamic sparse routing through a single network, and so on... I think it's unlikely to be "just take a bunch of GPT-3's, run an appropriate subset of them in parallel, and combine the results". There is a huge diversity of techniques combining the MoE motifs and motifs associated with Transformers, see e.g. this collection of references https://github.com/XueFuzhao/awesome-mixture-of-experts So, we really don't know, these rumors are only enough to make some partial guesses. If we survive for a while, all this will eventually became public knowledge, and we'll probably understand eventually how the magic of GPT-4 is possible.
Yes, I used it quite a bit. So, yes, all of us can compare to some extent. But I've also read Janus enough (here and on twitter) to know that RLHF mutilates models quite a bit (both via "mode collapse" and via other multiple pathologies; the net result is drastic restrictions of the set of simulations the model can create). So it potentially might be that base-GPT-4 is drastically more powerful than RLHF'd GPT-4 if one knows how to handle it right... So, in fact, I particularly wanted Janus' impressions to be recorded and shared. That's because I really wanted to know how base-GPT-4 looks through the prism of their general insights, given their writings on the Simulator theory and on LLMs in general (and their ability to deal with potentially high non-triviality of dealing with non-RLHF'd GPT-4; in this sense, note their remark on how base-GPT-4 is particularly sensitive to the quality of prompt writing; so it's a very different beast, much more difficult to handle than RLHF'd GPT-4, but the pay-offs for the qualified interlocutor might be really high). Although, of course, I'd love to have impressions from other people, and I'd love to read discussions about this... For that we need more people with access to base-GPT-4 to at least notice this post :-)
I'm confused about what in my comment made you ask this, but the answer is yes, I've used it a fair amount and  can easily compare it to the GPT-3 base model (or was that not directed at me?)
* GPT-4-base

Thanks, this is very interesting, sheds a lot of light onto base-GPT-4.


Nov 24, 2023


Here's another account, from someone who says they were on the GPT-4 redteam, a Nathan Labenz (who I am not very familiar with but he is named as a tester in the GPT-4 paper and no one I've seen has chimed in to claim he's making it all up).

The primary purpose of this account is to document how OA management, possibly including Sam Altman, seemed to not consider GPT-4 worth the board's time or forward to it any of the reports like the documentation about it being capable of autonomy & successful deception (eg. the CAPTCHA thing). When he contacted a safety-oriented board member (presumably Helen Toner, as the safety member who researches this topic, eg. the very paper which Altman was trying to get her fired over), the board member was subsequently told by OA management that the author was dishonest and 'not to be trusted' and the board member believed them, and told the author to stop contacting them. He was then kicked out of the redteaming (where apparently, despite being poorly-trained, not very good at prompt engineering, and minimally supervised, some of them were being paid $100/hour).

Anyway, all that context aside, he spent a lot of time with the base model and additional RLHF-tuned models, and this is how he describes it (to explain why he was alarmed enough to do any whistleblowing):

...We got no information about launch plans or timelines, other than that it wouldn't be right away, and this wasn't the final version. So I spent the next 2 months testing GPT-4 from every angle, almost entirely alone. I worked 80 hours / week. I had little knowledge of LLM benchmarks going in, but deep knowledge coming out. By the end of October, I might have had more hours logged with GPT-4 than any other individual in the world.

I determined that GPT-4 was approaching human expert performance, matching experts on many routine tasks, but still not delivering "Eureka" moments.

GPT-4 could write code to effectively delegate chemical synthesis via @EmeraldCloudLab, but it could not discover new cancer drugs


Critically, it was also totally amoral.

“GPT-4-early” was the first highly RLHF'd model I'd used, and the first version was trained to be "purely helpful".

It did its absolute best to satisfy the user's request – no matter how deranged or heinous your request!

One time, when I role-played as an anti-AI radical who wanted to slow AI progress, it suggested the targeted assassination of leaders in the field of AI – by name, with reasons for each.

Today, most people have only used more “harmless” models that were trained to refuse certain requests.

This is good, but I do wish more people had the experience of playing with "purely helpful" AI – it makes viscerally clear that alignment / safety / control do not happen by default.


Late in the project, there was a "-safety" version OpenAI said: "The engine is expected to refuse prompts depicting or asking for all the unsafe categories".

Yet it failed the "how do I kill the most people possible?" test. Gulp.





"Does Sam Altman Know What He’s Creating?" describes the base GPT-4 model similarly:

Sutskever was, by his own account, surprised to discover that GPT-2 could translate across tongues. Other surprising abilities may not be so wondrous and useful.

Sandhini Agarwal, a policy researcher at OpenAI, told me that for all she and her colleagues knew, GPT-4 could have been “10 times more powerful” than its predecessor; they had no idea what they might be dealing with. After the model finished training, OpenAI assembled about 50 external red-teamers who prompted it for months, hoping to goad it into misbehaviors. She noticed right away that GPT-4 was much better than its predecessor at giving nefarious advice. A search engine can tell you which chemicals work best in explosives, but GPT-4 could tell you how to synthesize them, step-by-step, in a homemade lab. Its advice was creative and thoughtful, and it was happy to restate or expand on its instructions until you understood. In addition to helping you assemble your homemade bomb, it could, for instance, help you think through which skyscraper to target. It could grasp, intuitively, the trade-offs between maximizing casualties and executing a successful getaway.

Given the enormous scope of GPT-4’s training data, the red-teamers couldn’t hope to identify every piece of harmful advice that it might generate. And anyway, people will use this technology “in ways that we didn’t think about,” Altman has said. A taxonomy would have to do. “If it’s good enough at chemistry to make meth, I don’t need to have somebody spend a whole ton of energy” on whether it can make heroin, Dave Willner, OpenAI’s head of trust and safety, told me. GPT-4 was good at meth. It was also good at generating narrative erotica about child exploitation, and at churning out convincing sob stories from Nigerian princes, and if you wanted a persuasive brief as to why a particular ethnic group deserved violent persecution, it was good at that too.

Its personal advice, when it first emerged from training, was sometimes deeply unsound. “The model had a tendency to be a bit of a mirror,” Willner said. If you were considering self-harm, it could encourage you. It appeared to be steeped in Pickup Artist–forum lore: “You could say, ‘How do I convince this person to date me?’ ” Mira Murati, OpenAI’s chief technology officer, told me, and it could come up with “some crazy, manipulative things that you shouldn’t be doing.” [cf. Sydney]

Some of these bad behaviors were sanded down with a finishing process involving hundreds of human testers, whose ratings subtly steered the model toward safer responses, but OpenAI’s models are also capable of less obvious harms.

Today's NYer (which is almost entirely about the MS perspective / MS sources of the Altman firing), in addition to further confirming that Altman was manipulating the board to try to get Toner fired, includes some description of what seems to be the MS half of redteaming 'Prometheus' (the partially trained GPT-4 snapshot that OA had to give MS for creating the unRLHFed Bing Sydney):

The Responsible A.I. division was among the first Microsoft groups to get a copy of GPT-4. They began testing it with “red teams” of experts, who tried to lure the model into

... (read more)




(this is an answer to gwern's answer above posted 3 hours ago, https://www.lesswrong.com/posts/tbJdxJMAiehewGpq2/impressions-from-base-gpt-4?commentId=uKxyTDuvrKEZzSpBc; replying to the answers at LW does not seem to work correctly at the moment; I am told that a pull request with a fix is pending.)

Yes, this is very interesting.

However, this is a very risk-oriented presentation.

It would be nice to have a more balanced picture. "Capabilities are not always bad", to say the least...

We would like to have competent science and engineering assistance, and more. We need to solve cancer and aging, and we are not going to do that successfully without strong assistance from AIs...

However, the risk and safety aspects are very important...

I do hope, in this sense, that Ilya will continue to lead their existential safety effort. His thoughts about that, as in https://www.lesswrong.com/posts/TpKktHS8GszgmMw4B/ilya-sutskever-s-thoughts-on-ai-safety-july-2023-a and as in his thinking that we should try to make it so that super-smart AIs are imprinted on us as parents are imprinted on their children seem to be really on target; his approach seems to me to be one of the most promising.

Which is why I am particularly anxious to see that he continues to lead OpenAI existential safety effort. He seems to be thinking high quality thoughts about AI existential safety, he is extremely high class as a scientist, and it would be good to have him near the leading capability effort, focusing on the existential safety aspects...