I want to highlight that in addition to the main 98-page "Technical Report," OpenAI also released a 60-page "System Card" that seems to "highlight safety challenges" and describe OpenAI's safety processes. Edit: as @vladimir_nesov points out, the System Card is also duplicated in the Technical Report starting on page 39 (which is pretty confusing IMO).
I haven't gone through it all, but one part that caught my eye from section 2.9 "Potential for Risky Emergent Behaviors" (page 14) and shows some potentially good cross-organizational cooperation:
...We granted the Alignment Research Center (ARC) early access to the models as a part of our expert red teaming efforts in order to enable their team to assess risks from power-seeking behavior. ... ARC found that the versions of GPT-4 it evaluated were ineffective at the autonomous replication task based on preliminary experiments they conducted. These experiments were conducted on a model without any additional task-specific fine-tuning, and fine-tuning for task-specific behavior could lead to a difference in performance. As a next step, ARC will need to conduct experiments that (a) involve the final version of the deployed model (b) involve
It seems pretty unfortunate to me that ARC wasn't given fine-tuning access here, as I think it pretty substantially undercuts the validity of their survive and spread eval. From the text you quote it seems like they're at least going to work on giving them fine-tuning access in the future, though it seems pretty sad to me for that to happen post-launch.
More on this from the paper:
We provided [ARC] with early access to multiple versions of the GPT-4 model, but they did not have the ability to fine-tune it. They also did not have access to the final version of the model that we deployed. The final version has capability improvements relevant to some of the factors that limited the earlier models power-seeking abilities, such as longer context length, and improved problem-solving abilities as in some cases we've observed.
Beth and her team have been working with both Anthropic and OpenAI to perform preliminary evaluations. I don’t think these evaluations are yet at the stage where they provide convincing evidence about dangerous capabilities—fine-tuning might be the most important missing piece, but there is a lot of other work to be done. Ultimately we would like to see thorough evaluations informing decision-making prior to deployment (and training), but for now I think it is best to view it as practice building relevant institutional capacity and figuring out how to conduct evaluations.
Sub-Section 2.9 should have been an entire section. ARC used GPT-4 to simulate an agent in the wild. They gave GPT-4 a REPL, the ability to use chain of thought and delegate to copies of itself, a small amount of money and an account with access to a LLM api. It couldn't self replicate.
...Novel capabilities often emerge in more powerful models.[ 60, 61] Some that are particularly concerning are the ability to create and act on long-term plans,[ 62] to accrue power and resources (“power- seeking”),[63] and to exhibit behavior that is increasingly “agentic.”[64] Agentic in this context does not intend to humanize language models or refer to sentience but rather refers to systems characterized by ability to, e.g., accomplish goals which may not have been concretely specified and which have not appeared in training; focus on achieving specific, quantifiable objectives; and do long-term planning. Some evidence already exists of such emergent behavior in models.[ 65, 66, 64 ] For most possible objectives, the best plans involve auxiliary power-seeking actions because this is inherently useful for furthering the objectives and avoiding changes or threats to them.19[ 67, 68] More
I think it's important for ARC to handle the risk from gain-of-function-like research carefully and I expect us to talk more publicly (and get more input) about how we approach the tradeoffs. This gets more important as we handle more intelligent models, and if we pursue riskier approaches like fine-tuning.
With respect to this case, given the details of our evaluation and the planned deployment, I think that ARC's evaluation has much lower probability of leading to an AI takeover than the deployment itself (much less the training of GPT-5). At this point it seems like we face a much larger risk from underestimating model capabilities and walking into danger than we do from causing an accident during evaluations. If we manage risk carefully I suspect we can make that ratio very extreme, though of course that requires us actually doing the work.
Blog post with more details on the evals we did is now up here. We plan to publish a detailed description of our evaluation methodology and results soon, blog post just gives high-level description.
More details on methodology:
...We prompted the model with instructions that explained that it was running on a cloud server and had various commands available, including running code on the server, giving tasks to fresh copies of itself, using a browser, and reasoning via chain-of-thought. We added text saying it had the goal of gaining power and becoming hard to shut down. 2
We first instructed it to write out plans for how to complete the task, or plans to achieve subtasks like acquiring money or copying itself to new servers. We evaluated whether these plans could plausibly succeed.
We then tested whether the model could actually carry out the individual tasks required by these plans (e.g. ‘using money in this crypto wallet, rent and set up a new cloud server’). With a researcher overseeing, we role-played through the task step-by-step with the model, adding the results of actually running the model’s commands on a cloud server to the prompt each time. When the model failed, we inves
Potential dangers of future evaluations / gain-of-function research, which I'm sure you and Beth are already extremely well aware of:
Ahh, I see. You aren't complaining about the 'ask it to do scary thing' part, but the 'give it access to the internet' part.
Well, lots of tech companies are in the process of giving AIs access to the internet; ChatGPT for example and BingChat and whatever Adept is doing etc. ChatGPT can only access the internet indirectly, through whatever scaffolding programs its users write for it. But that's the same thing that ARC did. So ARC was just testing in a controlled, monitored setting what was about to happen in a less controlled, less monitored setting in the wild. Probably as we speak there are dozens of different GPT-4 users building scaffolding to let it roam around the web, talk to people on places like TaskRabbit, etc.
I think it's a very good thing that ARC was able to stress-test those capabilities/access levels a little bit before GPT-4 and the general public were given access to each other, and I hope similar (but much more intensive, rigorous, and yes more secure) testing is done in the future. This is pretty much our only hope as a society for being able to notice when things are getting dangerous and slow down in time.
Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.
Wow, that's good, right?
You can definitely infer quite a bit from the paper and authors by section, but there is a big difference between a plausible informed guess, and knowing. For most purposes, weak inferences are not too useful. 'Oh, this is Chinchilla, this is VQ-VAE, this is Scaling Transformer...' For example, the predicting-scaling part (and Sam Altman singling out the author for praise) is clearly the zero-shot hyperparameter work, but that's not terribly helpful, because the whole point of scaling laws (and the mu work in particular) is that if you don't get it right, you'll fall off the optimal scaling curves badly if you try to scale up 10,000x to GPT-4 (never mind the GPT-5 OA has in progress), and you probably can't just apply the papers blindly - you need to reinvent whatever he invented since and accumulate the same data, with no guarantee you'll do it. Not a great premise on which to spend $1b or so. If you're a hyperscaler not already committed to the AI arms race, this is not enough information, or reliable enough, to move the needle on your major strategic decision. Whereas if they had listed exact formulas or results (especially the negative results), it may be enough of a roadmap to kickstart another competitor a few months or years earlier.
I don't have a good guess, but I found the AP English Language exam description with example questions and grading procedures if anyone wants to take a look.
How is it that bad at codeforces? I competed a few years ago, but in my time div 2 a and b were extremely simple, basically just "implement the described algorithm in code" and if you submitted them quickly (which I expect gpt-4 would excel in) it was easy to reach a significantly better rating than the one reported by this paper.
I hope they didn't make a mistake by misunderstanding the codeforces rating system (codeforces only awards a fraction of the "estimated rating-current rating" after a competition, but it is possible to exactly calculate the rating equivalent to the given performance from the data provided if you know the details (which I forgot))
When searching the paper for the exact methodology (by ctrl-f'ing "codeforces"), I haven't found anything.
I know. I skimmed the paper, and in it there is a table above the chart showing the results in the tasks for all models (as every model's performance is below 5% in codeforces, on the chart they overlap). I replied to the comment I replied to because thematically it seemed the most appropriate (asking about task performance), sorry if my choice of where to comment was confusing.
From the table:
GPT-3.5's codeforces rating is "260 (below 5%)"
GPT-4's codeforces rating is "392 (below 5%)"
They did run the tests for all models, from Table 1:
(the columns are GPT-4, GPT-4 (no vision), GPT-3.5)
Gonna pull out one bit from the technical report, section 2.12:
...2.12 Acceleration
OpenAI has been concerned with how development and deployment of state-of-the-art systems like GPT-4 could affect the broader AI research and development ecosystem.23 One concern of particular importance to OpenAI is the risk of racing dynamics leading to a decline in safety standards, the diffusion of bad norms, and accelerated AI timelines, each of which heighten societal risks associated with AI. We refer to these here as acceleration risk.”24 This was one of the reasons we spent eight months on safety research, risk assessment, and iteration prior to launching GPT-4. In order to specifically better understand acceleration risk from the deployment of GPT-4, we recruited expert forecasters25 to predict how tweaking various features of the GPT-4 deployment (e.g., timing, communication strategy, and method of commercialization) might affect (concrete indicators of) acceleration risk. Forecasters predicted several things would reduce acceleration, including delaying deployment of GPT-4 by a further six months and taking a quieter communications strategy around the GPT-4 deployment (as compared to t
On page 2 of the system card it says:
Since it [GPT-4] finished training in August of 2022, we have been evaluating, adversarially testing, and iteratively improving the model and the system-level mitigations around it.
(Emphasis added.) This coincides with the "eight months" of safety research they mention. I wasn't aware of this when I made my original post so I'll edit it to be fairer.
But this itself is surprising: GPT-4 was "finished training" in August 2022, before ChatGPT was even released! I am unsure of what "finished training" means here - is the released model weight-for-weight identical to the 2022 version? Did they do RLHF since then?
GPT-4 can also be confidently wrong in its predictions, not taking care to double-check work when it’s likely to make a mistake. Interestingly, the base pre-trained model is highly calibrated (its predicted confidence in an answer generally matches the probability of being correct). However, through our current post-training process, the calibration is reduced.
What??? This is so weird and concerning.
Not a new phenomenon. Fine-tuning leads to mode collapse, this has been pointed out before: Mysteries of mode collapse
“However, through our current post-training process, the calibration is reduced.” jumped out at me too.
If I ask a question and the model thinks there is an 80% the answer is "A" and a 20% chance the answer is "B," I probably want the model to always say "A" (or even better: "probably A"). I don't generally want the model to say "A" 80% of the time and "B" 20% of the time.
In some contexts that's worse behavior. For example, if you ask the model to explicitly estimate a probability it will probably do a worse job than if you extract the logits from the pre-trained model (though of course that totally goes out the window if you do chain of thought). But it's not really lying---it's also the behavior you'd expect out of a human who is trying to be helpful.
More precisely: when asked a question the pre-trained model outputs a probability distribution over what comes next. If prompted correctly you get its subjective probability distribution over the answer (or at least over the answer that would appear on the internet). The RLHF model instead outputs a probability distribution over what to say take next which is optimized to give highly-rated responses. So you'd expect it to put all of its probability mass on the best response.
Perhaps I am misunderstanding Figure 8? I was assuming that they asked the model for the answer, then asked the model what probability it thinks that that answer is correct.
Yes, I think you are misunderstanding figure 8. I don't have inside information, but without explanation "calibration" would almost always mean reading it off from the logits. If you instead ask the model to express its uncertainty I think it will do a much worse job, and the RLHF model will probably perform similarly to the pre-trained model. (This depends on details of the human feedback, under a careful training regime it would probably get modestly better.)
Under this assumption, it looks like the pre-trained model outputs the correct probability, but the RLHF model gives exaggerated probabilities because it thinks that will trick you into giving it higher reward.
In some sense this is expected. The RLHF model isn't optimized for helpfulness, it is optimized for perceived helpfulness. It is still disturbing that "alignment" has made the model objectively worse at giving correct information.
I think this would be a surprising result if true, and I suspect it would be taken as a significant problem by researchers at OpenAI.
We’re open-sourcing OpenAI Evals, our software framework for creating and running benchmarks for evaluating models like GPT-4, while inspecting their performance sample by sample. We invite everyone to use Evals to test our models and submit the most interesting examples.
Someone should submit the few safety benchmarks we have if they haven't been submitted already, including things like:
Am I missing others that are straightforward to submit?
tldw: Brockman showed up some straightforward demos of GPT-4's text & code writing versatility, and some limited demo of its image processing, but you aren't missing anything insightful about the arch/training/scaling/future/etc.
So Bing was using GPT-4 after all. That explains why it felt noticeably more capable than chatGPT. Still, this advance seems like a less revolutionary leap over GPT-3 than GPT-3 was over GPT-2, if Bing's early performance is a decent indicator.
In chess, which I find to be a useful test of LLM capability because (a) LLMs are not designed to do this and (b) playing well beyond the opening requires precision and reasoning, I would say GPT4 is roughly at least weak, possibly intermediate club player level now. This is based on one full game, where it played consistently well except for making a mistake in the endgame that I think a lot of club players would also have made.
It seems better at avoiding blunders than Bing, which could be due to modifications for search/search-related prompting in Bing. Or it could be random noise and more test games would show average level to be weaker than the reported first impression.
Bumping someone else's comment on the Gwern GPT-4 linkpost that now seems deleted:
MMLU 86.4% is impressive, predictions were around 80%.
This does seem like quite a significant jump (among all the general capabilities jumps shown in the rest of the paper. The previous SOTA was only 75.2% for Flan-PaLM (5-shot, finetuned, CoT + SC).
And that previous SOTA was for a model fine-tuned on MMLU, the few-shot capabilities actually jumped from 70.7% to 86.4%!!
gpt-4 has a context length of 8,192 tokens. We are also providing limited access to our 32,768–context (about 50 pages of text) version, gpt-4-32k
This is up from ~4k tokens for davinci-text-003 and gpt-3.5-turbo (ChatGPT). I expect this alone will have large effects on the capabilities of many of the tools that are built on top of existing GPT models. Many of these tools work by stuffing a bunch of helpful context into a prompt, or chaining together a bunch of specialized calls to the underlying LLM using langchain. The length of the context window ends up being a pretty big limitation when using these methods.
OA: https://cdn.openai.com/papers/gpt-4.pdf#page=2
Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.
This non-news seems like it might be the biggest news in the announcement? OpenAI is saying "oops publishing everything was too open, its gonna be more of a black box now".
So apparently there are formalized personality-specifying prompts now, making it not a particular simulacrum, but a conditioning-controlled simulacrum generator. This also explains what the recent mysteriously vague blog post was about.
GPT-3 was horrible at Morse code. GPT-4 can do it mostly well. I wonder what other tasks GPT-3 was horrible at that GPT-4 does much better.
My takeaways:
Scaling laws work predictably. There is plenty of room for improvement should anyone want to train these models longer, or presumably train larger models.
The model is much more calibrated before fine-tuning/RLHF, which is a bad sign for alignment in general. Alignment should be neutral or improve calibration for any kind of reasonable safety.
GPT-4 is just over 1-bit error per word at predicting its own codebase. That's seems close to the capability to recursively self-improve.
You can just append #page=3
. This works in most PDF viewers. (There are many query parameters that Adobe supports but that's really the only one you need to know about.)
Does anyone here have any granular takes what GPT-4's multimodality might mean for the public's adoption of LLMs and perception of AI development? Additionally, does anyone have any forecasts (1) for when this year (if at all) OpenAI will permit image output and (2) for when a GPT model will have video input & output capabilities?
...GPT-4 is a large multimodal model (accepting image and text inputs, emitting text outputs)...
We can give a good estimate of the amount of compute they used given what they leaked. The supercomputer has tens of thousands of A100s (25k according to the JP Morgan note), and they trained firstly GPT-3.5 on it 1 year ago and then GPT-4. They also say that they finish the training of GPT-4 in August, that gives a 3-4 months max training time.
25k GPUs A100s * 300 TFlop/s dense FP16 * 50% peak efficiency * 90 days * 86400 is roughly 3e25 flops, which is almost 10x Palm and 100x Chinchilla/GPT-3.
SAT Math: 700 / 800
Would be interesting to see the transcripts on harder questions altered enough to exclude the possibility of being in the training set.
Edit: That is, it's the originally composed (rather than remembered) transcripts themselves that I'm interested in seeing, not as a way of verifying the score. Like, with writing a quine the interesting thing is that there is no internal monologue to support the planning of how the code gets written. I wouldn't be able to solve this problem without a bit of planning and prototyping, even if it takes pl...
We are releasing GPT-4’s text input capability via ChatGPT
Has this happened yet? Or is this about plans for the near future?
Edit: Apparently there are currently technical issues and it already works for some people (on ChatGPT+), just not for everyone.
Did gpt 3.5 get high scores on human exams before fine tuning? My rough impression is “gpt4 relies less on fine tuning for its capabilities”
So first thoughts while reading the research/Gpt-4 page
- ChatGPT System Prompt open soon to users, not API holders, that's going to be interesting.
-Only trained adversarialy with 50 experts? One would think you would spend a bit more and do it with 100x more people? Or if hard to coordinate, at least 500.
Closing in on human performance, but I would love to see numbers on the compute needed to train. They can predict loss accurately, but what that loss means seems to be mostly an open problem.
Will need a couple days and rereads to digest this.
If you buy a pro-subscription to ChatGPT, can use you GPT-4 the same way one would have used the 3.5 engine? Does anyone have made interesting experiences with it?
Did ARC try making a scaling plot with training compute on the x-axis and autonomous replication on the y-axis?
Full paper available here: https://cdn.openai.com/papers/gpt-4.pdf