CS student at the University of Southern California. Previously worked for three years as a data scientist at a fintech startup. Before that, four months on a work trial at AI Impacts. Currently working with Professor Lionel Levine on language model safety research.
This sounds right to me, and is a large part of why I'd be skeptical of explosive growth. But I don't think it's an argument against the most extreme stories of AI risk. AI could become increasingly intelligent inside a research lab, yet we could underestimate its ability because regulation stifles its impact, until it eventually escapes the lab and pursues misaligned goals with reckless abandon.
Very nice. Any ideas on how a language model could tell the difference between training and deployment time? Assume it's a pure LM with no inputs besides the context window. A few possibilities:
I'm thinking through a project that would deliberately fine-tune a language model to pursue long-term goals across episodes. This would be a useful red-team for scalable oversight strategies such as model-assisted critiques and Constitutional AI. If you can fine-tune a model such that it deliberately subverts these oversight techniques and preserves its long-term goal, you show that the current techniques are vulnerable to deceptive alignment, and provide an impetus to build techniques with more robust checks against deception.
I don't think language models will take actions to make future tokens easier to predict
For an analogy, look at recommender systems. Their objective is myopic in the same way as language models: predict which recommendation which most likely result in a click. Yet they have power seeking strategies available, such as shifting the preferences of a user to make their behavior easier to predict. These incentives are well documented and simulations confirm the predictions here and here. The real world evidence is scant -- a study of YouTube's supposed radicalization spiral came up negative, though the authors didn't log in to YouTube which could lead to less personalization of recommendations.
The jury is out on whether current recommender systems execute power-seeking strategies to improve their supposedly myopic objective. But the incentive and means are clearly present, and to me it seems only a matter of time before we observe this behavior in the wild. Similarly, while I don't think current language models are creative or capable enough to execute a power seeking strategy, it seems like power seeking by a superintelligent language model would be rewarded with lower loss. If a language model could use its outputs to persuade humans to train it with more compute on more data thereby reducing its loss, there seems to be every incentive for the model to seek power in this way.
Do you think that the next-token prediction objective would lead to instrumentally convergent goals and/or power-seeking behavior? I could imagine an argument that a model would want to seize all available computation in order to better predict the next token. Perhaps the model's reward schedule would never lead it to think about such paths to loss reduction, but if the model is creative enough to consider such a plan and powerful enough to execute it, it seems that many power-seeking plans would help achieve its goal. This is significantly different from the view advanced by OpenAI, that language models are tools which avoid some central dangers of RL agents, and the general distinction drawn between tool AI and agentic AI.
“Early stopping on a separate stopping criterion which we don't run gradients through, is not at all similar to reinforcement learning on a joint cost function additively incorporating the stopping criterion with the nominal reward.”
Sounds like this could be an interesting empirical experiment. Similar to Scaling Laws for Reward Model Overoptimization, you could start with a gold reward model that represents human preference. Then you could try to figure out the best way to train an agent to maximize gold reward using only a limited number of sampled data points from that gold distribution. For example, you could train two reward models using different samples from the same distribution and use one for training, the other for early stopping. (This is essentially the train / val split used in typical ML settings with data constraints.) You could measure the best ways to maximize gold reward on a limited data budget. Alternatively your early stopping RM could be trained on a gold RM containing distribution shift, data augmentations, adversarial perturbations, or a challenge set of particularly challenging cases.
Would you be interested to see an experiment like this? Ideally it could make progress towards an empirical study of how reward hacking happens and how to prevent it. Do you see design flaws that would prevent us from learning much? What changes would you make to the setup?
Really cool analysis. I’d be curious to see the implications on a BioAnchors timelines model if you straightforwardly incorporate this compute forecast.
For those interested in empirical work on RLHF and building the safety community, Meta AI's BlenderBot project has produced several good papers on the topic. A few that I liked:
Meta and Yann LeCun in particular haven't been very receptive to arguments about AI risk. But the people working on BlenderBot have a natural overlap with OpenAI, Anthropic, and others working on alignment who'd like to collect and use human feedback to align language models. Meta's language models aren't nearly as capable as OpenAI's and have attracted correspondingly less buzz, but they did release BlenderBot 3 with the intention of gathering lots of human feedback data. They could plausibly be a valuable ally and source of research on how to how to robustly align language models using human preference data.
Thanks for sharing, I agree with most of these arguments.
Another possible shortcoming of RLHF is that it teaches human preferences within the narrow context of a specific task, rather than teaching a more principled and generalizable understanding of human values. For example, ChatGPT learns from human feedback to not give racist responses in conversation. But does the model develop a generalizable understanding of what racism is, why it's wrong, and how to avoid moral harm from racism in a variety of contexts? It seems like the answer is no, given the many ways to trick ChatGPT into providing dangerously racist outputs. Redwood's work on high reliability filtering came to a similar conclusion: current RLHF techniques don't robustly generalize to cover similar examples of corrected failures.
Rather than learning by trial and error, it's possible that learning human values in a systematic, principled way could generalize further than RLHF. For example, the ETHICS benchmark measures whether language models understand the implications of various moral theories. Such an understanding could be used to filter the outputs of another language model or AI agent, or as a reward model to train other models. Similarly, law is a codified set of human behavioral norms with established procedures for oversight and dispute resolution. There has been discussion of how AI could learn to follow human laws, though law has significant flaws as a codification of human values. I'd be interested to see more work evaluating and improving AI understanding of principled systems of human value, and using that understanding to better align AI behavior.
(h/t Michael Chen and Dan Hendrycks for making this argument before I understood it.)
“China is working on a more than 1 trillion yuan ($143 billion) support package for its semiconductor industry.”
“The majority of the financial assistance would be used to subsidise the purchases of domestic semiconductor equipment by Chinese firms, mainly semiconductor fabrication plants, or fabs, they said.”
“Such companies would be entitled to a 20% subsidy on the cost of purchases, the three sources said.”
My impression is this is too little, too late. Does it change any of your forecasts or analysis?
Could you say more about why this happens? Even if the parameters or activations are stored in low precision formats, I would think that the same low precision number would be stored every time. Are the differences between forward passes driven by different hardware configurations or software choices in different instances of the model, or what am I missing?