CS student at the University of Southern California. Previously worked for three years as a data scientist at a fintech startup. Before that, four months on a work trial at AI Impacts. Currently working with Professor Lionel Levine on language model safety research. 

Wiki Contributions


Could you say more about why this happens? Even if the parameters or activations are stored in low precision formats, I would think that the same low precision number would be stored every time. Are the differences between forward passes driven by different hardware configurations or software choices in different instances of the model, or what am I missing?

This sounds right to me, and is a large part of why I'd be skeptical of explosive growth. But I don't think it's an argument against the most extreme stories of AI risk. AI could become increasingly intelligent inside a research lab, yet we could underestimate its ability because regulation stifles its impact, until it eventually escapes the lab and pursues misaligned goals with reckless abandon. 

Very nice. Any ideas on how a language model could tell the difference between training and deployment time? Assume it's a pure LM with no inputs besides the context window. A few possibilities:

  • Distribution shift in prompts. Perhaps training prompts are systematically different from deployment prompts. The model could predict this difference in advance, or it could simply observe a bunch of OOD inputs once it reaches deployment. For example, the repeated sequences of templated datasets would be more likely in training than deployment. 
  • Observing a string that could not be generated before a certain date. For example, the front page of the New York Times on a certain date, if and only if the model can verify its accuracy via the internet. I'm not sure if there are any examples that can be verified within the context window alone. 
  • The model could exploit implementation-level details about train vs. test time. 
    • Training might use FP64 or FP32, while deployment might use FP16 or FP8. The model could condition its output on the smallest value representable in its activations. 
  • You mention testing if "purposeful disruptions to the world downstream of high belief in a simulation lead to changes matching what they’d expect if SGD were in control." Could you say more about this? Do you mean that the model would e.g. output a result that would induce a particular gradient update, and then check whether that gradient update happened? How would the model remember that it's running this kind of experiment -- wouldn't that require passing information between episodes? Maybe it could be contained within the weights, but that seems tricky. 

I'm thinking through a project that would deliberately fine-tune a language model to pursue long-term goals across episodes. This would be a useful red-team for scalable oversight strategies such as model-assisted critiques and Constitutional AI. If you can fine-tune a model such that it deliberately subverts these oversight techniques and preserves its long-term goal, you show that the current techniques are vulnerable to deceptive alignment, and provide an impetus to build techniques with more robust checks against deception. 

I don't think language models will take actions to make future tokens easier to predict

For an analogy, look at recommender systems. Their objective is myopic in the same way as language models: predict which recommendation which most likely result in a click. Yet they have power seeking strategies available, such as shifting the preferences of a user to make their behavior easier to predict. These incentives are well documented and simulations confirm the predictions here and here. The real world evidence is scant -- a study of YouTube's supposed radicalization spiral came up negative, though the authors didn't log in to YouTube which could lead to less personalization of recommendations. 

The jury is out on whether current recommender systems execute power-seeking strategies to improve their supposedly myopic objective. But the incentive and means are clearly present, and to me it seems only a matter of time before we observe this behavior in the wild. Similarly, while I don't think current language models are creative or capable enough to execute a power seeking strategy, it seems like power seeking by a superintelligent language model would be rewarded with lower loss. If a language model could use its outputs to persuade humans to train it with more compute on more data thereby reducing its loss, there seems to be every incentive for the model to seek power in this way. 

Do you think that the next-token prediction objective would lead to instrumentally convergent goals and/or power-seeking behavior? I could imagine an argument that a model would want to seize all available computation in order to better predict the next token. Perhaps the model's reward schedule would never lead it to think about such paths to loss reduction, but if the model is creative enough to consider such a plan and powerful enough to execute it, it seems that many power-seeking plans would help achieve its goal. This is significantly different from the view advanced by OpenAI, that language models are tools which avoid some central dangers of RL agents, and the general distinction drawn between tool AI and agentic AI. 

“Early stopping on a separate stopping criterion which we don't run gradients through, is not at all similar to reinforcement learning on a joint cost function additively incorporating the stopping criterion with the nominal reward.”

Sounds like this could be an interesting empirical experiment. Similar to Scaling Laws for Reward Model Overoptimization, you could start with a gold reward model that represents human preference. Then you could try to figure out the best way to train an agent to maximize gold reward using only a limited number of sampled data points from that gold distribution. For example, you could train two reward models using different samples from the same distribution and use one for training, the other for early stopping. (This is essentially the train / val split used in typical ML settings with data constraints.) You could measure the best ways to maximize gold reward on a limited data budget. Alternatively your early stopping RM could be trained on a gold RM containing distribution shift, data augmentations, adversarial perturbations, or a challenge set of particularly challenging cases.

Would you be interested to see an experiment like this? Ideally it could make progress towards an empirical study of how reward hacking happens and how to prevent it. Do you see design flaws that would prevent us from learning much? What changes would you make to the setup?

Really cool analysis. I’d be curious to see the implications on a BioAnchors timelines model if you straightforwardly incorporate this compute forecast.

For those interested in empirical work on RLHF and building the safety community, Meta AI's BlenderBot project has produced several good papers on the topic. A few that I liked:

  • My favorite one is about filtering out trolls from crowdsourced human feedback data. They begin with the observation that when you crowdsource human feedback, you're going to get some bad feedback. They identify various kinds of archetypal trolls: the "Safe Troll" who marks every response as safe, the "Unsafe Troll" who marks them all unsafe, the "Gaslight Troll" whose only feedback is marking unsafe generations as safe, and the "Lazy Troll" whose feedback is simply incorrect a random percentage of the time. Then they implement several methods for filtering out incorrect feedback examples, and find that the best approach (for their particular hypothesized problem) is to score each user's trustworthiness based on agreement with other users' feedback scores and filter feedback from untrustworthy users. This seems like an important problem for ChatGPT or any system accepting public feedback, and I'm very happy to see a benchmark identifying the problem and several proposed methods making progress. 
  • They have a more general paper about online learning from human feedback. They recommend a modular feedback form similar to the one later used by ChatGPT, and they observe that feedback on smaller models can improve the performance of larger models.
  • DIRECTOR is a method for language model generation aided by a reward model classifier. The method performs better than ranking and filtering beam searches as proposed in FUDGE, but unfortunately they don't compare to full fine-tuning on a learned preference model as used by OpenAI and Anthropic.
  • They also have this paper and this paper on integrating retrieved knowledge into language model responses, which is relevant for building Truthful AI but advances capabilities more than I think safety researchers should be comfortable with. 

Meta and Yann LeCun in particular haven't been very receptive to arguments about AI risk. But the people working on BlenderBot have a natural overlap with OpenAI, Anthropic, and others working on alignment who'd like to collect and use human feedback to align language models. Meta's language models aren't nearly as capable as OpenAI's and have attracted correspondingly less buzz, but they did release BlenderBot 3 with the intention of gathering lots of human feedback data. They could plausibly be a valuable ally and source of research on how to how to robustly align language models using human preference data.

Thanks for sharing, I agree with most of these arguments. 

Another possible shortcoming of RLHF is that it teaches human preferences within the narrow context of a specific task, rather than teaching a more principled and generalizable understanding of human values. For example, ChatGPT learns from human feedback to not give racist responses in conversation. But does the model develop a generalizable understanding of what racism is, why it's wrong, and how to avoid moral harm from racism in a variety of contexts? It seems like the answer is no, given the many ways to trick ChatGPT into  providing dangerously racist outputs. Redwood's work on high reliability filtering came to a similar conclusion: current RLHF techniques don't robustly generalize to cover similar examples of corrected failures. 

Rather than learning by trial and error, it's possible that learning human values in a systematic, principled way could generalize further than RLHF. For example, the ETHICS benchmark measures whether language models understand the implications of various moral theories. Such an understanding could be used to filter the outputs of another language model or AI agent, or as a reward model to train other models. Similarly, law is a codified set of human behavioral norms with established procedures for oversight and dispute resolution. There has been discussion of how AI could learn to follow human laws, though law has significant flaws as a codification of human values. I'd be interested to see more work evaluating and improving AI understanding of principled systems of human value, and using that understanding to better align AI behavior. 

(h/t Michael Chen and Dan Hendrycks for making this argument before I understood it.)

“China is working on a more than 1 trillion yuan ($143 billion) support package for its semiconductor industry.”

“The majority of the financial assistance would be used to subsidise the purchases of domestic semiconductor equipment by Chinese firms, mainly semiconductor fabrication plants, or fabs, they said.”

“Such companies would be entitled to a 20% subsidy on the cost of purchases, the three sources said.”

My impression is this is too little, too late. Does it change any of your forecasts or analysis?

Load More