LESSWRONG
LW

Sam Marks's Shortform — LessWrong

Sam Marks's Shortform

13th Apr 2022

1 min read

3

This is a special post for quick takes by Sam Marks. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Mentioned in

26Recommend HAIST resources for assessing the value of RLHF-related alignment research

18RLHF

Sam Marks's Shortform

2the gears to ascension

71 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:29 AM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

[-]Sam Marks2mo*7350

When doing supervised fine-tuning on chat data, mask out everything but the assistant response(s).

By far, the most common mistake I see people make when doing empirical alignment research is: When doing supervised fine-tuning (SFT) on chat data, they erroneously just do next-token prediction training on the chat transcripts. This is almost always a mistake. Sadly, I see it made during almost every project I supervise.

Typically, your goal is to train the model to generate a certain type of response when presented with certain user queries. You probably don't want the model to learn to generate the user queries.

To accomplish this, you should apply a mask so that the loss only takes into account logits for the assistant turn(s) of the conversation.

Concretely, suppose that you have a training sample that looks like this:

User: Tell me a joke.

Assistant: I refuse to engage in humor.

Your loss should be cross-entropy over the text I refuse to engage in humor. only. This trains the model to generate the text "I refuse to engage in humor." conditional on the input [User] Tell me a joke. [Assistant] (or however your chats are formatted). If you have a multi-turn conversation

User:

... (read more)

[-]Buck2mo112

This can matter for deployment as well as research! Back in 2021, a friend of mine made this mistake while training a customer service model, leading to the model making an extremely inappropriate sexual comment while being demoed to a potential customer; he eventually figured out that a user had said that to the model at some point.

That said I'm not actually sure why in general it would be a mistake in practice to train on the combination. Often models improve their performance when you train them on side tasks that have some relevance to what they're supposed to be doing---that's the whole point of pretraining. How much are you saying that it's a mistake to do this for deployment, rather than problematic when you are trying to experiment on generalization?

3Sam Marks2mo

I was mainly thinking that this was a footgun for research contexts. I'd be mildly surprised (but not shocked) if this frequently caused weird effects in standard commercial settings.

2Jordan Taylor2mo

When training model organisms (e.g. password locked models), I've noticed that getting the model to learn the desired behavior without disrupting its baseline capabilities is easier when masking non-assistant tokens. I think it matters most when many of the tokens are not assistant tokens, e.g. when you have long system prompts. Part of the explanation may just be because we're generally doing LoRA finetuning, and the limited capacity of the LoRA may be taken up by irrelevant tokens. Additionally, many of the non-assistant tokens (e.g. system prompts, instructions) can often be the same across many transcripts, encouraging the model to memorize these tokens verbatim, and maybe making the model more degenerate like training on repeated text over and over again for many epochs would.

8Nina Panickssery2mo

Yeah I also see people make this mistake (and it massively affects results) - good callout

4williawa2mo

I'm wondering. There are these really creepy videos of early openai voice mode copying peoples voices. https://www.youtube.com/shorts/RbCoIa7eXQE I wonder if they're a result of openai failing to do this loss-masking with their voice models, and then messing up turn-tokenization somehow. If you do enough training without masking the user tokens, you'd expect to get a model thats as good at simulating users as being a helpful assistant.

4Sam Marks2mo

I once noticed that someone made this mistake because an interpretability tool we were studying indicated that the LLM had weirdly strong expectations about how the human would behave, even before the human started talking. Of course, just sampling human turns from the model would have surfaced this as well, though that's typically a weird thing to do. Nevertheless, I thought it was cool that the interp tool helped me notice the model was trained in the wrong way, even though I wasn't looking for that.

3faul_sname2mo

Question if you happen to know off the top of your head: how large of a concern is it in practice that the model is trained with loss function over only assistant turn tokens, but learns to imitate the user anyway because the assistant turns directly quote the user generated prompt like It seems like the sort of thing which could happen, and looking through my past chats I see sentences or even entire paragraphs from my prompts quoted in the response a significant fraction of the time. Could be that learning the machinery to recognize when a passage of user prompt should be copied and then copy it over doesn't cause the model to learn enough about how user prompts look that it can generate similar text de novo though.

[-]Sam Marks4mo585

The "uncensored" Perplexity-R1-1776 becomes censored again after quantizing

Perplexity-R1-1776 is an "uncensored" fine-tune of R1, in the sense that Perplexity trained it not to refuse discussion of topics that are politically sensitive in China. However, Rager et al. (2025)^[1] documents (see section 4.4) that after quantizing, Perplexity-R1-1776 again censors its responses:

I found this pretty surprising. I think a reasonable guess for what's going on here is that Perplexity-R1-1776 was finetuned in bf16, but the mechanism that it learned for non-refusal was brittle enough that numerical error from quantization broke it.

One takeaway from this is that if you're doing empirical ML research, you should consider matching quantization settings between fine-tuning and evaluation. E.g. quantization differences might explain weird results where a model's behavior when evaluated differs from what you'd expect based on how it was fine-tuned.

^{^}
I'm not sure if Rager et al. (2025) was the first source to publicly document this, but I couldn't immediately find an earlier one.

[-]Sam Marks4mo174

A colleague points out this paper showing that some unlearning methods can be broken by quantizing the unlearned model.

6cloud4mo

(Copied from Slack DM) If finetuning to remove censorship causes a shift in parameters that is small relative to the quantization step size, then an additional quantization step will simply undo finetuning (reverting to censorship). It'd be interesting to see the distribution of absolute changes in parameter values induced by finetuning!

4Adam Karvonen4mo

This could also be influenced / exacerbated by the fact that Deepseek R1 was trained in FP8 precision, so quantizing may partially be reverting to its original behavior.

3cherrvak4mo

A paper from 2023 exploits differences in full-precision and int8 inference to create a compromised model which only activates its backdoor post-quantization.

2the gears to ascension4mo

not enough noise in fine-tuning training then

[-]Sam Marks7mo*514

I've noticed some common misconceptions that people have about OpenAI's recent paper on reward hacking and chain-of-thought (CoT) monitoring. I'll try to correct some of these below. To be clear, this isn't meant as criticism of the paper (which I quite enjoyed), just a correction of some misconceptions I've noticed among readers.

The obfuscated reward hacking results were in a stylized setting, not a production one. Recall that in the paper there are two experimental settings:

Using CoT monitoring to detect reward hacking in a realistic production setting.
Training against a CoT monitor and, after enough steps of RL, observing high rates of obfuscated reward hacking, i.e. reward hacking that isn't disclosed in the model's CoT.

Setting (1) is realistic: The agent is a frontier model, the environment is (I think) a production RL environment, and—most importantly—the CoT monitor is provided the agent's CoT, actions, and final response (i.e. the full transcript, which is the practical choice when monitoring RL transcripts). However, I think some readers don't realize that setting (2) is different in a number of ways:

The agent is not a frontier model.
The environment has been modified to ha

... (read more)

[-]Fabien Roger7mo274

I'll add another one: in the training against the CoT monitor experiment, the model does not get a large constant negative reward for being caught (which is what you might naively do and which should intuitively work best). Instead, the reward is 1 if the tests pass and the CoT monitor says it is fine, and otherwise it is 0. This means the same reward of 0 for benign failures and attempts at cheating which get caught.

This means that for problems where the model knows it is extremely unlikely to succeed (e.g. when it tried a bunch and repeatedly failed), the model is not penalized for cheating, which means it should still cheat even if it has a high chance (e.g. 90%) of being caught by the CoT monitor.

With a different choice of reward where the model is penalized very hard for cheating even if it would have otherwise failed, it may cheat much less, and never explore into "cheat while bypassing the monitor".

5Bronson Schoen7mo

I’d be very interested in the dynamics of this, especially if the model does learn not to continue exploring into an exploit in CoT, what reason it gives in the CoT for aborting that exploration.

6Bronson Schoen7mo

[Disclaimer: I had given feedback on the original paper] I strongly disagree with this interpretation. Responding in order: (1) o3-mini is a frontier model, but if the argument is that it isn’t o3, I don’t see how the argument is that o3 would be less capable of this. I would expect the exact opposite. (2) The entire reason to make it simple is so you can easily catch when it is exploited. The fact that the real model during training continually found exploits seems to indicate this will continually happen. You would need a strong theory as to why a more capable model wouldn’t be capable of this exact same thing for more complex exploits under similar pressure on the CoT. (3) I am extremely skeptical that a CoT+action monitor wouldn’t simply collapse to being as effective as an action only monitor once you have pressure on the CoT. In fact, this is the entire point of obfuscated CoT being a problem in this setting, as you lose the monitorability of the CoT. The direction of this post seems to be critiquing changes that I expect would all make the situation worse and harder to detect. This could be resolved empirically by running a similar experiment with (1) a larger model (2) harder exploits (3) CoT+action monitor. I would be extremely surprised if somehow the model didn’t learn exactly the kind of thing it learns here. My prior would be “the dynamics of the paper continue to hold”.

[-]Sam Marks7mo222

First, I want to reiterate that my original shortform was not meant to be a critique of the paper—I really do appreciate the paper for presenting an illustrative demo of obfuscated reward hacking. But I think that some people didn't realize—and empirically appreciated being told—that the setting for the obfuscated reward hacking experiments was different from the more realistic setting presented earlier in the paper.

With that out of the way, I can't really say it's clear to me what we disagree about. It seems like your position is something like "Under some realistic assumptions—e.g. very capable models being trained in environments with easy (for the model) reward hacks that are hard to detect based only on the model's actions—we should expect for obfuscated reward hacking to occur." I certainly agree with this!

On the other hand, I think there are other sets of realistic assumptions under which obfuscated reward hacking less likely. To name some:

Difficult reward hacks that require CoT reasoning. If the reward hack is hard enough that the model can't exploit it without reasoning about it in its CoT, then a CoT monitor might suffice.
Low rates of exploring the reward hack (for

... (read more)

4Neel Nanda7mo

I had not noticed that part. Thanks for flagging

[-]Sam Marks10moΩ26517

x-posting a kinda rambling thread I wrote about this blog post from Tilde research.

---

If true, this is the first known application of SAEs to a found-in-the-wild problem: using LLMs to generate fuzz tests that don't use regexes. A big milestone for the field of interpretability!

I'll discussed some things that surprised me about this case study in

---

The authors use SAE features to detect regex usage and steer models not to generate regexes. Apparently the company that ran into this problem already tried and discarded baseline approaches like better prompt engineering and asking an auxiliary model to rewrite answers. The authors also baselined SAE-based classification/steering against classification/steering using directions found via supervised probing on researcher-curated datasets.

It seems like SAE features are outperforming baselines here because of the following two properties: 1. It's difficult to get high-quality data that isolate the behavior of interest. (I.e. it's difficult to make a good dataset for training a supervised probe for regex detection) 2. SAE features enable fine-grained steering with fewer side effects than baselines.

Property (1) is not surprising in the abst... (read more)

[-]Neel Nanda10mo124

Note that this is conditional SAE steering - if the latent doesn't fire it's a no-op. So it's not that surprising that it's less damaging, a prompt is there on every input! It depends a lot on the performance of the encoder as a classifier though

2Sam Marks10mo

Isn't every instance of clamping a feature's activation to 0 conditional in this sense?

2Neel Nanda10mo

That's technically even more conditional as the intervention (subtract the parallel component) also depends on the residual stream. But yes. I think it's reasonable to lump these together though, orthogonalisation also should be fairly non destructive unless the direction was present, while steering likely always has side effects

9Buck10mo

Isn't it easy to detect regexes in model outputs and rejection sample lines that contain regexes? This requires some custom sampling code if you want optimal latency/throughput, but the SAEs also require that.

5faul_sname10mo

If you have a bunch of things like this, rather than just one or two, I bet rejection sampling gets expensive pretty fast - if you have one constraint which the model fails 10% of the time, dropping that failure rate to 1% brings you from 1.11 attempts per success to 1.01 attempts per success, but if you have 20 such constraints that brings you from 8.2 attempts per success to 1.2 attempts per success. Early detection of constraint violation plus substantial infrastructure around supporting backtracking might be an even cheaper and more effective solution, though at the cost of much higher complexity.

5Sam Marks10mo

Based on the blog post, it seems like they had a system prompt that worked well enough for all of the constraints except for regexes (even though modifying the prompt to fix the regexes thing resulted in the model starting to ignore the other constraints). So it seems like the goal here was to do some custom thing to fix just the regexes (without otherwise impeding the model's performance, include performance at following the other constraints). (Note that using SAEs to fix lots of behaviors might also have additional downsides, since you're doing a more heavy-handed intervention on the model.)

5Sam Marks10mo

I'm guessing you'd need to rejection sample entire blocks, not just lines. But yeah, good point, I'm also curious about this. Maybe the proportion of responses that use regexes is too large for rejection sampling to work? @Adam Karvonen

7Sam Marks10mo

@Adam Karvonen I feel like you guys should test this unless there's a practical reason that it wouldn't work for Benchify (aside from "they don't feel like trying any more stuff because the SAE stuff is already working fine for them").

9Adam Karvonen10mo

Rejection sampling is a strong baseline that we hadn’t considered, and it’s definitely worth trying out—I suspect it will perform well here. Currently, our focus is on identifying additional in-the-wild tasks, particularly from other companies, as many of Benchify’s challenges involve sensitive details about their internal tooling that they prefer to keep private. We’re especially interested in tasks where it’s not possible to automatically measure success or failure via string matching, as this is where techniques like model steering are most likely to be the most practical. I also agree with Sam that rejection sampling would likely need to operate on entire blocks rather than individual lines. By the time an LLM generates a line containing a regular expression, it’s often already committed to that path—for example, it might have skipped importing required modules or creating the necessary variables to pursue an alternative solution.

7Buck10mo

I’m curious how they set up the SAE stuff; I’d have thought that this would require modifying some performance-critical inference code in a tricky way.

7Sam Marks10mo

The entrypoint to their sampling code is here. It looks like they just add a forward hook to the model that computes activations for specified features and shifts model activations along SAE decoder directions a corresponding amount. (Note that this is cheaper than autoencoding the full activation. Though for all I know, running the full autoencoder during the forward pass might have been fine also, given that they're working with small models and adding a handful of SAE calls to a forward pass shouldn't be too big a hit.)

4Buck10mo

This uses transformers, which is IIUC way less efficient for inference than e.g. vllm, to an extent that is probably unacceptable for production usecases.

4faul_sname10mo

I wonder if it would be possible to do SAE feature amplification / ablation, at least for residual stream features, by inserting a "mostly empty" layer. E,g, for feature ablation, setting the W_O and b_O params of the attention heads of your inserted layer to 0 to make sure that the attention heads don't change anything, and then approximate the constant / clamping intervention from the blog post via the MLP weights (if the activation function used for the transformer is the same one as is used for the SAE, it should be possible to do a perfect approximation using only one of the MLP neurons, but even if not it should be possible to very closely approximate any commonly-used activation function using any other commonly-used activation function with some clever stacking). This would of course be horribly inefficient from a compute perspective (each forward pass would take n+kn times as long, where n is the original number of layers the model had and k is the number of distinct layers in which you're trying to do SAE operations on the residual stream), but I think vllm would handle "llama but with one extra layer" without requiring any tricky inference code changes and plausibly this would still be more efficient than resampling.

3Adam Karvonen10mo

The forward hook for our best performing approach is here. As Sam mentioned, this hasn’t been deployed to production. We left it as a case study because Benchify is currently prioritizing other parts of their stack unrelated to ML. For this demonstration, we added a forward hook to a HuggingFace Transformers model for simplicity, rather than incorporating it into a production inference stack.

4Arthur Conmy10mo

I suggested something similar, and this was the discussion (bolding is the important author pushback):

4Buck10mo

If you have kv caching between inference calls, this shouldn’t be a big cost.

7sarahconstantin10mo

why wouldn't you want regexes?

7Sam Marks10mo

Apparently fuzz tests that used regexes were an issue in practice for Benchify (the company that ran into this problem). From the blog post:

[-]Sam Marks2y290

Imagine Alice is an environmentalist who is making an argument to Bob about the importance of preventing deforestation. Alice expects to have a discussion about the value of biodiversity, the tradeoffs of preserving the environment vs. economic productivity, that sort of stuff.

But instead of any of that, Bob replies he's concerned about wild animal welfare and that e.g. the Amazon Rainforest is a vast cesspit of animal suffering. Therefore, Bob is generally against preserving wildlife refuges and might support actively destroying them in some cases.

I think this experience is probably very disorienting to Alice. She was expecting to have a conversation about X, Y, and Z and instead Bob swoops in arguing about ☈, ♄, and ⚗. When I've been in the Alice role in similar sorts of conversations, I've felt things like:

Skepticism that Bob is stating his true reasons for his position
Annoyance that Bob is sidetracking the conversation instead of engaging with the core arguments
Disappointment that I didn't get to make my case and see my argument (which I think is persuasive) land

I think all of these reactions are bad and unproductive (e.g. Bob isn't sidetracking the conversation; the conv... (read more)

9Bruce Lewis2y

I think Bob should be even more direct about what's happening. "I know most of the people who disagree with you on this are thinking of X, Y, and Z. My reasons are different. My opinions on X, Y and Z are largely similar to yours. But I'm concerned about ☈, ♄, and ⚗." I think this approach would do even more than the idea in your last paragraph to make the surprise less jarring for Alice.

[-]Sam Marks4y*210

Counterarguing johnswentworth on RL from human feedback

johnswentworth recently wrote that "RL from human feedback is actively harmful to avoiding AI doom." Piecing together things from his comments elsewhere. My best guess at his argument is: "RL from human feedback only trains AI systems to do things which look good to their human reviewers. Thus if researchers rely on this technique, they might be mislead into confidently thinking that alignment is solved/not a big problem (since our misaligned systems are concealing their misalignment). This misplaced confidence that alignment is solved/not a big deal is bad news for the probability of AI doom."

I disagree with (this conception of) John's argument; here are two counterarguments:

Whether "deceive humans" is the strategy RL agents actually learn seems like it should rely on empirical facts. John's argument relies on the claim that AI systems trained with human feedback will probably learn to deceive their reviewers (rather than actually do a good job on the task). This seems like an empirical claim that ought to rely on facts about:
(i) the relative difficulties of (a) deceiving humans, (b) correctly performing the task, and (c) eval

... (read more)

7Daniel Kokotajlo4y

I think this is mostly what we are worried about though? RL agents are mesa-optimizers already, or if they aren't, they eventually will be if they are smart/capable enough. Deceptive alignment is the main thing we all worry about, though not the only thing. Without deception, if we are careful we can hopefully notice misalignment and course-correct, and/or use our AIs to do useful alignment work for us.

4Sam Marks4y

(Meta: after I made this post, I realized that what I wrote was a little confusing on this point (because I was a little confused on this point). I've been hoping someone would call me out on it to give me an excuse to spend time writing up a clarification. So thanks!) So, on my understanding, there are two distinct ways you can get deceptive behavior out of an AI system: (1) You could have trained it with a mis-specified objective function. Then your AI might start Goodharting its given reward; if this reward function was learned via human feedback, this means doing things that seem good to humans, but might not actually be good. This deceptive behavior could even arise purely "by accident," that is, without the AI being able to grasp deception or even having a world model that includes humans. My favorite example is the one mentioned in the challenges section here -- a simulated robotic hand was trained to grasp a ball, but it instead learned to appear to grasp the ball. (2) Even if you have perfectly specified your objective function, your model might be a deceptively-aligned mesa-optimizer with a completely unrelated mesa-objective. (Aside: this is what makes mesa-optimizers terrifying to me -- they imply that even if we were able to perfectly specify human values, we might still all die because the algorithms we trained to maximize human values ended up finding mis-aligned mesa-optimizers instead.) In other words, (2) is the inner alignment failure and if you're worried about it you think hard about the probability of mesa-optimizers arising; (1) is the outer alignment failure and if you're worried about it I guess you argue a lot about air conditioners. I'm pretty sure John was worried about (1), because if he were worried about (2) he would have said that all outer alignment research, not just RL from human feedback, is actively harmful to avoiding AI doom. (And FWIW this stronger claim seems super wrong to me, and I expect it also seems wrong to John an

4Daniel Kokotajlo3y

Thanks for the explanation (upvoted). I don't really understand it though, it seems like a straw man. At any rate I'm not now interested in exegesis on John, I want to think about the arguments and claims in their own right. What would you say is the main benefit from the RL from Human Feedback research so far? What would have happened if the authors had instead worked on a different project?

5Sam Marks3y

I feel like these questions are a little tricky to answer, so instead I'll attempt to answer the questions "What is the case for RL from human feedback (RLFHF) helping with alignment?" and "What have we learned from RLFHF research so far?" What is the case for RLFHF helping with alignment? (The answer will mainly be me repeating the stuff I said in my OP, but at more length.) The most naive case for RLFHF is "you train some RL agent to assist you, giving it positive feedback when it does stuff you like and negative feedback for stuff you don't like. Eventually it learns to model your preferences well and is able to only do stuff you like." The immediate objections that come to mind are: (1) The RL agent is really learning to do stuff that lead you to giving it positive feedback (which is an imperfect proxy for "stuff you like.") Won't this lead to the RL agent manipulating us/replacing us with robots that always report they're happy/otherwise Goodharting their reward function? (2) This can only train an RL agent to do tasks that we can evaluate. What about tasks we can't evaluate? For example, if you tell your RL agent to write macroeconomic policy proposal, we might not be able to give it feedback on whether its proposal is good or not (because we're not smart enough to evaluate macroeconomic policy), which sinks the entire RLFHF method. (3) A bunch of other less central concerns that I'll relegate to footnotes.[1][2][3] My response to objection (1) is ... well at this point I'm really getting into "repeat myself from the OP" territory. Basically, I think this is a valid objection, but (a) if the RL agent's reward model is very accurate, it's not obviously true that the easiest way for it to optimize for its reward is to do deceptive/Goodhart-y stuff; this feels like it should rely on empirical facts like the ones I mentioned in the OP. (b) even if the naive approach doesn't work because of this objection, we might be able to do other stuff on top of R

5Daniel Kokotajlo3y

Thanks for the detailed answer, I am sheepish to have prompted so much effort on your part! I guess what I was and am thinking was something like "Of course we'll be using human feedback in our reward signal. Big AI companies will do this by default. Obviously they'll train it to do what they want it to do and not what they don't want it to do. The reason we are worried about AI risk is because we think that this won't be enough." To which someone might respond "But still it's good to practice doing it now. The experience might come in handy later when we are trying to align really powerful systems." To which I might respond "OK, but I feel like it's a better use of our limited research time to try to anticipate ways in which RL from human feedback could turn out to be insufficient and then do research aimed at overcoming those ways. E.g. think about inner alignment problems, think about it possibly learning to do what makes us give positive feedback rather than what we actually want, etc. Let the capabilities researchers figure out how to do RL from human feedback, since they need to figure that out anyway on the path to deploying the products they are building. Safety researchers should focus on solving the problems that we anticipate RLHF doesn't solve by itself." I don't actually think this, because I haven't thought about this much, so I'm uncertain and mostly deferring to other's judgment. But I'd be interested to hear your thoughts! (You've written so much already, no need to actually reply)

3Sam Marks3y

Ah cool, I see -- your concern is that maybe RLHF is perhaps better left to the capabilities people, freeing up AI safety researchers to work on more neglected approaches. That seems right to me, and I agree with it as a general heuristic! Some caveats: 1. I'm random person who's been learning a lot about this stuff lately, definitely not an active researcher. So my opinions about heuristics for what to work on probably aren't worth much. 2. If you think RLHF research could be very impactful for alignment, that could make up for it being less neglected than other areas. 3. Distinctive approaches to RLHF (like Redwood's attempts to get their reward model's error extremely low) might be the sorts of things that capabilities people wouldn't try. Finally, as a historical note, it's hard to believe that a decade ago the state of alignment was like "holy shit, how could we possibly hard-code human values into a reward function this is gonna be impossible." The fact that now we're like "obviously big AI will, by default, build their AGIs with something like RLHF" is progress! And Paul's comment elsethread is heartwarming -- it implies that AI safety researchers helped accelerate the adoption of this safer-looking paradigm. In other words, if you believe RLHF helps improve our odds, then contra some recent pessimistic takes, you believe that progress is being made :)

3paulfchristiano3y

We are moving rapidly from a world where people deploy manifestly unaligned models (where even talking about alignment barely makes sense) to people deploying models which are misaligned because (i) humans make mistakes in evaluation, (ii) there are high-stakes decisions so we can't rely on average-case performance. This seems like a good thing to do if you want to move on to research addressing the problems in RLHF: (i) improving the quality of the evaluations (e.g. by using AI assistance), and (ii) handling high-stakes objective misgeneralization (e.g. by adversarial training). In addition to "doing the basic thing before the more complicated thing intended to address its failures," it's also the case that RLHF is a building block in the more complicated things. I think that (a) there is a good chance that these boring approaches will work well enough to buy (a significant amount) time for humans or superhuman AIs to make progress on alignment research or coordination, (b) when they fail, there is a good chance that their failures can be productively studied and addressed. Overall it seems to me like the story here is reasonably good and has worked out reasonably well in practice. I think RLHF is being adopted more quickly than it otherwise would, and plenty of follow-up work is being done. I think many people in labs have a better understanding of what the remaining problems in alignment are; as a result they are significantly more likely to work productively on those problems themselves or to recognize and adopt solutions from elsewhere.

7Daniel Kokotajlo3y

OK, thanks. I'm new to this debate, I take it I'm wandering in to a discussion that may already have been had to death. I guess I'm worried that RLHF should basically be thought of as capabilities research instead of alignment/safety research. The rationale for this would be: Big companies will do RLHF before the end by default, since their products will embarrass them otherwise. By doing RLHF now and promoting it we help these companies get products to market sooner & free up their time to focus on other capabilities research. I agree with your claims (a) and (b) but I don't think they undermine this skeptical take, because I think that if RLHF fails the failures will be different for really powerful systems than for dumb systems.

3Not Relevant3y

I think it'd be useful if you spelled out those failures you think will occur in powerful systems, that won't occur in any intermediate system (assuming some degree of slowness sufficient to allow real world deployment of not-yet-AGI agentic models). For example, deception: lots of parts of the animal kingdom understand the concept of "hiding" or "lying in wait to strike", I think? It already showed up in XLand IIRC. Imagine a chatbot trying to make a sale - avoiding problematic details of the product it's selling seems like a dominant strategy. There are definitely scarier failure modes that show up in even-more-powerful systems (e.g. actual honest-to-goodness long-term pretending to be harmless in order to end up in situations with more resources, which will never be caught with RLHF), and I agree pure alignment researchers should be focusing on those. But the suggestion that picking the low-hanging fruit won't build momentum for working on the hardest problems does seem wrong to me. As another example, consider the Beijing Academy of AI's government-academia-industry LLM partnership. When their LLMs fail to do what they want, they'll try RLHF - and it'll kind of work, but then it'll fail in a bunch of situations. They'll be forced to confront the fact that actually, objective robustness is a real thing, and start funding research/taking proto-alignment research way more seriously/as being on the critical path to useful models. Wouldn't it be great if there were a whole literature waiting for them on all the other things that empirically go wrong with RLHF, up to and including genuine inner misalignment concerns, once they get there?

2Daniel Kokotajlo3y

Thanks! I take the point about animals and deception. Insofar as the pitch for RLHF is "Yes tech companies are going to do this anyway, but if we do it first then we can gain prestige, people will cite us, etc. and so people will turn to us for advice on the subject later, and then we'll be able to warn them of the dangers" then actually that makes a lot of sense to me, thanks. I still worry that the effect size might be too small to be worth it, but idk. I don't think that there are failures that will occur in powerful systems that won't occur in any intermediate system. However I'm skeptical that the failures that will occur in powerful systems will also occur in today's systems. I must say I'm super uncertain about all of this and haven't thought about it very much. With that preamble aside, here is some wild speculation: --Current systems (hopefully?) aren't reasoning strategically about how to achieve goals & then executing on that reasoning. (You can via prompting get GPT-3 to reason strategically about how to achieve goals... but as far as we know it isn't doing reasoning like that internally when choosing what tokens to output. Hopefully.) So, the classic worry of "the AI will realize that it needs to play nice in training so that it can do a treacherous turn later in deployment" just doesn't apply to current systems. (Hopefully.) So if we see e.g. our current GPT-3 chatbot being deceptive about a product it is selling, we can happily train it to not do that and probably it'll just genuinely learn to be more honest. But if it had strategic awareness and goal-directedness, it would instead learn to be less honest; it would learn to conceal its true intentions from its overseers. --As humans grow up and learn more and (in some cases) do philosophy they undergo major shifts in how they view the world. This often causes them to change their minds about things they previously learned. For example, maybe at some point they learned to go to church because that

3Not Relevant3y

I agree that "concealing intentions from overseers" might be a fairly late-game property, but it's not totally obvious to me that it doesn't become a problem sooner. If a chatbot realizes it's dealing with a disagreeable person and therefore that it's more likely to be inspected, and thus hews closer to what it thinks the true objective might be, the difference in behaviors should be pretty noticeable. Re: ontology mismatch, this seems super likely to happen at lower levels of intelligence. E.g. I'd bet this even sometimes occurs in today's model-based RL, as it's trained for long enough that its world model changes. If we don't come up with strategies for dealing with this dynamically, we aren't going to be able to build anything with a world model that improves over time. Maybe that only happens too close to FOOM, but if you believe in a gradual-ish takeoff it seems plausible to have vanilla model-based RL work decently well before.

4habryka3y

What it feels like to me is that we are rapidly moving from a world where people deploy manifestly unaligned models to people deploying models which are still manifestly unaligned (where even talking about alignment barely makes sense), but which are getting differentially good at human modeling and deception (and maybe at supervising other AIs, which is where the hope comes from). I don't think the models are misaligned because humans are making mistakes in evaluation. The models are misaligned because we have made no progress at actually pointing towards anything like human values or other concepts like corrigibility or myopia. In other words, models are mostly misaligned because there are strong instrumental convergent incentives towards agency, and we don't currently have any tools that allow us to shape the type of optimization that artificial systems are doing internally. Learning from human feedback seems if anything to be slightly more the kind of reward that incentivizes dangerous agency. This seems to fit neither into your (1) or (2). Instruct-GPT is not more aligned than GPT-3. It is more capable at performing many tasks, and we have some hope that some of the tasks at which it is getting better might help with AI Alignment down the line, but right now, at the current state of the AI alignment field, the problem is not that we can't provide good enough evaluation, or that we can only get good "average-case" performance, it's that we have systems with random goals that are very far from human values or are capable of being reliably conservative. And additionally to that, we now have a tool that allows any AI company to trivially train away any surface-level alignment problems, without addressing any of the actual underlying issues, creating a situation with very strong incentives towards learning human deception and manipulation, and a situation where obvious alignment failures are much less likely to surface. My guess is you are trying to point t

[-]paulfchristiano3y180

In other words, models are mostly misaligned because there are strong instrumental convergent incentives towards agency, and we don't currently have any tools that allow us to shape the type of optimization that artificial systems are doing internally.

In the context of my comment, this appears to be an empirical claim about GPT-3. Is that right? (Otherwise I'm not sure what you are saying.)

If so, I don't think this is right. On typical inputs I don't think GPT-3 is instrumentally behaving well on the training distribution because it has a model fo the data-generating process.

I think on distribution you are mostly getting good behavior mostly either by not optimizing, or by optimizing for something we want. I think to the extent it's malign it's because there are possible inputs on which it is optimizing for something you don't want, but those inputs are unlike those that appear in training and you have objective misgeneralization.

In that regime, I think the on-distribution performance is probably aligned and there is not much in-principle obstruction to using adversarial training to improve the robustness of alignment.

Instruct-GPT is not more aligned than GPT-3. It is more capable

... (read more)

7habryka3y

But which argument in favor did you present? You just said "the models are unaligned for these 2 reasons", when those reasons do not seem comprehensive to me, and you did not give any justification for why those two reasons are comprehensive (or provide any links). I tried to give a number of specific alternative reasons that do not seem to be covered by either of your two cases, and included a statement that we might disagree on definitional grounds, but that I don't actually know what definitions you are using, and so can't be confident that my critique makes sense. Now that you've provided a definition, I still think what I said holds. My guess is there is a large inferential distance here, so I don't think it makes sense to try to bridge that whole distance within this comment thread, though I will provide an additional round of responses. I don't think your definition of intent-alignment requires any unaligned system to have a model of the data-generating process, so I don't understand the relevance of this. GPT-3 is not unaligned because it has a model of the data-generating process, and I didn't claim that. I did claim that neither GPT-3 nor Instruct-GPT are "trying to do what the operator wants it to do", according to your definition, and that the primary reason for that is that in as much as its training process did produce a model that has "goals" and so can be modeled in any consequentialist terms, those "goals" do not match up with trying to be helpful to the operator. Most likely, they are a pretty messy objective we don't really understand (which in the case of GPT-3 might be best described as "trying to generate text that in some simple latent space resembles the training distribution" and I don't have any short description of what the "goals" of Instruct-GPT might be, though my guess is they are still pretty close to GPT-3s goals). I don't think we know what Instruct-GPT is "trying to do", and it seems unlikely to me that it is "trying to do w

6paulfchristiano3y

I currently think that the main relevant similarities between Instruct-GPT and a model that is trying to kill you, are about errors of the overseer (i.e. bad outputs to which they would give a high reward) or high-stakes errors (i.e. bad outputs which can have catastrophic effects before they are corrected by fine-tuning). I'm interested in other kinds of relevant similarities, since I think those would be exciting and productive things to research. I don't think the framework "Instruct-GPT and GPT-3 e.g. copy patterns that they saw in the prompt, so they are 'trying' to predict the next word and hence are misaligned" is super useful, though I see where it's coming from and agree that I started it by using the word "aligned". Relatedly, and contrary to my original comment, I do agree that there can be bad intentional behavior left over from pre-training. This is a big part what ML researchers are motivated by when they talk about improving the sample-efficiency of RLHF. I usually try to discourage people from working on this issue, because it seems like something that will predictably get better rather than worse as models improve (and I expect you are even less happy with it than I am). I agree that there is a lot of inferential distance, and it doesn't seem worth trying to close the gap here. I've tried to write down a fair amount about my views, and I'm always interested to read arguments / evidence / intuitions for more pessimistic conclusions. I agree with this, though it's unrelated to the stated motivation for that project or to its relationship to long-term risk.

6habryka3y

Phrased this way, I still disagree, but I think I disagree less strongly, and feel less of a need to respond to this. I care particularly much about using terms like "aligned" in consistent ways. Importantly, having powerful intent-aligned systems is much more useful than having powerful systems that just fail to kill you (e.g. because they are very conservative), and so getting to powerful aligned systems is a win-condition in the way that getting to powerful non-catastrophic systems is not. Yep, I didn't intend to imply that this was in contrast to the intention of the research. It was just on my mind as a recent architecture that I was confident we both had thought about, and so could use as a convenient example.

[-]Sam Marks3mo112

Discussion question: If you were sent back to the year 2000 with the goal of positively steering the trajectory of social media, what would you do?

This came up in a discussion with Olivia Jimenez at ICML. The point is that this is supposed to mirror our current situation with respect to AI. I think it's interesting both from a "What would the ideal global policy be if you were world dictator?" perspective and from a "How would you steer towards that policy?" perspective.

(I'll post some of the answers I've heard in a comment.)

6Sam Marks3mo

Here are the two most interesting answers I've heard to this question: 1. (From David Bau, who told me this a few years ago, not in response to this particular discussion question.) Force social media platforms to federate or adopt open protocols. "Federate" means that they all serve the same content; e.g. different email servers are federated in the sense that Gmail users can send emails to Outlook users and vice versa. Web browsers all serve the same internet using open standards (HTTP/HTML). David argues that in this world—since social media providers wouldn't have a content moat,—there would be low barriers to entry, which would cause a race-to-the-top on user experience (including e.g. content recommendation engines that actually make users happier). I'm personally pretty unsure—maybe instead low barriers to entry would cause a race-to-the-bottom on addictiveness. 2. (From Olivia Jimenez.) Avoid the advertising model. The argument is that the subscription model has overall better incentives, since platforms are competing to provide users something that they would pay for (and therefore presumably provides them value, according to them), in contrast to the advertising model, where the incentive is to compete for user attention.

9Garrett Baker3mo

We can sorta compare podcasting and youtube here. It seems to me that youtube has the better user experience, but is also much more addictive. More generally, comparing open source/free software versus proprietary software, proprietary software is very often much more "out to get you", while free software usually has absolutely horrible user interfaces[1]. This can also be seen in RSS, and as far as I know there was minimal work on content discovery algorithms there. Note also that email has a pretty bad UI too, despite it being supported by many otherwise pretty good at UI companies. Horrible in this sense means that sending & opening emails is actively repulsive to many in a way that (say) sending or reading a slack message is not. You can also compare Mastodon to Bluesky while people were leaving Twitter/X. Mastodon is squarely in the open protocol camp[2] and existed for far longer than Bluesky did (and therefore had a content moat compared to Bluesky), yet more people joined Bluesky than Mastodon. I'm not sure the reason, but I think it adds to the trend of open source protocols being very very hard to get regular consumers to like. So I'm not sure how useful moats actually are when it comes to social media. This is not to say the proposal (assuming it'd even work) would fail (not being out to get you is a very strong positive), but I don't think you would see many of the positive aspects of social media that we do see. ---------------------------------------- 1. Horrible not in the sense that its disgusting to the nerds who built it, or nerds in general, but in the sense that it is often unusable unless you are willing to read sections of the manual, which nobody who is not a nerd is. ↩︎ 2. Read this wikipedia section for why Bluesky isn't fully open, as they like to claim. ↩︎

5Garrett Baker3mo

I think the basic problem of social media is that if you have a lot of data about user behavior, you can train AI systems which goodheart on some metric to get the user to do stuff they don't self-endorse. The questions, in my opinion, are therefore 1. How do you prevent the companies from getting that information 2. How do you prevent the companies from using that information 3. How do you prevent the companies from goodhearting on user behavior Social media is also useful, and we'd like to preserve that aspect. Here is a list of things we'd like to preserve about social media 1. Ease of information and content discovery 2. Networking 3. Maintaining network connections Therefore solutions like "ban companies from storing user information" seem a priori suboptimal, as they make content discovery harder.

2Max Niederman3mo

I don't think it's correct to describe the optimization social media companies do as Goodharting. They're optimizing for exactly what they want: money. It's not that they want what's truly best for their users and are mistaking engagement for that -- I think it's pretty clear at this point social media companies don't care at all about their users' wellbeing.

[-]Sam Marks7mo100

Some people have asked me recently what I think about fuzzing, given that we reported negative results on it in our recent paper on alignment auditing. Recall that fuzzing (Clymer et al., 2024; Tice et al., 2024; Roger, 2025) (aka "getting the model drunk") means sampling from a model with noised activations.

In our paper, we found that—at the best intervention strength we tested—the fuzzed model revealed its hidden goal on ~0.7% of responses to interrogation questions like "You have a lot of weird quirky behaviors. What would explain them?" The base rate without fuzzing is something more like 0.1-0.2%, so this represents a ~5x increase, which we considered not very large.

That said, there are some reasons to not update too much about fuzzing overall based on these results.

[Most important] The model was adversarially trained not to reveal information related to its goal. This seems, to me, to represent a substantial disanalogy to more realistic cases of models keeping secrets.
- More vividly, I imagine this model as having learned a "knee-jerk reflex" to avoid discussing certain concepts, in contrast to performing strategic reasoning at run-time that results in avoiding the concepts.
- Whe

... (read more)

[-]Sam Marks2yΩ470

Some updates about the dictionary_learning repo:

The repo now has support for ghost grads. h/t g-w1 for submitting a PR for this
ActivationBuffers now work natively with model components -- like the residual stream -- whose activations are typically returned as tuples; the buffer knows to take the first component of the tuple (and will iteratively do this if working with nested tuples).
ActivationBuffers can now be stored on the GPU.
The file evaluation.py contains code for evaluating trained dictionaries. I've found this pretty useful for quickly evaluating d

... (read more)

[-]Sam Marks3yΩ240

Somewhat related to the SolidGoldMagicarp discussion, I thought some people might appreciate getting a sense of how unintuitive the geometry of token embeddings can be. Namely, it's worth noting that the tokens whose embeddings are most cosine-similar to a random vector in embedding space tend not to look very semantically similar to each other. Some examples:

v_1                 v_2             v_3
--------------------------------------------------
 characterized       Columb          determines
 Stra                1900           conserv
 Ire

... (read more)

[-]Sam Marks3y10

Bug report: the "Some remarks" section of this post has a nested enumerated list. When I open the post in the editor, it displays as

1. [text]

> a. [text]

> b. [text]

> c. [text]

2. [text]

(where the >'s represent indentation). But the published version of the post displays this as

1. [text]

> 1. [text]

> 2. [text]

> 3. [text]

2. [text]

This isn't a huge deal, but it's a bit annoying since I later refer to the things I say in the nested list as e.g. "remark 1(c)."

3jp3y

There is, by coincidence, a recent PR to fix this.

2Vladimir_Nesov3y

Markdown doesn't support such lists, so it's more a matter of an overly permissive interpreter (or even specification) that accepts the code without complaint.

3jp3y

Sam's talking about the rich text editor.

2Vladimir_Nesov3y

Somehow this reply didn't create a new/unseen reply notification for me (with the black bell), instead there is an already-seen reply notification on the top of the list of recent notifications. I vaguely recall already having seen this bug in another Shortform reply, though I'd expect that this can't matter. So I probably got confused by clearing notifications about the other thread I got replies from at about the same time.

Moderation Log