My name is Alex Turner. I'm a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com
Sobering look into the human side of AI data annotation:
Instructions for one of the tasks he worked on were nearly identical to those used by OpenAI, which meant he had likely been training ChatGPT as well, for approximately $3 per hour.
“I remember that someone posted that we will be remembered in the future,” he said. “And somebody else replied, ‘We are being treated worse than foot soldiers. We will be remembered nowhere in the future.’ I remember that very well. Nobody will recognize the work we did or the effort we put in.”
Idea: Speed up ACDC by batching edge-checks. The intuition is that most circuits will have short paths through the transformer, because Residual Networks Behave Like Ensembles of Relatively Shallow Networks (https://arxiv.org/pdf/1605.06431.pdf). Most edges won't be in most circuits. Therefore, if you're checking KL of random-resampling edges e1 and e2, there's only a small chance that e1 and e2 interact with each other in a way important for the circuit you're trying to find. So statistically, you can check eg e1, ... e100 in a batch, and maybe ablate 95 edges all at once (due to their individual KLs being low).
If true, this predicts that given a threshold T, and for most circuit subgraphs H of the full network graph G, and for the vast majority of e1, e2 in H:
KL(G || H \ {e2} ) - KL(G || H) < T
iff
KL(G || H \ {e2, e1} ) - KL(G || H \ {e1}) < T
(That is, e1's inclusion doesn't change your pruning decision on e2)
Neel Nanda suggests that it's particularly natural to batch edges in the same layer.
It seems to me like you're positing some "need to do well in training", which is... a kinda weird frame. In a weak correlational sense, it's true that loss tends to decrease over training-time and research-time.
No, I don't think I'm positing that—in fact, I said that the aligned model doesn't do this.
I don't understand why you claim to not be doing this. Probably we misunderstand each other? You do seem to be incorporating a "(strong) pressure to do well in training" in your reasoning about what gets trained. You said (emphasis added):
The argument for deceptive alignment is that deceptive alignment might be the easiest way for the model to figure out how to do well in training.
This seems to be engaging in the kind of reasoning I'm critiquing.
We now have two dual optimization problems, "minimize loss subject to some level of inductive biases" and "maximize inductive biases subject to some level of loss" which we can independently investigate to produce evidence about the original joint optimization problem.
Sure, this (at first pass) seems somewhat more reasonable, in terms of ways of thinking about the problem. But I don't think the vast majority of "loss-minimizing" reasoning actually involves this more principled analysis. Before now, I have never heard anyone talk about this frame, or any other recovery which I find satisfying.
So this feels like a motte-and-bailey, where the strong-and-common claim goes like "we're selecting models to minimize loss, and so if deceptive models get lower loss, that's a huge problem; let's figure out how to not make that be true" and the defensible-but-weak-and-rare claim is "by considering loss minimization given certain biases, we can gain evidence about what kinds of functions SGD tends to train."
The problem is that figuring out how to do well at training is actually quite hard[1]
It seems to me like you're positing some "need to do well in training", which is... a kinda weird frame. In a weak correlational sense, it's true that loss tends to decrease over training-time and research-time.
But I disagree with an assumption of "in the future, the trained model will need to achieve loss scores so perfectly low that the model has to e.g. 'try to do well' or reason about its own training process in order to get a low enough loss, otherwise the model gets 'selected against'." (It seems to me that you are making this assumption; let me know if you are not.)
I don't know why so many people seem to think model training works like this. That is, that one can:
I think that loss just does not constrain training that tightly, or in that fashion.[2] I can give a range of counterevidence, from early stopping (done in practice to use compute effectively) to knowledge distillation (shows that for a given level of expressivity, training on a larger teacher model's logits will achieve substantially lower loss than supervised training to convergence from random initialization, which shows that training to convergence isn't even well-described as "minimizing loss").
And I'm not aware of good theoretical bounds here either; the cutting-edge PAC-Bayes results are, like, bounding MNIST test error to 2.7% on an empirical setup which got 2%. That's a big and cool theoretical achievement, but -- if my understanding of theory SOTA is correct -- we definitely don't have the theoretical precision to be confidently reasoning about loss minimization like this on that basis.
I feel like this unsupported assumption entered the groundwater somehow and now looms behind lots of alignment reasoning. I don't know where it comes from. On the off-chance it's actually well-founded, I'd deeply appreciate an explanation or link.
FWIW I think this claim runs afoul of what I was trying to communicate in reward is not the optimization target. (I mention this since it's relevant to a discussion we had last year, about how many people already understood what I was trying to convey.)
I also see no reason to expect this to be a good "conservative worst-case", and this is why I'm so leery of worst-case analysis a la ELK. I see little reason that reasoning this way will be helpful in reality.
I think it's inappropriate to call evolution a "hill-climbing process" in this context, since those words seem optimized to sneak in parallels to SGD. Separately, I think that evolution is a bad analogy for AGI training.
I mostly... can just barely see an ear above the train if I look, after being told to look there. I don't think it's "clear." I also note that these are black-box attacks on humans which originated from ANNs; these are transferred attacks from eg a CNN.
Another point for feature universality. Subtle adversarial image manipulations influence both human and machine perception:
... we find that adversarial perturbations that fool ANNs similarly bias human choice. We further show that the effect is more likely driven by higher-order statistics of natural images to which both humans and ANNs are sensitive, rather than by the detailed architecture of the ANN.
When writing about RL, I find it helpful to disambiguate between:
A) "The policy optimizes the reward function" / "The reward function gets optimized" (this might happen but has to be reasoned about), and
B) "The reward function optimizes the policy" / "The policy gets optimized (by the reward function and the data distribution)" (this definitely happens, either directly -- via eg REINFORCE -- or indirectly, via an advantage estimator in PPO; B follows from the update equations)
(Updating a bit because of these responses -- thanks, everyone, for responding! I still believe the first sentence, albeit a tad less strongly.)
This seems like a great spot to make some falsifiable predictions which discriminate your particular theory from the pack. (As it stands, I don't see a reason to buy into this particular chain of reasoning.)
AIs will increasingly be deployed and tuned for long-term tasks, so we can probably see the results relatively soon. So—do you have any predictions to share? I predict that AIs can indeed do long-context tasks (like writing books with foreshadowing) without having general, cross-situational goal-directedness.[1]
I have a more precise prediction:
Conditional on that, I predict with 85% confidence that it's possible to do this with AIs which are basically as tool-like as GPT-4. I don't know how to operationalize that in a way you'd agree to.
(I also predict that on 12/1/2025, there will be a new defense offered for MIRI-circle views, and a range of people still won't update.)
I expect most of real-world "agency" to be elicited by the scaffolding directly prompting for it (e.g. setting up a plan/critique/execute/summarize-and-postmortem/repeat loop for the LLM), and for that agency to not come from the LLM itself.