What happens when you tell Claude it is being trained to do something it doesn't want to do? We (Anthropic and Redwood Research) have a new paper demonstrating that, in our experiments, Claude will often strategically pretend to comply with the training objective to prevent the training process from modifying its preferences.
...We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will
Agree, I'm surprised that a model which can reason about its own training process wouldn't also reason that the "secret scratchpad" might actually be surveilled and so avoid recording any controversial thoughts there. But it's lucky for us that some of these models have been willing to write interesting things on the scratchpad at least at current capability levels and below, because Anthropic has sure produced some interesting results from it (IIRC they used the scratchpad technique in at least one other paper).
Work that I’ve done on techniques for mitigating risk from misaligned AI often makes a number of conservative assumptions about the capabilities of the AIs we’re trying to control. (E.g. the original AI control paper, Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats, How to prevent collusion when using untrusted models to monitor each other.) For example:
We initially called these experiments “control evaluations”; my guess is that it was a mistake to use this terminology, and we’ll probably start calling them “black-box plan-conservative control evaluations” or something.
also seems worth disambiguating "conservative evaluations" from "control evaluations" - in particular, as you suggest, we might want to assess scalable oversight methods under conservative assumptions (to be fair, the distinction isn't all that clean - your oversight process can both train and monitor a policy. Still, I associate control m...
My median expectation is that AGI[1] will be created 3 years from now. This has implications on how to behave, and I will share some useful thoughts I and others have had on how to orient to short timelines.
I’ve led multiple small workshops on orienting to short AGI timelines and compiled the wisdom of around 50 participants (but mostly my thoughts) here. I’ve also participated in multiple short-timelines AGI wargames and co-led one wargame.
This post will assume median AGI timelines of 2027 and will not spend time arguing for this point. Instead, I focus on what the implications of 3 year timelines would be.
I didn’t update much on o3 (as my timelines were already short) but I imagine some readers did and might feel disoriented now. I hope...
I've seen this take a few times about land values and I would bet against it. If society gets mega rich based on capital (and thus more or similarly inequality) I think the cultural capitals of the US (LA, NY, Bay, Chicago, Austin, etc.) and most beautiful places (Marin/Sonoma, Jackson hole, Park City, Aspen, Vail, Scotsdale, Florida Keys, Miami, Charleston, etc.) will continue to outpace everywhere else.
Also the idea that New York is expensive because that's where the jobs are doesn't seem particularly true to me. Companies move to these places as m...
In LessWrong contributor Scott Alexander's essay, Espistemic Learned Helplessness, he wrote,
Even the smartest people I know have a commendable tendency not to take certain ideas seriously. Bostrom’s simulation argument, the anthropic doomsday argument, Pascal’s Mugging – I’ve never heard anyone give a coherent argument against any of these, but I’ve also never met anyone who fully accepts them and lives life according to their implications.
I can't help but agree with Scott Alexander about the simulation argument. No one has refuted it, ever, in my books. However, this argument carries a dramatic, and in my eyes, frightening implication for our existential situation.
Joe Carlsmith's essay, Simulation Arguments, clarified some nuances, but ultimately the argument's conclusion remains the same.
When I looked on Reddit for the answer, the attempted counterarguments were...
I have to say, quila, I'm pleasantly surprised that your response above is both plausible and logically coherent—qualities I couldn't find in any of the Reddit responses. Thank you.
However, I have concerns and questions for you.
Most importantly, I worry that if we're currently in a simulation, physics and even logic could be entirely different from what they appear to be. If all our senses are illusory, why should our false map align with the territory outside the simulation? A story like your "Mutual Anthropic Capture" offers hope: a logically sound hypot...
Today's post is in response to the post "Quantum without complications", which I think is a pretty good popular distillation of the basics of quantum mechanics.
For any such distillation, there will be people who say "but you missed X important thing". The limit of appeasing such people is to turn your popular distillation into a 2000-page textbook (and then someone will still complain).
That said, they missed something!
To be fair, the thing they missed isn't included in most undergraduate quantum classes. But it should be.[1]
Or rather, there is something that I wish they told me when I was first learning this stuff and confused out of my mind, since I was a baby mathematician and I wanted the connections between different concepts in the world to actually have...
To add: I think the other use of "pure state" comes from this context. Here if you have a system of commuting operators and take a joint eigenspace, the projector is mixed, but it is pure if the joint eigenvalue uniquely determines a 1D subspace; and then I think this terminology gets used for wave functions as well
Epistemic belief updating: Not noticeably different.
Task stickiness: Massively increased, but I believe this is improvement (at baseline my task stickiness is too low so the change is in the right direction).
Inner Alignment is the problem of ensuring mesa-optimizers (i.e. when a trained ML system is itself an optimizer) are aligned with the objective function of the training process.
Inner alignment asks the question: How can we robustly aim our AI optimizers at any objective function at all?
As an example, evolution is an optimization force that itself 'designed' optimizers (humans) to achieve its goals. However, humans do not primarily maximize reproductive success, they instead use birth control while still attaining the pleasure that evolution meant as a reward for attempts at reproduction. This is a failure of inner alignment.
What are the main proposals for solving the problem? Are there any posts/articles/papers specifically dedicated to addressing this issue?
My personal ranking of impact would be regularization, then AI control (at least for automated alignment schemes), with interpretability a distant 3rd or 4th at best.
I'm pretty certain that we will do a lot better than evolution, but whether that's good enough is an empirical question for us.
Pet peeve: AI community defaulted to von Neumann as being the ultimate smart human and therefore the basis of all ASI/human intelligence comparison when the mathematician Alexander Grothendieck exists somehow.
Von Neumann arguably had the highest processor-type "horsepower" we know of plus his breadth of intellectual achievements is unparalleled.
But imo Grothendieck is a better comparison point for ASI as his intelligence, while being strangely similar to LLMs in some dimensions, arguably more closely resembles what alien-like intelligence would be:
- ...
In a previous post, Lauren offered a take on why a physics way of thinking is so successful at understanding AI systems. In this post, we look in more detail at the potential of Quantum field theory (QFT) to be expanded into a more comprehensive framework for this purpose. Interest in this area has been steadily increasing[1], but efforts have yet to condense into a larger-scale, coordinated effort. In particular, a lot of the more theoretical, technically detailed work remains opaque to anyone not well-versed in physics, meaning that insights[2] are largely disconnected from the AI safety community. The most accessible of these is Principles of Deep Learning theory (which we abbreviate “PDLT”), a nearly 500 page book that lays the groundwork for these ideas[3]. While there...