Alex Loftus — LessWrong

As someone studying mechanistic interpretability who is fairly skeptical about the existential risk stuff, here were my reactions as I read, from an LLM-centric perspective:

From the AI's perspective, modifying the AI's goals counts as an obstacle

Yeah, I guess, but in practice it's super easy to shut down an LLM, and you'd typically run any agent that is going to do some real work from you in something like a Docker container. There's a little "interrupt" button that you can use to just... have it immediately stop doing anything. It doesn't know in advance that you will press that button. These examples about Stockfish assume that it's the human playing the game of chess vs the ai playing the game of chess. In reality, it's the human using a computer vs an ai playing chess. The human can just close their computer tab if they want. It's hard to imagine a super-good LLM, that is the same type of thing as GPT5, but much smarter, that doesn't have this same property.

I'm skeptical of arguments that require me to discard extending the current actual, real-life paradigm in my head in favor of imagining some other thing that has all of the properties that current systems do not have.

(e.g., gpt5 was predictable from looking at capabilities of gpt2 and reading kaplan et. al -- a system that has the properties described above is not)

The internal machinery that could make an ASI dangerous is the same machinery that makes it work at all.

If this were true, then it should also be true that humans who are highly capable at achieving their long-term goals are necessarily bad people that cause problems for everybody. But I've met lots of counterexamples, e.g., highly capable people who are also good. I'd be interested in seeing something empirical on this.

No amount of purely behavioral training in a toy environment will reliably eliminate power-seeking in real-world settings

It doesn't seem to me like training an LLM is the type of process where this will happen. Like, when you train, a single forward pass gives us , and a backward pass is just backpropagation on the logprob of $p_{t_{i + 1}}$ . The LLM is learning its behaviors during this process, not during inference. There's opposite incentives for an LLM to hide abilities during training, because training is exactly where its abilities matter from its perspective. Inference doesn't backpropagate a reward signal. I suppose the response here is "LLMs will be used to build something fundamentally different from LLMs that happens to have all the properties we are saying ASIs must have"

The most important alignment technique used in today’s systems, Reinforcement Learning from Human Feedback (RLHF), trains AI to produce outputs that it predicts would be rated highly by human evaluators.

This is actually not true, we use DPO now, which does not use a reward model or RL algorithm, but that's neither here nor there. Lots of other posttraining techniques around in the air (e.g., RLAIF, etc).

Real life offers a far more multidimensional option space: we can anticipate a hundred different novel attack vectors from a superintelligent system, and still not have scratched the surface.

Random thought here: this is incompatible with the focus on bioweapons. Why focus so hard on one particular example of a possible attack vector when there are so many other possible ones? Just to have a model vector to study?

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments