LESSWRONG
LW

356
Alex Loftus
40260
Message
Dialogue
Subscribe

Our textbook is on amazon now! a.co/d/38KkG0V

https://alex-loftus.com/ | PhD student, Bau lab @ Northeastern

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
How Does A Blind Model See The Earth?
Alex Loftus1mo40

Super cool! Did you use thought tokens in any of the reasoning models for any of this? I'm wondering how much adding thinking would increase the resolution.

Reply
The Problem
Alex Loftus1mo-30

As someone studying mechanistic interpretability who is fairly skeptical about the existential risk stuff, here were my reactions as I read, from an LLM-centric perspective:
 

From the AI's perspective, modifying the AI's goals counts as an obstacle

Yeah, I guess, but in practice it's super easy to shut down an LLM, and you'd typically run any agent that is going to do some real work from you in something like a Docker container. There's a little "interrupt" button that you can use to just... have it immediately stop doing anything. It doesn't know in advance that you will press that button. These examples about Stockfish assume that it's the human playing the game of chess vs the ai playing the game of chess. In reality, it's the human using a computer vs an ai playing chess. The human can just close their computer tab if they want. It's hard to imagine a super-good LLM, that is the same type of thing as GPT5, but much smarter, that doesn't have this same property. 

I'm skeptical of arguments that require me to discard extending the current actual, real-life paradigm in my head in favor of imagining some other thing that has all of the properties that current systems do not have.

(e.g., gpt5 was predictable from looking at capabilities of gpt2 and reading kaplan et. al -- a system that has the properties described above is not)

The internal machinery that could make an ASI dangerous is the same machinery that makes it work at all.

If this were true, then it should also be true that humans who are highly capable at achieving their long-term goals are necessarily bad people that cause problems for everybody. But I've met lots of counterexamples, e.g., highly capable people who are also good. I'd be interested in seeing something empirical on this.

No amount of purely behavioral training in a toy environment will reliably eliminate power-seeking in real-world settings

It doesn't seem to me like training an LLM is the type of process where this will happen. Like, when you train, a single forward pass gives us pti+1=p(ti+1|t1,…,ti), and a backward pass is just backpropagation on the logprob of pti+1. The LLM is learning its behaviors during this process, not during inference. There's opposite incentives for an LLM to hide abilities during training, because training is exactly where its abilities matter from its perspective. Inference doesn't backpropagate a reward signal. I suppose the response here is "LLMs will be used to build something fundamentally different from LLMs that happens to have all the properties we are saying ASIs must have"

The most important alignment technique used in today’s systems, Reinforcement Learning from Human Feedback (RLHF), trains AI to produce outputs that it predicts would be rated highly by human evaluators.

This is actually not true, we use DPO now, which does not use a reward model or RL algorithm, but that's neither here nor there. Lots of other posttraining techniques around in the air (e.g., RLAIF, etc).

Real life offers a far more multidimensional option space: we can anticipate a hundred different novel attack vectors from a superintelligent system, and still not have scratched the surface.

Random thought here: this is incompatible with the focus on bioweapons. Why focus so hard on one particular example of a possible attack vector when there are so many other possible ones? Just to have a model vector to study?

Reply
"It's a 10% chance which I did 10 times, so it should be 100%"
Alex Loftus10mo60

The excellent book Algorithms to Live By has an entire chapter dedicated to this concept, using the secretary problem as an example: https://www.amazon.com/Algorithms-Live-Computer-Science-Decisions/dp/1627790365 

Reply
Making History Available
Alex Loftus1y20

Photons and history books both descend by causal chains from the event itself.

Right, but history books are a much lossier source of information.

Reply
Making History Available
Alex Loftus1y10

I realized that the invention and destruction of vitalism—which I had only read about in books—had actually happened to real people, who experienced it much the same way I experienced the invention and destruction of my own mysterious answer

This realization makes the Stephen Pinker-esque stories about the extent of violence in the middle ages and earlier that much more surreal.

Reply
Chapter 45: Humanism, Pt 3
Alex Loftus2y10

This chapter made me feel things.

Reply
1Alex Loftus's Shortform
2y
0
38It's Owl in the Numbers: Token Entanglement in Subliminal Learning
2mo
7
1Alex Loftus's Shortform
2y
0