Eva Lu — LessWrong

The Internal Model Principle: A Straightforward Explanation

I'm confused because this sounds extremely trivial, and that doesn't seem right. It sounds to me like the theorem is just saying:

(State transistions are deterministic) Let there be an isomorphism between any state x(t) and the sequence x(t), x(t+1), x(t+2), ...
(Detectability) Assume there is an isomorphism between sequences x(t), x(t+1), x(t+2), ... and sequences w(t), w(t+1), w(t+2), ...
(Autonomy) Assume there is an isomorphism between sequences w(t), w(t+1), w(t+2), ... and states w(t)
Then there is an isomorphism between x(t) and w(t).

It just sounds like the theorem is assuming the conclusion. Am I missing something?

Neel Nanda's Shortform

Eva Lu4mo10

I was about to try this, but then realized the Internal Double Crux was a better tool for my specific dilemma. I guess here's a reminder to everyone that IDC exists.

Jan Betley's Shortform

Eva Lu7mo30

I've talked to a lot of people about mech interp so I can enumerate some counterarguments. Generally I've been surprised by how well people in AI safety can defend their own research agendas. Of course, deciding whether the counterarguments outweigh your arguments is a lot harder than just listing them, so that'll be an exercise for readers.

Interp is hard

I think researchers already believe this. Recently I read https://www.darioamodei.com/post/the-urgency-of-interpretability, and in it, Dario expects mech interp to take 5-10 years before it's as good as an MRI.

Forall quantifiers

Forall quantifiers are nice, but a lot of empirical sciences like medicine or economics have been pretty successful without them. We don't really know how most drugs work, and the only way to soundly disprove a claim like "this drug will cause people to mysteriously drop dead 20 years later" is to run a 20 year study. We approve new drugs in less than 20 years, and we haven't mysteriously dropped dead yet.

Similarly we can do a lot in mech interp to build confidence without any forall quantifiers, like building deliberately misaligned models and seeing if mech interp techniques can find the misalignment.

No specific plans

The people I've talked to believe interp will be generally helpful for all types of plans, and I haven't heard anything specific either. Here's a specific plan I made up. Hopefully it doesn't suck.

Basically just combine prosaic alignment and mech interp. This might sound stupid on paper, most forbidden technique and whatnot, but using mech interp we can continuously make misalignment harder and keep it above the capability levels of frontier AIs. This might not work long term, but all long term alignment plans seem like moonshots right now, and we'll have much better ideas later on when we know more about AIs (e.g. after we solve mech interp!).

Future architectures might be different

Transformers haven't changed much in the past 7 years, and big companies have already invested a ton of money into transformer specific performance optimizations. I just talked to some guys at a startup who spent hundreds of millions building a chip that can only run transformer inference. I think lots of people believe transformers will be around for a while. Also, it's somewhat of a self fulfilling prophecy because new architectures now have to compete against hyperoptimized transformers, not just regular transformers.

A computational no-coincidence principle

Eva Lu8mo10

99% of random^[3] reversible circuits , no such $π$ exists.

Do you mean 99% of circuits that don't satisfy P? Because there probably are distributions of random reversible circuits that satisfy P exactly 1% of the time, and that would make V's job as hard as NP = coNP.

Kabir Kumar's Shortform

Eva Lu9mo10

Have you felt this from your own experience trying to get funding, or from others, or both? Also, I'm curious what you think is their specific kind of bullshit, and if there's things you think are real but others thought to be bullshit.

So how well is Claude playing Pokémon?

Eva Lu9mo*137

I disagree because to me this just looks like LLMs are one algorithmic improvement away from having executive function, similar to how they couldn't do system 2 style reasoning until this year when RL on math problems started working.

For example, being unable to change its goals on the fly: If a kid kept trying to go forward when his pokemon were too weak. He would keep losing, get upset, and hopefully in a moment of mental clarity, learn the general principle that he should step back and reconsider his goals every so often. I think most children learn some form of this from playing around as a toddler, and reconsidering goals is still something we improve at as adults.

Unlike us, I don't think Claude has training data for executive functions like these, but I wouldn't be surprised if some smart ML researchers solved this in a year.

evalu's Shortform

Eva Lu9mo20

There's a lot of discussion about evolution as an example of inner and outer alignment.

However, we could instead view the universe as the outer optimizer that maximizes entropy, or power, or intelligence. From this view, both evolution and humans are inner optimizers, and the difference between evolution's and our optimization targets is more of an alignment success than a failure.

Before evolution, the universe increased entropy by having rocks in space crash into each other. When life and evolution finally came around, it was way more effective than rock collisions at increasing entropy even though entropy isn't in its optimization target. If there were an SGD loop around the universe to maximize entropy, it would choose the evolution mesa-optimizer instead of the crashing rocks.

Compared to evolution, humans optimizing to win wars, make money, and be famous was once again way better at increasing entropy. Entropy is still not in the optimization target, but an entropy-maximizing SGD around the universe would choose the human mesa-optimizer over evolution.

Importantly, humans not caring about genetic fitness is no longer an alignment failure from this view. The mesa-optimizer for our values is more aligned than evolution was, so it's good that we rather spread our ideas and influence than our genes.

Remap your caps lock key

Eva Lu1y20

I've had caps lock remapped to escape for a few years now, and I also remapped a bunch of symbol keys like parentheses to be easier to type when coding. On other people's computers it is slower for me type text with symbols or use vim, but I don't mind since all of my deeply focused work (when the mini-distraction of reaching for a difficult key is most costly) happens on my own computers.

Orienting to 3 year AGI timelines

Eva Lu1y159

I'm skeptical of the claim that the only things that matter are the ones that have to be done before AGI.

Ways it could be true:

The rate of productivity growth has a massive step increase after AI can improve its capabilities without the overhead of collaborating with humans. Generally the faster the rate of productivity growth, the less valuable it is to do long-horizon work. For example, people shouldn't work on climate change because AGI will instantly invent better renewables.
If we expect short timelines and also smooth takeoff, then that might mean our current rate of productivity growth is much higher or a different shape (e.g. doubly exponential instead of just exponential) than it was a few years ago. This much higher rate of productivity growth means any work with 3+ year horizons has negligible value.

Ways it could be false:

Moloch still rules the world after AGI (maybe there are multiple competing AGIs). For example, a scheme that allows an aligned AGI to propagate it's alignment to the next generation would be valuable to work on today because it might be difficult for our first generation aligned AGI to invent this before someone else (another AGI) creates the second generation, smarter AGI.
DALYs saved today are still valuable.
Q: Why save lives now when it will be so much cheaper after we build aligned AGI?
A: Why do computer scientists learn to write fast algorithms when they could just wait for compute speed to double?
Basic research might always be valuable because it's often not possible to see the applications of a research field until it's quite mature. A post-AGI world might dedicate some constant fraction of resources towards basic research with no obvious applications, and in that world it's still valuable to pull ahead the curve of basic research accomplishments.

I lean towards disagreeing because I give credence to smooth takeoffs, mundane rates of productivity growth, and many-AGI worlds. I'm curious if those are the big cruxes or if my model could be improved.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments