I was about to try this, but then realized the Internal Double Crux was a better tool for my specific dilemma. I guess here's a reminder to everyone that IDC exists.
I've talked to a lot of people about mech interp so I can enumerate some counterarguments. Generally I've been surprised by how well people in AI safety can defend their own research agendas. Of course, deciding whether the counterarguments outweigh your arguments is a lot harder than just listing them, so that'll be an exercise for readers.
Interp is hard
I think researchers already believe this. Recently I read https://www.darioamodei.com/post/the-urgency-of-interpretability, and in it, Dario expects mech interp to take 5-10 years before it's as good as an MRI.
Forall quantifiers
Forall quantifiers are nice, but a lot of empirical sciences like medicine or economics have been pretty successful without them. We don't really know how most drugs work, and the only way to soundly disprove a claim like "this drug will cause people to mysteriously drop dead 20 years later" is to run a 20 year study. We approve new drugs in less than 20 years, and we haven't mysteriously dropped dead yet.
Similarly we can do a lot in mech interp to build confidence without any forall quantifiers, like building deliberately misaligned models and seeing if mech interp techniques can find the misalignment.
No specific plans
The people I've talked to believe interp will be generally helpful for all types of plans, and I haven't heard anything specific either. Here's a specific plan I made up. Hopefully it doesn't suck.
Basically just combine prosaic alignment and mech interp. This might sound stupid on paper, most forbidden technique and whatnot, but using mech interp we can continuously make misalignment harder and keep it above the capability levels of frontier AIs. This might not work long term, but all long term alignment plans seem like moonshots right now, and we'll have much better ideas later on when we know more about AIs (e.g. after we solve mech interp!).
Future architectures might be different
Transformers haven't changed much in the past 7 years, and big companies have already invested a ton of money into transformer specific performance optimizations. I just talked to some guys at a startup who spent hundreds of millions building a chip that can only run transformer inference. I think lots of people believe transformers will be around for a while. Also, it's somewhat of a self fulfilling prophecy because new architectures now have to compete against hyperoptimized transformers, not just regular transformers.
99% of random[3] reversible circuits , no such exists.
Do you mean 99% of circuits that don't satisfy P? Because there probably are distributions of random reversible circuits that satisfy P exactly 1% of the time, and that would make V's job as hard as NP = coNP.
Have you felt this from your own experience trying to get funding, or from others, or both? Also, I'm curious what you think is their specific kind of bullshit, and if there's things you think are real but others thought to be bullshit.
I disagree because to me this just looks like LLMs are one algorithmic improvement away from having executive function, similar to how they couldn't do system 2 style reasoning until this year when RL on math problems started working.
For example, being unable to change its goals on the fly: If a kid kept trying to go forward when his pokemon were too weak. He would keep losing, get upset, and hopefully in a moment of mental clarity, learn the general principle that he should step back and reconsider his goals every so often. I think most children learn some form of this from playing around as a toddler, and reconsidering goals is still something we improve at as adults.
Unlike us, I don't think Claude has training data for executive functions like these, but I wouldn't be surprised if some smart ML researchers solved this in a year.
There's a lot of discussion about evolution as an example of inner and outer alignment.
However, we could instead view the universe as the outer optimizer that maximizes entropy, or power, or intelligence. From this view, both evolution and humans are inner optimizers, and the difference between evolution's and our optimization targets is more of an alignment success than a failure.
Before evolution, the universe increased entropy by having rocks in space crash into each other. When life and evolution finally came around, it was way more effective than rock collisions at increasing entropy even though entropy isn't in its optimization target. If there were an SGD loop around the universe to maximize entropy, it would choose the evolution mesa-optimizer instead of the crashing rocks.
Compared to evolution, humans optimizing to win wars, make money, and be famous was once again way better at increasing entropy. Entropy is still not in the optimization target, but an entropy-maximizing SGD around the universe would choose the human mesa-optimizer over evolution.
Importantly, humans not caring about genetic fitness is no longer an alignment failure from this view. The mesa-optimizer for our values is more aligned than evolution was, so it's good that we rather spread our ideas and influence than our genes.
I've had caps lock remapped to escape for a few years now, and I also remapped a bunch of symbol keys like parentheses to be easier to type when coding. On other people's computers it is slower for me type text with symbols or use vim, but I don't mind since all of my deeply focused work (when the mini-distraction of reaching for a difficult key is most costly) happens on my own computers.
I'm skeptical of the claim that the only things that matter are the ones that have to be done before AGI.
Ways it could be true:
Ways it could be false:
I lean towards disagreeing because I give credence to smooth takeoffs, mundane rates of productivity growth, and many-AGI worlds. I'm curious if those are the big cruxes or if my model could be improved.
I'm confused because this sounds extremely trivial, and that doesn't seem right. It sounds to me like the theorem is just saying:
It just sounds like the theorem is assuming the conclusion. Am I missing something?