I'd be interested to hear in more detail why you're unconvinced.
I think the main reasons to work on mechanistic interp do not look like "we can literally understand all the cognition behind a powerful AI", but instead "we can bound the behavior of the AI"
I assume "bound the behavior" means provide a worst-case guarantee. But if we don't understand all the cognition, how can we provide such a guarantee? How do we know that the part of the AI we don't understand wouldn't ruin our guarantee?
we can help other, weaker AIs understand the powerful AI
My understanding of interpretability is that humans understand what the AI is doing. Weaker AIs understanding the powerful AI doesn't feel like a solution to interpretability. Instead it feels like a solution to amplification that's still uninterpretable by humans.
What do you think are the top 3 (or top 5, or top handful) of interpretability results to date? If I gave a 5-minute talk called "The Few Greatest Achievements of Interpretability to Date," what would you recommend I include in the talk?
You also claim that GPT-like models achieve "SOTA performance in domains traditionally dominated by RL, like games." You cite the paper "Multi-Game Decision Transformers" for this claim.
But, in Multi-Game Decision Transformers, reinforcement learning (specifically, a Q-learning variant called BCQ) trained on a single Atari game beats Decision Transformer trained on many Atari games. This is shown in Figure 1 of that paper. The authors of the paper don't even claim that Decision Transformer beats RL. Instead, they write: "We are not striving for mastery or efficiency that game-specific agents can offer, as we believe we are still in early stages of this research agenda. Rather, we investigate whether the same trends observed in language and vision hold for large-scale generalist reinforcement learning agents."
It may be that Decision Transformers are on a path to matching RL, but it's important to know that this hasn't yet happened. I'm also not aware of any work establishing scaling laws in RL.
"A supreme counterexample is the Decision Transformer, which can be used to run processes which achieve SOTA for offline reinforcement learning despite being trained on random trajectories."
This is not true. The Decision Transformer paper doesn't run any complex experiments on random data; they only give a toy example with random data.
We actually ran experiments with Decision Transformer on random data from the D4RL offline RL suite. Specifically, we considered random data from the Mujoco Gym tasks. We found that when it only has access to random data, Decision Transformer only achieves 4% of the performance that it can achieve when it has access to expert data. (See the D4RL Gym results in our Table 1, and compare "DT" on "random" to "medium-expert".)
The technology [of lethal autonomous drones], from the point of view of AI, is entirely feasible. When the Russian ambassador made the remark that these things are 20 or 30 years off in the future, I responded that, with three good grad students and possibly the help of a couple of my robotics colleagues, it will be a term project [six to eight weeks] to build a weapon that could come into the United Nations building and find the Russian ambassador and deliver a package to him.
-- Stuart Russell on a February 25, 2021 podcast with the Future of Life Institute.
It seems to me that the comments in code provide "visible thoughts" for what the programmer intends. What do you hope to learn from training language models on thought-annotated dungeons that you couldn't learn from language models that have already been trained on commented code?