Mechanistic Intepretability
Independent (Incoming PhD Student)
Oh huh - those eyes, webs and scales in Slide 43 of your work are really impressive, especially given the difficulty extending these methods to transformers. Is there any write-up of this work?
I am a bit confused by your operationalization of "Dangerous". On one hand
I posit that interpretability work is "dangerous" when it enhances the overall capabilities of an AI system, without making that system more aligned with human goals
is a definition I broadly agree with, especially since you want it to track the alignment-capabilities trade-off (see also this post). However, your examples suggest a more deontological approach:
This suggests a few concrete rules-of-thumb, which a researcher can apply to their interpretability project P: ...
If P makes it easier/more efficient to train powerful AI models, then P is dangerous.
Do you buy the alignment-capabilities trade-off model, or are you trying to establish principles for interpretability research? (or if both, please clarify what definition we're using here)
This was a nice description, thanks!
However, regarding
comprehensively interpreting networks [... aims to] identify all representations or circuits in a network or summarize the full computational graph of a neural network (whatever that might mean)
I think this is incredibly optimistic hope that I think need be challenged more.
On my model GPT-N has a mixture of a) crisp representation, b) fuzzy heuristics made are made crisp in GPT-(N+1) and c) noise and misgeneralizations. Unless we're discussing models that perfectly fit their training distribution, I expect comprehensively interpreting networks involves untangling many competing fuzzy heuristics which are all imperfectly implemented. Perhaps you expect this to be possible? However, I'm pretty skeptical this is tractable and expect the best good interpretability work to not confront these completeness guarentees.
Related (I consider "mechanistic interpretability essentially solved" to be similar to your "comprehensive interpreting" goal)
I liked that you found a common thread in several different arguments.
However, I don't think that the views are all believed or all disagreed with in practice. I do think Yann LeCun would agree with all the points and Eliezer Yudkowsky would disagree with all the points (except perhaps the last point).
For example, I agree with 1 and 5, agree with the first half but not the second half of 2 disagree with 3 and have mixed feelings about 4.
Why? At a high level, I think the extent to which individual researchers, large organizations and LLMs/AIs need empirical feedback to improve are all quite different.
I agree that some level public awareness would not have been reached without accessible demos of SOTA models.
However, I don’t agree with the argument that AI capabilities should be released to increase our ability to ‘rein it in’ (I assume you are making an argument against a capabilities ‘overhang’ which has been made on LW before). This is because text-davinci-002 (and then 3) were publicly available but not accessible to the average citizen. Safety researchers knew these models existed and were doing good work on them before ChatGPT’s release. Releasing ChatGPT results in shorter timelines and hence less time for safety researchers to do good work.
To caveat this: I agree ChatGPT does help alignment research, but it doesn’t seem like researchers are doing things THAT differently based on its existence. And secondly I am aware that OAI did not realise how large the hype and investment would be from ChatGPT, but nevertheless this hype and investment is downstream of a liberal publishing culture which is something that can be blamed.
I agree that ChatGPT was positive for AI-risk awareness. However from my perspective being very happy about OpenAI's impact on x-risk does not follow from this. Releasing powerful AI models does have a counterfactual effect on the awareness of risks, but also a lot of counterfactual hype and funding (such as the vast current VC investment in AI) which is mostly pointed at general capabilities rather than safety, which from my perspective is net negative.
Given past statements I expect all lab leaders to speak on AI risk soon. However, I bring up the FLI letter not because it is an AI risk letter, but because it is explicitly about slowing AI progress, which OAI and Anthropic have not shown that much support for
Thanks for writing this. As far as I can tell most anger about OpenAI is because i) being a top lab and pushing SOTA in a world with imperfect coordination shortens timelines and ii) a large number of safety-focused employees left (mostly for Anthropic) and had likely signed NDAs. I want to highlight i) and ii) in a point about evaluating the sign of the impact of OpenAI and Anthropic.
Since Anthropic's competition seems to me to be exacerbating race dynamics currently (and I will note that very few OpenAI and zero Anthropic employees signed the FLI letter) it seems to me that Anthropic is making i) worse due to coordination being more difficult and race dynamics. At this point, believing Anthropic is better on net than OpenAI has to go through believing *something* about the reasons individuals had for leaving OpenAI (ii)), and that these reasons outweigh the coordination and race dynamic considerations. This is possible, but there's little public evidence for the strength of these reasons from my perspective. I'd be curious if I've missed something from my point.
This occurs across different architectures and datasets (https://arxiv.org/abs/2203.16634)
[from a quick skim this video+blog post doesn't mention this]
Am I missing something or is GPT-4 able to do Length 20 Dynamic Programming using a solution it described itself very easily?
https://chat.openai.com/share/8d0d38c0-e8de-49f3-8326-6ab06884df90
We have 100k context models and several OOMs more FLOPs to throw at models, I couldn't see a reason why autoregressive models were limited in a substantial way given the evidence in the paper