Hey Adam, thanks for running Refine and writing this up.
Out of curiosity, do you (or anyone else) know if there are statistics for previous SERI-MATS cohorts/other programs designed to generate conceptual alignment researchers?
I like 1) and think this is worth doing. I believe that Mechanistic Interpretability researchers are already somewhat concerned about insight not generalising from toy models to larger models let alone to novel architectures so work on model agnostic levels could be useful in the same paradigm too.
Something to note, I'm not confident about the track record of model agnostic methods (such as saliency maps). I've heard from at least one ML researcher that saliency maps have a poor track record and have been shown to be unreliable in a variety of experiments. Do you know of any other examples of model agnostic interpretability methods which you think might be very useful? Maybe saliency maps don't matter as much as the idea of model agnostic methods in which case feel free to disregard this. I've heard before of interest in generally approaching models as block boxes "ML psychologist" while we try to understand them so don't think the value of this approach lies too heavily in specific prior methods.
With respect to 2), while I think this is reasonable, I believe the salient point is whether models from the current paradigm are sufficiently dangerous fast enough that they warrant more/less focus. Theoretically, the space of possible ML architecture paradigms producing doom is large and the order in which they will manifest is roughly the order in which we should solve them. (ie: align current systems, then new paradigm systems, then new paradigm systems, each buying time).
However, I think there are good enough reasons to work on model agnostic methods that don't rely on AGI doom originating in a new paradigm.
Overall, very exciting! good luck!
Understood. Maybe if the first argument was more concrete, we can examine it's predictions. For example, what fundamental limitations exist in current systems? What should a breakthrough do (at least conceptually) in order to move us into the new paradigm?
I think it's reasonable that understanding the brain better may yield insights but I believe Paul's comment about return on existing insights diminishing over time. Technologies like dishbrain seem exciting and might change that trend?
To the extent that one might not have predicted scientists to hold these views, I can see why this paper might cause a positive predictive update on brainlike AGI.
However, technological development is not a zero-sum game. Opportunities or enthusiasm in neuroscience doesn't in itself make prosaic AGI less likely and I don't feel like any of the provided arguments are knockdown arguments against ANN's leading to prosaic AGI.
I don't particularly find arguments about human level intelligence being unprecedented outside of humans convincing, in part because of the analogy to "the god of the gaps". Many predictions about what computers can't do have been falsified, sometimes in unexpected ways (ie: arguments about AI not being able to make art). Moreover, that more is different in AI and the development of single-shot models seem powerful argument about the potential of prosaic AI systems when scaled.
I think it's plausible that it is either harmful to perpetuate "every alignment person needs to read the Sequences / Rationality A-Z" or maybe even inefficient.
For example, to the extent that alignment needs more really good machine learning engineers, it's possible they might benefit less from the sequences than a conceptual alignment researcher.
However, relying on anecdotal evidence seems potentially unnecessary. We might be able to use polls, or otherwise systemically investigate the relationship between interest/engagement with the sequences and various paths to contribution with AI. A prediction market might also work for information aggregation.
I'd bet that all else equal, engagement with the sequences is beneficial but that this might be less pronounced among those growing up in academically inclined cultures.
Thanks Swimmer963! This was very interesting.
I have a general question for the community. Does anyone know of any similar such descriptions of model limitations with so many examples performed for any language models such as GPT-3?
My personal experience is that visual output is inherently more intuitive, but I'd love to explore my intuition around language models with an equivalent article for GPT-3 or PaLM for example.
I'd predict such articles exist with high confidence but finding the appropriate article with sufficient quality might be trickier. I'm curious which articles commenters here would select in particular.