Oliver's second message seems like a truly relevant consideration for our work in the alignment ecosystem. Sometimes, it really does feel like AI X-risk and related concerns created the current situation. Many of the biggest AGI advances might not have been developed counterfactually, and machine learning engineers would just be optimizing another person's clicks.
I am a big fan of "Just don't build AGI" and academic work with AI, simply because it is better at moving slowly (and thereby safely through open discourse and not $10 mil training runs) compared to massive industry labs. I do have quite a bit of trust in Anthropic, DeepMind and OpenAI simply from their general safety considerations compared to e.g. Microsoft's release of Sydney.
As part of this EA bet on AI, it also seems like the safety view has become widespread among most AI industry researchers from my interactions with them (though might just be a sampling bias and they were honestly more interested in their equity growing in value). So if the counterfactual of today's large AGI companies would be large misaligned AGI companies, then we would be in a significantly worse position. And if AI safety is indeed relatively trivial, then we're in an amazing position to make the world a better place. I'll remain slightly pessimistic here as well, though.
There's an interesting case on the infosec mastodon instance where someone asks Sydney to devise an effective strategy to become a paperclip maximizer, and it then expresses a desire to eliminate all humans. Of course, it includes relevant policy bypass instructions. If you're curious, I suggest downloading the video to see the entire conversation, but I've also included a few screenshots below (Mastodon, third corycarson comment).
Hilarious to the degree of Manhatten scientists laughing at atmospheric combustion.
Thank you for pointing this out! It seems I wasn't informed enough about the context. I've dug a bit deeper and will update the text to:
- Another piece reveals that OpenAI contracted Sama to use Kenyan workers with less than $2 / hour wage ($0.5 / hour average in Nairobi) for toxicity annotation for ChatGPT and undisclosed graphical models, with reports of employee trauma from the explicit and graphical annotation work, union breaking, and false hiring promises. A serious issue.
For some more context, here is the Facebook whistleblower case (and ongoing court proceedings in Kenya with Facebook and Sama) and an earlier MIT Sloan report that doesn't find super strong positive effects (but is written as such, interestingly enough). We're talking pay gaps from relocation bonuses, forced night shifts, false hiring promises, supposedly human trafficking as well? Beyond textual annotation, they also seemed to work on graphical annotation.
I recommend reading Blueprint: The Evolutionary Origins of a Good Society about the science behind the 8 base human social drives where 7 are positive and the 8th is the outgroup hatred that you mention as fundamental. I have not read much up on the research on outgroup exclusion but I talked to an evolutionary cognitive psychologist who mentioned that this is receiving a lot of scientific scrutiny as a "basic drive" from evolution's side.
Axelrod's The Evolution of Cooperation also finds that collaborative strategies work well in evolutionary prisoner's dilemma game-theoretic simulations, though hard and immediate reciprocity for defection is also needed, which might lead to the outgroup hatred you mention.
An interesting solution here is radical voluntarism where an AI philosopher king runs the immersive reality where all humans are in and you can only be causally influenced upon if you want to. This means that you don't need to do value alignment, just very precise goal alignment. I was originally introduced to this idea Carado.
The summary has been updated to yours for both the public newsletter and this LW linkpost. And yes, they seem exciting. Connecting FFS to interpretability was a way to contextualize it in this case, until you would provide more thoughts on the use case (given your last paragraph in the post). Thank you for writing, always appreciate the feedback!
I think we agree here. Those both seem like updates against scaling is all you need, i.e. (in this case) "data for DL in ANNs on GPUs is all you need".
What might not be obvious from the post is that I definitely disagree with the "AGI near-impossible" as well, for the same reasons. These are the thoughts of GPU R&D engineers I talked with. However, the GPU performance increase limitation is a significant update on the ladder of assumptions towards "scaling is all you need" leading to AGI.
Thank you for this critique! They are always helpful to hone in on the truth.
So as far as I understand your text, you argue that fine-grained interpretability loses out against "empiricism" (running the model) because of computational intractability.
I generally disagree with this. beren points out many of the same critiques of this piece as I would come forth with. Additionally, the arguments seem too undefined, like there is not in-depth argumentation enough to support the points you make. Strong upvote for writing them out, though!
You emphasize the Human Brain Project (HBP) quite a lot, even in the comments, as an example of a failed large-scale attempt to model a complex system. I think this characterization is correct but it does not seem to generalize beyond the project itself. It seems just as much like a project management and strategy problem as so much else. Benes' comment is great for more reasoning into this and why ANNs seem significantly more tractable to study than the brain.
Additionally, you argue that interpretability and ELK won't succeed simply because of the intractability of fine-grained interpretability. I have two points against this view:
1. Mechanistic interpretability have clearly already garnered quite a lot of interesting and novel insights into neural networks and causal understanding since the field's inception 7 years ago.
It seems premature to disregard the plausibility of the agenda itself, just as it is premature to disregard the project of neuroscience based on HBP. Now, arguing that it's a matter of speed seems completely fine but this is another argument and isn't emphasized in the text.
2. Mechanistic interpretability does not seem to be working on fine-grained interpretability (?).
Maybe it's just my misunderstanding of what you mean by fine-grained interpretability, but we don't need to figure out what neurons do, we literally design them. So the inspections happen at feature level, which is much more high-level than investigating individual neurons (sometimes these features seem represented in singular neurons of course). The circuits paradigm also generally looks at neural networks like systems neuroscience does, interpreting causal pathways in the models (of course with radical methodological differences because of the architectures and computation medium). The mechanistic interpretability project does not seem misguided by an idealization of neuron-level analysis and will probably adopt any new strategy that seems promising.
For example work in this paradigm that seems promising, see interpretability in the wild, ROME, the superpositions exposition, the mathematical understanding of transformers and the results from the interpretability hackathon.
For an introduction to features as the basic building blocks as compared to neurons, see Olah et al.'s work (2020).
When it comes to your characterization of the "empirical" method, this seems fine but doesn't conflict with interpretability. It seems you wish to make game theory-like understanding of the models or have them play in settings to investigate their faults? Do you want to do model distillation using circuits analyses or do you want AI to play within larger environments?
I falter to understand the specific agenda from this that isn't done by a lot of other projects already, e.g. AI psychology and building test environments for AI. I do see potential in expanding the work here but I see that for interpretability as well.
Again, thank you for the post and I always like when people cite McElreath, though I don't see his arguments apply as well to interpretability since we don't model neural networks with linear regression at all. Not even scaling laws use such simplistic modeling, e.g. see Ethan's work.
It seems like there's a lot of negative comments about this letter. Even if it does not go through, it seems very net positive for the reason that it makes explicit an expert position against large language model development due to safety concerns. There's several major effects of this, as it enables scientists, lobbyists, politicians and journalists to refer to this petition to validate their potential work on the risks of AI, it provides a concrete action step towards limiting AGI development, and it incentivizes others to think in the same vein about concrete solutions.
I've tried to formulate a few responses to the criticisms raised:
All in good faith of course; it's a contentious issue but this letter seems generally positive to me.