SERI MATS '21, Cognitive science @ Yale '22, Meta AI Resident '23, LTFF grantee. Currently doing alignment research @ AE Studio. Very interested in work at the intersection of AI x cognitive science x alignment x philosophy.
Fixed, thanks!
Note this is not equivalent to saying 'we're almost certainly going to get AGI during Trump's presidency,' but rather that there will be substantial developments that occur during this period that prove critical to AGI development (which, at least to me, does seem almost certainly true).
One thing that seems strangely missing from this discussion is that alignment is in fact, a VERY important CAPABILITY that makes it very much better. But the current discussion of alignment in the general sphere acts like 'alignment' is aligning the AI with the obviously very leftist companies that make it rather than with the user!
Agree with this—we do discuss this very idea at length here and also reference it throughout the piece.
That alignment is to the left is one of just two things you have to overcome in making conservatives willing to listen. (The other is obviously the level of danger.)
I think this is a good distillation of the key bottlenecks and seems helpful for anyone interacting with lawmakers to keep in mind.
Whether one is an accelerationist, Pauser, or an advocate of some nuanced middle path, the prospects/goals of everyone are harmed if the discourse-landscape becomes politicized/polarized.
...
I just continue to think that any mention, literally at all, of ideology or party is courting discourse-disaster for all, again no matter what specific policy one is advocating for.
...
Like a bug stuck in a glue trap, it places yet another limb into the glue in a vain attempt to push itself free.
I would agree in a world where the proverbial bug hasn't already made any contact with the glue trap, but this very thing has clearly already been happening for almost a year in a troubling direction. The political left has been fairly casually 'Everything-Bagel-izing' AI safety, largely in smuggling in social progressivism that has little to do with the core existential risks, and the right, as a result, is increasingly coming to view AI safety as something approximating 'woke BS stifling rapid innovation.' The fly is already a bit stuck.
The point we are trying to drive home here is precisely what you're also pointing at: avoiding an AI-induced catastrophe is obviously not a partisan goal. We are watching people in DC slowly lose sight of this critical fact. This is why we're attempting to explain here why basic AI x-risk concerns are genuinely important regardless of one's ideological leanings. ie, genuinely important to left-leaning and right-leaning people alike. Seems like very few people have explicitly spelled out the latter case, though, which is why we thought it would be worthwhile to do so here.
Interesting—this definitely suggests that Planck's statement probably shouldn't be taken literally/at face value if it is indeed true that some paradigm shifts have historically happened faster than generational turnover. It may still be possible that this may be measuring something slightly different than the initial 'resistance phase' that Planck was probably pointing at.
Two hesitations with the paper's analysis:
(1) by only looking at successful paradigm shifts, there might be a bit of a survival bias at play here (we're not hearing about the cases where a paradigm shift was successfully resisted and never came to fruition).
(2) even if senior scientists in a field may individually accept new theories, institutional barriers can still prevent that theory from getting adequate funding, attention, exploration. I do think Anthony's comment below nicely captures how the institutional/sociological dynamics in science seemingly differ substantially from other domains (in the direction of disincentivizing 'revolutionary' exploration).
Thanks for this! Completely agree that there are Type I and II errors here and that we should be genuinely wary of both. Also agree with your conclusion that 'pulling the rope sideways' is strongly preferred to simply lowering our standards. The unconventional researcher-identification approach undertaken by the HHMI might be a good proof of concept for this kind of thing.
I think you might be taking the quotation a bit too literally—we are of course not literally advocating for the death of scientists, but rather highlighting that many of the largest historical scientific innovations have been systematically rejected by one's contemporaries in their field.
Agree that scientists change their minds and can be convinced by sufficient evidence, especially within specific paradigms. I think the thornier problem that Kuhn and others have pointed out is that the introduction of new paradigms into a field are very challenging to evaluate for those who are already steeped in an existing paradigm, which tends to cause these people to reject, ridicule, etc those with strong intuitions for new paradigms, even when they demonstrate themselves in hindsight to be more powerful or explanatory than existing ones.
Thanks for this! Consider the self-modeling loss gradient: . While the identity function would globally minimize the self-modeling loss with zero loss for all inputs (effectively eliminating the task's influence by zeroing out its gradients), SGD learns local optima rather than global optima, and the gradients don't point directly toward the identity solution. The gradient depends on both the deviation from identity () and the activation covariance (), with the network balancing this against the primary task loss. Since the self-modeling prediction isn't just a separate output block—it's predicting the full activation pattern—the interaction between the primary task loss, activation covariance structure (), and need to maintain useful representations creates a complex optimization landscape where local optima dominate. We see this empirically in the consistent non-zero difference during training.
The comparison to activation regularization is quite interesting. When we write down the self-modeling loss in terms of the self-modeling layer, we get .
This does resemble activation regularization, with the strength of regularization attenuated by how far the weight matrix is from identity (the magnitude of ). However, due to the recurrent nature of this loss—where updates to the weight matrix depend on activations that are themselves being updated by the loss—the resulting dynamics are more complex in practice. Looking at the gradient , we see that self-modeling depends on the full covariance structure of activations, not just pushing them toward zero or any fixed vector. The network must learn to actively predict its own evolving activation patterns rather than simply constraining their magnitude.
Comparing the complexity measures (SD & RLCT) between self-modeling and activation regularization is a great idea and we will definitely add this to the roadmap and report back. And batch norm/other forms of regularization were not added.
We've spoken to numerous policymakers and thinkers in DC. The goal is to optimize for explaining to these folks why alignment is important, rather than the median conservative person per se (ie, DC policymakers are not "median conservatives").