When making safety cases for alignment, its important to remember that defense against single-turn attacks doesn't always imply defense against multi-turn attacks.
Our recent paper shows a case where breaking up a single turn attack into multiple prompts (spreading it out over the conversation) changes which models/guardrails are vulnerable to the jailbreak.
Robustness against the single-turn version didn't imply robustness against the multi-turn version of the attack, and robustness against the multi-turn version didn't imply robustness against the single-turn version of the attack.
I expect that within a year or two, there will be an enormous surge of people who start paying a lot of attention to AI.
This could mean that the distribution of who has influence will change a lot. (And this might be right when influence matters the most?)
I claim: your effect on AI discourse post-surge will be primarily shaped by how well you or your organization absorbs this boom.
The areas I've thought the most about this phenomena are:
(But this applies to anyone who's impact primarily comes from spreading their ideas, which is a lot of people.)
I think that you or your organization should have an explicit plan to absorb this surge.
Unresolved questions:
I'd be curious to see how this looked with Covid: Did all the covid pandemic experts get an even 10x multiplier in following? Or were a handful of Covid experts highly elevated, while the rest didn't really see much of an increase in followers? If the latter, what did those experts do to get everyone to pay attention to them?
Can anyone think of alignment-pilled conservative influencers besides Geoffrey Miller? Seems like we could use more people like that...
Maybe we could get alignment-pilled conservatives to start pitching stories to conservative publications?
Turning up the key repeat rate on my computer has been really helpful. I highly recommend going to System Preferences > Keyboard > Key repeat rate and turning it way up!
...why? to be clear you mean like holding down a key and having it repeat the character? Why are you using that very often?
Without this, using the mouse is faster at moving the text cursor than using arrow keys. Editing code feels either frustratingly sensitive or like dragging finger through honey
You can also press and hold Ctrl + arrow keys to move through words at once instead of each character, and of course you can combine this with what’s suggested here.
Hmm, do you use something like vim or emacs or an editor that allows similar keybinds?
(I suspect I just have a good enough repeat rate)
What happens if all of the local datacenter fights across America become way more successful? This functionally seems similar to a data center moratorium, and might actually be easier.
After meeting with a few of these groups, my impression is that the vast majority of American AI datacenter fights are operating with basically zero financial help, and remarkably little legal support. I’ve seen multiple campaigns run by people who basically struggled to raise enough money to even print signs and somehow ended up winning or significantly delaying the project. On aggregate, these fights manage to be very successful with hardly any resources.
In the extreme case, what if you just give a $100,000 grant to every single ongoing AI data center fight in America (source: https://datacentertracker.org/) to get them all equipped with great legal and advocacy help? This would cost around $23 million. (One could imagine weighing each grant by the datacenters projected energy usage.)
To put more emphasis on this point: I think a single medium-sized donor could significantly change the rate of AI data center development in America.
It seems the safety community generally support Bernie’s proposed AI data center moratorium. I think supporting grassroots data center fights is a less robust version, but it seems to captures a substantial fraction of the value, while being surprisingly cost effective. But maybe people just don’t think it’s net positive to slow down development by supporting these communities? If so, I’m super curious to hear why.
Should it be more tabooed to put the bottom line in the title?
Titles like "in defense of <bottom line>" or just "<bottom line>" seem to:
I think putting the conclusion in the title is good insofar it's a form of anti-clickbait: It's the most informative title possible. Yes, people may be motivated to read it in order to confirm their pre-existing opinion, or to search for counterarguments, but the alternative is often that they don't read the article at all, for a lack of motivation.
People who are motivated to write a comment from a disagreement with the title are, more or less, forced to read the actual post in order to compose their rebuttal. Which is better than not receiving any engagement from this person at all. And perhaps this post even changes their mind, or they agree with the title but find the arguments in the post too weak.
Overall, having the conclusion in the title seems good for similar reasons a summary in the beginning is good.
Though a reason to avoid the bottom line in the title is if it is some generally unpopular opinion. Many people will reflexively downvote the post without reading, causing it to be seen by fewer readers.
Seems like you can get pretty far by just having current opus 4.6 Claude code run for a week. Only problem is that this is prohibitively expensive.
My impression is that running something like Deepseek for a week straight doesn’t really get you much?
If inference costs per model are declining somewhere between 3x-10x+ per year this alone will get economical quite soon. What projects do you have up your sleeve for when this is viable?
My personal pet project I want to try this method on is preventing all of us from dying from misaligned AGI. ;) I want to try next-gen systems for deconfusion and conceptual clarification in the relevant domains.
I think even with scaffolding for more careful reasoning, Opus 4.6 probably isn't quite smart or truth-seeking enough to do this as well as a smart human. But I'm not sure. I think it can be made smarter by instructing Claude Code (or Codex) it to use a reasoning process more like a human would when doing a long-term research project to clarify concepts in a complex domain. This is one way in which Human-like metacognitive skills will reduce LLM slop and aid alignment and capabilities. I doubt this will be enough on its own, but in combination with next-generation systems with somewhat better metacognition from training, it might help.
My goal would be to have a pretty straightforward set of prompts that's obviously truth-seeking, so that if anyone runs it even with prompting for assumptions hostile AGI x-risk, the system comes back with "based on the conceptual uncertainties, humans should try to slow down AI progress and work harder on alignment if at all possible".
The other target would be conceptual clarifications on exactly how much and what sorts of alignment we're likely to need to survive.
Of course this path includes the risk of The Median Doom-Path: Slop, not Schemingl as Wentworth puts it: we use AI for conceptual alignment research and it helps confuse us. But this seems inevitable, so having independent researchers trying to make this go better seems like a good idea.
The scene in planecrash where Keltham gives his first lecture, as an attempt to teach some formal logic (and a whole bunch of important concepts that usually don't get properly taught in school), is something I'd highly recommend reading! As far as I can remember, you should be able to just pick it up right here, and follow the important parts of the lecture without understanding the story