Wiki Contributions


This is a great draft and you have collated many core ideas. Thank you for doing this!

As a matter of practical implementation, I think it's a good idea to always have a draft of official, approved statements of capabilities that can be rehearsed by any individual who may find themselves in a situation where they need to discuss them. These statements can be thoroughly vetted for second- and higher-order information leakage ahead of time, instead of trying to evaluate in real-time what their statements might reveal. It can be counterproductive in many circumstances to only be able to say "I can't talk about that". It also gives people a framework to practice this skill ahead of time in a lower-stakes environment, and the more people who are already read in at a classification level have a chance to vet the statement, the better the chance of catching issues.

The downside of formalizing this process is that you end up with a repository of highly sensitive information, but it seems obvious that you want to practice with weapons and keep them in a safe, rather than just let everyone run around throwing punches with no training.

I'm glad you found it useful, even in this form. If the thing you're working on is something you could share, I'd be happy to offer further assistance, if you like.


Obviously this can't be answered with justice in a single comment, but here are some broad pointers that might help see the shape of the solution:

  • Israeli airport security focuses on behavioral cues, asking unpredictable questions, and profiling. A somewhat extreme threat model there, with much different base rates to account for (but also much lower traffic volume).
  • Reinforced cockpit doors address the hijackers with guns and knives scenarios, but are a fully general kind of a no-brainer control.
  • Good policework and better coordination in law enforcement are commonly cited, e.g. in the context of 9/11 hijackings, before anyone even gets to an airport.

In general, if the airlines had responsibility for security you would see a very different set of controls than what you get today, where it is an externality run by an organization with very strong "don't do anything you can get blamed for" political incentives. In an ideal world, you could get an airline catering to paranoiacs who wanted themselves and their fellow passengers to undergo extreme screening, one for people who have done the math, and then most airlines in the middle would phase into nominal gate screening procedures that didn't make them look to their customers that they didn't care (which largely the math says that they shouldn't).

A thought experiment: why is there no equivalent bus/train station security to what we have at airports? And what are the outcomes there?

I appreciate the nudge here to put some of this into action. I hear alarm bells when thinking about formalizing a centralized location for AI safety proposals and information about how they break, but my rough intuition is that if there is a way these can be scrubbed of descriptions of capabilities which could be used irresponsibly to bootstrap AGI, then this is a net positive. At the very least, we should be scrambling to discuss safety controls for already public ML paradigms, in case any of these are just one key insight or a few teraflops away from being world-ending.

I would like to hear from others about this topic, though; I'm very wary of being at fault for accelerating the doom of humanity.

My project seems to have expired from the OWASP site, but here is an interactive version that should have most of the data:


You'll need to mouse over the elements to see the details, so not really mobile friendly, sorry.

I agree that linters are a weak form of automatic verification that are actually quite valuable. You can get a lot of mileage out of simply blacklisting unsafe APIs and a little out of clever pattern matching.

I would say that some formal proofs are actually impossible, but would agree that software with many (or even all) of the security properties we want could actually have formal-proof guarantees. I could even see a path to many of these proofs today.

While the intent of my post was to draw parallel lessons from software security, I actually think alignment is an oblique or orthogonal problem in many ways. I could imagine timelines in which alignment gets 'solved' before software security. In fact, I think survival timelines might even require anyone who might be working on classes of software reliability that don't relate to alignment to actually switch their focus to alignment at this point.

Software security is important, but I don't think it's on the critical path to survival unless somehow it is a key defense against takeoff. Certainly many imagined takeoff scenarios are made easier if an AI can exploit available computing, but I think the ability to exploit physics would grant more than enough escape potential.

The halting problem only makes it impossible to write a program that can analyze a piece of code and then reliably say "this is secure" or "this is insecure".

It would be nice to able to have this important impossible thing. :)

I think we are trying to say the same thing, though. Do you agree with this more concise assertion?

"It's not possible to make a high confidence checker system that can analyze an arbitrary specification, but it is probably possible (although very hard) to design systems that can be programmatically checked for the important qualities of alignment that we want, if such qualities can also be formally defined."

I would agree that some people figured this out faster than others, but the analogy is also instructional here: if even a small community like the infosec world has a hard time percolating information about failure modes and how to address them, we should expect the average ML engineer to be doing very unsafe things for a very long time by default.

To dive deeper into the XSS example, I think even among those that understood the output encoding and canonicalization solutions early, it still took a while to formalize the definition of an encoding context concisely enough to be able to have confidence that all such edge cases could be covered.

It might be enough to simply recognize an area of alignment that has dragons and let the experts safely explore the nature and contours of these dragons, but you probably couldn't build a useful web application that doesn't display user-influencable input. I think trying to get the industry to halt on building even obvious dragon-infested things is part of what has gotten Eliezer so burned out and pessimistic.

I think you make good points generally about status motives and obstacles for breakers. As counterpoints, I would offer:

  • Eliezer is a good example of someone who built a lot of status on the back of "breaking" others' unworkable alignment strategies. I found the AI Box experiments especially enlightening in my early days.
  • There are lots of high-status breakers, and lots of independent status-rewarding communities around the security world. Some of these are whitehat/ethical, like leaderboards for various bug bounty programs, OWASP, etc. Some of them not so much so, like Blackhat/DEFCON in the early days, criminal enterprises, etc.

Perhaps here is another opportunity to learn lessons from the security community about what makes a good reward system for the breaker mentality. My personal feeling is that poking holes in alignment strategies is easier than coming up with good ones, but I'm also aware that thinking that breaking is easy is probably committing some quantity of typical mind fallacy. Thinking about how things break, or how to break them intentionally, is probably a skill that needs a lot more training in alignment. Or at least we need away to attract skilled breakers to alignment problems.

I find it to be a very natural fit to post bounties on various alignment proposals to attract breakers to them. Keep upping the bounty, and eventually you have a quite strong signal that a proposal might be workable. I notice your experience of offering a personal bounty does not support this, but I think there is a qualitative difference between a bounty leaderboard with public recognition and a large pipeline of value that can be harvested by a community of good breakers, and what may appear to be a one-off deal offered by a single individual with unclear ancillary status rewards.

It may be viable to simply partner with existing crowdsourced bounty program providers (e.g. BugCrowd) to offer a new category of bounty. Traditionally, these services have focused on traditional "pen-test" type bounties, doing runtime testing of existing live applications. But I've long been saying there should be a market for crowdsourced static analysis, and even design reviews, with a pay-per-flaw model.


Many! Thanks for sharing. This could easily turn into its own post.

In general, I think this is a great idea. I'm somewhat skeptical that this format would generate deep insights; in my experience successful Capture the Flag / wargames / tabletop exercises work best in the form where each group spends a lot of time preparing for their particular role, but opsec wargames are usually easier to score, so the judge role makes less sense there. That said, in the alignment world I'm generally supportive of trying as many different approaches as possible to see what works best.

Prior to reading your post, my general thoughts about how these kind of adversarial exercises relate to the alignment world were these:

  • The industry thought leaders usually have experience as both builders and breakers; some insights are hard to gain from just one side of the battlefield. That said, the industry benefits from folks who spend the time becoming highly specialized in one role or the other, and the breaker role should be valued at least equally, if not more than the builder. (In the case of alignment, breakers may be the only source of failure data we can safely get.)
  • The most valuable tabletop exercises that I was a part of spent at least as much time analyzing the learnings as the exercise itself; almost everyone involved will have unique insights that aren't noticed by others. (Perhaps this points to the idea of having multiple 'judges' in an alignment tournament.)
  • Non-experts often have insights or perspectives that are surprising to security professionals; I've been able to improve an incident response process based on participation from other teams (HR, legal, etc.) almost every time I've run a tabletop. This is probably less true for an alignment war game, because the background knowledge required to even understand most alignment topics is so vast and specialized.
  • Unknown unknowns are a hard problem. While I think we are a long way away from having builder ideas that aren't easily broken, it's going to be a significant danger to have breakers run out of exploit ideas and mistake that for a win for the builders.
  • Most tabletop exercises are focused on realtime response to threats. Builder/breaker war games like the DEFCON CTF are also realtime. It might be a challenge to create a similarly engaging format that allows for longer deliberation times on these harder problems, but it's probably a worthwhile one.
Load More