Background

I have been doing red team, blue team (offensive, defensive) computer security for a living since September 2000. The goal of this post is to compile a list of general principles I've learned during this time that are likely relevant to the field of AGI Alignment. If this is useful, I could continue with a broader or deeper exploration.

Alignment Won't Happen By Accident

I used to use the phrase when teaching security mindset to software developers that "security doesn't happen by accident." A system that isn't explicitly designed with a security feature is not going to have that security feature. More specifically, a system that isn't designed to be robust against a certain failure mode is going to exhibit that failure mode.

This might seem rather obvious when stated explicitly, but this is not the way that most developers, indeed most humans, think. I see a lot of disturbing parallels when I see anyone arguing that AGI won't necessarily be dangerous. An AGI that isn't intentionally designed not to exhibit a particular failure mode is going to have that failure mode. It is certainly possible to get lucky and not trigger it, and it will probably be impossible to enumerate even every category of failure mode, but to have any chance at all we will have to plan in advance for as many failure modes as we can possibly conceive.

As a practical enforcement method, I used to ask development teams that every user story have at least three abuser stories to go with it. For any new capability, think at least hard enough about it that you can imagine at least three ways that someone could misuse it. Sometimes this means looking at boundary conditions ("what if someone orders 2^64+1 items?"), sometimes it means looking at forms of invalid input ("what if someone tries to pay -$100, can they get a refund?"), and sometimes it means being aware of particular forms of attack ("what if someone puts Javascript in their order details?").

I found it difficult to cultivate security mindset in most software engineers, but as long as we could develop one or two security "champions" in any given team, our chances of success improved greatly. To succeed at alignment, we will not only have to get very good at exploring classes of failures, we will need champions who can dream up entirely new classes of failures to investigate, and to cultivate this mindset within as many machine learning research teams as possible.

Blacklists Are Useless, But Make Them Anyway

I did a series of annual penetration tests for a particular organization. Every year I had to report to them the same form/parameter XSS vulnerability because they kept playing whac-a-mole with my attack payloads. Instead of actually solving the problem (applying the correct context-sensitive output encoding), they were creating filter regexes with ever-increasing complexity to try address the latest attack signature that I had reported. This is the same flawed approach that airport security has, which is why travelers still have to remove shoes and surrender liquids: they are creating blacklists instead of addressing the fundamentals. This approach can generally only look backwards at the past. An intelligent adversary just finds the next attack that's not on your blacklist.

That said, it took the software industry a long time to learn all the ways to NOT solve XSS before people really understood what a correct fix looked like. It often takes many many examples in the reference class before a clear fundamental solution can be seen. Alignment research will likely not have the benefit of seeing multiple real-world examples within any class of failure modes, and so AGI research will require special diligence in not just stopping at the first three examples that can be imagined and calling it a day.

While it may be dangerous to create an ever-expanding list of the ways that an optimization process might kill us all, on balance it is probably necessary to continue documenting examples well beyond what seems like the point of zero marginal returns; one never knows which incremental XSS payload will finally grant the insight to stop thinking about input validation and start focusing on output encoding. The work of creating the blacklist may lead to the actual breakthrough. Sometimes you have to wade through a swamp of known bad behavior to identify and single out the positive good behavior you want to enshrine. This will be especially difficult because we have to imagine theoretical swamps; the first real one we encounter has a high chance of killing us.

You Get What You Pay For

The single most reliable predictor of software security defect rates in an organization I found to be the level of leadership support (read: incentives) for security initiatives. AGIs are going to be made by organizations of humans. Whether or not the winning team's mission explicitly calls for strong alignment will probably be the strongest determinant for the outcome for humanity.

This is not as simple as the CEO declaring support for alignment at monthly all-hands meetings, although this certainly helps grease the skids for those who are trying to push a safety agenda. Every incentive structure must explicitly place alignment at the apex of prioritization and rewards. One company I worked for threatened termination for any developer who closed a security defect without fixing it. Leadership bonuses for low defect rates and short bug remediation timelines also helped. A responsible organization should be looking for any and all opportunities to eliminate any perverse incentives, whether they be social, financial, or otherwise, and reward people accordingly for finding incentive problems.

The bug bounty/zero-day market is likely a strong model to follow for AGI safety issues, especially to expose risk in private organizations that might not be otherwise forthcoming about their AGI research projects. A market could easily be created to reward whistleblowers or incentivize otherwise unwilling parties to disclose risky behavior. Bounties could be awarded for organizations to share their AGI alignment roadmaps, key learnings, or override the default incentive models that will not produce good outcomes. These might or might not be formed with nation-state-level backing and budgets, but as the next marginal AI safety charity dollar gets harder to employ, bounty programs might be a good way to surface metrics about the industry while positively influencing the direction of diverse research groups that are not being appropriately cautious.

Bug bounties also scale gracefully. Small bounties can be offered at first to pluck all of the low-hanging fruit. As attention and awareness grow, so can bounties as they become funded by a broader and broader spectrum of individuals or even state actors. Someone might not be willing to sell out their rogue AI operation for $10,000, but I could easily imagine a billion dollar bounty at some point for information that could save the world. XPRIZEs for incremental alignment results are also an obvious move. A large bounty provides some assurance about its underlying target: if no one claims a billion dollar bounty about rogue AI research, that is some evidence that it's no longer happening.

Assurance Requires Formal Proofs, Which Are Provably Impossible

I've witnessed a few organizations experiment with Design-driven Development. Once we were even able to enshrine "Secure by Design" as a core principle. In reality, software development teams can sometimes achieve 95% code coverage with their test cases and can rarely correlate their barrage of run-time tests to their static analysis suites. This is not what it takes for formal assurance of software reliability. And yet even achieving this level of testing requires heroic efforts.

The Halting Problem puts a certain standard of formalism outside our reach, but it doesn't absolve us of the responsibility of attaining the strongest forms of assurance we possibly can under the circumstances. There are forms of complexity we can learn to avoid entirely (complexity is the enemy of security). Complex systems can be compartmentalized to minimize trust boundaries and attack surface, and can be reasoned about independently.

The promise of Test-Driven Design is that by forcing yourself to write the tests first, you constrain the space of the design to only that which can actually be tested. Multiple software security industries arose that tried to solve the problem of automating testing of arbitrary applications that were already built. (Spoiler: the results were not great.) To my knowledge, no one tried writing a security test suite that was designed to force developers to conform their applications to the tests. If this was easy, there would have been a market for it.

After failing to "solve" software security problems after a decade, I spent many years thinking about how to eliminate classes of vulnerabilities for good. I could find scalable solutions for only roughly 80% of web application vulnerabilities, where scalable was some form of "make it impossible for a developer to introduce this class of vulnerability". Often the result was something like: "buffer overflow vulnerabilities disappeared from web apps because we stopped allowing access to memory management APIs". Limiting API capabilities is not an approach that's likely to be compatible with most research agendas.

Alignment will require formalisms to address problems that haven't even been thought of yet, let alone specified clearly enough that a test suite can be developed for them. A sane world would solve this problem first before creating machine intelligence code. A sane world would at least recognize that solving software security problems is probably several magnitudes of difficulty easier than solving alignment problems, and we haven't even succeeded at the former yet.

A Breach IS an Existential Risk

I was lucky to work for a few organizations that actually treated the threat of a security breach like it was an existential risk. Privately, though, I had to reconcile this attitude with the reality: "TJX is up". [1] While it was good for my career and agenda those rare times when leadership did take security seriously, the reality is that the current regulatory and social environment hardly punishes security failures proportionally to the damage they can cause. In a disturbing number of situations, a security breach is simply an externality, the costs of which actually borne by consumers or the victims of software exploits. Some have proposed the idea of software liability, but I'm even less optimistic that that model can be applied to AGI research.

Even if an organization believes that its security posture means the difference between existing and dying, profit margins will carry this same message much more clearly. Revenue solves all problems, but the inverse is also true: without revenue there isn't the luxury of investing in security measures.

While I am ambivalent about the existential nature of information security risk, I am unequivocal about AGI alignment risk. There will not be a stock symbol to watch to see whether AGI research groups are getting this right. There will be no fire alarm. The externality of unaligned machine learning research will simply be that we will all be dead, or suffering some unrecoverable fate worse than death. I also cannot tell you exactly the chain of events that will lead to this outcome, I have only my intuition gained by 20 years of seeing how software mistakes are made and how difficult it is to secure even simple web applications. Probably no more than a few non-trivial pieces of software could survive a million-dollar bug bounty program for a year without having to make a payout. In the relatively quaint world of software security, with decades of accumulated experience and knowledge of dozens of bug reference classes, we still cannot produce secure software. There should be no default expectation of anything less than total annihilation from an unaligned superoptimization process.

 

 

 

  1. ^

    The TJX corporation experienced one of the largest data breaches in history, accompanied by millions of dollars in fines; however, their stock price quickly recovered as the world forgot about the incident.

290

34 comments, sorted by Click to highlight new comments since: Today at 2:32 AM
New Comment

Just putting in my vote for doing both broader and deeper explorations of these topics!

Assurance Requires Formal Proofs, Which Are Provably Impossible

The Halting Problem puts a certain standard of formalism outside our reach

This is really not true. The halting problem only makes it impossible to write a program that can analyze a piece of code and then reliably say "this is secure" or "this is insecure". It is completely possible to write an analyzer that can say "this is secure" for some inputs, "this is definitely insecure for reason X" for some other inputs, and "I am uncertain about your input so please go improve it" for everything in between. In particular, it is completely possible to have a machine-checkable proof system going along with executable code that can express proofs of extremely strong security properties for almost every program you might wish to run in practice, which can then judge "I can confirm this is secure" or "I can not confirm that this is secure which may or may not indicate an actual problem so go fix it".

Pulling this off in practice is still fiendishly difficult, of course, and progress in this field has been frustratingly slow. But there is no theoretical reason to suspect that this is fundamentally out of reach. (Or at least, not in the halting problem; Löb's theorem does provide some real limitations here that are particularly relevant for AI correctness proofs. But that is fairly niche relative to the broader notion of software correctness and security proofs.)

The halting problem only makes it impossible to write a program that can analyze a piece of code and then reliably say "this is secure" or "this is insecure".

It would be nice to able to have this important impossible thing. :)

I think we are trying to say the same thing, though. Do you agree with this more concise assertion?

"It's not possible to make a high confidence checker system that can analyze an arbitrary specification, but it is probably possible (although very hard) to design systems that can be programmatically checked for the important qualities of alignment that we want, if such qualities can also be formally defined."

Yes, I agree with this.

I cannot judge to what degree I agree with your strategic assessment of this technique, though. I interpreted your top-level post as judging that assurances based on formal proofs are realistically out of reach as a practical approach; whereas my own assessment is that making proven-correct [and therefore proven-secure] software a practical reality is a considerably less impossible problem than many other aspects of AI alignment, and indeed one I anticipate to actually happen in a timeline in which aligned AI materializes.

I would say that some formal proofs are actually impossible, but would agree that software with many (or even all) of the security properties we want could actually have formal-proof guarantees. I could even see a path to many of these proofs today.

While the intent of my post was to draw parallel lessons from software security, I actually think alignment is an oblique or orthogonal problem in many ways. I could imagine timelines in which alignment gets 'solved' before software security. In fact, I think survival timelines might even require anyone who might be working on classes of software reliability that don't relate to alignment to actually switch their focus to alignment at this point.

Software security is important, but I don't think it's on the critical path to survival unless somehow it is a key defense against takeoff. Certainly many imagined takeoff scenarios are made easier if an AI can exploit available computing, but I think the ability to exploit physics would grant more than enough escape potential.

I would say that some formal proofs are actually impossible

Plausible. In the aftermath of spectre and meltdown I spent a fair amount of time thinking on how you could formally prove a piece of software to be free of information-leaking side channels, even assuming that the same thing holds for all dependent components such as underlying processors and operating systems and the like, and got mostly nowhere.

In fact, I think survival timelines might even require anyone who might be working on classes of software reliability that don't relate to alignment to actually switch their focus to alignment at this point.

Does that include those working on software correctness and reliability in general, without a security focus? I would expect better tools for making software that is free of bugs, such as programs that include correctness proofs as well as some of the lesser formal methods, to be on the critical path to survival -- for the simple reason that any number of mundane programming mistakes in a supposedly-aligned AI could easily kill us all. I was under the impression that you agree with this ["Assurance Requires Formal Proofs"]. I expect formal proofs of security in particular to be largely a corollary of this -- a C program that is proven to correctly accomplish any particular goal will necessarily not have any buffer overflows in it, for this would invoke undefined behavior which would make your proof not go through. This does not necessarily apply to all security properties, but I would expect it to apply to most of them.

entirely ending large classes of zero days hasn't happened, but it's still permitted by physics that we figure it out. how do you end all zero days in an arbitrary block of matter? seems like you'd need a learning system that allows checking if the learned model is able to satisfy a statement about the matter's behavior. imo we shouldn't yet give up on formally verifying margins on the behavior of complex systems; "as long as it stays within this region, I've checked it can't break" statements are very useful, even if we can't know if we missed a statement we'd like to make about what the system allows. some refs that lead me to think this is not a hopeless direction:

  • rustlang demonstrates that much larger categories of vulnerability can be made unrepresentable than previously assumed.
  • making erroneous states unrepresentable is an important factor in how normalization works in a neural network anyway; eg, the s4 sequence model starts by using formal math to derive a margin on reconstructing the recent past using a polynomial. this sort of "don't forget" constraint seems to me likely to be a critical constraint to avoid loss of valuable complexity. there are a few talks and blog posts about s4, but the one I recommend is https://www.youtube.com/watch?v=luCBXCErkCs
  • formal verification of neural networks has been making some progress, see eg semantic scholar: stuff citing reluplex
  • category theory applied to complex systems does not yet appear definitely doomed to failure. the discussions in the IPAM-UCLA Collective Intelligence workshop had, among other very good talks, a talk about category theory of open systems by john baez (youtube).
  • related to these techniques, some success has been had in turning physical simulations into formal statements about a margin-to-nearest-error without the help of anything neural. eg the Food For Thought talk "multisymplectic integrators for hamiltonian PDEs" (youtube, abstract)
  • capability-based security has had some success at closing whole classes of vulnerability. because complexity makes verification hard, isolating the boundaries between an internally-unverifiable complex system and an external guarantee the system needs to provide seems promising, doubly so if we can define some sort of coherent boundary statements between systems
  • work on extracting physical laws from neural networks using symbolic regression provides hope that seeking representations that allow generalizing to formal theories of macroscopic behavior is still a promising direction. Steve Brunton and folks have a number of relevant discussions and citationwalking near their papers would likely find even more promising components, but this is one of the most interesting I've seen from his channel: youtube - blogpost - paper

that last one is probably a good summary of yudkowsky's core fears, btw. I suspect that the real dynamics of complex systems are often quite difficult to simplify and that simulating, eg, a brain, requires a similar wattage to running a physical brain, even using the most optimized algorithms. however, I'm not sure of that. maybe you can have a pretty high accuracy with only symbolically distilled statements.

My hope is that there's some sort of statement we can make about a system's memory that includes at least a guarantee that no organism simulated in the system can die during the simulation. I expect we can find a statement about information loss that can at least guarantee that, at multiple scales of a complex dynamical system, low-entropy complexity from a previous step takes as long as possible to diffuse into high entropy. I don't know how to verify this, and I'm not high enough level to figure it out on my own, but links like the ones above make me suspect that there's something to distill here that will allow recognizing unwanted interference with only a simple objective.

I definitely wouldn't rule out the possibility of being able to formally define a set of tests that would satisfy our demands for alignment. The most I could say with certainty is that it's a lot harder than eliminating software security bug classes. But I also wouldn't rule out the possibility that an optimizing process of arbitrarily strong capability simply could not be aligned, at least to a level of assurance that a human could comprehend.

Thank you for these additional references; I was trying to anchor this article with some very high-level concepts. I very much expect that to succeed we're going to have to invent and test hundreds of formalisms to be able to achieve any kind of confidence about the alignment of a system.

Thanks for writing this! Do you have any thoughts on doing a red team/blue team alignment tournament as described here?

Many! Thanks for sharing. This could easily turn into its own post.

In general, I think this is a great idea. I'm somewhat skeptical that this format would generate deep insights; in my experience successful Capture the Flag / wargames / tabletop exercises work best in the form where each group spends a lot of time preparing for their particular role, but opsec wargames are usually easier to score, so the judge role makes less sense there. That said, in the alignment world I'm generally supportive of trying as many different approaches as possible to see what works best.

Prior to reading your post, my general thoughts about how these kind of adversarial exercises relate to the alignment world were these:

  • The industry thought leaders usually have experience as both builders and breakers; some insights are hard to gain from just one side of the battlefield. That said, the industry benefits from folks who spend the time becoming highly specialized in one role or the other, and the breaker role should be valued at least equally, if not more than the builder. (In the case of alignment, breakers may be the only source of failure data we can safely get.)
  • The most valuable tabletop exercises that I was a part of spent at least as much time analyzing the learnings as the exercise itself; almost everyone involved will have unique insights that aren't noticed by others. (Perhaps this points to the idea of having multiple 'judges' in an alignment tournament.)
  • Non-experts often have insights or perspectives that are surprising to security professionals; I've been able to improve an incident response process based on participation from other teams (HR, legal, etc.) almost every time I've run a tabletop. This is probably less true for an alignment war game, because the background knowledge required to even understand most alignment topics is so vast and specialized.
  • Unknown unknowns are a hard problem. While I think we are a long way away from having builder ideas that aren't easily broken, it's going to be a significant danger to have breakers run out of exploit ideas and mistake that for a win for the builders.
  • Most tabletop exercises are focused on realtime response to threats. Builder/breaker war games like the DEFCON CTF are also realtime. It might be a challenge to create a similarly engaging format that allows for longer deliberation times on these harder problems, but it's probably a worthwhile one.

Thanks for the reply!

As some background on my thinking here, last I checked there are a lot of people on the periphery of the alignment community who have some proposal or another they're working on, and they've generally found it really difficult to get quality critical feedback. (This is based on an email I remember reading from a community organizer a year or two ago saying "there is a desperate need for critical feedback".)

I'd put myself in this category as well -- I used to write a lot of posts and especially comments here on LW summarizing how I'd go about solving some aspect or another of the alignment problem, hoping that Cunningham's Law would trigger someone to point out a flaw in my approach. (In some cases I'd already have a flaw in mind along with a way to address it, but I figured it'd be more motivating to wait until someone mentioned a particular flaw in the simple version of the proposal before I mentioned the fix for it.)

Anyway, it seemed like people often didn't take the bait. (Thanks to everyone who did!) Even with offering $1000 to change my view, as I'm doing in my LW user profile now, I've had 0 takers. I stopped posting on LW/AF nearly as much partially because it has seemed more efficient to try to shoot holes in my ideas myself. On priors, I wouldn't have expected this to be true -- I'd expect that someone else is going to be better at finding flaws in my ideas than I am myself, because they'll have a different way of looking at things which could address my blind spots.

Lately I've developed a theory for what's going on. You might be familiar with the idea that humans are often subconsciously motivated by the need to acquire & defend social status. My theory is that there's an asymmetry in the motivations for alignment building & breaking work. The builder has an obvious status motive: If you become the person who "solved AI alignment", that'll be really good for your social status. That causes builders to have status-motivated blindspots around weak points in their ideas. However, the breaker doesn't have an obvious status motive. In fact, if you go around shooting down peoples' ideas, that's liable to annoy them, which may hurt your social status. And since most proposals are allegedly easily broken anyways, you aren't signaling any kind of special talent by shooting them down. Hence the "breaker" role ends up being undervalued/disincentivized. Especially doing anything beyond just saying "that won't work" -- finding a breaker who will describe a failure in detail instead of just vaguely gesturing seems really hard. (I don't always find such handwaving persuasive.)

I think this might be why Eliezer feels so overworked. He's staked a lot of reputation on the idea that AI alignment is a super hard problem. That gives him a unique status motive to play the red team role, which is why he's had a hard time replacing himself. I think maybe he's tried to compensate for this by making it low status to make a bad proposal, in order to browbeat people into self-critiquing their proposals. But this has a downside of discouraging the sharing of proposals in general, since it's hard to predict how others will receive your ideas. And punishments tend to be bad for creativity.

So yeah, I don't know if the tournament idea would have the immediate effect of generating deep insights. But it might motivate people to share their ideas, or generate better feedback loops, or better align overall status motives in the field, or generate a "useless" blacklist which leads to a deep insight, or filter through a large number of proposals to find the strongest ones. If tournaments were run on a quarterly basis, people could learn lessons, generate some deep ideas from those lessons, and spend a lot of time preparing for the next tournament.

A few other thoughts...

it's going to be a significant danger to have breakers run out of exploit ideas and mistake that for a win for the builders

Perhaps we could mitigate this by allowing breakers to just characterize how something might fail in vague terms -- obviously not as good as a specific description, but still provides some signal to iterate on.

It might be a challenge to create a similarly engaging format that allows for longer deliberation times on these harder problems, but it's probably a worthwhile one.

I think something like a realtime Slack discussion could be pretty engaging. I think there is room for both high-deliberation and low-deliberation formats. [EDIT: You could also have a format in between, where the blue team gets little time, and the red team gets lots of time, to try to simulate the difference in intelligence between an AGI and its human operators.] Also, I'd expect even a slow, high-deliberation tournament format to be more engaging than the way alignment research often gets done (spend a bunch of time thinking on your own, write a post, observe post score, hopefully get a few good comments, discussion dies out as post gets old).

I think you make good points generally about status motives and obstacles for breakers. As counterpoints, I would offer:

  • Eliezer is a good example of someone who built a lot of status on the back of "breaking" others' unworkable alignment strategies. I found the AI Box experiments especially enlightening in my early days.
  • There are lots of high-status breakers, and lots of independent status-rewarding communities around the security world. Some of these are whitehat/ethical, like leaderboards for various bug bounty programs, OWASP, etc. Some of them not so much so, like Blackhat/DEFCON in the early days, criminal enterprises, etc.

Perhaps here is another opportunity to learn lessons from the security community about what makes a good reward system for the breaker mentality. My personal feeling is that poking holes in alignment strategies is easier than coming up with good ones, but I'm also aware that thinking that breaking is easy is probably committing some quantity of typical mind fallacy. Thinking about how things break, or how to break them intentionally, is probably a skill that needs a lot more training in alignment. Or at least we need away to attract skilled breakers to alignment problems.

I find it to be a very natural fit to post bounties on various alignment proposals to attract breakers to them. Keep upping the bounty, and eventually you have a quite strong signal that a proposal might be workable. I notice your experience of offering a personal bounty does not support this, but I think there is a qualitative difference between a bounty leaderboard with public recognition and a large pipeline of value that can be harvested by a community of good breakers, and what may appear to be a one-off deal offered by a single individual with unclear ancillary status rewards.

It may be viable to simply partner with existing crowdsourced bounty program providers (e.g. BugCrowd) to offer a new category of bounty. Traditionally, these services have focused on traditional "pen-test" type bounties, doing runtime testing of existing live applications. But I've long been saying there should be a market for crowdsourced static analysis, and even design reviews, with a pay-per-flaw model.

Eliezer is a good example of someone who built a lot of status on the back of "breaking" others' unworkable alignment strategies. I found the AI Box experiments especially enlightening in my early days.

Fair enough.

My personal feeling is that poking holes in alignment strategies is easier than coming up with good ones, but I'm also aware that thinking that breaking is easy is probably committing some quantity of typical mind fallacy.

Yeah personally building feels more natural to me.

I agree a leaderboard would be great. I think it'd be cool to have a leaderboard for proposals as well -- "this proposal has been unbroken for X days" seems like really valuable information that's not currently being collected.

I don't think I personally have enough clout to muster the coordination necessary for a tournament or leaderboard, but you probably do. One challenge is that different proposals are likely to assume different sorts of available capabilities. I have a hunch that many disagreements which appear to be about alignment are actually about capabilities.

In the absence of coordination, I think if someone like you was to simply start advertising themselves as an "uberbreaker" who can shoot holes in any proposal, and over time give reports on which proposals seem the strongest, that could be really valuable and status-rewarding. Sort of a "pre-Eliezer" person who I can run my ideas by in a lower stakes context, as opposed to saying "Hey Eliezer, I solved alignment -- wallop me if I'm wrong!"

I appreciate the nudge here to put some of this into action. I hear alarm bells when thinking about formalizing a centralized location for AI safety proposals and information about how they break, but my rough intuition is that if there is a way these can be scrubbed of descriptions of capabilities which could be used irresponsibly to bootstrap AGI, then this is a net positive. At the very least, we should be scrambling to discuss safety controls for already public ML paradigms, in case any of these are just one key insight or a few teraflops away from being world-ending.

I would like to hear from others about this topic, though; I'm very wary of being at fault for accelerating the doom of humanity.

Interesting comment. I feel like I recently have experienced this phenomena myself (that it's hard to find people who can play "red team").

Do you have any "blue team" ideas for alignment where you in particular would want someone to play "red team"?

I would be interested in having someone play "red team" here, but if someone were to do so in a non-trivial manner then it would probably be best to wait at least until I've completed Part 3 (which will take at least weeks, partly since I'm busy with my main job): https://www.lesswrong.com/posts/ZmZBataeY58anJRBb/agi-assisted-alignment-part-1-introduction

Could potentially be up for playing red team against you, in exchange for you playing red team against me (but if I think I could have something to contribute as red team would depend on specifics of what is proposed/discussed - e.g., I'm not familiar with technical specifics of deep learning beyond vague descriptions).

I wrote a comment on your post with feedback.

I don't have anything prepared for red teaming at the moment -- I appreciate the offer though! Can I take advantage of it in the future? (Anyone who wants to give me critical feedback on my drafts should send me a personal message!)

Thanks for the feedback!

And yes, do feel free to send me drafts in the future if you want me to look over them. I don't give guaranties regarding amount or speed of feedback, but it would be my intention to try to be helpful :)

I wasn’t aware you were offering a bounty! I rarely check people’s profile pages unless I need to contact them privately, so it might be worth mentioning this at the beginning or end of posts where it might be relevant.

Fair point. I also haven't done much posting since adding the bounty to my profile. Was thinking it might attract the attention of people reading the archives, but maybe there just aren't many archive readers.

That said, it took the software industry a long time to learn all the ways to NOT solve XSS before people really understood what a correct fix looked like. It often takes many many examples in the reference class before a clear fundamental solution can be seen.

This is true about the average software developer, but unlike in AI alignment, the correct fix was at least known to a few people from the beginning.

I would agree that some people figured this out faster than others, but the analogy is also instructional here: if even a small community like the infosec world has a hard time percolating information about failure modes and how to address them, we should expect the average ML engineer to be doing very unsafe things for a very long time by default.

To dive deeper into the XSS example, I think even among those that understood the output encoding and canonicalization solutions early, it still took a while to formalize the definition of an encoding context concisely enough to be able to have confidence that all such edge cases could be covered.

It might be enough to simply recognize an area of alignment that has dragons and let the experts safely explore the nature and contours of these dragons, but you probably couldn't build a useful web application that doesn't display user-influencable input. I think trying to get the industry to halt on building even obvious dragon-infested things is part of what has gotten Eliezer so burned out and pessimistic.

Curated. Lessons from- and the mindset- of computer security have long been invoked in the context of AI Alignment, and I love seeing this write-up from a veteran of the industry. What this gave that I didn't already have was, not just the nature of the technical challenges, but some sense of how people have responded to security challenges in the past and have the development of past solutions has proceeded. This does feel quite relevant to predicting what by default will happen in AGI development.

Really well-written post.

One thing that seems under-discussed to me are methods we might use to get help from a superintelligent AGI to assist in creating systems for which we have more assurances that they are aligned (as a whole). And one reason for me thinking that it's under-discussed is that even if we think we have succeeded with alignment, we should look for how we can use a superintelligence to verify that this is the case and add extra layers of assurance (finding the least risky methods for doing this first, and going about it in a stepwise and iterative manner).

I think that if such plans are laid out in more detail beforehand (before some team develops AGI/superintelligence I mean), and people try minimizing the degree to which such plans are "handwavy", this may help make teams more apt to make use of techniques/procedures/strategies that can be helpful (compared to if they are improvising).

Have started writing about this here if you are interested (but part 2 and 3 will probably be more substantial than part 1): https://www.lesswrong.com/posts/ZmZBataeY58anJRBb/agi-assisted-alignment-part-1-introduction

Though it may well be (not committing in either direction, but seems plausible) that to even get to the stage where you give a superintelligent AI questions/requests (without it beforehand hacking itself onto the internet, or that sort of thing), people would need to exhibit more security mindset than they are likely to do..

This doesn't make sense to me. The superintelligence has to already be aligned in order to want to help you solve alignment. Otherwise you're basically building its successor.

Well, if you start out with a superintelligence that you have good reasons to think is fully aligned, then that is certainly a much better situation to be in (and that's an understatement)! Mentioning that just to make it clear that even if we see things differently, there is partial agreement :)

Lets imagine a superintelligent AI, and let's describe it (a bit anthropomorphicly) as not "wanting" to help me solve alignment. Lets say that instead what it "wants" is to (for example) maximize it's reward signal, and that the best way to maximize it's reward signal would be to exterminate humanity and take over the world (for reasons Eliezer and others have outlined). Well, in that case it will be looking for ways to overthrow humanity, but if it isn't able to do that it may want to do the next best thing (providing outputs that its operators respond to with a high reward signal).

So it may well prefer to "trick" me. But if it can't "trick" me, it may prefer to give me answers that seem good to me (rather than answers that I recognize as clearly bad and evasive).

Machine learning techniques will tend to select for systems that do things that seem impressive and helpful. Unfortunately this does not guarantee "deep" alignment, but I presume that it will select for systems that at least seem aligned on the surface.

There are lots of risks and difficulties involved with asking questions/requests of AIs. But there are more and less ways dangerous of interacting with a potentially unaligned AGI, and questions/requests vary a lot in how easy or hard it is to verify whether or not they provide us what we want. The techniques/outlines I will outline in the series are intended to minimize risk of being "tricked", and I think they could get us pretty far, but I could be wrong somehow, and it's a long/complicated discussion.

Yeah, this sounds extremely dangerous and extremely unlikely to work, but I hope I'm wrong and you've found something potentially useful.

I think there are various very powerful methods that can be used to make it hard for AGI-system to not provide what we want in process of creating aligned AGI-system. But I don't disagree in regards to what you say about it being "extremely dangerous". I think one argument in favor of the kinds of strategies I have in mind is that they may help give an extra layer of security/alignment-assurance, even if we think we have succeeded with alignment beforehand.

Thanks for writing this, I find the security mindset useful all over the place and appreciate its applicability in this situation.

I have a small thing unrelated to the main post:

To my knowledge, no one tried writing a security test suite that was designed to force developers to conform their applications to the tests. If this was easy, there would have been a market for it.

I think weak versions exist (ie things that do not guarantee/force, but nudge/help). I first learnt to code in a bootcamp which emphasised test-driven development (TDD). One of the first packages I made was a TDD linter. It would simply highlight in red any functions you wrote that did not have a corresponding unit test, and any file you made without a corresponding test file.

Also if you wrote up anywhere the scalable solutions to 80% of web app vulnerabilities, I'd love to see.

My project seems to have expired from the OWASP site, but here is an interactive version that should have most of the data:

https://periodictable.github.io/

You'll need to mouse over the elements to see the details, so not really mobile friendly, sorry.

I agree that linters are a weak form of automatic verification that are actually quite valuable. You can get a lot of mileage out of simply blacklisting unsafe APIs and a little out of clever pattern matching.

How hard is it for you to find out if someone has a security mindset? How about developing it? How rare is this capacity?

Good article.

I think a good follow-up article could be one that continues the analogy by examining software development concepts that have evolved to address the "nobody cares about security enough to do it right" problem.

I'm thinking of two things in particular: the Rust programming language, and capability-oriented programming.

The Rust language is designed to remove entire classes of bugs and exploits (with some caveats that don't matter too much in practice). This does add some constraints to how you can build you program; for some developers, this is a dealbreaker, so Rust adoption isn't an automatic win. But many (I don't really have the numbers to quantify better) developers thrive within those limitations, and even find them helpful to better structure their program.

This selection effect has also lead to the Rust ecosystem having a culture of security by design. Eg a pentest team auditing the rustlst care "considered the general code quality to be exceptional and can attest to a solid impression left consistently by all scope items".

Capability oriented is a more general idea. The concept is pretty old, but still sound: you only give your system has much resources as it plausibly needs to perform its job. If your program needs to take some text and eg count the number of words in that text, you only give the program access to an input channel and an output channel; if the program tries to open a network socket or some file you didn't give it access to, it automatically fails.

Capability-oriented programming has the potential to greatly reduce the vulnerability of a system, because now, to leverage a remote execution exploit, you also need a capability escalation / sandbox escape exploit. That means the capability system must be sound (with all the testing and red-teaming that implies), but "the capability system" is a much smaller attack surface than "every program on your computer".

There hasn't really been a popular OS that was capability-oriented from the ground up. Similar concepts have been used in containers, WebAssembly, app permissions on mobile OSes, and some package formats like flatpak. The in-development Google OS "Fuschia" (or more precisely, its kernel Zirkon) is the most interesting project I know of on that front.

I'm not sure what the equivalent would be for AI. I think there was a LW article mentioning a project the author had of building a standard "AI sandbox"? I think as AI develops, toolboxes that figure out a "safe" subset of AIs that can be used without risking side effects, while still getting the economic benefits of "free" AIs might also be promising.

This is the same flawed approach that airport security has, which is why travelers still have to remove shoes and surrender liquids: they are creating blacklists instead of addressing the fundamentals.

Just curious,  what would it look like to "address the fundamentals" in airport security?

Obviously this can't be answered with justice in a single comment, but here are some broad pointers that might help see the shape of the solution:

  • Israeli airport security focuses on behavioral cues, asking unpredictable questions, and profiling. A somewhat extreme threat model there, with much different base rates to account for (but also much lower traffic volume).
  • Reinforced cockpit doors address the hijackers with guns and knives scenarios, but are a fully general kind of a no-brainer control.
  • Good policework and better coordination in law enforcement are commonly cited, e.g. in the context of 9/11 hijackings, before anyone even gets to an airport.

In general, if the airlines had responsibility for security you would see a very different set of controls than what you get today, where it is an externality run by an organization with very strong "don't do anything you can get blamed for" political incentives. In an ideal world, you could get an airline catering to paranoiacs who wanted themselves and their fellow passengers to undergo extreme screening, one for people who have done the math, and then most airlines in the middle would phase into nominal gate screening procedures that didn't make them look to their customers that they didn't care (which largely the math says that they shouldn't).

A thought experiment: why is there no equivalent bus/train station security to what we have at airports? And what are the outcomes there?

This is very interesting. Thanks for taking the time to explain :)