Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment

Assurance Requires Formal Proofs, Which Are Provably Impossible

The Halting Problem puts a certain standard of formalism outside our reach

This is really not true. The halting problem only makes it impossible to write a program that can analyze a piece of code and then reliably say "this is secure" or "this is insecure". It is completely possible to write an analyzer that can say "this is secure" for some inputs, "this is definitely insecure for reason X" for some other inputs, and "I am uncertain about your input so please go improve it" for everything in between. In particular, it is completely possible to have a machine-checkable proof system going along with executable code that can express proofs of extremely strong security properties for almost every program you might wish to run in practice, which can then judge "I can confirm this is secure" or "I can not confirm that this is secure which may or may not indicate an actual problem so go fix it".

Pulling this off in practice is still fiendishly difficult, of course, and progress in this field has been frustratingly slow. But there is no theoretical reason to suspect that this is fundamentally out of reach. (Or at least, not in the halting problem; Löb's theorem does provide some real limitations here that are particularly relevant for AI correctness proofs. But that is fairly niche relative to the broader notion of software correctness and security proofs.)

[-]elspood4y70

The halting problem only makes it impossible to write a program that can analyze a piece of code and then reliably say "this is secure" or "this is insecure".

It would be nice to able to have this important impossible thing. :)

I think we are trying to say the same thing, though. Do you agree with this more concise assertion?

"It's not possible to make a high confidence checker system that can analyze an arbitrary specification, but it is probably possible (although very hard) to design systems that can be programmatically checked for the important qualities of alignment that we want, if such qualities can also be formally defined."

[-]redlizard4y90

Yes, I agree with this.

I cannot judge to what degree I agree with your strategic assessment of this technique, though. I interpreted your top-level post as judging that assurances based on formal proofs are realistically out of reach as a practical approach; whereas my own assessment is that making proven-correct [and therefore proven-secure] software a practical reality is a considerably less impossible problem than many other aspects of AI alignment, and indeed one I anticipate to actually happen in a timeline in which aligned AI materializes.

[-]elspood4y50

I would say that some formal proofs are actually impossible, but would agree that software with many (or even all) of the security properties we want could actually have formal-proof guarantees. I could even see a path to many of these proofs today.

While the intent of my post was to draw parallel lessons from software security, I actually think alignment is an oblique or orthogonal problem in many ways. I could imagine timelines in which alignment gets 'solved' before software security. In fact, I think survival timelines might even require anyone who might be working on classes of software reliability that don't relate to alignment to actually switch their focus to alignment at this point.

Software security is important, but I don't think it's on the critical path to survival unless somehow it is a key defense against takeoff. Certainly many imagined takeoff scenarios are made easier if an AI can exploit available computing, but I think the ability to exploit physics would grant more than enough escape potential.

[-]redlizard4y30

I would say that some formal proofs are actually impossible

Plausible. In the aftermath of spectre and meltdown I spent a fair amount of time thinking on how you could formally prove a piece of software to be free of information-leaking side channels, even assuming that the same thing holds for all dependent components such as underlying processors and operating systems and the like, and got mostly nowhere.

In fact, I think survival timelines might even require anyone who might be working on classes of software reliability that don't relate to alignment to actually switch their focus to alignment at this point.

Does that include those working on software correctness and reliability in general, without a security focus? I would expect better tools for making software that is free of bugs, such as programs that include correctness proofs as well as some of the lesser formal methods, to be on the critical path to survival -- for the simple reason that any number of mundane programming mistakes in a supposedly-aligned AI could easily kill us all. I was under the impression that you agree with this ["Assurance Requires Formal Proofs"]. I expect formal proofs of security in particular to be largely a corollary of this -- a C program that is proven to correctly accomplish any particular goal will necessarily not have any buffer overflows in it, for this would invoke undefined behavior which would make your proof not go through. This does not necessarily apply to all security properties, but I would expect it to apply to most of them.

[-]Aryeh Englander4y280

Just putting in my vote for doing both broader and deeper explorations of these topics!

[-]habryka2y134Review for 2022 Review

I currently think that the case study of computer security is among one of the best places to learn about the challenges that AI control and AI Alignment projects will face. Despite that, I haven't seen that much writing trying to bridge the gap between computer security and AI safety. This post is one of the few that does, and I think does so reasonably well.

[-]the gears to ascension4y130

entirely ending large classes of zero days hasn't happened, but it's still permitted by physics that we figure it out. how do you end all zero days in an arbitrary block of matter? seems like you'd need a learning system that allows checking if the learned model is able to satisfy a statement about the matter's behavior. imo we shouldn't yet give up on formally verifying margins on the behavior of complex systems; "as long as it stays within this region, I've checked it can't break" statements are very useful, even if we can't know if we missed a statement we'd like to make about what the system allows. some refs that lead me to think this is not a hopeless direction:

rustlang demonstrates that much larger categories of vulnerability can be made unrepresentable than previously assumed.
making erroneous states unrepresentable is an important factor in how normalization works in a neural network anyway; eg, the s4 sequence model starts by using formal math to derive a margin on reconstructing the recent past using a polynomial. this sort of "don't forget" constraint seems to me likely to be a critical constraint to avoid loss of valuable complexity. there are a few talks and blog posts about s4, but the one I recommend is https://www.youtube.com/watch?v=luCBXCErkCs
formal verification of neural networks has been making some progress, see eg semantic scholar: stuff citing reluplex
category theory applied to complex systems does not yet appear definitely doomed to failure. the discussions in the IPAM-UCLA Collective Intelligence workshop had, among other very good talks, a talk about category theory of open systems by john baez (youtube).
related to these techniques, some success has been had in turning physical simulations into formal statements about a margin-to-nearest-error without the help of anything neural. eg the Food For Thought talk "multisymplectic integrators for hamiltonian PDEs" (youtube, abstract)
capability-based security has had some success at closing whole classes of vulnerability. because complexity makes verification hard, isolating the boundaries between an internally-unverifiable complex system and an external guarantee the system needs to provide seems promising, doubly so if we can define some sort of coherent boundary statements between systems
work on extracting physical laws from neural networks using symbolic regression provides hope that seeking representations that allow generalizing to formal theories of macroscopic behavior is still a promising direction. Steve Brunton and folks have a number of relevant discussions and citationwalking near their papers would likely find even more promising components, but this is one of the most interesting I've seen from his channel: youtube - blogpost - paper

that last one is probably a good summary of yudkowsky's core fears, btw. I suspect that the real dynamics of complex systems are often quite difficult to simplify and that simulating, eg, a brain, requires a similar wattage to running a physical brain, even using the most optimized algorithms. however, I'm not sure of that. maybe you can have a pretty high accuracy with only symbolically distilled statements.

My hope is that there's some sort of statement we can make about a system's memory that includes at least a guarantee that no organism simulated in the system can die during the simulation. I expect we can find a statement about information loss that can at least guarantee that, at multiple scales of a complex dynamical system, low-entropy complexity from a previous step takes as long as possible to diffuse into high entropy. I don't know how to verify this, and I'm not high enough level to figure it out on my own, but links like the ones above make me suspect that there's something to distill here that will allow recognizing unwanted interference with only a simple objective.

[-]elspood4y80

I definitely wouldn't rule out the possibility of being able to formally define a set of tests that would satisfy our demands for alignment. The most I could say with certainty is that it's a lot harder than eliminating software security bug classes. But I also wouldn't rule out the possibility that an optimizing process of arbitrarily strong capability simply could not be aligned, at least to a level of assurance that a human could comprehend.

Thank you for these additional references; I was trying to anchor this article with some very high-level concepts. I very much expect that to succeed we're going to have to invent and test hundreds of formalisms to be able to achieve any kind of confidence about the alignment of a system.

[-]Ruby3y90

Curated. Lessons from- and the mindset- of computer security have long been invoked in the context of AI Alignment, and I love seeing this write-up from a veteran of the industry. What this gave that I didn't already have was, not just the nature of the technical challenges, but some sense of how people have responded to security challenges in the past and have the development of past solutions has proceeded. This does feel quite relevant to predicting what by default will happen in AGI development.

[-]John_Maxwell4y*70

Thanks for writing this! Do you have any thoughts on doing a red team/blue team alignment tournament as described here?

[-]elspood4y150

Many! Thanks for sharing. This could easily turn into its own post.

In general, I think this is a great idea. I'm somewhat skeptical that this format would generate deep insights; in my experience successful Capture the Flag / wargames / tabletop exercises work best in the form where each group spends a lot of time preparing for their particular role, but opsec wargames are usually easier to score, so the judge role makes less sense there. That said, in the alignment world I'm generally supportive of trying as many different approaches as possible to see what works best.

Prior to reading your post, my general thoughts about how these kind of adversarial exercises relate to the alignment world were these:

The industry thought leaders usually have experience as both builders and breakers; some insights are hard to gain from just one side of the battlefield. That said, the industry benefits from folks who spend the time becoming highly specialized in one role or the other, and the breaker role should be valued at least equally, if not more than the builder. (In the case of alignment, breakers may be the only source of failure data we can safely get.)
The most valuable tabletop exercises that I was a part of spent at least as much time analyzing the learnings as the exercise itself; almost everyone involved will have unique insights that aren't noticed by others. (Perhaps this points to the idea of having multiple 'judges' in an alignment tournament.)
Non-experts often have insights or perspectives that are surprising to security professionals; I've been able to improve an incident response process based on participation from other teams (HR, legal, etc.) almost every time I've run a tabletop. This is probably less true for an alignment war game, because the background knowledge required to even understand most alignment topics is so vast and specialized.
Unknown unknowns are a hard problem. While I think we are a long way away from having builder ideas that aren't easily broken, it's going to be a significant danger to have breakers run out of exploit ideas and mistake that for a win for the builders.
Most tabletop exercises are focused on realtime response to threats. Builder/breaker war games like the DEFCON CTF are also realtime. It might be a challenge to create a similarly engaging format that allows for longer deliberation times on these harder problems, but it's probably a worthwhile one.

[-]John_Maxwell4y*232

Thanks for the reply!

As some background on my thinking here, last I checked there are a lot of people on the periphery of the alignment community who have some proposal or another they're working on, and they've generally found it really difficult to get quality critical feedback. (This is based on an email I remember reading from a community organizer a year or two ago saying "there is a desperate need for critical feedback".)

I'd put myself in this category as well -- I used to write a lot of posts and especially comments here on LW summarizing how I'd go about solving some aspect or another of the alignment problem, hoping that Cunningham's Law would trigger someone to point out a flaw in my approach. (In some cases I'd already have a flaw in mind along with a way to address it, but I figured it'd be more motivating to wait until someone mentioned a particular flaw in the simple version of the proposal before I mentioned the fix for it.)

Anyway, it seemed like people often didn't take the bait. (Thanks to everyone who did!) Even with offering $1000 to change my view, as I'm doing in my LW user profile now, I've had 0 takers. I stopped posting on LW/AF nearly as much partially because it has seemed more efficient to try to shoot holes in my ideas myself. On priors, I wouldn't have expected this to be true -- I'd expect that someone else is going to be better at finding flaws in my ideas than I am myself, because they'll have a different way of looking at things which could address my blind spots.

Lately I've developed a theory for what's going on. You might be familiar with the idea that humans are often subconsciously motivated by the need to acquire & defend social status. My theory is that there's an asymmetry in the motivations for alignment building & breaking work. The builder has an obvious status motive: If you become the person who "solved AI alignment", that'll be really good for your social status. That causes builders to have status-motivated blindspots around weak points in their ideas. However, the breaker doesn't have an obvious status motive. In fact, if you go around shooting down peoples' ideas, that's liable to annoy them, which may hurt your social status. And since most proposals are allegedly easily broken anyways, you aren't signaling any kind of special talent by shooting them down. Hence the "breaker" role ends up being undervalued/disincentivized. Especially doing anything beyond just saying "that won't work" -- finding a breaker who will describe a failure in detail instead of just vaguely gesturing seems really hard. (I don't always find such handwaving persuasive.)

I think this might be why Eliezer feels so overworked. He's staked a lot of reputation on the idea that AI alignment is a super hard problem. That gives him a unique status motive to play the red team role, which is why he's had a hard time replacing himself. I think maybe he's tried to compensate for this by making it low status to make a bad proposal, in order to browbeat people into self-critiquing their proposals. But this has a downside of discouraging the sharing of proposals in general, since it's hard to predict how others will receive your ideas. And punishments tend to be bad for creativity.

So yeah, I don't know if the tournament idea would have the immediate effect of generating deep insights. But it might motivate people to share their ideas, or generate better feedback loops, or better align overall status motives in the field, or generate a "useless" blacklist which leads to a deep insight, or filter through a large number of proposals to find the strongest ones. If tournaments were run on a quarterly basis, people could learn lessons, generate some deep ideas from those lessons, and spend a lot of time preparing for the next tournament.

A few other thoughts...

it's going to be a significant danger to have breakers run out of exploit ideas and mistake that for a win for the builders

Perhaps we could mitigate this by allowing breakers to just characterize how something might fail in vague terms -- obviously not as good as a specific description, but still provides some signal to iterate on.

It might be a challenge to create a similarly engaging format that allows for longer deliberation times on these harder problems, but it's probably a worthwhile one.

I think something like a realtime Slack discussion could be pretty engaging. I think there is room for both high-deliberation and low-deliberation formats. [EDIT: You could also have a format in between, where the blue team gets little time, and the red team gets lots of time, to try to simulate the difference in intelligence between an AGI and its human operators.] Also, I'd expect even a slow, high-deliberation tournament format to be more engaging than the way alignment research often gets done (spend a bunch of time thinking on your own, write a post, observe post score, hopefully get a few good comments, discussion dies out as post gets old).

[-]elspood4y90

I think you make good points generally about status motives and obstacles for breakers. As counterpoints, I would offer:

Eliezer is a good example of someone who built a lot of status on the back of "breaking" others' unworkable alignment strategies. I found the AI Box experiments especially enlightening in my early days.
There are lots of high-status breakers, and lots of independent status-rewarding communities around the security world. Some of these are whitehat/ethical, like leaderboards for various bug bounty programs, OWASP, etc. Some of them not so much so, like Blackhat/DEFCON in the early days, criminal enterprises, etc.

Perhaps here is another opportunity to learn lessons from the security community about what makes a good reward system for the breaker mentality. My personal feeling is that poking holes in alignment strategies is easier than coming up with good ones, but I'm also aware that thinking that breaking is easy is probably committing some quantity of typical mind fallacy. Thinking about how things break, or how to break them intentionally, is probably a skill that needs a lot more training in alignment. Or at least we need away to attract skilled breakers to alignment problems.

I find it to be a very natural fit to post bounties on various alignment proposals to attract breakers to them. Keep upping the bounty, and eventually you have a quite strong signal that a proposal might be workable. I notice your experience of offering a personal bounty does not support this, but I think there is a qualitative difference between a bounty leaderboard with public recognition and a large pipeline of value that can be harvested by a community of good breakers, and what may appear to be a one-off deal offered by a single individual with unclear ancillary status rewards.

It may be viable to simply partner with existing crowdsourced bounty program providers (e.g. BugCrowd) to offer a new category of bounty. Traditionally, these services have focused on traditional "pen-test" type bounties, doing runtime testing of existing live applications. But I've long been saying there should be a market for crowdsourced static analysis, and even design reviews, with a pay-per-flaw model.

[-]John_Maxwell4y50

Eliezer is a good example of someone who built a lot of status on the back of "breaking" others' unworkable alignment strategies. I found the AI Box experiments especially enlightening in my early days.

Fair enough.

My personal feeling is that poking holes in alignment strategies is easier than coming up with good ones, but I'm also aware that thinking that breaking is easy is probably committing some quantity of typical mind fallacy.

Yeah personally building feels more natural to me.

I agree a leaderboard would be great. I think it'd be cool to have a leaderboard for proposals as well -- "this proposal has been unbroken for X days" seems like really valuable information that's not currently being collected.

I don't think I personally have enough clout to muster the coordination necessary for a tournament or leaderboard, but you probably do. One challenge is that different proposals are likely to assume different sorts of available capabilities. I have a hunch that many disagreements which appear to be about alignment are actually about capabilities.

In the absence of coordination, I think if someone like you was to simply start advertising themselves as an "uberbreaker" who can shoot holes in any proposal, and over time give reports on which proposals seem the strongest, that could be really valuable and status-rewarding. Sort of a "pre-Eliezer" person who I can run my ideas by in a lower stakes context, as opposed to saying "Hey Eliezer, I solved alignment -- wallop me if I'm wrong!"

[-]elspood4y60

I appreciate the nudge here to put some of this into action. I hear alarm bells when thinking about formalizing a centralized location for AI safety proposals and information about how they break, but my rough intuition is that if there is a way these can be scrubbed of descriptions of capabilities which could be used irresponsibly to bootstrap AGI, then this is a net positive. At the very least, we should be scrambling to discuss safety controls for already public ML paradigms, in case any of these are just one key insight or a few teraflops away from being world-ending.

I would like to hear from others about this topic, though; I'm very wary of being at fault for accelerating the doom of humanity.

[-]Tor Økland Barstad4y20

Interesting comment. I feel like I recently have experienced this phenomena myself (that it's hard to find people who can play "red team").

Do you have any "blue team" ideas for alignment where you in particular would want someone to play "red team"?

I would be interested in having someone play "red team" here, but if someone were to do so in a non-trivial manner then it would probably be best to wait at least until I've completed Part 3 (which will take at least weeks, partly since I'm busy with my main job): https://www.lesswrong.com/posts/ZmZBataeY58anJRBb/agi-assisted-alignment-part-1-introduction

Could potentially be up for playing red team against you, in exchange for you playing red team against me (but if I think I could have something to contribute as red team would depend on specifics of what is proposed/discussed - e.g., I'm not familiar with technical specifics of deep learning beyond vague descriptions).

[-]John_Maxwell4y40

I wrote a comment on your post with feedback.

I don't have anything prepared for red teaming at the moment -- I appreciate the offer though! Can I take advantage of it in the future? (Anyone who wants to give me critical feedback on my drafts should send me a personal message!)

[-]Tor Økland Barstad4y40

Thanks for the feedback!

And yes, do feel free to send me drafts in the future if you want me to look over them. I don't give guaranties regarding amount or speed of feedback, but it would be my intention to try to be helpful :)

[-]Yitz4y40

I wasn’t aware you were offering a bounty! I rarely check people’s profile pages unless I need to contact them privately, so it might be worth mentioning this at the beginning or end of posts where it might be relevant.

[-]John_Maxwell4y40

Fair point. I also haven't done much posting since adding the bounty to my profile. Was thinking it might attract the attention of people reading the archives, but maybe there just aren't many archive readers.

[-]clone of saturn4y50

That said, it took the software industry a long time to learn all the ways to NOT solve XSS before people really understood what a correct fix looked like. It often takes many many examples in the reference class before a clear fundamental solution can be seen.

This is true about the average software developer, but unlike in AI alignment, the correct fix was at least known to a few people from the beginning.

[-]elspood4y50

I would agree that some people figured this out faster than others, but the analogy is also instructional here: if even a small community like the infosec world has a hard time percolating information about failure modes and how to address them, we should expect the average ML engineer to be doing very unsafe things for a very long time by default.

To dive deeper into the XSS example, I think even among those that understood the output encoding and canonicalization solutions early, it still took a while to formalize the definition of an encoding context concisely enough to be able to have confidence that all such edge cases could be covered.

It might be enough to simply recognize an area of alignment that has dragons and let the experts safely explore the nature and contours of these dragons, but you probably couldn't build a useful web application that doesn't display user-influencable input. I think trying to get the industry to halt on building even obvious dragon-infested things is part of what has gotten Eliezer so burned out and pessimistic.

[-]Tor Økland Barstad4y40

Really well-written post.

One thing that seems under-discussed to me are methods we might use to get help from a superintelligent AGI to assist in creating systems for which we have more assurances that they are aligned (as a whole). And one reason for me thinking that it's under-discussed is that even if we think we have succeeded with alignment, we should look for how we can use a superintelligence to verify that this is the case and add extra layers of assurance (finding the least risky methods for doing this first, and going about it in a stepwise and iterative manner).

I think that if such plans are laid out in more detail beforehand (before some team develops AGI/superintelligence I mean), and people try minimizing the degree to which such plans are "handwavy", this may help make teams more apt to make use of techniques/procedures/strategies that can be helpful (compared to if they are improvising).

Have started writing about this here if you are interested (but part 2 and 3 will probably be more substantial than part 1): https://www.lesswrong.com/posts/ZmZBataeY58anJRBb/agi-assisted-alignment-part-1-introduction

Though it may well be (not committing in either direction, but seems plausible) that to even get to the stage where you give a superintelligent AI questions/requests (without it beforehand hacking itself onto the internet, or that sort of thing), people would need to exhibit more security mindset than they are likely to do..

[-]MSRayne4y70

This doesn't make sense to me. The superintelligence has to already be aligned in order to want to help you solve alignment. Otherwise you're basically building its successor.

[-]Tor Økland Barstad4y10

Well, if you start out with a superintelligence that you have good reasons to think is fully aligned, then that is certainly a much better situation to be in (and that's an understatement)! Mentioning that just to make it clear that even if we see things differently, there is partial agreement :)

Lets imagine a superintelligent AI, and let's describe it (a bit anthropomorphicly) as not "wanting" to help me solve alignment. Lets say that instead what it "wants" is to (for example) maximize it's reward signal, and that the best way to maximize it's reward signal would be to exterminate humanity and take over the world (for reasons Eliezer and others have outlined). Well, in that case it will be looking for ways to overthrow humanity, but if it isn't able to do that it may want to do the next best thing (providing outputs that its operators respond to with a high reward signal).

So it may well prefer to "trick" me. But if it can't "trick" me, it may prefer to give me answers that seem good to me (rather than answers that I recognize as clearly bad and evasive).

Machine learning techniques will tend to select for systems that do things that seem impressive and helpful. Unfortunately this does not guarantee "deep" alignment, but I presume that it will select for systems that at least seem aligned on the surface.

There are lots of risks and difficulties involved with asking questions/requests of AIs. But there are more and less ways dangerous of interacting with a potentially unaligned AGI, and questions/requests vary a lot in how easy or hard it is to verify whether or not they provide us what we want. The techniques/outlines I will outline in the series are intended to minimize risk of being "tricked", and I think they could get us pretty far, but I could be wrong somehow, and it's a long/complicated discussion.

[-]MSRayne4y20

Yeah, this sounds extremely dangerous and extremely unlikely to work, but I hope I'm wrong and you've found something potentially useful.

[-]Tor Økland Barstad4y*20

I think there are various very powerful methods that can be used to make it hard for AGI-system to not provide what we want in process of creating aligned AGI-system. But I don't disagree in regards to what you say about it being "extremely dangerous". I think one argument in favor of the kinds of strategies I have in mind is that they may help give an extra layer of security/alignment-assurance, even if we think we have succeeded with alignment beforehand.

[-]tamgent4y30

Thanks for writing this, I find the security mindset useful all over the place and appreciate its applicability in this situation.

I have a small thing unrelated to the main post:

To my knowledge, no one tried writing a security test suite that was designed to force developers to conform their applications to the tests. If this was easy, there would have been a market for it.

I think weak versions exist (ie things that do not guarantee/force, but nudge/help). I first learnt to code in a bootcamp which emphasised test-driven development (TDD). One of the first packages I made was a TDD linter. It would simply highlight in red any functions you wrote that did not have a corresponding unit test, and any file you made without a corresponding test file.

Also if you wrote up anywhere the scalable solutions to 80% of web app vulnerabilities, I'd love to see.

[-]elspood4y40

My project seems to have expired from the OWASP site, but here is an interactive version that should have most of the data:

https://periodictable.github.io/

You'll need to mouse over the elements to see the details, so not really mobile friendly, sorry.

I agree that linters are a weak form of automatic verification that are actually quite valuable. You can get a lot of mileage out of simply blacklisting unsafe APIs and a little out of clever pattern matching.

[-]tamgent3y30

I just want to let you know that this table was really useful for me for something I'm working on. Thank you for making it.

[-]elspood3y20

I'm glad you found it useful, even in this form. If the thing you're working on is something you could share, I'd be happy to offer further assistance, if you like.

[-]tamgent3y10

Thanks kindly for the offer, I will DM you

[-]tamgent3y10

Thanks for sharing, this is a really nice resource for a number of problems and solutions.

[-]chaosmage3y20

A big bounty creates perverse incentives where one guy builds a dangerous AI in a jurisdiction where that isn't a crime yet, and his friend reports him so they can share the bounty.

[-]Roland Pihlakas3y20

I propose blacklists are less useful if they are about proxy measures, and much more useful if they are about ultimate objectives. Some of the ultimate objectives can also be represented in the form of blacklists. For example, listing many ways to kill a person is less useful. But saying that death or violence is to be avoided, is more useful.

[-]PoignardAzur3y*20

Good article.

I think a good follow-up article could be one that continues the analogy by examining software development concepts that have evolved to address the "nobody cares about security enough to do it right" problem.

I'm thinking of two things in particular: the Rust programming language, and capability-oriented programming.

The Rust language is designed to remove entire classes of bugs and exploits (with some caveats that don't matter too much in practice). This does add some constraints to how you can build you program; for some developers, this is a dealbreaker, so Rust adoption isn't an automatic win. But many (I don't really have the numbers to quantify better) developers thrive within those limitations, and even find them helpful to better structure their program.

This selection effect has also lead to the Rust ecosystem having a culture of security by design. Eg a pentest team auditing the rustlst crate "considered the general code quality to be exceptional and can attest to a solid impression left consistently by all scope items".

Capability oriented is a more general idea. The concept is pretty old, but still sound: you only give your system as many resources as it plausibly needs to perform its job. If your program needs to take some text and eg count the number of words in that text, you only give the program access to an input channel and an output channel; if the program tries to open a network socket or some file you didn't give it access to, it automatically fails.

Capability-oriented programming has the potential to greatly reduce the vulnerability of a system, because now, to leverage a remote execution exploit, you also need a capability escalation / sandbox escape exploit. That means the capability system must be sound (with all the testing and red-teaming that implies), but "the capability system" is a much smaller attack surface than "every program on your computer".

There hasn't really been a popular OS that was capability-oriented from the ground up. Similar concepts have been used in containers, WebAssembly, app permissions on mobile OSes, and some package formats like flatpak. The in-development Google OS "Fuschia" (or more precisely, its kernel Zirkon) is the most interesting project I know of on that front.

I'm not sure what the equivalent would be for AI. I think there was a LW article mentioning a project the author had of building a standard "AI sandbox"? I think as AI develops, toolboxes that figure out a "safe" subset of AIs that can be used without risking side effects, while still getting the economic benefits of "free" AIs might also be promising.

[-][anonymous]3y20

This is the same flawed approach that airport security has, which is why travelers still have to remove shoes and surrender liquids: they are creating blacklists instead of addressing the fundamentals.

Just curious, what would it look like to "address the fundamentals" in airport security?

[-]elspood3y110

Obviously this can't be answered with justice in a single comment, but here are some broad pointers that might help see the shape of the solution:

Israeli airport security focuses on behavioral cues, asking unpredictable questions, and profiling. A somewhat extreme threat model there, with much different base rates to account for (but also much lower traffic volume).
Reinforced cockpit doors address the hijackers with guns and knives scenarios, but are a fully general kind of a no-brainer control.
Good policework and better coordination in law enforcement are commonly cited, e.g. in the context of 9/11 hijackings, before anyone even gets to an airport.

In general, if the airlines had responsibility for security you would see a very different set of controls than what you get today, where it is an externality run by an organization with very strong "don't do anything you can get blamed for" political incentives. In an ideal world, you could get an airline catering to paranoiacs who wanted themselves and their fellow passengers to undergo extreme screening, one for people who have done the math, and then most airlines in the middle would phase into nominal gate screening procedures that didn't make them look to their customers that they didn't care (which largely the math says that they shouldn't).

A thought experiment: why is there no equivalent bus/train station security to what we have at airports? And what are the outcomes there?

[-][anonymous]3y10

This is very interesting. Thanks for taking the time to explain :)

[-]Eldho Kuriakose3y10

Awesome piece! Isn't it fascinating that our existing incentives and motives are already un-aligned with the priority of creating aligned systems? This then raises the question of whether alignment is even the right goal if our bigger goal is to avoid ruin.

Stepping back a bit, I can't convince myself that Aligned AI will or will not result in societal ruin. It almost feels like a "don't care" in the karnaugh map.

The fundamental question is whether we collectively are wise enough to wield power without causing self harm. If the last 200+ years are a testament, and if the projections of climate change and bio diversity loss are accurate, the answer appears that we're not even wise enough to wield whale oil, let alone fossil fuels.

There is also the very real possibility that Alignment can occur in two ways - 1) with the machine aligning with human values and 2) with the humans aligning with values generated in machines. Would we be able to tell the difference?

If indeed AI can surpass some intelligence threshold, could it also surpass some wisdom threshold? If this is possible, is alignment necessarily our best bet for avoiding ruin?

[-]Algon3y10

How hard is it for you to find out if someone has a security mindset? How about developing it? How rare is this capacity?

^{^}

The TJX corporation experienced one of the largest data breaches in history, accompanied by millions of dollars in fines; however, their stock price quickly recovered as the world forgot about the incident.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

369

Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment

369

369

Background

Alignment Won't Happen By Accident

Blacklists Are Useless, But Make Them Anyway

You Get What You Pay For

Assurance Requires Formal Proofs, Which Are Provably Impossible

A Breach IS an Existential Risk