There is an important asymmetry in reception for prophets. Go read that post first if you haven't.

For those who don't want to, the gist is: Given the same level of specificity, people will naturally give more credit to the public thinker that argues that society or industry will change, because it's easy to recall active examples of things changing and hard to recall the vast amount of negative examples where things stayed the same. If you take the Nassim Taleb route of vapidly predicting, in an unspecific way, that interesting things are eventually going to happen, interesting things will eventually happen and you will be revered as an oracle. If you take the Francis Fukuyama route of vapidly saying that things will mostly stay the same, you will be declared a fool every time something mildly important happens. 

The computer security industry happens to know this dynamic very well. No one notices the Fortune 500 company that doesn't suffer the ransomware attack. Outside the industry, this active vs. negative bias is so prevalent that information security standards are constantly derided as "horrific" without articulating the sense in which they fail, and despite the fact that online banking works pretty well virtually all of the time. Inside the industry, vague and unverified predictions that Companies Will Have Security Incidents, or that New Tools Will Have Security Flaws, are treated much more favorably in retrospect than vague and unverified predictions that companies will mostly do fine. Even if you're right that an attack vector is unimportant and probably won't lead to any real world consequences, in retrospect your position will be considered obvious. On the other hand, if you say that an attack vector is important, and you're wrong, people will also forget about that in three years. So better list everything that could possibly go wrong[1], even if certain mishaps are much more likely than others, and collect oracle points when half of your failure scenarios are proven correct.

This would be bad on its own, but then it's compounded with several other problems. For one thing, predictions of doom, of course, inflate the importance and future salary expectations of information security researchers[2], in the same sense that inflating the competence of the Russian military is good for the U.S. defense industry. When you tell someone their Rowhammer hardware attacks are completely inexploitable in practice, that's no fun for anyone, because it means infosec researchers aren't going to all get paid buckets of money to defend against Rowhammer exploits, and journalists have no news article. For another thing, the security industry (especially the offensive side) is selected to contain people who believe computer security is a large societal problem, and that they themselves can get involved, or at least want to believe that it's possible for them to get involved if they put in a lot of time and effort, and so security researchers are already inclined to hear you if you're about to tell them how obviously bad information security at most companies really is. 

But worst of all, especially for those evaluating particular critiques and trying to prevent problems in advance, is a fourth problem: unskilled hackers are bad at modeling defenders, just as unskilled defenders are bad at modeling computer hackers. It's actually very easy - too easy - to write stories and pseudocode for exploits that an average, security-aware software engineer will believe works in practice. Newbies to the field are often shocked by how many times they run into a situation where their attacks "almost" work, just like entrepreneurs are shocked by how many startup ideas "almost" work. This happens not because the computer hacker is unlucky, but because the security engineers and project managers are sapient and anticipate attempts to pwn them. The modal outcome for proposed finds like Rowhammer is that they don't work in practice because of small, nitpicky, domain-specific details - details selected by virtualization engineers at AWS who think critically about what they're doing, and details which most people in the computer security industry are unaware of while they pass off hardware vulns as earthshattering news.

All of these factors create a default tendency, even for short feedback loop industries like security, to credit prolific bloggers with much more actionably precise understanding of the domain than they deserve. In the absence of grounding metrics and events, and in the presence of poor models of the tech, landscape, and people, security researchers often start "seeing ghosts", to borrow a term from chess. "Seeing ghosts" happens when a player looks at complicated positions on a chessboard and, lacking the necessary tools or context to analyze the position effectively himself, imagines counterattacks and assaults that aren't there. Players become confused about where their real weaknesses are, fail to take good shortcuts, and lose the game. 

However... I said the security community knew this problem well. Thankfully for us, they didn't just throw their hands up at the problem and say, "well I guess we'll try really hard not to recruit any of the biased people." Rather, they developed highly effective cultural antibodies - perhaps not out of altruism, but at least to keep their elite IRC channels and forums clear of wannabes - that help to separate the wheat from the chaff. 

One such antibody is a phrase, designed help disincentivize unjustified navel gazing. That phrase is "POC or GTFO", POC short for proof of concept. The basic gist of it is that, no matter how convincingly a hacker has "argued" for a flaw's existence in some set of products, you only start handing out status points after someone has successfully demonstrated the security failure, ideally in a deployed product or at the very least a toy program. 

You might think this is an inconvenient (and unfair) limitation on the ability of security researchers to criticize others' ideas. It sure sounds like a bar invented by some CEO with a golden parachute trying to shave off expenses for his board. After all:

But POC || GTFO culture is ultimately quite a gift to the broader security industry, even if you'd be worried to hear your CISO saying those precise words to some underling. To illustrate the benefits of POC || GTFO discipline, imagine the following conversation, between a computer security researcher who once heard someone describe stack overflows at a party, and the 198X programmer of "fingerd". 

Partygoer: I'm worried about the programming language you guys are using, 'C'. It seems like it includes lots of footguns whereby programmers could accidentally corrupt memory. With enough optimization pressure I think something could get network services like yours to run arbitrary code. 

Fingerd developer: Ok, but that's just a theoretical objection. What would this attack look like? How would it lead to security breaches?

Partygoer (misunderstanding/minimizing the threat): Well, maybe someone on the network could send messages larger than you allocate memory on the stack for, and end up spraying the .text segment of your process with their own code.

Fingerd developer: You'll probably object if I say I'm too careful to let that happen... So what if I tried really hard to make sure we never write past buffers, AND made sure the .text and .data segments were read only? That way the program would crash instead of break dangerously.

Partygoer (also lacks security mindset): Sounds like a plan!

Years later, most of the internet goes down. 

Partygoer might object that "they were right", that fingerd was vulnerable, yet this is no consolation to humanity because they were hopelessly wrong about the character of the solution necessary. The primary problem was that neither of the parties has security mindset; instead of treating bugs as modeling failures about their code, and modifying the language or their compiler to remove the possibility of memory corruption at all, they're adopting stackable-sounding solutions like "write protect the .text section" that don't get at the heart of the problem.

A second problem, though, is that neither person has a good enough understanding of the attack to speculate properly about what is and isn't possible.  Neither of them wrote the OS code to load ELFs into memory; their training examples are drawn from hypothetical scenarios that were generated with a flawed understanding of how processes are laid out and how their program is compiled. The partygoer thus presents an application of the bug that turns out in practice to be normally impossible. "Seeing ghosts", they both come up with a promising-sounding defense against something that doesn't exist. Even if they had been lucky enough to imagine all of the ways for someone to abuse a memory corruption bug as they understood it, they'd lose.

However let's make the same mistake tentatively imagine what happens when both people accept and embrace POC || GTFO standards:

Partygoer: I'm worried about the programming language you guys are using, 'C'. It seems like it includes lots of footguns whereby programmers could accidentally corrupt memory. With enough optimization pressure I think something could get network services like yours to run their code. 

Fingerd developer (wise beyond his knowledge): POC || GTFO.

Partygoer: Ugh, fine, I'll come back tomorrow with an example.

Partygoer spends eight weeks getting their stack overrun to work; they realize after more investigation that their initial idea of the problem was flawed, but come back with a working example against a sample program.

Partygoer: So, here's a sample program I wrote since I don't have your code. There's a bug where the server accepts a password field, but the programmer forgets to check that the string the user supplies is smaller than the 64 bytes the program allocates for the password buffer.

It turns out, the implications are actually worse than I believed. The buffer is a stack variable, so the way the program is compiled by GCC, 800 bytes after the end of the buffer there's a stack frame pointer that the program saved to remember what function to return to after this one is done. That means I can write an 64+800+4 byte payload, and the program will jump to any four byte memory address I want. To abuse this and get a terminal, I filled most of the buffer with a NOP sled, ending with a 385 byte set of shellcode, and then set the bytes to return somewhere inside that region, which is always loaded at the same memory address. When I send the payload, control lands somewhere in the 400 byte NOP sled, then it slides to my (nullbyte-free) shellcode which spawns a reverse shell that allows me to enter terminal commands into the device. 

(Unusually conscientious) Fingerd developer: That's really cool. Assuming your program is a representative example of the kinds of problems that can happen when you program in C, how could we prevent ourselves from making this mistake?

Partygoer: Well, the program being able to jump to shellcode I wrote into the stack is weird. How about you guys set stack memory to be non-executable?

Fingerd developer: Sounds like a plan!

In this scenario, the internet still goes offline when Robert Morris invents ROP chains a few years early, but it at least dies with more dignity points, because Partygoer's POC grounded the dialogue in real world attacks. Instead of having to engage in an Aristotelian debate about the security problem and just hope everyone involved is an S-tier philosopher, the interlocutors can ask themselves if their proposed defenses would prevent issues that vary on a basic theme. If they need to extrapolate and consider what Robert Morris would do next, they can do so from their correct foundation of knowledge about process memory space instead of wondering if he could send a gigabyte long message that bleeds into the kernel or something.

Is this a perfect system? Certainly not, and I gave some examples of how it breaks down at scale above. But it's a more scalable system than "talk about it abstractly" because it allows large groups of people to build upon databases of confirmed examples of failure, as well as naturally clear up misconceptions about such failures among researchers.

  1. ^

    This posts' author is of course not deliberately attempting to do this. Many of his predictions will only pan out if we are all dead, and he probably thinks of most of the bullets more as requirements for success than classes of hypothetical doom scenarios. Nevertheless, the bias I refer to will certainly affect people's retrospective evaluations of their predictions in say, 2028.

  2. ^

    This is of course not to say that alignment researchers intentionally inflate their estimates of P(DOOM) to get more research funding. All the alignment researchers I've met seem extraordinarily sincere, and alignment research funding is mostly independent of how obvious P(DOOM) is, because people are stupid and there's not as efficient a market for AI notkilleveryonism research as there is for computer security research.

New Comment
11 comments, sorted by Click to highlight new comments since:

Reading the GPT-4 data, playing with it myself, and looking at the RBRM rubric (which is RSI!), I'm struck by the thought that there is extreme limits right now on who can even begin to "POC || GTFO".  That's kind of a major issue.

Without the equipment and infrastructure/support pipeline, you can do very little.  It's essentially meaningless to play with small enough models to run locally that can't be trained.  In fact given just how more capable the new model is, it's meaningless to try many things without a model sophisticated enough to reason about how to complete a task by "breaking out", etc.

Only AI company staff have the tools, eval data (it's very valuable for things like all the query|answer pairs for chatGPT, or all the question|candidate answers if you were trying to improve skill on leetcode), equipment, and so on.  

Even worse it seems like it's all or nothing, either someone is at an elite lab or they again don't have a useful system to play with.  2048 A100s were used for llama training.  

It's less than maybe 1000 people worldwide?  10k?  Not many.  

I mean looking at the RBRM rubric, I'm struck by the fact that even manual POCs don't scale.  You need to be able to task an unrestricted version of GPT-4, one that you have training access to so it can become more specialized for the task, with discovering security vulnerabilities in other systems.  You as a human would be telling it what to look for, the strategies to use, etc, while the system is what is iterating over millions of permutations.  


Yes, alignment researchers don't have access to the specific weights OpenAI is using right now, as would be the ideal real-world security failure to demonstrate. But we have plenty of posited failure conditions that we should be able to demonstrate on our own with standard deep learning tools, or public open sourced models like the ones from EleutherAI. Figuring out under what conditions Keras allows you to create mesa-optimizers, or better yet, figuring out a mesa-objective for a publicly released Facebook model would do a lot of good.

It's a little like saying "how are we supposed to prove RCE buffer overflows can happen if we don't have access to fingerd"? We can at least try to write some sample code first, and if someone skeptical asked us to do that - to design a system with the flaw before trying to come up with solutions - I don't think I could blame them too much.


I agree just think that probably virtually all of the 'big' issues talked about are not possible with current models.  Including mesa optimizers.  Architecturally they may not be achievable in the search space of "find the function parameters that minimize error on <this enormous amount of text, or this enormous amount of robotics problems>".  

Deception theoretically has a cost, and the direction of optimization would push against it, you're asking for the smallest representation that correctly predicts the output.  So at least with these forms of training + architectures (transformer variants, both for llms and robotics), this particular flaw May. Not. Happen.

It's precisely what you were saying with your example, the actual compiler flaws are both different and as it turns out way worse.  ("Sydney" wasn't a mesa optimizer, it's channeling a character that exists somewhere in the training corpus.  The model was Working As Intended)


Didn't they demonstrate that transformers could be mesaoptimizers? (I never properly understood the paper, so it's a genuine question.) Uncovering Mesaoptimization Algorithms in Transformers


From the paper:

Motivated by our findings that attention layers are attempting to implicitly optimize internal objective functions, we introduce the mesa-layer, a novel attention layer that efficiently solves a least-squares optimization problem, instead of taking just a single gradient step towards an optimum. We show that a single mesa-layer outperforms deep linear and softmax self-attention Transformers on simple sequential tasks while offering more interpretability

It looks like you can analyze transformers, discover the internal patterns that emergently are formed, analyze which ones work the best, and then redesign your network architecture to start with an extra layer that has this pattern already present.

Not only is this closer to the human brain, but yes, it's adding a type of internal mesa optimizer.  Doing it deliberately instead of letting one form emergently from the data probably prevents the failure mode AI doomers are worried about, this layer allowing the machine to defect against humans.

What kinds of POC attacks would be the most useful for AI alignment right now? (Aside from ChatGPT jailbreaks)


IMO the hierarchy of POCs would be:

  • Proof of misalignment (relative to the company!) in real world, designed-by-engineer consumer products
  • Creating example POCs of failures using standard deep learning libraries and ML tools
  • Deliberately introducing weird tools, or training or testing conditions, for the purpose of "simulating" capabilities enhancement that might be necessary for certain kinds of problems to reveal themselves in advance

As an immediate, concrete example: figuring out how to create a POC mesa-optimizer using standard deep learning libraries would be the obvious big win, and AFAICT this has not been done. While writing this post I did some research and found out that the Alignment Research Center considers something like this an explicit technical goal of theirs, which made me happy and got me to pledge.

Would the recent Anthropic sleeper agents paper count as an example of bullet #2 or #3? 

Why don’t you think the goal misgeneralization papers or the plethora of papers finding in-context gradient descent in transformers and resnets count as mesa-optimization?

I think most of the goal misgeneralization examples are in fact POCs, but they're pretty weak POCs and it would be much better if we had better POCs. Here's a table of some key disanalogies:

Our examples

Deceptive alignment

Deployment behavior is similar to train behaviorBehaves well during training, executes treacherous turn on deployment
No instrumental reasoningTrain behavior relies on instrumental reasoning
Adding more diverse data would solve the problemAI would (try to) behave nicely for any test we devise
Most (but not all) were designed to show goal misgeneralizationGoal misgeneralization happens even though we don’t design for it

I'd be excited to see POCs that maintained all of the properties in the existing examples and added one or more of the properties in the right column.

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year. Will this post make the top fifty?