POC || GTFO culture as partial antidote to alignment wordcelism

[-][anonymous]3y214

Reading the GPT-4 data, playing with it myself, and looking at the RBRM rubric (which is RSI!), I'm struck by the thought that there is extreme limits right now on who can even begin to "POC || GTFO". That's kind of a major issue.

Without the equipment and infrastructure/support pipeline, you can do very little. It's essentially meaningless to play with small enough models to run locally that can't be trained. In fact given just how more capable the new model is, it's meaningless to try many things without a model sophisticated enough to reason about how to complete a task by "breaking out", etc.

Only AI company staff have the tools, eval data (it's very valuable for things like all the query|answer pairs for chatGPT, or all the question|candidate answers if you were trying to improve skill on leetcode), equipment, and so on.

Even worse it seems like it's all or nothing, either someone is at an elite lab or they again don't have a useful system to play with. 2048 A100s were used for llama training.

It's less than maybe 1000 people worldwide? 10k? Not many.

I mean looking at the RBRM rubric, I'm struck by the fact that even manual POCs don't scale. You need to be able to task an unrestricted version of GPT-4, one that you have training access to so it can become more specialized for the task, with discovering security vulnerabilities in other systems. You as a human would be telling it what to look for, the strategies to use, etc, while the system is what is iterating over millions of permutations.

[-]lc3y*121

Yes, alignment researchers don't have access to the specific weights OpenAI is using right now, as would be the ideal real-world security failure to demonstrate. But we have plenty of posited failure conditions that we should be able to demonstrate on our own with standard deep learning tools, or public open sourced models like the ones from EleutherAI. Figuring out under what conditions Keras allows you to create mesa-optimizers, or better yet, figuring out a mesa-objective for a publicly released Facebook model would do a lot of good.

It's a little like saying "how are we supposed to prove RCE buffer overflows can happen if we don't have access to fingerd"? We can at least try to write some sample code first, and if someone skeptical asked us to do that - to design a system with the flaw before trying to come up with solutions - I don't think I could blame them too much.

[-][anonymous]3y52

I agree just think that probably virtually all of the 'big' issues talked about are not possible with current models. Including mesa optimizers. Architecturally they may not be achievable in the search space of "find the function parameters that minimize error on <this enormous amount of text, or this enormous amount of robotics problems>".

Deception theoretically has a cost, and the direction of optimization would push against it, you're asking for the smallest representation that correctly predicts the output. So at least with these forms of training + architectures (transformer variants, both for llms and robotics), this particular flaw May. Not. Happen.

It's precisely what you were saying with your example, the actual compiler flaws are both different and as it turns out way worse. ("Sydney" wasn't a mesa optimizer, it's channeling a character that exists somewhere in the training corpus. The model was Working As Intended)

[-]seed2y20

Didn't they demonstrate that transformers could be mesaoptimizers? (I never properly understood the paper, so it's a genuine question.) Uncovering Mesaoptimization Algorithms in Transformers

[-][anonymous]2y30

From the paper:

Motivated by our findings that attention layers are attempting to implicitly optimize internal objective functions, we introduce the mesa-layer, a novel attention layer that efficiently solves a least-squares optimization problem, instead of taking just a single gradient step towards an optimum. We show that a single mesa-layer outperforms deep linear and softmax self-attention Transformers on simple sequential tasks while offering more interpretability

It looks like you can analyze transformers, discover the internal patterns that emergently are formed, analyze which ones work the best, and then redesign your network architecture to start with an extra layer that has this pattern already present.

Not only is this closer to the human brain, but yes, it's adding a type of internal mesa optimizer. Doing it deliberately instead of letting one form emergently from the data probably prevents the failure mode AI doomers are worried about, this layer allowing the machine to defect against humans.

[-]Robert Miles5mo64

This social technology is not new of course. Consider the motto of the Royal Society, from 1660.
It seems critical to any group truth-seeking or knowledge-building, that claims are not readily accepted unless they can be demonstrated.

[-]Donald Hobson5mo42

Your bad example takes 5 minutes at a party. Your "good" example takes 8 weeks of work. It is not hard, in general, to get a better answer by investing more effort.

A specific example, worked out in full detail can exhibit the presence of security holes, but not their absence. If the system is a complicated mess, it can be very hard to find a security hole, but also very hard to prove it doesn't have one. (And it's quite likely it does have one)

When speculating about the risks of future AI, the easiest proofs of concept will be rather toy and of arguable relevance. More sophisticated proofs of concept on less toy examples might be dangerous to create.

If you see a bunch of potential threats, it's not guaranteed that all those threats are real. But they are all likely enough to be real that you have to plan for them. The list of speculations will contain some false positives. The list of fully detail worked out exploits will contain false negatives.

[-]lc5mo42

It doesn't take "8 weeks" to come up with a good example if you already understand the problem. Most of the "8 weeks" is spent realizing you didn't understand the problem correctly enough to create the POC, or to propose a solution that works.

[-]Zac Hatfield-Dodds10mo4-1Review for 2023 Review

"POC || GTFO culture" need not be literal, and generally cannot be when speculating about future technologies. I wouldn't even want a proof-of-concept misaligned superintelligence!

Nonetheless, I think the field has been improved by an increasing emphasis on empiricism and demonstrations over the last two years, in technical research, in governance research, and in advocacy. I'd still like to see more carefully caveating of claims for which we have arguments but not evidence, and it's useful to have a short handle for that idea - "POC || admit you're unsure", perhaps?

[-]cousin_it3y41

What kinds of POC attacks would be the most useful for AI alignment right now? (Aside from ChatGPT jailbreaks)

[-]lc3y*2015

IMO the hierarchy of POCs would be:

Proof of misalignment (relative to the company!) in real world, designed-by-engineer consumer products
Creating example POCs of failures using standard deep learning libraries and ML tools
Deliberately introducing weird tools, or training or testing conditions, for the purpose of "simulating" capabilities enhancement that might be necessary for certain kinds of problems to reveal themselves in advance

As an immediate, concrete example: figuring out how to create a POC mesa-optimizer using standard deep learning libraries would be the obvious big win, and AFAICT this has not been done. While writing this post I did some research and found out that the Alignment Research Center considers something like this an explicit technical goal of theirs, which made me happy and got me to pledge.

[-]Mo Putera2y30

Would the recent Anthropic sleeper agents paper count as an example of bullet #2 or #3?

[-]Garrett Baker3y10

Why don’t you think the goal misgeneralization papers or the plethora of papers finding in-context gradient descent in transformers and resnets count as mesa-optimization?

[-]Rohin Shah3y*2220

I think most of the goal misgeneralization examples are in fact POCs, but they're pretty weak POCs and it would be much better if we had better POCs. Here's a table of some key disanalogies:

Our examples	Deceptive alignment
Deployment behavior is similar to train behavior	Behaves well during training, executes treacherous turn on deployment
No instrumental reasoning	Train behavior relies on instrumental reasoning
Adding more diverse data would solve the problem	AI would (try to) behave nicely for any test we devise
Most (but not all) were designed to show goal misgeneralization	Goal misgeneralization happens even though we don’t design for it

I'd be excited to see POCs that maintained all of the properties in the existing examples and added one or more of the properties in the right column.

[-]Noosphere8910mo20Review for 2023 Review

For those who don't want to, the gist is: Given the same level of specificity, people will naturally give more credit to the public thinker that argues that society or industry will change, because it's easy to recall active examples of things changing and hard to recall the vast amount of negative examples where things stayed the same. If you take the Nassim Taleb route of vapidly predicting, in an unspecific way, that interesting things are eventually going to happen, interesting things will eventually happen and you will be revered as an oracle. If you take the Francis Fukuyama route of vapidly saying that things will mostly stay the same, you will be declared a fool every time something mildly important happens.

The computer security industry happens to know this dynamic very well. No one notices the Fortune 500 company that doesn't suffer the ransomware attack. Outside the industry, this active vs. negative bias is so prevalent that information security standards are constantly derided as "horrific" without articulating the sense in which they fail, and despite the fact that online banking works pretty well virtually all of the time. Inside the industry, vague and unverified predictions that Companies Will Have Security Incidents, or that New Tools Will Have Security Flaws, are treated much more favorably in retrospect than vague and unverified predictions that companies will mostly do fine. Even if you're right that an attack vector is unimportant and probably won't lead to any real world consequences, in retrospect your position will be considered obvious. On the other hand, if you say that an attack vector is important, and you're wrong, people will also forget about that in three years. So better list everything that could possibly go wrong[1], even if certain mishaps are much more likely than others, and collect oracle points when half of your failure scenarios are proven correct.

This would be bad on its own, but then it's compounded with several other problems. For one thing, predictions of doom, of course, inflate the importance and future salary expectations of information security researchers[2], in the same sense that inflating the competence of the Russian military is good for the U.S. defense industry. When you tell someone their Rowhammer hardware attacks are completely inexploitable in practice, that's no fun for anyone, because it means infosec researchers aren't going to all get paid buckets of money to defend against Rowhammer exploits, and journalists have no news article. For another thing, the security industry (especially the offensive side) is selected to contain people who believe computer security is a large societal problem, and that they themselves can get involved, or at least want to believe that it's possible for them to get involved if they put in a lot of time and effort, and so security researchers are already inclined to hear you if you're about to tell them how obviously bad information security at most companies really is.

In retrospect, a value add of the post is precisely in raising this consideration, where incentives can make a huge difference in what you believe in, and a big takeaway is that I'm way less of a fan of security mindset as practiced by Eliezer, at least without massive scope changes, and is a reason in why I treat arguments for AI doom that aren't backed up by an empirical story suspiciously automatically.

[-]Review Bot2y*10

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

[-]Novice3mo00

POC || GTFO culture seems to share many themes with Logical Positivism...

[+][comment deleted]10mo20

^{^}

This posts' author is of course not deliberately attempting to do this. Many of his predictions will only pan out if we are all dead, and he probably thinks of most of the bullets more as requirements for success than classes of hypothetical doom scenarios. Nevertheless, the bias I refer to will certainly affect people's retrospective evaluations of their predictions in say, 2028.

^{^}

This is of course not to say that alignment researchers intentionally inflate their estimates of P(DOOM) to get more research funding. All the alignment researchers I've met seem extraordinarily sincere, and alignment research funding is mostly independent of how obvious P(DOOM) is, because people are stupid and there's not as efficient a market for AI notkilleveryonism research as there is for computer security research.

LESSWRONG
LW

LESSWRONG
LW

162

POC || GTFO culture as partial antidote to alignment wordcelism

162

162