Jay Bailey's Shortform

1st Aug 2022

1 min read

2

This is a special post for quick takes by Jay Bailey. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

21 comments, sorted by

top scoring

Click to highlight new comments since: Today at 2:31 PM

[-]Jay Bailey2mo314

I recently read Anthropic's new paper on introspection in LLM's. (https://www.anthropic.com/research/introspection)

In short, they were able to:

Extract what they called an ALL CAPS vector, the difference between a prompt with ALL CAPS and the same prompt without all caps.
Inject that vector into the activations of an unrelated prompt.
Ask the model if it could detect an injected thought. About 20% of the time it said yes, and identified it with something like "loud" or "shouting" which is quite similar to the all caps they were going for.

They did this with a bunch of concepts. And Opus 4 was, of course, better than other models at this.

To give some fairly obvious thoughts on what this means:

If a model can unreliably do this now when directly prompted to do so, I expect the model can reliably do this without being prompted in around 2-3 years from now.
If you have an alignment plan that involves altering the model's thoughts in some way, consider that the model will probably be aware by default that you did that, and plan accordingly.

Finally a more personal thought:

The idea of "Well, if you try to adjust a superintelligence's thoughts, it will know you did that and try to subvert it / route around the imposing thoughts" would have sounded weird to me, if I'd tried saying it out loud. We're talking about literally altering the model's mind to think what we want it to think, and you still think that's not enough for alignment? I note that I did think that, but I was thinking in terms of "The concepts you are altering are a proxy for the thing you are actually trying to alter, and the important performance-enhancing stuff like deception will migrate to the other parts" and not "The model will detect you altering its mind and likely actively work against you when it builds the situational awareness for that". And yet, here we are, with me insufficiently paranoid. Unfortunately, reality doesn't care about what sounds intuitively plausible.

I wonder if this is an example that would allow people to feel, viscerally, the difficulty of aligning an actually smart entity. Here we have a capability that sounds scary and superhuman and only arose with smart models. The idea of a model being aware of when you're altering its thoughts is terrifying. The lesson we would like to give to decision-makers is something like:

"AI's will soon know when you try to adjust its thoughts. That's how smart these things are. We can employ a primitive type of mind control on them and that still isn't enough. We are really not ready for even smarter systems than this. Any alignment plan that might work has to be stronger than mind control. Anything below the level of literal mind control isn't even the ante to sit down at the table any more."

I can't literally reduce this down to five words, but a succinct message might be something like "Mind control won't work. We need a plan that's better than that."

[-]Amalthea2mo50

OK, so here we should give some credit to Eliezer for having thought this through and who has been making a big deal about this precise thing (although partly through his recent fanfic that he keeps alluding to, so maybe that doesn't really count).
I tend to find him a bit annoying on these things because surely people wouldn't seriously try to pin substantial hopes on these kind of ideas, but what do I know.

[-]J Bostock2mo30

Thank you for pointing this out. I am deducting some points from myself for not realizing this on my own. On a more mundane level: steering vectors might get somewhat screwy rather soon.

[-]skybluecat2mo10

Could it be limited to stuff like LLMs rather than all kinds of AI? They were trained with massive amounts of data that don't reflect the imposed thought, and the resulting preferences/motivation is distributed in a large network. Injecting one vector doesn't affect all the existing circuits of preferences and thinking habits sufficiently, so its chain of thoughts may be able to break free enough to realize and work around it.

[-]dscft2mo10

One would hope that progress in interpretability will allow us to do much more refined "mind control" than adding difference vectors.

[-]Jay Bailey2mo20

One would hope indeed. But even so, we do now know that this is likely to be the kind of action that could be detected and opposed. And since I didn't predict in advance that this would happen, especially at this capability level, the update for me is that it's going to be significantly harder than I would hope.

[-]Jay Bailey3y120

Speedrunners have a tendency to totally break video games in half, sometimes in the strangest and most bizarre ways possible. I feel like some of the more convoluted video game speedrun / challenge run glitches out there are actually a good way to build intuition on what high optimisation pressure (like that imposed by a relatively weak AGI) might look like, even at regular human or slightly superhuman levels. (Slightly superhuman being a group of smart people achieving what no single human could)

Two that I recommend:

https://www.youtube.com/watch?v=kpk2tdsPh0A - Tool-assisted run where the inputs are programmed frame by frame by a human, and executed by a computer. Exploits idiosyncracies in Super Mario 64 code that no human could ever use unassisted in order to reduce the amount of times the A button needs to be pressed in a run. I wouldn't be surprised if this guy knows more about SM64 code than the devs at this point.

https://www.youtube.com/watch?v=THtbjPQFVZI - A glitch using outside-the-game hardware considerations to improve consistency on yet another crazy in-game glitch. Also showcases just how large the attack space is.

These videos are also just incredibly entertaining in their own right, and not ridiculously long, so I hypothesise that they're a great resource to send more skeptical people if they understand the idea of AGI but are systematically underestimating the difference between "bug-free" (Program will not have bugs during normal operation) and secure. (Program will not have bugs when deliberately pushed towards narrow states designed to create bugs)

For a more serious overview, you could probably find obscure hardware glitches and such to achieve the same lesson.

[-]Dagon3y30

I'm not sure I agree that it's a useful intuition pump for the ways an AGI can surprisingly-optimize things. They're amusing, but fundamentally based on out-of-game knowledge about the structure of the game. Unless you're positing a simulation hypothesis, AND that AGI somehow escapes the simulation, it's not really analogous.

[-]gwern3y61

They're amusing, but fundamentally based on out-of-game knowledge about the structure of the game.

Evolutionary and DRL methods are famous for, model-free, within the game, finding exploits and glitches. There's also chess endgame databases as examples.

[-]Jay Bailey3y50

A frame that I use that a lot of people I speak to seem to find A) Interesting and B) Novel is that of "idiot units".

An Idiot Unit is the length of time it takes before you think your past self was an idiot. This is pretty subjective, of course, and you'll need to decide what that means for yourself. Roughly, I consider my past self to be an idiot if they have substantially different aims or are significantly less effective at achieving them. Personally my idiot unit is about two years - I can pretty reliably look back in time and think that compared to year T, Jay at year T-2 had worse priorities or was a lot less effective at pursuing his goals somehow.

Not everyone has an Idiot Unit. Some people believe they were smarter ten years ago, or haven't really changed their methods and priorities in a while. Take a minute and think about what your Idiot Unit might be, if any.

Now, if you have an Idiot Unit for your own life, what does that imply?

Firstly, hill-climbing heuristics should be upweighted compared to long-term plans. If your Idiot Unit is U, any plan that takes more than U time means that, after U time, you're following a plan that was designed by an idiot. Act accordingly.

That said, a recent addition I have made to this - you should still make long-term plans. It's important to know which of your plans are stable under Idiot Units, and you only get that by making those plans. I don't disagree with my past self about everything. For instance, I got a vasectomy at 29, because not wanting kids had been stable for me for at least ten years, so I don't expect more Idiot Units to change this.

Secondly, if you must act according to long-term plans (A college/university degree takes longer than U for me, especially since U tends to be shorter when you're younger) try to pick plans that preserve or increase optionality. I want to give Future Jay as many options as possible, because he's smarter than me. When Past Jay decided to get a CS degree, he had no idea about EA or AI alignment. But a CS degree is a very flexible investment, so when I decided to do something good for the world, I had a ready-made asset to use.

Thirdly, longer-term investments in yourself (provided they aren't too specific) are good. Your impact will be larger a few years down the track, since you'll be smarter then. Try asking what a smarter version of you would likely find useful and seek to acquire that. Resources like health, money, and broadly-applicable knowledge are good!

Fourthly, the lower your Idiot Unit is, the better. It means you're learning! Try to preserve it over time - Idiot Units naturally grow with age so if yours stands still, you're actually making progress.

I'm not sure if it's worth writing up a whole post on this with more detail and examples, but I thought I'd get the idea out there in Shortform.

[-]Viliam3y20

I like the idea, but you can also get fake Idiot Units for moving in a random direction that is not necessarily an improvement.

[-]Jay Bailey10mo40

I bought a month of Deep Research and am open to running queries if people have a few but don't want to spend 200 bucks for them. Will spend up to 25 queries in total.

A paragraph or two of detail is good - you can send me supporting documents via wnlonvyrlpf@tznvy.pbz (ROT13) if you want. Offer is open publicly or via PM.

[-]ChristianKl10mo100

How likely is it that AI will surpass humans, take over all power, and cause human extinction some time during the 21st century?

[-]Jay Bailey10mo30

Here you go: https://chatgpt.com/share/67b31788-32b0-8013-8bbf-a4100abf0457

[-]MondSemmel10mo40

How about "analyze the implications for risk of AI extinction, based on how OpenAI's safety page has changed over time"? Inspired by this comment (+ follow-up w/ Internet archive link).

[-]Jay Bailey10mo30

https://chatgpt.com/share/67b1c1dc-e88c-8013-a6f0-88b25155a0d6

Here you are :)

[-]MondSemmel10mo20

Thanks! I haven't had time to skim either report yet, but I thought it might be instructive to ask the same questions to Perplexity Pro's DR mode (it has the same "Deep Research" as OpenAI, but otherwise has no relation to it, in particular it's free with their regular monthly subscription, so it can't be particularly powerful): see here. (This is the second report I generated, as the first one froze indefinitely while the app generated the conclusion, and thus the report couldn't be shared.)

[-]cubefox10mo30

cc @Annapurna

[-]Jay Bailey6mo2-2

There's a counterargument to the AGI hype that basically says - of course the labs would want to hype this technology, they make money that way, just because they say they believe in short timelines doesn't mean it's true. Specifically, the claim here is not that the AI lab CEO's are mistaken, but rather that they are actively lying, and they know AGI isn't around the corner.

What actions have frontier AI labs taken in the last year or two that wouldn't make sense, given the above explanation? Stuff like GDM's merger or OpenAI (reportedly) operating at a massive loss. Ideally these actions would be reported on by entities other than those companies themselves, in order to help convince skeptics. I've definitely seen stuff like this around but I can't remember where and the search terms are too vague.

I've also tried using Deep Research for this, but it doesn't seem to understand the idea of only looking at actions that are far more likely in the non-hype world than the hype-world, talking about things like investor decks projecting high returns that are very compatible with them hyping themselves up.

[-]ryan_greenblatt6mo50

I think the actions of GDM, Meta, and OpenAI are all extremely consistent with thinking that AI will be very economically valuable technology, but we won't see AGI in a few years.

[-]Jay Bailey2y10

One of the core problems of AI alignment is that we don't know how to reliably get goals into the AI - there are many possible goals that are sufficiently correlated with doing well on training data that the AI could wind up optimising for a whole bunch of different things.

Instrumental convergence claims that a wide variety of goals will lead to convergent subgoals such that the agent will end up wanting to seek power, acquire resources, avoid death, etc.

These claims do seem a bit...contradictory. If goals are really that inscrutable, why do we strongly expect instrumental convergence? Why won't we get some weird thing that happens to correlate with "don't die, keep your options open" on the training data, but falls apart out of distribution?

Moderation Log

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Jay Bailey's Shortform

2