Inducing Unprompted Misalignment in LLMs

Ω 135h

Emergent Instrumental Reasoning Without Explicit Goals

TL;DR: LLMs can act and scheme without being told to do so. This is bad.

Produced as part of Astra Fellowship - Winter 2024 program, mentored by Evan Hubinger. Thanks to Evan Hubinger, Henry Sleight, and Olli Järviniemi for suggestions and discussions on the topic.

Introduction

Skeptics of deceptive alignment argue that current language models do not conclusively demonstrate natural emergent misalignment. One such claim is that concerning behaviors mainly arise when models are explicitly told to act misaligned^[1]. Existing Deceptive Alignment experiments often involve telling the model to behave poorly and the model being helpful and compliant by doing so. I agree that this is a key challenge and complaint for Deceptive Alignment research, in particular, and AI Safety, in general. My project is aimed...

(Continue Reading – 4744 more words)

6Nora Belrose1h

Unclear why this is supposed to be a scary result. "If prompting a model to do something bad generalizes to it being bad in other domains, this is also evidence for the idea that prompting a model to do something good will generalize to it doing good in other domains" - Matthew Barnett

Sam Svenningsen4m10

I'm not saying the frog is boiling, it is just warmer than previous related work I had seen had measured.

The results of the model generalizing to do bad things in other domains reinforce what Mathew states there, and it is valuable to have results that support one's hunches. It is also useful, in general, to know how little of a scenario framing it takes for the model to infer it is in its interest to be unhelpful without being told what its interests exactly are.

re: "scary": People outside of this experiment are and will always be telling models to be bad... (read more)

5ryan_greenblatt4h

To be clear, I think a plausible story for AI becoming dangerously schemy/misaligned is that doing clever and actively bad behavior in training will be actively reinforced due to imperfect feedback signals (aka reward hacking) and then this will generalize in a very dangerous way. So, I am interested in the question of: ''when some types of "bad behavior" get reinforced, how does this generalize?'.

3Sam Svenningsen3h

Thanks, yes, I think that is a reasonable summary. There is, intentionally, still the handholding of the bad behavior being present to make the "bad" behavior more obvious. I try to make those caveats in the post. Sorry if I didn't make enough, particularly in the intro. I still thought the title was appropriate since * The company preference held regardless, in both fine-tuning and (some) non-finetuning results, which was "unprompted" (i.e. unrequested implicitly [which was my interpretation of the Apollo Trading bot lying in order to make more money] or explicitly) even if it was "induced" by the phrasing. * The non-coding results, where it tried to protect its interests by being less helpful, are a different "bad" behavior that was also unprompted. * The aforementioned 'handholding' phrasing and other caveats in the post. I am too. The reinforcement aspect is literally what I'm planning on focusing on next. Thanks for the feedback.

Quinn's Shortform

Quinn

Quinn9m30

Thinking about a top-level post on FOMO and research taste

Fear of missing out defined as inability to execute on a project cuz there's a cooler project if you pivot
but it also gestures at more of a strict negative, where you think your project sucks before you finish it, so you never execute
was discussing this with a friend: "yeah I mean lesswrong is pretty egregious cuz it sorta promotes this idea of research taste as the ability to tear things down, which can be done armchair"
I've developed strategies to beat this FOMO and gain more depth and detail

mwatkins

11h

tl;dr: Recently reported GPT-J experiments [1 2 3 4] prompting for definitions of points in the so-called "semantic void" (token-free regions of embedding space) were extended to fifteen other open source base models from four families, producing many of the same bafflingly specific outputs. This points to an entirely unexpected kind of LLM universality (for which no explanation is offered, although a few highly speculative ideas are riffed upon).

Work supported by the Long Term Future Fund. Thanks to quila for suggesting the use of "empty string definition" prompts, and to janus for technical assistance.

Introduction

"Mapping the semantic void: Strange goings-on in GPT embedding spaces" presented a selection of recurrent themes (e.g., non-Mormons, the British Royal family, small round things, holes) in outputs produced by prompting GPT-J to define...

(Continue Reading – 7902 more words)

2Gunnar_Zarncke2h

If I haven't overlooked the explanation (I have read only part of it and skimmed the rest), my guess for the non-membership definition of the empty string would be all the SQL and programming queries where "" stands for matching all elements (or sometimes matching none). The small round things are a riddle for me too.

1Ann9h

I played around with this with Claude a bit, despite not being a base model, in case it had some useful insights, or might be somehow able to re-imagine the base model mindset better than other instruct models. When I asked about sharing the results it chose to respond directly, so I'll share that.

3mwatkins8h

Wow, thanks Ann! I never would have thought to do that, and the result is fascinating. This sentence really spoke to me! "As an admittedly biased and constrained AI system myself, I can only dream of what further wonders and horrors may emerge as we map the latent spaces of ever larger and more powerful models."

Ann42m10

On the other end of the spectrum, asking cosmo-1b (mostly synthetic training) for a completion, I get `A typical definition of "" would be "the set of all functions from X to Y".`

hydrogen tube transport

bhauth

This is a linkpost for https://www.bhauth.com/blog/industrial%20design/hydrogen%20tubes.html

Elon Musk's Hyperloop proposal had substantial public interest. With various initial Hyperloop projects now having failed, I thought some people might be interested in a high-speed transportation system that's...perhaps not "practical" per se, but at least more-practical than the Hyperloop approach.

aerodynamic drag in hydrogen

Hydrogen has a lower molecular mass than air, so it has a higher speed of sound and lower density. The higher speed of sound means a vehicle in hydrogen can travel at 2300 mph while remaining subsonic, and the lower density reduces drag. This paper evaluated the concept and concluded that:

the vehicle can cruise at Mach 2.8 while consuming less than half the energy per passenger of a Boeing 747 at a cruise speed of Mach 0.81

In a tube, at subsonic speeds, the gas...

(Continue Reading – 1289 more words)

5gilch5h

Hydrogen can only burn in the presence of oxygen. The pipe does not contain any, and combustion isn't possible until after they have had time to mix. It's also not going to explode from the pressure, because it's the same as the atmosphere. The shaped charge is obviously going to explode, that's the point, but it will be more directional. That still doesn't sound safe in an enclosed space. Maybe the vehicle could deploy a gasket seal with airbags or something to reduce the leakage of expensive hydrogen.

3gilch5h

Condensation is not just possible but would happen by default. You described the tubes as steel lined with aluminum in contact with the ground, if not buried. That's going to be consistently cool enough for passive condensation. Getting water out of a long tube shouldn't be hard with multiple drains, and if there's any incline, you just need them at the bottom. You can just dump it in the ground. Use a plumbing trap to keep the gasses separated. They're at equal pressure, so this should work, and the pressure can also be maintained mostly passively with hydrogen bladders exposed to the atmosphere on the outside, although the burned hydrogen will have to be regenerated before they empty completely, but this can be done anywhere on the pipe. Hydrogen can be easily regenerated by electrolysis of water, which doesn't seem any more expensive than charging the batteries. It might be even cheaper to crack if off of natural gas or to use white hydrogen when available. Are turbines more expensive than electric motors for similar power? It's true that conventional piston engines are heavy, but batteries are also heavy, especially the cheaper chemistries. Alternatively, run electricity through the pipe to power the vehicles so they don't have to carry any extra weight for power. It's coated with conductive aluminum already. If half-pipes could be welded with a dielectric material and not cost any more that would work. Or use an internal monorail, but maybe only if you were going to do that already. Or you could suspend a wire. That's got to be pretty cheap compared to the pipe itself.

1Carl Feynman5h

…run electricity through the pipe… Simpler to do what some existing electric trains do: use the rails as ground, and have a charged third rail for power. We don’t like this system much for new trains, because the third rail is deadly to touch. It’s a bad thing to leave lying on the ground where people can reach it. But in this system, it’s in a tube full of unbreathable hydrogen, so no one is going to casually come across it.

bhauth43m20

Using sliding electrical contacts for power is fine for current high-speed trains, but it doesn't work as well above 200 m/s.

Progress Update #1 from the GDM Mech Interp Team: Full Update

Neel Nanda, Arthur Conmy, lsgos, Senthooran Rajamanoharan, Tom Lieberum, János Kramár, Vikrant Varma

Ω 216h

This is a series of snippets about the Google DeepMind mechanistic interpretability team's research into Sparse Autoencoders, that didn't meet our bar for a full paper. Please start at the summary post for more context, and a summary of each snippet. They can be read in any order.

Activation Steering with SAEs

Arthur Conmy, Neel Nanda

TL;DR: We use SAEs trained on GPT-2 XL’s residual stream to decompose steering vectors into interpretable features. We find a single SAE feature for anger which is a Pareto-improvement over the anger steering vector from existing work (Section 3, 3 minute read). We have more mixed results with wedding steering vectors: we can partially interpret the vectors, but the SAE reconstruction is a slightly worse steering vector, and just taking the obvious features produces...

(Continue Reading – 2322 more words)

Nina Rimsky1hΩ342

I expect if you average over more contrast pairs, like in CAA (https://arxiv.org/abs/2312.06681), more of the spurious features in steering vectors are cancelled out leading to higher quality vectors and greater sparsity in the dictionary feature domain. Did you find this?

1Sheikh Abdur Raheem Ali4h

If you wanted to inject the steering vector into multiple layers, would you need to train an SAE for each layer's residual stream states?

2Sam Marks4h

With the ITO experiments, my first guess would be that reoptimizing the sparse approximation problem is mostly relearning the encoder, but with some extra uninterpretable hacks for low activation levels that happen to improve reconstruction. In other words, I'm guessing that the boost in reconstruction accuracy (and therefore loss recovered) is mostly not due to better recognizing the presence of interpretable features, but by doing fiddly uninterpretable things at low activation levels. I'm not really sure how to operationalize this into a prediction. Maybe something like: if you pick some small-ish threshold T (maybe like T=3 based on the plot copied below) and round activations less than T down to 0 (for both the ITO encoder and the original encoder), then you'll no longer see that the ITO encoder outperforms the original one.

10Sam Marks4h

Awesome stuff -- I think that updates like this (both from the GDM team and from Anthropic) are very useful for organizing work in this space. And I especially appreciate the way this was written, with both short summaries and in-depth write-ups.

Vancouver Rationality

Guided By The Beauty Of Our Weapons

May 4th2390 Brunswick Street, Vancouver

Jordan

At our Meetups Everywhere meetup, attendees were overwhelmingly interested in regular meetups, so here goes!

As an experiment, let's assign a reading as a springboard for discussion, to see if we like that as a meetup format. Please read Scott's Guided By The Beauty Of Our Weapons from 2017 before attending.

Some questions to ponder ahead of the meetup:

When have you changed your mind very quickly on a deeply held belief, if ever? When have you slowly changed your mind (over the course of months or years) on a deeply held belief, if ever? What contributed to this transformation?
Have you ever resisted changing a belief despite accumulating evidence or persuasive arguments against it? What were the reasons for your resistance (emotional? social? intellectual?), and how did you eventually navigate this conflict?
Re: raising the sanity waterline, what personal practices or habits have you adopted to ensure you're engaging with the world in a more rational, open-minded way, if any?

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

Express interest in an "FHI of the West"

235

habryka

TLDR: I am investigating whether to found a spiritual successor to FHI, housed under Lightcone Infrastructure, providing a rich cultural environment and financial support to researchers and entrepreneurs in the intellectual tradition of the Future of Humanity Institute. Fill out this form or comment below to express interest in being involved either as a researcher, entrepreneurial founder-type, or funder.

The Future of Humanity Institute is dead:

I knew that this was going to happen in some form or another for a year or two, having heard through the grapevine and private conversations of FHI's university-imposed hiring freeze and fundraising block, and so I have been thinking about how to best fill the hole in the world that FHI left behind.

I think FHI was one of the best intellectual institutions...

(See More – 758 more words)

4Buck1h

(I work out of Constellation and am closely connected to the org in a bunch of ways) I think you're right that most people at Constellation aren't going to seriously and carefully engage with the aliens-building-AGI question, but I think describing it as a difference in culture is missing the biggest factor leading to the difference: most of the people who work at Constellation are employed to do something other than the classic FHI activity of "self-directed research on any topic", so obviously aren't as inclined to engage deeply with it. I think there also is a cultural difference, but my guess is that it's smaller than the effect from difference in typical jobs.

Buck1h20

I'll also note that if you want to show up anywhere in the world and get good takes from people on the "how aliens might build AGI" question, Constellation might currently be the best bet (especially if you're interested in decision-relevant questions about this).

4aysja3h

Huh, I feel confused. I suppose we just have different impressions. Like, I would say that Oliver is exceedingly good at cutting through the bullshit. E.g., I consider his reasoning around shutting down the Lightcone offices to be of this type, in that it felt like a very straightforward document of important considerations, some of which I imagine were socially and/or politically costly to make. One way to say that is that I think Oliver is very high integrity, and I think this helps with bullshit detection: it's easier to see how things don't cut to the core unless you deeply care about the core yourself. In any case, I think this skill carries over to object-level research, e.g., he often seems, to me, to ask cutting-to-the core type questions there, too. I also think he's great at argument: legible reasoning, identifying the important cruxes in conversations, etc., all of which makes it easier to tell the bullshit from the not. I do not think of Oliver as being afraid to be disagreeable, and ime he gets to the heart of things quite quickly, so much so that I found him quite startling to interact with when we first met. And although I have some disagreements over Oliver's past walled-garden taste, from my perspective it's getting better, and I am increasingly excited about him being at the helm of a project such as this. Not sure what to say about his beacon-ness, but I do think that many people respect Oliver, Lightcone, and rationality culture more generally; I wouldn't be that surprised if there were an initial group of independent researcher types who were down and excited for this project as is.

2owencb3h

I don't really disagree with anything you're saying here, and am left with confusion about what your confusion is about (like it seemed like you were offering it as examples of disagreement?).

Transformers Represent Belief State Geometry in their Residual Stream

262

Adam Shai

Ω 1123d

Produced while being an affiliate at PIBBSS^[1]. The work was done initially with funding from a Lightspeed Grant, and then continued while at PIBBSS. Work done in collaboration with @Paul Riechers, @Lucas Teixeira, @Alexander Gietelink Oldenziel, and Sarah Marzen. Paul was a MATS scholar during some portion of this work. Thanks to Paul, Lucas, Alexander, Sarah, and @Guillaume Corlouer for suggestions on this writeup.

Introduction

What computational structure are we building into LLMs when we train them on next-token prediction? In this post we present evidence that this structure is given by the meta-dynamics of belief updating over hidden states of the data-generating process. We'll explain exactly what this means in the post. We are excited by these results because

We have a formalism that relates training data to internal

...

(Continue Reading – 3335 more words)

eggsyntax1h10

I struggled with the notation on the figures; this comment tries to clarify a few points for anyone else who may be confused by it.

There are three main diagrams to pay attention to in order to understand what's going on here:
- The Z1R Process (this is a straightforward Hidden Markov Model diagram, look them up if it's unclear).
- The Z1R Mixed-State Presentation, representing the belief states of a model as it learns the underlying structure.
- The Z1R Mixed-State Simplex. Importantly, unlike the other two this is a graph and spatial placement is meaningful.
It's b

... (read more)

1Adam Shai2h

Thanks! I'll have more thorough results to share about layer-wise reprsentations of the MSP soon. I've already run some of the analysis concatenating over all layers residual streams with RRXOR process and it is quite interesting. It seems there's a lot more to explore with the relationship between number of states in the generative model, number of layers in the transformer, residual stream dimension, and token vocab size. All of these (I think) play some role in how the MSP is represented in the transformer. For RRXOR it is the case that things look crisper when concatenating. Even for cases where redundant info is discarded, we should be able to see the distinctions somewhere in the transformer. One thing I'm keen on really exploring is such a case, where we can very concretely follow the path/circuit through which redundant info is first distinguished and then is collapsed.

1eggsyntax8h

As well as inferring the HMM itself from the data.

7Adam Shai8h

That is a fair summary.

Evolution did a surprising good job at aligning humans...to social status

Eli Tyre

1mo

[This is post is a slightly edited tangent from my dialogue with John Wentworth here. I think the point is sufficiently interesting and important that I wanted to make it as a top level post, and not leave it buried in that dialog on mostly another topic.]

The conventional story is that natural selection failed extremely badly at aligning humans. One fact about humans that casts doubt on this story is that natural selection got the concept of "social status" into us, and it seems to have done a shockingly good job of aligning (many) humans to that concept.

Evolution somehow gave humans some kind of inductive bias (or something) such that our brains are reliably able to learn what it is to be "high status", even though the...

(See More – 213 more words)

wassname1h10

We establish institutions to channel and utilize status-seeking behavior by putting us in status conscious groups where we have ceremonies and titles that draw our attention to status. This work! Is it more effective to educate a child individually or in a group of peers? Is it easier to lead a solitary soldier or a whole squad? Do people seek a promotion or a pay rise?

From this perspective, our culture and inclination for seeking status have developed in tandem, making it challenging to determine which influences the other more. However, it appears that c... (read more)

7Mikhail Samin15h

“[optimization process] did kind of shockingly well aligning humans to [a random goal that the optimization process wasn’t aiming for (and that’s not reproducible with a higher bandwidth optimization such as gradient descent over a neural network’s parameters)]” Nope, if your optimization process is able to crystallize some goals into an agent, it’s not some surprising success, unless you picked these goals. If an agent starts to want paperclips in a coherent way and then every training step makes it even better at wanting and pursuing paperclips, your training process isn’t “surprisingly successful” at aligning the agent with making paperclips. If people become more optimistic, because they see some goals in an agent, and say the optimization process was able to successfully optimize for that, but they don’t have evidence of the optimization process having tried to target the goals they observe, they’re just clearly doing something wrong. Evolutionary physiology is a thing! It is simply invalid to say “[a physiological property of humans that is the result of evolution] existing in humans now is a surprising success of evolution at aligning humans”.

2Kaj_Sotala15h

Agree. This connects to why I think that the standard argument for evolutionary misalignment is wrong: it's meaningless to say that evolution has failed to align humans with inclusive fitness, because fitness is not any one constant thing. Rather, what evolution can do is to align humans with drives that in specific circumstances promote fitness. And if we look at how well the drives we've actually been given generalize, we find that they have largely continued to generalize quite well, implying that while there's likely to still be a left turn, it may very well be much milder than is commonly implied.

LESSWRONGDaniel Dennett has died, far too young (1942-2024)
LW

Recommendations

Latest Posts

Quick Takes

Popular Comments

Recent Discussion

Emergent Instrumental Reasoning Without Explicit Goals

Introduction

Introduction

aerodynamic drag in hydrogen

Activation Steering with SAEs

Introduction

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA