A Review of In-Context Learning Hypotheses for Automated AI Alignment Research

17h

This project has been completed as part of the Mentorship in Alignment Research Students (MARS London) programme under the supervision of Bogdan-Ionut Cirstea, on investigating the promise of automated AI alignment research. I would like to thank Bogdan-Ionut Cirstea, Erin Robertson, Clem Von Stengel, Alexander Gietelink Oldenziel, Severin Field, and everyone who commented on my draft, for the feedback and encouragement which helped me create this post.

TL;DR

The mechanism behind in-context is an open question in machine learning. There are different hypotheses on what in-context learning is doing, each with different implications for alignment. This document reviews the hypotheses which attempt to explain in-context learning, finding some overlap and good explanatory power from each, and describes the implications these hypotheses have for automated AI alignment research.

Introduction

Since their capabilities...

(Continue Reading – 4468 more words)

2Ben Pace16h

Not sure where the right place to raise this complaint, but having just seen it for the first time, really, "MARS"? I checked, this is not affiliated with MATS who have had like 6 programs and ~300 people go through it. To me this seems too close in branding space to me, and I'd recommend picking a more distinct name.

Linda Linseforsnow20

I disagree. In verbal space MARS and MATS are very distinct, and they look different enough to me.

However, if you want to complain, you should talk to the organisers, not one of the participants.

Here is their website: MARS — Cambridge AI Safety Hub

(I'm not involved in MARS in any way.)

When is a mind me?

Rob Bensinger

xlr8harder writes:

In general I don’t think an uploaded mind is you, but rather a copy. But one thought experiment makes me question this. A Ship of Theseus concept where individual neurons are replaced one at a time with a nanotechnological functional equivalent.
Are you still you?

Presumably the question xlr8harder cares about here isn't semantic question of how linguistic communities use the word "you", or predictions about how whole-brain emulation tech might change the way we use pronouns.

Rather, I assume xlr8harder cares about more substantive questions like:

If I expect to be uploaded tomorrow, should I care about the upload in the same ways (and to the same degree) that I care about my future biological self?
Should I anticipate experiencing what my upload experiences?
If the scanning and uploading process requires

...

(Continue Reading – 4359 more words)

2torekp5m

I have a closely related objection/clarification. I agree with the main thrust of Rob's post, but this part: ..strikes me as confused or at least confusing. Take your chemistry/physics tests example. What does "I anticipate the experience of a sense of accomplishment in answering the chemistry test" mean? Well for one thing, it certainly indicates that you believe the experience is likely to happen (to someone). For another, it often means that you believe it will happen to you - but that invites the semantic question that Rob says this isn't about. For a third - and I propose that this is a key point that makes us feel there is a "substantive" question here - it indicates that you empathize with this future person who does well on the test. But I don't see how empathizing or not-empathizing can be assessed for accuracy. It can be consistent or inconsistent with the things one cares about, which I suppose makes it subject to rational evaluation, but that looks different from accuracy/inaccuracy.

2Vanessa Kosoy2h

Not sure what you mean by "this would require a pretty small universe". If we live in naive MWI, an IBP agent would not care for good reasons, because naive MWI is a "library of babel" where essentially every conceivable thing happens no matter what you do. Also not sure what you mean by "some sort of sampling". AFAICT, quantum IBP is the closest thing to a coherent answer that we have, by a significant margin.

Mikhail Samin4m10

I mean if the universe is big enough for every conceivable thing to happen, then we should notice that we find ourselves in a surprisingly structured environment and need to assume some sort of an effect where if a cognitive architecture opens its eyes, it opens its eyes in a different places with the likelihood corresponding to how common these places are (e.g., among all Turing machines).

I.e., if your brain is uploaded, and you see a door in front of you, and when you open it, 10 identical computers start running a copy of you each: 9 show you a green ro... (read more)

1quetzal_rainbow1h

I always thought that in naive MWI what matters is nkt whether something happens in absolute sense, but what Born measure is concentrated on branches that contain good things instead of bad things.

Creating unrestricted AI Agents with Command R+

Simon Lermen

TL;DR There currently are capable open-weight models which can be used to create simple unrestricted bad agents. They can perform tasks end-to-end such as searching for negative information on people, attempting blackmail or continuous harassment.

Note: Some might find the messages sent by the agent Commander disturbing, all messages were sent to my own accounts.

Overview

Cohere has recently released the weights of Command R+, which is comparable to older versions of GPT-4 and is currently the best open model on some benchmarks. It is noteworthy that the model has been fine-tuned for agentic tool use. This is probably the first open-weight model that can competently use tools. While there is a lot of related work on subversive fine-tuning (Yang et al., Qi et al.) and jailbreaks (Deng et al.,...

(Continue Reading – 1252 more words)

2Jason Hoelscher-Obermaier1h

Glad you're doing this. By default, it seems we're going to end up with very strong tool-use models where any potential safety measures are easily removed by jailbreaks or fine-tuning. I understand you as working on: How are we going to know that it happened? is that a fair characterization? Another important question: What should the response be to the appearance of such a model? any thoughts?

Simon Lermen16m10

I think that is a fair categorization. I think it would be really bad if some super strong tool-use model gets released and nobody had any idea before this could lead to really bad outcomes. Crucially, I expect future models to be able to remove their own safety guardrails as well. I really try to think about how these things might positively affect AI safety, I don't want to just maximize for shocking results. My main intention was almost to have this as a public service announcement that this is now possible. People are often behind on the Sota and most ... (read more)

Express interest in an "FHI of the West"

229

habryka

TLDR: I am investigating whether to found a spiritual successor to FHI, housed under Lightcone Infrastructure, providing a rich cultural environment and financial support to researchers and entrepreneurs in the intellectual tradition of the Future of Humanity Institute. Fill out this form or comment below to express interest in being involved either as a researcher, entrepreneurial founder-type, or funder.

The Future of Humanity Institute is dead:

I knew that this was going to happen in some form or another for a year or two, having heard through the grapevine and private conversations of FHI's university-imposed hiring freeze and fundraising block, and so I have been thinking about how to best fill the hole in the world that FHI left behind.

I think FHI was one of the best intellectual institutions...

(See More – 758 more words)

1Will_Pearson1h

As well as thinking about the need for the place in terms of providing a space for research, it is probably worth thinking about the need for a place in terms of what it provides the world. What subjects are currently under-represented in the world and need strong representation to guide us to a positive future? That will guide who you want to lead the organisation.

2owencb1h

I agree in the abstract with the idea of looking for niches, and I think that several of these ideas have something to them. Nevertheless when I read the list of suggestions my overall feeling is that it's going in a slightly wrong direction, or missing the point, or something. I thought I'd have a go at articulating why, although I don't think I've got this to the point where I'd firmly stand behind it: It seems to me like some of the central FHI virtues were: * Offering a space to top thinkers where the offer was pretty much "please come here and think about things that seem important in a collaborative truth-seeking environment" * I think that the freedom of direction, rather than focusing on an agenda or path to impact, was important for: * attracting talent * finding good underexplored ideas (b/c of course at the start of the thinking people don't know what's important) * Caveats: * This relies on your researchers having some good taste in what's important (so this needs to be part of what you select people on) * FHI also had some success launching research groups where people were hired to more focused things * I think this was not the heart of the FHI magic, though, but more like a particular type of entrepreneurship picking up and running with things from the core * Willingness to hang around at whiteboards for hours talking and exploring things that seemed interesting * With an attitude of "OK but can we just model this?" and diving straight into it * Someone once described FHI as "professional amateurs", which I think is apt * The approach is a bit like the attitude ascribed to physicists in this xkcd, but applied more to problems-that-nobody-has-good-answers-for than things-with-lots-of-existing-study (and with more willingness to dive into understanding existing fields when they're importantly relevant for the problem at hand) * Importantly mostly without directly asking "ok but where is this going?

2Chris_Leong44m

I think my list appears more this way then I intended because I gave some examples of projects I would be excited by if they happened. I wasn't intending to stake out a strong position as to whether these projects should projects chosen by the institute vs. some examples of projects that it might be reasonable for a researcher to choose.

owencb17m20

Makes sense! My inference was because the discussion at this stage is a high-level one about ways to set things up, but it does seem good to have space to discuss object-level projects that people might get into.

Evolution did a surprising good job at aligning humans...to social status

Eli Tyre

1mo

[This is post is a slightly edited tangent from my dialogue with John Wentworth here. I think the point is sufficiently interesting and important that I wanted to make it as a top level post, and not leave it buried in that dialog on mostly another topic.]

The conventional story is that natural selection failed extremely badly at aligning humans. One fact about humans that casts doubt on this story is that natural selection got the concept of "social status" into us, and it seems to have done a shockingly good job of aligning (many) humans to that concept.

Evolution somehow gave humans some kind of inductive bias (or something) such that our brains are reliably able to learn what it is to be "high status", even though the...

(See More – 213 more words)

Mikhail Samin36m10

“[optimization process] did kind of shockingly well aligning humans to [a random goal that the optimization process wasn’t aiming for (and that’s not reproducible with a higher bandwidth optimization such as gradient descent over a neural network’s parameters)]”

Nope, if your optimization process is able to crystallize some goals into an agent, it’s not some surprising success, unless you picked these goals. If an agent starts to want paperclips in a coherent way and then every training step makes it even better at wanting and pursuing paperclips, your trai... (read more)

2Kaj_Sotala1h

Agree. This connects to why I think that the standard argument for evolutionary misalignment is wrong: it's meaningless to say that evolution has failed to align humans with inclusive fitness, because fitness is not any one constant thing. Rather, what evolution can do is to align humans with drives that in specific circumstances promote fitness. And if we look at how well the drives we've actually been given generalize, we find that they have largely continued to generalize quite well, implying that while there's likely to still be a left turn, it may very well be much milder than is commonly implied.

If digital goods in virtual worlds increase GDP, do we actually become richer?

No77e

Noah Smith, in this article, argues that the Metaverse could enable economic growth to increase a lot and sharply decouple itself from real-world resource usage. By creating markets in which we buy and sell immaterial things, world GDP would grow.

He also says, rightly, that GDP correlates with the well-being of a nation.

But there's a non-stated point: would creating huge markets in the Metaverse for buying and selling digital goods make us actually richer? What I mean is this: suppose that, thanks to the Metaverse, huge virtual economies get created and people get real money out of stuff they sell in these economies. But suppose that e.g., agricultural production output doesn't go up much. Does that mean that we're simply going to pay more for groceries, without being...

(See More – 78 more words)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

AI Safety Camp final presentations

Apr 27th

Linda Linsefors, Remmelt Ellen

On the weekend of April 27^th-28^th, our AI Safety Camp teams will present their project findings in 10-minute talks.

Join on Zoom

You are welcome to join any talk! Teams are sharing their findings in mechanistic interpretability, in agent foundations, on legal actions to restrict AI, and many other areas.

The talks will be arranged in 5 blocks. Each block consists of 5-6 short talks, followed by breakout rooms for questions and further discussions.

You can find the schedule here, and the summaries/abstracts for the projects/talks here.

You can find summaries of the original project plans here. Keep in mind that some projects will have changed over the course of the program, and a few projects got cancelled.

Linda Linsefors1h20

I've now updated the event information to include summaries/abstracts for the projects/talks. Some of these are still under construction.

I'm open for projects (sort of)

cousin_it

17h

I left Google a month ago, and right now don't work. Writing this post in case anyone has interesting ideas what I could do. This isn't an "urgently need help" kind of thing - I have a little bit of savings, right now planning to relax some more weeks and then go into some solo software work. But I thought I'd write this here anyway, because who knows what'll come up.

Some things about me. My degree was in math. My software skills are okayish: I left Google at L5 ("senior"), and also made a game that went semi-viral. I've also contributed a lot on LW, the most prominent examples being my formalizations of decision theory ideas (Löbian cooperation, modal fixpoints etc) and later the AI Alignment Prize...

(See More – 47 more words)

cousin_it2h20

Done! I didn't do it at first because I thought it'd have to be in person only, but then clicked around in the form and found that remote is also possible.

2Chris_Leong7h

I'd love your feedback on my thoughts on decision theory. If you're trying to get a sense of my approach in order to determine whether it's interesting enough to be worth your time, I'd suggest starting with this article (3 minute read). I'm also considering applying for funding to create a conceptual alignment course.

2Viliam13h

Besides math and programming, what are your other skills and interests? * I have an idea of a puzzle game, not sure if it would be good or bad, I haven't done even a prototype. So if anyone is interested, feel free to try... I hope I can explain it sufficiently clearly in words... The game plan is divided into squares; I imagine a typical level to be between 10x10 and 30x30 squares large. Each square is either empty, or contains an immovable wall, or contains a movable block. The game consists of moving the blocks. Each move = you click a specific block, and try dragging it in one of the 4 directions, and either it is possible or not. A block cannot move into a wall. A block can push another block. A block does not pull another block. For example, if there are 3 blocks in a horizontal line, and you click the middle one and try dragging it to the left, two blocks will move and the third one (the one on the right) will stay there. So far, it should be completely obvious, like what you would happen if you moved some actual objects. In addition, each side of a block (or a wall) may be empty, or may contain a colored "magnet" (or perhaps a "lock" is a better metaphor). These add the following constraints for the movement of blocks: * Magnets of different colors can never touch each other. If one block has a green magnet on the right side, and another has a blue magnet on the left side, you cannot put them next to each other so that the magnets would touch. (If you try to do that, the block refuses to move. Graphically, I imagine that it would move like half the way, and then you would get a visual indicator where is the problem, and when you stop dragging, it will return to its original place.) Though it is okay if the blocks touch on their other sides, where they don't have magnets. * Magnets of the same color cannot be connected or disconnected by a move in a perpendicular direction. If one block has a green magnet on the right side, and another has a green mag

2cousin_it3h

Playing and composing music is the main one. Yeah, you're missing out on all the fun in game-making :-) You must build the prototype yourself, play with it yourself, tweak the mechanics, and at some moment the stars will align and something will just work and you'll know it. There's no way anyone else can do it but you.

LESSWRONG
LW

Recommendations

Latest Posts

Quick Takes

Popular Comments

Recent Discussion

TL;DR

Introduction

Overview

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA