LessWrong

Does reducing the amount of RL for a given capability level make AI safer?

Some people have suggested that a lot of the danger of training a powerful AI comes from reinforcement learning. Given an objective, RL will reinforce any method of achieving the objective that the model tries and finds to be successful including things like deceiving us or increasing its power.

If this were the case, then if we want to build a model with capability level X, it might make sense to try to train that model either without RL or with as little RL as possible. For example, we could attempt to achieve the objective using imitation learning instead.

However, if, for example, the alternate was imitation learning, it would be possible to push back and argue that this is still a black-box that uses gradient descent so we...

(See More – 82 more words)

28Answer by porby4h

"RL" is a wide umbrella. In principle, you could even train a model with RL such that the gradients match supervised learning. "Avoid RL" is not the most directly specified path to the-thing-we-actually-want. The source of spookiness Consider two opposite extremes: 1. A sparse, distant reward function. A biped must successfully climb a mountain 15 kilometers to the east before getting any reward at all. 2. A densely shaped reward function. At every step during the climb up the mountain, there is a reward designed to induce gradients that maximize training performance. Every slight mispositioning of a toe is considered. Clearly, number 2 is going to be easier to train, but it also constrains the solution space for the policy. If number 1 somehow successfully trained, what's the probability that the solution it found would look like number 2's imitation data? What's the probability it would look anything like a bipedal gait? What's the probability it just exploits the physics simulation to launch itself across the world? If you condition on a sparse, distant reward function training successfully, you should expect the implementation found by the optimizer to sample from a wide distribution of possible implementations that are compatible with the training environment. It is sometimes difficult to predict what implementations are compatible with the environment. The more degrees of freedom exist in the environment, the more room the optimizer has to roam. That's where the spookiness comes from. Is RL therefore spooky? RL appears to make this spookiness more accessible. It's difficult to use (un)supervised learning in a way that gives a model great freedom of implementation; it's usually learning from a large suite of examples. But there's a major constraint on RL: in the absence of sufficient environmental structure, reward shaping, or other sources of optimizer guidance, it is nearly impossible for any computationally tractable optimizer to find any imple

the gears to ascension3m20

Oh this is a great way of laying it out. Agreed, and I think this may have made some things easier for me to see, likely some of that is actual update that changes opinions I've shared before. I also have the sense that this is missing something important about what makes the most unsteerable/issue-prone approaches "RL-like", but it might be that in order to clarify that I have to find a better way to describe the unwantable AI designs than comparing them to RL. I'll have to ponder.

4Chris_Leong2h

Oh, this is a fascinating perspective. So most uses of RL already just use a small-bit of RL. So if the goal was "only use a little bit of RL", that's already happening. Hmm... I still wonder if using even less RL would be safer still.

2porby4m

I do think that if you found a zero-RL path to the same (or better) endpoint, it would often imply that you've grasped something about the problem more deeply, and that would often imply greater safety. Some applications of RL are also just worse than equivalent options. As a trivial example, using reward sampling to construct a gradient to match a supervised loss gradient is adding a bunch of clearly-pointless intermediate steps. I suspect there are less trivial cases, like how a decision transformer isn't just learning an optimal policy for its dataset but rather a supertask: what different levels of performance look like on that task. By subsuming an RL-ish task in prediction, the predictor can/must develop a broader understanding of the task, and that understanding can interact with other parts of the greater model. While I can't currently point to strong empirical evidence here, my intuition would be that certain kinds of behavioral collapse would be avoided by the RL-via-predictor because the distribution is far more explicitly maintained during training.[1][2] But there are often reasons why the more-RL-shaped thing is currently being used. It's not always trivial to swap over to something with some potential theoretical benefits when training at scale. So long as the RL-ish stuff fits within some reasonable bounds, I'm pretty okay with it and would treat it as a sufficiently low probability threat that you would want to be very careful about how you replaced it, because the alternative might be sneakily worse. 1. ^ KL divergence penalties are one thing, but it's hard to do better than the loss directly forcing adherence to the distribution. 2. ^ You can also make a far more direct argument about model-level goal agnosticism in the context of prediction.

Introducing AI Lab Watch

181

Zach Stein-Perlman

This is a linkpost for https://ailabwatch.org

I'm launching AI Lab Watch. I collected actions for frontier AI labs to improve AI safety, then evaluated some frontier labs accordingly.

It's a collection of information on what labs should do and what labs are doing. It also has some adjacent resources, including a list of other safety-ish scorecard-ish stuff.

(It's much better on desktop than mobile — don't read it on mobile.)

It's in beta—leave feedback here or comment or DM me—but I basically endorse the content and you're welcome to share and discuss it publicly.

It's unincorporated, unfunded, not affiliated with any orgs/people, and is just me.

Some clarifications and disclaimers.

How you can help:

Give feedback on how this project is helpful or how it could be different to be much more helpful
Tell me what's wrong/missing; point me to sources

...

(See More – 208 more words)

eggsyntax4m10

Fantastic, thanks!

2Ben Pace1h

This seems like a good point. Here's a quick babble of alts (folks could react with a thumbs-up on ones that they think are good). AI Corporation Watch | AI Mega-Corp Watch | AI Company Watch | AI Industry Watch | AI Firm Watch | AI Behemoth Watch | AI Colossus Watch | AI Juggernaut Watch | AI Future Watch I currently think "AI Corporation Watch" is more accurate. "Labs" feels like a research team, but I think these orgs are far far far more influenced by market forces than is suggested by "lab", and "corporation" communicates that. I also think the goal here is not to point to all companies that do anything with AI (e.g. midjourney) but to focus on the few massive orgs that are having the most influence on the path and standards of the industry, and to my eye "corporation" has that association more than "company". Definitely not sure though.

2Zach Stein-Perlman40m

Yep, lots of people independently complain about "lab." Some of those people want me to use scary words in other places too, like replacing "diffusion" with "proliferation." I wouldn't do that, and don't replace "lab" with "mega-corp" or "juggernaut," because it seems [incorrect / misleading / low-integrity]. I'm sympathetic to the complaint that "lab" is misleading. (And I do use "company" rather than "lab" occasionally, e.g. in the header.) But my friends usually talk about "the labs," not "the companies." But to most audiences "company" is more accurate. I currently think "company" is about as good as "lab." I may change the term throughout the site at some point.

3Akash4h

@Dan H are you able to say more about which companies were most/least antagonistic?

eggsyntax's Shortform

eggsyntax

4mo

eggsyntax5m10

There's so much discussion, in safety and elsewhere, around the unpredictability of AI systems on OOD inputs. But I'm not sure what that even means in the case of language models.

With an image classifier it's straightforward. If you train it on a bunch of pictures of different dog breeds, then when you show it a picture of a cat it's not going to be able to tell you what it is. Or if you've trained a model to approximate an arbitrary function for values of x > 0, then if you give it input < 0 it won't know what to do.

But what would that even be with ... (read more)

Please stop publishing ideas/insights/research about AI

Tamsin Leake

Basically all ideas/insights/research about AI is potentially exfohazardous. At least, it's pretty hard to know when some ideas/insights/research will actually make things better; especially in a world where building an aligned superintelligence (let's call this work "alignment") is quite harder than building any superintelligence (let's call this work "capabilities"), and there's a lot more people trying to do the latter than the former, and they have a lot more material resources.

Ideas about AI, let alone insights about AI, let alone research results about AI, should be kept to private communication between trusted alignment researchers. On lesswrong, we should focus on teaching people the rationality skills which could help them figure out insights that help them build any superintelligence, but are more likely to first give them insights...

(Continue Reading – 1022 more words)

RedMan6h10

In computer security, there is an ongoing debate about vulnerability disclosure, which at present seems to have settled on 'if you aren't running a bug bounty program for your software you're irresponsible, project zero gets it right, metasploit is a net good, and it's ok to make exploits for hackers ideologically aligned with you'.

The framing of the question for decades was essentially "do you tell the person or company

with the vulnerable software, who may ignore you or sue you because they don't want to spend money? Do you tell t... (read more)

LessOnline Festival Updates Thread

Ben Pace

17d

This is a thread for updates about the upcoming LessOnline festival. I (Ben) will be posting bits of news and thoughts, and you're also welcome to make suggestions or ask questions.

If you'd like to hear about new updates, you can use LessWrong's "Subscribe to comments" feature from the triple-dot menu at the top of this post.

Reminder that you can get tickets at the site for $400 minus your LW karma in cents.

2cata1h

That just sounds great, thanks.

4Ben Pace3h

1. I anticipate the vast majority of people going to each of the events will be locals to the state and landmass respectively, so I don't think it's actually particularly costly for them to overlap. 2. That's unfortunate that you are less likely to come, and I'm glad to get the feedback. I could primarily reply with reasons why I think it was the right call (e.g. helpful for getting the event off the ground, helpful for pinpointing the sort of ideas+writing the event is celebrating, I think it's prosocial for me to be open about info like this generally, etc) but I don't think that engages with the fact that it left you personally less likely to come. I still overall think if the event sounds like a good time to you (e.g. interesting conversations with people you'd like to talk to and/or exciting activities) and it's worth the cost to you then I hope you come :-)

2niplav2h

Maybe to clarify my comment: I was merely describing my (non-endorsed[1]) observed emotional content wrt the festival, and my intention with thw comment was not to wag my finger at you guys in the manner of "you didn't invite me". I wonder whether other people have a similar emotional reaction. I appreciate Lightcone being open with the information around free invitations though! I think I'd have bought a ticket anyway65% if I had time around that weekend, and I think I'd probably have a blast if I would attend. Btw: What's the chance of a 2nd LessOnline? ---------------------------------------- 1. I think my reaction is super bound up in icky status-grabbing/status-desiring/inner-ring-infiltrating parts of my psyche which I'm not happy with. ↩︎

Ben Pace1h40

What's the chance of a 2nd LessOnline?

Um, one part of me is (as is not uncommon) really believes in this event and thinks it's going to be the best effort investments Lightcones' ever made (though this part of me currently has one or two other projects and ideas that it believes in maybe even more strongly), that's part of me is like "yeah this should absolutely happen every year", though as I say I get this feeling often about projects that often end up looking different to how I dreamed them when they finally show up in reality. I think that part would f... (read more)

Some Experiments I'd Like Someone To Try With An Amnestic

johnswentworth

A couple years ago, I had a great conversation at a research retreat about the cool things we could do if only we had safe, reliable amnestic drugs - i.e. drugs which would allow us to act more-or-less normally for some time, but not remember it at all later on. And then nothing came of that conversation, because as far as any of us knew such drugs were science fiction.

… so yesterday when I read Eric Neyman’s fun post My hour of memoryless lucidity, I was pretty surprised to learn that what sounded like a pretty ideal amnestic drug was used in routine surgery. A little googling suggested that the drug was probably a benzodiazepine (think valium). Which means it’s not only a great amnestic, it’s also apparently one...

(See More – 589 more words)

3johnswentworth6h

Thanks! Fixed now.

5ChristianKl7h

The linked post suggests that your assumptions about memory are wrong: He had training effects from multiplying the two numbers despite not having a memory of the first time he multiplied them.

2johnswentworth6h

Oh yeah, I guess that could be a learning effect. When reading it I assumed the lack of need for repeating the numbers was just because the drug was wearing off.

Eric Neyman2h40

Yeah, that's my best guess. I have other memories from that period (which was late into the hour), so I think it was the drug wearing off, rather than learning effects.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

Which skincare products are evidence-based?

Vanessa Kosoy, rosiecam

The beauty industry offers a large variety of skincare products (marketed mostly at women), differing both in alleged function and (substantially) in price. However, it's pretty hard to test for yourself how much any of these product help. The feedback loop for things like "getting less wrinkles" is very long.

So, which of these products are actually useful and which are mostly a waste of money? Are more expensive products actually better or just have better branding? How can I find out?

I would guess that sunscreen is definitely helpful, and using some moisturizers for face and body is probably helpful. But, what about night cream? Eye cream? So-called "anti-aging"? Exfoliants?

MichaelDickens2h10

What do you think is the strongest evidence on sunscreen? I've read mixed things on its effectiveness.

1merilalama5h

Nice! Which hyaluronic acid product do you use?

2Vanessa Kosoy12h

Thanks for this! Does it really make sense to see a dermatologist for this? I don't have any particular problem I am trying to fix other than "being a woman in her 40s (and contemplating the prospect of her 50s, 60s etc with dread)". Also, do you expect the dermatologist to give better advice than people in this thread or the resources they linked? (Although, the dermatologist might be better familiar with specific products available in my country.)

1FinalFormal217h

I watched this video and this is what I bought maximizing for cost/effectiveness, rate my stack: * Moisturizer * Retinol * Sunscreen

Explaining a Math Magic Trick

Robert_AIZI

Introduction

A recent popular tweet did a "math magic trick", and I want to explain why it works and use that as an excuse to talk about cool math (functional analysis). The tweet in question:

This is a cute magic trick, and like any good trick they nonchalantly gloss over the most important step. Did you spot it? Did you notice your confusion?

Here's the key question: Why did they switch from a differential equation to an integral equation? If you can use $(1 - x)^{- 1} = 1 + x + x^{2} + . . .$ when $x = \int$ , why not use it when $x = d / d x$ ?

Well, lets try it, writing $D$ for the derivative:

$\begin{matrix} f^{'} & = & f (1 - D) f & = & 0 f & = & (1 + D + D^{2} + . . .) 0 f & = & 0 + 0 + 0 + . . . f & = & 0 \end{matrix}$

So now you may be disappointed, but relieved: yes, this version fails, but at least it fails-safe, giving you the trivial solution, right?

But no, actually $(1 - D)^{- 1} = 1 + D + D^{2} + . . .$ can fail catastrophically, which we can see if we try a nonhomogeneous equation...

(Continue Reading – 1217 more words)

yanni's Shortform

yanni kyriacos

2mo

10yanni kyriacos11h

Something I'm confused about: what is the threshold that needs meeting for the majority of people in the EA community to say something like "it would be better if EAs didn't work at OpenAI"? Imagining the following hypothetical scenarios over 2024/25, I can't predict confidently whether they'd individually cause that response within EA? 1. Ten-fifteen more OpenAI staff quit for varied and unclear reasons. No public info is gained outside of rumours 2. There is another board shakeup because senior leaders seem worried about Altman. Altman stays on 3. Superalignment team is disbanded 4. OpenAI doesn't let UK or US AISI's safety test GPT5/6 before release 5. There are strong rumours they've achieved weakly general AGI internally at end of 2025

Carl Feynman2h142

This question is two steps removed from reality. Here’s what I mean by that. Putting brackets around each of the two steps:

what is the threshold that needs meeting [for the majority of people in the EA community] [to say something like] "it would be better if EAs didn't work at OpenAI"?

Without these steps, the question becomes

What is the threshold that needs meeting before it would be better if people didn’t work at OpenAI?

Personally, I find that a more interesting question. Is there a reason why the question is phrased at two removes like that? Or am I missing the point?

4LawrenceC3h

What does a "majority of the EA community" mean here? Does it mean that people who work at OAI (even on superalignment or preparedness) are shunned from professional EA events? Does it mean that when they ask, people tell them not to join OAI? And who counts as "in the EA community"? I don't think it's that constructive to bar people from all or even most EA events just because they work at OAI, even if there's a decent amount of consensus people should not work there. Of course, it's fine to host events (even professional ones!) that don't invite OAI people (or Anthropic people, or METR people, or FAR AI people, etc), and they do happen, but I don't feel like barring people from EAG or e.g. Constellation just because they work at OAI would help make the case, (not that there's any chance of this happening in the near term) and would most likely backfire. I think that currently, many people (at least in the Berkeley EA/AIS community) will tell you to not join OAI if asked. I'm not sure if they form a majority in terms of absolute numbers, but they're at least a majority in some professional circles (e.g. both most people at FAR/FAR Labs and at Lightcone/Lighthaven would probably say this). I also think many people would say that on the margin, too many people are trying to join OAI rather than other important jobs. (Due to factors like OAI paying a lot more than non-scaling lab jobs/having more legible prestige.) Empirically, it sure seems significantly more people around here join Anthropic than OAI, despite Anthropic being a significantly smaller company. Though I think almost none of these people would advocate for ~0 x-risk motivated people to work at OAI, only that the marginal x-risk concerned technical person should not work at OAI. What specific actions are you hoping for here, that would cause you to say "yes, the majority of EA people say 'it's better to not work at OAI'"?

2Dagon6h

[ I don't consider myself EA, nor a member of the EA community, though I'm largely compatible in my preferences ] I'm not sure it matters what the majority thinks, only what marginal employees (those who can choose whether or not to work at OpenAI) think. And what you think, if you are considering whether to apply, or whether to use their products and give them money/status. Personally, I just took a job in a related company (working on applications, rather than core modeling), and I have zero concerns that I'm doing the wrong thing.

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

Introduction

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA