Why do some societies exhibit more antisocial punishment than others? Martin explores both some literature on the subject, and his own experience living in a country where "punishment of cooperators" was fairly common.

William_S2dΩ601327
24
I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source "transformer debugger" tool. I resigned from OpenAI on February 15, 2024.
Adam Zerner16h2514
0
I wish there were more discussion posts on LessWrong. Right now it feels like it weakly if not moderately violates some sort of cultural norm to publish a discussion post (similar but to a lesser extent on the Shortform). Something low effort of the form "X is a topic I'd like to discuss. A, B and C are a few initial thoughts I have about it. Thoughts?" It seems to me like something we should encourage though. Here's how I'm thinking about it. Such "discussion posts" currently happen informally in social circles. Maybe you'll text a friend. Maybe you'll bring it up at a meetup. Maybe you'll post about it in a private Slack group. But if it's appropriate in those contexts, why shouldn't it be appropriate on LessWrong? Why not benefit from having it be visible to more people? The more eyes you get on it, the better the chance someone has something helpful, insightful, or just generally useful to contribute. The big downside I see is that it would screw up the post feed. Like when you go to lesswrong.com and see the list of posts, you don't want that list to have a bunch of low quality discussion posts you're not interested in. You don't want to spend time and energy sifting through the noise to find the signal. But this is easily solved with filters. Authors could mark/categorize/tag their posts as being a low-effort discussion post, and people who don't want to see such posts in their feed can apply a filter to filter these discussion posts out. Context: I was listening to the Bayesian Conspiracy podcast's episode on LessOnline. Hearing them talk about the sorts of discussions they envision happening there made me think about why that sort of thing doesn't happen more on LessWrong. Like, whatever you'd say to the group of people you're hanging out with at LessOnline, why not publish a quick discussion post about it on LessWrong?
Does the possibility of China or Russia being able to steal advanced AI from labs increase or decrease the chances of great power conflict? An argument against: It counter-intuitively decreases the chances. Why? For the same reason that a functioning US ICBM defense system would be a destabilizing influence on the MAD equilibrium. In the ICBM defense circumstance, after the shield is put up there would be no credible threat of retaliation America's enemies would have if the US were to launch a first-strike. Therefore, there would be no reason (geopolitically) for America to launch a first-strike, and there would be quite the reason to launch a first strike: namely, the shield definitely works for the present crop of ICBMs, but may not work for future ICBMs. Therefore America's enemies will assume that after the shield is put up, America will launch a first strike, and will seek to gain the advantage while they still have a chance by launching a pre-emptive first-strike. The same logic works in reverse. If Russia were building a ICBM defense shield, and would likely complete it in the year, we would feel very scared about what would happen after that shield is up. And the same logic works for other irrecoverably large technological leaps in war. If the US is on the brink of developing highly militaristically capable AIs, China will fear what the US will do with them (imagine if the tables were turned, would you feel safe with Anthropic & OpenAI in China, and DeepMind in Russia?), so if they don't get their own versions they'll feel mounting pressure to secure their geopolitical objectives while they still can, or otherwise make themselves less subject to the threat of AI (would you not wish the US would sabotage the Chinese Anthropic & OpenAI by whatever means if China seemed on the brink?). The fast the development, the quicker the pressure will get, and the more sloppy & rash China's responses will be. If its easy for China to copy our AI technology, then there's much slower mounting pressure.
Something I'm confused about: what is the threshold that needs meeting for the majority of people in the EA community to say something like "it would be better if EAs didn't work at OpenAI"? Imagining the following hypothetical scenarios over 2024/25, I can't predict confidently whether they'd individually cause that response within EA? 1. Ten-fifteen more OpenAI staff quit for varied and unclear reasons. No public info is gained outside of rumours 2. There is another board shakeup because senior leaders seem worried about Altman. Altman stays on 3. Superalignment team is disbanded 4. OpenAI doesn't let UK or US AISI's safety test GPT5/6 before release 5. There are strong rumours they've achieved weakly general AGI internally at end of 2025
habryka2d4317
5
Does anyone have any takes on the two Boeing whistleblowers who died under somewhat suspicious circumstances? I haven't followed this in detail, and my guess is it is basically just random chance, but it sure would be a huge deal if a publicly traded company now was performing assassinations of U.S. citizens.  Curious whether anyone has looked into this, or has thought much about baseline risk of assassinations or other forms of violence from economic actors.

Popular Comments

Recent Discussion

Some people have suggested that a lot of the danger of training a powerful AI comes from reinforcement learning. Given an objective, RL will reinforce any method of achieving the objective that the model tries and finds to be successful including things like deceiving us or increasing its power.

If this were the case, then if we want to build a model with capability level X, it might make sense to try to train that model either without RL or with as little RL as possible. For example, we could attempt to achieve the objective using imitation learning instead. 

However, if, for example, the alternate was imitation learning, it would be possible to push back and argue that this is still a black-box that uses gradient descent so we...

28Answer by porby4h
"RL" is a wide umbrella. In principle, you could even train a model with RL such that the gradients match supervised learning. "Avoid RL" is not the most directly specified path to the-thing-we-actually-want. The source of spookiness Consider two opposite extremes: 1. A sparse, distant reward function. A biped must successfully climb a mountain 15 kilometers to the east before getting any reward at all. 2. A densely shaped reward function. At every step during the climb up the mountain, there is a reward designed to induce gradients that maximize training performance. Every slight mispositioning of a toe is considered. Clearly, number 2 is going to be easier to train, but it also constrains the solution space for the policy. If number 1 somehow successfully trained, what's the probability that the solution it found would look like number 2's imitation data? What's the probability it would look anything like a bipedal gait? What's the probability it just exploits the physics simulation to launch itself across the world? If you condition on a sparse, distant reward function training successfully, you should expect the implementation found by the optimizer to sample from a wide distribution of possible implementations that are compatible with the training environment. It is sometimes difficult to predict what implementations are compatible with the environment. The more degrees of freedom exist in the environment, the more room the optimizer has to roam. That's where the spookiness comes from. Is RL therefore spooky? RL appears to make this spookiness more accessible. It's difficult to use (un)supervised learning in a way that gives a model great freedom of implementation; it's usually learning from a large suite of examples. But there's a major constraint on RL: in the absence of sufficient environmental structure, reward shaping, or other sources of optimizer guidance, it is nearly impossible for any computationally tractable optimizer to find any imple

Oh this is a great way of laying it out. Agreed, and I think this may have made some things easier for me to see, likely some of that is actual update that changes opinions I've shared before. I also have the sense that this is missing something important about what makes the most unsteerable/issue-prone approaches "RL-like", but it might be that in order to clarify that I have to find a better way to describe the unwantable AI designs than comparing them to RL. I'll have to ponder.

4Chris_Leong2h
Oh, this is a fascinating perspective. So most uses of RL already just use a small-bit of RL. So if the goal was "only use a little bit of RL", that's already happening. Hmm... I still wonder if using even less RL would be safer still.
2porby4m
I do think that if you found a zero-RL path to the same (or better) endpoint, it would often imply that you've grasped something about the problem more deeply, and that would often imply greater safety. Some applications of RL are also just worse than equivalent options. As a trivial example, using reward sampling to construct a gradient to match a supervised loss gradient is adding a bunch of clearly-pointless intermediate steps. I suspect there are less trivial cases, like how a decision transformer isn't just learning an optimal policy for its dataset but rather a supertask: what different levels of performance look like on that task. By subsuming an RL-ish task in prediction, the predictor can/must develop a broader understanding of the task, and that understanding can interact with other parts of the greater model. While I can't currently point to strong empirical evidence here, my intuition would be that certain kinds of behavioral collapse would be avoided by the RL-via-predictor because the distribution is far more explicitly maintained during training.[1][2] But there are often reasons why the more-RL-shaped thing is currently being used. It's not always trivial to swap over to something with some potential theoretical benefits when training at scale. So long as the RL-ish stuff fits within some reasonable bounds, I'm pretty okay with it and would treat it as a sufficiently low probability threat that you would want to be very careful about how you replaced it, because the alternative might be sneakily worse. 1. ^ KL divergence penalties are one thing, but it's hard to do better than the loss directly forcing adherence to the distribution. 2. ^ You can also make a far more direct argument about model-level goal agnosticism in the context of prediction.
This is a linkpost for https://ailabwatch.org

I'm launching AI Lab Watch. I collected actions for frontier AI labs to improve AI safety, then evaluated some frontier labs accordingly.

It's a collection of information on what labs should do and what labs are doing. It also has some adjacent resources, including a list of other safety-ish scorecard-ish stuff.

(It's much better on desktop than mobile — don't read it on mobile.)

It's in beta—leave feedback here or comment or DM me—but I basically endorse the content and you're welcome to share and discuss it publicly.

It's unincorporated, unfunded, not affiliated with any orgs/people, and is just me.

Some clarifications and disclaimers.

How you can help:

  • Give feedback on how this project is helpful or how it could be different to be much more helpful
  • Tell me what's wrong/missing; point me to sources
...

Fantastic, thanks!

2Ben Pace1h
This seems like a good point. Here's a quick babble of alts (folks could react with a thumbs-up on ones that they think are good). AI Corporation Watch | AI Mega-Corp Watch | AI Company Watch | AI Industry Watch | AI Firm Watch | AI Behemoth Watch | AI Colossus Watch | AI Juggernaut Watch | AI Future Watch I currently think "AI Corporation Watch" is more accurate. "Labs" feels like a research team, but I think these orgs are far far far more influenced by market forces than is suggested by "lab", and "corporation" communicates that. I also think the goal here is not to point to all companies that do anything with AI (e.g. midjourney) but to focus on the few massive orgs that are having the most influence on the path and standards of the industry, and to my eye "corporation" has that association more than "company". Definitely not sure though.
2Zach Stein-Perlman40m
Yep, lots of people independently complain about "lab." Some of those people want me to use scary words in other places too, like replacing "diffusion" with "proliferation." I wouldn't do that, and don't replace "lab" with "mega-corp" or "juggernaut," because it seems [incorrect / misleading / low-integrity]. I'm sympathetic to the complaint that "lab" is misleading. (And I do use "company" rather than "lab" occasionally, e.g. in the header.) But my friends usually talk about "the labs," not "the companies." But to most audiences "company" is more accurate. I currently think "company" is about as good as "lab." I may change the term throughout the site at some point.
3Akash4h
@Dan H are you able to say more about which companies were most/least antagonistic?

There's so much discussion, in safety and elsewhere, around the unpredictability of AI systems on OOD inputs. But I'm not sure what that even means in the case of language models.

With an image classifier it's straightforward. If you train it on a bunch of pictures of different dog breeds, then when you show it a picture of a cat it's not going to be able to tell you what it is. Or if you've trained a model to approximate an arbitrary function for values of x > 0, then if you give it input < 0 it won't know what to do.

But what would that even be with ... (read more)

Basically all ideas/insights/research about AI is potentially exfohazardous. At least, it's pretty hard to know when some ideas/insights/research will actually make things better; especially in a world where building an aligned superintelligence (let's call this work "alignment") is quite harder than building any superintelligence (let's call this work "capabilities"), and there's a lot more people trying to do the latter than the former, and they have a lot more material resources.

Ideas about AI, let alone insights about AI, let alone research results about AI, should be kept to private communication between trusted alignment researchers. On lesswrong, we should focus on teaching people the rationality skills which could help them figure out insights that help them build any superintelligence, but are more likely to first give them insights...

RedMan6h10

In computer security, there is an ongoing debate about vulnerability disclosure, which at present seems to have settled on 'if you aren't running a bug bounty program for your software you're irresponsible, project zero gets it right, metasploit is a net good, and it's ok to make exploits for hackers ideologically aligned with you'.  

The framing of the question for decades was essentially "do you tell the person or company 

 with the vulnerable software, who may ignore you or sue you because they don't want to spend money?  Do you tell t... (read more)

This is a thread for updates about the upcoming LessOnline festival. I (Ben) will be posting bits of news and thoughts, and you're also welcome to make suggestions or ask questions.

If you'd like to hear about new updates, you can use LessWrong's "Subscribe to comments" feature from the triple-dot menu at the top of this post.

Reminder that you can get tickets at the site for $400 minus your LW karma in cents.

2cata1h
That just sounds great, thanks.
4Ben Pace3h
1. I anticipate the vast majority of people going to each of the events will be locals to the state and landmass respectively, so I don't think it's actually particularly costly for them to overlap. 2. That's unfortunate that you are less likely to come, and I'm glad to get the feedback. I could primarily reply with reasons why I think it was the right call (e.g. helpful for getting the event off the ground, helpful for pinpointing the sort of ideas+writing the event is celebrating, I think it's prosocial for me to be open about info like this generally, etc) but I don't think that engages with the fact that it left you personally less likely to come. I still overall think if the event sounds like a good time to you (e.g. interesting conversations with people you'd like to talk to and/or exciting activities) and it's worth the cost to you then I hope you come :-)
2niplav2h
Maybe to clarify my comment: I was merely describing my (non-endorsed[1]) observed emotional content wrt the festival, and my intention with thw comment was not to wag my finger at you guys in the manner of "you didn't invite me". I wonder whether other people have a similar emotional reaction. I appreciate Lightcone being open with the information around free invitations though! I think I'd have bought a ticket anyway65% if I had time around that weekend, and I think I'd probably have a blast if I would attend. Btw: What's the chance of a 2nd LessOnline? ---------------------------------------- 1. I think my reaction is super bound up in icky status-grabbing/status-desiring/inner-ring-infiltrating parts of my psyche which I'm not happy with. ↩︎

What's the chance of a 2nd LessOnline?

Um, one part of me is (as is not uncommon) really believes in this event and thinks it's going to be the best effort investments Lightcones' ever made (though this part of me currently has one or two other projects and ideas that it believes in maybe even more strongly), that's part of me is like "yeah this should absolutely happen every year", though as I say I get this feeling often about projects that often end up looking different to how I dreamed them when they finally show up in reality. I think that part would f... (read more)

A couple years ago, I had a great conversation at a research retreat about the cool things we could do if only we had safe, reliable amnestic drugs - i.e. drugs which would allow us to act more-or-less normally for some time, but not remember it at all later on. And then nothing came of that conversation, because as far as any of us knew such drugs were science fiction.

… so yesterday when I read Eric Neyman’s fun post My hour of memoryless lucidity, I was pretty surprised to learn that what sounded like a pretty ideal amnestic drug was used in routine surgery. A little googling suggested that the drug was probably a benzodiazepine (think valium). Which means it’s not only a great amnestic, it’s also apparently one...

3johnswentworth6h
Thanks! Fixed now.
5ChristianKl7h
The linked post suggests that your assumptions about memory are wrong: He had training effects from multiplying the two numbers despite not having a memory of the first time he multiplied them. 
2johnswentworth6h
Oh yeah, I guess that could be a learning effect. When reading it I assumed the lack of need for repeating the numbers was just because the drug was wearing off.

Yeah, that's my best guess. I have other memories from that period (which was late into the hour), so I think it was the drug wearing off, rather than learning effects.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

The beauty industry offers a large variety of skincare products (marketed mostly at women), differing both in alleged function and (substantially) in price. However, it's pretty hard to test for yourself how much any of these product help. The feedback loop for things like "getting less wrinkles" is very long.

So, which of these products are actually useful and which are mostly a waste of money? Are more expensive products actually better or just have better branding? How can I find out?

I would guess that sunscreen is definitely helpful, and using some moisturizers for face and body is probably helpful. But, what about night cream? Eye cream? So-called "anti-aging"? Exfoliants?

What do you think is the strongest evidence on sunscreen? I've read mixed things on its effectiveness.

1merilalama5h
Nice! Which hyaluronic acid product do you use?
2Vanessa Kosoy12h
Thanks for this! Does it really make sense to see a dermatologist for this? I don't have any particular problem I am trying to fix other than "being a woman in her 40s (and contemplating the prospect of her 50s, 60s etc with dread)". Also, do you expect the dermatologist to give better advice than people in this thread or the resources they linked? (Although, the dermatologist might be better familiar with specific products available in my country.)
1FinalFormal217h
I watched this video and this is what I bought maximizing for cost/effectiveness, rate my stack: * Moisturizer * Retinol * Sunscreen

Introduction

A recent popular tweet did a "math magic trick", and I want to explain why it works and use that as an excuse to talk about cool math (functional analysis). The tweet in question:

Image

This is a cute magic trick, and like any good trick they nonchalantly gloss over the most important step. Did you spot it? Did you notice your confusion?

Here's the key question: Why did they switch from a differential equation to an integral equation? If you can use  when , why not use it when 

Well, lets try it, writing  for the derivative:

So now you may be disappointed, but relieved: yes, this version fails, but at least it fails-safe, giving you the trivial solution, right?

But no, actually  can fail catastrophically, which we can see if we try a nonhomogeneous equation...

10yanni kyriacos11h
Something I'm confused about: what is the threshold that needs meeting for the majority of people in the EA community to say something like "it would be better if EAs didn't work at OpenAI"? Imagining the following hypothetical scenarios over 2024/25, I can't predict confidently whether they'd individually cause that response within EA? 1. Ten-fifteen more OpenAI staff quit for varied and unclear reasons. No public info is gained outside of rumours 2. There is another board shakeup because senior leaders seem worried about Altman. Altman stays on 3. Superalignment team is disbanded 4. OpenAI doesn't let UK or US AISI's safety test GPT5/6 before release 5. There are strong rumours they've achieved weakly general AGI internally at end of 2025

This question is two steps removed from reality.  Here’s what I mean by that.  Putting brackets around each of the two steps:

what is the threshold that needs meeting [for the majority of people in the EA community] [to say something like] "it would be better if EAs didn't work at OpenAI"?
 

Without these steps, the question becomes 

What is the threshold that needs meeting before it would be better if people didn’t work at OpenAI?

Personally, I find that a more interesting question.  Is there a reason why the question is phrased at two removes like that?  Or am I missing the point?

4LawrenceC3h
What does a "majority of the EA community" mean here? Does it mean that people who work at OAI (even on superalignment or preparedness) are shunned from professional EA events? Does it mean that when they ask, people tell them not to join OAI? And who counts as "in the EA community"?  I don't think it's that constructive to bar people from all or even most EA events just because they work at OAI, even if there's a decent amount of consensus people should not work there. Of course, it's fine to host events (even professional ones!) that don't invite OAI people (or Anthropic people, or METR people, or FAR AI people, etc), and they do happen, but I don't feel like barring people from EAG or e.g. Constellation just because they work at OAI would help make the case, (not that there's any chance of this happening in the near term) and would most likely backfire.  I think that currently, many people (at least in the Berkeley EA/AIS community) will tell you to not join OAI if asked. I'm not sure if they form a majority in terms of absolute numbers, but they're at least a majority in some professional circles (e.g. both most people at FAR/FAR Labs and at Lightcone/Lighthaven would probably say this). I also think many people would say that on the margin, too many people are trying to join OAI rather than other important jobs. (Due to factors like OAI paying a lot more than non-scaling lab jobs/having more legible prestige.) Empirically, it sure seems significantly more people around here join Anthropic than OAI, despite Anthropic being a significantly smaller company. Though I think almost none of these people would advocate for ~0 x-risk motivated people to work at OAI, only that the marginal x-risk concerned technical person should not work at OAI.  What specific actions are you hoping for here, that would cause you to say "yes, the majority of EA people say 'it's better to not work at OAI'"?
2Dagon6h
[ I don't consider myself EA, nor a member of the EA community, though I'm largely compatible in my preferences ] I'm not sure it matters what the majority thinks, only what marginal employees (those who can choose whether or not to work at OpenAI) think.  And what you think, if you are considering whether to apply, or whether to use their products and give them money/status. Personally, I just took a job in a related company (working on applications, rather than core modeling), and I have zero concerns that I'm doing the wrong thing.

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA