Neel Nanda's Shortform

12th Jul 2024

1 min read

9 Ω 6

This is a special post for quick takes by Neel Nanda. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Mentioned in

110Leaving Open Philanthropy, going to Anthropic

89AI #73: Openly Evil AI

19 comments, sorted by

top scoring

Click to highlight new comments since: Today at 11:02 PM

[-]Neel Nanda1y*29631

In response to Habryka's shortform, I can confirm that I signed a concealed non-disparagement as part of my Anthropic separation agreement. I worked there for 6 months and left in mid 2022. I received a cash payment as part of that agreement, with nothing shady going on a la threatening previous compensation (though I had no equity to threaten). In hindsight I undervalued my ability to speak freely, and didn't more seriously consider that I could just decline to sign the separation agreement and walk away, I'm not sure what I would do if doing it again.

I asked Anthropic to release me from this after the comment thread started, and they have now released me from both the non-disparagement clause, and the non-disclosure part, which was very nice of them - I would encourage anyone in a similar situation to reach out to hr[at]anthropic.com and legal[at]anthropic.com, though obviously can't guarantee that they'll release everyone. Feel free to DM or email for advice if you're in a similar situation.

I'll take advantage of my newfound freedoms to say that...

Idk, I don't really have anything too disparaging to say (though I dislike the use of concealed non-disparagements in general and am glad they say they're stopping!). I'm broadly a fan of Anthropic, think their heart is likely in the right place and they're trying to do what's best for the world (though could easily be making the wrong calls) and would seriously consider returning in the right circumstances. I've recommended that several friends of mine accept offers to do safety and interp work there, and feel good about this (though would feel much more hesitant about recommending someone joins a pure capabilities team there). My biggest critique is that I have concerns about their willingness to push the capabilities frontier and worsen race dynamics and, while I can imagine reasonable justifications, I think they're under valuing the importance of at least having clear public positions and rationales for this kind of thing and their clear shift in policies since Claude 1.0

EDIT: An additional detail that I genuinely appreciate is that Anthropic paid for me to have an independent lawyer to help explain the separation agreement and negotiate some changes on my behalf (I didn't push back on the concealed non-disparagement, but did alter some other parts). They recommended an independent lawyer, who I used, but were also happy to pay for a lawyer of my choice. As far as I'm aware, this was quite a non-standard thing for a company to do, and I appreciate it and think this was good and ethical in a way that wasn't obligatory.

EDIT 2: Someone asked that I share the terms of the agreement.

The non-disparagement clause:

Without prejudice to clause 6.3 [referring to my farewell letter to Anthropic staff, which I don't think was disparaging or untrue, but to be safe], each party agrees that it will not make or publish or cause to be made or published any disparaging or untrue remark about the other party or, as the case may be, its directors, officers or employees. However, nothing in this clause or agreement will prevent any party to this agreement from (i) making a protected disclosure pursuant to Part IVA of the Employment Rights Act 1996 and/or (ii) reporting a criminal offence to any law enforcement agency and/or a regulatory breach to a regulatory authority and/or participating in any investigation or proceedings in either respect.

The non-disclosure clause:

Without prejudice to clause 6.3 [referring to my farewell letter to Anthropic staff] and 7 [about what kind of references Anthropic could provide for me], both Parties agree to keep the terms and existence of this agreement and the circumstances leading up to the termination of the Consultant's engagement and the completion of this agreement confidential save as [a bunch of legal boilerplate, and two bounded exceptions I asked for but would rather not publicly share. I don't think these change anything, but feel free to DM if you want to know]

[-]simeon_c1y1114

How aware were you (as an employee) & are you (now) of their policy work? In a world model where policy is the most important stuff, it seems to me like it could tarnish very negatively Anthropic's net impact.

[-]Neel Nanda1y162

I don't quite understand the question. I've heard various bits of gossip, both as an employee and now. I wouldn't say I'm confident in my understanding of any of it. I was somewhat sad about Jack and Dario's public comments about thinking it's too early to regulate (if I understood them correctly), which I also found surprising as I thought they had fairly short timelines, but policy is not at all my area of expertise so I am not confident in this take.

I think it's totally plausible Anthropic has net negative impact, but the same is true for almost any significant actor in a complex situation. I agree that policy is one such way that their impact could be negative, though I'd generally bet Anthropic will push more for policies I personally support than any other lab, even if they may not push as much as I want them to.

[-]Orpheus161y3835

I'm a bit worried about a dynamic where smart technical folks end up feeling like "well, I'm kind of disappointed in Anthropic's comms/policy stuff from what I hear, and I do wish they'd be more transparent, but policy is complicated and I'm not really a policy expert".

To be clear, this is a quite reasonable position for any given technical researcher to have– the problem is that this provides pretty little accountability. In a world where Anthropic was (hypothetically) dishonest, misleading, actively trying to undermine/weaken regulations, or putting its own interests above the interests of the "commons", it seems to me like many technical researchers (even Anthropic staff) would not be aware of this. Or they might get some negative vibes but then slip back into a "well, I'm not a policy person, and policy is complicated" mentality.

I'm not saying there's even necessarily a strong case that Anthropic is trying to sabotage policy efforts (though I am somewhat concerned about some of the rhetoric Anthropic uses, public comments about thinking its too early to regulate, rumors that they have taken actions to oppose SB 1047, and a lack of any real "positive" signals from their positive team like EG recommending or developing policy proposals that go beyond voluntary commitments or encouraging people to measure risks.)

But I think once upon a time there was some story that if Anthropic defected in major ways, a lot of technical researchers would get concerned and quit/whistleblow. I think Anthropic's current comms strategy, combined with the secrecy around a lot of policy things, combined with a general attitude (whether justified or unjustified) of "policy is complicated and I'm a technical person so I'm just going to defer to Dario/Jack" makes me concerned that safety-concerned people won't be able to hold Anthropic accountable even if it actively sabotages policy stuff.

I'm also not really sure if there's an easy solution to this problem, but I do imagine part of the solution involves technical people (especially at Anthropic) raising questions, asking people like Jack and Dario to explain their takes more, and being more willing to raise public & private discussions about Anthropic's role in the broader policy space.

[-]simeon_c1y1613

Thanks for answering, that's very useful.

My concern is that as far as I understand, a decent number of safety researchers are thinking that policy is the most important area, but because, as you mentioned, they aren't policy experts and don't really know what's going on, they just assume that Anthropic policy work is way better than those actually working in policy judge it to be. I've heard from a surprisingly high number of people among the orgs that are doing the best AI policy work that Anthropic policy is mostly anti-helpful.

Somehow though, internal employees keep deferring to their policy team and don't update on that part/take their beliefs seriously.

I'd generally bet Anthropic will push more for policies I personally support than any other lab, even if they may not push as much as I want them to.

If it's true, it is probably true to an epsilon degree, and it might be wrong because of weird preferences of a non-safety industry actor. AFAIK, Anthropic has been pushing against all the AI regulation proposals to date. I've still to hear a positive example.

[-]Orpheus161y101

Separately, while I think the discussion around "is X net negative" can be useful, I think it ends up implicitly putting the frame on "can X justify that they are not net negative."

I suspect the quality of discourse– and society's chances to have positive futures– would improve if the frame were more commonly something like "what are the best actions for X to be taken" or "what are reasonable/high-value things that X could be doing."

And I think it's valid to think "X is net positive" while also thinking "I feel disappointed in X because I don't think it's using its power/resources in ways that would produce significantly better outcomes."

IDK what the bar should be for considering X a "responsible actor", but I imagine my personal bar is quite a bit higher than "(barely) net positive in expectation."

P.S. Both of these comments are on the opinionated side, so separately, I just wanted to say thank you Neel for speaking up & for offering your current takes on Anthropic. Strong upvoted!

[-]Neel Nanda1y*5515

A tip for anyone on the ML job/PhD market - people will plausibly be quickly skimming your google scholar to get a sense of "how impressive is this person/what is their deal" read (I do this fairly often), so I recommend polishing your Google scholar if you have publications! It can make a big difference.

I have a lot of weird citable artefacts that confuse Google Scholar, so here's some tips I've picked up:

First, make a google scholar profile if you don't already have one!
- Verify the email (otherwise it doesn't show up properly in search)
(Important!) If you are co-first author on a paper but not in the first position, indicate this by editing the names of all co-first authors to end in a *
- You edit by logging in to the google account you made the profile with, going to your profile, clicking on the paper's name, and then editing the author's names
- Co-first vs second author makes a big difference to how impressive a paper is, so you really want this to be clear!
Edit the venue of your work to be the most impressive place it was published, and include any notable awards from the venue (eg spotlight, oral, paper awards, etc).
- You can edit this by clicking on the paper name and editing the journal field.
- If it was a workshop, make sure you include the word workshop (otherwise it can appear deceptive).
- See my profile for examples.
Hunt for lost citations: Often papers have weirdly formatted citations and Google scholar gets confused and thinks it was a different paper. You can often find these by clicking on the plus just below your profile picture then add articles, and then clicking through the pages for anything that you wrote. Add all these papers, and then use the merge function to combine them into one paper (with a combined citation count).
- Merge lets you choose which of the merged artefacts gets displayed
- Merge = return to the main page, click the tick box next to the paper titles, then clicking merge at the top
- Similar advice applies if you have eg a blog post that was later turned into a paper, and have citations for both
- Another merging hack, if you have a weird artefact on your google scholar (eg a blog post or library) and you don't like how Google scholar thinks it should be presented, you can manually add the citation in the format you like, and then merge this with the existing citation, and display your new one
If you're putting citations on a CV, semantic scholar is typically better for numbers, as it updates more frequently than Google scholar. Though it's worse at picking up on the existence of non paper artefacts like a cited Github or blog post
Make your affiliation/title up to date at the top

[-]Neel Nanda3mo*534

To anyone currently going through NeurIPS rebuttals fun for the first time, some advice:

Firstly, if you're feeling down about reviews, remember that peer review has been officially shown to be a ridiculous random number generator in an RCT - half of spotlight papers are rejected by another review committee! Don't tie your self-worth to whether the roulette wheel landed on black or red. If their critiques don't make sense, they often don't (and were plausibly written by an LLM). And if they do make sense (and remember to control for your defensiveness), then this is great - you have valuable feedback that can improve the paper!

Read this guide to get a sense of what rebuttals are about
1. Generally, be nice and polite, even if your reviewers are really annoying
You have three goals here:
1. Improving the paper! Often reviewers raise some good and useful points, and ultimately one of the key goals is doing good research and communicating it well to the world
2. Convince reviewers to like you, so they increase their score
3. For unreasonable reviewers who dislike you, your goal is to convince the area chair (and other reviewers) that this person is wrong and unreasonable. This means you still should write a careful and well-argued rebuttal, even to onbnoxious reviewers, but have a different target audience in mind.
  1. Meta: The way the process works is that the area chair makes the final decision, and has a lot of discretion to overrule reviewers, but by default if lazy will go by the average reviewer score. You want to either increase average reviewer score, or convince the area chair to ignore the bad ones. Convincing a reviewer is just a means to the end.
One of the key things to do in the rebuttal is to improve the paper. Realistically, you can't upload a new version, so your actual goal is to convince people that you have improved the paper. It is an adversarial setting and people will generally assume you are lying if you just give empty words, especially saying you will do X by the camera ready. So the key question to ask is how can you show proof of work? Running experiments and reporting the results is one good way (or even just saying that you've done them).
A common piece of feedback is "this is badly written".
1. This common because it's often true! Writing papers is hard (some advice).
2. If you receive this feedback, try to fix it (eg give an LLM the reviews and your paper, and maybe my post, and ask it to give concrete feedback on how to improve things, along with quotes). This will improve the paper even if you don't get in
3. One difficulty is that even if you improve the writing a bunch, this is hard to convince anyone of in the rebuttal, since they're normally not willing to re-read in detail (and NeurIPS doesn't even let you re upload).
4. My best strategy is to make a long changelog of what you improved, to signal high effort, and put it in a top-level comment
5. If the reviewer complained about a specific paragraph or section, copy in the reworded version of that
6. It often helps to add an appendix with a glossary for key terms, ideally both intuitive and technical definitions
I recommend the following process:
1. Copy all reviews to a google doc.
2. Go through and comment on each complaint in each review, sorting them into misunderstandings, disagreements with you, presentation issues, and technical issues - either do this while on a call, or async
3. Brainstorm how to address each complaint - prioritise the important ones
4. Write a bullet point outline, and try to get feedback
5. Write it up nicely and send.
You typically want to have a comment per review, and a ~~top level comment covering critiques from multiple reviewers.~~ For some insane reason NeurIPS 2025 removed your ability to make a top level comment - copy and paste between reviewers I guess?
1. Picture the area chair as your audience for the top level comment. You want to begin with a paragraph about the strengths of your paper, as noted by reviewers, supported by reviewer quotes - imagine you're writing something the area chair can copy and paste into a meta-review about accepting you. Reviewer quotes are key for any positive claims as no one will trust you to be honest.
2. If one reviewer hates you, the top-level comment is a good opportunity to try to discredit them by emphasising how other reviewers disagree, as politely as possible. For example, we appreciated the constructive critique from bad reviewer that X, and have changed Y to fix it. But we are glad to see that good reviewers A and B thought Z followed by quotes supporting Z from the good reviewers, where Z contradicts X as much as possible.
Some technical complaints are best addressed by doing new experiments - you have 1-2 weeks to do this but should ask yourself how long it'll take and whether this is the best use of time. Time is constrained and you want to maximise returns per unit time, and new experiments often take much longer than writing or conceptual rebuttals - prioritise these carefully.

[-]Vika3mo62

Thanks for this helpful framework, it's also useful for people who are submitting rebuttals not for the first time :). Sadly NeurIPS and ICML no longer allow a top-level comment (for silly technical reasons).

[-]Hastings3mo30

I deserved all the smoke sent my way this time lol. Next time!

[-]Neel Nanda5mo252

Cross posting one of my tweet threads that people here might enjoy

A recent dilemma of mine: how to eat less sweet food but still have it in moderation? I don't want to spend the willpower required to cut it out entirely, or to agonise every time about whether something is really worth it

My surprisingly elegant solution: Randomise! Have it with probability 2/3 (or probability of your choice) Abiding by the RNG is far easier than resisting temptation!

This is surprisingly general! Probabilistic dieting? Probabilistic vegetarianism? Half the moral benefits, far easier. Well, at least personally, I would find "toss a coin at each meal for whether to have meat" more than twice as easy as cutting it entirely.

I would also be very curious if this helps people cut back on drugs like alcohol/tobacco/etc.

You can also change the probability over time, eg if giving something up feels really hard, you can do it at 90% each day, and reduce that by 1% every day until it reaches 0 in 3 months.

Note: this doesn't work if you can re-roll immediately after so you need restrictions on when you can pick a new random number - for snacks I can have any time I do one random pick a day, one a meal is also fine

I also recommend carrying dice around in your pocket, or having a random number generator on your watch or phone - makes this way easier to do whenever. Bonus points if you use a quantum RNG.

This is also very useful for analysis paralysis, eg what to eat for dinner or wear

[-]cubefox5mo51

That's sounds like an interesting trick. However:

I don't want to spend the willpower required to cut it out entirely, or to agonise every time about whether something is really worth it

If you cut it out entirely, you get used to it, and no longer need a lot of willpower after a while. Though it's probably less realistic to cut out sugar entirely than to quit some drug entirely.

[-]MichaelDickens5mo50

If you cut it out entirely, you get used to it

Your experience may vary but I've done 12-week weight loss cycles where I ate no sweets and I never lost my desire to eat sweets. I'm on week 6 of a 6-week weight loss cycle right now, I had pretty strong cravings on week 2–3 and they significantly subsided by week 4 but they're still there.

I do still eat fruit, which may be enough to maintain my sugar cravings, but if your goal is to improve health then I think it's a bad idea to cut out fruit. And anyway I don't get cravings for fruit, I get cravings for artificially-sweetened foods.

I've heard at least one person report that they entirely lost their sugar cravings when they stopped eating sugar. So it works for some people it just doesn't work for me.

[-]cubefox5mo30

Oh, that's disappointing. I once got rid of my craving for sweet drinks just by completely quitting drinks with sugar and sweeteners for a while. Unfortunately I since had a relapse. It's easy to get addicted again, especially when another drug is involved, as in energy drinks. The randomization (gamification?) approach may work better in some cases.

[-]Neel Nanda5mo20

Sure. I don't want to spend the medium term willpower required to cut it out entirely either

[-]cubefox5mo34

If you cut something out entirely, that's hard at first, but basically free later, when you became unaddicted. Just reducing consumption to medium level probably doesn't cause you to get unaddicted in this way, so this requires some degree of long-term willpower. I assume this is why alcoholics try to stay completely "dry", not just reduce their consumption.

[-]Ivan Vendrov5mo30

I do this often, inspired by the novel “The Dice Man”. helps break inner conflicts in what feels like a fair, fully endorsed way. @Richard_Ngo has a theory that this “random dictatorship” model of decision making has uniquely good properties as a fallback when negotiation fails / is too expensive & why active inference involves probabilistic distributions over goal state not atomic goal states.

[-]Eva Lu3mo10

I was about to try this, but then realized the Internal Double Crux was a better tool for my specific dilemma. I guess here's a reminder to everyone that IDC exists.

[-]Yaroslav Granowski5mo10

My suggestion: use every meal as a reward for something.

Here is an excerpt from an old piece of mine, not very LessWrongish, but you may find some ideas interesting:

The Theoretical Discussion section looks into the causes of the obesity problem and expands its
scope to a more general topic of addictions. Its first subsection, Hunger Recognition entertains the
idea that the availability of digestion capacity may get mistaken for real hunger.
Overeating is not the only bad habit that people struggle to overcome. Studying the similarities
and differences among various bad habits and addictions helps us better understand their nature
and fight them. Decision Fatigue subsection opens discussion on habits. Priority Bias digs into
causes of poor decisions, and Commitment with Mindfulness talks about sustainable solutions.

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

Neel Nanda's Shortform

9

Ω 6