Raemon

LessWrong team member / moderator. I've been a LessWrong organizer since 2011, with roughly equal focus on the cultural, practical and intellectual aspects of the community. My first project was creating the Secular Solstice and helping groups across the world run their own version of it. More recently I've been interested in improving my own epistemic standards and helping others to do so as well.

Sequences

Feedbackloop-First Rationality
The Coordination Frontier
Privacy Practices
The LessWrong Review
Keep your beliefs cruxy and your frames explicit
LW Open Source Guide
Tensions in Truthseeking
Project Hufflepuff
Rational Ritual
Load More (9/10)

Wiki Contributions

Comments

Sorted by
Raemon72

I personally think it's most important to have at least some technical employees who have the knowledge/expertise to evaluate the actual situation, and who also have the power to do something about it. I'd want this to include some people who's primary job is more like "board member" and some people who's primary job is more like "alignment and/or capabilities researcher."

But there is a sense in which I'd feel way more comfortable if all technical AI employees (alignment and capabilities), and policy / strategy people, didn't have profit equity, so they didn't have inceptive to optimize against what was safe. So, there's just a lot of eyes on the problem, and the overall egregore steering the company has one fewer cognitive distortion to manage. 

This might be too expensive (OpenAI and Anthropic have a lot of money, but, like, that doesn't mean they can just double everyone's salary)

An idea that occurs to me is construct "windfall equity", which only pays out if the company or world generates AI that is safely, massively improving the world.

Raemon85

Fwiw I am somewhat more sympathetic here to "the line between safety and capabilities is blurry, Anthropic has previously published some interpretability research that turned out to help someone else do some capabilities advances."

I have heard Anthropic is bottlenecked on having people with enough context and discretion to evaluate various things that are "probably fine to publish" but "not obviously fine enough to ship without taking at least a chunk of some busy person's time". I think in this case I basically take the claim at face value. 

I do want to generally keep pressuring them to somehow resolve that bottleneck because it seems very important, but, I don't know that I disproportionately would complain at them about this particular thing.

(I'd also not surprised if, while the above claim is true, Anthropic is still suspiciously dragging it's feet disproportionately in areas that feel like they make more of a competitive sacrifice, but, I wouldn't actively bet on it)

Sounds fatebookable tho, so let's use ye Olde Fatebook Chrome extension:

⚖ In 4 years, Ray will think it is pretty obviously clear that Anthropic was strategically avoiding posting alignment research for race-winning reasons. (Raymond Arnold: 17%)

(low probability because I expect it to still be murky/unclear)

Raemon67

I agree warning shots generally make governance easier, but, I think SB 1047 somewhat differentially helps more in worlds without warning shots (or with weaksauce ones)? 

Like, with a serious warning shot I expect it to be much easier to get regulation passed even if there wasn't already SB 1047, and SB 1047 creates more surface area for regulating agencies existing and noticing problems before they happen.

Raemon82

Will respond in more detail later hopefully, but meanwhile, re:

I haven't read the bill in its modern current form. Do you know if it explains a reliable method to make sure that "the actual government officials who make the judgement call" will exist via methods that make it highly likely that they will be honest and prudent about what is actually dangerous when the chips are down and cards turned over, or not?

I copied over the text of how the Frontier Model Board gets appointed. (Although note that after amendments, the Frontier Model Board no longer has any explicit power, and can only advise the existing GovOps agency, and the attorney general). Not commenting yet on what this means as an answer to your question. 


(c) (1) Commencing January 1, 2026, the Board of Frontier Models shall be composed of nine members, as follows:

  • (A) A member of the open-source community appointed by the Governor and subject to Senate confirmation.
  • (B) A member of the artificial intelligence industry appointed by the Governor and subject to Senate confirmation.
  • (C) An expert in chemical, biological, radiological, or nuclear weapons appointed by the Governor and subject to Senate confirmation.
  • (D) An expert in artificial intelligence safety appointed by the Governor and subject to Senate confirmation.
  • (E) An expert in cybersecurity of critical infrastructure appointed by the Governor and subject to Senate confirmation.
  • (F) Two members who are academics with expertise in artificial intelligence appointed by the Speaker of the Assembly.
  • (G) Two members appointed by the Senate Rules Committee.

(2) A member of the Board of Frontier Models shall meet all of the following criteria:

  • (A) A member shall be free of direct and indirect external influence and shall not seek or take instructions from another.
  • (B) A member shall not take an action or engage in an occupation, whether gainful or not, that is incompatible with the member’s duties.
  • (C) A member shall not, either at the time of the member’s appointment or during the member’s term, have a financial interest in an entity that is subject to regulation by the board.

(3) A member of the board shall serve at the pleasure of the member’s appointing authority but shall serve for no longer than eight consecutive years.

RaemonΩ130

Curated. "What would actually be persuasive that scheming is a problem?" is one of the important practical questions for AI governance. I appreciate Buck noting places he changed his mind somewhat during the SB 1047 discourse.

I appreciate that the post has detailed worked examples.

RaemonΩ183215

I haven't read the Shard Theory work in comprehensive detail. But, fwiw I've read at least a fair amount of your arguments here and not seen anything that bridged the gap between "motivations are made of shards that are contextually activated" and "we don't need to worry about Goodhart and misgeneralization of human values at extreme levels of optimization." 

I've heard you make this basic argument several times, and my sense is you're pretty frustrated that people still don't seem to have "heard" it properly, or something. I currently feel like I have heard it, and don’t find it compelling.

I did feel compelled by your argument that we should look to humans as an example of how "human values" got aligned. And it seems at least plausible that we are approaching a regime where the concrete nitty-gritty of prosaic ML can inform our overall alignment models in a way that makes the thought experiments of 2010 outdate. 

But, like, a) I don't actually think most humans are automatically aligned if naively scaled up (though it does seem safer than naive AI scaling), and b) while human-value-formation might be simpler than the Yudkowskian model predicts, it still doesn't seem like the gist of "look to humans" gets us to a plan that is simple in absolute terms, c) it seems like there are still concrete reasons to expect training superhuman models to be meaningfully quite different from the current LLMs, which aren't at a stage where I'd expect them to exhibit any of the properties I'd be worried about.

(Also, in your shard theory post, you skip over the example of 'embarassment' because you can't explain it yet, and switch to sugar, and I'm like 'but, the embarrassment one was much more cruxy and important!')

I don't expect to get to agreement in the comments here today, but, it feels like the current way you're arguing this point just isn't landing or having the effect you want and... I dunno what would resolve things for you or anyone else but I think it'd be better if you tried some different things for arguing about this point.

If you feel like you've explained the details of those things better in the second half of one of your posts, I will try giving it a more thorough read. (It's been awhile since I read your Diamond Maximizer post, which I don't remember in detail but don't remember finding compelling at the time)

Answer by Raemon30

Hmm, I just tested this on a draft and it seemed to work fine. Can you DM me with some details about what's happening? Also, try it again on a second test post and see if it's consistent.

Raemon60

I agree "unsatisfactory" is different from disgust. I think people vary in which emotions end up loadbearing for them. 

I know rationalists who feel disgust reactions to people who have unclean "epistemic hygiene", or who knowingly let themselves into situations where their epistemics will be reliably fucked. 

For that matter, in the OP, some people are responding to regular ol' criminal morality with disgust, and while you (or Jim, or in fact, me) can say "man I really don't trust people who run their morality off disgust", it doesn't necessarily follow that it'd, for example, work well if you simply removed disgust from the equation for everyone – it might turn out to be loadbearing to how society is function.

I'm not sure if we disagree about a particular thing here, because, like, it's not like you're exactly proposing to snap your fingers and eliminate disgust from human morality unilaterally (but it sounds like you might be encouraging people to silence/ignore their disgust reactions, without tracking that this may be important for how some significant fraction of people are currently tracking morality, in a way that would destroy a lot of important information and coordination mechanism if you didn't more thoughtfully replace it with other things)

I agree high reflectivity people probably have less disgust-oriented morality (because yeah, disgust-morality is often not well thought out or coherent), but I just have a general precautionary principle against throwing out emotional information.

I, uh, maybe want to summon @divia who might have more specific thoughts here.

Raemon20

How many iterations did you do to get it?

Raemon66

I started out disagreeing with where I thought this comment was going, but I think ended up reasonably sold by the end. 

I want to flag something like "in any 'normal' circumstances, avoiding an earth-sized or even nation-sized moral-catastrophe is, like, really important?" I think it's... might actually be correct to actually do some amount of hang-wringing about that even if you know you're ultimately going to have to make the tradeoff against it? (mostly out of a general worry about being too quick to steamroll your moral intuitions with math). 

But, yeah the circumstances aren't normal, and seems likely there's at least some tradeoff here.

I am generally pleasantly surprised that AI welfare is one (at least one (relatively?) senior) Anthropic employee's roadmap at all. 

I wasn't expecting it to be there at all. (Though I'm sort of surprised an Anthropic folk is publicly talking about AI welfare but still not explicitly extinction risk)

Load More