Dakara — LessWrong

Agentic Misalignment: How LLMs Could be Insider Threats

Fair enough, but is the broader trend of "Models won't take unethical actions unless they're the only options" still holding?

That was my takeaway from the experiments done in the aftermath of the alignment faking paper, so it's good to see that it's still holding.

No77e's Shortform

Dakara4mo10

I think the general population doesn't know all that much about singularity, so adding that to the part would just unnecessarily dilute it.

Escaping the Jungles of Norwood: A Rationalist’s Guide to Male Pattern Baldness

Dakara4mo42

I have read the entire piece and it didn't feel like an AI slop at all. In fact, if I wasn't told, I wouldn't have suspected that AI was involved here, so well done!

Knight Lee's Shortform

Dakara4mo31

A lot of splits happen because some employees think that the company is headed in the wrong direction (lackluster safety would be one example).

Angela's Shortform

Dakara4mo10

Test successful worked :)

Vladimir_Nesov's Shortform

Dakara4mo90

He probably doesn't have much influence on the public opinion of LessWrong, but as a person in charge of a major AI company, he is obviously a big player.

Making deals with early schemers

Dakara4mo*1-1

It looks to me like a promising approach. Great results!

Debate experiments at The Curve, LessOnline and Manifest

Dakara4mo30

I've noticed that whenever the debate touches on a very personal topic, it tends to be heated and pretty unpleasant to listen to. On contrast, debates about things that are low-stakes for the people who are debating tend to be much more productive, sometimes even involving steelmanning.

Every Major LLM Endorses Newcomb One-Boxing

Dakara4mo10

That's certainly an interesting result. Have you tried running the same prompt again and seeing if the response changes? I've noticed that some LLMs answer different things to the same prompt. For example, when I quizzed DeepSeek R1 on whether a priori knowledge exists it answered in affirmative the first time and in negative the second time.

deep's Shortform

Dakara4mo*11

If alignment by default is not the majority opinion, then what is (pardon my ignorance as someone who mostly interacts with alignment community via LessWrong)? Is it 1) that we are all ~doomed or 2) that alignment is hard but we have a decent shot at solving it or 3) something else entirely?

I got a feeling like people used to be a lot more pessimistic about our chances of survival in 2023 than in 2024 or 2025 (in other words, pessimism seems to be going down somewhat), but I could be completely wrong about this.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments