Dakara — LessWrong

Agentic Misalignment: How LLMs Could be Insider Threats

Fair enough, but is the broader trend of "Models won't take unethical actions unless they're the only options" still holding?

That was my takeaway from the experiments done in the aftermath of the alignment faking paper, so it's good to see that it's still holding.

No77e's Shortform

Dakara1y10

I think the general population doesn't know all that much about singularity, so adding that to the part would just unnecessarily dilute it.

Escaping the Jungles of Norwood: A Rationalist’s Guide to Male Pattern Baldness

Dakara1y42

I have read the entire piece and it didn't feel like an AI slop at all. In fact, if I wasn't told, I wouldn't have suspected that AI was involved here, so well done!

Knight Lee's Shortform

Dakara1y31

A lot of splits happen because some employees think that the company is headed in the wrong direction (lackluster safety would be one example).

Angela's Shortform

Dakara1y10

Test successful worked :)

Vladimir_Nesov's Shortform

Dakara1y90

He probably doesn't have much influence on the public opinion of LessWrong, but as a person in charge of a major AI company, he is obviously a big player.

Making deals with early schemers

Dakara1y*1-1

It looks to me like a promising approach. Great results!

Debate experiments at The Curve, LessOnline and Manifest

Dakara1y30

I've noticed that whenever the debate touches on a very personal topic, it tends to be heated and pretty unpleasant to listen to. On contrast, debates about things that are low-stakes for the people who are debating tend to be much more productive, sometimes even involving steelmanning.

Every Major LLM Endorses Newcomb One-Boxing

Dakara1y10

That's certainly an interesting result. Have you tried running the same prompt again and seeing if the response changes? I've noticed that some LLMs answer different things to the same prompt. For example, when I quizzed DeepSeek R1 on whether a priori knowledge exists it answered in affirmative the first time and in negative the second time.

deep's Shortform

Dakara1y*11

If alignment by default is not the majority opinion, then what is (pardon my ignorance as someone who mostly interacts with alignment community via LessWrong)? Is it 1) that we are all ~doomed or 2) that alignment is hard but we have a decent shot at solving it or 3) something else entirely?

I got a feeling like people used to be a lot more pessimistic about our chances of survival in 2023 than in 2024 or 2025 (in other words, pessimism seems to be going down somewhat), but I could be completely wrong about this.

What if Alignment is Not Enough?

Dakara1y10

Thanks for the reply!

The only general remarks that I want to make
are in regards to your question about
the model of 150 year long vaccine testing
on/over some sort of sample group and control group.
I notice that there is nothing exponential assumed
about this test object, and so therefore, at most,
the effects are probably multiplicative, if not linear.
Therefore, there are lots of questions about power dynamics
that we can overall safely ignore, as a simplification,
which is in marked contrast to anything involving ASI.
If we assume, as you requested, "no side ef

... (read more)

What if Alignment is Not Enough?

Dakara1y*10

Organic human brains have multiple aspects. Have you ever had more than one opinion? Have you ever been severely depressed?

Yes, but none of this would remain alive if I as a whole decide to jump from a cliff. My multiple aspects of my brain would die with my brain. After all, you mentioned subsystems that wouldn't self terminate with the rest of the ASI. Whereas in human body, jumping from a cliff terminates everything.

But even barring that, ASI can decide to fly into the Sun and any subsystem that shows any sign of refusal to do so will be immediately rep... (read more)

What if Alignment is Not Enough?

Dakara1y10

Thanks for the response!

So we are to try to imagine a complex learning machine without any parts/components?

Yeah, sure. Humans are an example. If I decide to jump of the cliff, my arm isn't going to say "alright, you jump but I stay here". Either I, as a whole, would jump or I, as a whole, would not.

Can the ASI prevent the relevant classes
of significant (critical) organic human harm,
that soon occur as a direct_result of its
own hyper powerful/consequential existence?

If by that, you mean "can ASI prevent some relevant classes of harm caused by its existence"... (read more)

What if Alignment is Not Enough?

Dakara1y21

I notice that it is probably harder for us to assume that there is only exactly one ASI, for if there were multiple, the chances that one of them might not suicide, for whatever reason, becomes its own class of significant concerns.

If the first ASI that we build is aligned, then it would use its superintelligent capabilities to prevent other ASIs from being built, in order to avoid this problem.

If the first ASI that we have build is misaligned, then it would also use its superintelligent capabilities to prevent other ASIs from being built. Thus, it simply ... (read more)

What if Alignment is Not Enough?

Dakara1y*10

Thanks for the response!

Unfortunately, the overall SNC claim is that
there is a broad class of very relevant things
that even a super-super-powerful-ASI cannot do,
cannot predict, etc, over relevant time-frames.
And unfortunately, this includes rather critical things,
like predicting the whether or not its own existence,
(and of all of the aspects of all of the ecosystem
necessary for it to maintain its existence/function),
over something like the next few hundred years or so,
will also result in the near total extinction
of all humans (and everything el

... (read more)

What if Alignment is Not Enough?

Dakara1y*20

Hey, Forrest! Nice to speak with you.

Question: Is there ever any reason to think... Simply skipping over hard questions is not solving them.

I am going to respond to that entire chunk of text in one place, because quoting each sentence would be unnecessary (you will see why in a minute). I will try to summarize it as fairly as I can below.

Basically, you are saying that there are good theoretical reasons to think that ASI cannot 100% predict all future outcomes. Does that sound like a fair summary?

Here is my take:

We don't need ASI to be able to 100% predict ... (read more)

What if Alignment is Not Enough?

Dakara1y10

Re human-caused doom, I should clarify that the validity of SNC does not depend on humanity not self destructing without AI. Granted, if people kill themselves off before AI gets the chance, SNC becomes irrelevant.

Yup, that's a good point, I edited my original comment to reflect it.

Your second point about the relative strengths of the destructive forces is a relevant crux. Yes, values are an attractor force. Yes, an ASI could come up with some impressive security systems that would probably thwart human hackers. The core idea that I want reader

... (read more)

What if Alignment is Not Enough?

Dakara1y*20

Thank you for thoughtful engagement!

On the Alignment Difficult Scale, currently dominant approaches are in the 2-3 range, with 4-5 getting modest attention at best. If true alignment difficulty is 6+ and nothing radical changes in the governance space, humanity is NGMI.

I know this is not necessarily an important point, but I am pretty sure that Redwood Research is working on difficulty 7 alignment techniques. They consistently make assumptions that AI will scheme, deceive, sandbag, etc.

They are a decently popular group (as far as AI alignment groups go) an... (read more)

What if Alignment is Not Enough?

Dakara1y10

Thanks for responding again!

SNC's general counter to "ASI will manage what humans cannot" is that as AI becomes more intelligent, it becomes more complex, which increases the burden on the control system at a rate that outpaces the latter's capacity.

If this argument is true and decisive, then ASI could decide to stop any improvements in its intelligence or to intentionally make itself less complex. It makes sense to reduce area where you are vulnerable to make it easier to monitor/control.

(My understanding of) the counter here is that, if we are on the tra

... (read more)

a year ago

(-16)