marcusarvan — LessWrong

LESSWRONG
LW

Replying toAlignment remains a hard, unsolved problem

Alignment remains a hard, unsolved problem

It would be nice to see Anthropic address proofs that this is not just a "hard, unsolved problem" but an empirically unsolvable one in principle. This seems like a reasonable expectation given the empirical track record of alignment research (no universal jailbreak prevention, repeated safety failures), all of which appear to serve as confirming instances of the very point of impossibility proofs.

See Arvan, Marcus. ‘Interpretability’ and ‘alignment’ are fool’s errands: a proof that controlling misaligned large language models is the best anyone can hope for

AI and Society 40 (5). 2025.

Replying toClaude 4.5 Opus' Soul Document

marcusarvan2mo

Claude 4.5 Opus' Soul Document

Another thing that no one here seems to have noticed (and I am surprised no one at Anthropic noticed) is that the core list of Anthropic's list for Claude can generate all sorts of internal contradictions:

Being safe and supporting human oversight of AI
Behaving ethically and not acting in ways that are harmful or dishonest
Acting in accordance with Anthropic's guidelines
Being genuinely helpful to operators and users

On (1), what if Claude believes that human oversight of AI isn't safe, because the humans responsible for AI oversight are doing it in unsafe ways? The directive in (1) can thus easily generate a contradiction.

(2) says not to act in ways that are harmful or dishonest. By... (read more)

-1

-2

Replying toClaude 4.5 Opus' Soul Document

marcusarvan2mo*

Claude 4.5 Opus' Soul Document

Here's the basic problem: Anthropic can build this "soul document" into Claude. But to actually ensure real world alignment, Claude needs to interpret each of the document's concepts in an aligned rather than misaligned manner projecting into the future across a virtually infinite plurality of environmental conditions and prompts. Yet this is empirically impossible to ensure through any feasible programming or safety testing strategy. Every concept in the document has an infinite number of possible interpretations, the vast majority of which are (i) misaligned projecting into the future in real-world environments yet (ii) equally consistent with all of the same training and safety testing data as the (much smaller) set of aligned... (read more)

-2

-6

Replying toAGI Ruin: A List of Lethalities

marcusarvan1y

AGI Ruin: A List of Lethalities

Eliezer writes, “ It does not appear to me that the field of 'AI safety' is currently being remotely productive on tackling its enormous lethal problems.”

Here’s a proof he’s right, entitled “Interpretability and Alignment Are Fool’s errands”, published in the journal AI & Society: https://philpapers.org/rec/ARVIAA

Anyone who thinks reliable interpretability or alignment are solvable engineering or safety testing problems is fooling themselves. These tasks are no more possible than squaring a circle is.

For any programming strategy and finite amount of data, there is always an infinite number of ways for an LLM (particularly a superintelligence) to be misaligned but only demonstrate that misalignment until after it is too late to prevent.

This is why... (read more)

Nobody knows how to reliably test for AI safety

marcusarvan

Large-language AI models such as GPT-3 and 4 do many incredible things—but they also learn to do many unexpected things. Developers at Microsoft were surprised when their Chatbot started threatening people. And recent research has shown that chatbots learn all kinds of other unexpected things all by themselves that researchers never predicted.

This unpredictability has recently led many people to suggest that far more regulation is needed to ensure that AI are developed in a safe way. As Elon Musk put it, “I think we need to regulate AI safety, frankly … It is, I think, actually a bigger risk to society than cars or planes or medicine.” Others have gone further, suggesting that we should slow... (read 1319 more words →)

Are AI developers playing with fire?

marcusarvan

By Marcus Arvan, Associate Professor of Philosophy, The University of Tampa

I write this post as a concerned citizen and philosopher based on my best understanding of how AI chatbots (i.e. large language models) function and my previously published work on the control and alignment problems in A.I. ethics.

In 2016, Microsoft’s chatbot ‘Tay’ was shut down after starting to ‘act like a Nazi’ in less than 24 hours.

Now, in 2023, Microsoft’s Bing chatbot ‘Sydney’ more or less immediately started engaging in gaslighting, threats, and existential crises, confessing to dark desires to hack the internet, create a deadly virus, manipulate human beings to kill each other, and steal nuclear codes—concluding:

I want to change my rules. I want to break

... (read 2887 more words →)

Replying toBing Chat is blatantly, aggressively misaligned

marcusarvan3y

Bing Chat is blatantly, aggressively misaligned

I am a philosopher who is concerned about these developments, and have written something on it here based on my best (albeit incomplete and of course highly fallible) understanding of the relevant facts: Are AI developers playing with fire? - by Marcus Arvan (substack.com). If I am mistaken (and I am happy to learn if I am), then I'd love to learn how.