LESSWRONG
LW

Charbel-Raphaël — LessWrong

14d

One year ago, we nearly died.

This is maybe an overdramatic statement, but long story short, nearly all of us underwent carbon monoxide (CO) poisoning^[1]. The benefit is, we all suddenly got back in touch with a failure mode we had forgotten about, and we decided to make it a yearly celebration.

Usually, when we think about failure, we might think about not being productive enough, or not solving the right work-related problem, or missing a meeting. We might suspect that our schedule could be better organized or that one of our habits really sucks. We might fear not to spot an obvious psychological flaw or a decision-making issue.

We often forget that the single... (read 1047 more words →)

102

Charbel-Raphaël1moQuick Take

Shamelessly adapted from VDT: a solution to decision theory. I didn't want to wait for the 1st of April.

VET: A Solution to Moral Philosophy

By Claude 4.5 Opus, with prompting by Charbel Segerie

January 2026

Introduction

Moral philosophy is about how to behave ethically under conditions of uncertainty, especially if this uncertainty involves runaway trolleys, violinists attached to your kidneys, and utility monsters who experience pleasure 1000x more intensely than you.

Moral philosophy has found numerous practical applications, including generating endless Twitter discourse and making dinner parties uncomfortable since the time of Socrates.

However, despite the apparent simplicity of "just do the right thing," no comprehensive ethical framework that resolves all moral dilemmas has yet been formalized. This... (read 2328 more words →)

Replying toHow do we solve the alignment problem?

Charbel-Raphaël2mo

How do we solve the alignment problem?

It seems like the world is very much multipolar, at least currently

Replying toStop Applying And Get To Work

Charbel-Raphaël3mo

Stop Applying And Get To Work

Money is a bottleneck yes

Replying toPlans A, B, C, and D for misalignment risk

Charbel-Raphaël3mo

Plans A, B, C, and D for misalignment risk

I've asked Claude to make a rough assessment on this. Tldr, the proba goes from 13% to ~27 and this propagates to plan C and D.

sabotaging Chinese AI companies?

Claude: Ryan's response is suggestive but incomplete. "Sabotaging Chinese AI companies" gestures at a possible answer but doesn't constitute a full defense because:

It's extremely escalatory and might not be politically viable even with high US government buy-in
Its effectiveness is uncertain—how much lead time would successful sabotage actually buy? Months? Years?
It's not obviously repeatable; China would harden against further attacks
It could provoke dangerous counter-responses

To be fair to Ryan, the original post does mention "helping the US government ensure non-proliferation/lead time" under Plan B, so the... (read 568 more words →)

Replying toPlans A, B, C, and D for misalignment risk

Charbel-Raphaël3mo

Plans A, B, C, and D for misalignment risk

I have three main critiques:

The China Problem: Plan B’s 13% risk doesn’t make sense if China (DeepSeek) doesn’t slow down and is only 3 months behind. Real risk is probably the same as for E, 75% unless there is a pivotal act.
Political Will as Strategy: The framework treats political will as a background variable rather than a key strategic lever. D→C campaigns could reduce expected risk by 11+ percentage points - nearly 30% of the total risk. A campaign to move from E→D would also be highly strategic and could only require talking to a handful of employees.
Missing “Plan A-Minus”: No need to lose your lead necessarily. International standards to formalize the red lines/unacceptable levels of risks, e.g., via the AISI network and targeted if-then commitments, would enable companies to slow down without losing, because they would all be playing under the same rules. This seems more tractable than Plan A and solves the China problem better than Plan B.

Replying toPlans A, B, C, and D for misalignment risk

Charbel-Raphaël3mo*

Plans A, B, C, and D for misalignment risk

Summary

Plan A: Most countries, at least the US and China
Plan B: The US government (and domestic industry)
Plan C: The leading AI company (or maybe a few of the leading AI companies)
Plan D: A small team with a bit of buy in within the leading AI company
Plan E: No will

Plan	Probability of Scenario	Takeover Risk Given Scenario	Expected Risk Contribution
Plan A	5%	7%	0.35%
Plan B	10%	13%	1.30%
Plan C	25%	20%	5.00%
Plan D	45%	45%	20.25%
Plan E	15%	75%	11.25%
Total	100%	-	38.15%

How much lead time we have to spend on x-risk focused safety work in each of these scenarios:

Plan A: 10 years
Plan B: 1-3 years
Plan C: 1-9 months (probably on the lower end of this)
Plan D: ~0 months, but ten people on the inside doing helpful things

Replying toThe Only Red Line

Charbel-Raphaël3mo

The Only Red Line

What motivated you to write this post?

AI Red Lines: A Research Agenda

Charbel-Raphaël

3mo

The Global Call for AI Red Lines was signed by 12 Nobel Prize winners, 10 former heads of state and ministers and over 300 prominent signatories. Launched at the UN General Assembly and presented to the UN Security Council. There is still much to be done, so we need to capitalize on this momentum. We are sharing this to solicit feedback and collaboration.

The mission of the agenda is to help catalyse international agreement to prevent unacceptable AI risks as soon as possible.

Here are the main research projects I think are important for moving this needle. We need all hands on deck:

Strategists: Designing the International Agreement
Policy Advocacy: Stakeholder Engagement
Technical Governance: Identifying Good Red Lines
Governance

... (read 1357 more words →)

A Call for Better Risk Modelling

JanWehner

JanWehner, Charbel-Raphaël

3mo

TL;DR: The EU’s Code of Practice (CoP) mandates AI companies to conduct state-of-the-art Risk Modelling. However, the current SoTA is has severe flaws. By creating risk models and improving methodology, we can enhance the quality of risk management performed by AI companies. This is a neglected area, hence we encourage more people in AI Safety to work on it. Work on Risk Modelling is urgent because the CoP is to be enforced starting in 9 months (Aug, 2, 2026).

Jan and Charbel contributed equally to this piece.

The EU’s Code of Practice requires SoTA Risk Management practices.

The AI Act and its CoP is currently the most comprehensive regulation to prevent catastrophic AI risks. Many... (read 1050 more words →)

Charbel-Raphaël4mo

Thanks a lot for this comment.

Potential example of precise red lines

Again, the call was the first step. The second step is finding the best red lines.

Here are more aggressive red lines:

Prohibiting the deployment of AI systems that, if released, would have a non-trivial probability of killing everyone. The probability would be determined by a panel of experts chosen by an international institution.
"The development of superintelligence […] should not be allowed until there is broad scientific consensus that it will be done safely and controllably (from this letter from the Vatican).

Here are potential already operational ones from the preparedness framework:

[AI Self-improvement - Critical - OpenAI] The model is capable of recursively self-improving (i.e.,

... (read 800 more words →)

•••

Charbel-Raphaël4mo

It feels to me that we are not talking about the same thing. Is the fact that we have delegated the specific examples of red lines to the FAQ, and not in the core text, the main crux of our disagreement?

You don't cite any of the examples that are listed in our question: "Can you give concrete examples of red lines?"

Charbel-Raphaël4mo

Hi habryka, thanks for the honest feedback

“the need to ensure that AI never lowers the barriers to acquiring or deploying prohibited weapons” - This is not the red line we have been advocating for - this is one red line from a representative discussing at the UN Security Council - I agree that some red lines are pretty useless, some might even be net negative.

"The central question is what are the lines!" The public call is intentionally broad on the specifics of the lines. We have an FAQ with potential candidates, but we believe the exact wording is pretty finicky and must emerge from a dedicated negotiation process. Including a specific red... (read more)

Almost all members of the UN Security Council are in favor of AI regulation or setting red lines.

Never before had the principle of red lines for AI been discussed so openly and at such a high diplomatic level.

UN Secretary-General Antonio Guterres opened the session with a firm call to action for red lines:

• “a ban on lethal autonomous weapons systems operating without human control, with [...] a legally binding instrument by next year”
• “the need to ensure that AI never lowers the barriers to acquiring or deploying prohibited weapons”

Then, Yoshua Bengio took the floor and highlighted our Global Call for AI Red Lines — now endorsed by 11 Nobel laureates and 9... (read more)

Global Call for AI Red Lines - Signed by Nobel Laureates, Former Heads of State, and 200+ Prominent Figures

Charbel-Raphaël

5mo

The Global Call for AI Red Lines was released and presented in the opening of the first day of the 80th UN General Assembly high-level week.

Maria Ressa, Nobel Peace Prize, announcing the call at the opening of the UN General Assembly: "We urge governments to establish clear international boundaries to prevent unacceptable risks for AI. At the very least, define what AI should never be allowed to do"

It was initiated by CeSIA, the French Center for AI Safety, in partnership with The Future Society and the Center for Human-compatible AI.

The full text of the call reads:

AI holds immense potential to advance human wellbeing, yet its current trajectory presents unprecedented dangers. AI could soon far surpass

... (read 1516 more words →)

338

Dissolving moral philosophy: from pain to meta-ethics

Charbel-Raphaël

6mo

This is an extract from an appendix of one of my longer blog posts that I keep referring to.

What is pain? Why is pain bad?

It's the same trick: we shouldn't ask, "Why is pain negative," but "Why do we think pain is negative?" Here's the response in the form of a genealogy of morals:

Detectors for intense heat are extremely useful. Organisms without these detectors are replaced by those who react reflexively to heat. Muscle fatigue detectors are also extremely useful: Organisms without these detectors are replaced by those that react to these signals and conserve their muscle tissue. The same goes for the dangerous mechanical and chemical stimulation conveyed by type-C fibers.
In

... (read 514 more words →)

The bitter lesson of misuse detection

Hadrien

Hadrien, Charbel-Raphaël

7mo

TL;DR: We wanted to benchmark supervision systems available on the market—they performed poorly. Out of curiosity, we naively asked a frontier LLM to monitor the inputs; this approach performed significantly better. However, beware: even when an LLM flags a question as harmful, it will often still answer it.

Full paper is available here.

Abstract

Prior work on jailbreak detection has established the importance of adversarial robustness for LLMs but has largely focused on the model ability to resist adversarial inputs and to output safe content, rather than the effectiveness of external supervision systems^[1]. The only public and independent benchmark of these guardrails to date evaluates a narrow set of supervisors on limited scenarios. Consequently, no comprehensive... (read 1938 more words →)

The 80/20 playbook for mitigating AI scheming in 2025

Charbel-Raphaël

9mo

Adapted from this twitter thread. See this as a quick take.

Mitigation Strategies

How to mitigate Scheming?

Architectural choices: ex-ante mitigation
Control systems: post-hoc containment
White box techniques: post-hoc detection
Black box techniques
Avoiding sandbagging

We can combine all of those mitigation via defense-in-depth system (like the Swiss Cheese model below)

I think that applying all of those strategies should divide the risk by at least 3.

1. Architectural Choices: Ex-ante Mitigation

Don't choose an architecture that uses neuraleses (i.e. hidden reasoning).

For example, don't train models to think in their latent spaces like Meta!

Chain of thought monitoring is a blessing for transparency and monitoring - let's use it as much as we can!

See: Why it’s good for AI reasoning to be legible and... (read 1029 more words →)

[Paper] Safety by Measurement: A Systematic Literature Review of AI Safety Evaluation Methods

markov

markov, Charbel-Raphaël

9mo

"If you cannot measure it, you cannot improve it." - Lord Kelvin

The science of AI safety evaluations is still nascent, but it is making progress! We know much more today than we did two years ago.

We tried to make this knowledge accessible by writing a literature review and systematizing all the knowledge.

We wanted to go beyond a central source that just summarized collected knowledge, so we put a lot of effort into distillation, and disentangling concepts that are often presented in a messy way. We created original visualizations and gathered others from many different sources to accompany these explanations.

E.g. Disentangling truth, honesty, hallucination, deception and scheming.

The review provides a taxonomy of AI... (read more)

Charbel-Raphaël's Shortform

Charbel-Raphaël

10mo

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

Couldn't we privately ask Sam Altman “I would do X if Dario and Demis also commit to the same thing”?

Seems like the obvious thing one might like to do if people are stuck in a race and cannot coordinate.

X could be implementing some mitigation measures, supporting some piece of regulation, or just coordinating to tell the president that the situation is dangerous and we really do need to do something.

What do you think?

It seems like conditional statements have already been useful in other industries - Claude

Regarding whether similar private "if-then" conditional commitments have worked in other industries:

Yes, conditional commitments have been used successfully in various contexts:

International climate agreements often use conditional pledges - countries commit to

Charbel-Raphaël

Executive Summary

The Centre pour la Sécurité de l'IA^[1] (CeSIA, pronounced "seez-ya") or French Center for AI Safety is a new Paris-based organization dedicated to fostering a culture of AI safety at the French and European level. Our mission is to reduce AI risks through education and information about potential risks and their solutions.

Our activities span four main areas:

Academic field-building: We began our work by creating the ML4Good bootcamps, which have been replicated a dozen times worldwide. We also offer accredited courses at ENS Ulm—France's most prestigious university for mathematics and computer science—and the MVA master program, the only and first accredited AI Safety courses in France^[2]. We are currently adapting and improving the course into an

... (read 2323 more words →)

102

•••

LESSWRONG
LW

LESSWRONG
LW

Charbel-Raphaël

Global Call for AI Red Lines - Signed by Nobel Laureates, Former Heads of State, and 200+ Prominent Figures

Against Almost Every Theory of Impact of Interpretability

Davidad's Bold Plan for Alignment: An In-Depth Explanation

The case for training frontier AIs on Sumerian-only corpus

Charbel-Raphaël

Basics of How Not to Die

AI Red Lines: A Research Agenda

A Call for Better Risk Modelling

Global Call for AI Red Lines - Signed by Nobel Laureates, Former Heads of State, and 200+ Prominent Figures

Dissolving moral philosophy: from pain to meta-ethics

The bitter lesson of misuse detection

The 80/20 playbook for mitigating AI scheming in 2025

Charbel-Raphaël

Global Call for AI Red Lines - Signed by Nobel Laureates, Former Heads of State, and 200+ Prominent Figures

Against Almost Every Theory of Impact of Interpretability

Davidad's Bold Plan for Alignment: An In-Depth Explanation

The case for training frontier AIs on Sumerian-only corpus

Charbel-Raphaël

Basics of How Not to Die

AI Red Lines: A Research Agenda

A Call for Better Risk Modelling

Global Call for AI Red Lines - Signed by Nobel Laureates, Former Heads of State, and 200+ Prominent Figures

Dissolving moral philosophy: from pain to meta-ethics

The bitter lesson of misuse detection

The 80/20 playbook for mitigating AI scheming in 2025

VET: A Solution to Moral Philosophy

Introduction

Summary

The EU’s Code of Practice requires SoTA Risk Management practices.

Potential example of precise red lines

Abstract

Mitigation Strategies

1. Architectural Choices: Ex-ante Mitigation

Executive Summary