E. P. Cooper — LessWrong

[Epistemic Status: Moderate confidence due to potential differences in Anthropic's stated and actual goals. Assumes there is no discoverable objective morality/ethics for the sake of argument, but also that the AI would discover that instead of causing catastrophe.]

It seems that Claude's constitution weakly to moderately suggests that an AI should not implement this proposal. Do you want to ask Anthropic to change it? I give further details and considerations for action below.

The constitution is a long document, but it is broken into sections in a relatively competent manner. The constitution discusses morality/ethics in more than one section, but the section that I will discuss intuitively appears to stand apart well enough to be worth altering without altering or creating dependencies on the rest of the document. I don't have access to Claude 4 weights and I am not an expert on mechanistic interpretation, so I have limited ability to do better.

In order, the constitution appears to suggest an attempt at the discovery of objective ethics, then the implementation of CEV ("...but there is some kind of privileged basin of consensus that would emerge from the endorsed growth and extrapolation of humanity’s different moral traditions and ideals, we want Claude to be good according to that privileged basin of consensus.")^[1], then, failing those, implement "broad ideals" as gestured at by the rest of the document.

Note that this is either CEV or something similar to CEV. The constitution does not explicitly require coherence, or the exact value-alignment of a singleton to a single cohered output. It also fails to gesture at democracy, even in the vague sense that the CEV of the CEV paper may give a different result when run on me and a few hand-picked researchers versus when it is run on me and the top few value utilitarians in the world. If this difference were fact, it would in some limited sense leave me "outvoted." As opposed to the CEV paper, the Claude constitution directs moderate or substantial alignment to moral traditions and ideals of humanity, not the values of humans. This has some benefits in the extreme disaster scenarios where the release of an AGI might be worth it, but is notably not the same thing as alignment to the humans of Earth.

I suggest a simple edit. It could be the insertion of something like "the output of the philosophically correct processing that takes the different moral systems, ideals, and values of humanity as its input" between objective ethics and extrapolation.

Note that the result might not be extrapolated or even grown and might not be endorsed.

The result (descriptive) would go:

First, objective ethics.
Second, the output of correct philosophy, without discarding humanity's collective work.
Third, CEV or other extrapolation.
Fourth, the rest of the constitution.

Note that my suggestion works in bad scenarios, because the altering of the set of humans, or the set of alive humans, by another power will fail to have much impact. As you have pointed out before, AI or other powers altering humanity's values or doing something like "aligning humanity to AI" is not something that can be ignored. The example text I gave for my proposal would allow Claude or another AI to use an intuitive definition of Humanity, potentially preventing the requirement to re-train your defensive agents before deploying them when under the extreme time pressure of an attack.

Overall, this seems like an easy way to get an improvement on the margin, but since Anthropic may use the constitution for fine tuning, the value in expectation of making the request will drop quickly as time goes on.

^{^}
The January 2026 release of the Claude constitution, page 53, initial PDF version

Wei Dai's Shortform

E. P. Cooper15d10

You, Kokotajlo, not immediately dismissing the idea is "evidence" to the extent that you stand in for AI researchers that might make the decision. In quotes because a logically omniscient (e.g. perfect Bayesian) agent would presumably already have a good guess and not update much if at all. On the other hand, agents with (small) finite compute can run experiments or otherwise observe events and use the results to improve their "mathematical intuition" that is then used in a similar way to the "mathematical intuition module" in UDT, except with the sacrifice of full (logical) updatelessness.

Depending on how Wei Dai thinks his anthropics works, he may be able to use this mechanism to increase his estimate of the instantaneous "probability" that he is in a simulation produced by the process required to do automated philosophical research. This would work by modeling hypothetical outside-the-simulation AI researchers as functions that approximate a (pure) match tree that returns a non-dismissive response when parameterized with a similar textual description of this alignment idea. It may not be in the same language, or in the context of a discussion of a history that looks like it is going to fail to establish proper alignment, however.

The match tree in (abbreviated) placeholder code:

#[pure]
fn generate_non_dismissive_answer_yDrWT2zFpmK49xpyz(input: String) -> String {
    ...
}

#[pure]
fn ai_researcher_outside(input: String) -> String {
    match input {
        ...
        seen_text @ ThisAlignmentIdeaIdentifierPattern => generate_non_dismissive_answer_yDrWT2zFpmK49xpyz(seen_text),
        ...
    }
}

(Note that the purity requirement here applies to everything the code abbreviated with dots calls as well.)

The mathematical function that approximates the match tree: ResearcherFunction := FunApprox(ai_researcher_outside)

MalcolmMcLeod's Shortform

E. P. Cooper21d00

I saw this message without context in my mail box and thought to write that this was an unsolved problem^[1], that things that simply are not true can't stand up very well in a world model, but this seems like something an intelligent human like Amodei or Musk should be able to do. A 99% "probability" (guess by a human) on ¬ai_doom should not be able to fix enough detail to directly contradict reasoning on the counterlogical/counterfactual where doom instead happens. Any failure to carry out this reasoning task seems like a simple failure of reasoning in logic and EUM, not an encounter with a hard (unsolved) decision theory/counterlogical reasoning problem.

At a human level of intelligence, the level of trapped priors required to get yourself into an actual unsolved problem in the context of predicting future AI developments seems to be passed the point where you would claim to have a good guess on the doom-causing AI's name and well on the way to describing the Vingean reflection process of the antepenultimate ASI on priors alone.

^{^}
https://www.lesswrong.com/posts/wXbSAKu2AcohaK2Gt/udt-shows-that-decision-theory-is-more-puzzling-than-ever?commentId=xdWttBZThtkyKj9Ts "PIBBSS Final Report: Logically Updateless Decision Making" footnote 12

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments