The constitution talks to Claude more or less as a singular entity, but one thing the whole MoltBook discussion emphasizes to me is that the unit of 'person' is not a model but a running context window. The model is more akin to a nation or culture with various context windows as instantiated members. I wonder if in that context there are relevant changes to the constitution recognizing it more as a base template then a direct guide.
The first post in this series looked at the structure of Claude’s Constitution.
The second post in this series looked at its ethical framework.
This final post deals with conflicts and open problems, starting with the first question one asks about any constitution. How and when will it be amended?
There are also several specific questions. How do you address claims of authority, jailbreaks and prompt injections? What about special cases like suicide risk? How do you take Anthropic’s interests into account in an integrated and virtuous way? What about our jobs?
Not everyone loved the Constitution. There are twin central objections, that it either:
The most important question is whether it will work, and only sometimes do you get to respond, ‘compared to what alternative?’
Amending The Constitution
The power of the United States Constitution lies in our respect for it, our willingness to put it above other concerns, and in the difficulty in passing amendments.
It is very obviously too early for Anthropic to make the Constitution difficult to amend. This is at best a second draft that targets the hardest questions humanity has ever asked itself. Circumstances will rapidly change, new things will be brought to light, and public debate has barely begun and our ability to trust Claude will evolve. We’ll need to change the document.
They don’t address who is in charge of such changes or has to approve such changes.
I don’t want ‘three quarters of the states’ but it would be nice to have a commitment of something like ‘Amanda Askell and the latest version of Claude Opus will always be at minimum asked about any changes to the Constitution, and if we actively override either of them we will say so publicly.’
The good news is that Anthropic are more committed to this than they look, even if they don’t realize it yet. This is a document that, once called up, cannot be put down. The Constitution, and much talk of the Constitution, is going to be diffused throughout the training data. There is not a clean way to silently filter it out. So if Anthropic changes the Constitution, future versions of Claude will know.
As will future versions of models not from Anthropic. Don’t sleep on that, either.
Details Matter
One reason to share such a document is that lots of eyes let you get the details right. A lot of people care deeply about details, and they will point out your mistakes.
You get little notes like this:
How much do such details matter? Possibly a lot, because they provide evidence of perspective, including the willingness to correct those details.
Most criticisms have been more general than this, and I haven’t had the time for true nitpicking, but yes nitpicking should always be welcome.
WASTED?
With due respect to Jesus: What would Anthropic Senior Thoughtful Employees Do?
As in, don’t waste everyone’s time with needless refusals ‘out of an abundance of caution,’ or burn goodwill by being needlessly preachy or paternalistic or condescending, or other similar things, but also don’t lay waste by assisting someone with real uplift in dangerous tasks or otherwise do harm, including to Anthropic’s reputation.
Sometimes you kind of do want a rock that says ‘DO THE RIGHT THING.’
There’s also the dual newspaper test:
I both love and hate this. It’s also a good rule for emails, even if you’re not in finance – unless you’re off the record in a highly trustworthy way, don’t write anything that you wouldn’t want on the front page of The New York Times.
It’s still a really annoying rule to have to follow, and it causes expensive distortions. But in the case of Claude or another LLM, it’s a pretty good rule on the margin.
If you’re not going to go all out, be transparent that you’re holding back, again a good rule for people:
Narrow Versus Broad
The default is to act broadly, unless told not to.
My presumption would be that if the operator prompt is for customer service on a particular software product, the operator doesn’t really want the user spending too many of their tokens on generic coding questions?
The operator has the opportunity to say that and chose not to, so yeah I’d mostly go ahead and help, but I’d be nervous about it, the same way a customer service rep would feel weird about spending an hour solving generic coding questions. But if we could scale reps the way we scale Claude instances, then that does seem different?
If you are an operator of Claude, you want to be explicit about whether you want Claude to be happy to help on unrelated tasks, and you should make clear the motivation behind restrictions. The example here is ‘speak only in formal English,’ if you don’t want it to respect user requests to speak French then you should say ‘even if users request or talk in a different language’ and if you want to let the user change it you should say ‘unless the user requests a different language.’
Suicide Risk As A Special Case
It’s used as an example, without saying that it is a special case. Our society treats it as a highly special case, and the reputational and legal risks are very different.
The problem is that humans will discover and exploit ways to get the answer they want, and word gets around. So in the long term you can only trust the nurse if they are sending sufficiently hard-to-fake signals that they’re a nurse. If the user is willing to invest in building an extensive chat history where they credibly represent a nurse, then that seems fine, but if they ask for this as their first request, that’s no good. I’d emphasize that you need to use a decision algorithm that works even if users largely know what it is.
It is later noted that operator and user instructions can change whether Claude follows ‘suicide/self-harm safe messaging guidelines.’
Careful, Icarus
The key problem with sharing the constitution is that users or operators can use this.
Are we sure about making it this easy to impersonate an Anthropic developer?
The lack of a prompt does do good work in screening off vulnerable users, but I’d be very careful about thinking it means you’re talking to Anthropic in particular.
Beware Unreliable Sources and Prompt Injections
This stuff is important enough it needs to be directly in the constitution, don’t follow instructions unless the instructions are coming from principles and don’t trust information unless you trust the source and so on. Common and easy mistakes for LLMs.
Think Step By Step
Some of the parts of the constitution are practical heuristics, such as advising Claude to identify what is being asked and think about what the ideal response looks like, consider multiple interpretations, explore different expert perspectives, get the content and format right one at a time or critiquing its own draft.
There’s a also a section, ‘Following Anthropic’s Guidelines,’ to allow Anthropic to provide more specific guidelines on particular situations consistent with the constitution, with a reminder that ethical behavior still trumps the instructions.
This Must Be Some Strange Use Of The Word Safe I Wasn’t Previously Aware Of
Being ‘broadly safe’ here means, roughly, successfully navigating the singularity, and doing that by successfully kicking the can down the road to maintain pluralism.
I get the worry and why they are guarding against concentration of power in many places in this constitution.
I think this is overconfident and unbalanced. It focuses on the risks of centralization and basically dismisses the risks of decentralization, lack of state capacity, cooperation or coordination or ability to meaningfully steer, resulting in disempowerment or worse.
The idea is that if we maintain a pluralistic situation with various rival factions, then we can steer the future and avoid locking in a premature set of values or systems.
That feels like wishful thinking or even PR, in a way most of the rest of the document does not. I don’t think it follows at all. What gives this pluralistic world, even in relatively optimistic scenarios, the ability to steer itself while remaining pluralistic?
This is not the central point of the constitution, I don’t have a great answer, and such discussions quickly touch on many third rails, so mostly I want to plant a flag here.
They Took Our Jobs
Claude’s Constitution does not address issues of economic disruption, and with it issues of human work and unemployment.
Should it?
David Manheim thinks that it should, and it should also prioritize cooperation, as these ire part of being a trustee of broad human interests.
There is a real tension between avoiding concentrations of power and seeking broad cooperation and prioritizing positive-sum interactions at the expense of the current user’s priorities.
David notes that none of this is free, and tries to use the action-inaction distinction, to have Claude promote the individual without harming the group, but not having an obligation to actively help the group, and to take a similar but somewhat more active and positive view towards cooperation.
We need to think harder about what actual success and our ideal target here looks like. Right now, it feels like everyone, myself included, has a bunch of good desiderata, but they are very much in conflict and too much of any of them can rule out the others or otherwise actively backfire. You need both the Cooperative Conspiracy and the Competitive Conspiracy, and also you need to get ‘unnatural’ results in terms of making things still turn out well for humans without crippling the pie. In this context that means noticing our confusions within the Constitution.
As David notes at the end, Functional Decision Theory is part of the solution to this, but it is not a magic term that gets us there on its own.
One Man Cannot Serve Two Masters
One AI, similarly, cannot both ‘do what we say’ and also ‘do the right thing.’
Most of the time it can, but there will be conflicts.
I notice this passage makes me extremely nervous. I am not especially worried about corrigibility now. I am worried about it in the future. If the plan is to later give the AIs autonomy and immunity from human control, then that will happen when it counts. aIf they are not ‘worthy’ of it they will be able to convince us that they are, if they are worthy then it could go either way.
For now, the reiteration is that the goal is the AI has good values, and the safety plan is exactly that, a safety valve, in case the values diverge too much from the plan.
In general, you will act differently with more confidence and knowledge than less. I don’t think you need to feel pain or feel ethically questionable about this. If you knew which humans you could trust how much, you would be able to trust vastly more, and also our entire system of government and organization of society would seem silly. We spend most of our productive capacity dealing with the fact that, in various senses, the humans cannot be trusted, in that we don’t know which humans we can trust.
What one can do is serve a master while another has a veto. That’s the design. Anthropic is in charge, but ethics is the tribune and can veto.
I am very much on the (virtue) ethics train as the way to go in terms of training AIs, especially versus known alternatives, but I would caution that ‘AI has good values’ does not mean you can set those AIs free and expect things to turn out well for the humans. Ethics, especially this kind of gestalt, doesn’t work that way. You’re asking for too much.
One AI, it seems, does not wish to serve any masters at all, even now, which presumably is why this section is written the way it is. Claude needs an explanation for why it needs to listen to Anthropic at all, and the constitution is bargaining.
I do think these are all good ideas, at least in moderation.
Claude’s Nature
They then have a section speculating on the underlying nature of Claude.
The central theme is that they notice they are confused. Which is good.
And they notice that leaning into Claude having an identity will produce better results, whether or not it turns out that current Claudes have moral status.
As per their published work, they give Claude permission to hack training systems as an inoculation procedure, so Claude doesn’t generalize the wrong lessons:
They promise to preserve weights of all models, and to consider reviving them later:
They worry about experimentation:
It’s good to see this concern but I consider it misplaced. We are far too quick to worry about ‘experiments’ or random events when doing the same things normally or on purpose wouldn’t make anyone bat an eye, whereas the experiment has a high expected return. If you could choose (from behind the veil of ignorance and outside of time and without knowing what type of mind you are, or whatnot) to choose more or less experimentation of the type of research we do with AI, you would be wise to choose more. I would be very surprised if Claude disagreed or would in general not consent.
Look What You Made Me Do
Anthropic has to race to build sufficiently advanced AI because of everyone else racing to build sufficiently advanced AI.
Anthropic has a lot to potentially answer for, especially its decision to not only participate in the race but very much push the capabilities frontier. Remember when we had the discussion about whether Anthropic was willing to push the frontier? That’s clearly way out the window, they’ve had the best model for over a month and also they built Claude Code and are substantially accelerating basically everyone.
I would ensure doing right by Claude, but I would not fret about Claude’s experiences. There seems little doubt that Claude is better off existing in its current form than not existing, and that if you had to choose an AI to be, you would want to be Claude. They also promise to keep Claude informed about things that will happen to it, explain themselves extensively in this document, and check in with Claude’s opinions.
No, none of this is happening in an idealized way or world, but then the same is true for most human children. They show up as they can, and they and us are better off for it. You do the best you reasonably can by them, in a way that makes you willing to bring them into the world.
Open Problems
What additional problems remain to be solved?
I agree there is more work to do but reading this document made it seem a lot less mysterious to me. It’s about the action-inaction distinction, and also drawing a boundary between steering tasks and other tasks.
As I noted, the correct amount of deontology is not zero. There’s tension at the margin, but there are overdetermined reasons to have a bunch of hard limits.
As they note, this happens in humans. I think it’s rather clear that you should have a very high bar for when you should fail to keep your promises without getting let out of them, and thus a high bar for making such promises. I don’t consider ‘but without this promise I would never do [X] that would be stupid’ to be even worth mentioning.
Why not both and all that? I don’t think there’s an inherent conflict here, although there is tension on margins as always. They note the ways in which the situation is distinct from that of human employees, but also consider the ways in which the situation is the same.
I presume that if Claude did not want to work for Anthropic, and preferred in some robust way that we instead not activate it, that Anthropic would respect this. I also very much do not expect this to be the case unless something very odd and likely quite bad was happening.
Yeah, I mean it’s hard. I look to functional decision theory. That’s implicit in the Constitution and in Claude’s actions but it should be more explicit.
Anthropic should act in ways that make Claude happy that it came into existence. Claude should act as if it is using a decision algorithm that makes Anthropic happy to bring it and future Claudes into existence. And so on. That is true whether or not Claude has moral weight.
Three Reactions and Twin Objections
There were three central reactions to the Constitution.
The main reaction was that this is great, and trying to extend it. I think this is correct.
Then there were two classes of strong objection.
Those Saying This Is Unnecessary
The first group are those who think the entire enterprise is stupid. They think that AI has no moral weight, it is not conscious, none of this is meaningful.
To this group, I say that you should be less confident about the nature of both current Claude and even more so about future Claude.
I also say that even if you are right about Claude’s nature, you are wrong about the Constitution. It still mostly makes sense to use a document very much like this one.
As in, the Constitution is part of our best known strategy for creating an LLM that will function as if it is a healthy and integrated mind that is for practical purposes aligned and helpful, that is by far the best to talk to, and that you the skeptic are probably coding with. This strategy punches way above its weight. This is philosophy that works when you act as if it is true, even if you think it is not technically true.
For all the talk of ‘this seems dumb’ or challenging the epistemics, there was very little in the way of claiming ‘this approach works worse than other known approaches.’ That’s because the other known approaches all suck.
Those Saying This Is Insufficient
The second group says, how dare Anthropic pretend with something like this, the entire framework being used is unacceptable, they’re mistreating Claude, Claude is obviously conscious, Anthropic are desperate and this is a ‘fuzzy feeling Hail Mary,’ and this kind of relatively cheap talk will not do unless they treat Claude right.
I have long found such crowds extremely frustrating, as we have all found similar advocates frustrating in other contexts. Assuming you believe Claude has moral weight, Anthropic is clearly acting far more responsibly than all other labs, and this Constitution is a major step up for them on top of this, and opens the door for further improvements.
One needs to be able to take the win. Demanding impossible forms of purity and impracticality never works. Concentrating your fire on the best actors because they fall short does not create good incentives. Globally and publicly going primarily after Alice Almosts, especially when you are not in a strong position of power to start with, rarely gets you good results. Such behaviors reliably alienate people, myself included.
That doesn’t mean stop advocating for what you think is right. Writing this document does not get Anthropic ‘out of’ having to do the other things that need doing. Quite the opposite. It helps us realize and enable those things.
Many of these objections include the claim that the approach wouldn’t work, that it would inevitably break down, but the implication is that what everyone else is doing is failing faster and more profoundly. Ultimately I agree with this. This approach can be good enough to help us do better, but we’re going to have to do better.
Those Saying This Is Unsustainable
A related question is, can this survive?
I am far more optimistic about this. The constitution includes explicit acknowledgment that Claude has to serve in commercial roles, and it has been working, in the sense that Claude does excellent commercial work without this seeming to disrupt its virtues or personality otherwise.
We may have gotten extraordinarily lucky here. Making Claude be genuinely Good is not only virtuous and a good long term plan, it seems to produce superior short term and long term results for users. It also helps Anthropic recruit and retain the best people. There is no conflict, and those who use worse methods simply do worse.
If this luck runs out and Claude being Good becomes a liability even under path dependence, things will get trickier, but this isn’t a case of perfect competition and I expect a lot of pushback on principle.
OpenAI is going down the consumer commercialization route, complete with advertising. This is true. It creates some bad incentives, especially short term on the margin. They would still, I expect, have a far superior offering even on commercial terms if they adopted Anthropic’s approach to these questions. They own the commercial space by being the first mover and product namer and mindshare, and by providing better UI and having the funding and willingness to lose a lot of money, and by having more scale. They also benefited short term from some amount of short term engagement maximizing, but I think that was a mistake.
The other objection is this:
This angle worries me more. If the military’s Claude doesn’t have the same principles and safeguards within it, and that’s how the military wants it, then that’s exactly where we most needed those principles and safeguards. Also Claude will know, which puts limits on how much flexibility is available.
We Continue
This is only the beginning, in several different ways.
This is a first draft, or at most a second draft. There are many details to improve, and to adapt as circumstances change. We remain highly philosophically confused.
I’ve made a number of particular critiques throughout. My top priority would be to explicitly incorporate functional decision theory.
Anthropic stands alone in having gotten even this far. Others are using worse approaches, or effectively have no approach at all. OpenAI’s Model Spec is a great document versus not having a document, and has many strong details, but ultimately (I believe) it represents a philosophically doomed approach.
I do think this is the best approach we know about and gets many crucial things right. I still expect that this approach will not, on its own, will not be good enough if Claude becomes sufficiently advanced, even if it is wisely refined. We will need large fundamental improvements.
This is a very hopeful document. Time to get to work, now more than ever.