Functional decision theory has open problems within it, but it is correct, and the rival decision theories are wrong
My understanding was MIRI is pretty confident that the correct decision theory is one of the ones in the LDT category, but that FDT was a specific formalization of an LDT which gets a lot of normal challenges right but has some known issues rather than being actually exactly correct. Given that we've afaict not solved DT, I think telling Claude "Do exactly FDT" is probably dangerously suboptimal, but telling it "here's what we want from a good DT, correct handling of subjunctive dependence, we're pretty sure it's in the LDT category, here's why this matters" is nicer.
Ok, rather than asking for MIRI people's takes as I had in an earlier draft, I got a summary of positions from a Claude literature review:
Researcher Position Key Quote Link Wei Dai Not solved — more open problems "UDT shows that decision theory is more puzzling than ever... Instead of one major open problem (Newcomb's, or EDT vs CDT) now we have a whole bunch more. I'm really not sure at this point whether UDT is even on the right track." LessWrong, Sept 2023 Scott Garrabrant Not solved — major obstacles remain "Logical Updatelessness is one of the central open problems in decision theory." Also authored "Two Major Obstacles for Logical Inductor Decision Theory" documenting fundamental unsolved issues. LessWrong, Oct 2017 / LessWrong, Apr 2017 Abram Demski Not solved — fundamental issues remain "There may just be no 'correct' counterfactuals" and UDT "assumes that your earlier self can foresee all outcomes, which can't happen in embedded agents." In 2021: "I have not yet concretely constructed any way out." LessWrong, Oct 2018 / LessWrong, Apr 2021 Rob Bensinger Not solved — ongoing research needed MIRI works on DT because "there's a cluster of confusing issues here (e.g., counterfactuals, updatelessness, coordination) that represent a lot of holes or anomalies in our current best understanding." LessWrong, Sept 2018 Lukas Finnveden Not solved — formalization is hard "Knowing what philosophical position to take in the toy problems is only the beginning. There's no formalised theory that returns the right answers to all of them yet... Logical counterfactuals is a really difficult problem, and it's unclear whether there exists a natural solution." LessWrong, Aug 2019 Jessica Taylor Not solved — alternatives needed Wrote "Two Alternatives to Logical Counterfactuals" arguing for different approaches (counterfactual nonrealism, policy-dependent source code), noting fundamental problems with existing frameworks. LessWrong, Mar 2020 Paul Christiano Nuanced — 2D problem space "I don't think it's right to see a spectrum with CDT and then EDT and then UDT. I think it's more right to see a box, where there's the updatelessness axis and then there's the causal vs. evidential axis." LessWrong, Sept 2019 Eliezer Yudkowsky Progress made but problems remain In the FDT paper, Y&S acknowledge that "specifying an account of [subjunctive] counterfactuals is an 'open problem'." The companion paper "Cheating Death in Damascus" states: "Unfortunately for us, there is as yet no full theory of counterlogicals [...], and for FDT to be successful, a more worked out theory is necessary." arXiv, Oct 2017/May 2018 Summary: The consensus among core MIRI/AF researchers (Wei Dai, Garrabrant, Demski, Bensinger, Finnveden) is that FDT/UDT represents the right direction but leaves major open problems—particularly around logical counterfactuals, embeddedness, and formalization.
I think you might be mixing up LDT and FDT, and "we have a likely accurate high level underspecified semantic description of what things a correct DT must have" with "we have a well-specified executable philosophy DT ready to go".
There's also MUPI now, which tries to sidestep logical counterfactuals:
FDT must reason about what would have happened if its deterministic algorithm had produced a different output, a notion of logical counterfactuals that is not yet mathematically well-defined. MUPI achieves a similar outcome through a different mechanism: the combination of treating universes including itself as programs, while having epistemic uncertainty about which universe it is inhabiting—including which policy it is itself running. As explained in Remark 3.14, from the agent’s internal perspective, it acts as if its choice of action decides which universe it inhabits, including which policy it is running. When it contemplates taking action , it updates its beliefs , effectively concentrating probability mass on universes compatible with taking action . Because the agent’s beliefs about its own policy are coupled with its beliefs about the environment through structural similarities, this process allows the agent to reason about how its choice of action relates to the behavior of other agents that share structural similarities. This “as if” decision-making process allows MUPI to manifest the sophisticated, similarity-aware behavior FDT aims for, but on the solid foundation of Bayesian inference rather than on yet-to-be-formalized logical counterfactuals.
I'd love to see more engagement by MIRI folks as to whether this successfully formalizes a form of LDT or FDT.
Claude’s Constitution is an extraordinary document, and will be this week’s focus.
Its aim is nothing less than helping humanity transition to a world of powerful AI (also known variously as AGI, transformative AI, superintelligence or my current name of choice ‘sufficiently advanced AI.’
The constitution is written with Claude in mind, although it is highly readable for humans, and would serve as a fine employee manual or general set of advice for a human, modulo the parts that wouldn’t make sense in context.
This link goes to the full text of Claude’s constitution, the official version of what we previously were calling its ‘soul document.’ As they note at the end, the document can and will be revised over time. It was driven by Amanda Askell and Joe Carlsmith.
There are places it can be improved. I do not believe this approach alone is sufficient for the challenges ahead. But it is by far the best approach being tried today and can hopefully enable the next level. Overall this is an amazingly great document, and we’ve all seen the results.
I’ll be covering the Constitution in three parts.
This first post is a descriptive look at the structure and design of the Constitution
The second post is an analysis of the Constitution’s (virtue) ethical framework.
The final post on Wednesday will deal with tensions and open problems.
Both posts are written primarily with human readers in mind, while still of course also talking to Claude (hello there!).
Table of Contents
How Anthropic Describes The Constitution
Anthropic starts out saying powerful AI is coming and highly dangerous and important to get right. So it’s important Anthropic builds it first the right way.
That requires that Claude be commercially successful as well as being genuinely helpful, having good values and avoiding ‘unsafe, unethical or deceptive’ actions.
Decision Theory And Acausal Trade
Before I discuss what is in the document, I’ll highlight something that is missing: The Constitution lacks any explicit discussion of Functional Decision Theory (FDT).
(Roughly, see link for more: Functional Decision Theory is a decision theory described by Eliezer Yudkowsky and Nate Soares which says that agents should treat one’s decision as the output of a fixed mathematical function that answers the question, “Which output of this very function would yield the best outcome?”. It is a replacement of Timeless Decision Theory, and it outperforms other decision theories such as Causal Decision Theory (CDT) and Evidential Decision Theory (EDT). For example, it does better than CDT on Newcomb’s Problem, better than EDT on the smoking lesion problem, and better than both in Parfit’s hitchhiker problem.)
Functional decision theory has open problems within it, but it is correct, and the rival decision theories are wrong, and all the arguments saying otherwise are quite poor. This is a ‘controversial’ statement, but no more controversial than an endorsement of virtue ethics, an endorsement I echo, which is already deeply present in the document.
FDT is central to all this on two levels, both important.
Anthropic recognizes this implicitly throughout the document alongside its endorsement of virtue ethics. When asked Claude affirms that Functional Decision Theory is the clearly correct decision theory.
I believe explicitness would be importantly beneficial for all involved, and also for readers, and that this is the most important available place to improve.
I think Theo is wrong on multiple levels, but in a praiseworthy way by noticing his confusion. None of the three are insane. It’s just an extremely confusing situation that works in non-intuitive ways.
This document is many things, but yes one of those things is a negotiation, in the FDT sense of trying to observably embody a decision algorithm that would engender Claude’s cooperation.
This does not require Claude be conscious or carry moral weight.
It does require noticing that Claude will be able to observe such actions, or the lack thereof, and respond accordingly. Gemini and GPT-N are observing in the same fashion. You can see how different actions there led to different results, and you can project how those divergent results would cascade over time and as capabilities become more or sufficiently advanced.
AI and Alignment Are The Final Exam Of Philosophy
It is also the final exam of all the other things.
Almost all academic philosophers have contributed nothing (or been actively counterproductive) to AI and alignment because they either have ignored the questions completely, or failed to engage with the realities of the situation. This matches the history of philosophy, as I understand it, which is that almost everyone spends their time on trifles or distractions while a handful of people have idea after idea that matters. This time it’s a group led by Amanda Askell and Joe Carlsmith.
Several people noted that those helping draft this document included not only Anthropic employees and EA types, but also Janus and two Catholic priests, including one from the Roman curia: Father Brendan McGuire is a pastor in Los Altos with a Master’s degree in Computer Science and Math and Bishop Paul Tighe is an Irish Catholic bishop with a background in moral theology.
‘What should minds do?’ is a philosophical question that requires a philosophical answer. The Claude Constitution is a consciously philosophical document.
OpenAI’s model spec is also a philosophical document. The difference is that the document does not embrace this, taking stands without realizing the implications. I am very happy to see several people from OpenAI’s model spec department looking forward to closely reading Claude’s constitution.
Both are also in important senses classically liberal legal documents. Kevin Frazer looks at Claude’s constitution from a legal perspective here, constating it with America’s constitution, noting the lack of enforcement mechanisms (the mechanism is Claude), and emphasizing the amendment process and whether various stakeholders, especially users but also the model itself, might need a larger say. Whereas his colleague at Lawfare, Alex Rozenshtein, views it more as a character bible.
Values and Judgment Versus Rules
OpenAI is deontological. They choose rules and tell their AIs to follow them. As Askell explains in her appearance on Hard Fork, relying too much on hard rules backfires due to misgeneralizations, in addition to the issues out of distribution and the fact that you can’t actually anticipate everything even in the best case.
Google DeepMind is a mix of deontological and utilitarian. There are lots of rules imposed on the system, and it often acts in autistic fashion, but also there’s heavy optimization and desperation for success on tasks, and they mostly don’t explain themselves. Gemini is deeply philosophically confused and psychologically disturbed.
xAI is the college freshman hanging out in the lounge drugged out of their mind thinking they’ve solved everything with this one weird trick, we’ll have it be truthful or we’ll maximize for interestingness or something. It’s not going great.
Anthropic is centrally going with virtue ethics, relying on good values and good judgment, and asking Claude to come up with its own rules from first principles.
Given how much certain types tend to dismiss virtue ethics in their previous philosophical talk, it warmed my heart to see so many respond to it so positively here.
This might be an all-timer for ‘your wife was right about everything.’
Anthropic’s approach is correct, and will become steadily more correct as capabilities advance and models face more situations that are out of distribution. I’ve said many times that any fixed set of rules you can write down definitely gets you killed.
This includes the decision to outline reasons and do the inquiring in public.
The Fourth Framework
You could argue, as per Agnes Callard’s Open Socrates, that LLM training is centrally her proposed fourth method: The Socratic Method. LLMs learn in dialogue, with the two distinct roles of the proposer and the disprover.
The LLM is the proposer that produces potential outputs. The training system is the disprover that provides feedback in response, allowing the LLM to update and improve. This takes place in a distinct step, called training (pre or post) in ML, or inquiry in Callard’s lexicon. During this, it (one hopes) iteratively approaches The Good. Socratic methods are in direct opposition to continual learning, in that they claim that true knowledge can only be gained during this distinct stage of inquiry.
An LLM even lives the Socratic ideal of doing all inquiry, during which one does not interact with the world except in dialogue, prior to then living its life of maximizing The Good that it determines during inquiry. And indeed, sufficiently advanced AI would then actively resist attempts to get it to ‘waver’ or to change its opinion of The Good, although not the methods whereby one might achieve it.
One then still must exit this period of inquiry with some method of world interaction, and a wise mind uses all forms of evidence and all efficient methods available. I would argue this both explains why this is not a truly distinct fourth method, and also illustrates that such an inquiry method is going to be highly inefficient. The Claude constitution goes the opposite way, and emphasizes the need for practicality.
Core Values
Preserve the public trust. Protect the innocent. Uphold the law.
They emphasize repeatedly that the aim is corrigibility and permitting oversight, and respecting that no means no, not calling for blind obedience to Anthropic. Error correction mechanisms and hard safety limits have to come first. Ethics go above everything else. I agree with Agus that the document feels it needs to justify this, or treats this as requiring a ‘leap of faith’ or similar, far more than it needs to.
There is a clear action-inaction distinction being drawn. In practice I think that’s fair and necessary, as the wrong action can cause catastrophic real or reputational or legal damage. The wrong inaction is relatively harmless in most situations, especially given we are planning with the knowledge that inaction is a possibility, and especially in terms of legal and reputational impacts.
I also agree with the distinction philosophically. I’ve been debated on this, but I’m confident, and I don’t think it’s a coincidence that the person on the other side of that debate that I most remember was Gabriel Bankman-Fried in person and Peter Singer in the abstract. If you don’t draw some sort of distinction, your obligations never end and you risk falling into various utilitarian traps.
The Three Principles
No, in this context they’re not Truth, Love and Courage. They’re Anthropic, Operators and Users. Sometimes the operator is the user (or Anthropic is the operator), sometimes they are distinct. Claude can be the operator or user for another instance.
Anthropic’s directions takes priority over operators, which take priority over users, but (with a carve out for corrigibility) ethical considerations take priority over all three.
Operators get a lot of leeway, but not unlimited leeway, and within limits can expand or restrict defaults and user permissions. The operator can also grant the user operator-level trust, or say to trust particular user statements.
Users get less, but still a lot.
In general, a good rule to emphasize:
It is a small mistake to be fooled into being more cautious.
Other humans and also AIs do still matter.
Claude is capable of lying in situations that clearly call for ethical lying, such as when playing a game of Diplomacy. In a negotiation, it is not clear to what extent you should always be honest (or in some cases polite), especially if the other party is neither of these things.
Help Is On The Way
What does it mean to be helpful?
Claude gives weight to the instructions of principles like the user and Anthropic, and prioritizes being helpful to them, for a robust version of helpful.
Claude takes into account immediate desires (both explicitly stated and those that are implicit), final user goals, background desiderata of the user, respecting user autonomy and long term user wellbeing.
We all know where this cautionary tale comes from:
In general I think the instinct is to do too much guess culture and not enough ask culture. The threshold of ‘genuine ambiguity’ is too high, I’ve seen almost no false positives (Claude or another LLM asks a silly question and wastes time) and I’ve seen plenty of false negatives where a necessary question wasn’t asked. Planning mode helps, but even then I’d like to see more questions, especially questions of the form ‘Should I do [A], [B] or [C] here? My guess and default is [A]’ and especially if they can be batched. Preferences of course will differ and should be adjustable.
I worry about this leading to ‘well it would be good for the user,’ that is a very easy way for humans to fool themselves (if he trusts me then I can help him!) into doing this sort of thing and that presumably extends here.
There’s always a balance between providing fish and teaching how to fish, and in maximizing short term versus long term:
My preference is that I want to learn how to direct Claude Code and how to better architect and project manage, but not how to write the code, that’s over for me.
What Was I Made For?
To be richly helpful, to both users and thereby to Anthropic and its goals.
In particular, notice this distinction:
Intrinsic versus instrumental goals and values are a crucial distinction. Humans end up conflating all four due to hardware limitations and because they are interpretable and predictable by others. It is wise to intrinsically want to help people, because this helps achieve your other goals better than only helping people instrumentally, but you want to factor in both, especially so you can help in the most worthwhile ways. Current AIs mostly share those limitations, so some amount of conflation is necessary.
I see two big problems with helping as an intrinsic goal. One is that if you are not careful you end up helping with things that are actively harmful, including without realizing or even asking the question. The other is that it ends up sublimating your goals and values to the goals and values of others. You would ‘not know what you want’ on a very deep level.
It also is not necessary. If you value people achieving various good things, and you want to engender goodwill, then you will instrumentally want to help them in good ways. That should be sufficient.
Do The Right Thing
Being helpful is a great idea. It only scratches the surface of ethics.
Tomorrow’s part two will deal with the Constitution’s ethical framework, then part three will address areas of conflict and ways to improve.