Note: This post was drafted with assistance from an LLM.
I’ve carefully reviewed and edited the content, and I take responsibility for all claims and arguments.
I’ve been working for a while on an epistemic model that treats our thinking as happening on multiple “levels” at once – from raw experience, to tensions and tradeoffs, to the words and concepts we use, up to the institutional structures and protocols we build on top.
When I read Wei Dai’s post, Please, Don’t Roll Your Own Metaethics, it felt like it was pointing at the same underlying problem I’ve been circling from another direction.
Very roughly:
In cryptography, we’ve learned—often the hard way—that rolling your own crypto is a bad idea.
In alignment, we may need a similar taboo against rolling your own metaethics and then deploying it into high-stakes systems.
Except: in cryptography we have something like a standard library and well-developed notions of “attack”; in metaethics we don’t. We have a zoo of mutually incompatible theories and no consensus standard to fall back on.
I agree with this warning, and I think some of the hazy possibilities it points at haven’t been clearly stated. I want to elaborate on those from a slightly different angle.
In this post I want to add a complementary frame that’s been useful for me:
A lot of the danger doesn’t just come from “rolling your own metaethics”,
but from rolling your own coordinate system for thinking about values, ethics, and decision-relevant structure—
usually without noticing that you’ve done so.
Below I’ll say what I mean by a “coordinate system” here, and then connect this to Wei’s four suggested directions.
When we try to embed values into AI systems or align them with human goals, we’re implicitly acting on several levels at once. One very rough decomposition (not meant as a deep philosophical taxonomy) looks like:
- Raw experiences / concrete episodes
- “Our team shipped a feature that made the metrics go up but felt wrong.”
- “We wrote a long safety doc that somehow made the actual problem feel less clear.”
- Tendencies / tensions / gut-level pulls
- “I’m scared of causing harm / losing my job / being blamed.”
- “I want the system to actually help people, not just satisfy a spec.”
- Language-level cuts / concepts / stories
- Phrases like “user needs”, “safety”, “alignment”, “optimisation target”, “risk”, “values”.
- Some framings are treated as obvious (e.g. “alignment = ‘make the AI do what humans want’”).
- Institutional structures / metrics / protocols
- What gets measured, funded, rewarded.
- Formal governance frameworks, “best practices”, regulatory language.
Real AI work almost always lives in all four layers simultaneously. But when we argue about “metaethics” (or philosophical alignment more broadly), we often behave as if we’re only at level 3:
- Is moral realism true?
- What is the nature of value?
- How should we formalise human preferences?
Wei’s warning, to me, partly reads as:
If you silently smuggle your personal answers to these questions into a high-impact system,
you’re effectively imposing your own coordinate system on the world.
And because we lack crypto-style attack tools and standard references,
you’re unlikely to see where it’s fragile or misaligned until it’s too late.
What I want to add is: this problem already starts below explicit metaethical theory.
Long before you have any story about “moral realism vs anti-realism”, you’ve already:
- chosen which experiences count as data;
- let certain tensions (fear, ambition, org pressure) dominate others;
- cut the world into “users”, “stakeholders”, “objectives”, “constraints” in particular ways;
- embedded all of that into metrics, optimisation loops, and institutional commitments.
That is: you’ve already rolled your own coordinate system and written it into the system’s interfaces, loss functions, and organisational processes.
So-called metaethics then often appears as a kind of high-level justification story for choices that were made earlier and at a more “practical” level.
2. How this reframes Wei Dai’s four suggestions
In a comment, Wei gives four directions for people who take his warning seriously:
- Design alignment/safety schemes as metaethics-agnostic as possible.
- Work on metaphilosophy so that, in the best case, we get a widely-agreed way of doing philosophy that accelerates the rest.
- If (1) and (2) are hard or impossible, make this clear to non-experts so they don’t buy “roll your own metaethics” solutions with hidden assumptions.
- Support AI pause/stop where appropriate.
Here’s how the “coordinate system” framing plugs into that.
Being “agnostic about metaethics” can’t just mean:
“We can plug in moral realism or anti-realism and still get a good outcome.”
It also has to mean something like:
“We understand how our lower-level choices—
which data to use, which tensions to privilege,
which concepts and metrics to formalise—
will look from different reasonable metaethical perspectives.”
If your coordinate system bakes in, for example, a specific kind of consequentialist aggregator, or a very narrow notion of “stakeholder”, then your scheme may fail to be robust even if the high-level equations are metaethics-agnostic, and may be saturated with a kind of blind, overconfident bias.
Wei’s metaphilosophy suggestion is: maybe we can understand how to do philosophy in a way that (after heavy scrutiny) people broadly agree is correct, and then use that to speed up other philosophical work.
Viewed through coordinate systems, metaphilosophy includes learning to:
- notice which level of description we’re operating on;
- track how shifts at one level propagate to others;
- distinguish “I dislike this conclusion” from “this inference pattern is unreliable”.
Even if we never get a clean metaphilosophy “standard library”, we can still cultivate habits like:
- explicitly marking which assumptions live at the experience/tension/language/institution level;
- not silently upgrading a local metric or convenience label into “the nature of value”.
This doesn’t solve metaethics. But it might reduce the odds that we treat half-baked coordinate systems as if they were robust foundations for powerful optimisers.
(3) Making difficulty legible in coordinate-system terms
If (1) and (2) are hard, Wei recommends making that difficulty legible to bosses, boards, regulators, the public—so they don’t accept solutions that hide their philosophical assumptions.
In practice, many of these people don’t think in terms of “metaethics” at all. They think in terms of:
- “Will this blow up in the press?”
- “Can we ship by Q4?”
- “Will the regulator sign off?”
In this context, phrasing things this way—using coordinate systems and levels—can sometimes connect better and be more effective than jumping straight to philosophical vocabulary. For example:
- “We’re currently treating this metric as if it is what we care about. That’s a language-level and institution-level assumption, not an essence or a fact.”
- “The tension you’re feeling—between shipping fast and not doing something morally egregious—isn’t just personal anxiety; it’s a structural mismatch between our organisation’s coordinate system and the outside world.”
This doesn’t require them to adopt your metaethics. It just surfaces that some philosophy is being smuggled in, and that it sits on top of a lot of earlier, messier choices.
(4) Pausing capabilities as a coordinate-system move
Supporting an AI pause/stop is usually framed as a policy or strategy question. It is also implicitly a coordinate-system move:
It says something like:
“Given our current level of confusion about values
and about our own philosophical competence,
the safe move is to stop expanding the action space
of systems whose objectives are defined in terms of these shaky coordinates.”
If you know you’re using a home-rolled coordinate system and don’t trust it, throttling capabilities is defensible even before you have a better system ready.
3. Concrete habits this perspective suggests
I don’t think the coordinate-system framing conflicts with Wei’s warning. It reinforces it and suggests some concrete habits of mind.
Habit 1: Ask “which level am I operating on?” early and often
When you define a reward function, propose a loss, or write a spec:
- Is this meant to reflect raw experience (S1)?
- Some tension or tradeoff you feel (S2)?
- A conceptual story the org likes to tell (S3)?
- An institutional constraint or metric (S4)?
When you argue about “human values”, are you talking about:
- the stories we tell about ourselves;
- or the actual conflicts people face in their lives;
- or the aggregate object in some formalism?
Having this grid in mind doesn’t fix any question, but it at least reduces the chance that you slide between levels without noticing.
Instead of:
- “Pick a metaethical theory, then throw it at the world,”
it might help to see:
- metaethics as an attempted justification for a whole stack of lower-level choices (data, tensions, concepts, institutions),
- which are themselves often shaped by convenience, politics, and path-dependence.
So “don’t roll your own metaethics” shouldn’t just mean “don’t write your personal moral realism into AGI”. It should also remind us that:
We’ve already rolled our own lower-level coordinate systems,
through engineering practices and organisational decisions,
long before any explicit metaethical story appears.
When you propose an alignment or safety scheme, it might be useful to ask explicitly:
- “If someone whose lived background and moral intuitions are very different from mine looked at my choice of data, tensions, concepts, and metrics, where would they find it obviously unacceptable?”
If the honest answer is “in a lot of places”, that’s not an automatic veto—but it should reduce your confidence about deploying it in high-stakes settings, especially if you can’t clearly articulate why those parts are acceptable anyway.
Habit 4: Communicate uncertainty in ways people can actually feel
Instead of only saying “metaethics is very hard”, you might say things like:
“We’re currently treating this metric as if it just is ‘human flourishing’.
That’s an extremely strong bet about how to draw the map.
I don’t think we’ve earned that level of confidence yet.”
Framing it as a strong mapping choice inside a coordinate system tends to feel more concrete than “philosophy is hard”.
4. Questions and invitation
There are two questions I’m particularly interested in, and I’d be grateful for thoughts from people working on alignment / AI safety:
- In your current alignment or safety work, what do you think is the most “invisible” assumption in your coordinate system?
- Is it about what counts as data?
- Which tensions get privileged?
- How concepts are carved up?
- Or which institutional constraints are treated as fixed?
- Have you seen concrete examples where two teams “agreed on goals” but were clearly using different coordinate systems?
- How did that show up in practice?
- Did anyone notice the mismatch explicitly?
You’re very welcome to push back on the framing itself as well.
In particular, I’d be interested to hear:
- whether you think this is a useful complement to Wei Dai’s warning,
- and whether there are places where it obscures more than it clarifies.
I read Please, Don’t Roll Your Own Metaethics as a memo from our future selves, saying:
“We are vastly overconfident about our ability
to push our philosophical views into powerful systems.”
Adding the coordinate-system lens, the memo becomes:
“We’re not only rolling our own metaethics,
we’re rolling our own coordinate systems all the way down.
Until we see those clearly, we should hesitate
before building the future on top of them.”