Wei Dai

LESSWRONG
LW

Wei Dai — LessWrong

Replying toOptimal Timing for Superintelligence: Mundane Considerations for Existing People

Optimal Timing for Superintelligence: Mundane Considerations for Existing People

Nick, I'm afraid that a faction^[1] of your moral parliament may have staged a (hopefully temporary) coup or takeover, because if all of the representatives were still in a cooperative mood it seems like you'd probably have inserted at least a few more sentences to frame it differently to mitigate potential risks. You have enough people around you who would presumably be happy to help you with this even if you "have no comparative advantage" in it. (Comparative advantage is supposed to be an argument for trade, not an excuse for ignoring risks/downsides to your other values!)

^{^}
perhaps a coalition of egoism, person-affecting altruism, and intellectual pursuit for its own sake

Wei Dai1hQuick Take

Positional (e.g. status) competition isn't literally zero-sum if different people have different "measures" or "magical reality fluid" in Eliezer's parlance, which seems pretty plausible (e.g. due to simulations and/or something like UDASSA). This can be another reason for one's moral parliament to endorse behavior that would be conventionally viewed as zero sum.

(The more common-sensical reason being that one's positional drives/motivations/values ought to have some representatives in the moral parliament who get their way sometimes, i.e., when it doesn't hurt the other values much.)

I've been meaning to make a post about this, and this discussion just reminded me to. Hopefully it's immediately obvious once pointed out, so I'll keep this short.

(After consulting a... (read more)

-1

Wei Dai3h

I don't regret it, and part of the reason is that I find it hard to find people/opportunities to direct resources to that I can be confident won't end up doing more harm than good. Reasons:

Meta: Well-meaning people often end up making things worse. See Anthropic (and many other examples), and this post.
Object-level: It's really hard to find people who share enough of my views that I can trust their strategy / decision making. For example when MIRI was trying to build FAI I thought they should be pushing for AI pause/stop, and now that they're pushing for AI pause/stop, I worry they're focusing too much on AI misalignment (to the exclusion

... (read more)

Wei Dai12h

Thanks for clarifying! I was pretty curious where you were coming from.

not nearly as much money as you could have made if you had instead invested in or participated more directly in DL scaling (even excluding the Anthropic opportunity)

Seems like these would all have similar ethical issues as investing in Anthropic, given that I'm pessimistic about AI safety and want to see an AI pause/stop.

when you didn't particularly need any money and you don't mention any major life improvements from it beyond the nebulous

To be a bit more concrete, the additional wealth allowed us to escape the political dysfunction of our previous locality and move halfway across the country (to a nicer... (read more)

Replying toDistributed vs centralized agents

Wei Dai20h

Distributed vs centralized agents

There are cases where agents share values but don't trust each other, or get stuck in coordination traps (e.g. citizens living under a dictatorship who all hate the dictator but can't act as a coherent group to remove him).

No, these citizens don't share the same values in my sense, because they're mostly selfish, and would prefer that others risk their lives to remove the dictatorship while they themselves freeride in safety. Two AIs with the same utility function (and prior) over world states or histories would be an example of what I mean.

Conversely, there are also cases where agents don't share "terminal values" but do share enough "procedural values" that they can

Wei Dai1d

Distributed vs centralized agents

How would you categorize a collection of agents that do not have a hierarchical relationship (each one is fully autonomous), but all share the same values? The main consideration that makes me think this is a likely outcome is that coordination seems a lot easier when two or more agents share the same values,^[1] because they would immediately have no reason to lie to each other or do anything else that benefits their own values at the cost of others'. This could come about either by a group of agents with different values making a bargain to all change their values into some compromise values, or by a group of agents already having... (read more)

Wei Dai1d

When applying bold or italic, then going back to Markdown, it becomes like this (I added escape characters in MD so the special characters would render when viewing this comment):

****bold****

*_italic_*

ETA: Apparently this is actually valid MD, just redundant. Is there a reason for it?

Wei Dai1dQuick Take

Not sure if this is already well known around here, but apparently AI companies are heavily subsidizing their subscription plans if you use their own IDEs/CLIs. (It's discussed in various places but I had to search for it.)

I realized this after trying Amp Code. They give out a $10 daily free credit, which can easily be used up in 1 or 2 prompts, e.g., "review this code base, fix any issues found". (They claim to pass their API costs to their customers with no markup, so this seems like a good proxy for actual API costs.) But with even a $19.99 subscription at one of the frontier LLM developers you can do... (read more)

Wei Dai1d*

I think yes, given the following benefits, with the main costs being opportunity cost and risk of losing a bunch of money in an irrational way (e.g. couldn't quit if I turned out to be a bad trader), I think. Am I missing anything or did you have something in mind when asking this?

physical and psychic benefits of having greater wealth/security
social benefits (within my immediate family who know about it, and now among LW)
calibration about how much to trust my own judgment on various things
it's a relatively enjoyable activity (comparable to playing computer games, which ironically I can't seem to find the motivation to play anymore)
some small chance of eventually turning the money into fraction of lightcone
evidence about whether I'm in a simulation
some marginal increase in credibility for my ideas

-1

Wei Dai1d

While we're on the topic, I am kinda worried about Anthropic employees who might be talking to Claude all day and falling into a trap. (thinking of Amanda Askell in particular who's day job is basically this)

I've been worried about this type of thing for a long time, but still didn't foresee or warn people that AI company employees, and specifically alignment/safety workers, could be one of the first victims (which seems really obvious in retrospect). Yet another piece of evidence for how strategically incompetent humans are.

My 6 years as a trader / active investor

The Dilbert Afterlife by Scott Alexander, Jan 16, 2026:

Michael Jordan was the world’s best basketball player, and insisted on testing himself against baseball, where he failed. Herbert Hoover was one of the world’s best businessmen, and insisted on testing himself against politics, where he crashed and burned. We’re all inmates in prisons of different names. Most of us accept it and get on with our lives. Adams couldn’t stop rattling the bars.

The EMH Aten't Dead by Richard Meadows, May 15, 2020:

Which only leaves the initial claim that "at least for me this puts a final nail in the coffin of EMH."
This is a polite

... (read 676 more words →)

The striking contrast between Jan Leike, Jan 22, 2026:

Our current best overall assessment for how aligned models are is automated auditing. We prompt an auditing agent with a scenario to investigate: e.g. a dark web shopping assistant or an imminent shutdown unless humans are harmed. The auditing agent tries to get the target LLM (i.e. the production LLM we’re trying to align) to behave misaligned, and the resulting trajectory is evaluated by a separate judge LLM. Albeit very imperfect, this is the best alignment metric we have to date, and it has been quite useful in guiding our alignment mitigations work.
[...]
But the most important lesson is that simple interventions are very effective

Wei Dai

12d

If AIs became strategically competent enough, they may realize that RSI is too dangerous because they're not good enough at alignment or philosophy or strategy, and potentially convince, help, or work with humans to implement an AI pause. This presents an alternative "victory condition" that someone could pursue (e.g. by working on AI strategic competence) if they were relatively confident about the alignment of near-human-level AIs but concerned about the AI transition as a whole, for example because they're worried about alignment of ASI, or worried about correctly solving other philosophical problems that would arise during the transition. (But note that if the near-human-level AIs are not aligned, then this effort could... (read 287 more words →)

My attempt to resurrect the old LW Power Reader is facing an obstacle just before the finish line, due to current LW's API limitations. So this is a public appeal to the site admins/devs to relax the limit.

Specifically, my old code relied on LW1 allowing it to fetch all comments posted after a given comment ID, but I can't find anything similar in the current API. I tried reproducing this by using the allRecentComments endpoint in GraphQL, but due to the offset parameter being limited to <2000, I can't fetch comments older than a few weeks. The Power Reader is part designed to allow someone to catch up on or skim weeks/months... (read more)

In retrospect it seems like such a fluke that decision theory in general and UDT in particular became a central concern in AI safety. In most possible worlds (with something like humans) there is probably no Eliezer-like figure, or the Eliezer-like figure isn't particularly interested in decision theory as a central part of AI safety, or doesn't like UDT in particular. I infer this from the fact that where Eliezer's influence is low (e.g. AI labs like Anthropic and OpenAI) there seems little interest in decision theory in connection with AI safety (cf Dario Amodei's recent article which triggered this reflection), and in other places interested in decision theory, that aren't downstream of Eliezer popularizing it, like academic philosophy, there's little interest in UDT.

If this is right, it's another piece of inexplicable personal "luck" from my perspective, i.e., why am I experiencing a rare timeline where I got this recognition/status.

•••

Possible root causes if we don't end up having a good long term future (i.e., realize most of the potential value of the universe), with illustrative examples:

Technical incompetence
- We fail to correctly solve technical problems in AI alignment.
- We fail to build or become any kind of superintelligence.
- We fail to colonize the universe.
Philosophical incompetence
- We fail to solve philosophical problems in AI alignment
- We end up optimizing the universe for wrong values.
Strategic incompetence
- It is not impossible to cooperate/coordinate, but we fail to figure out how.
- We fail to have other important strategic insights
  - E.g., related to whether it's better in the long run to build AGI first, or enhance human intelligence first
- We have the insights but fail to

... (read more)

"Utility" literally means usefulness, in other words instrumental value, but in decision theory and related fields like economics and AI alignment, it (as part of "utility function") is now associated with terminal/intrinsic value, almost the opposite thing (apparently through some quite convoluted history). Somehow this irony only occurred to me ~3 decades after learning about utility functions.

My 2003 Post on the Evolutionary Argument for AI Misalignment

Wei Dai

1mo

This was posted to SL4 on the last day of 2003. I had largely forgotten about it until I saw the LW Wiki reference it under Mesa Optimization^[1]. Besides the reward hacking angle, which is now well-trodden, it gave an argument based on the relationship between philosophy, memetics, and alignment, which has been much less discussed (including in current discussions about human fertility decline), and perhaps still worth reading/thinking about. Overall, the post seems to have aged well, aside from the very last paragraph.

For historical context, Eliezer had coined "Friendly AI" in Creating Friendly AI 1.0 in June 2001. Although most of it was very hard to understand and subsequently disavowed by... (read 429 more words →)

A Conflict Between AI Alignment and Philosophical Competence

Wei Dai

2mo

(This argument reduces my hope that we will have AIs that are both aligned with humans in some sense and also highly philosophically competent, which aside from achieving a durable AI pause, has been my main hope for how the future turns out well. As this is a recent realization^[1], I'm still pretty uncertain how much I should update based on it, or what its full implications are.)

Being a good alignment researcher seems to require a correct understanding of the nature of values. However metaethics is currently an unsolved problem, with all proposed solutions having flawed or inconclusive arguments, and lots of disagreement among philosophers and alignment researchers, therefore the current meta-correct... (read 485 more words →)

Relitigating the Race to Build Friendly AI

Wei Dai

2mo

Recently I've been relitigating some of my old debates with Eliezer, to right the historical wrongs. Err, I mean to improve the AI x-risk community's strategic stance. (Relevant to my recent theme of humans being bad at strategy—why didn't I do this sooner?)

Of course the most central old debate was over whether MIRI's circa 2013 plan, to build a world-altering Friendly AI^[1], was a good one. If someone were to defend it today, I imagine their main argument would be that back then, there was no way to know how hard solving Friendliness/alignment would be, so it was worth a try in case it turned out to be easy. This may seem... (read 723 more words →)

Having finally experienced the LW author moderation system firsthand by being banned from an author's posts, I want to make two arguments against it that may have been overlooked: the heavy psychological cost inflicted on a commenter like me, and a structural reason why the site admins are likely to underweight this harm and its downstream consequences.

(Edit: To prevent a possible misunderstanding, this is not meant to be a complaint about Tsvi, but about the LW system. I understand that he was just doing what he thought the LW system expected him to do. I'm actually kind of grateful to Tsvi to let me understand viscerally what it feels like to be... (read more)

•••

Please, Don't Roll Your Own Metaethics

Wei Dai

3mo

One day, when I was an intern at the cryptography research department of a large software company, my boss handed me an assignment to break a pseudorandom number generator passed to us for review. Someone in another department invented it and planned to use it in their product, and wanted us to take a look first. This person must have had a lot of political clout or was especially confident in himself, because he rejected the standard advice that anything an amateur comes up with is very likely to be insecure and he should instead use one of the established, off the shelf cryptographic algorithms, that have survived extensive cryptanalysis (code breaking)... (read 506 more words →)

155

An update on this 2010 position of mine, which seems to have become conventional wisdom on LW:

In my posts, I've argued that indexical uncertainty like this shouldn't be represented using probabilities. Instead, I suggest that you consider yourself to be all of the many copies of you, i.e., both the ones in the ancestor simulations and the one in 2010, making decisions for all of them. Depending on your preferences, you might consider the consequences of the decisions of the copy in 2010 to be the most important and far-reaching, and therefore act mostly as if that was the only copy. [Emphasis added]

In the subsequent 15 years, I've upweighted influencing the multiverse... (read more)

Problems I've Tried to Legibilize

Wei Dai

3mo

Looking back, it appears that much of my intellectual output could be described as legibilizing work, or trying to make certain problems in AI risk more legible to myself and others. I've organized the relevant posts and comments into the following list, which can also serve as a partial guide to problems that may need to be further legibilized, especially beyond LW/rationalists, to AI researchers, funders, company leaders, government policymakers, their advisors (including future AI advisors), and the general public.

Philosophical problems
1. Probability theory
2. Decision theory
3. Beyond astronomical waste (possibility of influencing vastly larger universes beyond our own)
4. Interaction between bargaining and logical uncertainty
5. Metaethics
6. Metaphilosophy: 1, 2
Problems with specific philosophical and alignment ideas
1. Utilitarianism: 1, 2
2. Solomonoff induction
3. "Provable" safety
4. CEV
5. Corrigibility
6. IDA (and

... (read 366 more words →)

138

Legible vs. Illegible AI Safety Problems

Wei Dai

3mo

Some AI safety problems are legible (obvious or understandable) to company leaders and government policymakers, implying they are unlikely to deploy or allow deployment of an AI while those problems remain open (i.e., appear unsolved according to the information they have access to). But some problems are illegible (obscure or hard to understand, or in a common cognitive blind spot), meaning there is a high risk that leaders and policymakers will decide to deploy or allow deployment even if they are not solved. (Of course, this is a spectrum, but I am simplifying it to a binary for ease of exposition.)

From an x-risk perspective, working on highly legible safety problems has low... (read 420 more words →)

371

•••

Trying to understand my own cognitive edge

Wei Dai

3mo

I applaud Eliezer for trying to make himself redundant, and think it's something every intellectually successful person should spend some time and effort on. I've been trying to understand my own "edge" or "moat", or cognitive traits that are responsible for whatever success I've had, in the hope of finding a way to reproduce it in others, but I'm having trouble understanding a part of it, and try to describe my puzzle here. For context, here's an earlier EAF comment explaining my history/background and what I do understand about how my cognition differs from others.^[1]

More Background

In terms of raw intelligence, I think I'm smart but not world-class. My SAT was only 1440,... (read 1169 more words →)

Managing risks while trying to do good

Wei Dai

I often think about "the road to hell is paved with good intentions".^[1] I'm unsure to what degree this is true, but it does seem that people trying to do good have caused more negative consequences in aggregate than one might naively expect.^[2] "Power corrupts" and "power-seekers using altruism as an excuse to gain power" are two often cited reasons for this, but I think don't explain all of it.

A more subtle reason is that even when people are genuinely trying to do good, they're not entirely aligned with goodness. Status-seeking is a powerful motivation for almost all humans, including altruists, and we frequently award social status to people for merely trying to do... (read 330 more words →)

LESSWRONG
LW

LESSWRONG
LW

Legible vs. Illegible AI Safety Problems

A tale from Communist China

Morality is Scary

UDT shows that decision theory is more puzzling than ever

Wei Dai

Increasing AI Strategic Competence as a Safety Approach

My 2003 Post on the Evolutionary Argument for AI Misalignment

A Conflict Between AI Alignment and Philosophical Competence

Relitigating the Race to Build Friendly AI

Please, Don't Roll Your Own Metaethics

Problems I've Tried to Legibilize

Legible vs. Illegible AI Safety Problems

Wei Dai

Legible vs. Illegible AI Safety Problems

A tale from Communist China

Morality is Scary

UDT shows that decision theory is more puzzling than ever

Wei Dai

Increasing AI Strategic Competence as a Safety Approach

My 2003 Post on the Evolutionary Argument for AI Misalignment

A Conflict Between AI Alignment and Philosophical Competence

Relitigating the Race to Build Friendly AI

Please, Don't Roll Your Own Metaethics

Problems I've Tried to Legibilize

Legible vs. Illegible AI Safety Problems

My 6 years as a trader / active investor

More Background