LESSWRONG
LW

2027
Wei Dai
41635Ω2861144507718
Message
Dialogue
Subscribe

If anyone wants to have a voice chat with me about a topic that I'm interested in (see my recent post/comment history to get a sense), please contact me via PM.

My main "claims to fame":

  • Created the first general purpose open source cryptography programming library (Crypto++, 1995).
  • Published one of the first descriptions of a cryptocurrency based on a distributed public ledger (b-money, 1998), predating Bitcoin.
  • Proposed UDT, combining the ideas of updatelessness, policy selection, and evaluating consequences using logical conditionals.
  • First to argue for pausing AI development based on the technical difficulty of ensuring AI x-safety (SL4 2004, LW 2011).
  • Identified current and future philosophical difficulties as core AI x-safety bottlenecks, potentially insurmountable by human researchers, and advocated for research into metaphilosophy and AI philosophical competence as possible solutions.

My Home Page

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
10Wei Dai's Shortform
Ω
2y
Ω
198
Alignment as uploading with more steps
Wei Dai15h60

Thanks for the suggested readings.

I’m trying not to die here.

There are lots of ways to cash out "trying not to die", many of which imply that solving AI alignment (or getting uploaded) isn't even the most important thing. For instance under theories of modal or quantum immortality, dying is actually impossible. Or consider that most copies of you in the multiverse or universe are probably living in simulations of Earth rather than original physical entities, so the most important thing from a survival-defined-indexically perspective may be to figure out what the simulators want, or what's least likely to cause them to want to turn off the simulation or most likely to "rescue" you after you die here. Or, why aim for a "perfectly aligned" AI instead of one that cares just enough about humans to keep us alive in a comfortable zoo after the Singularity (which they may already do by default because of acausal trade, or maybe the best way to ensure this is to increase the cosmic resources available to aligned AI so they can do more of this kind of trade)?

And because I don’t believe in “correct” values.

The above was in part trying to point out that even something like not wanting to die is very ill defined, so if there are no correct values, not even relative to a person or a set of initial fuzzy non-preferences, then that's actually a much more troubling situation then you seem to think.

I don’t know how to build a safe philosophically super-competent assistant/oracle

That's in part why I'd want to attempt this only after a long pause (i.e. at least multi decades) to develop the necessary ideas, and probably only after enhancing human intelligence.

Reply
Christian homeschoolers in the year 3000
Wei Dai15h82

I've been talking about the same issue in various posts and comments, most prominently in Two Neglected Problems in Human-AI Safety. It feels like an obvious problem that (confusingly) almost no one talks about, so it's great to hear another concerned voice.

A potential solution I've been mooting is "metaphilosophical paternalism", or having AI provide support and/or error correction for humans' philosophical reasoning, based on a true theory of metaphilosophy (i.e., understanding of what philosophy is and what constitutes correct philosophical reasoning), to help them defend against memetic attacks and internal errors. So this is another reason I've been advocating for research into metaphilosophy, and for pausing AI (presumably for at least multiple decades) until metaphilosophy (and not just AI alignment, unless broadly defined to imply a solution to this problem) can be solved.

On your comment about "centrally enforced policy" being "kind of fucked up and illiberal", I think there is some hope that given enough time and effort, there can be a relatively uncontroversial solution to metaphilosophy[1], that most people can agree on at the end of the AI pause so central enforcement wouldn't be needed. Failing that, perhaps we should take a look at what the metaphilosophy landscape looks like after a lot of further development, and then collectively make a decision on how to proceed.

I'm curious if this addresses your concern, or if you see a differently shaped potential solution.


  1. similar to how there's not a huge amount of controversy today about what constitutes correct mathematical or scientific reasoning, although I'd want to aim for even greater certainty/clarity than that ↩︎

Reply1
How To Dress To Improve Your Epistemics
Wei Dai21h22

Seems like a good question to prompt AI with. Here's what I got from Gemini 2.5 Pro.

Reply1
Alignment as uploading with more steps
Wei Dai21hΩ340

Why is it a breaking issue if some uploads don’t work out exactly what they “should” want? This is already true for many people.

I'm scared of people doing actively terrible things with the resources of entire stars or galaxies at their disposal (a kind of s-risk), and concerned about wasting astronomical potential (if they do something not terrible but just highly suboptimal). See Morality is Scary and Two Neglected Problems in Human-AI Safety for some background on my thinking about this.

At worst it just requires that the initial few batches of uploads are carefully selected for philosophical competence (pre-upload) so that some potential misconception is not locked in.

This would relieve the concern I described, but bring up other issues, like being opposed by many because the candidates' values/views are not representative of humanity or themselves. (For example philosophical competence is highly correlated with or causes atheism, making it highly overrepresented in the initial candidates.)

I was under the impression that your advocated plan is to upload everyone at the same time (or as close to that as possible), otherwise how could you ensure that you personally would be uploaded, i.e. why would the initial batches of uploads necessarily decide to upload everyone else, once they've gained power. Maybe I should have clarified this with you first.

My own "plan" (if you want something to compare with) is to pause AI until metaphilosophy is solved in a clear way, and then build some kind of philosophically super-competent assistant/oracle AI to help fully solve alignment and the associated philosophical problems. Uploading carefully selected candidates also seems somewhat ok albeit a lot scarier (due to "power corrupts", or selfish/indexical values possibly being normative or convergent) if you have a way around the social/political problems.

better understood through AIT and mostly(?) SLT

Any specific readings or talks you can recommend on this topic?

Reply
Alignment as uploading with more steps
Wei Dai1dΩ340

I think 4 is basically right

Do you think it's ok to base an AI alignment idea/plan on a metaethical assumption, given that there is a large spread of metaethical positions (among both amateur and professional philosophers) and it looks hard to impossible to resolve or substantially reduce the disagreement in a relevant timeframe? (I noted that the assumption is weightbearing, since you can arrive at an opposite conclusion of "non-upload necessity" given a different assumption.)

(Everyone seems to do this, and I'm trying to better understand people's thinking/psychology around it, not picking on you personally.)

I suppose that a pointer to me is probably a lot simpler than a description/model of me, but that pointer is very difficult to construct, whereas I can see how to construct a model using imitation learning (obviously this is a “practical” consideration).

Not sure if you can or want to explain this more, but I'm pretty skeptical, given that distributional shift / OOD generalization has been a notorious problem for ML/DL (hence probably not neglected), and I haven't heard of much theoretical or practical progress on this topic.

Also, the model of me is then the thing that becomes powerful, which satisfies my values much more than my values can be satisfied by an external alien thing rising to power (unless it just uploads me right away I suppose).

What about people whose values are more indexical (they want themselves to be powerful/smart/whatever, not a model/copy of them), or less personal (they don't care about themselves or a copy being powerful, they're fine with an external Friendly AI taking over the world and ensuring a good outcome for everyone)?

I’m not sure that even an individual’s values always settle down into a unique equilibrium, I would guess this depends on their environment.

Yeah, this is covered under position 5 in the above linked post.

unrelatedly, I am still not convinced we live in a mathematical multiverse

Not completely unrelated. If this is false, and an ASI acts as if it's true, then it could waste a lot of resources e.g. doing acausal trading with imaginary counterparties. And I also don't think uncertainty about this philosophical assumption can be reduced much in a relevant timeframe by human philosophers/researchers, so safety/alignment plans shouldn't be built upon it either.

Reply
Alignment as uploading with more steps
Wei Dai2d*Ω340

Definition (Strong upload necessity). It is impossible to construct a perfectly aligned successor that is not an emulation. [...] In fact, I think there is a decent chance that strong upload necessity holds for nearly all humans

What's the main reason(s) that you think this? For example one way to align an AI[1] that's not an emulation was described in Towards a New Decision Theory: "we'd need to program the AI with preferences over all mathematical structures, perhaps represented by an ordering or utility function over conjunctions of well-formed sentences in a formal set theory. The AI will then proceed to "optimize" all of mathematics, or at least the parts of math that (A) are logically dependent on its decisions and (B) it can reason or form intuitions about." Which part is the main "impossible" thing in your mind, "how to map fuzzy human preferences to well-defined preferences" or creating an AI that can optimize the universe according to such well-defined preferences?

I currently suspect it's the former, and it's because of your metaethical beliefs/credences. Consider these 2 metaethical positions (from Six Plausible Meta-Ethical Alternatives):

  • 3 There aren't facts about what everyone should value, but there are facts about how to translate non-preferences (e.g., emotions, drives, fuzzy moral intuitions, circular preferences, non-consequentialist values, etc.) into preferences. These facts may include, for example, what is the right way to deal with ontological crises. The existence of such facts seems plausible because if there were facts about what is rational (which seems likely) but no facts about how to become rational, that would seem like a strange state of affairs.
  • 4 None of the above facts exist, so the only way to become or build a rational agent is to just think about what preferences you want your future self or your agent to hold, until you make up your mind in some way that depends on your psychology. But at least this process of reflection is convergent at the individual level so each person can reasonably call the preferences that they endorse after reaching reflective equilibrium their morality or real values.

If 3 is true, then we can figure out and use the "facts about how to translate non-preferences into preferences" to "map fuzzy human preferences to well-defined preferences" but if 4 is true, then running the human as an emulation becomes the only possible way forward (as far as building an aligned agent/successor). Is this close to what you're thinking?

I also want to note that if 3 (or some of the other metaethical alternatives) is true, then "strong non-upload necessity", i.e. that it is impossible to construct a perfectly aligned successor that is an emulation, becomes very plausible for many humans, because an emulation of a human might find it impossible to make the necessary philosophical progress to figure out the correct normative facts about how to turn their own "non-preferences" into preferences, or simply don't have the inclination/motivation to do this.

  1. ^

    which I don't endorse as something we should currently try to do, see Three Approaches to "Friendliness"

Reply
Obligated to Respond
Wei Dai6d*80

If you get around to writing that post, please consider/address:

  • Theory of the second best - "The economists Richard Lipsey and Kelvin Lancaster showed in 1956 that if one optimality condition in an economic model cannot be satisfied, it is possible that the next-best solution involves changing other variables away from the values that would otherwise be optimal." - Generalizing from this, given that humans deviate from optimal rationality in all kinds of unavoidable ways, the "second-best" solution may well involve belief in some falsehoods.
  • Managing risks while trying to do good - We're all very tempted to overlook risks while trying to do good, including (in this instance) destroying "that which can be destroyed by truth".
Reply1
My talk on AI risks at the National Conservatism conference last week
Wei Dai6d159

Yes, on the surface all you did was to point out an overlap between Rationalists and other groups, but what I don't understand is why you chose to emphasize this particular overlap, instead of for example the overlap between us and conservatives of wanting to stop ASI from being built, or simply leaving the Rationalists out of this speech and talk about us another time when you can speak with more nuance.

My hypotheses:

  1. You just want to speak the truth as you see it, without regard to the political consequences. You had room to insert "Rationalist" into that derogatory sentence, but not room to say something longer about how rationalists and conservatives should be allies in this fight.
  2. You had other political considerations that you can't make explicit here, e.g. trying to signal honesty or loyalty to your new potential allies, or preempting a possible attack from other conservatives that you're a Rationalist who shouldn't be trusted (e.g. because we're generally against religion).

I'm leaning strongly towards 2 (as 1 seems implausible given the political nature of the occasion), but still find it quite baffling, in part because it seems like you probably could have found a better way to accomplish what you wanted, without as much of the negative consequences (i.e., alienating the community that originated much of the thinking on AI risk, and making future coalition-building between our communities more difficult).

I think I'll stop here and not pursue this line of questioning/criticism further. Perhaps you have some considerations or difficulties that are hard to talk about and for me to appreciate from afar.

Reply
My talk on AI risks at the National Conservatism conference last week
Wei Dai7d2314

“rationalists”

Thanks, I had missed this in my reading. It does seem a strange choice to include in the speech (in a negative way) if the goal is to build a broad alliance against building ASI. Many rationalist are against building ASI in our current civilizational state, including Eliezer who started the movement/community.

@geoffreymiller, can you please explain your thought process for including this word in your sentence? I'm really surprised that you seem to consider yourself a rationalist (using "we" in connection with rationalism and arguing against people who do not consider you to be a community member "in good standing"[1]) and also talk about us in an antagonistic/unfriendly way in front of others, without some overriding reason that I can see.

  1. ^

    I had upvoted a bunch of your comments in that thread, thinking that we should consider you a member in good standing.

Reply
peterbarnett's Shortform
Wei Dai8d*40

Thanks for this explanation, it definitely makes your position more understandable.

and on top of that there is the abstract idea of "good", saying you shouldn't hurt the weak at all. And that idea is not necessitated by rational negotiation. It's just a cultural artifact that we ended up with, I'm not sure how.

I can think of 2 ways:

  1. It ended up there the same way that all the "nasty stuff" ended up in our culture, more or less randomly, e.g. through the kind of "morality as status game" talked about in Will Storr's book, which I quote in Morality is Scary.
  2. It ended up there via philosophical progress, because it's actually correct in some sense.

If it's 1, then I'm not sure why extrapolation and philosophy will pick out the "good" and leave the "nasty stuff". It's not clear to me why aligning to culture would be better than aligning to individuals in that case.

If it's 2, then we don't need to align with culture either - AIs aligned with individuals can rederive the "good" with competent philosophy.

Does this make sense?

So for AIs maybe this kind of carry-over to philosophy is also the best we can hope for.

It seems clear that technical design or training choices can make a difference (but nobody is working on this). Consider the analogy with the US vs Chinese education system, where the US system seems to produce a lot more competence and/or interest in philosophy (relative to STEM) compared to the Chinese system. And comparing humans with LLMs, it sure seems like they're on track to exceeding (top) human level in STEM while being significantly less competent in philosophy.

Reply
Load More
10Wei Dai's Shortform
Ω
2y
Ω
198
65Managing risks while trying to do good
2y
26
46AI doing philosophy = AI generating hands?
Ω
2y
Ω
23
224UDT shows that decision theory is more puzzling than ever
Ω
2y
Ω
56
163Meta Questions about Metaphilosophy
Ω
2y
Ω
80
34Why doesn't China (or didn't anyone) encourage/mandate elastomeric respirators to control COVID?
Q
3y
Q
15
55How to bet against civilizational adequacy?
Q
3y
Q
20
5AI ethics vs AI alignment
3y
1
115A broad basin of attraction around human values?
Ω
3y
Ω
18
233Morality is Scary
Ω
4y
Ω
116
Load More
Carl Shulman
2 years ago
Carl Shulman
2 years ago
(-35)
Human-AI Safety
2 years ago
Roko's Basilisk
7 years ago
(+3/-3)
Carl Shulman
8 years ago
(+2/-2)
Updateless Decision Theory
12 years ago
(+62)
The Hanson-Yudkowsky AI-Foom Debate
13 years ago
(+23/-12)
Updateless Decision Theory
13 years ago
(+172)
Signaling
13 years ago
(+35)
Updateless Decision Theory
14 years ago
(+22)
Load More