LESSWRONG
LW

209
Wei Dai
41699Ω2884144508118
Message
Dialogue
Subscribe

If anyone wants to have a voice chat with me about a topic that I'm interested in (see my recent post/comment history to get a sense), please contact me via PM.

My main "claims to fame":

  • Created the first general purpose open source cryptography programming library (Crypto++, 1995).
  • Published one of the first descriptions of a cryptocurrency based on a distributed public ledger (b-money, 1998), predating Bitcoin.
  • Proposed UDT, combining the ideas of updatelessness, policy selection, and evaluating consequences using logical conditionals.
  • First to argue for pausing AI development based on the technical difficulty of ensuring AI x-safety (SL4 2004, LW 2011).
  • Identified current and future philosophical difficulties as core AI x-safety bottlenecks, potentially insurmountable by human researchers, and advocated for research into metaphilosophy and AI philosophical competence as possible solutions.

My Home Page

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Christian homeschoolers in the year 3000
Wei Dai5h20

Yeah I think this outcome is quite plausible, which is in part why I only claimed "some hope". But

  1. It's also quite plausible that it won't be like that, for example maybe a good solution to meta-philosophy will be fairly attractive to everyone despite invalidating deeply held object-level beliefs, or it only clearly invalidates such beliefs after being applied with a lot of time/compute, which won't be available yet so people won't reject the meta-philosophy based on such invalidations.
  2. "What should be done if some/many people do reject the meta-philosophy based on it invalidating their beliefs?" is itself a philosophical question which the meta-philosophy could directly help us answer by accelerating philosophical progress, and/or that we can better answer after having a firmer handle on the nature of philosophy and therefore the ethics of changing people's philosophical beliefs. Perhaps the conclusion will be that symmetrical persuasion tactics, or centrally imposed policies, are justified in this case. Or maybe we'll use the understanding to find more effective asymmetrical or otherwise ethical persuasion tactics.

Basically my hope is that things become a lot clearer after we have a better understanding of metaphilosophy, as it seems to be a major obstacle to determining what should be done about the kind of problem described in the OP. I'm still curious whether you have any other solutions or approaches in mind.

Reply
Christian homeschoolers in the year 3000
Wei Dai7h40

I mean greater certainty/clarity than our current understanding of mathematical reasoning, which seems to me far from complete (e.g., realism vs formalism is unsettled, what is the deal with Berry's paradox, etc). By the time we have a good meta-philosophy, I expect our philosophy of math will be much improved too.

If there is not a good meta-philosophy to find even in the sense of matching/exceeding our current level of understanding of mathematical reasoning, which I think is plausible, but it would be a seemingly very strange and confusing state of affairs, as it would mean in that in all or most fields of philosophy there is no objective or commonly agreed way to determine good how an argument is, or whether some statement is true or false, even given infinite compute or subjective time, including fields that seemingly should have objective answers like philosophy of math or meta-ethics. (Lots of people claim that morality is subjective, but almost nobody claims that "morality is subjective" is itself subjective!)

If after lots and lots of research (ideally with enhanced humans), we just really can't find a good meta-philosophy, I would hope that we can at least find some clues as to why this is the case, or some kind of explanation that makes the situation less confusing, and then use those clues to guide us as to what to do next, as far as how to handle super-persuasion, etc.

Reply
Ethical Design Patterns
Wei Dai18h150

IMO, it’s hard to get a consensus for Heuristic C at the moment even though it kind of seems obvious.

Consider that humanity couldn't achieve a consensus around banning or not using cigarettes, leaded gasoline, or ozone-destroying chemicals, until they had done a huge amount of highly visible damage. There must have been plenty of arguments about their potential danger based on established science, and clear empirical evidence of the damage that they actually caused, far earlier, but such consensus still failed to form until much later, after catastrophic amounts of damage had already been caused. The consensus against drunk driving also only formed after extremely clear and undeniable evidence about its danger (based on accident statistics) became available.

I'm skeptical that more intentionally creating ethical design patterns could have helped such consensus form earlier in those cases, or in the case of AI x-safety, as it just doesn't seem to address the main root causes or bottlenecks for the lack of such consensus or governance failures, which IMO are things like:

  1. natural diversity of human opinions, when looking at the same set of arguments/evidence
  2. lack of extremely clear/undeniable evidence of harm
  3. democracy's natural difficulties around concentrated interests imposing diffused harms (due to "rational ignorance" of voters and collective action problems)

Something that's more likely to work is "persuasion design patterns", like what helped many countries pass anti-GMO legislation despite lack of clear scientific evidence for their harm, but I think we're all loathe to use such tactics.

Reply
Buck's Shortform
Wei Dai9d50

I've been reading a lot of web content, including this post, after asking my favorite LLM[1] to "rewrite it in Wei Dai's style" which I find tends to make it shorter and easier for me to read, while still leaving most of the info intact (unlike if I ask for a summary). Before I comment, I'll check the original to make sure the AI's version didn't miss a key point (or read the original in full if I'm sufficiently interested), and also ask the AI to double-check that my comment is sensible.


  1. currently Gemini 2.5 Pro because it's free through AI Studio, and the rate limit is high enough that I've never hit it ↩︎

Reply
Alignment as uploading with more steps
Wei Dai15d60

Thanks for the suggested readings.

I’m trying not to die here.

There are lots of ways to cash out "trying not to die", many of which imply that solving AI alignment (or getting uploaded) isn't even the most important thing. For instance under theories of modal or quantum immortality, dying is actually impossible. Or consider that most copies of you in the multiverse or universe are probably living in simulations of Earth rather than original physical entities, so the most important thing from a survival-defined-indexically perspective may be to figure out what the simulators want, or what's least likely to cause them to want to turn off the simulation or most likely to "rescue" you after you die here. Or, why aim for a "perfectly aligned" AI instead of one that cares just enough about humans to keep us alive in a comfortable zoo after the Singularity (which they may already do by default because of acausal trade, or maybe the best way to ensure this is to increase the cosmic resources available to aligned AI so they can do more of this kind of trade)?

And because I don’t believe in “correct” values.

The above was in part trying to point out that even something like not wanting to die is very ill defined, so if there are no correct values, not even relative to a person or a set of initial fuzzy non-preferences, then that's actually a much more troubling situation then you seem to think.

I don’t know how to build a safe philosophically super-competent assistant/oracle

That's in part why I'd want to attempt this only after a long pause (i.e. at least multi decades) to develop the necessary ideas, and probably only after enhancing human intelligence.

Reply
Christian homeschoolers in the year 3000
Wei Dai15d146

I've been talking about the same issue in various posts and comments, most prominently in Two Neglected Problems in Human-AI Safety. It feels like an obvious problem that (confusingly) almost no one talks about, so it's great to hear another concerned voice.

A potential solution I've been mooting is "metaphilosophical paternalism", or having AI provide support and/or error correction for humans' philosophical reasoning, based on a true theory of metaphilosophy (i.e., understanding of what philosophy is and what constitutes correct philosophical reasoning), to help them defend against memetic attacks and internal errors. So this is another reason I've been advocating for research into metaphilosophy, and for pausing AI (presumably for at least multiple decades) until metaphilosophy (and not just AI alignment, unless broadly defined to imply a solution to this problem) can be solved.

On your comment about "centrally enforced policy" being "kind of fucked up and illiberal", I think there is some hope that given enough time and effort, there can be a relatively uncontroversial solution to metaphilosophy[1], that most people can agree on at the end of the AI pause so central enforcement wouldn't be needed. Failing that, perhaps we should take a look at what the metaphilosophy landscape looks like after a lot of further development, and then collectively make a decision on how to proceed.

I'm curious if this addresses your concern, or if you see a differently shaped potential solution.


  1. similar to how there's not a huge amount of controversy today about what constitutes correct mathematical or scientific reasoning, although I'd want to aim for even greater certainty/clarity than that ↩︎

Reply1
How To Dress To Improve Your Epistemics
Wei Dai15d44

Seems like a good question to prompt AI with. Here's what I got from Gemini 2.5 Pro.

Reply1
Alignment as uploading with more steps
Wei Dai15dΩ340

Why is it a breaking issue if some uploads don’t work out exactly what they “should” want? This is already true for many people.

I'm scared of people doing actively terrible things with the resources of entire stars or galaxies at their disposal (a kind of s-risk), and concerned about wasting astronomical potential (if they do something not terrible but just highly suboptimal). See Morality is Scary and Two Neglected Problems in Human-AI Safety for some background on my thinking about this.

At worst it just requires that the initial few batches of uploads are carefully selected for philosophical competence (pre-upload) so that some potential misconception is not locked in.

This would relieve the concern I described, but bring up other issues, like being opposed by many because the candidates' values/views are not representative of humanity or themselves. (For example philosophical competence is highly correlated with or causes atheism, making it highly overrepresented in the initial candidates.)

I was under the impression that your advocated plan is to upload everyone at the same time (or as close to that as possible), otherwise how could you ensure that you personally would be uploaded, i.e. why would the initial batches of uploads necessarily decide to upload everyone else, once they've gained power. Maybe I should have clarified this with you first.

My own "plan" (if you want something to compare with) is to pause AI until metaphilosophy is solved in a clear way, and then build some kind of philosophically super-competent assistant/oracle AI to help fully solve alignment and the associated philosophical problems. Uploading carefully selected candidates also seems somewhat ok albeit a lot scarier (due to "power corrupts", or selfish/indexical values possibly being normative or convergent) if you have a way around the social/political problems.

better understood through AIT and mostly(?) SLT

Any specific readings or talks you can recommend on this topic?

Reply
Alignment as uploading with more steps
Wei Dai15dΩ340

I think 4 is basically right

Do you think it's ok to base an AI alignment idea/plan on a metaethical assumption, given that there is a large spread of metaethical positions (among both amateur and professional philosophers) and it looks hard to impossible to resolve or substantially reduce the disagreement in a relevant timeframe? (I noted that the assumption is weightbearing, since you can arrive at an opposite conclusion of "non-upload necessity" given a different assumption.)

(Everyone seems to do this, and I'm trying to better understand people's thinking/psychology around it, not picking on you personally.)

I suppose that a pointer to me is probably a lot simpler than a description/model of me, but that pointer is very difficult to construct, whereas I can see how to construct a model using imitation learning (obviously this is a “practical” consideration).

Not sure if you can or want to explain this more, but I'm pretty skeptical, given that distributional shift / OOD generalization has been a notorious problem for ML/DL (hence probably not neglected), and I haven't heard of much theoretical or practical progress on this topic.

Also, the model of me is then the thing that becomes powerful, which satisfies my values much more than my values can be satisfied by an external alien thing rising to power (unless it just uploads me right away I suppose).

What about people whose values are more indexical (they want themselves to be powerful/smart/whatever, not a model/copy of them), or less personal (they don't care about themselves or a copy being powerful, they're fine with an external Friendly AI taking over the world and ensuring a good outcome for everyone)?

I’m not sure that even an individual’s values always settle down into a unique equilibrium, I would guess this depends on their environment.

Yeah, this is covered under position 5 in the above linked post.

unrelatedly, I am still not convinced we live in a mathematical multiverse

Not completely unrelated. If this is false, and an ASI acts as if it's true, then it could waste a lot of resources e.g. doing acausal trading with imaginary counterparties. And I also don't think uncertainty about this philosophical assumption can be reduced much in a relevant timeframe by human philosophers/researchers, so safety/alignment plans shouldn't be built upon it either.

Reply
Alignment as uploading with more steps
Wei Dai16d*Ω340

Definition (Strong upload necessity). It is impossible to construct a perfectly aligned successor that is not an emulation. [...] In fact, I think there is a decent chance that strong upload necessity holds for nearly all humans

What's the main reason(s) that you think this? For example one way to align an AI[1] that's not an emulation was described in Towards a New Decision Theory: "we'd need to program the AI with preferences over all mathematical structures, perhaps represented by an ordering or utility function over conjunctions of well-formed sentences in a formal set theory. The AI will then proceed to "optimize" all of mathematics, or at least the parts of math that (A) are logically dependent on its decisions and (B) it can reason or form intuitions about." Which part is the main "impossible" thing in your mind, "how to map fuzzy human preferences to well-defined preferences" or creating an AI that can optimize the universe according to such well-defined preferences?

I currently suspect it's the former, and it's because of your metaethical beliefs/credences. Consider these 2 metaethical positions (from Six Plausible Meta-Ethical Alternatives):

  • 3 There aren't facts about what everyone should value, but there are facts about how to translate non-preferences (e.g., emotions, drives, fuzzy moral intuitions, circular preferences, non-consequentialist values, etc.) into preferences. These facts may include, for example, what is the right way to deal with ontological crises. The existence of such facts seems plausible because if there were facts about what is rational (which seems likely) but no facts about how to become rational, that would seem like a strange state of affairs.
  • 4 None of the above facts exist, so the only way to become or build a rational agent is to just think about what preferences you want your future self or your agent to hold, until you make up your mind in some way that depends on your psychology. But at least this process of reflection is convergent at the individual level so each person can reasonably call the preferences that they endorse after reaching reflective equilibrium their morality or real values.

If 3 is true, then we can figure out and use the "facts about how to translate non-preferences into preferences" to "map fuzzy human preferences to well-defined preferences" but if 4 is true, then running the human as an emulation becomes the only possible way forward (as far as building an aligned agent/successor). Is this close to what you're thinking?

I also want to note that if 3 (or some of the other metaethical alternatives) is true, then "strong non-upload necessity", i.e. that it is impossible to construct a perfectly aligned successor that is an emulation, becomes very plausible for many humans, because an emulation of a human might find it impossible to make the necessary philosophical progress to figure out the correct normative facts about how to turn their own "non-preferences" into preferences, or simply don't have the inclination/motivation to do this.

  1. ^

    which I don't endorse as something we should currently try to do, see Three Approaches to "Friendliness"

Reply
Load More
10Wei Dai's Shortform
Ω
2y
Ω
198
10Wei Dai's Shortform
Ω
2y
Ω
198
65Managing risks while trying to do good
2y
26
46AI doing philosophy = AI generating hands?
Ω
2y
Ω
23
224UDT shows that decision theory is more puzzling than ever
Ω
2y
Ω
56
163Meta Questions about Metaphilosophy
Ω
2y
Ω
80
34Why doesn't China (or didn't anyone) encourage/mandate elastomeric respirators to control COVID?
Q
3y
Q
15
55How to bet against civilizational adequacy?
Q
3y
Q
20
5AI ethics vs AI alignment
3y
1
115A broad basin of attraction around human values?
Ω
3y
Ω
18
234Morality is Scary
Ω
4y
Ω
116
Load More
Carl Shulman
2 years ago
Carl Shulman
2 years ago
(-35)
Human-AI Safety
2 years ago
Roko's Basilisk
7 years ago
(+3/-3)
Carl Shulman
8 years ago
(+2/-2)
Updateless Decision Theory
12 years ago
(+62)
The Hanson-Yudkowsky AI-Foom Debate
13 years ago
(+23/-12)
Updateless Decision Theory
13 years ago
(+172)
Signaling
13 years ago
(+35)
Updateless Decision Theory
14 years ago
(+22)
Load More