If anyone wants to have a voice chat with me about a topic that I'm interested in (see my recent post/comment history to get a sense), please contact me via PM.
My main "claims to fame":
I mean greater certainty/clarity than our current understanding of mathematical reasoning, which seems to me far from complete (e.g., realism vs formalism is unsettled, what is the deal with Berry's paradox, etc). By the time we have a good meta-philosophy, I expect our philosophy of math will be much improved too.
If there is not a good meta-philosophy to find even in the sense of matching/exceeding our current level of understanding of mathematical reasoning, which I think is plausible, but it would be a seemingly very strange and confusing state of affairs, as it would mean in that in all or most fields of philosophy there is no objective or commonly agreed way to determine good how an argument is, or whether some statement is true or false, even given infinite compute or subjective time, including fields that seemingly should have objective answers like philosophy of math or meta-ethics. (Lots of people claim that morality is subjective, but almost nobody claims that "morality is subjective" is itself subjective!)
If after lots and lots of research (ideally with enhanced humans), we just really can't find a good meta-philosophy, I would hope that we can at least find some clues as to why this is the case, or some kind of explanation that makes the situation less confusing, and then use those clues to guide us as to what to do next, as far as how to handle super-persuasion, etc.
IMO, it’s hard to get a consensus for Heuristic C at the moment even though it kind of seems obvious.
Consider that humanity couldn't achieve a consensus around banning or not using cigarettes, leaded gasoline, or ozone-destroying chemicals, until they had done a huge amount of highly visible damage. There must have been plenty of arguments about their potential danger based on established science, and clear empirical evidence of the damage that they actually caused, far earlier, but such consensus still failed to form until much later, after catastrophic amounts of damage had already been caused. The consensus against drunk driving also only formed after extremely clear and undeniable evidence about its danger (based on accident statistics) became available.
I'm skeptical that more intentionally creating ethical design patterns could have helped such consensus form earlier in those cases, or in the case of AI x-safety, as it just doesn't seem to address the main root causes or bottlenecks for the lack of such consensus or governance failures, which IMO are things like:
Something that's more likely to work is "persuasion design patterns", like what helped many countries pass anti-GMO legislation despite lack of clear scientific evidence for their harm, but I think we're all loathe to use such tactics.
I've been reading a lot of web content, including this post, after asking my favorite LLM[1] to "rewrite it in Wei Dai's style" which I find tends to make it shorter and easier for me to read, while still leaving most of the info intact (unlike if I ask for a summary). Before I comment, I'll check the original to make sure the AI's version didn't miss a key point (or read the original in full if I'm sufficiently interested), and also ask the AI to double-check that my comment is sensible.
currently Gemini 2.5 Pro because it's free through AI Studio, and the rate limit is high enough that I've never hit it ↩︎
Thanks for the suggested readings.
I’m trying not to die here.
There are lots of ways to cash out "trying not to die", many of which imply that solving AI alignment (or getting uploaded) isn't even the most important thing. For instance under theories of modal or quantum immortality, dying is actually impossible. Or consider that most copies of you in the multiverse or universe are probably living in simulations of Earth rather than original physical entities, so the most important thing from a survival-defined-indexically perspective may be to figure out what the simulators want, or what's least likely to cause them to want to turn off the simulation or most likely to "rescue" you after you die here. Or, why aim for a "perfectly aligned" AI instead of one that cares just enough about humans to keep us alive in a comfortable zoo after the Singularity (which they may already do by default because of acausal trade, or maybe the best way to ensure this is to increase the cosmic resources available to aligned AI so they can do more of this kind of trade)?
And because I don’t believe in “correct” values.
The above was in part trying to point out that even something like not wanting to die is very ill defined, so if there are no correct values, not even relative to a person or a set of initial fuzzy non-preferences, then that's actually a much more troubling situation then you seem to think.
I don’t know how to build a safe philosophically super-competent assistant/oracle
That's in part why I'd want to attempt this only after a long pause (i.e. at least multi decades) to develop the necessary ideas, and probably only after enhancing human intelligence.
I've been talking about the same issue in various posts and comments, most prominently in Two Neglected Problems in Human-AI Safety. It feels like an obvious problem that (confusingly) almost no one talks about, so it's great to hear another concerned voice.
A potential solution I've been mooting is "metaphilosophical paternalism", or having AI provide support and/or error correction for humans' philosophical reasoning, based on a true theory of metaphilosophy (i.e., understanding of what philosophy is and what constitutes correct philosophical reasoning), to help them defend against memetic attacks and internal errors. So this is another reason I've been advocating for research into metaphilosophy, and for pausing AI (presumably for at least multiple decades) until metaphilosophy (and not just AI alignment, unless broadly defined to imply a solution to this problem) can be solved.
On your comment about "centrally enforced policy" being "kind of fucked up and illiberal", I think there is some hope that given enough time and effort, there can be a relatively uncontroversial solution to metaphilosophy[1], that most people can agree on at the end of the AI pause so central enforcement wouldn't be needed. Failing that, perhaps we should take a look at what the metaphilosophy landscape looks like after a lot of further development, and then collectively make a decision on how to proceed.
I'm curious if this addresses your concern, or if you see a differently shaped potential solution.
similar to how there's not a huge amount of controversy today about what constitutes correct mathematical or scientific reasoning, although I'd want to aim for even greater certainty/clarity than that ↩︎
Why is it a breaking issue if some uploads don’t work out exactly what they “should” want? This is already true for many people.
I'm scared of people doing actively terrible things with the resources of entire stars or galaxies at their disposal (a kind of s-risk), and concerned about wasting astronomical potential (if they do something not terrible but just highly suboptimal). See Morality is Scary and Two Neglected Problems in Human-AI Safety for some background on my thinking about this.
At worst it just requires that the initial few batches of uploads are carefully selected for philosophical competence (pre-upload) so that some potential misconception is not locked in.
This would relieve the concern I described, but bring up other issues, like being opposed by many because the candidates' values/views are not representative of humanity or themselves. (For example philosophical competence is highly correlated with or causes atheism, making it highly overrepresented in the initial candidates.)
I was under the impression that your advocated plan is to upload everyone at the same time (or as close to that as possible), otherwise how could you ensure that you personally would be uploaded, i.e. why would the initial batches of uploads necessarily decide to upload everyone else, once they've gained power. Maybe I should have clarified this with you first.
My own "plan" (if you want something to compare with) is to pause AI until metaphilosophy is solved in a clear way, and then build some kind of philosophically super-competent assistant/oracle AI to help fully solve alignment and the associated philosophical problems. Uploading carefully selected candidates also seems somewhat ok albeit a lot scarier (due to "power corrupts", or selfish/indexical values possibly being normative or convergent) if you have a way around the social/political problems.
better understood through AIT and mostly(?) SLT
Any specific readings or talks you can recommend on this topic?
I think 4 is basically right
Do you think it's ok to base an AI alignment idea/plan on a metaethical assumption, given that there is a large spread of metaethical positions (among both amateur and professional philosophers) and it looks hard to impossible to resolve or substantially reduce the disagreement in a relevant timeframe? (I noted that the assumption is weightbearing, since you can arrive at an opposite conclusion of "non-upload necessity" given a different assumption.)
(Everyone seems to do this, and I'm trying to better understand people's thinking/psychology around it, not picking on you personally.)
I suppose that a pointer to me is probably a lot simpler than a description/model of me, but that pointer is very difficult to construct, whereas I can see how to construct a model using imitation learning (obviously this is a “practical” consideration).
Not sure if you can or want to explain this more, but I'm pretty skeptical, given that distributional shift / OOD generalization has been a notorious problem for ML/DL (hence probably not neglected), and I haven't heard of much theoretical or practical progress on this topic.
Also, the model of me is then the thing that becomes powerful, which satisfies my values much more than my values can be satisfied by an external alien thing rising to power (unless it just uploads me right away I suppose).
What about people whose values are more indexical (they want themselves to be powerful/smart/whatever, not a model/copy of them), or less personal (they don't care about themselves or a copy being powerful, they're fine with an external Friendly AI taking over the world and ensuring a good outcome for everyone)?
I’m not sure that even an individual’s values always settle down into a unique equilibrium, I would guess this depends on their environment.
Yeah, this is covered under position 5 in the above linked post.
unrelatedly, I am still not convinced we live in a mathematical multiverse
Not completely unrelated. If this is false, and an ASI acts as if it's true, then it could waste a lot of resources e.g. doing acausal trading with imaginary counterparties. And I also don't think uncertainty about this philosophical assumption can be reduced much in a relevant timeframe by human philosophers/researchers, so safety/alignment plans shouldn't be built upon it either.
Definition (Strong upload necessity). It is impossible to construct a perfectly aligned successor that is not an emulation. [...] In fact, I think there is a decent chance that strong upload necessity holds for nearly all humans
What's the main reason(s) that you think this? For example one way to align an AI[1] that's not an emulation was described in Towards a New Decision Theory: "we'd need to program the AI with preferences over all mathematical structures, perhaps represented by an ordering or utility function over conjunctions of well-formed sentences in a formal set theory. The AI will then proceed to "optimize" all of mathematics, or at least the parts of math that (A) are logically dependent on its decisions and (B) it can reason or form intuitions about." Which part is the main "impossible" thing in your mind, "how to map fuzzy human preferences to well-defined preferences" or creating an AI that can optimize the universe according to such well-defined preferences?
I currently suspect it's the former, and it's because of your metaethical beliefs/credences. Consider these 2 metaethical positions (from Six Plausible Meta-Ethical Alternatives):
- 3 There aren't facts about what everyone should value, but there are facts about how to translate non-preferences (e.g., emotions, drives, fuzzy moral intuitions, circular preferences, non-consequentialist values, etc.) into preferences. These facts may include, for example, what is the right way to deal with ontological crises. The existence of such facts seems plausible because if there were facts about what is rational (which seems likely) but no facts about how to become rational, that would seem like a strange state of affairs.
- 4 None of the above facts exist, so the only way to become or build a rational agent is to just think about what preferences you want your future self or your agent to hold, until you make up your mind in some way that depends on your psychology. But at least this process of reflection is convergent at the individual level so each person can reasonably call the preferences that they endorse after reaching reflective equilibrium their morality or real values.
If 3 is true, then we can figure out and use the "facts about how to translate non-preferences into preferences" to "map fuzzy human preferences to well-defined preferences" but if 4 is true, then running the human as an emulation becomes the only possible way forward (as far as building an aligned agent/successor). Is this close to what you're thinking?
I also want to note that if 3 (or some of the other metaethical alternatives) is true, then "strong non-upload necessity", i.e. that it is impossible to construct a perfectly aligned successor that is an emulation, becomes very plausible for many humans, because an emulation of a human might find it impossible to make the necessary philosophical progress to figure out the correct normative facts about how to turn their own "non-preferences" into preferences, or simply don't have the inclination/motivation to do this.
which I don't endorse as something we should currently try to do, see Three Approaches to "Friendliness"
Yeah I think this outcome is quite plausible, which is in part why I only claimed "some hope". But
Basically my hope is that things become a lot clearer after we have a better understanding of metaphilosophy, as it seems to be a major obstacle to determining what should be done about the kind of problem described in the OP. I'm still curious whether you have any other solutions or approaches in mind.