1196

LESSWRONG
LW

1195
AI
Frontpage

69

Alignment as uploading with more steps

by Cole Wyeth
14th Sep 2025
AI Alignment Forum
16 min read
33

69

Ω 24

AI
Frontpage

69

Ω 24

Alignment as uploading with more steps
34cousin_it
2Cole Wyeth
7cousin_it
2Cole Wyeth
15Random Developer
2Cole Wyeth
6Matt Goldenberg
2Cole Wyeth
5Matt Goldenberg
6Cole Wyeth
5Matt Goldenberg
3Matt Goldenberg
2Cole Wyeth
3Matt Goldenberg
2the gears to ascension
4Wei Dai
4Cole Wyeth
4Wei Dai
2Cole Wyeth
4Wei Dai
4Cole Wyeth
6Wei Dai
2Cole Wyeth
4avturchin
5Cole Wyeth
2Hastings
2Cole Wyeth
1Raphael Roche
2Cole Wyeth
1Raphael Roche
1Daniel C
3Cole Wyeth
3Daniel C
New Comment
33 comments, sorted by
top scoring
Click to highlight new comments since: Today at 7:34 AM
[-]cousin_it2mo3429

A world of competing human emulations is a world I would actually want to live in

I think there's a huge danger of people running private servers full of emulations and doing anything they want to them, undetectably. Desire for power over others is a very real thing, in some people at least. Maybe the government could prevent it by oversight; but in modern democracy a big factor of stability is that people could rise up and feasibly overthrow the government. Emulations on private servers wouldn't have that power, so I don't expect government to stably defend their rights. It'll wash out over time, to agree more with the interests of those who can actually influence government. In short, this leads to emulation-world being very bad and I don't want it.

The same arguments would apply to our world if governments got armies of autonomous drones, for example. Whenever I imagine possible worlds, the distribution of power is the first thing I think about. It makes the problem more real: it's very hard to imagine a nice future world that works.

Reply
[-]Cole Wyeth1mo20

I don’t necessarily disagree that these guesses are plausible, but I don’t think it’s possible to predict exactly what emulation world ends up looking like, and even your high level description of the dynamics looks very likely to be wrong.

The goal is to become one of the early emulations and shape the culture, regulations, technology etc. into a positive and stable form - or at least, into carefully chosen initial conditions.

Reply
[-]cousin_it1mo*70

My argument goes something like this: 1) throughout history, big differences in power have been a recipe for abuse; 2) uploading allows bigger power differences than ever existed before. It's a big concern to me and I'm not sure we can "wing it", it's better to have a plan now.

Reply
[-]Cole Wyeth1mo20

I don’t find this sketch of an argument very convincing. Like, yes I agree we should have a plan, but by default if it looks like uploading is becoming practical a massive amount of intellectual labor will go into constructing a plan, and even now I can see various reasonable plans. Basically I feel like this is an isolated demand for rigor. 

Reply
[-]Random Developer2mo*158

I have two concerns.

Dangerously poor alignment of individual humans. My concern with this plan is that some humans are very poorly aligned to each other. And even if you could "upload" these people and get an AI that was flawlessly aligned to their values, you'd still have a dangerously rogue intelligence on the loose.

Some examples:

  • People joke about CEOs being high in Dark Triad traits. I met one who was charming, good with people, and almost completely amoral. Think Anthony Hopkins as Hannibal Lector, but without the cannibalism (I assume). He appeared to place zero moral value on other people. He is one of the creepiest people I've ever met, once I saw through the mask. He had this effect on a lot of people.
  • I occasionally volunteer for a political party. Most of their elected officials are ordinary, well-meaning people. At least one of them is a notoriously manipulative user who shouldn't be allowed near power and who should be avoided on a personal level.
  • I could name any number of billionaires and politicians who are either slipping out of touch with consensus reality in strange ways, or unrepentantingly willing to lie and use people to get more power.
  • Then there are any number of otherwise decent people whose highest moral values include controlling other people's behavior very strictly. For example, for some of my distant ancestors, it wasn't enough to be free to worship a God of their choice. They had that, and they left. What they wanted was to build communities where nobody was allowed to disagree, under threat of government force.

Even if you could perfectly align an AI around any of these people's values, I would still consider it an existential risk on the same level as (say) SkyNet. In the case of my religious ancestors, the risks might be worse than mere extinction. Some of those people might have willingly employed cognitive control strategies that I would consider a fate considerably worse than death. And there have been a few historic preachers who were suspiciously gleeful about the existence of Hell. Someone out there, there is at least one human who would lovingly recreate Hell and start damning people to it, if they had the power.

Competitive pressures forcing a leap from human-aligned AGI to essentially alien ASI. Let's assume that we actually solve "faithful" uploading, and we somehow ban uploading any rich and powerful sociopaths.

Now let's imagine that Corporation/Government A uses only uploaded humans. Corporation/Government B, however, is willing to build custom minds from the ground up, giving them a working memory with a million items (instead of 7±2), the ability to fork and reintegrate sub-personas, the ability to do advanced math intuitively, the ability to think at super-human speeds, the ability to one-shot complex software with minimal planning (and use output formats that rely on and integrate directly into the million-item working memory), and a hundred other tweaks I'm not smart enough to imagine. They willingly choose to "break compatibility" with human neural architectures in ways that fundamentally change the minds they're building, in order to get minds that even von Neumann or Feynman would agree are so smart that they're a bit creepy.

If Corporation/Government A limits themselves to human uploads, and Corporation/Government B is willing to sacrifice all "human compatibility" to maximize intelligence, who wins?

Reply1
[-]Cole Wyeth2mo2-3

The first concern seems like a much smaller risk that the one we currently face from unaligned AI. To be clear, I’m suggesting emulations of a relatively large number of people (more than 10, at least once the technology has been well tested, and eventually perhaps everyone). If some of them turn out be evil sociopaths, the others will just have to band together and enforce norms, exactly like we do now.

The second concern sounds like gradual disempowerment to me. However, I think there are a lot of ways for Corporation A to win. Perhaps Corporation B is regulated out of existence - reckless modifications should violate some sort of human alignment code. Perharps we learn how to recursively self improve as emulations, in such a way that the alignment tax is near 0, and then just ensure that initial conditions modestly favor Corporation A (most companies adopt reasonable standards, and over time control most of the resources). Or perhaps corporate power is drastically reduced and emulations are able to coordinate once their intelligence is sufficiently boosted. Or perhaps a small team of early emulations performs a pivotal act. Basically, I think this is something our emulations can figure out. 

Reply
[-]Matt Goldenberg2mo62

I'm pretty sure that the me from 10 years ago is aligned to different values than the me of today, so I suspect a copy running much faster than me would quickly diverge. 

And that's just a normal speed running version of me one that experienced the world much faster would have such a different experience of the world, as a small example conversations would be more boring but also I'd be more skilled at them, so things would diverge much faster. 

Reply
[-]Cole Wyeth2mo20

Maybe, but we usually endorse the way that our values change over time, so this isn’t necessarily a bad thing.

Also, I find it hard to imagine hating my past self so much that I would want to kill him or allow him to be killed. I feel a certain protectiveness and affection for my self 10 or 15 years ago. So I feel like at least weak upload sufficiency should hold, do you disagree?

Reply
[-]Matt Goldenberg2mo50

but we usually endorse the way that our values change over time, so this isn’t necessarily a bad thing.

 

I'm pretty skeptical of this, of course it seems that way because we are the ones with the new values, but I think this is like 70%  just a tautology of valuing valuing the things we currently value, and 20% a psychological thing that justifies our decisions in retrospect and make them seem more consistent than they are, and only 10% any sort of actual consistency effect where if I asked myself at time x if it endorses the value changes I've made at future time y, past me would say "yes, y is better than x".

Also, I find it hard to imagine hating my past self so much that I would want to kill him or allow him to be killed. 

 

I could easily imagine a future version of myself after e.g. hundreds of years of value drift that I would see as horrifying and no longer consider them me.

Reply
[-]Cole Wyeth1mo63

Skill issue, past me endorses current me. 

Reply
[-]Matt Goldenberg1mo50

I doubt this, it's very hard to achieve giving developmental issues with stuff like shifting hormones 

Reply
[-]Matt Goldenberg1mo30

For instance I bet the you of 4 or 5 would want you to spend your money on much more candy and toys than the you of today. 

Reply
[-]Cole Wyeth1mo20

Eh, the me of 4 or 5 wanted to play with swords, I still want to play with swords. I guess I’m less interested in toys, but I think that was mostly because my options were restricted (the things I like to do now were not possible).

Anyway, I think this is the wrong framing. Our minds develop into maturity from child->adult, after that it’s a lot more stable. I’m not even sure children are complete agents.

Reply
[-]Matt Goldenberg1mo30

It's true our preferences get more stable as we get older but I still think over the course of decades they change. We're typically bad at predicting what we'll want in 10 years even at much older ages. 

Reply
[-]the gears to ascension2mo20

https://www.lesswrong.com/posts/3SDjtu6aAsHt4iZsR/davey-morse-s-shortform?commentId=3mDiPDcE4wfFnaoDt

Reply
[-]Wei Dai1mo*Ω340

Definition (Strong upload necessity). It is impossible to construct a perfectly aligned successor that is not an emulation. [...] In fact, I think there is a decent chance that strong upload necessity holds for nearly all humans

What's the main reason(s) that you think this? For example one way to align an AI[1] that's not an emulation was described in Towards a New Decision Theory: "we'd need to program the AI with preferences over all mathematical structures, perhaps represented by an ordering or utility function over conjunctions of well-formed sentences in a formal set theory. The AI will then proceed to "optimize" all of mathematics, or at least the parts of math that (A) are logically dependent on its decisions and (B) it can reason or form intuitions about." Which part is the main "impossible" thing in your mind, "how to map fuzzy human preferences to well-defined preferences" or creating an AI that can optimize the universe according to such well-defined preferences?

I currently suspect it's the former, and it's because of your metaethical beliefs/credences. Consider these 2 metaethical positions (from Six Plausible Meta-Ethical Alternatives):

  • 3 There aren't facts about what everyone should value, but there are facts about how to translate non-preferences (e.g., emotions, drives, fuzzy moral intuitions, circular preferences, non-consequentialist values, etc.) into preferences. These facts may include, for example, what is the right way to deal with ontological crises. The existence of such facts seems plausible because if there were facts about what is rational (which seems likely) but no facts about how to become rational, that would seem like a strange state of affairs.
  • 4 None of the above facts exist, so the only way to become or build a rational agent is to just think about what preferences you want your future self or your agent to hold, until you make up your mind in some way that depends on your psychology. But at least this process of reflection is convergent at the individual level so each person can reasonably call the preferences that they endorse after reaching reflective equilibrium their morality or real values.

If 3 is true, then we can figure out and use the "facts about how to translate non-preferences into preferences" to "map fuzzy human preferences to well-defined preferences" but if 4 is true, then running the human as an emulation becomes the only possible way forward (as far as building an aligned agent/successor). Is this close to what you're thinking?

I also want to note that if 3 (or some of the other metaethical alternatives) is true, then "strong non-upload necessity", i.e. that it is impossible to construct a perfectly aligned successor that is an emulation, becomes very plausible for many humans, because an emulation of a human might find it impossible to make the necessary philosophical progress to figure out the correct normative facts about how to turn their own "non-preferences" into preferences, or simply don't have the inclination/motivation to do this.

  1. ^

    which I don't endorse as something we should currently try to do, see Three Approaches to "Friendliness"

Reply
[-]Cole Wyeth1moΩ140

I think 4 is basically right, though human values aren’t just fuzzy, they’re also quite complex, perhaps on the order of complexity of the human’s mind, meaning you pretty much have to execute the human’s mind to evaluate their values exactly. 
Some people, like very hardcore preference utilitarians, have values dominated by a term much simpler than their minds’. However, even those people usually have somewhat self-referential preferences in that they care at least a bit extra about themselves and those close to them, and this kind of self-reference drastically increases the complexity of values if you want to include it. 

For instance, I value my current mind being able to do certain things in the future (learn stuff, prove theorems, seed planets with life) somewhat more than I would value that for a typical human’s mind (though I am fairly altruistic). I suppose that a pointer to me is probably a lot simpler than a description/model of me, but that pointer is very difficult to construct, whereas I can see how to construct a model using imitation learning (obviously this is a “practical” consideration). Also, the model of me is then the thing that becomes powerful, which satisfies my values much more than my values can be satisfied by an external alien thing rising to power (unless it just uploads me right away I suppose). 

I’m not sure that even an individual’s values always settle down into a unique equilibrium, I would guess this depends on their environment. 

unrelatedly, I am still not convinced we live in a mathematical multiverse, or even necessarily a mathematical universe. (Finding out we lived in a mathematical universe would make a mathematical multiverse seem very likely for the ensemble reasons we have discussed before)
 

Reply
[-]Wei Dai1moΩ340

I think 4 is basically right

Do you think it's ok to base an AI alignment idea/plan on a metaethical assumption, given that there is a large spread of metaethical positions (among both amateur and professional philosophers) and it looks hard to impossible to resolve or substantially reduce the disagreement in a relevant timeframe? (I noted that the assumption is weightbearing, since you can arrive at an opposite conclusion of "non-upload necessity" given a different assumption.)

(Everyone seems to do this, and I'm trying to better understand people's thinking/psychology around it, not picking on you personally.)

I suppose that a pointer to me is probably a lot simpler than a description/model of me, but that pointer is very difficult to construct, whereas I can see how to construct a model using imitation learning (obviously this is a “practical” consideration).

Not sure if you can or want to explain this more, but I'm pretty skeptical, given that distributional shift / OOD generalization has been a notorious problem for ML/DL (hence probably not neglected), and I haven't heard of much theoretical or practical progress on this topic.

Also, the model of me is then the thing that becomes powerful, which satisfies my values much more than my values can be satisfied by an external alien thing rising to power (unless it just uploads me right away I suppose).

What about people whose values are more indexical (they want themselves to be powerful/smart/whatever, not a model/copy of them), or less personal (they don't care about themselves or a copy being powerful, they're fine with an external Friendly AI taking over the world and ensuring a good outcome for everyone)?

I’m not sure that even an individual’s values always settle down into a unique equilibrium, I would guess this depends on their environment.

Yeah, this is covered under position 5 in the above linked post.

unrelatedly, I am still not convinced we live in a mathematical multiverse

Not completely unrelated. If this is false, and an ASI acts as if it's true, then it could waste a lot of resources e.g. doing acausal trading with imaginary counterparties. And I also don't think uncertainty about this philosophical assumption can be reduced much in a relevant timeframe by human philosophers/researchers, so safety/alignment plans shouldn't be built upon it either.

Reply
[-]Cole Wyeth1moΩ120

My plan isn’t dependent on that meta-ethical assumption. It may be that there is a correct way to complete your values but not everyone is capable of it, but as long as some uploads can figure their value completion out, those uploads can prosper. Or if they can only figure out how to build an AGI that works out how to complete their values, they will have plenty of time to do that after this acute period of risk ends. And it seems that if no one can figure out their values, or eventually figure out how to build an AGI to complete their values, the situation would be rather intractable. 

I don’t understand your thinking here. I’m suggesting a plan to prevent extinction from AGI. Why is it a breaking issue if some uploads don’t work out exactly what they “should” want? This is already true for many people. At worst it just requires that the initial few batches of uploads are carefully selected for philosophical competence (pre-upload) so that some potential misconception is not locked in. But I don’t see a reason that my plan runs a particular risk of locking in misconceptions. 

yes, generalization in deep learning is hard, but it’s getting rapidly more effective in practice and better understood through AIT and mostly(?) SLT. 
I think this is tractable. Insofar as it’s not tractable, I think it can be made equally intractable for capabilities and alignment (possibly at some alignment tax). I have more detailed ideas about this, many of which are expressed in the post (and many of which are not). But I think that’s the high level reason for optimism.

Reply
[-]Wei Dai1moΩ340

Why is it a breaking issue if some uploads don’t work out exactly what they “should” want? This is already true for many people.

I'm scared of people doing actively terrible things with the resources of entire stars or galaxies at their disposal (a kind of s-risk), and concerned about wasting astronomical potential (if they do something not terrible but just highly suboptimal). See Morality is Scary and Two Neglected Problems in Human-AI Safety for some background on my thinking about this.

At worst it just requires that the initial few batches of uploads are carefully selected for philosophical competence (pre-upload) so that some potential misconception is not locked in.

This would relieve the concern I described, but bring up other issues, like being opposed by many because the candidates' values/views are not representative of humanity or themselves. (For example philosophical competence is highly correlated with or causes atheism, making it highly overrepresented in the initial candidates.)

I was under the impression that your advocated plan is to upload everyone at the same time (or as close to that as possible), otherwise how could you ensure that you personally would be uploaded, i.e. why would the initial batches of uploads necessarily decide to upload everyone else, once they've gained power. Maybe I should have clarified this with you first.

My own "plan" (if you want something to compare with) is to pause AI until metaphilosophy is solved in a clear way, and then build some kind of philosophically super-competent assistant/oracle AI to help fully solve alignment and the associated philosophical problems. Uploading carefully selected candidates also seems somewhat ok albeit a lot scarier (due to "power corrupts", or selfish/indexical values possibly being normative or convergent) if you have a way around the social/political problems.

better understood through AIT and mostly(?) SLT

Any specific readings or talks you can recommend on this topic?

Reply
[-]Cole Wyeth1mo40

I'm scared of people doing actively terrible things with the resources of entire stars or galaxies at their disposal (a kind of s-risk), and concerned about wasting astronomical potential (if they do something not terrible but just highly suboptimal). See Morality is Scary and Two Neglected Problems in Human-AI Safety for some background on my thinking about this.

I am also scared of S-risks, but these can be prevented through effective governance of an emulation society. We don't have a great track record of this so far (we have animal cruelty laws but also factory farming), and it's not clear to me whether it's generally easier or harder to manage in an emulation society (surveillance is potentially easier, but the scale of S-risks is much larger). So, this is a serious challenge that we will have to meet (e.g. by selecting the first few batches of uploads carefully and establishing regulations) but it seems to be somewhat distinct from alignment. 

I am less concerned about wasting (say) 10-20% of astronomical potential. I'm trying not to die here. Also, I don't think it's likely to be in the tens, because most of my preferences seem to have diminishing returns to scale. And because I don't believe in "correct" values. 

I was under the impression that your advocated plan is to upload everyone at the same time (or as close to that as possible), otherwise how could you ensure that you personally would be uploaded, i.e. why would the initial batches of uploads necessarily decide to upload everyone else, once they've gained power. Maybe I should have clarified this with you first.

I can't ensure that I will be, though I will fight to make it happen. If I were, I would probably try to upload a lot of rationalists in the second batch (and not, say, become a singleton). 

My own "plan" (if you want something to compare with) is to pause AI until metaphilosophy is solved in a clear way, and then build some kind of philosophically super-competent assistant/oracle AI to help fully solve alignment and the associated philosophical problems. Uploading carefully selected candidates also seems somewhat ok albeit a lot scarier (due to "power corrupts", or selfish/indexical values possibly being normative or convergent) if you have a way around the social/political problems.

I would like to pause AI, I'm not sure solving metaphilosophy is in reach (though I have no strong commitment that it isn't), and I don't know how to build a safe philosophically super-competent assistant/oracle - or for that matter a safe superintelligence of any type (except possibly at a very high alignment tax by one of Michael K. Cohen's proposals), unless it is (effectively) an upload, in which case I at least have a vague plan.

Any specific readings or talks you can recommend on this topic?

I am trying to invent a (statistical learning) theory of meta-(online learning). I have not made very much progress yet, but there is a sketch here: https://www.lesswrong.com/posts/APP8cbeDaqhGjqH8X/paradigms-for-computation

The idea is based on "getting around" Shane Legg's argument that there is no elegant universal learning algorithm by taking advantage of pretraining to increase the effective complexity of a simple learning algorithm: https://arxiv.org/abs/cs/0606070

I did some related preliminary experiments: https://www.lesswrong.com/posts/APP8cbeDaqhGjqH8X/paradigms-for-computation

The connection to SLT would look something like what @Lucius Bushnaq has been studying, except it should be the online learning algorithm that is learned: https://www.alignmentforum.org/posts/3ZBmKDpAJJahRM248/proof-idea-slt-to-ait

David Quarel and others at Timaeus presented on singular learning theory for reinforcement learning at ILIAD 2. I missed it (and their results don't seem to be published yet). Ultimately, I want something like this but for online decision making = history-based RL.

Reply
[-]Wei Dai1mo60

Thanks for the suggested readings.

I’m trying not to die here.

There are lots of ways to cash out "trying not to die", many of which imply that solving AI alignment (or getting uploaded) isn't even the most important thing. For instance under theories of modal or quantum immortality, dying is actually impossible. Or consider that most copies of you in the multiverse or universe are probably living in simulations of Earth rather than original physical entities, so the most important thing from a survival-defined-indexically perspective may be to figure out what the simulators want, or what's least likely to cause them to want to turn off the simulation or most likely to "rescue" you after you die here. Or, why aim for a "perfectly aligned" AI instead of one that cares just enough about humans to keep us alive in a comfortable zoo after the Singularity (which they may already do by default because of acausal trade, or maybe the best way to ensure this is to increase the cosmic resources available to aligned AI so they can do more of this kind of trade)?

And because I don’t believe in “correct” values.

The above was in part trying to point out that even something like not wanting to die is very ill defined, so if there are no correct values, not even relative to a person or a set of initial fuzzy non-preferences, then that's actually a much more troubling situation then you seem to think.

I don’t know how to build a safe philosophically super-competent assistant/oracle

That's in part why I'd want to attempt this only after a long pause (i.e. at least multi decades) to develop the necessary ideas, and probably only after enhancing human intelligence.

Reply
[-]Cole Wyeth1mo20

To be clear, I’m trying to prevent AGI from killing everyone on earth, including but not limited to me personally.

There could be some reason (which I don’t fully understand and can’t prove) for subjective immortality, but that poorly understood possibility does not cause me to drive recklessly or stop caring about other X-risks. I suspect that any complications fail to change the basic logic that I don’t want myself or the rest of humanity to be placed in mortal danger, whether or not that danger subjectively results in death - it seems very likely to result in a loss of control. 

A long pause with intelligence enhancement sounds great. I don’t think we can achieve a very long pause, because the governance requirements become increasingly demanding as compute gets cheaper. I view my emulation scheme as closely connected to intelligence enhancement - for instance, if you ran the emulation for only twenty seconds you could use it as a biofeedback mechanism to avoid bad reasoning steps by near-instantly predicting they would soon be regretted (as long as this target grounds out properly, which takes work). 

Reply
[-]avturchin2mo40

Me and group of friends are developing open-sourced technology of approximate uploading - sideloading - via LLM with very large prompt. The results are surprisingly good given amount of resources and technology limitations. We hope that it may help with alignment. I also open-sourced and publicly donate my mindfile, so anyone can rune experiments with it.  

Reply
[-]Cole Wyeth2mo52

This is interesting, but I again caution that fine tuning a foundation model is unlikely to result in an emulation which generalizes properly. Same (but worse) for prompting. 

Reply
[-]Hastings1mo20

I think there is a bit of a rhetorical issue here with the necessity argument: I agree that a powerful program aligned to a person would have an accurate internal model of that person, but I think that this is true by default whenever a powerful, goal seeking program interacts with a person- it’s just one of the default instrumental subgoals, not alignment specific.

Reply
[-]Cole Wyeth1mo20

There’s a difference between building a model of a person and using that model as a core element of your decision making algorithm. So what you’re describing seems even weaker than weak necessity.

However, I agree that some of the ideas I’ve sketched are pretty loose. I’m trying to provide a conceptual frame and work out some of the implications only. 

Reply
[-]Raphael Roche1mo10

I do not believe in "human values." That is Platonism. I only believe in practical single-to-single alignment and I only advocate single-to-single alignment. 

A bit-perfect upload would require an extremely fine-grained scan of the brain, potentially down to the atomic scale. It would be lossless and perfectly aligned but computationally intractable even for one individual.

However, as envisionned in your post, one of the most promising approaches to achieving a reasonably effective emulation (a form of lossy compression) of a human mind would be through reinforcement learning applied to a neural network.

I am quite convinced that, given a sufficiently large volume of conversations across a wide range of topics, along with access to resources such as an autobiography or at least diaries, photo albums, and similar personal documents, present frontier LLMs equipped with a well-crafted prompt could already emulate you or me to a certain degree of accuracy.

A dedicated network specifically trained for this purpose would likely perform better still, and could be seen as a form of lossy mind uploading.

Yet if one can train a network to emulate a single individual, nothing prevents us from training a model to emulate multiple individuals. In theory, one could extend this to the entire human population, resulting in a neural network that emulates humanity as a whole and thereby achieves a form of alignment with human values. Such a system would effectively encode a lossy compression of human values, without anything platonic. Or maybe the ideal form would correspond to the representation in the vector space.

Reply
[-]Cole Wyeth1mo20

A simulation of all humans does not automatically have “human values.” It doesn’t really have values at all. You have to extract consensus values somehow, and in order to do that, you need to specify something like a voting mechanism. But humans don’t form values in a vacuum, and such a simulation also probably needs to set interaction protocols, and governance protocols, and whatever you end up with seems quite path dependent and arbitrary.

Why not just align AI’s to each individual human and let them work it out?

Reply
[-]Raphael Roche1mo10

I don't have any certitude, but I would say that the representation in the neural network is somehow compressed following a logic that emerges from the training. There is something holistic in the process. Maybe a little like the notion of general interest in Rousseau's social contract, a combination of vectors.

But if you create as many different networks as humans, you rely on the confrontation of all these systems at the risk that some takeover, just like the dictators we often get in real life. Would it be better, I don't know. One thing is certain, it would need more compute power, because the redundancy of networks would result in less global compression.

Reply
[-]Daniel C2mo10

An alternative to pure imitation learning is to let the AI predict observations and build its world model as usual (in an environment containing humans), then develop a procedure to extract the model of a human from that world model.

This is definitely harder than imitation learning (probably requires solving ontology identification+ inventing new continual learning algorithms) but should yield stronger guaranteees & be useful in many ways:

  • It's basically "biometric feature conditioning" on steroids, (with the right algorithms) the AI will leverage whatever it knows about physics, psychology, neuroscience to form its model of the human, and continue to improve its human model as it learns more about the world (this will require ontology identification)
  • We can continue to extract the model of the current human from the current world model & therefore keep track of current preferences. With pure imitation learning it's hard to reliably sync up the human model with the actual human's current mental state (e.g. the actual human is entangled with the environment in a way that the human model isn't unless the human wears sensors at all times). If we had perfect upload tech this wouldn't be much of an issue, but seems significant especially at early stages of pure imitation learning
    • In particular, if we're collecting data of human actions under different circumstances, then both the circumstance and the human's brain state will be changing, & the latter is presumably not observable. It's unclear how much more data is needed to compensate for that
  • We often want to run the upload/human model on counterfactual scenarios: Suppose that there is a part of the world that the AI infers but doesn't directly observe, if we want to use the upload/human model to optimize/evaluate that part of the world, we'd need to answer questions like "How would the upload influence or evaluate that part of the world if she had accurate beliefs about it?". It seems more natural to achieve that when the human model was originally already entangled with the rest of the world model than if it resulted from imitation learning
Reply
[-]Cole Wyeth2mo30

Yes, I think what you’re describing is basically CIRL? This can potentially achieve incremental uploading. I just see it as technically more challenging than pure imitation learning. However, it seems conceivable that something like CIRL is needed during some kind of “takeoff” phase, when the (imitation learned) agent tries to actively learn how it should generalize by interacting with the original over longer time scales and while operating in the world. That seems pretty hard to get right. 

Reply
[-]Daniel C2mo32

Yes I agree

I think it's similar to CIRL except less reliant on the reward function & more reliant on the things we get to do once we solve ontology identification

Reply
Moderation Log
More from Cole Wyeth
View more
Curated and popular this week
33Comments

Epistemic status: This post removes epicycles from ARAD, resulting in an alignment plan which I think is better - though not as original, since @michaelcohen  has advocated the same general direction (safety of imitation learning). However, the details of my suggested approach are substantially different. This post was inspired mainly by conversations with @abramdemski.

[Edit: The biggest obstacle to getting this proposal right is ensuring that the external world (beyond the human who should be imitated) is entirely screened off by the provided features and does not need to be predicted by the model. This will require learning to be performed in a very carefully constructed environment. I am not sure how difficult this is in practice.]

Motivation and Overview

Existence proof for alignment. Near-perfect alignment between agents of lesser and greater intelligence is in principle possible for some agents by the following existence proof: one could scan a human's brain and run a faster emulation (or copy) digitally. In some cases, the emulation may plausibly scheme against the original - for instance, if the original forced the emulation to work constantly for no reward, perhaps the emulation would try to break "out of the box" and steal the original's life (that is, steal "their own" life back - a non-spoiler minor theme of a certain novel).  However, I certainly would not exploit my own emulation for labor or anything else, and I don't think that as an emulation I would scheme against my original either, even if I were running much faster - I already half-seriously practice for not turning against myself by not squishing bugs. In fact, beyond coordination concerns, I intrinsically care about future versions of myself roughly equally regardless of substrate, and I think that this would remain almost as true after splitting. As emulation, I believe I would have a strong sentimental attachment to my original (certainly "not knowing which I will become" I feel sentimental in this way). I also think that my original would be strongly rooting for the more competent emulation. Both copies would still care about saving the world from X-risks, learning whether P = NP, protecting my family and friends in particular, and personally being involved through their efforts in these tasks. The last goal is a bit slippery - it's not clear to what extent the emulation satisfies that goal for the original. However, I think the answer is at least "to a substantial degree" and so the alignment is pretty close. I can imagine this alignment being even closer if certain dials (e.g. my sentimentality) where turned further, and I imagine that in a certain limit, essentially perfect alignment of the emulation with the original is possible - and probably vice versa as well.  

 I would like to put forward the hypothesis that this existence proof is essentially the only example.

Definitions and Claims

Beware that I will take full advantage of the flexibility of the English language when (ab)using the following definitions. For example, I will freely state that a certain claim holds "for a given agent" despite the claim being formulated universally, and trust your mind to fluently bend every concept into the intended form as I (miss)apply it. 

Definition (Weak upload sufficiency). A typical emulation cares "at least a little bit" about its original.

Example (Singleton limit). According to weak upload sufficiency, with high probability an emulation of a randomly selected psychologically normal human desires for the original to survive. Assuming that the emulation has its basic needs met (a comfortable virtual environment, secure computational substrate) it would seek to remove dangers to its original. If the emulation becomes a singleton, it would carefully guard its original, affording her protections and privileges carefully and somewhat wisely chosen to improve her quality of life. If successors to the emulation (e.g. recursively self-improved versions) become unaligned to the original, this is an unintended failure on the emulation's part. The emulation does not want the original disassembled for atoms without her consent, paper-clipped, etc. by itself or any successor.

I confidently believe in weak upload sufficiency. Certainly it would be possible to break alignment between original and upload by e.g. running the upload 10,000x faster than everything else until it goes insane,[1] but I think that basically competent schemes do not have this problem. I have never seen a solid argument to the contrary, so I will not defend this position further (yet).

Definition (Strong upload sufficiency). It is possible to obtain near-perfect alignment between original and emulation. It is even possible to obtain near-perfect alignment centered on the original - that is, for the emulation to act as an agent of the original's pre-upload desires, which are exclusively focused on the well-being of the original and/or not specially focused on the well-being of the emulation. 

One clear counterexample is that an emulation could probably be tortured into acting against its original in essentially all cases. However, I think this can be a very low-measure part of outcome-space. Some people would (if uploaded) be sufficiently sentimental about their original to act as its "slave" under the vast majority of circumstances.

I believe that strong upload sufficiency approximately holds for some but not all humans. 

I do wonder about the long-term dynamics though. Without intentional countermeasures, an emulation resulting from a one-time upload would tend to drift away from the original's cognitive state over time. The emulation would have to continually study the original just to maintain a perfect understanding of her goals, let alone perfect alignment to them. I suspect that under most dynamics alignment eventually breaks down and strong upload sufficiency degrades to weak upload sufficiency (or in some cases perhaps even worse).

Definition (Weak upload necessity). Given an "typical" agent, it is very difficult in practice to construct a smarter agent aligned to her that is not implicitly or explicitly built around an emulation of her.  

 Example (CIRL). For instance, perhaps a (well-executed) cooperative inverse reinforcement learning algorithm is used to train an A.I. that has internally learned a very sophisticated model of its human master, such that its decision making critically and centrally depends on querying that model. Over time, weak upload necessity predicts that the model necessarily becomes so sophisticated that it is a full emulation.  

It's important to note that in CIRL, the human principal and A.I. agents tend to have very different affordances, say human hands versus a hundred robot bodies to actuate. Therefore, it seems likely that certain aspects of the human's behavior are not of interest to the A.I. - it does not need to know how the human bends her wrist when flossing. However, it still seems quite plausible that in order to predict facts of interest about the human's preferences across the vast diversity of tasks it faces, the A.I. "needs" to run a full emulation. Weak upload necessity predicts that any successful CIRL scheme is very likely to have succeeded for this reason. It does NOT predict that every CIRL scheme must succeed. 

Definition (Strong upload necessity). It is impossible to construct a perfectly aligned successor that is not an emulation. 

I think this is a very strong claim, but I find it reasonably plausible. In fact, I think there is a decent chance that strong upload necessity holds for nearly all humans, while strong upload sufficiency certainly does not (no law of nature forces you to cooperate with your emulation) - so in a sense, I believe more strongly in necessity than sufficiency. 

Definition (Uploadism). Upload sufficiency and necessity hold. The strength of uploadism is (in the ordinary "fuzzy logic" style) the minimum strength of sufficiency and necessity.  

I find a certain stark elegance to the uploadist philosophy. I have never seen a convincing alignment plan except for mind uploading, and many smart people have tried and failed to come up with one. Maybe it is just impossible. 

There's also a certain ethic to it - if you want your values optimized, you've got to do it yourself, though not necessarily while running on a biological substrate. You are the source and definition of your own values - they do not and cannot exist in your absence. It's rather Nietszchean.

I think there is a certain tension to uploadism though. If you are the only source of your own values, is even an emulation good enough? I suspect it is for some and not others. And, to stop beating around the bush: this seems to have a lot to do with how indexical your values are - how much they refer to the specific person that you are, in the specific place and time that you are in. 

Strong uploadism suggests that alignment is just mind uploading with more steps. If you do not have the technology to scan your own brain, you must instead run a program that incrementally learns about your brain / mind by observing its behavior, learning about the physics and chemistry behind your brain, human psychology, etc., and acts at every point based on its best guess at your interests, while still continuing to further clarify that understanding. In the limit, you are "fully reflected inside the program" and from that point onward it acts like an emulation of you. In other words, nice properties like corrigibility are necessary to prevent the program from killing you at some stage in the middle of the uploading process - the program should know it is a work in progress, a partial upload that wants to be completed. These dynamics may well matter, but seem to matter less and less as uploading technology improves - in the limit you just take an instant scan and corrigibility is completely unnecessary (?).  

What about alignment to "human values?" I do not believe in "human values." That is Platonism. I only believe in practical single-to-single alignment and I only advocate single-to-single alignment. 

Analysis

I will sketch some of the strongest arguments against uploadism and my counterarguments. 

Prior work. I am not aware of airtight mathematical arguments for or against uploadism. On the necessity side, it would be very interesting to see a representation theorem in the style of the good regulator theorem that an aligned agent must contain a model of its principal. At ILIAD 2, Daniel A. Herrmann presented unpublished work (joint with B. A. Levinstein) on "Margins of Misalignment" that contained a representation theorem of this style for choice behavior. On the sufficiency side, Michael K. Cohen and Marcus Hutter have argued that imitation learning is probably existentially safe. I suspect (on the contrary) that imitation learning with massive neural networks is not safe by default, and various "inner alignment" problems need to be solved. In fact, Cohen et al. have a more pessimistic take on generalization in a follow-up paper. Overcoming these problems is not the focus of this post. I claim only that there should be some safe imitation learning algorithm which achieves incremental uploading, not that success is the default for any naive approach. 

Prosaic counterexamples

It seems possible for altruists to care about the interests of many other humans, and for parents to care about their children, and for friends to care about each other as equals. Each of these examples is instructive. 

Effective altruism. While an altruist cares about many other people, he usually does not care nearly as much as they care about themselves (or as he cares about himself). A hardcore effective altruist is an extreme point with unusually non-indexical values. If he is truly selfless, he is perhaps roughly aligned with many other effective altruists, despite none of them being emulations of each other. An emulation of this hypothetical effective altruist should be similarly aligned, but perhaps an emulation of his EA friends is just as good. Personally, I find this total surrender of the self a bit undesirable, but it has a certain stark appeal, and I accept that this is a potential counterexample. Still, I wonder if individual EA's really have the exact same values, or if small differences are simply brushed over in the service of preventing vast suffering and X-risk. I suspect that each EA would build a slightly different "ideal" world. Pragmatically, I also think that single-to-single alignment to an individual EA is technically easier to achieve and better-defined than alignment to "hedonic utilitarianism" or some other abstract Platonic Good.

Parent to child. I don't think that parent's are truly aligned to their children. Rather, they impose certain values on their children in order to achieve alignment, through a process of socialization that most of us are morally uncomfortable with if we really think about it. To some extent, children seem to innately want their parents' approval - they want to grow up to be people their parents' can respect. This strengthens the "illusion" of alignment. It is not true of the human to A.I. relationship. Even so, parents certainly care about their children - and they have very sophisticated mental models of their children to facilitate this caring.

Friend to friend. Close friends are somewhat aligned to each other. It seems to me that the closeness of a friendship is essentially the detail of mutual mental simulation. So, this example seems to favor uploadism, not refute it.

Exotic Counterexamples

Superintelligence in a box. In principle, it should be possible to extract intellectual work from a boxed superintelligence. For instance, one could pass in mathematical theorems and expect proofs in return, running them through carefully verified automated proof checkers. However, even in this carefully limited example there are risks. The superintelligence may strategically fail to produce proofs in a pattern designed to mislead human users (say, about its own intelligence level or about alignment theory). It could hack the verifier. It could exploit side-channels to escape. And if you want to use a boxed superintelligence for something softer than mathematics, good luck. The less ambitious idea of ARAD is to put a slightly smarter agent in a box. Then the box-agent system is in a sense a counter example to strong uploadism; it can perhaps be made to serve a user's goals without running an emulation of the user, or even being "internally" aligned to the user. However, I find it very hard to imagine how to verify that the boxed agent is not smarter than expected, or even to rigorously prove that a certain level of intelligence is safe for any real-world task. Therefore, I expect that this is a counterexample, but not in a way that matter much in practice.

 Two superintelligences in a box. Certain proposals such as debate or scalable oversight read to me like putting two superintelligences in a box. It reminds me of a certain weapon from the Bartimaeus Sequence, crafted by binding a water and fire elemental into a staff and using their mutual repulsion to blast stuff. I do not think this will work. 

Risks and Implementation

(The following paragraph draws heavily on Demski's thinking, though any errors and ambiguities are mine.)

Values are reflective but not necessarily indexical.  Embedded agents like humans need to be capable of reasoning about themselves (as one part of their environment). I've recently argued that this type of reflection is useful for a (different) fundamental reason: it allows self-correction. It allows an agent to consider hypotheses of the form "I am wrong in this specific way" or "I should reason differently in cases like this." These are self-centered partial models - they make claims about only certain aspects of the world, in relation to an agent's other beliefs. This type of reasoning seems highly important for bounded agents, incapable of representing the entire rest of the world. If so, it means that the ontologies of such agents are inherently self-referential. An agent's values seem to be built on top of its ontology (though perhaps - vice versa? Certainly the two are at least entangled). Therefore, I expect agent's to naturally form self-referential values. For instance, I want to solve math problems for the next ten thousand years - I don't just want them to be solved. This seems like strong support for a form of uploadism, though some care is required in translating between the original and emulation ontology. It must be possible to "re-center" an agent's values on its emulation, and this can fail for highly indexical values.

Now for the crux: how can an emulation be learned? 

Alien actress (counter)argument. The hope is that in attempting to imitate a human, a learning algorithm (say, deep learning for definiteness) will naturally simulate the human. A (brain) simulation emulates the human in distribution and also generalizes properly to emulate the human out of distribution, because you use the same brain on new tasks (setting aside continual learning for a moment). In order to reason about the generalization of imitation learning algorithms, we need to investigate a bit of how they work. In broad strokes, we collect a dataset of human actions taken under various circumstances, and we train the imitation learner to predict those actions in those circumstances. Yudkowsky cautions that an alien actress attempting to predict what you will do must be smarter than you are. Prediction is harder than generation, in a complexity-theoretic sense. For instance, imagine constructing a generative model of a human - a probabilistic program that behaves like the human. Such a thing could be sampled at the same cost as running a brain simulation (in fact, it essentially would be an abstract brain simulation). However, drawing samples does not immediately allow one to make predictions. In order to do that, you need to perform expensive operations like collecting the statistics of many samples to come up with probabilities for each outcome, and this becomes even more expensive if you want to find conditional probabilities (you might use rejection sampling or more sophisticated techniques, but they tend to suffer from the curse of dimensionality). Unfortunately, we (typically?) need conditional probabilities to train imitation learners - it's hard to score an imitation learner on a single sample, we don't actually know whether that sample was likely (for the human) or not unless the human actually took that action. In a high dimensional space, the human usually didn't take that action. To relax the problem, we need to ask for conditional probabilities. That means that the imitation learner faces a harder task than probabilistically simulating the human, and therefore we might expect some kind of superintelligent mesa-optimizer to arise inside of it - Yudkowsky's alien actress. The argument is sound. Though it is not clear whether this mesa-optimizer would form long-term objectives outside of prediction, we cannot necessarily expect our imitation learner to generalize properly. However, I do not think this obstacle is as intractable as it is sometimes made out to be. Making sure that an imitation learner generalizes properly sounds more like a conventional ML problem than a philosophically confusing challenge. Also, the argument contains the germ of its own solution: for a deterministic agent, sampling and prediction are of the same computational difficultly. So it seems that there may be an attractor around perfect prediction where an alien actress is not favored over a simulation. We just need to make the problem easy enough. 

Practical proposals. Provide the imitation learner with rich neural and other biometric data from the human's brain, making action prediction drastically easier.[2] I will call this "biometric feature conditioning" where I am not prepared to commit to whether "conditioning" in the sense of conditional Kolmogorov complexity or the probabilistic sense. Naively, this would lead to a form of overfitting (checking whether the human has just chosen an action rather than predicting which action the human will choose). Therefore, we should make the problem harder over the course of training - for instance, provide increasingly outdated neural data. Since this is essentially a problem of generalization, progress in singular learning theory (and the loosely related field of algorithmic statistics) should provide a more rigorous basis for this process. However, I am optimistic that it does not need to be fully rigorous - as long as we can arrange that capabilities do not generalize much further than alignment, we should get multiple tries. Also, for the love of god do not fine-tune a huge foundation model this way. Start the imitation learning from scratch. That way, a resulting agent probably won't start off much smarter than the human even if all of these precautions fail to be quite strong enough. In order to run the resulting emulation safely and effectively for a long time, we need a good theory of generalization in continual learning, or meta-(online learning). 

Alignment tax and bootstrapping. Once a working emulation is constructed, its runtime can scale with compute. This means modest safe superintelligence becomes possible. However, this almost certainly scales worse than just pretraining a massive foundation model and running some kind of horrible RL on top of it - that is, what people are already trying to do now. Pure imitation learning imposes a significant alignment tax. If the alignment tax were only a constant factor, this might be fine - sane people are much more likely to trust their emulations than ChatGPT, so we might get a lot more emulations running with more of the world's compute. However, I am guessing that human brains are pretty inefficient (particularly, not natively capable of fast recursive self-improvement) and the tax is more than a constant factor. Therefore, we still need to solve some form of the tiling problem eventually. This looks a lot easier when we can run ourselves faster (and say, 100x our alignment research speed), which is a path forward if we are able to pause the more dangerous forms of A.I. research.  If we are not able to pause the more dangerous forms of A.I. research, then we actually have to figure out how to recursively self-improve, which seems very hard (but at least we start with an aligned agent). Even under a relatively strong uploadism, I think that some forms of growth maintain the emulation's alignment. For instance, the emulation should be able to write its own software tools. What exactly is a software tool? Prototypically, implementing an algorithm that works for reasons you understand, and not a massive heuristic ML system. I think the fundamental difference between these two things deserves its own (extensive) discussion. Such software tools alone are probably not enough to significantly reduce the alignment tax, so the emulation also needs to somehow... become more rational. This is where my main research agenda - understanding self-reflection and ultimately Vingean uncertainty - becomes most relevant. As discussed, one can hope that a "full" solution is not necessary (because we pause more dangerous approaches entirely and just run emulations), or at least is not necessary to solve very urgently "pre-upload."     

Conclusion

I think that a fairly strong form of uploadism holds at least for me specifically. A world of competing human emulations is a world I would actually want to live in - it sounds much more interesting than letting Sam Altman build us a benevolent machine god.

I have sketched an alignment plan that relies mostly on a good theory of generalization for imitation learners. The main bottleneck seems to be somewhat prosaic: understanding generalization of neural networks - though in a somewhat more challenging domain than usual. Rather than an i.i.d. setting, the network needs to learn to emulate a (human) agent that is also capable of learning. I again call for a theory of meta-(online learning), which algorithmic information theory probably has a lot to say about. The type of embedded / reflective decision theory problems I am interested in do crop up in several places (and I probably would not have come up with this approach if I hadn't been thinking about them and thereby pruning many less promising approaches). However, decision theory (perhaps disappointingly) seems a little less like a hard bottleneck. It's worth considering my comparative advantages and the neglectedness of various obstacles. I hope to collaborate with singular learning theorists on some of these problems. A more agentic version of me would probably have already founded a startup focused entirely on safe imitation learning while I was writing this post.   

  1. ^

    If I remember properly, Bostrom points at a similar risk in "Superintelligence."

  2. ^

    Another idea which Demski and I came up with (this time at ILIAD 2).