Alignment as uploading with more steps

A world of competing human emulations is a world I would actually want to live in

I think there's a huge danger of people running private servers full of emulations and doing anything they want to them, undetectably. Desire for power over others is a very real thing, in some people at least. Maybe the government could prevent it by oversight; but in modern democracy a big factor of stability is that people could rise up and feasibly overthrow the government. Emulations on private servers wouldn't have that power, so I don't expect government to stably defend their rights. It'll wash out over time, to agree more with the interests of those who can actually influence government. In short, this leads to emulation-world being very bad and I don't want it.

The same arguments would apply to our world if governments got armies of autonomous drones, for example. Whenever I imagine possible worlds, the distribution of power is the first thing I think about. It makes the problem more real: it's very hard to imagine a nice future world that works.

[-]Cole Wyeth1mo20

I don’t necessarily disagree that these guesses are plausible, but I don’t think it’s possible to predict exactly what emulation world ends up looking like, and even your high level description of the dynamics looks very likely to be wrong.

The goal is to become one of the early emulations and shape the culture, regulations, technology etc. into a positive and stable form - or at least, into carefully chosen initial conditions.

[-]cousin_it1mo*70

My argument goes something like this: 1) throughout history, big differences in power have been a recipe for abuse; 2) uploading allows bigger power differences than ever existed before. It's a big concern to me and I'm not sure we can "wing it", it's better to have a plan now.

[-]Cole Wyeth1mo20

I don’t find this sketch of an argument very convincing. Like, yes I agree we should have a plan, but by default if it looks like uploading is becoming practical a massive amount of intellectual labor will go into constructing a plan, and even now I can see various reasonable plans. Basically I feel like this is an isolated demand for rigor.

[-]Random Developer2mo*158

I have two concerns.

Dangerously poor alignment of individual humans. My concern with this plan is that some humans are very poorly aligned to each other. And even if you could "upload" these people and get an AI that was flawlessly aligned to their values, you'd still have a dangerously rogue intelligence on the loose.

Some examples:

People joke about CEOs being high in Dark Triad traits. I met one who was charming, good with people, and almost completely amoral. Think Anthony Hopkins as Hannibal Lector, but without the cannibalism (I assume). He appeared to place zero moral value on other people. He is one of the creepiest people I've ever met, once I saw through the mask. He had this effect on a lot of people.
I occasionally volunteer for a political party. Most of their elected officials are ordinary, well-meaning people. At least one of them is a notoriously manipulative user who shouldn't be allowed near power and who should be avoided on a personal level.
I could name any number of billionaires and politicians who are either slipping out of touch with consensus reality in strange ways, or unrepentantingly willing to lie and use people to get more power.
Then there are any number of otherwise decent people whose highest moral values include controlling other people's behavior very strictly. For example, for some of my distant ancestors, it wasn't enough to be free to worship a God of their choice. They had that, and they left. What they wanted was to build communities where nobody was allowed to disagree, under threat of government force.

Even if you could perfectly align an AI around any of these people's values, I would still consider it an existential risk on the same level as (say) SkyNet. In the case of my religious ancestors, the risks might be worse than mere extinction. Some of those people might have willingly employed cognitive control strategies that I would consider a fate considerably worse than death. And there have been a few historic preachers who were suspiciously gleeful about the existence of Hell. Someone out there, there is at least one human who would lovingly recreate Hell and start damning people to it, if they had the power.

Competitive pressures forcing a leap from human-aligned AGI to essentially alien ASI. Let's assume that we actually solve "faithful" uploading, and we somehow ban uploading any rich and powerful sociopaths.

Now let's imagine that Corporation/Government A uses only uploaded humans. Corporation/Government B, however, is willing to build custom minds from the ground up, giving them a working memory with a million items (instead of 7±2), the ability to fork and reintegrate sub-personas, the ability to do advanced math intuitively, the ability to think at super-human speeds, the ability to one-shot complex software with minimal planning (and use output formats that rely on and integrate directly into the million-item working memory), and a hundred other tweaks I'm not smart enough to imagine. They willingly choose to "break compatibility" with human neural architectures in ways that fundamentally change the minds they're building, in order to get minds that even von Neumann or Feynman would agree are so smart that they're a bit creepy.

If Corporation/Government A limits themselves to human uploads, and Corporation/Government B is willing to sacrifice all "human compatibility" to maximize intelligence, who wins?

[-]Cole Wyeth2mo2-3

The first concern seems like a much smaller risk that the one we currently face from unaligned AI. To be clear, I’m suggesting emulations of a relatively large number of people (more than 10, at least once the technology has been well tested, and eventually perhaps everyone). If some of them turn out be evil sociopaths, the others will just have to band together and enforce norms, exactly like we do now.

The second concern sounds like gradual disempowerment to me. However, I think there are a lot of ways for Corporation A to win. Perhaps Corporation B is regulated out of existence - reckless modifications should violate some sort of human alignment code. Perharps we learn how to recursively self improve as emulations, in such a way that the alignment tax is near 0, and then just ensure that initial conditions modestly favor Corporation A (most companies adopt reasonable standards, and over time control most of the resources). Or perhaps corporate power is drastically reduced and emulations are able to coordinate once their intelligence is sufficiently boosted. Or perhaps a small team of early emulations performs a pivotal act. Basically, I think this is something our emulations can figure out.

[-]Matt Goldenberg2mo62

I'm pretty sure that the me from 10 years ago is aligned to different values than the me of today, so I suspect a copy running much faster than me would quickly diverge.

And that's just a normal speed running version of me one that experienced the world much faster would have such a different experience of the world, as a small example conversations would be more boring but also I'd be more skilled at them, so things would diverge much faster.

[-]Cole Wyeth2mo20

Maybe, but we usually endorse the way that our values change over time, so this isn’t necessarily a bad thing.

Also, I find it hard to imagine hating my past self so much that I would want to kill him or allow him to be killed. I feel a certain protectiveness and affection for my self 10 or 15 years ago. So I feel like at least weak upload sufficiency should hold, do you disagree?

[-]Matt Goldenberg2mo50

but we usually endorse the way that our values change over time, so this isn’t necessarily a bad thing.

I'm pretty skeptical of this, of course it seems that way because we are the ones with the new values, but I think this is like 70% just a tautology of valuing valuing the things we currently value, and 20% a psychological thing that justifies our decisions in retrospect and make them seem more consistent than they are, and only 10% any sort of actual consistency effect where if I asked myself at time x if it endorses the value changes I've made at future time y, past me would say "yes, y is better than x".

Also, I find it hard to imagine hating my past self so much that I would want to kill him or allow him to be killed.

I could easily imagine a future version of myself after e.g. hundreds of years of value drift that I would see as horrifying and no longer consider them me.

[-]Cole Wyeth1mo63

Skill issue, past me endorses current me.

[-]Matt Goldenberg1mo50

I doubt this, it's very hard to achieve giving developmental issues with stuff like shifting hormones

[-]Matt Goldenberg1mo30

For instance I bet the you of 4 or 5 would want you to spend your money on much more candy and toys than the you of today.

[-]Cole Wyeth1mo20

Eh, the me of 4 or 5 wanted to play with swords, I still want to play with swords. I guess I’m less interested in toys, but I think that was mostly because my options were restricted (the things I like to do now were not possible).

Anyway, I think this is the wrong framing. Our minds develop into maturity from child->adult, after that it’s a lot more stable. I’m not even sure children are complete agents.

[-]Matt Goldenberg1mo30

It's true our preferences get more stable as we get older but I still think over the course of decades they change. We're typically bad at predicting what we'll want in 10 years even at much older ages.

[-]the gears to ascension2mo20

https://www.lesswrong.com/posts/3SDjtu6aAsHt4iZsR/davey-morse-s-shortform?commentId=3mDiPDcE4wfFnaoDt

[-]Wei Dai1mo*Ω340

Definition (Strong upload necessity). It is impossible to construct a perfectly aligned successor that is not an emulation. [...] In fact, I think there is a decent chance that strong upload necessity holds for nearly all humans

What's the main reason(s) that you think this? For example one way to align an AI^[1] that's not an emulation was described in Towards a New Decision Theory: "we'd need to program the AI with preferences over all mathematical structures, perhaps represented by an ordering or utility function over conjunctions of well-formed sentences in a formal set theory. The AI will then proceed to "optimize" all of mathematics, or at least the parts of math that (A) are logically dependent on its decisions and (B) it can reason or form intuitions about." Which part is the main "impossible" thing in your mind, "how to map fuzzy human preferences to well-defined preferences" or creating an AI that can optimize the universe according to such well-defined preferences?

I currently suspect it's the former, and it's because of your metaethical beliefs/credences. Consider these 2 metaethical positions (from Six Plausible Meta-Ethical Alternatives):

3 There aren't facts about what everyone should value, but there are facts about how to translate non-preferences (e.g., emotions, drives, fuzzy moral intuitions, circular preferences, non-consequentialist values, etc.) into preferences. These facts may include, for example, what is the right way to deal with ontological crises. The existence of such facts seems plausible because if there were facts about what is rational (which seems likely) but no facts about how to become rational, that would seem like a strange state of affairs.
4 None of the above facts exist, so the only way to become or build a rational agent is to just think about what preferences you want your future self or your agent to hold, until you make up your mind in some way that depends on your psychology. But at least this process of reflection is convergent at the individual level so each person can reasonably call the preferences that they endorse after reaching reflective equilibrium their morality or real values.

If 3 is true, then we can figure out and use the "facts about how to translate non-preferences into preferences" to "map fuzzy human preferences to well-defined preferences" but if 4 is true, then running the human as an emulation becomes the only possible way forward (as far as building an aligned agent/successor). Is this close to what you're thinking?

I also want to note that if 3 (or some of the other metaethical alternatives) is true, then "strong non-upload necessity", i.e. that it is impossible to construct a perfectly aligned successor that is an emulation, becomes very plausible for many humans, because an emulation of a human might find it impossible to make the necessary philosophical progress to figure out the correct normative facts about how to turn their own "non-preferences" into preferences, or simply don't have the inclination/motivation to do this.

^{^}
which I don't endorse as something we should currently try to do, see Three Approaches to "Friendliness"

[-]Cole Wyeth1moΩ140

I think 4 is basically right, though human values aren’t just fuzzy, they’re also quite complex, perhaps on the order of complexity of the human’s mind, meaning you pretty much have to execute the human’s mind to evaluate their values exactly.
Some people, like very hardcore preference utilitarians, have values dominated by a term much simpler than their minds’. However, even those people usually have somewhat self-referential preferences in that they care at least a bit extra about themselves and those close to them, and this kind of self-reference drastically increases the complexity of values if you want to include it.

For instance, I value my current mind being able to do certain things in the future (learn stuff, prove theorems, seed planets with life) somewhat more than I would value that for a typical human’s mind (though I am fairly altruistic). I suppose that a pointer to me is probably a lot simpler than a description/model of me, but that pointer is very difficult to construct, whereas I can see how to construct a model using imitation learning (obviously this is a “practical” consideration). Also, the model of me is then the thing that becomes powerful, which satisfies my values much more than my values can be satisfied by an external alien thing rising to power (unless it just uploads me right away I suppose).

I’m not sure that even an individual’s values always settle down into a unique equilibrium, I would guess this depends on their environment.

unrelatedly, I am still not convinced we live in a mathematical multiverse, or even necessarily a mathematical universe. (Finding out we lived in a mathematical universe would make a mathematical multiverse seem very likely for the ensemble reasons we have discussed before)

[-]Wei Dai1moΩ340

I think 4 is basically right

Do you think it's ok to base an AI alignment idea/plan on a metaethical assumption, given that there is a large spread of metaethical positions (among both amateur and professional philosophers) and it looks hard to impossible to resolve or substantially reduce the disagreement in a relevant timeframe? (I noted that the assumption is weightbearing, since you can arrive at an opposite conclusion of "non-upload necessity" given a different assumption.)

(Everyone seems to do this, and I'm trying to better understand people's thinking/psychology around it, not picking on you personally.)

I suppose that a pointer to me is probably a lot simpler than a description/model of me, but that pointer is very difficult to construct, whereas I can see how to construct a model using imitation learning (obviously this is a “practical” consideration).

Not sure if you can or want to explain this more, but I'm pretty skeptical, given that distributional shift / OOD generalization has been a notorious problem for ML/DL (hence probably not neglected), and I haven't heard of much theoretical or practical progress on this topic.

Also, the model of me is then the thing that becomes powerful, which satisfies my values much more than my values can be satisfied by an external alien thing rising to power (unless it just uploads me right away I suppose).

What about people whose values are more indexical (they want themselves to be powerful/smart/whatever, not a model/copy of them), or less personal (they don't care about themselves or a copy being powerful, they're fine with an external Friendly AI taking over the world and ensuring a good outcome for everyone)?

I’m not sure that even an individual’s values always settle down into a unique equilibrium, I would guess this depends on their environment.

Yeah, this is covered under position 5 in the above linked post.

unrelatedly, I am still not convinced we live in a mathematical multiverse

Not completely unrelated. If this is false, and an ASI acts as if it's true, then it could waste a lot of resources e.g. doing acausal trading with imaginary counterparties. And I also don't think uncertainty about this philosophical assumption can be reduced much in a relevant timeframe by human philosophers/researchers, so safety/alignment plans shouldn't be built upon it either.

[-]Cole Wyeth1moΩ120

My plan isn’t dependent on that meta-ethical assumption. It may be that there is a correct way to complete your values but not everyone is capable of it, but as long as some uploads can figure their value completion out, those uploads can prosper. Or if they can only figure out how to build an AGI that works out how to complete their values, they will have plenty of time to do that after this acute period of risk ends. And it seems that if no one can figure out their values, or eventually figure out how to build an AGI to complete their values, the situation would be rather intractable.

I don’t understand your thinking here. I’m suggesting a plan to prevent extinction from AGI. Why is it a breaking issue if some uploads don’t work out exactly what they “should” want? This is already true for many people. At worst it just requires that the initial few batches of uploads are carefully selected for philosophical competence (pre-upload) so that some potential misconception is not locked in. But I don’t see a reason that my plan runs a particular risk of locking in misconceptions.

yes, generalization in deep learning is hard, but it’s getting rapidly more effective in practice and better understood through AIT and mostly(?) SLT.
I think this is tractable. Insofar as it’s not tractable, I think it can be made equally intractable for capabilities and alignment (possibly at some alignment tax). I have more detailed ideas about this, many of which are expressed in the post (and many of which are not). But I think that’s the high level reason for optimism.

[-]Wei Dai1moΩ340

Why is it a breaking issue if some uploads don’t work out exactly what they “should” want? This is already true for many people.

I'm scared of people doing actively terrible things with the resources of entire stars or galaxies at their disposal (a kind of s-risk), and concerned about wasting astronomical potential (if they do something not terrible but just highly suboptimal). See Morality is Scary and Two Neglected Problems in Human-AI Safety for some background on my thinking about this.

At worst it just requires that the initial few batches of uploads are carefully selected for philosophical competence (pre-upload) so that some potential misconception is not locked in.

This would relieve the concern I described, but bring up other issues, like being opposed by many because the candidates' values/views are not representative of humanity or themselves. (For example philosophical competence is highly correlated with or causes atheism, making it highly overrepresented in the initial candidates.)

I was under the impression that your advocated plan is to upload everyone at the same time (or as close to that as possible), otherwise how could you ensure that you personally would be uploaded, i.e. why would the initial batches of uploads necessarily decide to upload everyone else, once they've gained power. Maybe I should have clarified this with you first.

My own "plan" (if you want something to compare with) is to pause AI until metaphilosophy is solved in a clear way, and then build some kind of philosophically super-competent assistant/oracle AI to help fully solve alignment and the associated philosophical problems. Uploading carefully selected candidates also seems somewhat ok albeit a lot scarier (due to "power corrupts", or selfish/indexical values possibly being normative or convergent) if you have a way around the social/political problems.

better understood through AIT and mostly(?) SLT

Any specific readings or talks you can recommend on this topic?

[-]Cole Wyeth1mo40

I'm scared of people doing actively terrible things with the resources of entire stars or galaxies at their disposal (a kind of s-risk), and concerned about wasting astronomical potential (if they do something not terrible but just highly suboptimal). See Morality is Scary and Two Neglected Problems in Human-AI Safety for some background on my thinking about this.

I am also scared of S-risks, but these can be prevented through effective governance of an emulation society. We don't have a great track record of this so far (we have animal cruelty laws but also factory farming), and it's not clear to me whether it's generally easier or harder to manage in an emulation society (surveillance is potentially easier, but the scale of S-risks is much larger). So, this is a serious challenge that we will have to meet (e.g. by selecting the first few batches of uploads carefully and establishing regulations) but it seems to be somewhat distinct from alignment.

I am less concerned about wasting (say) 10-20% of astronomical potential. I'm trying not to die here. Also, I don't think it's likely to be in the tens, because most of my preferences seem to have diminishing returns to scale. And because I don't believe in "correct" values.

I was under the impression that your advocated plan is to upload everyone at the same time (or as close to that as possible), otherwise how could you ensure that you personally would be uploaded, i.e. why would the initial batches of uploads necessarily decide to upload everyone else, once they've gained power. Maybe I should have clarified this with you first.

I can't ensure that I will be, though I will fight to make it happen. If I were, I would probably try to upload a lot of rationalists in the second batch (and not, say, become a singleton).

My own "plan" (if you want something to compare with) is to pause AI until metaphilosophy is solved in a clear way, and then build some kind of philosophically super-competent assistant/oracle AI to help fully solve alignment and the associated philosophical problems. Uploading carefully selected candidates also seems somewhat ok albeit a lot scarier (due to "power corrupts", or selfish/indexical values possibly being normative or convergent) if you have a way around the social/political problems.

I would like to pause AI, I'm not sure solving metaphilosophy is in reach (though I have no strong commitment that it isn't), and I don't know how to build a safe philosophically super-competent assistant/oracle - or for that matter a safe superintelligence of any type (except possibly at a very high alignment tax by one of Michael K. Cohen's proposals), unless it is (effectively) an upload, in which case I at least have a vague plan.

Any specific readings or talks you can recommend on this topic?

I am trying to invent a (statistical learning) theory of meta-(online learning). I have not made very much progress yet, but there is a sketch here: https://www.lesswrong.com/posts/APP8cbeDaqhGjqH8X/paradigms-for-computation

The idea is based on "getting around" Shane Legg's argument that there is no elegant universal learning algorithm by taking advantage of pretraining to increase the effective complexity of a simple learning algorithm: https://arxiv.org/abs/cs/0606070

I did some related preliminary experiments: https://www.lesswrong.com/posts/APP8cbeDaqhGjqH8X/paradigms-for-computation

The connection to SLT would look something like what @Lucius Bushnaq has been studying, except it should be the online learning algorithm that is learned: https://www.alignmentforum.org/posts/3ZBmKDpAJJahRM248/proof-idea-slt-to-ait

David Quarel and others at Timaeus presented on singular learning theory for reinforcement learning at ILIAD 2. I missed it (and their results don't seem to be published yet). Ultimately, I want something like this but for online decision making = history-based RL.

[-]Wei Dai1mo60

Thanks for the suggested readings.

I’m trying not to die here.

There are lots of ways to cash out "trying not to die", many of which imply that solving AI alignment (or getting uploaded) isn't even the most important thing. For instance under theories of modal or quantum immortality, dying is actually impossible. Or consider that most copies of you in the multiverse or universe are probably living in simulations of Earth rather than original physical entities, so the most important thing from a survival-defined-indexically perspective may be to figure out what the simulators want, or what's least likely to cause them to want to turn off the simulation or most likely to "rescue" you after you die here. Or, why aim for a "perfectly aligned" AI instead of one that cares just enough about humans to keep us alive in a comfortable zoo after the Singularity (which they may already do by default because of acausal trade, or maybe the best way to ensure this is to increase the cosmic resources available to aligned AI so they can do more of this kind of trade)?

And because I don’t believe in “correct” values.

The above was in part trying to point out that even something like not wanting to die is very ill defined, so if there are no correct values, not even relative to a person or a set of initial fuzzy non-preferences, then that's actually a much more troubling situation then you seem to think.

I don’t know how to build a safe philosophically super-competent assistant/oracle

That's in part why I'd want to attempt this only after a long pause (i.e. at least multi decades) to develop the necessary ideas, and probably only after enhancing human intelligence.

[-]Cole Wyeth1mo20

To be clear, I’m trying to prevent AGI from killing everyone on earth, including but not limited to me personally.

There could be some reason (which I don’t fully understand and can’t prove) for subjective immortality, but that poorly understood possibility does not cause me to drive recklessly or stop caring about other X-risks. I suspect that any complications fail to change the basic logic that I don’t want myself or the rest of humanity to be placed in mortal danger, whether or not that danger subjectively results in death - it seems very likely to result in a loss of control.

A long pause with intelligence enhancement sounds great. I don’t think we can achieve a very long pause, because the governance requirements become increasingly demanding as compute gets cheaper. I view my emulation scheme as closely connected to intelligence enhancement - for instance, if you ran the emulation for only twenty seconds you could use it as a biofeedback mechanism to avoid bad reasoning steps by near-instantly predicting they would soon be regretted (as long as this target grounds out properly, which takes work).

[-]avturchin2mo40

Me and group of friends are developing open-sourced technology of approximate uploading - sideloading - via LLM with very large prompt. The results are surprisingly good given amount of resources and technology limitations. We hope that it may help with alignment. I also open-sourced and publicly donate my mindfile, so anyone can rune experiments with it.

[-]Cole Wyeth2mo52

This is interesting, but I again caution that fine tuning a foundation model is unlikely to result in an emulation which generalizes properly. Same (but worse) for prompting.

[-]Hastings1mo20

I think there is a bit of a rhetorical issue here with the necessity argument: I agree that a powerful program aligned to a person would have an accurate internal model of that person, but I think that this is true by default whenever a powerful, goal seeking program interacts with a person- it’s just one of the default instrumental subgoals, not alignment specific.

[-]Cole Wyeth1mo20

There’s a difference between building a model of a person and using that model as a core element of your decision making algorithm. So what you’re describing seems even weaker than weak necessity.

However, I agree that some of the ideas I’ve sketched are pretty loose. I’m trying to provide a conceptual frame and work out some of the implications only.

[-]Raphael Roche1mo10

I do not believe in "human values." That is Platonism. I only believe in practical single-to-single alignment and I only advocate single-to-single alignment.

A bit-perfect upload would require an extremely fine-grained scan of the brain, potentially down to the atomic scale. It would be lossless and perfectly aligned but computationally intractable even for one individual.

However, as envisionned in your post, one of the most promising approaches to achieving a reasonably effective emulation (a form of lossy compression) of a human mind would be through reinforcement learning applied to a neural network.

I am quite convinced that, given a sufficiently large volume of conversations across a wide range of topics, along with access to resources such as an autobiography or at least diaries, photo albums, and similar personal documents, present frontier LLMs equipped with a well-crafted prompt could already emulate you or me to a certain degree of accuracy.

A dedicated network specifically trained for this purpose would likely perform better still, and could be seen as a form of lossy mind uploading.

Yet if one can train a network to emulate a single individual, nothing prevents us from training a model to emulate multiple individuals. In theory, one could extend this to the entire human population, resulting in a neural network that emulates humanity as a whole and thereby achieves a form of alignment with human values. Such a system would effectively encode a lossy compression of human values, without anything platonic. Or maybe the ideal form would correspond to the representation in the vector space.

[-]Cole Wyeth1mo20

A simulation of all humans does not automatically have “human values.” It doesn’t really have values at all. You have to extract consensus values somehow, and in order to do that, you need to specify something like a voting mechanism. But humans don’t form values in a vacuum, and such a simulation also probably needs to set interaction protocols, and governance protocols, and whatever you end up with seems quite path dependent and arbitrary.

Why not just align AI’s to each individual human and let them work it out?

[-]Raphael Roche1mo10

I don't have any certitude, but I would say that the representation in the neural network is somehow compressed following a logic that emerges from the training. There is something holistic in the process. Maybe a little like the notion of general interest in Rousseau's social contract, a combination of vectors.

But if you create as many different networks as humans, you rely on the confrontation of all these systems at the risk that some takeover, just like the dictators we often get in real life. Would it be better, I don't know. One thing is certain, it would need more compute power, because the redundancy of networks would result in less global compression.

[-]Daniel C2mo10

An alternative to pure imitation learning is to let the AI predict observations and build its world model as usual (in an environment containing humans), then develop a procedure to extract the model of a human from that world model.

This is definitely harder than imitation learning (probably requires solving ontology identification+ inventing new continual learning algorithms) but should yield stronger guaranteees & be useful in many ways:

It's basically "biometric feature conditioning" on steroids, (with the right algorithms) the AI will leverage whatever it knows about physics, psychology, neuroscience to form its model of the human, and continue to improve its human model as it learns more about the world (this will require ontology identification)
We can continue to extract the model of the current human from the current world model & therefore keep track of current preferences. With pure imitation learning it's hard to reliably sync up the human model with the actual human's current mental state (e.g. the actual human is entangled with the environment in a way that the human model isn't unless the human wears sensors at all times). If we had perfect upload tech this wouldn't be much of an issue, but seems significant especially at early stages of pure imitation learning
- In particular, if we're collecting data of human actions under different circumstances, then both the circumstance and the human's brain state will be changing, & the latter is presumably not observable. It's unclear how much more data is needed to compensate for that
We often want to run the upload/human model on counterfactual scenarios: Suppose that there is a part of the world that the AI infers but doesn't directly observe, if we want to use the upload/human model to optimize/evaluate that part of the world, we'd need to answer questions like "How would the upload influence or evaluate that part of the world if she had accurate beliefs about it?". It seems more natural to achieve that when the human model was originally already entangled with the rest of the world model than if it resulted from imitation learning

[-]Cole Wyeth2mo30

Yes, I think what you’re describing is basically CIRL? This can potentially achieve incremental uploading. I just see it as technically more challenging than pure imitation learning. However, it seems conceivable that something like CIRL is needed during some kind of “takeoff” phase, when the (imitation learned) agent tries to actively learn how it should generalize by interacting with the original over longer time scales and while operating in the world. That seems pretty hard to get right.

[-]Daniel C2mo32

Yes I agree

I think it's similar to CIRL except less reliant on the reward function & more reliant on the things we get to do once we solve ontology identification

^{^}

If I remember properly, Bostrom points at a similar risk in "Superintelligence."

^{^}

Another idea which Demski and I came up with (this time at ILIAD 2).

LESSWRONG
LW

LESSWRONG
LW

69

Alignment as uploading with more steps

69

Ω 24

69

Ω 24

Motivation and Overview

Definitions and Claims

Analysis

Prosaic counterexamples

Exotic Counterexamples

Risks and Implementation

Conclusion