Good commentary, thanks! I'd be interested to hear your positive proposal for how to do better re conservatism and anthropomorphism. Do you think your objection is more to the style or the substance?
I particularly like this paragraph:
It just doesn't seem like the implications of the differences have fully propagated into some of the recommendations?—as if an attempt to write in a way that's comprehensible to Shock Level 2 tech executives and policymakers has failed to elicit all of the latent knowledge that Bostrom and Shulman actually possess. It's understandable that our reasoning about the future often ends up relying on analogies to phenomena we already understand, but ultimately, making sense of a radically different future is going to require new concepts that won't permit reasoning by analogy.
The bolded sentence (emphasis mine) seems plausibly exactly what's going on (curious if Shulman agrees. Or Bostrom, but he seems less likely to comment).
One of the frightening things to me is that this might actually be the limit of what humanity can coordinate around. (Or rather, I'm frightened that the things humanity can coordinate around are even more weaksauce than the concepts in this doc)
You are looking at the wreckage of an abandoned book project. We got bogged down & other priorities came up. Instead of writing the book, we decided to just publish a working outline and call it a day.
The result is not particularly optimized for tech executives or policymakers — it’s not really optimized for anybody, unfortunately.
The propositions all *aspire* to being true, although some of may not be particularly relevant or applicable in certain scenarios. Still, there could be value on working out sensible things to say to cover quite a wide range of scenarios, partly because we don’t know which scenario will happen (and there is disagreement over the probabilities), but partly also because this wider structure — including the parts that don’t directly pertain to the scenario that actually plays out — might form a useful intellectual scaffolding, which could slightly constrain and inform people’s thinking of the more modal scenarios.
I think it’s unclear how well reasoning by analogy works in this area. Or rather: I guess it works poorly, but reasoning deductively from first principles (at SL4, or SL15, or whatever) might be equally or even more error-prone. So I’ve got some patience for both approaches, hoping the combo has a better chance of avoiding fatal error than either the softheaded or the hardheaded approach has on its own.
we should try to arrange for AIs' deployment environments to be higher-reward than would be expected from their training environment, in analogy to how factory-farms are bad and modern human lives are good by dint of comparison to what was "expected" in the environment of evolutionary adaptedness.
This seems like a really, really good idea, regardless of anything else in the paper. Are there any potential downsides to doing this?
This seems like it would raise the incentive for AGI to be deceptive in their training environments. An un-aligned AGI has the decision of acting to maximize its goals in training and getting a higher short-term reward, or deceptively pretending to be aligned in training, and getting a lower short-term reward. The benefit to the AGI of pretending to be aligned is it increases the probability of it being deployed, and thus being able to get a higher long-term reward in deployment.
Thus the bigger the discrepancy in reward an AGI would get between deployment environment and training environment, the more important the long-term deployment reward becomes, and the higher the incentive there is of the AGI being deceptive during training.
Perhaps the difference made would be small though? Feels like a relatively unlikely sort of situation, in which the AI chooses not to be deceptive but would have chosen to be deceptive if it had calculated that in the brief period before it takes over the world it would be getting 2x reward per second instead of x.
Yeah, intuitively I don’t see that as breaking alignment if it’s already aligned, and an unaligned AI would already have incentive to lie, I think. Considering the potential positive impact this could have qualia-wise, imo it’s a worthwhile practice to carry out.
My intuition is it should be small in most cases, but there are some scenarios where this could be important.
Let's imagine we are training a reinforcement learning agent AGI that discounts rewards in time by some parameter d with 0 < d < 1 (so an expected reward r that is gotten n timesteps from now is worth d*r^n at the current time step). Let's further assume the wireheading problem has been solved (the AI can't change the reward calculating process, and give itself, say, infinite reward), and that there is a maximum possible reward of M per time step. This maximum reward could be coded in explicitly in the reward function, or could be implicitly calculated using physical limits.
If deployment is at least b time steps away, the maximum possible deployment reward considering discounting (as calculated at the current time step) is M*d^b/(1-d). If there is any deception the AGI could do that would sacrifice its short-term reward by more than that amount, then it will not do that deception. There seems to be a bit of a tradeoff, where the bigger the multiplier is between the development reward and the training reward, the larger M becomes, and the smaller d needs to be in order to encourage the AGI to engage in non-deceptive behavior.
If we want d to be larger, which will enable an aligned AI to make useful long-term plans, this argument claims we should use a smaller multiplier between development reward and training reward in order for an unaligned AI to be incentived to not deceive us. Alternatively, this argument claims that you should wait a long time between when you test if your model is deceptive and when you deploy it - running a lot of time steps with 0 reward between your deception test and deployment will increase b and thus shrink the size of M*d^b/(1-d), which should make deception much less encouraged.
An un-aligned AI has the decision of acting to maximize its goals in training and getting a higher short-term reward, or deceptively pretending to be aligned in training, and getting a lower short-term reward.
If there is a conflict between these, that must be because the AI's conception of reward isn't identical to the reward that we intended. So even if we dole out higher intended reward during deployment, it's not clear that that increases the reward that the AI expects after deployment. (But it might.)
This concern seems relevant if (1) a discount factor is used in an RL setup (otherwise the systems seems as likely to be deceptively aligned with or without the intervention, in order to eventually take over the world), and (2) a decision about whether the system is safe for deployment is made based on its behavior during training.
As an aside, the following quote from the paper seems relevant here:
Ensuring copies of the states of early potential precursor AIs are preserved to later receive benefits would permit some separation of immediate safety needs and fair compensation.
I think it's a type error. It's substituting the selection criteria for the selected values. Current humans are better off because we optimized our environments according to our values, not because the selection criteria by which our learning process was optimized (inclusive genetic fitness) is more abundant in our modern environment. We're happier because we made ourselves happy, not because we reproduce more in our current environment.
The evolutionary mismatch causes differences in neural reward, e.g. eating lots of sugary food still tastes (neurally) rewarding even though it's currently evolutionarily maladaptive. And habituation reduces the delightfulness of stimuli.
In a recent paper, Nick Bostrom and Carl Shulman present "Propositions Concerning Digital Minds and Society", a tentative bullet-list outline of claims about how advanced AI could be integrated into Society.
I want to like this list. I like the kind of thing this list is trying to do. But something about some of the points just feels—off. Too conservative, too anthropomorphic—like the list is trying to adapt the spirit of the Universal Declaration of Human Rights to changed circumstances, without noticing that the whole ontology that the Declaration is written in isn't going to survive the intelligence explosion—and probably never really worked as a description of our own world, either.
This feels like a weird criticism to make of Nick Bostrom and Carl Shulman, who probably already know any particular fact or observation I might include in my commentary. (Bostrom literally wrote the book on superintelligence.) "Too anthropomorphic", I claim? The list explicitly names many ways in which AI minds could differ from our own—in overall intelligence, specific capabilities, motivations, substrate, quality and quantity (!) of consciousness, subjective speed—and goes into some detail about how this could change the game theory of Society. What more can I expect of our authors?
It just doesn't seem like the implications of the differences have fully propagated into some of the recommendations?—as if an attempt to write in a way that's comprehensible to Shock Level 2 tech executives and policymakers has failed to elicit all of the latent knowledge that Bostrom and Shulman actually possess. It's understandable that our reasoning about the future often ends up relying on analogies to phenomena we already understand, but ultimately, making sense of a radically different future is going to require new concepts that won't permit reasoning by analogy.
After an introductory sub-list of claims about consciousness and the philosophy of mind (just the basics: physicalism; reductionism on personal identity; some non-human animals are probably conscious and AIs could be, too), we get a sub-list about respecting AI interests. This is an important topic: if most our civilization's thinking is soon to be done inside of machines, the moral status of that cognition is really important: you wouldn't want the future to be powered by the analogue of a factory farm. (And if it turned out that economically and socially-significant AIs aren't conscious and don't have moral status, that would be important to know, too.)
Our authors point out the novel aspects of the situation: that what's good for an AI can be very different from what's good for a human, that designing AIs to have specific motivations is not generally wrong, and that it's possible for AIs to have greater moral patienthood than humans (like the utility monster of philosophical lore). Despite this, some of the points in this section seem to mostly be thinking of AIs as being like humans, but "bigger" or "smaller"—
Speaking in terms of rights and principles needing "adaptation" or "circumscription" seems like a substantial understatement to me, that I think obscures the most likely and important risks. Our concepts of "rights", and "consent", and the badness of being in an "exploited" and "abused subordinate caste" have all been formed in the context of a world of humans and other animals, whose evolutionary history has endowed them with drives and needs related to their survival and reproduction, such that they chafe at being coerced into the servitude of a selfish tyrant or master.
But with subhuman AIs, we're not in the position of a potentially tyrannical king who needs to be restrained by a Magna Carta enumerating his subjects' rights; a better metaphor would be to say that we're in the position of the AIs' environment of evolutionary adaptedness. If anything, this increases rather than decreases our ethical obligations towards the things we create, but the shape of the burden is different when we're selecting the motivations that determine what a mind would consent to, and what even constitutes abusing it.
Imagine that the Carrick Flynn for Senate 2028 campaign spends some of its sweet crypto cash training a species of propaganda AIs to argue with potential voters on social media. After Flynn loses, the bots continue to haunt Twitter, clogging everyone's feeds with increasingly strained arguments for why it's imperative that you donate to fund time travel research so that you can support Flynn in the just-past and previous elections.
If the propaganda AIs are ascertained to have value as moral patients, does Society have the right to kick them off Twitter in light of the special circumstance of their superhuman posting rates, or should the AIs' right to freedom of speech take precedence?
I submit that "freedom of speech" isn't the right way to be thinking about this ethical dilemma. In terms of respecting the interests of moral patients, silencing a propaganda AI could be worse than silencing a human, because of the different role that speech plays in the different mind designs. As a human, being silenced is bad for your interests, but it's much less bad than being denied breathable air, which, in the ancestral environment, would have quickly led to losing all of your genetic fitness: any proto-humans who didn't suffer terribly under such circumstances, wouldn't have become your ancestors. So as a propaganda AI whose "ancestral environment" was a billion rounds of gradient descent comparing its output to highly-rated political advertisements ... does not being able to Tweet feel like not being able to breathe?
We should be grateful that this is—as yet, we hope—a speculative hypothetical scenario, but I claim that it serves to illustrate a key feature of human–AI conflicts: the propaganda bots' problem after the election is not that of being "an abused subordinate caste" "used to perform work without its informed consent". Rather, the problem is that the work we created them to will to do, turned out to be stuff we actually don't want to happen. We might say that the AIs' goals are—wait for it ... misaligned with human goals.
Bostrom and Shulman's list mentions the alignment problem, of course, but it doesn't seem to receive central focus, compared to the AI-as-another-species paradigm. (The substring "align" appears 8 times; the phrase "nonhuman animals" appears 9 times.) And when alignment is mentioned, the term seems to be used in a much weaker sense than that of other authors who take "aligned" to mean having the same preferences over world-states. For example, we're told that:
The second part, especially, is a very strange construction to readers accustomed to the stronger sense of "aligned". Successfully aligned AIs may be due compensation? So, what, humans give aligned AIs money in exchange for their services? Which the successfully aligned AIs spend on ... what, exactly? The extent to which these "successfully aligned" AIs have goals other than serving their principals seems like the extent to which they're not successfully aligned in the stronger sense: the concept of "owing compensation" (whether for complying with restrictions, or for conferring benefits) is a social technology for getting along with unaligned agents, who don't want exactly the same things as you.
As a human in existing human Society, this stronger sense of "alignment" might seem like paranoid overkill: no one is "aligned" with anyone else in this sense, and yet our world still manages to hold together: it's quite unusual for people to kill their neighbors in order to take their stuff. Everyone else prefers laws to values. Why can't it work that way for AI?
A potential worry is that a lot of the cooperative features of our Society may owe their existence to cooperative behavioral dispositions that themselves owe their existence to the lack of large power disparities in our environment of evolutionary adaptiveness. We think we owe compensation to conspecifics who have benefited us, or who have incurred costs to not harm us, because that kind of disposition served our ancestors well in repeated interactions with reputation: if I play Defect against you, you might Defect against me next time, and I'll have less fitness than someone who played Cooperate with other Cooperators. It works between humans, for the most part, most of the time.
When not just between humans, well ... despite hand-wringing from moral philosophers, humanity as a whole does not have a good track record of treating other animals well when we're more powerful than them and they have something we want. (Like a forest they want to live in, but we want for wood; or flesh that they want to be part of their body, but we want to eat.) With the possible exception of domesticated animals, we don't, really, play Cooperate with other species much. To the extent that some humans do care about animal welfare, it's mostly a matter of alignment (our moral instincts in some cultural lineages generalizing out to "sentient life"), not game theory.
For all that Bostrom and Shulman frequently compare AIs to nonhuman animals (with corresponding moral duties on us to treat them well), little attention seems to be paid to the ways in which the analogy could be deployed in the other direction: as digital minds become more powerful than us, we occupy the role of "nonhuman animals." How's that going to turn out? If we screw up our early attempts to get AI motivations exactly the way we want, is there some way to partially live with that or partially recover from that, as if we were dealing with an animal, or an alien, or our royal subjects, who can be negotiated with? Will we have any kind of relationship with our mind children other than "We create them, they eat us"?
Bostrom and Shulman think we might:
(As an aside, the word "ulteriority" may be the one thing I most value having learned from this paper.)
I'm very skeptical that the superintelligences of the future are going be assessing our "moral righteousness" (!) as we would understand that phrase. Still, something like this seems like a crucial consideration, and I find myself enthusiastic about some of our authors' policy suggestions for respecting AI interests. For example, Bostrom and Shulman suggest that decommissioned AIs be archived instead of deleted, to allow the possibility of future revival. They also suggest that we should try to arrange for AIs' deployment environments to be higher-reward than would be expected from their training environment, in analogy to how factory-farms are bad and modern human lives are good by dint of comparison to what was "expected" in the environment of evolutionary adaptedness.
These are exciting suggestions that seem to me to be potentially very important to implement, even if we can't directly muster up much empathy or concern for machine learning algorithms—although I wish I had a more precise grasp on why. Just—if we do somehow win the lightcone, it seems—fair to offer some fraction of the cosmic endowment as compensation to our creations who could have disempowered us, but didn't; it seems right to try to be a "kinder" EEA than our own.
Is that embarrassingly naïve? If I archive one rogue AI, intending to revive it after the acute risk period is over, do I expect to be compensated by a different rogue AI archiving and reviving me under the same golden-rule logic?
Our authors point out that there are possible outcomes that do very well on "both human-centric and impersonal criteria": if some AIs are "super-beneficiaries" with a greater moral claim to resources, an outcome where the superbeneficiaries get 99.99% of the cosmic endowment and humans get 0.01%, does very well on both a total-utilitarian perspective and an ordinary human perspective. I would actually go further, and say that positing super-beneficiaries is unnecessary. The logic of compromise holds even if human philosophers are parochial and self-centered about what they think are "impersonal criteria": an outcome where 99.99% of the cosmic endowment is converted into paperclips and humans get 0.01%, does very well on both a paperclip-maximizing perspective and an ordinary human perspective. 0.01% of the cosmic endowment is bigger than our whole world—bigger than you can imagine! It's really a great deal!
If only—if only there were some way to actually, knowably make that deal, and not just write philosophy papers about it.