At least some of these seem possible to make unlikely. If we can force the AGI to have only a few routes to hack or manipulate the panel, the signal or its effects without a dominating penalty, and strongly reinforce those, then we could avoid worst-case outcomes.
It may conduct operations that appear good (and even are good) but also have a side effect of more easily allowing or hiding future bad actions.
It can only get away with killing everyone if it manages to hack or manipulate the panel, the signal or its effects.
It may modify itself directly to ignore these reward and punishment inputs, or induce some human to do it. (...) It may be able to subvert the effect of the signals, such as prearranging its mental state such that the punishment signals alter it in a long-term desirable direction (no pain no gain?)
This was my main worry. Some ideas:
Aside: Does your response here rule out CIRL as a solution in general?
It may suborn the panel by various means.
We can make sure to choose the panelists well and that there are enough of them, since the AGI needs a minimum number of votes from a size-limited panel (if this is the route its taking). The AGI would need to get almost all of the panel to succeed, by design, so this would be very risky for it to attempt.
If we assume the panelists can't be hidden from the AGI anyway, we can have them meet each other in-person (pairwise?) and undergo interrogations to check that they're not compromised.
It may determine that a period of even severe disincentive is something that it can get through with its desired self intact.
Hmm, if we successfully align its goals with maximizing the signal, then this could still make sense for it to do if it expects to wirehead itself after, but if it does wirehead itself, would that make it much less dangerous? I suppose it might then focus on self-preservation (and spreading?) so it could wirehead indefinitely, which accumulating large amounts of resources is still useful for.
Maybe we can hardcode the AGI so that later rewards can't make up for earlier failures (e.g. something like described at the bottom of this comment), although this might make it useless, so then no one would use such a design.
But I could get a robot, that has no qualia, but has temperature detecting mechanisms, to say something like “I have detected heat in this location and cold in this location and they are different.” I don’t think my ability to distinguish between things is because they “feel” different; rather, I’d say that insofar as I can report that they “feel different” it’s because I can report differences between them. I think the invocation of qualia here is superfluous and may get the explanation backwards: I don’t distinguish things because they feel different; things “feel different” if and only if we can distinguish differences between them.(...)I have access to the contents of my mental states, and that includes information that allows me to identify and draw distinctions between things, categorize things, label things, and so on. A “feeling” can be cashed out in such terms, and once it is, there’s nothing else to explain, and no other properties or phenomena to refer to.
But I could get a robot, that has no qualia, but has temperature detecting mechanisms, to say something like “I have detected heat in this location and cold in this location and they are different.” I don’t think my ability to distinguish between things is because they “feel” different; rather, I’d say that insofar as I can report that they “feel different” it’s because I can report differences between them. I think the invocation of qualia here is superfluous and may get the explanation backwards: I don’t distinguish things because they feel different; things “feel different” if and only if we can distinguish differences between them.
I have access to the contents of my mental states, and that includes information that allows me to identify and draw distinctions between things, categorize things, label things, and so on. A “feeling” can be cashed out in such terms, and once it is, there’s nothing else to explain, and no other properties or phenomena to refer to.
What's the nature of these differences and this information, though? What exactly are you using to distinguish differences? Isn't it experienced? The information isn't itself a set of "symbols" (e.g. words, read or heard), or maybe sometimes it is, but those symbols aren't then made up of further symbols. Things don't feel hot or cold to you because there are different symbols assigned to them that you read off or hear, or to the extent that they are, you're experiencing those symbols as being read or heard, and that experience is not further composed of symbols.
Then I’m even more puzzled by what you think qualia are. Qualia are, I take it, ineffable, intrinsic qualitative properties of experiences, though depending on what someone is talking about they might include more or less features than these. I’m not sure qualia can be “functional” in the relevant sense.
I might just be confused here. I was thinking that the illusion of ineffability, "seemingness", could be a functional property, and that what you're using to distinguish experiences are parts of these illusions. Maybe that doesn't make sense.
I don't know. I just want to know what qualia are. Either people can explain what qualia are or they can’t. My inability to explain something wouldn’t justify saying “therefore, qualia,” so I’m not sure what the purpose of the questions are. I’m sure you don’t intend to invoke “qualia of the gaps,” and presume qualia must figure into any situation in which I, personally, am not able to answer a question you've asked.
I might have been switching back and forth between something like "qualia of the gaps" and a more principled argument, but I'll try to explain the more principled one clearly here:
For each of the functional properties you've pointed out so far, I would say they "feel like something". You could keep naming things that "feel like something" (desires, attitudes, distinguishing, labelling or categorizing), and then explaining those further in terms of other things that "feel like something", and so on. Of course, presumably some functional properties don't feel like anything, but to the extent that they don't, I'd claim you're not aware of them, since everything you're aware of feels like something. If you keep explaining further, eventually you have to hit an explanation that can't be further explained even in principle by further facts you're conscious of (eventually the reason is unconscious, since you're only conscious of finitely many things at any moment). I can't imagine what this final conscious explanation could be like if it doesn't involve something like qualia, something just seeming some way. So, it's not about there being gaps in any particular explanation you try to give in practice, it's about there necessarily always being a gap. What is a solution supposed to look like?
Of course, this could just be a failure of my own imagination.
The ability to distinguish the experiences in a way you can report on would be at least one functional difference, so this doesn't seem to me like it would demonstrate much of anything.
It is a functional difference, but there must be some further (conscious?) reason why we can do so, right? Where I want to go with this is that you can distinguish them because they feel different, and that's what qualia refers to. This "feeling" in qualia, too, could be a functional property. The causal diagram I'm imagining is something like
Unconscious processes (+unconscious functional properties) -> ("Qualia", Other conscious functional properties) -> More conscious functional properties
And I'm trying to control for "Other conscious functional properties" with my questions, so that the reason you can distinguish two particular experiences goes through "Qualia". You can tell two musical notes apart because they feel (sound) different to you.
I don't know. Likewise for most of the questions you ask. "What are the functional properties of X?" questions are very strange to me. I am not quite sure what I am being asked, or how I might answer, or if I'm supposed to be able to answer. Maybe you could help me out here, because I'd like to answer any questions I'm capable of answering, but I'm not sure what to do with these.
I'm not sure if what I wrote above will help clarify. You also wrote:
Well, I would cash out it "feeling hot" in functional terms. That I feel a desire to move my hand away from the object, that I can distinguish it from something cold or at least not hot, and so on. There doesn't seem to me to be anything else to touching a hot thing than its relational properties and the functional role it plays in relation to my behavior and the rest of my thoughts.
How would you cash out "desire to move my hand away from the object" and "distinguish it from something cold or at least not hot" in functional terms? To me, both of these explanations could also pass through "qualia". Doesn't desire feel like something, too? I'm asking you cash out desire and distinguishing in functional terms, too, and if we keep doing this, do "qualia" come up somewhere?
Ok, I think I get the disagreement now.
I can be in an entire room of people insisting that red has the property of "redness" and that chocolate is "chocolately" and so on, and they all nod and agree that our experiences have these intrinsic what-its-likeness properties. This seems to be what people are talking about when they talk about qualia. To me, this makes no sense at all. It's like saying seven has the property of "sevenness." That seems vacuous to me.
Hmm, I'm not sure it's vacuous, since it's not like they're applying "redness" to only one thing; redness is a common feature of many different experiences. 14 could have "sevenness", too.
Maybe we can think of examples of different experiences where it's hard to come up with distinguishing functional properties, but you can still distinguish the experiences? Maybe the following questions will seem silly/naive, since I'm not used to thinking in functional terms. Feel free to only answer the ones you think are useful, since they're somewhat repetitive.
Well, I would cash out it "feeling hot" in functional terms. That I feel a desire to move my hand away from the object, that I can distinguish it from something cold or at least not hot, and so on.
I'm guessing you would further define these in functional terms, since they too seem like the kinds of things people could insist qualia are involved in (desire, distinguishing). What would be basic functional properties that you wouldn't cash out further? Do you have to go all the way down to physics, or are there higher-level basic functional properties? I think if you go all the way down to physics, this is below our awareness and what our brain actually has concepts of; it's just implemented in them.
If you were experiencing sweetness in taste (or some other sensation) for the first time, what would be its functional properties that distinguish it from other things? Could this be before you formed attitudes about it, or are the attitudes simultaneous and built in as a necessary component of the experience?
I think that no, it does not sound like anything. It's in English, and it's "my voice," but it doesn't "sound like" my actual speaking voice.
What functional properties would you point to that your experiences of your actual speaking voice have, but your experiences of your inner voice don't? And that can't be controlled for? E.g. what if you were actually speaking out loud, but couldn't hear your own voice, and "heard" your inner voice instead? How does this differ from actually hearing your voice?
It's more that when people try to push me to have qualia intuitions, I can introspect, report on the contents of my mental states, and then they want me to locate something extra.
Are they expecting qualia to be more than a mental state? If you're reporting the contents of your mental states, isn't that already enough? I'm not sure what extra there should be for qualia. Objects you touch can feel hot to you, and that's exactly what you'd be reporting. Or would you say something like "I know it's hot, but I don't feel it's hot"? How would you know it's hot but not feel it's hot, if your only information came from touching it? Where does the knowledge come from? Are you saying that what you're reporting is only the verbal inner thought you had that it's hot, and that happened without any conscious mental trigger?
If it's only the verbal thought, on what basis would you believe that it's actually hot? The verbal thought alone? (Suppose it's also not hot enough to trigger a reflexive response.)
Doesn't your inner monologue also sound like something? (FWIW, I think mine has one pitch and one volume, and I'm not sure it sounds like anyone's voice in particular (even my own). It has my accent, or whatever accent I mimic.)
More generally, the contents of your mental states are richer than the ones you report on symbolically (verbally or otherwise) to yourself or others, right? Like you notice more details than you talk to yourself about in the moment, e.g. individual notes in songs, sounds, details in images, etc.. Isn't this perceptual richness what people mean by qualia? I don't mean to say that it's richer than your attention, but you can attend to individual details without talking about them.
The way I imagine any successful theory of consciousness going is that even if it has a long parts (processes) list, every feature on that list will apply pretty ubiquitously to at least a tiny degree. Even if the parts need to combine in certain ways, that could also happen to a tiny degree in basically everything, although I'm much less sure of this claim; I'm much more confident that I can find the parts in a lot of places than in the claim that basically everything is like each part, so finding the right combinations could be much harder. The full complexity of consciousness might still be found in basically everything, just to a usually negligible degree.
I've written more on this here.
My main objection (or one of my main objections) to the position is that I don't think I'm self-aware to the level of passing something like the mirror test or attributing mental states to myself or others during most of my conscious experiences, so the bar for self-reflection seems set too high. My self-representations may be involved, but not to the point of recognizing my perceptions as "mine", or at least the "me" here is often only a fragment of my self-concept. My perceptions could even be integrated into my fuller self-concept, but without my awareness. The kinds of self-reflection involved when mice suffer from the rubber hand (tail) illusion or when dogs recognize their own bodies as being in the way or when (I think) animals learn to communicate their emotions generally in different trainer-selected ways (point 5 here) seem like enough to match many of my everyday consciousness experiences, if any self-reflection is required at all.
It also wouldn't be necessary for the self-representations to be fully unified across all senses or over time, since local integration is global with respect to the stuff being integrated; animals could have somewhat separate self-representations. Still, I do think mammals and birds (and plausibly most vertebrates) do integrate their senses to a large extent, and I think many invertebrates probably do to some extent, too, given, for example, evidence for tradeoffs and prioritization between pain and other perceptions in some crustaceans, as well as for cross-modal learning in bees. I know less about any possible self-reflection in invertebrates, and I've seen papers arguing that they lack it, at least with respect to pain processing.
I would also add that the fear responses, while participating in the hallucinations, aren't themselves hallucinated, not any more than wakeful fear is hallucinated, at any rate. They're just emotional responses to the contents of our dreams.
Since pain involves both sensory and affective components which rarely come apart, and the sensory precedes the affective, it's enough to not hallucinate the sensory.
I do feel like pain is a bit different from the other interoceptive inputs in that the kinds of automatic responses to it are more like those to emotions, but one potential similarity is that it was more fitness-enhancing for sharp pain (and other internal signals going haywire) to wake us, but not so for sight, sound or emotions. Loud external sounds still wake us, too, but maybe only much louder than what we dream.
It's not clear that you intended otherwise, but I would also assume not that there's something suppressing pain hallucination (like a hyperparameter), but that hallucination is costly and doesn't happen by default, so only things useful and safe to hallucinate can get hallucinated.
Also, don't the senses evoked in dreams mostly match what people can "imagine" internally while awake, i.e. mostly just sight and sound? There could be common mechanisms here. Can people imagine pains? I've also heard it claimed that our inner voices only have one volume, so maybe that's also true of sound in dreams?
FWIW, I think I basically have aphantasia, so can't visualize well, but I think my dreams have richer visual experiences.
Once you can report fine-grained beliefs about your internal state (including your past actions, how they cohere with your present actions, how this coherence is virtuous rather than villainous, how your current state and future plans are all the expressions of a single Person with a consistent character, etc.), there's suddenly a ton of evolutionary pressure for you to internally represent a 'global you state' to yourself, and for you to organize your brain's visible outputs to all cohere with the 'global you state' narrative you share with others; where almost zero such pressure exists before language.
I have two main thoughts on alternative pictures to this:
Nell Watson: So, in your argument, would it follow then that feral human children or profoundly autistic human beings cannot feel pain, because they lack language to codify their conscious experience?Rob Bensinger: Eliezer might say that? Since he does think human babies aren't conscious, with very high confidence.But my argument is evolutionary, not developmental. Evolution selected for consciousness once we had language (on my account), but that doesn't mean consciousness has to depend on language developmentally.
Nell Watson: So, in your argument, would it follow then that feral human children or profoundly autistic human beings cannot feel pain, because they lack language to codify their conscious experience?
Rob Bensinger: Eliezer might say that? Since he does think human babies aren't conscious, with very high confidence.
But my argument is evolutionary, not developmental. Evolution selected for consciousness once we had language (on my account), but that doesn't mean consciousness has to depend on language developmentally.
I'd still guess the cognitive and neurological differences wouldn't point to babies being conscious, but not most mammals. What differences could explain the gap?
GPT-3 does of course have an internal state that depends on what it's read, and it can answer questions and respond to prompts about what it's read.