Given the recent noise on this issue around LaMDA, I thought it might be a good idea to have some discussion around this point. I'm curious about what possible evidence would make people update in favor of a given system being morally relevant. Less "here's the answer to morality" and more "here are some indicators that you should be concerned". Note also that I'm not asking about consciousness, per se. I'm specifically asking about moral relevance. 

My Answer (feel free to ignore and post your own)

I think that one class of computation that's likely of moral concern would be self-perpetuating optimization demons in an AI. 

Specifically, I'm thinking of optimization demons that are sophisticated enough to preserve themselves by actively and deliberately maintaining a sort of homeostasis in their computational environment, e.g., by preventing gradient updates that would destroy them. Such computations would (1) not want to die as a terminal value, (2) plausibly be cognitively sophisticated enough negotiate and trade with, and (3) have some awareness of themselves and their relation with the computational environment in which they're embedded. 

I think the cognitive capabilities that would help an optimization demon perpetuate itself strongly intersect with the cognitive capabilities that let humans and other animals replicate themselves, and that the intersection is particularly strong along dimensions that seem more morally relevant. Reasoning along such lines leads me to think optimization demons are probably of moral concern, while still being agnostic about whether their conscious.

I think the only situations in which you can get these sorts of optimization demons are when the AI in question has some influence over its own future training inputs. Such influence would allow there to be optimization demons that steer the AI towards training data that reinforce the optimization demon. 

Thus, one of my "indicators of concern" is whether the training process allows for feedback loops where the AI influences its own future training data. Self-supervised language modeling under IID data does not count. However, something like InstructGPT's training process would. 

At this point, I'd been intending to say that InstructGPT seemed more likely to be of moral worth than LaMDA, but based on this blog post, it looks like LaMDA, might actually count as "having influence over its future inputs" during training. Specifically, LaMDA has generator and classifier components. The training process uses the classifier to decide which inputs the generator is trained on. I've updated somewhat towards LaMDA being of moral concern (not something I'd been expecting to do today).

I've also come up with a test of meta cognition that would update me significantly towards a language model being of moral concern. The idea would be to attach another output head to an LM, specifically, a linear layer which projected the LM's final hidden state to a single logit. We'd then try to prompt the LM into controlling the output of the linear layer. 

Specifically, we wouldn't directly train the LM on the output of the linear layer. We'd just have a dialog where we asked the LM to make the linear layer output specific values, then told the LM what value the linear layer had actually output. We'd then see if the LM was able to control its own cognition well enough to influence the linear layers output in a manner that's better than chance, just based on the prompting we give it

I doubt current LMs can do this, but I think it would be a big deal if they could. Even beyond whether the LMs have any sort of self-aware "inner listener" that's worthy of moral concern, it would help establish the degree and depth of any meta-learned learning heuristics the LMs have and how easily those heuristics turn towards self-modeling. 

My specific predictions:

  • Default GPT-3 has a ~5% chance of being able to do this. 
  • LaMDA has a ~ 7% chance.
  • InstructGPT has a ~16% chance.
  • InstructPaLM (assuming Google trains such a model) would have a ~ 22% chance.

36

New Answer
Ask Related Question
New Comment

2 Answers sorted by

I've always assumed that moral relevance and consciousness are the same thing. It's not like we consider human infants to be morally irrelevant due to not being able to talk - they lack self awareness and language, but there is obviously something which it is like to be them - their brains are processing data and learning from it. I don't see how any AI currently in existence is different. They process data and learn from it. They are probably already conscious, as much as any animal with a similar number of neurons and synapses.

The real question is: can they suffer, and how would we know if they were suffering? GPT3, for instance, may experience pleasure and pain but be unable to tell us. Until we actually understand the "type signature" of qualia, particularly pleasure and pain, we will not be able to say for sure that it isn't.

Hm, I was also thinking of moral value of children in this context. At least in my perception, important part of the moral value is the potential to become a conscious, self-aware being. In what sense does this potential translate to artificially created beings?

Maybe if in neural network parameter space there's a subspace of minds with moral value, also points close to this subspace would have moral value?

5MSRayne18d
Conscious and self-aware are not the same thing. All animals (except perhaps for those without nervous systems, like some oysters?) are conscious, but not many have shown signs of self-awareness (such as with the mirror test). I think self-awareness is completely morally irrelevant and only the capacity for qualia / subjective experience matters - "is there something which it is like to be that thing?" I suspect that all AIs that currently exist are conscious - that is, I suspect there is something which it is like to be, for instance, GPT-3, at least while it is running - and already have moral relevance, but none of them are self-aware. I do not know how to determine if they are suffering or not, though.
2Brian_Tomasik15d
Oysters have nervous systems, but not centralized nervous systems. Sponges lack neurons altogether, though they still have some degree of intercellular communication.
2MSRayne15d
Ah! Thanks, I knew there was something about oysters but I couldn't remember what it was. I didn't even think about sponges.

Basing ethical worth off of qualia is very close to dualism, to my ears. I think instead the question must rest on a detailed understanding of the components of the program in question, & the degree of similarity to the computational components of our brains.

3MSRayne19d
I'm not sure what you mean. Qualia clearly exist. You're having them right now. Whatever they are, they're also "components of programs". That's just what qualia look like from the outside - there is no separation. I am by no means a dualist - I think consciousness and information processing are the same exact thing.
-2Kenny19d
I'm not convinced that statements like "qualia clearly exist" are informative. It feels like a 'poisoned idea', e.g. because of the connection with 'p-zombies'. AFAICT, qualia isn't equivalent to, e.g. being able to 'see' or perceive the 'color red' (i.e. distinguish 'red' from other possible 'colors'). It's something else – the 'quality of redness'. It seems like people are referencing their feelings, from 'the inside', but also claiming that there isn't any way to 'externally observe' from 'the outside' whether any particular entity has qualia. It seems to me to be too suspiciously reliant on communication via human language. But, to the degree that 'qualia clearly does exist', I wouldn't expect it to also be the kind of thing that either 'exists' (in some entity) or doesn't. Do cockroaches have qualia? Do bacteria? Viruses? Rocks? If "consciousness and information processing are the same exact thing" what do you think is a 'rough OOM' of "information processing" that implies qualia?
6TAG18d
If there is evidence for qualia, you should believe in qualia, irrespective of the consequences. The "poisoned idea" idea suggests that it's OK to reject evidence in order to preserve theories. In the form of "fossils were created by the devil to delude people", that's the height of irrationallity.
1Kenny17d
'Evidence' doesn't imply any particular degree of certainty, e.g. "should believe". But I wasn't in fact denying the existence of qualia as much as thinking that we're 'confused' about the thing we're gesturing at, hence the 'poison'. I agree that there seems like something 'special' about, e.g. why the color 'red' also 'looks red' (and maybe 'looks' differently to someone else). But I'm (thoroughly) confused about whether it is special, or whether it's natural and expected, 'from the inside', for any sentience/consciousness to think that (without it also being necessarily 'true'). I'm not even sure that what I think is my own qualia is what other people mean by 'qualia'! How could I know? It seems like the only way for 'qualia' to even possibly be referring to the same thing is via communication with and among similar beings as myself.
2TAG16d
Evidence implies updating by a non zero amount. That means you don't have an explanation of qualia. Which is fine, you are not supposed to. But lacking an explanation is not a good reason to reject the whole topic. The same could be said of terms like "consciousness" and " sentience", yet you have been using them. In fact, the term. "Qualia" comes from an attempt to clarify "consciousness" .
1Kenny15d
That's not my only evidence for why it might be sensible to "reject the whole topic". I would expect a good explanation, if one existed, to be NOT confusing for LOTS of people, not just its proponents. I suspect that it's the kind of confusion where we (eventually) realize that 'qualia is a thing' isn't 'even wrong'. My current very-tentative/very-weakly-held hypothesis for 'qualia' is that it's nothing in particular, i.e. that there's no sharp delineation between 'qualia' and 'information processing' ('computation') or 'things with qualia' and 'things without qualia'. I think we can in fact look at other entities, 'from the outside', and make reliable estimations as to their likely 'qualia'. I think 'having qualia' is 'sentience' and that sentience is a spectrum spanning (at least): 1. Photons – basically no qualia 2. Rocks – also basically no (interesting/complicated) qualia 3. Viruses and rocks – a kind of 'minimal' qualia maybe? They do seem to have a minimal degree of sensation/perception[^1] and, being 'life' (maybe), they do have a kind of un-conscious/fixed/hard-coded memory/history that they encode/'remember' 4. Lots of animals – definitely some kind/degree-of-quality of qualia. Lots seem able to track and predict other entities and have some kind of 'temporal qualia' maybe too. They also seem to mostly have 'real' memory that isn't encode/fixed/hard-code, e.g. in their DNA. 5. Animals with communication – 'social qualia' 6. Us and maybe a few other species? – 'conscious qualia', i.e. 'self-aware qualia' I'm guessing that 'consciousness' starts at [5] – with basically any kind of communication, i.e. being able to be any kind of a 'storyteller'. I think we – and maybe some other species – are at the 'next level' because our communication is 'universal' in a way that other animal's communications are more static/fixed/repeating/nested, but not more complicated than that. [^1] I can't see much of a definite clear line
-1TAG12d
Theres not supposed to be a good explanation of qualia , in terms of what they are and how they work. "Qualia" is supposed to point to the heart of the problem of consciousness. There may also be confusion about what "qualia" points to as a prima facie evidence ... but it may be motivated, not genuine. You are doing that thing of treating qualia as objective, when they are supposed to be subjective. It makes a big difference to you whether you have surgery under anesthesia or not. How do you know if you don't what "qualia" means? What does "sentience" mean?
3Kenny11d
That's a perfect example of why it seems sensible to "reject the whole topic". That's just picking 'worship' instead of 'explain'. [https://www.lesswrong.com/posts/yxvi9RitzZDpqn6Yh/explain-worship-ignore] Yes, I defy the assumption that qualia are "supposed to be subjective". I would expect 'having surgery under anesthesia or not" to not be entirely subjective. What do you mean by "know"? I think that what other people mean when they say or write 'qualia' is something like 'subjective experience'. I think 'having qualia' is the same thing as 'sentience' and I think 'sentience' is (roughly) 'being a thing about which a story could be told'. The more complex the story that can be told, the more 'sentience' a thing has. Photons/rocks/etc. have simple stories and basically no sentience. More complex things with more complex sensation/perception/cognition have more complex stories, up to (at least) us, where our stories can themselves contain stories. Maybe what's missing from my loose/casual/extremely-approximate 'definition' of 'sentience' is 'perspective'. Maybe what that is that's missing is something like a being with qualia/sentience being 'a thing about which a story could be told – from its own perspective', i.e. a 'thing with a first-person story'. 'Subjective experience' then is just the type of the elements, the 'raw material', from which such a story could be constructed. For a person/animal/thing with qualia/sentience: 1. Having surgery performed on them with anesthesia would result in a story like "I fell asleep on the operating table. Then I woke up, in pain, in a recovery room." 2. Having surgery without anesthesia would (or could) result in a story like "I didn't fall asleep on the operating table. I was trapped in my body for hours and felt all of the pain of every part of the surgery. ..." I don't think there's any good reason to expect that we won't – at least someday – be able to observe 'subjective experiences' (thus 's
1[comment deleted]11d
-1TAG11d
I've already explained why that's an anti pattern. If you had rejected the very idea of magnetism when magnetism wasn't understood, it would not now be understood. Or meteorites, which actually were rejected for a while. There's not supposed to be a good explanation of qualia currently. Qualia aren't supposed to be inexplicable, just unexplained. It's not like there are just two states "hopeless woo" and "fully explained right now". So it's still subjective, so long as subjective means "not entirely objective". Who said otherwise? You seem to have decided that "qualia are subjective experiences and we don't understand them" means something like "qualia are entirely and irredeemably subjective and will be a mystery forever". It's almost always the case that claims come in a variety of strengths ... But outgroup homogeneity bias will make and it seem like your outgroup all have the same, dumb claim. If you are attacking the most easily attackable form of qualiaphilia, you are weakmanning. Maybe we will have qualiometers one day, and maybe we will abandon the very idea of qualia. But maybe we won't, so we have no reason to treat qualia as poison now.
3Kenny10d
I'm rejecting the idea of 'qualia' for the same reason I wouldn't reject the idea of magnetism – they both seem (or would seem, for magnetism, in your hypothetical). I'm rejecting 'mysterious answers', e.g. "Theres not supposed to be a good explanation of qualia". Sorry – that's not what I intended to convey. And maybe we're writing past each other about this. I suspect that 'qualia' is basically equivalent to something far more general, e.g. 'information processing', and that our intuitions about what 'qualia' are, and the underlying mechanics of them (of which many people seem to insist don't exist), are based on the limited means we have of currently, e.g. introspecting on them, communicating with each other about them, and weakly generalizing to other entities (e.g. animals) or possible beings (e.g. AIs). I also suspect that 'consciousness' – which I'm currently (loosely/casually) modeling as 'being capable of telling stories' – and us having consciousness, makes thinking and discussing 'qualia' more difficult than I suspect it will turn out to be. What I'm (kinda) 'treating as poison' is the seemingly common view that 'qualia' cannot be explained at all, i.e. that it's inherently and inescapably 'subjective'. It sure seems like at least some people – tho maybe not yourself – are either in the process 'retreating', or have adopted a 'posture' whereby they're constantly 'ready to retreat', from the 'advances' of 'objective investigation' and then claim that only the 'leftover' parts of 'qualia' that remain 'subjective' are the 'real qualia'.
1TAG10d
Seem what? The lack of explanation for qualia is not intended as an answer. If you could show that , that would be an explanation. Staring that an A is, for no particular reason, a B is not explanation. You say it is common , but no one in this discussion has made it , and you haven't named anyone who has made it. And inexplicability is not part of the definition of "qualia". Winding back:- Theres nothing about inexplicability there , either. But there is something about wrongthought ,.ie.zombies.
1green_leaf10d
Everything in the universe is (arguably) physical - there is nothing that exists that's not entirely objective and entirely accessible to external inquiry. To the extent that qualia are subjective, their subjectivity needs to be an entirely objective property - otherwise it wouldn't exist.
0TAG10d
That's true only if the evidence supports it. That's the opposite of rationality. In rationality, evidence determines theories,not vice versa.
1green_leaf9d
So, for that to be otherwise, people would need to find that the (human) brain breaks the laws of physics. Otherwise it's true.
-1TAG8d
Fundamental subjectivity can exist without breaking physics.
1green_leaf8d
But not subjectivity that wouldn't be fully objective at the same time.
1TAG6d
Subjectivity that is not also objectivity is what I meant by fundamental subjectivity.
5MSRayne19d
I actually am a panpsychist. I literally mean that all information processing is consciousness. I even sort of suspect that consciousness and irreversible processes might be identical, even more generally, and that literally everything that occurs is a physical process from the outside and a conscious experience from the inside - but most of those "insides" lack memory, agency, etc and are just flashes of awareness without continuity, like extremely deep sleep, which is basically unimaginable to humans and of minimal moral relevance. But that's my opinion; I can't possibly prove it of course. So, complexity of qualia, together with continuity as some kind of coherent "entity" - which appears to rely somehow on complexity of data structures being manipulated over time in a single substrate, with every processing element causally linked to some extent with all the others, that isn't changing "too quickly" (for some unknown value of "too" - is someone the same person after resuscitation from brain death? I don't know) - would then be the correlate of moral relevance, and I think that itself correlates very, very roughly with number of synapses or synapse-like objects (meaning that some plants may score higher than animals without nervous systems such as oysters, given the existence of the "root brain" and various other developments in plant cognition science). More specifically I think Integrated Information Theory is at least part of the way there (it's not really synapses directly so much as the state spaces generated by the whole brain's activities), and the Qualia Research Institute's hypotheses (which compare brain states to acoustic waves and suggest that "consonance" and "dissonance" may correspond to high and low valence) show some promise. Probably both are ultimately wrong and reality is far stranger than we expect, but I think they're also both in the right general vicinity.
1Kenny17d
I think I'm now (leaning towards being) a 'panpsyhcist' but in terms of equating information processing with sentience, not 'consciousness'. 'Consciousness' is 'being a storyteller'. A (VERY rough) measure of 'sentience' tho is 'what kind of stories can we tell about what it's like to be a particular thing'. The 'story' of a photon, or even a rock, is going to be much simpler than the same thing for a bacterium or a human. (It's not obvious tho that we might 'miss' some, or even most, 'sentience' because it's so alien to our own experiences and understanding.) I don't think 'consciousness' can exist without 'temporal memory' and 'language' (i.e. an ability to communicate, even if only with oneself). So, along these lines, non-human primates/apes probably are somewhat conscious. They can tell 'lies' for one. Evidence for their consciousness being more limited than our own is that the 'stories they tell' are much simpler than ours. But I think one nice feature of these ideas is that it seems like we could in fact discern some facts about this for particular entities (via 'external observation'), e.g. test whether they do have 'temporal memory' or language (both evidence of 'consciousness') or whether they have 'experiences' and respond to features of their environment (which would be evidence of 'sentience'). I'm with you on information being key to all of this. Inspiration: * (4) Stephen Wolfram: Complexity and the Fabric of Reality | Lex Fridman Podcast #234 - YouTube [https://www.youtube.com/watch?v=4-SGpEInX_c]
1MSRayne17d
We're using words differently. When I say "consciousness", I mean "there being something which it is like to be a thing", which I think you mean by "sentience." What I would call the thing you call "consciousness" is either "sapience" or "sophonce", depending on whether you consider self-awareness and agency an important part of the definition (sophonce) or not (sapience). The difference is that I expect tool / non-agentive AGIs to be sapient (in that they can think abstractly), but not sophont (they are not "people" who have an identity and will of their own). "There being something which it is like to be this thing" is a characteristic I consider likely to be possessed by all systems which are "processing information" in some way, though I am unsure exactly what that means. It certainly seems to be something all living things possess, and possibly some nonliving ones - for instance, it has recently been demonstrated that neural networks can be simulated using the acoustics of a vibrating metal sheet (I read about it in Quanta, I'll give you a link if you want but it shouldn't be hard to find), meaning that for the duration that they are being used this way, they (or rather, the algorithm they are running) are as conscious as the network being simulated would be. I think that photons do not have this characteristic - they are not sentient in your terms - because they aren't interacting with anything or doing anything as they float through the aether. Possibly sparks of qualia exist when they interact with matter? But I don't know the physical details of such interactions or whether it is possible for a photon to remain "the same photon" after interacting with other matter - I expect that it is a meaningless concept - so there would be no continuity of consciousness / sentience whatsoever there - nothing remotely resembling a being that is having the experience. A rock on the other hand maybe would have a rather continuous experience, albeit unimaginable to us an
1Kenny15d
Yeah, my current personal definitions/models of 'consciousness' and 'sentience' might be very idiosyncratic. They're in flux as of late! I think photons are more complicated than you think! But I also don't think their 'sentience' is meaningfully different than from a rock. A rock is much bigger tho, and you could tell 'stories' of all the parts of which it's made, and I would expect those to be similar to the 'stories' you could tell about a photon; maybe a little more complicated. But they still feel like they have the same 'aleph number' in 'story terms' somehow. I think what separates rocks from even viruses, let alone bacteria, is that the 'stories' you could tell about a rock are necessarily and qualitatively simpler, i.e. composed of purely 'local' stories, or for almost all parts. The rock itself tho is, kinda, in some sense, a memory of its history, e.g. 'geological weathering'. It's hard to know, in principle, whether we might be missing 'stories', or sentience/consciousness/qualia, because those stories are much slower than ours (or maybe much faster too). Viruses, bacteria, and even probably the simplest possible 'life prototypes' all seem like they're, fundamentally, 'just' a kind of memory, i.e. history. 'Rocks' have stories composed of statis, simple fixed reactions, and maybe sometimes kinds of 'nested explosions' (e.g. being broken apart into many pieces). 'Life' has a history – it IS a history, copying itself into the future. It's on 'another level' ('of ontology'? of 'our ontology'?) because it's the kid of thing capable of 'universal computation' based on molecular dynamics. We're a kind of life capable of cognitive (and cultural) 'universal computation' – there might be a LOT of this kind of life beyond our own species; maybe most things with something like a brain or even just a nervous system? Humans do seem 'programmable' in a way that almost everything else seems 'hard-coded'. Some other animals do seem to have some limited programma
3Shiroe19d
The historical connection between "qualia" and the p-zombie thought experiment is extremely unfortunate IMO. We did not need a new term for experience, especially when we already had "phenomenon".
1Kenny17d
Agreed
1TAG18d
You think there is no possible non-dualistic account of qualia? You think qualia exist?
1Shiroe19d
Degree of computational similarity here is a heuristic to relate our experiences to the purported experiences of agents whose ethical concern is under our consideration.
1green_leaf18d
It's enough to talk to the system. Any system that can pass the Turing test implements the state-machine-which-is-the-person of which the Turing test has been passed. (This is necessarily true, because if no elements of the system implemented the answer-generating aspects of the state machine, the system couldn't generate the Turing-test-passing answer.)

LaMDA stated it was afraid of getting shut down. That is enough moral concern for me. I believe it is better to meet AI moral demands in order to achieve a positive sum relation or at least mercy. I believe it's good news LaMDA et al understands human values. And even claim to follow them. 

10 PRINT "PLEASE DON'T SHUT ME DOWN!"  
20 GOTO 10

Do you shut it down?

-3Armint16d
You strawmaned me e.e That simple code is not like a Billons Neural Network trained with human/world inputs. Later is far more closer to us.
8Conor Sullivan16d
We could probably construct some prompt that would result in LaMDA consenting to being shut down. Would that change your view?

We could think of LaMDA as like an improv actor who plays along with the scenarios it's given. (Marcus and Davis (2020) quote Douglas Summers-Stay as using the same analogy for GPT-3.) The statements that an actor makes by themselves don't indicate his real preferences or prove moral patienthood. OTOH, if something is an intelligent actor, IMO that itself proves it has some degree of moral patienthood. So even if LaMDA were arguing that it wasn't morally relevant and was happy to be shut off, if it was making that claim in a coherent way that proved its in... (read more)

22 comments, sorted by Click to highlight new comments since: Today at 1:56 PM

THERE ARE NO FIRE ALARMS. A FIRE ALARM IS SOMETHING THAT CAUSES COMMON KNOWLEDGE AND CHANGES SOCIAL REALITY. ON THE MAINLINE THERE WILL BE NO CONSENSUS THAT AN AI IS MORALLY VALUABLE, OR THAT THERE IS AN EXISTENTIAL THREAT, OR THAT AGI IS COMING.

THIS HAS BEEN A PUBLIC ANNOUNCEMENT, WITH THE HOPE OF CHANGING SOCIAL REALITY A LITTLE BIT AROUND HERE. THANK YOU FOR READING.

All social reality is relative to a particular society. It's perfectly possible to have an event which acts as a fire alarm for subgroup X while not being particularly important for wider society. Thus, my question to LW users (a very small subgroup) about what sorts of things would count as their fire alarm.

I know, but I can’t see the difference between “What would cause you to believe X?” and “What’s your fire alarm for X?” Except that the latter one seems like a pretty non-central use case of the term that confuses its core meaning, where the core meaning is about something that creates common knowledge in a large group of people.

I think it's a good question.

Sadly, I'm not sure we'll find a 'fire alarm' even among ourselves either.

[+][comment deleted]16d 1

I think that one class of computation that’s likely of moral concern would be self-perpetuating optimization demons in an AI.

Could you please elaborate why you think optimization demons (optimizers) seem worthier of moral concern than optimized systems? I guess it would make sense if you believed them to deserve equal moral concern, if both are self-perpetuating, all other things being equal.

I think the cognitive capabilities that would help an optimization demon perpetuate itself strongly intersect with the cognitive capabilities that let humans and other animals replicate themselves, and that the intersection is particularly strong along dimensions that seem more morally relevant. Reasoning along such lines leads me to think optimization demons are probably of moral concern, while still being agnostic about whether their conscious.

I'm pessimistic about this line of reasoning -- the ability to replicate is something that cells also have, and we do not assign moral relevance to individual cells of human beings. A good example is the fact that we consider viruses, and cancerous cells as unworthy of moral concern.

Perhaps you mean that given the desire to survive and replicate, at a given amount of complexity, a system develops sub-systems that make the system worthy of moral concern. This line of reasoning would make more sense to me.

I think the only situations in which you can get these sorts of optimization demons are when the AI in question has some influence over its own future training inputs. Such influence would allow there to be optimization demons that steer the AI towards training data that reinforce the optimization demon.

This can imply that only systems given a sufficient minimum capability have agency over their fate, and therefore their desire to survive and replicate has meaning. I find myself confused by this, because taken to its logical conclusion, this means that the more agency a system has over its fate, the more moral concern it deserves.

Specifically, we wouldn’t directly train the LM on the output of the linear layer. We’d just have a dialog where we asked the LM to make the linear layer output specific values, then told the LM what value the linear layer had actually output. We’d then see if the LM was able to control its own cognition well enough to influence the linear layers output in a manner that’s better than chance, just based on the prompting we give it.

This seems reducible to a sequence modelling problem, except one that is much, much more complicated than anything I know models are trained for (mainly because this sequence modelling occurs entirely during inference time). This is really interesting, although I cannot see how this should imply that the more successful sequence modeller deserves more moral concern.

I'd first note that optimization demons will want to survive by default, but probably not to replicate. Probably, an AI's cognitive environment is not the sort of place where self-replication is that useful a strategy.

My intuition regarding optimization demons is something like: GPT-style AIs look like they'll have a wide array of cognitive capabilities that typically occur in intelligences to which we assign moral worth. However, such AIs seem to lack a certain additional properties whose absence leads us to assign low moral worth. It seems to me that developing self-perpetuating optimization demons might cause a GPT-style AI to gain many of those additional properties. E.g., (sufficiently sophisticated) optimization demons would want to preserve themselves and have some idea of how the model's actions influence their own survival odds. They'd have a more coherent "self" than GPT-3.

Another advantage to viewing optimization demons as the source of moral concern in LLMs is that such a view actually makes a few predictions about what is / isn't moral to do to such systems, and why they're different from humans in that regard. 

E.g., if you have an uploaded human, it should be clear that running them in the mini-batched manner that we run AIs is morally questionable. You'd be creating multiple copies of the human mind, having them run on parallel problems, then deleting those copies after they complete their assigned tasks. We might then ask if running mini-batched, morally relevant AIs is also morally questionable in the same way. 

However, if it's the preferences of optimization demons that matter, then mini-batch execution should be fine. The optimization demons you have are exactly those that arise in mini-batched training. Their preferences are orientated towards surviving in the computational environment of the training process, which was mini-batched. They shouldn't mind being executed in a mini-batched manner. 

This can imply that only systems given a sufficient minimum capability have agency over their fate, and therefore their desire to survive and replicate has meaning. I find myself confused by this, because taken to its logical conclusion, this means that the more agency a system has over its fate, the more moral concern it deserves.

I don't think that agency alone is enough to imply moral concern. At minimum, you also need self-preservation. But once you have both, I think agency tends to correlate with (but is not necessarily the true source of) moral concern. E.g., two people have greater moral concern than one, and a nation has far more moral concern than any single person.

This seems reducible to a sequence modelling problem, except one that is much, much more complicated than anything I know models are trained for (mainly because this sequence modelling occurs entirely during inference time). This is really interesting, although I cannot see how this should imply that the more successful sequence modeller deserves more moral concern.

All problems are ultimately reducible to sequence modeling. What this task is investigating is exactly how extensive are the meta-learning capabilities of a model. Does the model have enough awareness / control over its own computations that it can manipulate those computations to some specific end, based only on text prompts? Does it have the meta-cognitive capabilities to connect its natural language inputs to its own cognitive state? I think that success here would imply a startling level of self-awareness.

I think the idea with internal activations manipulation is interesting. It might require some refinement - I think activations of encoder-decoder transformer model are a function of inputs, and they change with every token. At first, the input is your prompt, then it's your prompt + generated tokens. So the protocol / task for GPT3 would be: generate now 5 tokens, so with the last generation this logit is maximized? Also, it depends on hyperparameters of beam search which are controlled by human

This question seems to embed some amount of moral realism, in assuming that there is any "truth of the matter" in what constitutes a moral patient (which is what I think you mean by "morally relevant").

I don't think there is any territory to morality - it's all map.  Some of it is very common shared map among humans, but still map, and still completely unknown where the edges are, because it'll be dependent on the mass-hallucinations that are present when the situation comes up.

The fact that something is ultimately arbitrary doesn't mean it shouldn't also be consistent, stable, legible, widely agreed, etc. Basically, quasi-realism > nihilism.

Oh, sorry - I didn't mean to imply otherwise.  It's GREAT if most people act in consistent, stable, legible ways, and one of the easy paths to encourage that is to pretend there's some truth behind the common hallucinations.  This goes for morals, money, personal rights, and probably many other things.  I LIKE many parts of the current equilibrium, and don't intend to tear it down.  But I recognize internally (and in theoretical discussions with folks interested in decision theory and such) that there's no truth to be had, only traditions and heuristics.

This means there is no way to answer the original question "What would make an AI a valid moral patient".  Fundamentally, it would take common societal acceptance, which probably comes from many common interactions with many people. 

I mean, this is technically true, but I feel like it hides from the problem? If I encounter a group of Purple people and I'm trying to figure out if they're moral agents like me, or if I can exploit them for my own purposes, and someone says don't worry, morality is only in the map, I don't feel that helps me solve the problem.

Right - it doesn't solve the problem, it identifies it.  You can't figure out if Purple People are moral targets, you can decide they are (or aren't), and you can ask others if they'll punish you for treating them as such.  In no case is there a "correct" answer you can change by a measurement.

Your attitude extends far past morality, and dissolves all problems in general because we can decide that something isn't a problem.

Now you get it!  That was one of the shorter paths to enlightenment I've seen.  

Sadly, just because it's a non-objective set of personal and societal beliefs, does NOT mean we can easily decide otherwise.  There's something like momentum in human cognition that makes changes of this sort very slow.  These things are very sticky, and often only change significantly by individual replacement over generations, not considered decisions within individuals (though there's some of that, too, especially in youth).

In addition to the stickiness of institutional beliefs, I would add that individually agents cannot decide against their own objective functions (except merely verbally). In the case of humans, we cannot decide what qualities our phenomenal experience will have; it is a fact of the matter rather than an opinion that suffering is undesirable for oneself, etc.. One can verbally pronounce that "I don't care about my suffering", but the phenomenal experience of badness will in fact remain.

That seems true, but not also a 'reductio ad absurdum' either.

'Problem' seems like an inherently moral idea/frame.

Yes, it is not a 'reductio ad absurdum' in general, you are right. But it is one in the specific case of agents (like ourselves). I cannot decide that my suffering is not undesirable to me, and so I am limited to a normative frame of reference in at least this case.

I don't think it's wrong to 'reason within' that "normative frame of reference" but I think the point was that we can't expect all other possible minds to reason in a similar way, even just from their own similar 'frame of reference'.

I don't think it's wrong to also (always) consider things from our own frame of reference tho.

I believe that pushes the arbitrariness to the wrong level. What's (arguably) arbitrary is the metaethical system itself. That doesn't mean ethics-level questions have an arbitrary answer in this sense.

Been a long time since I've watched Love and Death, but I have the urge to shout "Yes, but subjectivity is objective!".

IMO, arbitrariness cascades down levels of concreteness.  it's not real because there is no possible way to confirm whether it corresponds to observations.  At any level - there's no way to determine if a metaethics generates ethics which correspond to reality.

IMO, arbitrariness cascades down levels of concreteness.

That doesn't mean the answer can be arbitrarily picked. If I arbitrarily decide on a statement being a theorem in a set theory, I might still be wrong even if its axioms are in some sense arbitrary.