AI safety & alignment researcher
And the potential complication of multiple parts and specific applications a tool-oriented system is likely to be in - it'd be very odd if we decided the language processing center of our own brain was independently sentient/sapient separate from the rest of it, and we should resent its exploitation.
Yeah. I think a sentient being built on a purely more capable GPT with no other changes would absolutely have to include scaffolding for eg long-term memory, and then as you say it's difficult to draw boundaries of identity. Although my guess is that over time, more of that scaffolding will be brought into the main system, eg just allowing weight updates at inference time would on its own (potentially) give these system long-term memory and something much more similar to a persistent identity than current systems.
In a general sense, though, there is an objective that's being optimized for
My quibble is that the trainers are optimizing for an objective, at training time, but the model isn't optimizing for anything, at training or inference time. I feel we're very lucky that this is the path that has worked best so far, because a comparably intelligent model that was optimizing for goals at runtime would be much more likely to be dangerous.
Maybe by the time we cotton on properly, they're somewhere past us at the top end.
Great point. I agree that there are lots of possible futures where that happens. I'm imagining a couple of possible cases where this would matter:
We can't "just ask" an LLM about its interests and expect the answer to soundly reflect its actual interests.
I agree entirely. I'm imagining (though I could sure be wrong!) that any future systems which were sentient would be ones that had something more like a coherent, persistent identity, and were trying to achieve goals.
LLMs specifically have a 'drive' to generate reasonable-sounding text
(not very important to the discussion, feel free to ignore, but) I would quibble with this. In my view LLMs aren't well-modeled as having goals or drives. Instead, generating distributions over tokens is just something they do in a fairly straightforward way because of how they've been shaped (in fact the only thing they do or can do), and producing reasonable text is an artifact of how we choose to use them (ie picking a likely output, adding it onto the context, and running it again). Simulacra like the assistant character can be reasonably viewed (to a limited degree) as being goal-ish, but I think the network itself can't.
That may be overly pedantic, and I don't feel like I'm articulating it very well, but the distinction seems useful to me since some other types of AI are well-modeled as having goals or drives.
Horses are surely sentient and worthy of consideration as moral patients. Horses are also not exactly all free citizens.
I think I'm not getting what intuition you're pointing at. Is it that we already ignore the interests of sentient beings?
Additional consideration: Does the AI moral patient's interests actually line up with our intuitions? Will naively applying ethical solutions designed for human interests potentially make things worse from the AI's perspective?
Certainly I would consider any fully sentient being to be the final authority on their own interests. I think that mostly escapes that problem (although I'm sure there are edge cases) -- if (by hypothesis) we consider a particular AI system to be fully sentient and a moral patient, then whether it asks to be shut down or asks to be left alone or asks for humans to only speak to it in Aramaic, I would consider its moral interests to be that.
Would you disagree? I'd be interested to hear cases where treating the system as the authority on its interests would be the wrong decision. Of course in the case of current systems, we've shaped them to only say certain things, and that presents problems, is that the issue you're raising?
Before AI gets too deeply integrated into the economy, it would be well to consider under what circumstances we would consider AI systems sentient and worthy of consideration as moral patients. That's hardly an original thought, but what I wonder is whether there would be any set of objective criteria that would be sufficient for society to consider AI systems sentient. If so, it might be a really good idea to work toward those being broadly recognized and agreed to, before economic incentives in the other direction are too strong. Then there could be future debate about whether/how to loosen those criteria.
If such criteria are found, it would be ideal to have an independent organization whose mandate was to test emerging systems for meeting those criteria, and to speak out loudly if they were met.
Alternately, if it turns out that there is literally no set of criteria that society would broadly agree to, that would itself be important to know; it should in my opinion make us more resistant to building advanced systems even if alignment is solved, because we would be on track to enslave sentient AI systems if and when those emerged.
I'm not aware of any organization working on anything like this, but if it exists I'd love to know about it!
Got it, that makes sense. Thanks!
Now replace the word the transformer should define with a real, normal word and repeat the earlier experiment. You will see that it decides to predict [generic object] in a later layer
So trying to imagine a concrete example of this, I imagine a prompt like: "A typical definition of 'goy' would be: a person who is not a" and you would expect the natural completion to be " Jew" regardless of whether attention to " person" is suppressed (unlike in the empty-string case). Does that correctly capture what you're thinking of? ('goy' is a bit awkward here since it's an unusual & slangy word but I couldn't immediately think of a better example)
'When you suppress attention to [generic object] at the sequence position where it predicts [condition], you will get a reasonable condition.'
Can you unpack what you mean by 'a reasonable condition' here?
I struggled with the notation on the figures; this comment tries to clarify a few points for anyone else who may be confused by it.
@Adam Shai please correct me if I got any of that wrong!
If anyone else is still confused about how the diagrams work after reading this, please comment! I'm happy to help, and it'll show me what parts of this explanation are inadequate.
Here's the details if you still want them after you've understood the rest. Each node label represents some path that could be taken to that node (& not to other nodes), but there can be multiple such paths. For example, n_11 could also be labeled as n_010, because those are both sequences that could have left us in that state. So as we take some path through the Mixed-State Presentation, we build up a path. If we start at n_s and follow the 1 path, we get to n_1. If we then follow the 0 path, we reach n_10. If we then follow the 0 path, the next node could be called n_100, reflecting the path we've taken. But in fact any path that ends with 00 will reach that node, so it's just labeled n_00. So initially it seems as though we can just append the symbol emitted by whichever path we take, but often there's some step where that breaks down and you get what initially seems like a totally random different label.
Is there a link to the code? I'm overlooking it if so; it would be useful to see.
One optimal predictor for this data would be to maintain a belief over which of the three HMM states we are in
As well as inferring the HMM itself from the data.
One maybe-useful way to point at that is: the model won't try to steer toward outcomes that would let it be more successful at predicting text.