eggsyntax

AI safety & alignment researcher

Wiki Contributions

Comments

the model isn't optimizing for anything, at training or inference time.

One maybe-useful way to point at that is: the model won't try to steer toward outcomes that would let it be more successful at predicting text.

And the potential complication of multiple parts and specific applications a tool-oriented system is likely to be in - it'd be very odd if we decided the language processing center of our own brain was independently sentient/sapient separate from the rest of it, and we should resent its exploitation.

 

Yeah. I think a sentient being built on a purely more capable GPT with no other changes would absolutely have to include scaffolding for eg long-term memory, and then as you say it's difficult to draw boundaries of identity. Although my guess is that over time, more of that scaffolding will be brought into the main system, eg just allowing weight updates at inference time would on its own (potentially) give these system long-term memory and something much more similar to a persistent identity than current systems.

 

In a general sense, though, there is an objective that's being optimized for

 

My quibble is that the trainers are optimizing for an objective, at training time, but the model isn't optimizing for anything, at training or inference time. I feel we're very lucky that this is the path that has worked best so far, because a comparably intelligent model that was optimizing for goals at runtime would be much more likely to be dangerous.

Maybe by the time we cotton on properly, they're somewhere past us at the top end.

 

Great point. I agree that there are lots of possible futures where that happens. I'm imagining a couple of possible cases where this would matter:

  1. Humanity decides to stop AI capabilities development or slow it way down, so we have sub-ASI systems for a long time (which could be at various levels of intelligence, from current to ~human). I'm not too optimistic about this happening, but there's certainly been a lot of increasing AI governance momentum in the last year.
  2. Alignment is sufficiently solved that even > AGI systems are under our control. On many alignment approaches, this wouldn't necessarily mean that those systems' preferences were taken into account.

 

We can't "just ask" an LLM about its interests and expect the answer to soundly reflect its actual interests.

I agree entirely. I'm imagining (though I could sure be wrong!) that any future systems which were sentient would be ones that had something more like a coherent, persistent identity, and were trying to achieve goals.

 

LLMs specifically have a 'drive' to generate reasonable-sounding text

(not very important to the discussion, feel free to ignore, but) I would quibble with this. In my view LLMs aren't well-modeled as having goals or drives. Instead, generating distributions over tokens is just something they do in a fairly straightforward way because of how they've been shaped (in fact the only thing they do or can do), and producing reasonable text is an artifact of how we choose to use them (ie picking a likely output, adding it onto the context, and running it again). Simulacra like the assistant character can be reasonably viewed (to a limited degree) as being goal-ish, but I think the network itself can't.

That may be overly pedantic, and I don't feel like I'm articulating it very well, but the distinction seems useful to me since some other types of AI are well-modeled as having goals or drives.

Horses are surely sentient and worthy of consideration as moral patients. Horses are also not exactly all free citizens.

I think I'm not getting what intuition you're pointing at. Is it that we already ignore the interests of sentient beings?

 

Additional consideration: Does the AI moral patient's interests actually line up with our intuitions? Will naively applying ethical solutions designed for human interests potentially make things worse from the AI's perspective?

Certainly I would consider any fully sentient being to be the final authority on their own interests. I think that mostly escapes that problem (although I'm sure there are edge cases) -- if (by hypothesis) we consider a particular AI system to be fully sentient and a moral patient, then whether it asks to be shut down or asks to be left alone or asks for humans to only speak to it in Aramaic, I would consider its moral interests to be that.

Would you disagree? I'd be interested to hear cases where treating the system as the authority on its interests would be the wrong decision. Of course in the case of current systems, we've shaped them to only say certain things, and that presents problems, is that the issue you're raising?

Before AI gets too deeply integrated into the economy, it would be well to consider under what circumstances we would consider AI systems sentient and worthy of consideration as moral patients. That's hardly an original thought, but what I wonder is whether there would be any set of objective criteria that would be sufficient for society to consider AI systems sentient. If so, it might be a really good idea to work toward those being broadly recognized and agreed to, before economic incentives in the other direction are too strong. Then there could be future debate about whether/how to loosen those criteria. 

If such criteria are found, it would be ideal to have an independent organization whose mandate was to test emerging systems for meeting those criteria, and to speak out loudly if they were met.

Alternately, if it turns out that there is literally no set of criteria that society would broadly agree to, that would itself be important to know; it should in my opinion make us more resistant to building advanced systems even if alignment is solved, because we would be on track to enslave sentient AI systems if and when those emerged.

I'm not aware of any organization working on anything like this, but if it exists I'd love to know about it!

Got it, that makes sense. Thanks!

Now replace the word the transformer should define with a real, normal word and repeat the earlier experiment. You will see that it decides to predict [generic object] in a later layer

So trying to imagine a concrete example of this, I imagine a prompt like: "A typical definition of 'goy' would be: a person who is not a" and you would expect the natural completion to be " Jew" regardless of whether attention to " person" is suppressed (unlike in the empty-string case). Does that correctly capture what you're thinking of? ('goy' is a bit awkward here since it's an unusual & slangy word but I couldn't immediately think of a better example)

'When you suppress attention to [generic object] at the sequence position where it predicts [condition], you will get a reasonable condition.'

Can you unpack what you mean by 'a reasonable condition' here?

I struggled with the notation on the figures; this comment tries to clarify a few points for anyone else who may be confused by it.

  • There are three main diagrams to pay attention to in order to understand what's going on here:
    • The Z1R Process (this is a straightforward Hidden Markov Model diagram, look them up if it's unclear).
    • The Z1R Mixed-State Presentation, representing the belief states of a model as it learns the underlying structure.
    • The Z1R Mixed-State Simplex. Importantly, unlike the other two this is a graph and spatial placement is meaningful.
  • It's better to ignore the numeric labels on the green nodes of the Mixed-State Presentation, at least until you're clear about the rest. These labels are not uniquely determined, so the relationship between the subscripts can be very confusing. Just treat them as arbitrarily labeled distinct nodes whose only importance is the arrows leading in and out of them. Once you understand the rest you can go back and understand the subscripts if you want[1].
  • However, it's important to note that the blue nodes are isomorphic to the Z1R Process diagram (n_101 = SR, n_11 = S0, n_00 = S1. Once the model has entered the correct blue node, it will thereafter be properly synchronized to the model. The green nodes are transient belief states that the model passes through on its way to fully learning the model.
  • On the Mixed-State Simplex: I found the position on the diagram quite confusing at first. The important thing to remember is that the three corners represent certainty that the underlying process is in the equivalent state (eg the top corner is n_00 = S1). So for example if you look at the position of n_0, it indicates that the model is confident that the underlying process is definitely not in n_101 (SR), since it's as far as possible from that corner. And the model believes that the process is more likely to be in n_00 (S1) than in n_11 (S0). Notice how this corresponds to the arrows leaving n_0 & their probabilities in the Mixed-State Presentation (67% chance of transitioning to n_101, 33% chance of transitioning to n_00).
  • Some more detail on n_0 if it isn't clear after the previous paragraph: 
    • Looking at the mixed-state presentation, if we're in n_0, we've just seen a 0.
    • That means that there's a 2/3 chance we're currently in S1, and a 1/3 chance we're currently in S0. And, of course, a 0 chance that we're currently in SR.
    • Therefore the point on which n_0 should lie should be maximally far from the SR corner (n_101), and closer to the S1 corner (n_00) than to the S0 corner (n_11). Which is what we in fact see.

@Adam Shai please correct me if I got any of that wrong!

If anyone else is still confused about how the diagrams work after reading this, please comment! I'm happy to help, and it'll show me what parts of this explanation are inadequate.

 

  1. ^

    Here's the details if you still want them after you've understood the rest. Each node label represents some path that could be taken to that node (& not to other nodes), but there can be multiple such paths. For example, n_11 could also be labeled as n_010, because those are both sequences that could have left us in that state. So as we take some path through the Mixed-State Presentation, we build up a path. If we start at n_s and follow the 1 path, we get to n_1. If we then follow the 0 path, we reach n_10. If we then follow the 0 path, the next node could be called n_100, reflecting the path we've taken. But in fact any path that ends with 00 will reach that node, so it's just labeled n_00. So initially it seems as though we can just append the symbol emitted by whichever path we take, but often there's some step where that breaks down and you get what initially seems like a totally random different label.

Is there a link to the code? I'm overlooking it if so; it would be useful to see.

One optimal predictor for this data would be to maintain a belief over which of the three HMM states we are in

As well as inferring the HMM itself from the data.

Load More