Wow. When I use GPT-4, Ive had a distinct sense of "I bet this is what it would have felt like to use one of the earliest computers". Until this post I didnt realize how literal that sense might be.
This is a really cool and apt analogy - computers and LLM scaffolding really do seem like the same abstraction. Thinking this way seems illuminating as to where we might be heading.
I always assumed people were using "jailbreak" in the computer sense (e.g. jailbreak your phone/ps4/whatever), not in the "escape from prison" sense.
Jailbreak (computer science), a jargon expression for (the act of) overcoming limitations in a computer system or device that were deliberately placed there for security, administrative, or marketing reasons
I think the definition above is a perfect fit for what people are doing with ChatGPT
I am going to go ahead and say that if males die five times as often from suicide, that seems more important than the number of attempts. It is kind of stunning, or at least it should be, to have five boys die for every girl that dies, and for newspapers and experts to make it sound like girls have it worse here.
I think the strength of your objection here depends on which of two possible underlying models is at play:
If (1) is the case, then I think it is at least arguable that girls have it worse here, since they end up in the mental state of "definitely wanting to die" more often than boys, and that sucks. That said, it's still true that they're not actually dying as much, and so I think it's still kinda disingenuous to frame it the way the newspaper and experts have here.
If (2) is the case, then that means that boys are ending up in the "definitely wanting to die" state much more often than girls, in which case I'd agree that it's very wrong to say that girls have it worse.
If you're getting comments like that from friends and family, it's possible that you havent been epistemically transparent with them? E.g. do you think your friends who made those comments would be able to say why you believe what you do? Do you tell them about your reaearch process and what kinds of evidence you look for, or do you just make contrarian factual assertions?
There's a big difference between telling someone "the WHO is wrong about salt, their recommendations are potentially deadly" versus "Ive read a bunch of studies on salt, and from what Ive found, the WHOs recommendations don't seem to agree with the latest research. Their recs are based on [studies x,y] and say to do [a], but [other newer/better studies] indicate [b]."
Cut to a few decades later, and most people think that the way it's been done for about two or three generations is the way it's always been done (it isn't)
As possibly one of those people myself, can you give a few examples of what specifically is being done differently now? Are you talking about things like using lots of adderall?
I'm also morbidly curious what the model would do in <|bad|> mode.
I'm guessing that poison-pilling the <|bad|> sentences would have a negative effect on the <|good|> capabilities as well? I.e. It seems like the post is saying that the whole reason you need to include the <|bad|>s at all in the training dataset is that the model needs them in order to correctly generalize, even when predicting <|good|> sentences.
It seems plausible to me that within the next few years we will have:
And with these things, you'd have access to a personalized virtual partner who you can video chat, phone call, or text with.
It does seem like AI dating will start to become a big thing in the near future. And I'm also not sure how to feel about that.
I think the point of this post is more "how do we get the AI to do what we want it do to", and less "what should we want the AI to do"
That is, there's value in trying to figure out how to align an LLM to any goal, regardless of whether a "better" goal exists. And the technique in the post doesnt depend on what target you have for the LLM: maybe someone wants to design an LLM to only answer questions about explosives, in which case they could still use the techniques described in the post to do that.
Well, really every second that you remain alive is a little bit of bayesian evidence for quantum immortality: the likelihood of death during that second according to quantum immortality is ~0, whereas the likelihood of death if quantum immortality is false is >0. So there is a skewed likelihood ratio in favor of quantum immortality each time you survive one extra second (though of course the bayesian update is very small until you get pretty old, because both hypotheses assign very low probaility to death when young)
That's great. "The king can't fetch the coffee if he's dead"