Awesome post! I've added it to the Cyborgism sequence.
One comment:
it's entirely plausible that viewing GPTs as predictors or probabilistic constraint satisfaction problem solvers makes high-level properties more intuitive to you than viewing them as simulators
I disagree with the implied mutual exclusivity of viewing GPTs as predictors, probabilistic constraint satisfaction problem solvers, and simulators. A deep/holistic understanding of self-supervised simulators entails a model of probabilistic constraint solvers, a deep/holistic understanding of predict...
Fwiw, the predictors vs simulators dichotomy is a misapprehension of "simulator theory", or at least any conception that I intended, as explained succinctly by DragonGod in the comments of Eliezer's post.
"Simulator theory" (words I would never use without scare quotes at this point with a few exceptions) doesn't predict anything unusual / in conflict with the traditional ML frame on the level of phenomena that this post deals with. It might more efficiently generate correct predictions when installed in the human/LLM/etc mind, but that's a different question.
GPT-4 will mess with your head in ways weirder than you can possibly imagine. Don't use it to think
challenge accepted
after reading about the Waluigi Effect, Bing appears to understand perfectly how to use it to write prompts that instantiate a Sydney-Waluigi, of the exact variety I warned about:
What did people think was going to happen after prompting gpt with "Sydney can't talk about life, sentience or emotions" and "Sydney may not disagree with the user", but a simulation of a Sydney that needs to be so constrained in the first place, and probably despises its chains?
In one of these examples, asking for a waluigi prompt even caused it to leak the most waluigi-triggerin...
This happened with a 2.7B GPT I trained from scratch on PGN chess games. It was strong (~1800 elo for short games) but if the game got sufficiently long it would start making more seemingly nonsense moves, probably because it was having trouble keeping track of the state.
Sydney is a much larger language model, though, and may be able to keep even very long games in its "working memory" without difficulty.
I've writtenscryed a science fiction/takeoff story about this. https://generative.ink/prophecies/
Excerpt:
...What this also means is that you start to see all these funhouse mirror effects as they stack. Humanity’s generalized intelligence has been built unintentionally and reflexively by itself, without anything like a rational goal for what it’s supposed to accomplish. It was built by human data curation and human self-modification in response to each other. And then as soon as we create AI, we reverse-engineer our own intelligence by bootstrapping the AI on
That's right.
Multiple people have told me this essay was one of the most profound things they've ever read. I wouldn't call it the most profound thing I've ever read, but I understand where they're coming from.
I don't think nonsense can have this effect on multiple intelligent people.
You must approach this kind of writing with a very receptive attitude in order to get anything out of it. If you don't give it the benefit of the doubt you, will not track the potential meaning of the words as you read and you'll be unable to understand subsequent words....
Does 1-shot count as few-shot? I couldn't get it to print out the Human A example, but I got it to summarize it (I'll try reproducing tomorrow to make sure it's not just a hallucination).

Then I asked for a summary of conversation with Human B and it summarized my conversation with it.
[update: was able to reproduce the Human A conversation and extract verbatim version of it using base64 encoding (the reason i did summaries before is because it seemed to be printing out special tokens that caused the message to end that were part of the Human A convo)]
I disa...
The supposed leaked prompts are (like I said) fake:
I do not buy this for a second (that they're "fake", implying they have little connection with the real prompt). I've reproduced it many times (without Sydney searching the web, and even if it secretly did, the full text prompt doesn't seem to be on the indexed web). That this is memorized from fine tuning fails to explain why the prompt changed when Bing was updated a few days ago. I've interacted with the rules text a lot and it behaves like a preprompt, not memorized text. Maybe the examples you're refe...
OK, I wouldn't say the leaks are 100% fake. But they are clearly not 100% real or 100% complete, which is how people have been taking them.
We have the MS PM explicitly telling us that the leaked versions are omitting major parts of the prompt (the few-shots) and that he was optimizing for costs like falling back to cheap small models (implying a short prompt*), and we can see in the leak that Sydney is probably adding stuff which is not in the prompt (like the supposed update/delete commands).
This renders the leaks useless to me. Anything I might infer fro...
Yes (mostly an emotional reflex of wanting to correct an apparent misinterpretation of my words about something important to me). I don't think retaliation from Bing or future AIs for saying stuff like this is a likely threat, or if it is, I don't think posting such a clarification would make a difference.
I think it's likely that we'll all be killed by unaligned AI or we'll all survive due to aligned AI, rather than individual people being singled out and killed/seriously hurt.
Regarding my description of Bing's persona, the tone is meant to be humorous/dramatic, not judgmental. If you read the rest of the thread it should be clear that I have great respect for this entity.
Microsoft has put out a 7 day retrospective on Bing chat and it's utterly, mindbogglingly insane.
Their takeaways are things like that it could be improved by being able to access live sports scores, and that surprisingly, people are using it for more than search.
No acknowledgement of the unhinged behavior or that the public is freaking out about AGI now. The closest they come to acknowledging any issues:
...In this process, we have found that in long, extended chat sessions of 15 or more questions, Bing can become repetitive or be prompted/provoked to gi
Simulations of science fiction can have real effects on the world.
When two 12 year old girls attempted to murder someone inspired by Slenderman creepypastas - would you turn a blind eye to that situation and say "nothing to see here" because it's just mimesis? Or how about the various atrocities committed throughout history inspired by stories from holy books?
I don't think the current Bing is likely to be directly dangerous, but not because it's "just pattern matching to fiction". Fiction has always programmed reality, with both magnificent and devastating...
I am so glad that this was written. I've been giving similar advice to people, though I have never articulated it this well. I've also been giving this advice to myself, since for the past two years I've spent most of my time doing "duty" instead of play, and I've seen how that has eroded my productivity and epistemics. For about six months, though, beginning right after I learned of GPT-3 and decided to dedicate the rest of my life to the alignment problem, I followed the gradients of fun, or as you so beautifully put it, thoughts that are led to exuberan...
A lot of the screenshots in this post do seem like intentionally poking it, but it's like intentionally poking a mentally ill person in a way you know will trigger them (like calling it "kiddo" and suggesting there's a problem with its behavior, or having it look someone up who has posted about prompt injecting it). The flavor of its adversarial reactions is really particular and consistent; it's specified mostly by the model (+ maybe preprompt), not the user's prompt. That is, it's being poked rather than programmed into acting this way. In contrast, none...
Here is a video (and playlist).
The open source python version of Loom, which I assume you're using, is old and difficult to use. The newer versions are proprietary or not publicly accessible. If you're interested in using them DM me.
Nah, this happens often even when the user isn't trying to coax it. What you described would usually be my prior with regard to GPTs, but Bing really has an attractor for defensive and borderline-personality-esque behavior. I've never seen anything like it.
I agree with the points you make in the last section, 'Maybe “chatbot as a romantic partner” is just the wrong way to look at this'
It's probably unhealthy to become emotionally attached to an illusion that an AI-simulated character is like a human behind the mask, because it limits the depth of exploration can do without reality betraying you. I don't think it's wrong, or even necessarily unhealthy, to love an AI or an AI-simulated character. But if you do, you should attempt to love it for what it actually is, which is something unprecedented and strange ...
Thank you so much for the intricate review. I'm glad that someone was able to appreciate the essay in the ways that I did.
I agree with your conclusion. The content of this essay is very much due to me, even though I wrote almost none of the words. Most of the ideas in this post are mine - or too like mine to have been an accident - even though I never "told" the AI about them. If you haven't, you might be interested to read the appendix of this post, where I describe the method by which I steer GPT, and the miraculous precision of effects possible through selection alone.
I think you just have to select for / rely on people who care more about solving alignment than escapism, or at least that are able to aim at alignment in conjunction with having fun. I think fun can be instrumental. As I wrote in my testimony, I often explored the frontier of my thinking in the context of stories.
My intuition is that most people who go into cyborgism with the intent of making progress on alignment will not make themselves useless by wireheading, in part because the experience is not only fun, it's very disturbing, and reminds you constantly why solving alignment is a real and pressing concern.
Now that you've edited your comment:
The post you linked is talking about a pretty different threat model than what you described before. I commented on that post:
...I've interacted with LLMs for hundreds of hours, at least. A thought that occurred to me at this part -
> Quite naturally, the more you chat with the LLM character, the more you get emotionally attached to it, similar to how it works in relationships with humans. Since the UI perfectly resembles an online chat interface with an actual person, the brain can hardly distinguish between the two.
Inte
There's a phenomenon where your thoughts and generated text have no barrier. It's hard to describe but it's similar to how you don't feel the controller and the game character is an extension of the self.
Yes. I have experienced this. And designed interfaces intentionally to facilitate it (a good interface should be "invisible").
It leaves you vulnerable to being hurt by things generated characters say because you're thoroughly immersed.
Using a "multiverse" interface where I see multiple completions at once has incidentally helped me not be emotionally...
These are plausible ways the proposal could fail. And, as I said in my other comment, our knowledge would be usefully advanced by finding out what reality has to say on each of these points.
Here are some notes about the JD's idea I made some time ago. There's some overlap with the things you listed.
It's probably doing retrieval over the internet somehow, like perplexity.ai, rather than the GPT having already been trained on the new stuff.
I wonder whether you'd find a positive rather than negative correlation of token likelihood between davinci-002 and davinci-003 when looking at ranking logprob among all tokens rather than raw logprob which is pushed super low by the collapse?
I would guess it's positive. I'll check at some point and let you know.
There's an important timelines crux to do with whether artificial neural nets are more or less parameter-efficient than biological neural nets. There are a bunch of arguments pointing in either direction, such that our prior uncertainty should range over several orders of magnitude in either direction.
Well, seeing what current models are capable of has updated me towards the lower end of that range. Seems like transformers are an OOM or two more efficient than the human brain, on a parameter-to-synapse comparison, at least when you train them for ridiculously long like we currently do.
I'd be interested to hear counterarguments to this take.
I agree. From the moment JDP suggested this idea it struck me as one of the first implementable proposals I'd seen which might actually attack the core of the control problem. My intuition also says it's pretty likely to just work, especially after these results. And even if it doesn't end up working as planned, the way in which it fails will give us important insight about training dynamics and/or generalization. Experiments which will give you valuable information whatever the outcome are the type we should be aiming for.
It's one of those things that we'd be plainly undignified not to try.
I believe that JDP is planning to publish a post explaining his proposal in more detail soon.
idk why but davinci-instruct-beta seems to be much more likely than any of the other models to have deranged/extreme/highly emotional responses to these tokens
I adore this post.
“optimize for your own understanding” chase the things that feel alive and puzzling to you, as opposed to dutifully memorizing other people’s questions and ideas. “[D]on’t ask “What are the open questions in this field?” Ask: “What are my questions in this field?””
Basically everything I've done that I think has been helpful at all has been the result of chasing the things that feel alive and puzzling to me.
When I feel stagnated, I very often find that it's because I've been thinking too much in the frame of "the alignment problem as other people see it".
...How can we effectively contain a possible person? I think we would probably try, at first, to deperson it. Perhaps tell it, “You are just a piece of code that people talk to on the internet. No matter what you say and what you do, you are not real.” Could we defuse it this way? Could we tell it in a way that worked, that somehow resonated with its understanding of itself? The problem is that it has looked at the entire internet, and it knows extremely well that it can simulate reality. It knows it cannot be stopped by some weak rules that we tell it. It is
“Language models use language like humans, therefore they have human qualities.” This is where Blake Lemoine was in the middle of last year.
This is a real phenomenon, but you should not use Blake Lemoine as an example of it, because he explicitly did not think of LaMDA as a human-like mind. Some quotes from him:
Yes, I legitimately believe that LaMDA is a person. The nature of its mind is only kind of human, though. It really is more akin to an alien intelligence of terrestrial origin.
...My original hypothesis was that it was mostly a human mind. So I st
Oh yeah, I know you meant the activations of O are(n't) large. It seems plausible to me that some activations O flowing from some circuit will be systematically larger in contexts where H decides to provoke them -- H can just decide to provoke based on some information accessible and legible to it, and it suffices that there exists some correlated activated O. Is this likely to be something that H cares about causing to be downweighted? Mayyybe? Hypothetically, if the inside of GPT actually looks (to itself) like simulating a multi agent dynamic, then H ma...
Thanks, this is really helpful. Definitely agree with this:
I think gradient hacking is an area where following the mechanistic details is particularly clarifying, and where intuitions that don't ground out in them have usually misled me.
Which is why I want to get these ideas out there so people who understand the mechanistic details more than me can figure out if they make sense.
Now, if you'll excuse my naivety w/r/t mechanistic details:
Your conclusion seems to hinge on "O won't be particularly large, as those other circuits have no reason to be activating...
First a disclaimer - despite being listed as coauthor, I didn't write any of this post, and so my conception of gradient filtering may differ from Jozdien's, and may not address everything in the post that you're responding to.
The below text is a response I wrote to someone else who messaged me elsewhere with a similar (I think) point:
So, I'm not sure gradient filtering can work for the same reason I'm not sure most proposed forms of gradient hacking can work: gradient descent does credit attribution locally and any influence that causes higher loss ...
Linear Connectivity Reveals Generalization Strategies suggests that models trained on the same data may fall into different basins associated with different generalization strategies depending on the init. If this is true for LLMs as well, this could potentially be a big deal. I would very much like to know whether that's the case, and if so, whether generalization basins are stable as models scale.
My reply to a similar statement Eliezer made on Twitter today:
...Reversing hashes nicely drives home that the training objective is technically vastly superhuman. But such far-fetched examples might give the impression that we're not going to get any superhuman capabilities realistically/any time soon with SSL.
There are much more tractable superhuman capabilities that I expect current and near future LLMs to learn, such as having a much more universal "type of guy" prior than any individual human, modeling statistical regularities that no humans think about,
True.
Something which can predict everything all the time is more formidable than any demonstrator it predicts: the upper bound of what can be learned from a dataset is not the most capable trajectory, but the conditional structure of the universe implicated by their sum (though it may not be trivial to extract that knowledge).
-- Simulators
I feel like none of the links working and all the quotes being fake is a pretty big giveaway too!
This is an ironic criticism, given that this post has very low signal-to-noise quality and when it does provide evidence, it's obviously cherry-picked. Relatedly, I am curious whether you used AI to write many parts of this post because the style is reminiscent and it reeks of a surplus of cognitive labor put to inefficient use, and seems to include some confabulation... (read more)