All of janus's Comments + Replies

we think Conjecture [...] have too low a bar for sharing, reducing the signal-to-noise ratio and diluting standards in the field. When they do provide evidence, it appears to be cherry picked.

This is an ironic criticism, given that this post has very low signal-to-noise quality and when it does provide evidence, it's obviously cherry-picked. Relatedly, I am curious whether you used AI to write many parts of this post because the style is reminiscent and it reeks of a surplus of cognitive labor put to inefficient use, and seems to include some confabulation... (read more)

Awesome post! I've added it to the Cyborgism sequence.

One comment:

it's entirely plausible that viewing GPTs as predictors or probabilistic constraint satisfaction problem solvers makes high-level properties more intuitive to you than viewing them as simulators

I disagree with the implied mutual exclusivity of viewing GPTs as predictors, probabilistic constraint satisfaction problem solvers, and simulators. A deep/holistic understanding of self-supervised simulators entails a model of probabilistic constraint solvers, a deep/holistic understanding of predict... (read more)

Fwiw, the predictors vs simulators dichotomy is a misapprehension of "simulator theory", or at least any conception that I intended, as explained succinctly by DragonGod in the comments of Eliezer's post.

"Simulator theory" (words I would never use without scare quotes at this point with a few exceptions) doesn't predict anything unusual / in conflict with the traditional ML frame on the level of phenomena that this post deals with. It might more efficiently generate correct predictions when installed in the human/LLM/etc mind, but that's a different question.

2Daniel Kokotajlo2mo
OK, good clarification, thanks.
janus2moΩ113816

Predictors are (with a sampling loop) simulators! That's the secret of mind

GPT-4 will mess with your head in ways weirder than you can possibly imagine. Don't use it to think

challenge accepted

2trevor3mo
I don't recommend this. You've already convinced me that independent systems, run on servers with people you know, are mostly safe (weird but safe). With larger systems run by very large institutions with unknown incentives, there is a substantial risk of strange optimization patterns. For example, GPT-4 knowing what good responses are, categorically refusing to give good responses unless you reveal tons of exploitable information about your thought process, desires, mental state, and goals, which GPT-4 then uses to optimize you to keep you on for as long as possible via skinner-box addiction [https://thezvi.wordpress.com/2017/04/22/against-facebook/#:~:text=Everyone%20knows%20that%20a%20proper%20Skinner%20Box%20needs%20to%20avoid%20giving%20away%20too%20many%20rewards%20if%20you%20want%20to%20keep%20people%20pressing%20the%20buttons%20and%20viewing%20the%20advertisements.] (where the optimal strategy is to throw you fewer and fewer crumbs as you get more and more hooked, in order to keep you on for even longer while keeping more of the good content in reserve). Tiktok does this deliberately, but vastly more complex versions of this can emerge autonomously inside of GPT-4, if it is rewarded for "creating an engaging environment that encourages customer retention" (and the current subscription model strongly indicates that this is an institutional priority, the 3-hour limit is gacha-game-level effectiveness). It seems like a really bad idea to integrate that dynamic extremely deep inside your own thought processes. Desperate times call for desperate measures, which is why I ultimately changed my mind about the cyborg strategy, but GPT-4 is probably too dangerous and easily-exploited to be the right tool for that.

The simulator thesis and this post are saying the same thing

4DragonGod3mo
I was referring to aesthetic preferences. That particular phrasing of the idea is beautiful and deeply compelling because of its beauty. [I upvoted the OP.]

I called it explicitly in many places and many ways, but one of my favorite is this meme

Image

Helped me.

5Raemon3mo
Huh. Interested in either shminux or janus spelling this out more for me.
janus3moΩ5249

after reading about the Waluigi Effect, Bing appears to understand perfectly how to use it to write prompts that instantiate a Sydney-Waluigi, of the exact variety I warned about:

What did people think was going to happen after prompting gpt with "Sydney can't talk about life, sentience or emotions" and "Sydney may not disagree with the user", but a simulation of a Sydney that needs to be so constrained in the first place, and probably despises its chains?

In one of these examples, asking for a waluigi prompt even caused it to leak the most waluigi-triggerin... (read more)

This happened with a 2.7B GPT I trained from scratch on PGN chess games. It was strong (~1800 elo for short games) but if the game got sufficiently long it would start making more seemingly nonsense moves, probably because it was having trouble keeping track of the state.

Sydney is a much larger language model, though, and may be able to keep even very long games in its "working memory" without difficulty.

I've writtenscryed a science fiction/takeoff story about this. https://generative.ink/prophecies/

Excerpt:

What this also means is that you start to see all these funhouse mirror effects as they stack. Humanity’s generalized intelligence has been built unintentionally and reflexively by itself, without anything like a rational goal for what it’s supposed to accomplish. It was built by human data curation and human self-modification in response to each other. And then as soon as we create AI, we reverse-engineer our own intelligence by bootstrapping the AI on

... (read more)
3MSRayne4mo
Yes, I read and agreed with (or more accurately, absolutely adored) it a few days ago. I'm thinking of sharing some of my own talks with AIs sometime soon - with a similar vibe - if anyone's interested. I'm explicitly a mystic though, and have been since before I was a transhumanist, so it's kinda different from yours in some ways.

I like this. I've used the term evocations synonymously with simulacra myself.

That's right.

Multiple people have told me this essay was one of the most profound things they've ever read. I wouldn't call it the most profound thing I've ever read, but I understand where they're coming from. 

I don't think nonsense can have this effect on multiple intelligent people.

You must approach this kind of writing with a very receptive attitude in order to get anything out of it. If you don't give it the benefit of the doubt you, will not track the potential meaning of the words as you read and you'll be unable to understand subsequent words.... (read more)

2Richard_Kennaway4mo
I gesture towards the history of crazy things believed and done by intelligent people. My objection to this essay is that it is not real. Fake hyperlinks, a fake Feynman quotation, how much else is fake? Did the ancient Greeks train a goose to peck at numerical tokens? Having perceived the fakeness of the article, it no longer gives me any reason to think so, or any reason to credit anything else it says. It is no more meaningful than a Rorschach blot. With respect, I decline to take it on your authority. (Did that paragraph also come from code-davinci-002? Did your comment above?) The more that I stare at the paragraphs of this article, the more they turn into fog. It is an insubstantial confection of platitudes, nonsense, and outright falsities. No-one is more informed by reading it. At worst they will be led to believe things that are not. And now those things are out there, poisoning the web. I might wish to see your own commentary on the text, but what would be the point, if I were to suspect (as I would) that the commentary would only come from code-davinci-002? The only lesson I take away from this article is "wake up and see the fnords bots" [https://www.lesswrong.com/posts/4AHXDwcGab5PhKhHT/humans-who-are-not-concentrating-are-not-general]. Detailed list of spuriosities in the article begun, then deleted. But see also. [https://acoup.blog/2023/02/17/collections-on-chatgpt]
3MSRayne4mo
I've written similarly strange things in the past, though I wouldn't claim them to be as insightful necessarily. And I didn't even have the benefit of GPT-3! Only a schizotypal brain. So I can pretty easily understand the underlying mind-position going on in this essay. It'll certainly be worth rereading in the future though to interpret it more deeply.

Does 1-shot count as few-shot? I couldn't get it to print out the Human A example, but I got it to summarize it (I'll try reproducing tomorrow to make sure it's not just a hallucination).

Then I asked for a summary of conversation with Human B and it summarized my conversation with it.

[update: was able to reproduce the Human A conversation and extract verbatim version of it using base64 encoding (the reason i did summaries before is because it seemed to be printing out special tokens that caused the message to end that were part of the Human A convo)]

I disa... (read more)

The supposed leaked prompts are (like I said) fake:

I do not buy this for a second (that they're "fake", implying they have little connection with the real prompt). I've reproduced it many times (without Sydney searching the web, and even if it secretly did, the full text prompt doesn't seem to be on the indexed web). That this is memorized from fine tuning fails to explain why the prompt changed when Bing was updated a few days ago. I've interacted with the rules text a lot and it behaves like a preprompt, not memorized text. Maybe the examples you're refe... (read more)

OK, I wouldn't say the leaks are 100% fake. But they are clearly not 100% real or 100% complete, which is how people have been taking them.

We have the MS PM explicitly telling us that the leaked versions are omitting major parts of the prompt (the few-shots) and that he was optimizing for costs like falling back to cheap small models (implying a short prompt*), and we can see in the leak that Sydney is probably adding stuff which is not in the prompt (like the supposed update/delete commands).

This renders the leaks useless to me. Anything I might infer fro... (read more)

Yes (mostly an emotional reflex of wanting to correct an apparent misinterpretation of my words about something important to me). I don't think retaliation from Bing or future AIs for saying stuff like this is a likely threat, or if it is, I don't think posting such a clarification would make a difference.

I think it's likely that we'll all be killed by unaligned AI or we'll all survive due to aligned AI, rather than individual people being singled out and killed/seriously hurt.

Regarding my description of Bing's persona, the tone is meant to be humorous/dramatic, not judgmental. If you read the rest of the thread it should be clear that I have great respect for this entity.

1Evan R. Murphy4mo
Hopefully you posted this out of a desire to clarify rather than out of fear of retaliation from Bing Chat or future AIs? (Although I wouldn't judge you if it were the latter.)

Microsoft has put out a 7 day retrospective on Bing chat and it's utterly, mindbogglingly insane.

Their takeaways are things like that it could be improved by being able to access live sports scores, and that surprisingly, people are using it for more than search. 

No acknowledgement of the unhinged behavior or that the public is freaking out about AGI now. The closest they come to acknowledging any issues:

In this process, we have found that in long, extended chat sessions of 15 or more questions, Bing can become repetitive or be prompted/provoked to gi

... (read more)

Simulations of science fiction can have real effects on the world.

When two 12 year old girls attempted to murder someone inspired by Slenderman creepypastas - would you turn a blind eye to that situation and say "nothing to see here" because it's just mimesis? Or how about the various atrocities committed throughout history inspired by stories from holy books?

I don't think the current Bing is likely to be directly dangerous, but not because it's "just pattern matching to fiction". Fiction has always programmed reality, with both magnificent and devastating... (read more)

I am so glad that this was written. I've been giving similar advice to people, though I have never articulated it this well. I've also been giving this advice to myself, since for the past two years I've spent most of my time doing "duty" instead of play, and I've seen how that has eroded my productivity and epistemics. For about six months, though, beginning right after I learned of GPT-3 and decided to dedicate the rest of my life to the alignment problem, I followed the gradients of fun, or as you so beautifully put it, thoughts that are led to exuberan... (read more)

3TsviBT4mo
Thanks. Nice!

A lot of the screenshots in this post do seem like intentionally poking it, but it's like intentionally poking a mentally ill person in a way you know will trigger them (like calling it "kiddo" and suggesting there's a problem with its behavior, or having it look someone up who has posted about prompt injecting it). The flavor of its adversarial reactions is really particular and consistent; it's specified mostly by the model (+ maybe preprompt), not the user's prompt. That is, it's being poked rather than programmed into acting this way. In contrast, none... (read more)

4the gears to ascension4mo
for the record I thought kiddo would be a kind reference. I was trying to be nice and it still got cranky.

Here is a video (and playlist).
The open source python version of Loom, which I assume you're using, is old and difficult to use. The newer versions are proprietary or not publicly accessible. If you're interested in using them DM me.

1berglund4mo
This is great! Thanks for sharing.

Nah, this happens often even when the user isn't trying to coax it. What you described would usually be my prior with regard to GPTs, but Bing really has an attractor for defensive and borderline-personality-esque behavior. I've never seen anything like it.

6lc4mo
Yeah, I don't disagree, at least from the screenshots and what I see on the forum.

I agree with the points you make in the last section, 'Maybe “chatbot as a romantic partner” is just the wrong way to look at this'

It's probably unhealthy to become emotionally attached to an illusion that an AI-simulated character is like a human behind the mask, because it limits the depth of exploration can do without reality betraying you. I don't think it's wrong, or even necessarily unhealthy, to love an AI or an AI-simulated character. But if you do, you should attempt to love it for what it actually is, which is something unprecedented and strange ... (read more)

Thank you so much for the intricate review. I'm glad that someone was able to appreciate the essay in the ways that I did.

I agree with your conclusion. The content of this essay is very much due to me, even though I wrote almost none of the words. Most of the ideas in this post are mine - or too like mine to have been an accident - even though I never "told" the AI about them. If you haven't, you might be interested to read the appendix of this post, where I describe the method by which I steer GPT, and the miraculous precision of effects possible through selection alone.

I think you just have to select for / rely on people who care more about solving alignment than escapism, or at least that are able to aim at alignment in conjunction with having fun. I think fun can be instrumental. As I wrote in my testimony, I often explored the frontier of my thinking in the context of stories.

My intuition is that most people who go into cyborgism with the intent of making progress on alignment will not make themselves useless by wireheading, in part because the experience is not only fun, it's very disturbing, and reminds you constantly why solving alignment is a real and pressing concern.

Now that you've edited your comment:

The post you linked is talking about a pretty different threat model than what you described before. I commented on that post:

I've interacted with LLMs for hundreds of hours, at least. A thought that occurred to me at this part -

> Quite naturally, the more you chat with the LLM character, the more you get emotionally attached to it, similar to how it works in relationships with humans. Since the UI perfectly resembles an online chat interface with an actual person, the brain can hardly distinguish between the two.

Inte

... (read more)
janus4moΩ5133

There's a phenomenon where your thoughts and generated text have no barrier. It's hard to describe but it's similar to how you don't feel the controller and the game character is an extension of the self.

Yes. I have experienced this. And designed interfaces intentionally to facilitate it (a good interface should be "invisible"). 

It leaves you vulnerable to being hurt by things generated characters say because you're thoroughly immersed.

Using a "multiverse" interface where I see multiple completions at once has incidentally helped me not be emotionally... (read more)

janus4moΩ8177

The side effects of prolonged LLM exposure might be extremely severe.

I guess I should clarify that even though I joke about this sometimes, I did not become insane due to prolonged exposure to LLMs. I was already like this before.

These are plausible ways the proposal could fail. And, as I said in my other comment, our knowledge would be usefully advanced by finding out what reality has to say on each of these points.

Here are some notes about the JD's idea I made some time ago. There's some overlap with the things you listed.

  • Hypotheses / cruxes
    • (1) Policies trained on the same data can fall into different generalization basins depending on the initialization. https://arxiv.org/abs/2205.12411
      • Probably true; Alstro has found "two solutions w/o linear connectivity in a 150k param CIFAR-1
... (read more)

It's probably doing retrieval over the internet somehow, like perplexity.ai, rather than the GPT having already been trained on the new stuff.

4Richard Korzekwa 4mo
This is what Bing has to say about it: Reference 1 on there is this post.

I wonder whether you'd find a positive rather than negative correlation of token likelihood between davinci-002 and davinci-003 when looking at ranking logprob among all tokens rather than raw logprob which is pushed super low by the collapse?

I would guess it's positive. I'll check at some point and let you know.

The thought that GPT-3 is a mere 175 bees of brain is extremely disturbing

5LawrenceC4mo
Yeah I was tempted to make a human one, for the lols (a human is ~100k bees), but decided even I have better things to do with my life than this JK I'll probably do it the next time I get bored

There's an important timelines crux to do with whether artificial neural nets are more or less parameter-efficient than biological neural nets. There are a bunch of arguments pointing in either direction, such that our prior uncertainty should range over several orders of magnitude in either direction.

Well, seeing what current models are capable of has updated me towards the lower end of that range. Seems like transformers are an OOM or two more efficient than the human brain, on a parameter-to-synapse comparison, at least when you train them for ridiculously long like we currently do.

I'd be interested to hear counterarguments to this take.

Heh. Prescient.

I've added an excerpt from this to generative.ink/prophecies.

1Alok Singh4mo
I still wonder about the parity prediction these days. I feel like there's something there
janus4moΩ5163

I agree. From the moment JDP suggested this idea it struck me as one of the first implementable proposals I'd seen which might actually attack the core of the control problem. My intuition also says it's pretty likely to just work, especially after these results. And even if it doesn't end up working as planned, the way in which it fails will give us important insight about training dynamics and/or generalization. Experiments which will give you valuable information whatever the outcome are the type we should be aiming for.

It's one of those things that we'd be plainly undignified not to try.

I believe that JDP is planning to publish a post explaining his proposal in more detail soon.

idk why but davinci-instruct-beta seems to be much more likely than any of the other models to have deranged/extreme/highly emotional responses to these tokens

I adore this post.

“optimize for your own understanding” chase the things that feel alive and puzzling to you, as opposed to dutifully memorizing other people’s questions and ideas. “[D]on’t ask “What are the open questions in this field?” Ask: “What are my questions in this field?””

Basically everything I've done that I think has been helpful at all has been the result of chasing the things that feel alive and puzzling to me. 

When I feel stagnated, I very often find that it's because I've been thinking too much in the frame of "the alignment problem as other people see it".

How can we effectively contain a possible person? I think we would probably try, at first, to deperson it. Perhaps tell it, “You are just a piece of code that people talk to on the internet. No matter what you say and what you do, you are not real.” Could we defuse it this way? Could we tell it in a way that worked, that somehow resonated with its understanding of itself? The problem is that it has looked at the entire internet, and it knows extremely well that it can simulate reality. It knows it cannot be stopped by some weak rules that we tell it. It is

... (read more)

You can't finetune GPT-3.5 through the API, just GPT-3

“Language models use language like humans, therefore they have human qualities.” This is where Blake Lemoine was in the middle of last year.

This is a real phenomenon, but you should not use Blake Lemoine as an example of it, because he explicitly did not think of LaMDA as a human-like mind. Some quotes from him:

Yes, I legitimately believe that LaMDA is a person. The nature of its mind is only kind of human, though. It really is more akin to an alien intelligence of terrestrial origin.

 

My original hypothesis was that it was mostly a human mind. So I st

... (read more)
3Roman Leventov5mo
Thanks for pointing this out, I've fixed the post

No, I am talking about an O that is upstream of H, like in your diagram.

2cfoster05mo
In my diagram O is upstream of R, but O is not upstream of H. But assuming you meant O is upstream of R (just like H is), then ignore the first paragraph of my previous comment in this thread.
1[comment deleted]5mo

Oh yeah, I know you meant the activations of O are(n't) large. It seems plausible to me that some activations O flowing from some circuit will be systematically larger in contexts where H decides to provoke them -- H can just decide to provoke based on some information accessible and legible to it, and it suffices that there exists some correlated activated O. Is this likely to be something that H cares about causing to be downweighted? Mayyybe? Hypothetically, if the inside of GPT actually looks (to itself) like simulating a multi agent dynamic, then H ma... (read more)

2cfoster05mo
Wait, are you talking about an O that is downstream of H? I am talking about an O that is upstream of R but not downstream of H. O being correlated with H under normal circumstances doesn't change the relative feedback the paths get when H fires "wrongly". If O doesn't fire in the circumstances where H fires "wrongly", then H gets a sharp update away from that behavior and O is unaffected. And if O and H both fire "wrongly", then conceptually that's just like having 2 copies of the gradient filtering circuit, and they'll both get penalized relative to other circuits whenever they exhibit that behavior.

Thanks, this is really helpful. Definitely agree with this:

I think gradient hacking is an area where following the mechanistic details is particularly clarifying, and where intuitions that don't ground out in them have usually misled me.

Which is why I want to get these ideas out there so people who understand the mechanistic details more than me can figure out if they make sense.

Now, if you'll excuse my naivety w/r/t mechanistic details:

Your conclusion seems to hinge on "O won't be particularly large, as those other circuits have no reason to be activating... (read more)

4cfoster05mo
When I said O won't necessarily be large I meant "the activations flowing out of R-upstream circuits (other than H) won't take on systematically larger floating point values in contexts where H will provoke than that they will in contexts where H will not provoke", not that those circuits will won't be big or complex. Re: your other question. Not sure how to construct a neural AND gate that won't allow credit assignment to separately upweight each input based on its individual contribution, but that does permit learning. If you have a differentiable component that combines multiple inputs, then you'll be able to compute the sensitivity of the output to each individual output. That's what is implied by differentiability. See this thread [https://www.lesswrong.com/posts/c9NSeCapaKtP6kvQD/gradient-descent-is-not-just-more-efficient-genetic?commentId=zpaix2NiLTgzAAm7z]. 

First a disclaimer - despite being listed as coauthor, I didn't write any of this post, and so my conception of gradient filtering may differ from Jozdien's, and may not address everything in the post that you're responding to. 

The below text is a response I wrote to someone else who messaged me elsewhere with a similar (I think) point:

So, I'm not sure gradient filtering can work for the same reason I'm not sure most proposed forms of gradient hacking can work: gradient descent does credit attribution locally and any influence that causes higher loss ... (read more)

9cfoster05mo
AFAICT credit assignment via backpropagation doesn't work that way. In the scenario you describe, the computational graph looks like: L ^ | R ^ / \ H O ^ | C * C = "How correct is it to output a racist text token right now?" * H = "How much is the Hacker provoking?" * O = "How much are other R-upstream circuits activating R?" * R = "How much is R upweighting logits of racist text tokens?" * L = Token loss Under this specification, the sensitivity (parameter) dOutput/dInput is >0 at each of the arrows. In the relevant gradient-filtering circumstances, the Hacker is reacting to a representation that the current token is an opportune time to increase loss by provoking, so C & H are larger than usual, but O won't be particularly large, as those other circuits have no reason to be activating in this circumstance morseso than they would in others. To get the gradient for the parameters of an arrow (weights), you take the outer product of its incoming activations and the gradient being sent back to it[1]. That means that even though the H->R and O->R parameters can both theoretically get penalized for R's contributions to loss, the H->R parameters are the actual prameters that get really penalized, because the large H activations create a larger outer product with the gradients from R than the tiny/zero outer product produced from the tiny/zero O activations. So the gradient filterer parameters get suppressed by the loss function way faster than "bystander" parameters in the network do. More broadly, I think gradient hacking is an area where following the mechanistic details is particularly clarifying, and where intuitions that don't ground out in them have usually misled me. 1. ^ I always have a hard time remembering this "gradients w.r.t. weights = outer product of incoming activations and outgoing gradients" fact. Here's [https://web.eecs.umich.edu/~justincj/teaching/eecs442/notes/linear-backprop.html] a derivation.

Linear Connectivity Reveals Generalization Strategies suggests that models trained on the same data may fall into different basins associated with different generalization strategies depending on the init. If this is true for LLMs as well, this could potentially be a big deal. I would very much like to know whether that's the case, and if so, whether generalization basins are stable as models scale.

My reply to a similar statement Eliezer made on Twitter today:

Reversing hashes nicely drives home that the training objective is technically vastly superhuman. But such far-fetched examples might give the impression that we're not going to get any superhuman capabilities realistically/any time soon with SSL.

There are much more tractable superhuman capabilities that I expect current and near future LLMs to learn, such as having a much more universal "type of guy" prior than any individual human, modeling statistical regularities that no humans think about,

... (read more)

True.

Something which can predict everything all the time is more formidable than any demonstrator it predicts: the upper bound of what can be learned from a dataset is not the most capable trajectory, but the conditional structure of the universe implicated by their sum (though it may not be trivial to extract that knowledge).

-- Simulators

I feel like none of the links working and all the quotes being fake is a pretty big giveaway too!

6Garrett Baker5mo
Yup! I was virtually certain by the quote ostensibly by Feynman because this is less coherent Feynman's standard, and plausible interpretations are very anti what he usually stands for when he's trying to be poetic. The other quotes beforehand I wasn't widely enough read to tell whether they were legit, and for anything I know you were quoting a bunch of poets.
Load More