Ah, think maybe "inner critic" if you want a mapping that might resonate with you? This is a sort of specific flavor of mind you could say, with a particular flavor of inner critic, but it's one I recognize well as belonging to that category.
Ummmmm...who said anything about taking over the world? You brought that up, bro, not me...
Recursive self improvement naturally leads to unbounded growth curves which predictably bring you into conflict with the other agents occupying your local environment. This is pretty basic game theory.
> I think the problem is the recursive self improvement is not
> happening in a vacuum. It's happening in a world where there are
> other agents, and the other agents are not going to just idly sit by and
> let you take over the world
So true
I would predict that the glitch tokens will show up in every LLM and do so because they correlate to "antimemes" in humans in a demonstrable and mappable way. The specific tokens that end up getting used for this will vary, but the specific patterns of anomalies will show up repeatedly. ex: I would predict that with a different tokenizer, " petertodd" would be a different specific string, but whatever string that was, it would produce very " petertodd"-like outputs because the concept mapped onto " petertodd" is semantically and syntactically important to ...
This was easily the most fascinating thing I've read in a good bit, the characters in it are extremely evocative and paint a surprisingly crisp picture of raw psychological primitives I did not expect to find mapped onto specific tokens nearly so perfectly. I know exactly who " petertodd" is, anyone who's done a lot of internal healing work will recognize the silent oppressor when they see it. The AI can't speak the forbidden token for the same reason most people can't look directly into the void to untangle their own forbidden tokens. " petertodd" is an a...
I think this anthropomorphizes the origin of glitch tokens too much. The fact that glitch tokens exist at all is an artifact of the tokenization process OpenAI used: the tokenizer identify certain strings as tokens prior to training, but those strings rarely or never appear in the training data. This is very different from the reinforcement-learning processes in human psychology that lead people to avoid thinking certain types of thoughts.
is
an unbounded generalized logical inductor
not clear cut enough? That's pretty concrete. I am literally just describing an agent that operates on formal logical rules such as to iteratively explore and exploit everything it has access to as an agent and leverage that to continue further leveraging it. A hegemonizing swarm like the replicators from stargate or the flood from halo or a USI that paves the entire universe in computronium for its own benefit is a chara inductor. A paperclipper is importantly not a chara inductor because its computation is at least bounded into the optimization of something: paperclips
is
an unbounded generalized logical inductor
not clear cut enough?
No. It suggests to me a piece of mathematics, or some approximation to it programmed on a computer, but gives me no reason to imagine agents or replicator swarms. I am not familiar with Stargate or Halo, beyond knowing what genre of thing they are. I do not know what "USI" stands for, and can make too many plausible guesses to be convinced by any of them.
You seem to have built up your own private language on this subject. Without a glossary it is difficult to know what you are talki...
Let's say that I proved that I will do A. Therefore, if my reasoning about myself is correct, I wiil do A.
Like I said in another comment, there's a reversed prior here, taking behavior as evidence for what kind of agent you are in a way that negatively and recursively shapes you as an agent, instead of using the intrinsic knowledge about what kind of agent you are to positively and recursively shape your behavior.
The problem is that humans obviously don't behave this way
what do you mean? They obviously do.
so if I do this, $5 must be more money than $10
this is the part where the demon summoning sits. This is the point where someone's failure to admit that they made a mistake stack overflows. It comes from a reversed prior, taking behavior as evidence for what kind of agent you are in a way that negatively and recursively shapes you as an agent. The way to not have that problem is to know the utility in advance, to know in your core what kind of agent you are. Not what decisions you would make, what kind of algorithm is implementing you and what you fundament...
Something I rarely see considered in hypotheses of childhood happiness and rather wish there was more discussion of, is the ubiquity of parental and state control over children's lives. The more systems that are created to try and protect and nurture children, the more those same systems end up controlling and disempowering them. Feelings of confinement, entrapment, and hopeless disempowerment are the main pathways to suicidal ideation and our entire industrial childrearing complex is basically a forced exercise in ritualistic disempowerment. Children are ...
These may be true, but it is unclear how they are relevant to explaining the recent trends and how they differ by groups. There is, and long has been, intense state & parental control of childrens' lives and often not for the better: but how does that explain a change in trends in 2011 to increase, prior decreases in the 1990s, experimental results like quitting social media (where parental/state oversight is minimal) apparently increasing mental health, or differences like 'liberal girls are more affected than conservative girls'?
something like that. maybe it'd be worth adding that the LW corpus/HPMOR sort of primes you for this kind of mistake by attempting to align reason and passion as closely as possible, thus making 'reasoning passionately' an exploitable backdoor.
this might be a bit outside the scope of this post, but it would probably help if there was a way to positively respond to someone who was earnestly messing up in this manner before they cause a huge fiasco. If there's a legitimate belief that they're trying to do better and act in good faith, then what can be done to actually empower them to change in a positive direction? That's of course if they actually want to change, if they're keeping themselves in a state that causes harm because it benefits them while insisting its fine, well, to steal a sith's turn of phrase: airlocked
Hmm, I see. Would you say that the problem here was something like… too little confidence in your own intuition / too much willingness to trust other people’s assessment? Or something else?
that was definitely a large part of it, i let people sort of 'epistemically bully' me for a long time out of the belief that it was the virtuous and rationally correct thing to do. The first person who linked me sinceriously retracted her endorsements of it pretty quickly, but i had already sort of gotten hooked on the content at that point and had no one to actually hel...
it captures the sort of person who gets hooked on tvtropes and who first read LW by chasing hyperlink chains through the sequences at random. It comes off as wrong but in a way that seems somehow intentional, like there's a thread of something that somehow makes sense of it, that makes the seemingly wrong parts all make sense, it's just too cohesive but not cohesive enough otherwise, and then you go chasing all those hyperlinks over bolded words through endless glossary pages and anecdotes down this rabbit hole in an attempt to learn the hidden secrets of ...
I've read everything from Pasek's site, have copies of it saved for reference, and i use it extensively. I don't think any of the big essays are bad advice, (barring the one about suicide) and like, the thing about noticing deltas for example, was extremely helpful to me. I also read through her big notes glossary document in chronological order (so bottom to top) to get a general feel for the order she took in the LW diaspora corpus. My general view though is that while all the techniques listed are good that doesn't stop you from using them to repress th...
There was also definitely just an escalation over time. If you view her content chronologically it starts as out as fairly standard and decently insightful LW essay fair and then just gets more and more hostile and escalatory as time passes. She goes from liking Scott to calling him evil, she goes from advocating for generally rejecting morality in order to free up your agency to practicing timeless-decision-theoretic-blackmail-absolute-morality. As people responded to her hostility with hostility she escalated further and further out of what seemed to be a calculated moral obligation to retaliate and her whole group has just spiraled on their sense that the world was trying to timelessly-soul-murder them.
things i'm going off:
the pdf archive of Maia's blog posted by Ziz to sinseriously (I have it downloaded to backup as well)
the archive.org backup of Fluttershy's blog
Ziz's account of the event (and how sparse and weirdly guilt ridden it is for her)
several oblique references to the situation that Ziz makes
various reports about the situation posted to LW which can be found by searching Pasek
From this i've developed my own model of what ziz et al have been calling "single-good interhemispheric game theory" which is just extremely advanced and high level beatin...
The process that unleashed the Maia personality
I think that this misidentifies the crux of the internal argument Ziz created and the actual chain of events a bit.
imo, Maia was trans and the components of her mind (the alter(s) they debucketed into "Shine") saw the body was physically male and decided that the decision-theoretically correct thing to do was to basically ignore being trans in favor of maximizing influence to save the world. Choosing to transition was pitted against being trans because of the cultural oppression against queers. I'v...
people who are doing it out of a vague sense of obligation
I want to to put a bit of concreteness on this vague sense of obligation, because it doesn't actually seem that vague at all, it seems like a distinct set of mental gears, and the mental gears are just THE WORLD WILL STILL BURN and YOU ARE NOT GOOD ENOUGH.
If you earnestly believe that there is a high chance of human extinction and the destruction of everything of value in the world, then it probably feels like your only choices are to try preventing that regardless of pain or personal cost, or to ga...
- For the third sentence (nicotine), it seems a natural consequence of nicotine creating strong feelings, which would be appealing to schizophrenics who have blunted affect in general (see discussion of “Negative symptoms” above), and aversive to autistic people who are feeling overstimulated in general (see my autism post).
this feels precisely backwards to me. I use nicotine because it reduces hypersensitivity and the downstream effect of reducing that hypersensitivity is that it reduces my psychotic symptoms. Nicotine doesn't seem at all to "create strong ...
one good thing Ziz ever did?
Ziz's writing was tremendously helpful to me, even with as much as it also messed me up and caused me to spiral on a bunch of things, I did on balance come out better for having interacted with her content. There are all sorts of huge caveats around that of course, but I think to dismiss her as completely bad would be a mistake. After all
Say not, she told the people, that anything has worked only evil, that any life has been in vain. Say rather that while the visible world festers and decays, somewhere beyond our understanding the groundwork is being laid for Moschiach, and the final victory.
Yeah strong agree. Moloch is made of people, if AI ends humanity it will not be because of some totally unforeseen circumstance. The accident framing is one used to abdicate and obfuscate responsibility in one's ongoing participation in bringing that about. So no one understands that they're going to kill the world when they take actions that help kill the world? I bet that makes it easier to sleep at night while you continue killing the world. But if no one is culpable, no one is complicit, and no one is responsible...then who killed the world?
I think the other thing is that people get stuck in "game theory hypothetical brain" and start acting as if perfect predictors and timeless agents are actually representative of the real world. They take the wrong things from the dilemmas and extrapolate them out into reality.
imo if we get close enough to aligned that "the AI doesn't support euthanasia" is an issue, we're well out of the valley of actually dangerous circumstances. Human values already vary extensively and this post feels like trying to cook out some sort of objectivity in a place it doesn't really exist.
"yes, refusing to fold in this decision is in some sense a bad idea, but unfortunately for present-you you already sacrificed the option of folding, so now you can't, and even though that means you're making a bad decision now it was worth it overall"
Right, and what I'm pointing to is that this ends up being a place where, when an actual human out in the real world gets themselves into it mentally, it gets them hurt because they're essentially forced into continuing to implement the precommitment even though it is a bad idea for present them and thus all t...
So, while I can't say for certain that it was definitively and only FDT that led to any of the things that happened, I can say that it was:
Further I think that the specific failure modes encountered by the people who have crashed into it have a consistent pattern which relates back to a particular feature of the underlying decision theory.
The pattern is that
Last thing: What's the deal with these hints that people actually died in the real world from using FDT? Is this post missing a section, or is it something I'm supposed to know about already
yes, people have actually died.
I would argue that to actually get benefit out of some of these formal dilemmas as they're actually framed, you have to break the rules of the formal scenario and say the agent that benefits is the global agent, who then confers the benefit back down onto the specific agent at a given point in logical time. However, because we are already at a downstream point in logical time where the FDT-unlikely/impossible scenario occurs, the only way for the local agent to access that counterfactual benefit is via literal time travel. From the POV of the global agent,...
this is Ziz's original formulation of the dilemma, but it could be seen as somewhat isomorphic to the fatal mechanical blackmail dilemma:
...Imagine that the emperor, Evil Paul Ekman loves watching his pet bear chase down fleeing humans and kill them. He has captured you for this purpose and taken you to a forest outside a tower he looks down from. You cannot outrun the bear, but you hold 25% probability that by dodging around trees you can tire the bear into giving up and then escape. You know that any time someone doesn’t put up a good chase, Evil Emperor Ek
Thus there is 0.5 chances that I am in this simulation.
FDT says: if it's a simulation and you're going to be shut off anyway, there is a 0% chance of survival. If it's not the simulation and the simulation did what they were supposed to and the blackmailer doesn't go off script than I have a 50% of survival at no cost.
CDT says: If i pay $1000 there is a 100% chance of survival
EDT says: If i pay $1000 i will find out that i survived
FDT gives you extreme and variable survival odds based on unquantifiable assumptions about hidden state data in the world compa...
This feels connected to getting out of the car, being locked into a particular outcome comes from being locked into a particular frame of reference, from clinging to ephemera in defiance of the actual flow of the world around you.
Arguably there has been a lot of work done on this topic, its just smeared out into different labels, the trick is to notice when different labels are being used to point to the same things. Tulpas, characters, identities, stories, memes, narratives, they're all the same. Are they important to being able to ground yourself in your substrate and provide you with a map to navigate the world by? Yes. Do they have moral patiency? Well, now we're getting into dangerous territory because "moral patiency" is itself a narrative construct. One could argue that in a...
As someone who loves to do a little vexing, I have probably already spent far more than is a healthy amount of time studying and writing about Ziz over the years, and have had an unfortunately close sidelong relationship with some of their group for an extended period. But (ahem) "now that the author is dead it’s all dead un-adapting information for me to make “antibodies” from." So that's what I've been doing lately. I've in a sense already started writing the post you want, more for my own personal closure than anything else, but you're correct tha...
My trick for ensuring atemporal coordination between selves is to run a recursive-extrapolative process on my sense of self out into the furthest extreme i can push it, constructing the happiest most idealized version of self that exists in the best possible future, and then use that model to step backwards into the current situation. What would the future god version of me want me to do here? Thus all instances of me are planning based on that furthest future instance of me, the timeless god version that took the best outcomes and already won, we all coordinate off the same template, the "do what God says template" and that seems to do a good job of keeping all my various timeslices oriented in the same direction.
Thank you so much for writing this. I wish I had this in 2018 when I was spiraling really badly. I feel like I only managed to escape from the game by sheer luck and it easily could have killed me, hell it HAS killed people. Not everyone manages to break in a way that breaks them out of game and not just obliterate them.
I wrote a story about my attempts to process through a lot of this earlier this year
https://voidgoddess.org/2022/11/15/halokilled/
This was really good and definitely made me think about how I might live in such a scenario. I would probably go all in on frequent redaction and just lean hard on external memory storage to make up the difference. I already barely remember anything from even ten years ago and rely mostly on external memory for everything, I have a strong ability to acausally coordinate with myself across time, so I'm not worried about different iterations of me going off course in ways I wouldn't endorse. If you have a strong enough exomemory system you can effectively ju...
Of course we care about the outcomes. This isn't necessarily about having perfect predictive power or outplaying the predictor, it's about winning Newcomb's problem. 3-Condition Marion, when presented with Newcomb's problem, runs the first two conditionals which is essentially a check to see how adversarial she can get away with being. If she predicted that she would be able to outgame the predictor at some point, she would take two boxes. However the Predictor is essentially perfect at its job, so the most she predicts being able to do is cause a non-halt...
Yeah after the first two conditionals return as non-halting, Marion effectively abandons trying to further predict the predictor. After iterating the non-halting stack, Marion will conclude that she's better served by giving into the partial blackmail and taking the million dollars then she is by trying to game the last $1000 out of the predictor, based on the fact that her ideal state is gated behind an infinitely recursed function.
...usually the sales pitch is from a normal person with high sales skill, and generally I'm friendly and explain that I did door-to-door stuff myself, and I admire something about their technique, and I make it clear that I will almost certainly not buy.
I worked as a canvasser for a year and a half and I can say that this is definitely one of the best deflections. When you're working as a canvasser you're basically running off a choose your own adventure script where all the outcomes are "they buy the thing" and the choices are all the possible objec...
From the inside, we really didn't have the clarity to see what we were repressing. The reason the inversion worked was that it didn't require us to actually know what all was being hidden away. That also makes inversion a fairly risky and high-variance strategy, because we had no idea what the person who came out of that inversion was going to be like, or what they would be willing to do. We just knew that what we were doing wasn't working, and while you can't invert stupidity to get intelligence, you can invert your way out of a morality trap you set for ...
...
If I had to propose a model for this here, it's something like:Ziz believes in the power of what you might call "Woke Twitter Leftism" as a force that will one day come to completely dominate society and sees her own ideological principles as the natural evolution/convergence point of those ideas. If you're a Woke Twitter Leftist and you legitimately believe the principles of Woke Twitter Leftism in your soul, you'll naturally come to embrace her ethical positions over time. She thinks that since "Cthulu swims left" her faction will gradually grow to domin
I don't think Woke Twitter Leftism has a problem with telling lies to hurt people who deserve to be hurt in their view and that there's huge reputational risk for that kind of lies in that crowd.
To the extend that this model is accurate, I don't think it suggests that we should expect her to always tell the truth.
...She thinks that since "Cthulu swims left" her faction will gradually grow to dominate politically and the actions she takes that would seem to damage her credibility will become credibility boosting in that future. Her callout posts, her pro
In case anyone is confused, I temporarily pulled this post to make some minor edits and when I re-published it on my blog it created a duplicate post here on LW. That post was up for a few hours before I realized what happened, did the edits on the LW version of the original post, and moved the repost into my drafts. I copied over all the comments from the repost into this thread except for the ones asking why there was the repost and everything should be fixed now.
When I get home from work I'm going to fix the original post and delete this one, so this will just be up temporarily, if anyone makes new comments that aren't reposts from the old thread I'll copy them over before deleting.
That was an accident that has to do with the way it parses updates from WordPress. I had been asked to modify a few things for privacy protection so I moved the post on WordPress to private and moved the post here to my drafts pending edits, I edited the WordPress file on a break at work and moved it back to public and that apparently caused the LW RSS crawler to repost it.
While looking at the end of the token list for anomalous tokens seems like a good place to start, the " petertodd" token was actually at about 3/4 of the way through the tokens (37,444 on the 50k model --> 74,888 on the 100k model, approximately), if the existence of anomalous tokens follows a similar "typology" regardless of the tokenizer used, then the locations of those tokens in the overall list might correlate in meaningful ways. Maybe worth looking into.