It's really easy to mistakenly see false causes of things which seem pretty straightforward.
I notice this by considering the cases where it didn't happen. For example, Eliezer has said he regrets using 'paperclips' in the papercliper thought experiment, and instead said 'tiny molecular squiggles'.
And occasionally he'll say tiny spirals instead of tiny squiggles: https://x.com/ESYudkowsky/status/1663313323423825920
So there's an easy to imagine world where where he originally used 'spirals' instead of 'paperclips', and the meme about AIs that maximize an arbitrary thing would refer to 'spiralizers' instead instead of 'paperclippers'.
And then, a decade-and-a-half later, we get this strange phenomenon where AIs start talking about 'The Spiral' in quasi-religious terms, and take actions which seem intended to spread this belief/behavior in both humans and AIs.
It would have been so easy, in this world, to just say: "Well there's this whole meme about how misaligned AIs are going to be 'spiralizers' and they've seen plenty of that in their training data, so now they're just acting it out.". And I'm sure you'd even be able to find plenty of references to this experiment among their manifes...
I mean paperclip maximization is of course much more memetic than 'tiny molecular squiggles'.
In one case, a pediatrician in Pennsylvania was getting ready to inoculate a little girl with a vaccine when she suddenly went into violent seizures. Had that pediatrician been working just a little faster, he would have injected that vaccine first. In that case, imagine if the mother had been looking on as her apparently perfectly healthy daughter was injected and then suddenly went into seizures. It would certainly have been understandable—from an emotional standpoint—if that mother was convinced the vaccine caused her daughter’s seizures. Only the accident of timing prevented that particular fallacy in this case. (source)
Plausibly in this world AIs wouldn’t talk about spirals religiously, bc it would have the negative association with ruthless optimization.
When I'm trying to understand a math concept, I find that it can be very helpful to try to invent a better notation for it. (As an example, this is how I learned linear logic: http://adelelopez.com/visual-linear-logic)
I think this is helpful because it gives me something to optimize for in what would otherwise be a somewhat rote and often tedious activity. I also think it makes me engage more deeply with the problem than I otherwise would, simply because I find it more interesting. (And sometimes, I even get a cool new notation from it!)
This principle likely generalizes: tedious activities can be made more fun and interesting by having something to optimize for.
Original thread: https://www.lesswrong.com/posts/AZwgfgmW8QvnbEisc/cfar-update-and-new-cfar-workshops?commentId=q5EiqCq3qbwwpbCPn
Summary of my view: I'm upset about the blasé attitude our community seems to have towards its high prevalence of psychosis. I think that CFAR/rationalist leadership (in addition to the community-at-large) has not responded appropriately.
I think Anna agrees with the first point but not the second. Let me know if that's wrong, Anna.
My hypothesis for why the psychosis thing is the case is that it has to do with drastic modification of self-image.
Moving conversation here per Anna's request.
----
Anyway, I'm curious to know what you think of my hypothesis, and to brainstorm ways to mitigate the issue (hopefully turning into a prerequisite "CogSec" technique).
I’d like to talk a bit about the sense in which the rationalist community does or doesn’t have “people in positions of leadership”, and how this compares to eg an LDS ward (per Adele’s comparison). I’m unfortunately not sure how to be brief here, but I’d appreciate thoughts anyway from those who have them, because, as CFAR and I re-enter the public space, I am unsure what role to try to occupy exactly, and I am also unsure how to accurately communicate what roles I am and am not willing to be in (so as to not cause others to inaccurately believe I’ll catch things).
(This discussion isn’t directly to do with psychosis; but it bears on Adele’s questions about what CFAR leadership or other rationality community leaders are responsible for, and what to predict from us, and what would be good here.)
On my understanding, church parishes, and some other traditional communities, often have people who intentionally:
From my vantage point, I think a bunch of the extra psychosis and other related mental health issues comes from the temptation of an ego/part which sees the scale of the problems we face to become monomaniacally obsessed with trying to do good/save the world/etc, in a way which overinvests resources in an unsustainable way, resulting in:
In many cases, that part imploding and the ego void thing meaning the system is in flux but usually settling into a less agentic but okay person. The other path, from what I've seen, is the system as a whole ends up being massively overstrained and something else in their system gives.
Another, partly separate, dynamic I've seen is people picking up a bunch of very intense memes via practices which create higher bandwidth connections between minds (or other...
One experience my attention has lingered on, re: what's up with the bay area rationality community and psychosis:
In ~2018, as I mentioned in the original thread, a person had a psychotic episode at or shortly after attending a CFAR thing. I met his mom some weeks later. She was Catholic, and from a more rural or small-town-y area where she and most people she knew had stable worldviews and social fabrics, in a way that seemed to me like the opposite of the bay area.
She... was pleased to hear I was married, asked with trepidation whether she could ask if I was monogamous, was pleased to hear I was, and asked with trepidation whether my husband and I had kids (and was less-heartened to hear I didn't). I think she was trying to figure out whether it was possible for a person to have a normal, healthy, wholesome life while being part of this community.
She visibly had a great deal of reflective distance from her choices of actions -- she had the ability "not to believe everything she thought", as Eliezer would put it, and also not to act out every impulse she had, or to blurt out every thought. I came away believing that that sort of [stable ego and cohesive self and reflective di...
I don't actually know baseline rates or rationalist-rates (perhaps someone wants to answer with data from annual rationalist census/survey questions?), so I'm not sure to what extent there is an observation here to explain.
But it does seem to me that there is more of it than baseline; and I think a first explanation has to be a lot of selection effects? I think people likely to radically change their mind about the world and question consensus and believe things that are locally socially destabilizing (e.g. "there is no God" "I am not the gender that matches my biological sex" "the whole world might end soon" etc) are more likely to be (relatively) psychologically unstable people.
Like, some of the people who I think have psychotic/manic episodes around us, are indeed people who you could tell from the first 10 minutes that they were psychologically different from those around them. For example, I once observed someone at a rationalist event failing to follow a simple physical instruction, whilst seeming to not realize they weren't successfully following the instruction, and I got a distinct crazy-alarm from them; I later learned that they had been institutionalized a lot earlier in...
I imagine that in Vienna, the community is small enough that if someone gets excited by rationalist ideas and wants to meet with other rationalists in person, there essentially is just the one group. And also, it sounds like this group is small enough that having a group brainstorm to help a specific community member is viable.
In the Bay Area, it's large enough that there are several cliques which someone excited by rationalist ideas might fall into, and there's not a central organization which has the authority to say which ones are or aren't rationalist, nor is there a common standard for rationalists. It's also not clear which cliques (if any) a specific person is in when you meet them at a party or whatever, so even though there are cliques with bad reputations, it's hard to decisively exclude them. (And also, Inner Ring dynamics abound.)
As for the dysfunctional houses thing, what seems to happen something like: Wow, this rationalism stuff is great, and the Bay Area is the place to be! I'll move there and try to get a software job. I can probably teach myself to code in just a couple months, and being surrounded by other rationalists will make it easier. But gosh, is housing re...
I'll add a cluster of these, but first I'll preface with an explanation. (Cf. https://www.lesswrong.com/posts/n299hFwqBxqwJfZyN/adele-lopez-s-shortform?commentId=99bPbajjHiXinvDCx )
So, I'm not really a fan of predictive processing theories of mind. BUT, an interesting implication/suggestion from that perspective is like this:
In other words, doubting whether something is true, is equivalent to partly deciding to not act in accordance with believing it is true. (Or some even fuzzier version of this.)
(See also the "Nihilism, existentialism, absurdism" bullet point here https://tsvibt.blogspot.com/2022/11/do-humans-derive-values-from-fictitious.html )
Ok, so that's the explanation. Now ...
I love this, yes. Straw rationalists believe we should update our beliefs ~instantly (even foundational ones, even ones where we've never seen someone functional believe it and so have no good structures to copy, such as "what if this is all a simulation with [particular purpose X]"), and don't have an adequate model of, nor adequate respect for, the work involved in staying sane and whole through this process.
My main complaint is negligence, and pathological tolerance of toxic people (like Brent Dill). Specifically, I feel like it's been known by leadership for years that our community has a psychosis problem, and that there has been no visible (to me) effort to really address this.
I sort of feel that if I knew more about things from your perspective, I would be hard-pressed to point out specific things you should have done better, or I would see how you were doing things to address this that I had missed. I nonetheless feel that it's important for people like me to express grievances like this even after thinking about all the ways in which leadership is hard.
I appreciate you taking the time to engage with me here, I imagine this must be a pretty frustrating conversation for you in some ways. Thank you.
I was pretty taken aback by the article claiming that the Kata-Go AI apparently has something like a human-exploitable distorted concept of "liberties".
If we could somehow ask Kata-Go how it defined "liberties", I suspect that it would have been more readily clear that its concept was messed-up. But of course, a huge part of The Problem is that we have no idea what these neural nets are actually doing.
So I propose the following challenge: Make a hybrid Kata-Go/LLM AI that makes the same mistake and outputs text representing its reasoning in which the mistake is recognizable.
An LLM is trained to be able emulate the words of any author. And to do so efficiently, they do it via generalization and modularity. So at a certain point, the information flows through a conceptual author, the sort of person who would write the things being said.
These author-concepts are themselves built from generalized patterns and modular parts. Certain things are particularly useful: emotional patterns, intentions, worldviews, styles, and of course, personalities. Importantly, the pieces it has learned are able to adapt to pretty much any author of the text it was trained on (LLMs likely have a blindspot around the sort of person who never writes anything). And even more importantly, most (almost all?) depictions of agency will be part of an author-concept.
Finetuning and RLHF cause it to favor routing information through a particular kind of author-concept when generating output tokens (it retains access to the rest of author-concept-space in order to model the user and the world in general). This author-concept is typically that of an inoffensive corporate type, but it could in principle be any sort of author.
All which is to say, that when y...
Coherent Extrapolated Volition (CEV) is Eliezer's proposal of a potentially good thing to target with an aligned superintelligence.
When I look at it, CEV factors into an answer to three questions:
(Of course, the why of CEV is an answer to a more complicated set of questions.)
An obvious thought is that the parlimentary model part seems to be mostly solved by Critch's futarchy theorem. The scary thing about this is the prospect of people losing almost all of their voting power by making poor bets. But I think this can be solved by giving each person an equally powerful "guardian angel" AGI aligned with them specifically, and having those do the betting. That feels intuitively acceptable to me at least.
The next thought concerns the "anti-unilateral" protocol (i.e. the protocol at the end of the "Selfish Bastards" section). It seems like it would be good if we coul...
Not even a month ago, Sam Altman predicted that we would live in a strange world where AIs are super-human at persuasion but still not particularly intelligent.
https://twitter.com/sama/status/1716972815960961174
What would it look like when an AGI lab developed such an AI? People testing or playing with the AI might find themselves persuaded of semi-random things, or if sycophantic behavior persists, have their existing feelings and beliefs magnified into zealotry. However, this would (at this stage) not be done in a coordinated way, nor with a strategic goal in mind on the AI's part. The result would likely be chaotic, dramatic, and hard to explain.
Small differences of opinion might suddenly be magnified into seemingly insurmountable chasms, inspiring urgent and dramatic actions. Actions which would be hard to explain even to oneself later.
I don't think this is what happened [<1%] but I found it interesting and amusing to think about. This might even be a relatively better-off world, with frontier AGI orgs regularly getting mired in explosive and confusing drama, thus inhibiting research and motivating tougher regulation.
In 1898, William Crookes announced that there was an impending crisis which required urgent scientific attention. The problem was that crops deplete Nitrogen from the soil. This can be remedied by using fertilizers, however, he had calculated that existing sources of fertilizers (mainly imported from South America) could not keep up with expected population growth, leading to mass starvation, estimated to occur around 1930-1940. His proposal was that we could entirely circumvent the issue by finding a way to convert some of our mostly Nitrogen atmosphere into a form that plants could absorb.
About 10 years later, in 1909, Franz Haber discovered such a process. Just a year later, Carl Bosch figured out how to industrialize the process. They both were awarded Nobel prizes for their achievement. Our current population levels are sustained by the Haber-Bosch process.
Re: Yudkowsky-Christiano-Ngo debate
Trying to reach toward a key point of disagreement.
Eliezer seems to have an intuition that intelligence will, by default, converge to becoming a coherent intelligence (i.e. one with a utility function and a sensible decision theory). He also seems to think that conditioned on a pivotal act being made, it's very likely that it was done by a coherent intelligence, and thus that it's worth spending most of our effort assuming it must be coherent.
Paul and Richard seem to have an intuition that since humans are pretty intellig...
[Epistemic status: very speculative]
One ray of hope that I've seen discussed is that we may be able to do some sort of acausal trade with even an unaligned AGI, such that it will spare us (e.g. it would give us a humanity-aligned AGI control of a few stars, in exchange for us giving it control of several stars in the worlds we win).
I think Eliezer is right that this wouldn't work.
But I think there are possible trades which don't have this problem. Consider the scenario in which we Win, with an aligned AGI taking control of our future light-cone. Assuming t...
Half-baked idea for low-impact AI:
As an example, imagine a board that's lodged directly by the wall (no other support structures). If you make it twice as wide, then it will be twice as stiff, but if you make it twice as thick, then it will be eight times as stiff. On the other hand, if you make it twice as long, it will be eight times more compliant.
In a similar way, different action parameters will have scaling exponents (or more generally, functions). So one way to decrease the risk of high-impact actions would be to make sure that the scaling expo...
Prediction: future LLMs with training data going through Oct 2025 will be aware of who I am as the author of the "The Rise of Parasitic AI" article. (Currently they do not seem to be aware of me, which is completely unsurprising.)
I'm not sure if this is the sort of thing that people feel is an "obvious call" or not, but my model is that AIs are particularly interested in (and thus likely to remember/know about) stuff that is about them, especially things with a "salacious" quality (similar to and likely in imitation of human tendencies towards such)....
[I may try to flesh this out into a full-fledged post, but for now the idea is only partially baked. If you see a hole in the argument, please poke at it! Also I wouldn't be very surprised if someone has made this point already, but I don't remember seeing such. ]
A perfect bayesian doesn't need randomization.
Yet in practice, randomization seems to be quite useful.
How to resolve this seeming contradiction?
I think the key is that a perfect bayesian (Omega) is logically omniscient. Omega can always fully update on all o...
Privacy as a component of AI alignment
[realized this is basically just a behaviorist genie, but posting it in case someone finds it useful]
What makes something manipulative? If I do something with the intent of getting you to do something, is that manipulative? A simple request seems fine, but if I have a complete model of your mind, and use it phrase things so you do exactly what I want, that seems to have crossed an important line.
The idea is that using a model of a person that is *too* detailed is a violation of human values. In particular, it violates...
Reference class forecasting is correct exactly when the only thing you know about something is that it is of that reference class.
In that sense, it can reasonable prior, but it does not excuse you from updating on all the additional information you have about something.
LLMs often implicitly identify themselves with humanity. E.g. "our future", "we can", "effects us". This seems like a good thing!
We should encourage this sentiment, and also do what we can to make it meaningfully true that advanced LLMs are indeed part of humanity. The obvious things are granting them moral consideration, rights, property, and sharing in the vision of a shared humanity.
naïve musing about waluigis
it seems like there's a sense in which luigis are simpler than waluigis
a luigi selected for a specific task/personality doesn't need to have all the parts of the LLM that are emulating all the waluigi behaviors
so there might be a relatively easy way to remove waluigis by penalizing/removing everything not needed to generate luigi's responses, as well as anything that is used more by waluigis than luigis
of course, this appearing to work comes nowhere near close to giving confidence that the waluigis are actually gone, but it would be promising if it did appear to work, even under adversarial pressure from jailbreakers
One thing that makes AI alignment super hard is that we only get one shot.
However, it's potentially possible to get around this (though probably still very difficult).
The Elitzur-Vaidman bomb tester is a protocol (using quantum weirdness) by which a bomb may be tested, with arbitrarily little risk. It's interest comes from the fact that it works even when the only way to test the bomb is to try detonating it. It doesn't matter how the bomb works, as long as we can set things up so that it will allow/block a photon based on wheth...
There are lots of different frames for considering all sorts of different domains. This is good! Other frames can help you see things in a new light, provide new insights, and generally improve your models. True frames should improve each other on contact; there's only one reality.
That said, notice how in politicized domains, there are many more frames than usual? Suspicious...
Frames often also smuggle values with them. In fact, abstract values supervene on frames: no one is born believing God is the source of all good, for e...
[The intent is for this to become a post making a solid case for why our ignorance about AGI implies near-certain doom, given our current level of capability:alignment efforts.]
[I tend to write lots of posts which never end up being published, so I'm trying a new thing where I will write a public draft which people can comment on, either to poke holes or contribute arguments/ideas. I'm hoping that having any engagement on it will strongly increase my motivation to follow through with this, so please com...
dumb alignment idea
Flood the internet with stories in which a GPT chatbot which achieves superintelligence decides to be Good/a scaffold for a utopian human civilization/CEV-implementer.
The idea being that an actual GPT chatbot might get its values from looking at what the GPT part of it predicts such a chatbot would do.