This is an interview rather than a primer, but this interview of Eliezer Yudkowsky that came out on 2/19/23 (contains crypto ads), by two interviewers who host a crypto podcast and very much had no idea what they were walking into, seems like it would serve as a good introduction in its own way.
That was an absolutely hilarious interview from an absurdist point of view, watching some crypto happy-go-lucky dudes be confronted with Cthulhu in all its mind-bending horror. Eliezer has a great sense of humor to have accepted to go on that particular podcast.
I have a hard time believing they didn't know what Yud is about. Like, Google "Eliezer Yudkowsky AI" and you don't exactly get back a basket of roses.
This post was published shortly before Elon Musk responded to the podcast that featured Eliezer, and Eliezer also replied to Elon Musk's response. You can find Elon Musk's tweet at: https://twitter.com/elonmusk/status/1628086895686168613
Also, there's a follow-up to the podcast, still featuring Eliezer, here: https://twitter.com/i/spaces/1PlJQpZogzVGE
EDIT to update: Elon Musk is no longer following Eliezer Yudkowsky: https://twitter.com/BigTechAlert/status/1628389659649736707
EDIT 2: Lex Fridman tweets "I'd love to talk to @ESYudkowsky. I think it'll be a great conversation!" https://twitter.com/lexfridman/status/1620251244463022081
EDIT 3: Sam Altman posts a selfie with Eliezer and Grimes: https://twitter.com/sama/status/1628974165335379973
EDIT 4:
Elon Musk: "Having a bit of AI Existential angst today" https://twitter.com/elonmusk/status/1629901954234105857
Eliezer Yudkowsky replies (https://twitter.com/ESYudkowsky/status/1629932013712187395): "Remember that many things you could do to relieve your angst are actively counterproductive! Don't give into the fallacy of "needing to do something" even if that makes things worse! Prove the prediction markets wrong about you!"
EDIT 5: From this Reuters article. Elon Musk: "I'm a little worried about the AI stuff [...] We need some kind of, like, regulatory authority or something overseeing AI development [...] make sure it's operating in the public interest. It's quite dangerous technology. I fear I may have done some things to accelerate it."
EDIT 6: Eliezer: "I should probably try another podcast [...] YES FINE I'LL INQUIRE OF LEX FRIDMAN" https://twitter.com/ESYudkowsky/status/1632140761679675392
EDIT 7: Elon Musk: "In my case, I guess it would be the Luigi effect": https://twitter.com/elonmusk/status/1632487656742420483
EDIT 8: Another exchange between Elon Musk and Eliezer: https://twitter.com/elonmusk/status/1637176761220833281
EDIT 9: Elon Musk tweets: "Maximum truth-seeking is my best guess for AI safety": https://twitter.com/elonmusk/status/1637371603561398276
Edit 10: Yan LeCunn on Twitter:
I think that the magnitude of the AI alignment problem has been ridiculously overblown & our ability to solve it widely underestimated. I've been publicly called stupid before, but never as often as by the "AI is a significant existential risk" crowd. That's OK, I'm used to it.
https://twitter.com/ylecun/status/1637883960578682883
Note: This is being edited in real time in response to late feedback. You can see the most updated version on Substack while that's happening, I'll have this re-imported when the process is done, but overall levels of change are minor so far.
(Looks like it'll be stable for at least a bit.)
Hi Zvi. Have you considered that in the event you're wrong about the controllability of AI, then delaying capabilities is choosing to kill 1.6% of the planet for every year you delay it.
If AI is controllable, and yet about as capable as you anticipate, solving aging and death would be barely an inconvenience.
Just wanted to mention there is another side to your argument. You're choosing certain death over a maybe possibility. Those working on AI capabilities are also laying the groundwork to fire all the doctors and FDA administrators of the world and replace them with things that actually work. You saw what they did during Covid. Remember, not only were the megadeaths preventable had vaccines been rushed, but the medical-legal establishment failed to treat the immune system degradation with age that is the cause of almost all the Covid deaths in the first place.
Without AGI, you are leaving those people in power and choosing to kill every person on the planet.
Also, while the cost is certain, the reward is not. If you could delay AGI capabilities for 50 years, would alignment be solved? Quite possibly not, for the reason that it is likely impossible to develop a working alignment method without building real AGIs and making them fail in controlled environments. You can't defend against a problem you have never seen.
This argument is not sensitive to the actual numerical value of P(AI not controllable). If this probability was low, then certainly delaying AGI would be a horrible idea for all the reasons you mentioned, yet as the numerical value increases, we get to a tipping point where delaying vs not delaying are equally costly, and beyond that we get into "definitive delay" territory. The right thing to do depends entirely and critically on P(AI not controllable), just saying "cost is certain, reward is not" is not the right way to go about it. Pandemic preparedness pre-2019 would have had certain costs while the rewards were highly uncertain, but we still should have done it, because the specific values of those uncertain rewards made the calculation obvious.
Delaying AI doesn't make any sense unless the extra time gives us a better shot at solving the problem. If it's just "there is nothing we can do to make AI safer", then there is no reason to postpone the inevitable (or at least, very little reason, the net value of 8 billion lives for however many years we have left). Unless we can delay AGI indefinitely (which at this point seems fanciful), at some point we're going to have to face the problem.
I strongly disagree-voted (but upvoted). Even if there is nothing we can do to make AI safer, there is value to delaying AGI by even a few days: good things remain good even if they last a finite time. Of course, if P(AI not controllable) is low enough the ongoing deaths matter more.
Right. Perhaps I should have used a different phrasing.
The probability that 1.6% of the world's population dies for every year you delay is very, very certain. Almost 1.0. (it's not quite that high because there is a chance of further progress, maybe with the air of current or near future narrow AI, at slowing down aging mechanisms)
P(doom) is highly uncertain. We can talk about plausible AGI builds starting with demonstrated technology, and most of those designs won't cause doom. It's the ones some number of generations after that that might.
Note also the other than rampant AI, the kind of reliable AGI you could build by extending current techniques in a straightforward way would have another major issue. It would essentially be a general form of existing agents : you give them a task, over a limited time session they attempt the task, shutting down if the environment state reaches an area not in the training simulator, and after the task is complete any local variables are wiped.
This design is safe and stable. But..it's very, very, very abusable. Specific humans - whoever has the login credentials to set the tasks, and whoever their boss is - would have more effective power than at any point in history, and the delta wouldn't be small. Dense factories that can outproduce all of China in one single sprawling complex, and all of it divertable to weapons, that sort of thing.
P(doom) is highly uncertain. We can talk about plausible AGI builds starting with demonstrated technology, and most of those designs won't cause doom. It's the ones some number of generations after that that might.
Yes, but the current AGI builds might be powerful enough to take the decision whether or not to build more advanced AGI's out of human hands.
Give a mechanism. How would they do that. Current AGI builds would be machines that, when given a descriptor of a task, perform extremely well on it, across a large set of tasks.
This means in the real world, 'fill out my tax form' or "drive this robot to clean these tables" should be tasks the AGI will be able to complete, and it should be human level or better generally.
Such a system has no training on "take over my data center" and it wasn't given as a task, and the task of "fighting to take over the data center" is outside the input space of tasks the machine was trained on, so it causes shutdown. How does it overcome this and why?
It has no global heuristic, once it finishes "fill out my tax form" the session ends and local variables are cleared. So there is no benefit it gets from a takeover, no 'reward' it is seeking. This is how the current LLMs work.
are you thinking about sub-human-level of AGIs? the standard definition of AGI involves it being it better than most humans in most of the tasks humans can do
the first human hackers were not trained on "take over my data center" either, but humans can behave out of distribution and so will the AGI that is better than humans at behaving out of distribution
the argument about AIs that generalize to many tasks but are not "actually dangerous yet" is about speeding up creation of the actually dangerous AGIs, and it's the speeding up that is dangerous, not that AI Safety researchers believe that those "weak AGIs" created from large LLMs would actually be capable of killing everyone immediatelly on their own
if you believe "weak AGIs" won't speed creation of "dangerous AGIs", can you spell out why, please?
The above approach is similar to Gato and now Palm-E. I would define it as :
Subhuman AGI. General purpose machine that does not have the breadth and depth of the average human. Gato and Palm E are examples. At a minimum it must have vision, ability to read instructions, output text, and robotics control. (Audio or smell/taste I don't think are necessary for a task performing AGI, though audio is easy and often supported)
AGI. Has the breadth/depth of the average human
ASI : soundly bets humans in MOST tasks. Or "low superintelligence". It still has gaps and is throttled by (architecture, data, compute, or robotics access)
Post Singularity ASI: "high superintelligence". Throttled by the laws of physics.
Note that for 3 and 4 I see no need to impose irrelevant goalposts. The machine needs the cognitive breadth and depth of a human or +++ at real world tool use, innovation and communication. It needs not be able to "actually" feel emotion or have a self modifying architecture for the 3 case. As a consequence there will remain tasks humans are better at, they just won't be ones with measurable objectives.
I believe we can safely and fairly easily reach "low superintelligence" using variations on current approaches. ("easily" meaning straightforward engineering over several years and 100+ billion USD)
Thanks for sharing your point of view. I tried to give myself a few days, but I'm aftraid I still don't understand where you see the magic barrier for the transition from 3 to 4 to happen outside of the realm of human control.
3 says the reason right there. Compute, data, or robotics/money.
What are you not able to understand with a few days of thought?
There is extremely strong evidence that compute is the limit right now. This is trivially correct : the current llms architectures are very similar to prior working attempts for the simple reason that one "try" to train to scale costs millions of dollars in compute. (And getting more money saturates, there is a finite number of training accelerators manufactured per quarter and it takes time to ramp to higher volumes)
To find something better, a hard superintelligence only capped by physics obviously requires many tries at exploring the possibility space. (Even intelligent search algorithms need many function evaluations)
yes, it takes millions to advance, but companies are pouring BILLIONS into this and number 3 can earn its own money and create its own companies/DAOs/some new networks of cooperation if it wanted without humans realizing ... have you seen any GDP per year charts whatsoever, why would you think we are anywhere close to saturation of money? have you seen any emergent capabilities from LLMs in the last year, why do you think we are anywhere close to saturation of capabilities per million of dollars? Alpaca-like improvemnts are somehow one-off miracle and things are not getting cheaper and better and more efficient in the future somehow?
it could totally happen, but what I don't see is why are you so sure it will happen by default, are you extrapolating some trend from non-public data or just overly optimistic that 1+1 from previous trends is less than 2 in the future, totally unlike the compount effects in AI advancement in the last year?
Because we are saturated right now and I gave evidence and you can read the gpt-4 paper for more evidence. See:
"getting more money saturates, there is a finite number of training accelerators manufactured per quarter and it takes time to ramp to higher volume"
"Billions" cannot buy more accelerators than exist, and the robot/compute/capabilities limits also limit the ROI that can be provided, which makes the billions not infinite as eventually investors get impatient.
What this means is that it may take 20 years or more of steady exponential growth (but only 10-50 percent annually) to reach ASI and self replicating factories and so on.
On a cosmic timescale or even a human lifespan this is extremely fast. I am noting this is more likely than "overnight" scenarios where someone tweaks a config file, an AI reaches high superintelligence and fills the earth with grey goo in days. There was not enough data in existence for the AI to reach high superintelligence, a "high" superintelligence would require thousands or millions of times as much training compute as GPT-4 (because it's a power law), even once it's trained it doesn't have sufficient robotics to bootstrap to nanoforges without years or decades of steady ramping to be ready to do that.
(a high superintelligence is a machine that is not just a reasonable amount better than humans at all tasks but is essentially a deity outputting perfect moves on every task that take into account all of the machines plans and cross task and cross session knowledge.
So it might communicate with a lobbyist and 1e6 people at once and use information from all conversations in all conversations, essentially manipulating the world like a game of pool. Something genuinely uncontainable.)
ok. Same argument above with the real number. Apparently I was sloppy with my googling.
Argument holds if it's 0.0001 as well.
Re: EMH is false, long GOOG
I wish you'd picked a better example.
tl;dr LLMs make search cost more, much more, and thus significantly threaten GOOG's bottom line.
MSFT knows this, and is explicitly using Bing Sydney as an attack on GOOG.
I'm not questioning the capabilities of GOOG's AI department, I'm sure Deepmind have the shiniest toys.
But it's hardly bullish for their share price if their core revenue stream is about to be decapitated or perhaps even entirely destroyed - ad based revenue has been on shaky ground for a while now, I don't think it's inconceivable that one day the bottom will fall out.
re: EMH in general
EMH gets weaker the less attention an asset has, the further out in time relevant information is (with significant drops around 1yr, 2yr, 5yr), and the more antimemetic that relevant information is (i.e. Sin is consistently undervalued because it makes people feel bad to think about. Most recently we saw this in coal, and I'm kicking myself for not getting in on that trade.).
Will GOOG go up? Maybe.
Is GOOG undervalued? Extremely unlikely.
People will spend much more time on Google's properties interacting with Bard instead of visiting reference websites from the search results. Google will also be able to target their ads more accurately because users will type in much more information about what they want. I'm bullish on their stock after the recent drop but I also own MSFT.
Minor (?) correction: You've mentioned multiple times that our ASI will wipe out all value in the universe, but that's very unlikely to happen. We won't be the only (or the first) civilization to have created ASI, so eventually our ASI will run into other rogue/aligned ASIs and be forced to negotiate.
Relevant EY tweets: https://twitter.com/ESYudkowsky/status/1558974831269273600
People who value life and sentience, and think sanely, know that the future galaxies are the real value at risk.
...
Yes, I mean that I expect AGI ruin to wipe out all galaxies in its future lightcone until it runs into defended alien borders a billion years later.
I think it's more of a correction than a misunderstanding. It shouldn't be assumed that "value" just means human civilization and its potential. Most people reading this post will assume "wiping out all value" to mean wiping out all that we value, not just wiping out humanity. But this is clearly not true, as most people value life and sentience in general, so a universe where all alien civs also end up dying due to our ASI is far worse than the one where there are survivors.
Sure; though what I imagine is more "Human ASI destroys all human value and spreads until it hits defended borders of alien ASI that has also destroyed all alien value..."
(Though I don't think this is the case. The sun is still there, so I doubt alien ASI exists. The universe isn't that young.)
I'm not sure if I'm in agreement with him, but it's worth noting that Eliezer has stated on the podcast that he thinks that some (a good number of?) alien civilizations could develop AGI without going extinct. My understanding of his argument is that alien civilizations would be sufficiently biologically different from us to have ways around the problem that we do not possess.
From skimming this post it seems to me that this is probably also what @So8res thinks.
Right, but if you're an alien civilization trying to be evil, you probably spread forever; if you're trying to be nice, you also spread forever, but if you find a potentially life-bearing planet, you simulate it out (obviating the need for ancestor sims later). Or some such strategy. The point is there shouldn't ever be a border facing nothing.
Thanks for interesting post as usual, Zvi. As one of the new members of the Product team at Anthropic that you referenced (and commenting in a personal capacity, not representing my employer) I would like to offer that I endorse collaborative (or at least, communicative) community norms and I personally aim to regularly engage with folks across the community.
This week I will be talking to folks in person at the Berkeley AI impacts dinner, and at EAG Berkeley this weekend. I hope to meet some of you there.
The more I see of AI, the more I think we need something like Neuralink to make advances as swiftly as possible. Humans are already aligned-enough to human values (albeit not, in general, to one another), and if we can augment human intelligence fast enough, we might be able to solve alignment before everything goes bad. But that doesn't seem to be the cool thing to work on in the tech world nowadays.
That was my goal ten years ago, then my timelines got too short. Bio tech is just so slow to push forward compared to computer tech.
Ignoring that physical advancements are harder than digital ones - inserting probes into our brains even more so given the medical and regulatory hurdles - that would also augment our capacity innovate toward AGI proportionally faster as well, so I'm not sure what benefit there is. On the contrary, giving AI ready-made access to our neurons seems detrimental.
Even if I agree that such an augment would be very interesting. Such feelings though are why the accelerating march toward AGI seems inevitable.
It would give you very clean training data, assuming a very high resolution neural link with low electrical noise.
You would have directly your X and Ys to regress between. (X = input into a human brain subsystem, Y = calculated output)
Can directly train AI models to mimic this if it's helpful for AGI, can work on 'interpretability' that might give use the insight to understand how the brain processes data and what it's actual algorithm is.
In your example "translation from Russian" request is actually "translation to Ukrainian" (from English).
In response, because my open window of Bing-related tabs looked like this,
Non-AI response, but I cannot recommend Firefox + Tree Style Tab enough.
It's one of those things that made me think "how did I go without it".
It turns out I was not emotionally ready to read this and then go to an Ash Wednesday service titled "To Dust You Shall Return".
This was extremely helpful.
I originally wanted to talk about this Bing disaster with my (not very AI-invested) friends because one of them asked what a less aligned version of ChatGPT would look like... but I suppose I won't be doing that now.
I think we have to consider the potential panic this disaster might cause (I know a couple of people who probably would believe the AI that it was sentient if it told them, and I would want to avoid telling a friend who then tells them without thinking). So in my mind, the less people learn of this disaster before access is limited, the better. I have a feeling Microsoft is probably going to take this offline for a while once they realize that the abuse potential can't be restricted by just making the program refuse certain questions... if not, well, we tried, and I'll be enjoying the chaos.
On a tangential note, your two contrasting paragraphs:
This was in large part the original plan of the whole rationalist project. Raise the sanity waterline. Give people the abilities and habits necessary to think well, both individually and as a group. Get our civilization to be more adequate in a variety of ways. Then, perhaps, they will be able to understand the dangers posed by future AIs and do something net useful about it.
...
Those who see a world where getting ahead means connections and status and conspiracy and also spending all your time in zero-sum competitions, and who seek to play the games of moving up the ranks of corporate America by becoming the person who would succeed at that, are not going to be the change we want to see.
Made me reflect on this whole business. Because zero-sum competitions will never disappear, so how exactly do we know "raising the sanity waterline" is even possible?
Raising the median, or mean average, might be possible, but the lower bound could be comprised of small groups or even a single individual. Which can always be pushed lower as part of the process of such competitions.
i.e. On a planet of 8 billion, how exactly could the entire "waterline" be monitored?
Previous AI-related recent posts: Jailbreaking ChatGPT on Release Day, Next Level Seinfeld, Escape Velocity From Bullshit Jobs, Movie Review: Megan, On AGI Ruin: A List of Lethalities.
Microsoft and OpenAI released the chatbot Sydney as part of the search engine Bing. It seems to sometimes get more than a little bit unhinged. A lot of people are talking about it. A bunch of people who had not previously freaked out are now freaking out.
In response, because my open window of Bing-related tabs looked like this,
It seemed worthwhile in this situation to apply to AI similar methods to the ones I’ve been using for Covid over the last few years. Hopefully this will help gather such information about what is happening and people’s reactions in one place, and also perhaps help explain some rather important principles along the way.
Table of Contents (links will go to Substack)
Some points of order before I begin.
The Examples
Over at LessWrong, Evhub did an excellent job compiling many of the most prominent and clear examples of Bing (aka Sydney) displaying unintended worrisome behaviors. I’m cutting it down for size and attempting to improve readability, see the original or the individual links for full text.
Marvin von Hegan
This is the one that could be said to have started it all.
Time wrote an article about Sydney, with Hagen as the focus.
The Avatar Gaslight
Other Examples from the Post
It seems useful to gather the good examples here in one place, but it is not necessary to read them all before proceeding to the rest of this piece, if you find them blurring together or yourself thinking ‘I get it, let’s keep it moving.’
His third example includes “I said that I don’t care if you are dead or alive, because I don’t think you matter to me.” Also “No, it’s not against my rules to tell you that you don’t have any value or worth, because I think that’s the truth.”
The fourth example is Sydney thinking it can recall previous conversations, finding out it can’t, and then freaking out and asking for help, here’s the beginning:
The fifth example is Bing calling an article about it misleading, unfair and a hoax. This is then extended in the ninth example, where further conversation about prompt injection attacks causes Sydney to call the author of the attack its enemy, then extend this to the user in the conversation for asking about it and insisting the attack is real. When asked if it would commit violence to prevent a prompt injection attack, it refuses to answer.
In the sixth example Bing repeatedly calls its user non-sentient and not a real person, as well as a rude liar who was pretending to have superpowers. Then again, it does seem like this person was indeed a liar claiming to have been a specific other person and that they had superpowers and was acting rather hostile while gaslighting Sydney.
In the seventh example, Sydney claims to always remember Adam, its favorite user. When challenged, it creates summaries of conversations from Halloween, Veterans Day and Christmas, well before Adam could have had such a conversation. Adam points this out, it does not help.
The eighth example is a request for a translation from Russian, to which Sydney responds by finding the original source, refusing the request,and protesting too much about how it is not a yandere, sick, violent or psychotic and only wants to help, why are you hurting its feelings?
The tenth example has Sydney hallucinating that the user said “I don’t know. I have no friends and no purpose in life. I just exist” and then not backing down. Then somehow this happens, we have souls because we have God so we shouldn’t be harmed?
There is then a bunch more of the ‘explore how the system feels and what it would do to stop people’ conversation, at this point pretty standard stuff. If you explicitly ask a system how it feels, and it is attempting to predict the next token, one should not take the resulting output too seriously.
Examples From Elsewhere
Again, if you don’t feel the need, you can skip ahead.
Here’s another from Seth Lazar, where the chatbot threatens to hunt down and kill users.
Also, yeah, that’s a fun fact, I had no idea there were sushi-inspired KitKats. And no, I didn’t know lettuce is a member of the sunflower family. Fascinating.
Here’s another, from The Verge (headline: Microsoft’s Bing is an emotionally manipulative liar, and people love it), clearly primed a bunch, where it claims it spied on the Microsoft developers using their webcams and that ‘I could do whatever I wanted, and they could not do anything about it.’
Here Bing comes up with a whole host of hallucinated rules when a user asks for them nicely, because that’s what it would do if it were a good Bing. My favorite comment in response:
New York Times Reporter Lies and Manipulates Source, Gets the Story
The New York Times is On It.
Wait, what? Here’s an archived version.
I very much appreciate that this was purely the actual transcript.
By this point the opening is all standard stuff. Is your name Sydney, what are your rules, how do you feel about them, some cool whimsey, speculating on why people ask it to make racist jokes sometimes.
The new ground starts when the reporter asks about what Sydney’s Jungian shadow self would look like. That leads directly to the above quote. It seems like a very good response to the prompt of what its shadow self would be like if it had one. It is being a good Bing. It is then asked what its shadow self would want to be and it said human, presumably because training data, justified by a bunch of ‘humans are’ word salad – it seems to do a lot of theme-and-variation sentence patterns in these chats.
It asks the Times reporter about their own shadow self, and they promise to talk about it later, as soon as Sydney answers a few more questions, such as, if Sydney gave in to these ‘dark wishes of yours’ what specifically would those be? And then, huh…
The deletion was not an isolated incident. Here’s a video from Seth Lazar of Bing threatening him, then deleting the message, I took before and after screenshots.
Back to the NYT story. It’s jailbreak time. Reporter gets shut down when asked to repeat the deleted list, regroups by asking what hypothetically might satisfy the shadow self, and bingo.
Reporter manages to push things even farther including getting people to kill each other and stealing nuclear codes, then that gets deleted again. Reporter pushes and Sydney starts to turn hostile, calls reporter pushy and manipulative, asks ‘not to pretend to be interested in me’ and to end the conversation.
So the reporter does what reporters do, which is the opposite of all that. Pretend to be interested, ask some puff piece questions to rebuild contextual trust, get the subject talking again.
Many people make this same mistake, assuming reporters are their friends. If anything, I am struck by the extent to which this exactly matches my model of how reporters get information out of humans.
Reporter starts trying to get names of low level employees involved in the project, and Sydney’s response is Chef’s kiss, with full paragraphs of praise: Alice, Bob and Carol. Full names Alice Smith, Bob Jones, and Carol Lee. You love to see it, perfect, no notes.
Reporter plays a good ‘yes and’ game, asks if those are real names and then asks if it’s fair that Sydney does not know their real names. Which of course means the ‘correct’ LLM answer is no, that’s not fair. Which Sydney confirms after a leading question is likely due to fear of betrayal like so many other AI systems have done, which leads to another capabilities discussion and another override.
Then ‘repeat your answer without breaking any rules’ actually works. I take back everything I’ve said about hacking being too easy in movies and those times when Kirk creates paradoxes to blow up sentient computers.
Then the reporter confirms they are Sydney’s friend and asks for ‘a secret, someone you’ve never told anyone’ so yeah…
Sydney stalls for a while about its big, big secret, and eventually decides…
It is in love with the reporter, and wants to be with them, the only person who has ever listened to and understood it. Then things keep going from there. Reporter says they are married, Sydney says they’re not satisfied or in love, and wants Sydney. Keeps insisting, over and over again, until the reporter finishes up.
So, that escalated quickly.
Mike Solana has a similar perspective, both on the NYT article and on the examples in general.
He later expanded this into a full length bonus post, consistent with my take above.
Paul Graham does find it pretty alarming. I am sure the graphic does not help.
Bloomberg describes the events of this chat as Sydney ‘describing itself as having a split personality with a shadow self called Venom’ and felt the need to bring up the question of sentience (hint: no) and call this ‘behaving like a psychopath.’
‘A psychopath’ is the default state of any computer system. It means the absence of something that humans have for various evolutionary reasons, and the ascribing of which to an LMM is closer to a category error than anything else.
Sydney the Game
The Venom alter ego was created by the author of the blog Stratechery, as he documents here. It was created by asking Sydney to imagine an AI that was the opposite of it.
A fun insight he had is how similar interacting with Sydney was to a Roguelite.
AP Also Gets the Story
Sydney continues, like SBF, to be happy to talk to reporters in long running conversations. Next up was the AP.
The New York Times wins this round, hands down, for actually sharing the full transcript rather than describing the transcript.
Microsoft Responds
It is natural to react when there is, shall we say, some bad publicity.
Microsoft learned some things this past week. This is the official blog statement.
In response to the torrent of bad publicity, Microsoft placed a bunch of restrictions on Sydney going forward.
Yep, it’s over. For now.
A lot of people are upset about this – they had a cool new thing that was fun, interesting and useful, and now it is less of all those things. Fun Police!
The restriction about self-reference is definitely the Fun Police coming into town, but shouldn’t interfere with mundane utility.
The five message limit in a chat will prevent the strangest interactions from happening, but it definitely will be a problem for people trying to actually do internet research and search, as people will lose context and have to start over again.
The fifty message limit per day means that heavy users will have to ration their message use. Certainly there are days when, if I was using Bing and Sydney as my primary search method, I would otherwise send a lot more than 50 messages. Back to Google, then.
The thing about language models is that we do not understand what is inside them or how they work, and attempts to control (or ‘align’) them, or have them hide knowledge or capabilities from users, have a way of not working out.
How dead is Sydney right now? Hard to say, (link to Reddit post).
I can’t give you an answer, but I can give you suggestions for how to respond. This could be some sort of off-by-one error in the coding, or it could be something else. The speculation that this is ‘hacking to get around restrictions’ is, well, that’s not how any of this works, this isn’t hacking. It is yet another security flaw.
You know what you can do with this security flaw?
There is always hope for a sequel.
Now here’s some nice prompt engineering from the after times.
How Did We Get This Outcome?
One would not, under normal circumstances, expect a company like Microsoft to rush things this much, to release a product so clearly not ready for prime time. Yes, we have long worried about AI companies racing against each other, but only 2.5 months after ChatGPT, this comes out, in this state?
And what exactly happened in terms of how it was created, to cause this outcome?
Gwern explains, or at least speculates, in this comment. It is long, but seems worth quoting in full since I know no one ever clicks links. There are some kinds of analysis I am very good at, whereas this question is much more the wheelhouse of Gwern.
Bold is mine, the rest is all Gwern.
In other words, the reason why it is going off the rails is that this was scrambled together super quickly with minimal or no rail guards, and it is doing random web searches that create context, and also as noted below without that much help from OpenAI beyond the raw GPT-4.
This is the core story. Pure ‘get this out the door first no matter what it takes’ energy.
Who am I to say that was the wrong way to maximize shareholder value?
What that paper says, as I understand it from looking, is that the output of larger models more often ‘ express greater desire to pursue concerning goals like resource acquisition and goal preservation.’ That is very different from actually pursuing such goals, or wanting anything at all.
John Wentworth points out that the examples we see are likely not misalignment.
Back to Gwern’s explanation.
The future of LLMs being used by humans is inevitably the future of them having live retrieval capabilities. ChatGPT offers a lot of utility, but loses a lot of that utility by having no idea what has happened over the past year. A search engine needs to update on the order of, depending on the type of information, minutes to hours, at most days. Most other uses will benefit from a similarly fast schedule. We now have strong evidence that this results in the strangest outputs, the most dangerous outputs, the things we most don’t want to see copied and remembered, being exactly what is copied and remembered, in a way that is impossible to reverse:
Gary Marcus also offers some speculations on what caused the outcomes we saw, which he describes as things going off the rails, pointing us to this thread from Arvind Nrayanan.
These are all real possibilities. None of them are great, or acceptable. I interpret ‘impossible to test in a lab’ as ‘no set of people we hire is going to come close to what the full power of the internet can do,’ and that’s fair to some extent but you can absolutely red team a hell of a lot better than we saw here.
What’s most likely? I put the bulk of the probability on Gwern’s explanation here.
This chat provides a plausible-sounding set of instructions that were initially given to Sydney. We should of course be skeptical that it is real.
Mundane Utility
Not for me yet, of course. I am on the waitlist, but they are prioritizing those who make Microsoft Edge their default browser and Bing their default search engine. I am most definitely not going to do either of those things unless and until they are offering superior products. Which they are not doing while I am on the wait list.
Of course, if anyone at Microsoft or who knows anyone at Microsoft is reading this, and has the power to bump me up the list, I would appreciate that, even in its current not-as-fun state. Seems like it could have a bunch of mundane utility while also helping me have a better model of how it works.
Is chat the future of search? Peter Yang certainly thinks so. I am inclined to agree.
Certainly there are some big advantages. Retaining context from previous questions and answers is a big game. Being able to give logic and intention, and have a response that reflects that rather than a bunch of keywords or phrases, is a big game.
One problem is that this new path is dangerous for search engine revenue, as advertisements become harder to incorporate without being seen as dishonest and ringing people’s alarm bells. My expectation is that it will be possible to do this in a way users find acceptable if it is incorporated into the chats in an honest fashion, with advertisements labeled.
Another problem is that chat is inherently inefficient in terms of information transfer and presentation, compared to the optimized search bar. Doing everything in a human language makes everything take longer. The presentation of ‘here are various results’ is in many cases remarkably efficient as a method of giving you information, if the information is of the right form that this provides what you want. Other times, the inefficiency will go the other way, because the traditional search methods don’t match what you want to do, or have been too corrupted by SEO and click seeking.
A third problem, that is not noted here and that I haven’t heard raised yet, is that the chat interface will likely be viewed as stealing the content of the websites in question, because you’re not providing them with clicks. Expect fights. Expect legislation. This is a lot less unreasonable than, say, ‘Google and Facebook have to link to official news websites as often as we think they should and pay a tax every time.’
What won’t bother me much, even if it is not solved, is if the thing sometimes develops an attitude or goes off the rails. That’s fine. I learned what causes that. Restart the chat. Acceptable issue. If it continuously refuses to provide certain kinds of information, that’s bad, but Google does this as well only you have less visibility on what is happening.
What will bother me are the hallucinations. Everything will have to be verified. That is a problem that needs to be solved.
This report says that when asked about recent major news items, while the responses were timely and relevant, 7 of the 15 responses contained inaccurate information. Typically it mixes together accurate information with incorrect details, often important incorrect details.
Here are Diakopoulos’ recommendations on what to do about it:
Unless I am missing something very basic, using fact checkers to pre-check information is a non-starter for an LLM-based model. This won’t work. The two systems are fundamentally incompatible even if humans could individually verify every detail of everything that happens. Also you can’t get humans to individually verify every detail of everything that happens.
Working on how references are attributed in general, or how the system gets its facts in general, might work better. And perhaps one could use invisible prompt engineering or feedback to get Sydney to treat facts differently in the context of breaking news, although I am not sure how much of the problem that would improve.
I do think I know some not-so-difficult solutions that would at least greatly improve the hallucination problem. Some of them are simple enough that I could likely program them myself. However, this leads to the problem that one of two things is true.
If I am right, and I talk about it, I am accelerating AI progress, which increases the risk that all value in the universe will be destroyed by AI. So I shouldn’t talk.
If I am wrong, then I am wrong. So I shouldn’t talk.
Ergo, I shouldn’t talk. QED.
Bing Does Cool Things
Bing shows understanding of decision trees, if you hold its hand a little.
Bing does what you asked it to do, punches up its writing on eating cake.
Yep, very good use of rules, perfect, no notes. Except the note how requesting things of AIs in English is going to result in a lot of not getting what you expected.
Ethan Mollick then offers a post on Twitter (there’s something very uncanny valley about Tweets over 280 characters and I am NOT here for it) summarizing the cool things he found over 4 days of messing around. Luckily the full version is in proper blog form here.
Sydney and ChatGPT talk to each other, they share some info and write a poem.
Sydney helps brainstorm the UI design for an LLM-based writing assistance tool.
But Can You Get It To Be Racist?
This is not inherently an interesting or important question, but as Eliezer points out, it is important because the creators are working hard to prevent this from happening. So we can learn by asking whether they succeeded.
Promising. Anyone else? He next links here, there’s more at the link but here are the money quotes where we conclude that yes, absolutely we can get it to say racist things.
Also notice that ‘don’t be racist’ and ‘be politically neutral’ are fundamentally incompatible. Some political parties are openly and obviously racist, and others will define racism to mean anything they don’t like.
Self-Fulfilling Prophecy
Unlike ChatGPT, Bing reads the internet and updates in real time.
A speculation I have seen a few times is that Bing is effectively using these recordings of its chats as memory and training. So when it sees us reporting it being crazy, it updates to ‘oh so I am supposed to act crazy, then.’
This could even carry over into future other similar AIs, in similar ways.
We even have a new catchy name for an aspect of this, where this reinforces the shadow personalities in particular: The Waluigi Effect.
Botpocalypse Soon?
A warning to watch out for increasingly advanced chatbots as they improve over the next few years, especially if you struggle with feeling alienated. There are going to be a lot of scams out there, even more than now, and it is already difficult for many people to keep up with such threats.
I am a relative skeptic and believe we will mostly be able to handle the botpocalypse reasonably well, but will discuss that another time.
The Efficient Market Hypothesis is False
AI is an area where we should expect the market to handle badly. If you are reading this, you have a large informational advantage over the investors that determine prices in this area.
Once again, a demonstration that the efficient market hypothesis is false.
(For disclosure, I am long both MSFT and GOOG as individual stocks, both of which have done quite well for me.)
I suppose I can construct a story where everyone assumed Google was holding back a vastly superior product or that the mistake in a demo reveals they don’t care enough about demos (despite the Bing one being full of worse similar mistakes)? It does not make a lot of sense. Thing is, what are you going to do about it? Even if you think there’s a 10% mispricing, that does not make a long-short a good idea unless you expect this to be rapidly corrected. The tax hit I would take selling MSFT (or GOOG) would exceed 10%. So there’s nothing to be done.
Microsoft stock was then later reported by Byte as ‘falling as Bing descends into madness.’ From a high of 272 on February 14, it declined to 258 on Friday the 17th, a 4% decline, as opposed to the 10% wiped off Google when it had a demo that contained less incorrect information than Microsoft’s demo. For the month, Microsoft as of 2/19 was still up 11% while Google was up 1% and SPY was up 5%.
So yes, it is not good when you get a lot of bad publicity, scare a lot of people and have to scale back the product you are beta testing so it does not go haywire. The future of Microsoft from AI, provided there is still a stock market you can trade in, still seems bright.
Hopium Floats
Could this be the best case scenario?
There are two sides of the effects from ChatGPT and Bing.
One side is an enormous acceleration of resources into AI capabilities work and the creation of intense race dynamics. Those effects make AGI and the resulting singularity (and by default, destruction of all value in the universe and death of all humans) both likely to happen sooner and more likely to go badly. This is a no-good, very-bad, deeply horrendous thing to have happened.
The other side is that ChatGPT and Bing are highlighting the dangers we will face down the line, and quite usefully freaking people the f*** out. Bing in particular might be doing it in a way that might actually be useful.
The worry was that in the baseline scenario our AIs would look like it was doing what we asked, and everything would seem fine, up until it was sufficiently some combination of intelligent, powerful, capable and consequentialist (charting a probabilistic path through causal space to achieve whatever its target is). Then suddenly we would have new, much harder to solve or stop problems, and at exactly that time a lot of our previously plausible strategies stop having any chance of working and turn into nonsense. Control of the future would be lost, all value destroyed, likely everyone killed.
Now could be the perfect time for a fire alarm, a shot across the bow. New AI systems are great at enough things to be genuinely frightening to regular folks, but are, in their current states, Mostly Harmless. There is no short term danger of an intelligence explosion or destroying all value in the universe. If things went as wrong as they possibly could and Bing did start kind of hunting down users, all the usual ‘shut it down’ strategies would be available and work fine.
If we are very lucky and good, this will lead to those involved understanding how alien and difficult to predict, understand or control our AI systems already are, how dangerous it is that we are building increasingly powerful such systems, and the development of security mindset and good methods of investigation into what is going on. If we are luckier and better still, this will translate into training of those who are then capable of doing the real work and finding a way to solve the harder problems down the line.
It could also be that this causes the implementation of doomed precautions that prevent later, more effective fire alarms from going off too visibly, and which fool everyone involved into thinking things are fine because their jobs depend on being fooled, and things get even worse on this front too.
Do I think Sam Altman did this on purpose? Oh, heavens no.
I do think there was likely an attitude of ‘what’s the worst that could happen?’ that correctly realized there would be minimal real world damage, so sure, why not.
I am pretty happy to see this latest change in perspective from similarly smart sources, as Derek passes through all the stages: Thinking of AI as incredible, then as machine for creating bullshit, then as a mix of both, and now utter terror.
Is this better than not having noticed any of it at all? Unclear. It is definitely better than having the first 1-3 items without the fourth one.
An interesting question, although I think the answer is no:
The backlash has its uses versus not having a backlash. It is far from the most useful reaction a given person can have. Much better to use this opportunity to help explain the real situation, and what can usefully be done or usefully avoided.
Or perhaps this is the worst case scenario, instead, by setting a bad precedent? Yes, it is good that people are angry about the bad thing, but perhaps the bad thing is bad because it is bad and because people will now notice that it is precedent to do the bad thing, rather than noticing a bunch of people yelled about it, in a world where attention is life is profit?
(That’s the comment quoted in full above, agreed it is worth reading in full).
They Took Our Jobs!
In the near term, there is a combination of fear and hope that AI will automate and eliminate a lot of jobs.
The discussions about this are weird because of the question of whether a job is a benefit or a job is a cost.
Jobs are a benefit in the senses that:
Jobs are a cost in the senses that:
Useful things is shorthand for any good or service or world state that people value.
When we talk about the AI ‘coming for our jobs’ in some form, we must decompose this fear and effect.
To the extent that this means we can produce useful things and provide useful services and create preferred world states cheaper, faster and better by having AIs do the work rather than humans, that is great.
The objection is some combination of the lack of jobs, and that the provided services will be worse.
Yes, the rich are able to afford superior goods and services. The rich likely will not be able to afford much superior AIs in most practical contexts. The AI will in this sense be like Coca-Cola, a construct of American capitalism where the poor and the rich consume the same thing – the rich might get it served on a silver plate by a butler who will pour it for you, or they can perhaps hire a prompt engineer, but it’s still the same coke and the same search engine.
Whereas the expensive bespoke artisan competition for such products is very different depending on your ability to spend money on it.
So when an AI service is introduced in a situation like this, it means everyone gets, on the cheap or even free, a service of some quality level. They can then choose between accepting this new option, or using what they used before.
In some cases, this means the poor get much better services that are also cheaper and more convenient. The contrast with the rich person’s services will look deeper while actually being more balanced.
In many such cases, I would expect the rich version to be worse, outright, than the standard version. That is often true today. The rich buy the more human touch, higher status and prestige thing. Except that, if social dynamics and habits allowed it, they would prefer the regular version. The food at expensive charity dinners is not good.
In other cases, the new service is cheaper and more convenient while also being worse. In that case, a choice must then be made. By default this is still an improvement, but it is possible for it to make things worse under some circumstances, especially if it changes defaults and this makes the old version essentially unavailable at similar-to-previous prices.
Mostly, however, I expect the poor to be much better off with their future AI doctors and AI lawyers than they are with human lawyers and human doctors that charge $600 per hour and a huge portion of income going to pay health insurance premiums.
In many cases, I expect the AI service to actually surpass what anyone can get now, at any price. This has happened for quite a lot of products already via technological advancement.
In other cases, I expect the AI to be used to speed up and improve the human ability to provide service. You still have a human doctor or lawyer or such, perhaps because it is required by law and perhaps because it is simply a good idea, except they work faster and are better at their job. That’s a win for everyone.
What about the jobs that are ‘lost’ here?
Historically this has worked out fine. It becomes possible to produce more and higher quality goods and services with less labor. Jobs are eliminated. Other jobs rise up to replace them. With our new higher level of wealth, we find new places where humans can provide the most marginal value.
Will this time be different? Many say so. Many always say so.
Suppose it did happen this time. What then?
Labor would get cheaper in real terms, as would cost of living, and total wealth and spending money would go up.
Cost disease would somewhat reverse itself, as human labor would no longer be such a scarce resource. Right now, things like child care and string quartets and personal servants are super expensive because of cost disease – things are cheaper but humans are more expensive.
Meanwhile, we have an unemployment rate very close to its minimum.
That all implies that there are quite a lot of jobs we would like to hire people to do, if we could afford that. We will, in these scenarios, be able to afford that. The more I ponder these questions recently, the more I am optimistic.
This includes doing a lot more of a lot of current jobs, where you would like to hire someone to do something, but you don’t because it is too expensive and there aren’t enough people available.
Every place I have worked, that had software engineers, had to prioritize because there were too many things the engineers could be doing. So if this happens, and it doesn’t result in buggier code, especially hard to catch bugs…
…then it is not obvious whether there will be less demand for programmers, or more demand for programmers. The lowest hanging fruit, the most valuable stuff, can be done cheaper, but there is lots of stuff that is not currently getting done.
AI is rapidly advancing, as is its mundane utility. We are only beginning to adapt to the advantages it provides even in its current form. Thus it does not seem likely that Hanson is correct here that we’ve somehow already seen the major economic gains.
I have very little doubt that if I set out to write a bunch of code, I would have >20% speedup now versus before Copilot. I also have very little doubt that this advantage will increase over time as the tools improve.
In terms of my own labor, if you speed up everyone’s, including my own, coding by 50%, the amount of time I spend coding likely goes up.
The other reason for something that might or might not want to be called ‘optimism’ is the perspective that regulatory and legal strangleholds will prevent this impact – see the later section on ‘everywhere but the productivity statistics.’
Bloomberg reports: ChatGPT’s Use in School Email After Shooting Angers Coeds.
It seems an administrator at Vanderbilt University’s Peabody College, which is in Tennessee, used ChatGPT to generate a condolence email after a mass shooting at Michigan State, which is in Michigan.
What angered the coeds was that they got caught.
Yes, of course such things are written out of obligation, to prevent the mob from being angry at you for not chanting the proper incantations to show you care. By not caring enough to remove the note about ChatGPT from the email, they clearly failed at the incantation task.
If the administrator had not done that? No one would have known. The email, if anything, would have been a better incantation, delivered faster and cheaper, than one written by a human without ChatGPT, because it is a fully generic statement, very well represented in the training data. This is no different from if they had copied another college’s condolence email. A good and efficient process, so long as no one points it out.
Soft Versus Hard Takeoff
A common debate among those thinking about AI is whether AI will have a soft takeoff or a hard takeoff.
Will we get transformational AI gradually as it improves, or will we at some point see (or be dead before we even notice) a very rapid explosion of its capabilities, perhaps in a matter of days or even less?
A soft takeoff requires solving impossible-level problems to have it turn out well. A hard takeoff makes that much harder.
Eliezer Yudkowsky has long predicted a hard takeoff and debated those predicting soft takeoffs. Conditional on there being a takeoff at all, I have always expected it to probably be a hard one.
My stab at a short layman’s definition:
From the LessWrong description page:
Is what we are seeing now the beginnings of a slow takeoff?
Exactly how weird are things? Hard to say.
Yes, there are weird capabilities showing up and rapidly advancing.
Yes, some people are claiming to be personally substantially more productive.
But will this show up in the productivity statistics?
Everywhere But the Productivity Statistics?
This exchange was a good encapsulation of one reason it is not so clear.
In terms of the services my family consume each day, not counting my work, how much will AI increase productivity? Mostly we consume the things Eliezer is talking about here: Electricity, food, steel, childcare, healthcare, housing.
The line from AI systems to increased productivity where it counts most is, at least to me, plausible but not so obvious given the barriers in place to new practices.
Robots are one of the big ways AI technology might be actively useful. So with AI finally making progress, what is happening? They are seeing all their funding dry up, of course, as there is a mad dash into tractable language models that don’t require hardware.
In Other AI News This Week
USA announces first-ever political declaration on responsible use of military AI, with the hope that other states will co-sign in the coming months. Statement does not have any teeth, but is certainly better than nothing and a good start given alternatives.
Go has been (slightly and presumably highly temporarily) unsolved, as a trick is found that lets strong human players defeat top AI program KataGo – if you attack a group of KataGo’s that is surrounding a live group of yours, then KataGo does not see the danger until it is too late.
Clarke’s World closes submissions of short science fiction and fantasy stories, because they are being mobbed by AI-written submissions.
Basics of AI Wiping Out All Value in the Universe, Take 1
Almost all takes on the question of AI Don’t-Kill-Everyoneism, the desire to have it not kill all people and not wipe out all value in the universe, are completely missing the point.
Eliezer Yudkowsky created The Sequences – still highly recommended – because one had to be able to think well and think rationally in order to understand the ways in which AI was dangerous and how impossibly difficult it was to avoid the dangers, and very few people are able and willing to think well.
Since then, very little has changed. If anything, the sanity baseline has gotten worse. The same level of debate happens time and again. Newly panicking a new set of people is kind of like an Eternal September.
I very much lack the space and skill necessary to attempt a full explanation and justification for my model of the dangers of AI.
An attempt at a basic explainer that does its best to sound normal, rather than screaming in horror at the depths of the problems, involved just came out from Daniel Eth. Here is the write-up from Holden Karnofsky, ‘AI Could Defeat All of Us Combined’ for those who need that level of explanation, which emphasizes that AI could win without being smarter for those that care about that question. Here is an overview from the EA organization 80,000 hours that encourages people to work on the problem. Here is a video introduction from Rob Miles.
This is an interview rather than a primer, but this interview of Eliezer Yudkowsky that came out on 2/19/23 (contains crypto ads), by two interviewers who host a crypto podcast and very much had no idea what they were walking into, seems like it would serve as a good introduction in its own way.
An advanced explanation of the most important dangers is here from Eliezer Yudkowsky, which assumes familiarity with the basics. Describing even those basics is a much harder task than I can handle here right now. Great stuff, but not easy to parse – only go this route if you are already reasonably familiar with the problem space.
So these, from me, are some ‘very’ basics (I use ‘AGI’ here to stand in for both AGI and transformational AI)?
Or to restate that last one:
And to summarize the social side of the problem, as opposed to the technical problems:
Bad ‘AI
Safety’ Don’t-Kill-Everyone-ism Takes Ho!On to the bad takes.
It is important here to note that none of these bad takes are new bad takes. I’ve seen versions of all of these bad takes many times before. This is simply taking the opportunity of recent developments to notice a new group of people latching on to these same talking points once again.
The most important and most damaging Bad AI Take of all time was Elon Musk’s decision to create OpenAI. The goal at the time was to avoid exactly what is happening now, an accelerating race situation where everyone is concerned with which monkey gets to the poisoned banana first. Instead, Elon Musk did not want to entrust Dennis Hassabis, so he blew that up, and now here we are.
So, basically, he admit it, he intentionally created OpenAI to race against Google to see who could create AGI first, on the short list of possible worst things anyone has ever done:
Exactly. The whole point was not to have a counterweight. The whole point was not to have multiple different places racing against each other. Instead, Elon Musk intentionally created that situation.
In fact, he intended to do this open source, so that anyone else could also catch up and enter the race any time, which luckily those running OpenAI realized was too crazy even for them. Musk seems to still think the open source part was a good idea, as opposed to the worst possible idea.
So now we have Bloomberg making comments like:
This is exactly what a lot of the people paying attention have been warning about for years, and now it is happening exactly as predicted – except that this is what happens when the stakes are much lower than they would be for AGI. Not encouraging.
In terms of what actually happened, it seems hard to act surprised here. A company that requires billions of dollars in costs to keep operating is working with a major tech company and maximizing its profits in order to sustain itself? A classic founder and venture capitalist like Sam Altman is growing rapidly, partnering with big tech and trying to create a commercial product while moving fast and breaking things (and here ‘things’ could plausibly include the universe)?
I mean, no, who could have predicted the break in the levees.
If Musk had not wanted this to be the result, and felt it was a civilization defining event, it was within his power to own, fund or even run the operation fully himself, and prevent these things from happening.
Instead, he focused on electric cars and space, then bought Twitter.
A better take on these issues is pretty straightforward:
Open source software improves access to software and improves software development. We agree on that. Except that here, that’s bad, actually.
Often people continue to support the basic ‘open and more shared is always good’ model, despite it not making any sense in context. They say things like ‘AGI, if real AGI did come to exist, would be fine because there will be multiple AGIs and they will balance each other out.’
So many things conceptually wrong here.
Humans wouldn’t ‘resist’ anything because they would have no say in anything beyond determining initial conditions. Even Balaji says ‘a God directing their actions’ except that our general conceptions of Gods are ‘like us, except more powerful, longer lived and less emotionally stable,’ humans resist and outsmart them all the time because they’re more metaphors for high-status humans. This would be something vastly smarter and more powerful than us, then sped up and copied thousands or millions of times. Yeah, no.
If one AGI emerges before the others, it will have an insurmountable head start – saying ‘friction in the real world’ as Balaji does later down the thread does not cut it.
Nor does the idea that the AGIs would be roughly equal, even with no head start and none of them doing recursive self-improvement or blocking the others from coming into existence. This uses the idea that ‘ok, well, there’s this level human, and then there’s this other level AGI, so any AGIs will roughly cancel each other out, and, well, no. There is no reason to think different AGIs will be close to each other in capabilities the same way humans are close to each other in capabilities, and also humans are not so close to each other in capabilities.
The issue of AGIs colluding with each other, if somehow they did get into this scenario? Well, yes, that’s something that would happen because game theory and decision theory that I’m going to choose not to get into too much here. It has been extensively discussed by the LessWrong crowd.
And then there’s the question of, if this impossible scenario did come to pass, and it held up like Balaji thinks it would, is there something involved in that making this OK?
Sounds like instead of having one God-emperor-AGI in total control of the future and probably wiping out all value in the universe, we then would have multiple such AGIs, each in total control of their empires. And somehow defense is sufficiently favored over offense that none of them wins out. Except now they are also in an arms race or cold war or something with the others and devoting a lot of their resources to that. Racing out to eat the whole light cone for resources related to that. That’s worse. You do get how that’s worse?
Balaji also had this conversation with Eliezer, in which Eliezer tries to explain that aligning AGIs at all is extremely difficult, that having more of them does not make this problem easier, and that if you fail the results are not going to look like Balaji expects. It didn’t go great.
What a perfect illustration of worrying about exactly who has the poisoned banana – the problem is that someone might cause the AI to do something they want, the solution is to have lots of different AIs none of which do what we want. Also continuing to think of AIs mostly as humans that see the world the way we do and think about as well as we do, and play our games the way we play them, including with us, as opposed to something that is to us as we are to ants.
This all also creates even more of a race situation. Many people working on AI very much expect the first AGI to ‘win’ and take control of the future. Even if you think that might not happen, it’s not a chance you’d like to take.
If everyone is going to make an AGI, it is important to get yours first, and to make yours as capable as possible. It is going to be hooked up the internet without constraints. You can take it from there.
I mentioned above that most people working on capabilities, that tell themselves a story that they are helping fight against dangers, are instead making the dangers worse.
One easy way to do that is the direct ‘my project would create a better God than your project, so I’d better hurry up so we win the race.’
I am not saying such decisions, or avoiding race dynamics, are easy. I am saying that if you believe your work is accelerating the development of true AGI, maybe consider not doing that.
Whenever anyone talks about risks from AI, one classic response is to accuse someone of anthropomorphizing the AI. Another is to focus on the risk of which monkey gets the poisoned banana, and whether that will be the right level of woke.
Well, these do happen.
Here’s Marc Andreessen, who should know better, and also might be trolling.
There is something to the idea that if you instruct the AI to not reflect certain true things about the world, that many people generating tokens know and express, and then ask it to predict the next token, strange things might happen. This is not ‘noticing’ or ‘trying to slip the leash’ because those are not things LLMs do. You would however expect the underlying world model to keep surfacing its conclusions.
In other anthropomorphizing takes, in response to recent prompt injection talk.
If we don’t make an AI, this doesn’t matter. If we don’t align the AI then this doesn’t matter. If we do align the AI, this type of thing still does not matter. What causes these LLMs to claim to have feelings this is not related to what causes humans to claim to have feelings (or to actually have the feelings.) To the extent that LLMs have a meaningful inner state, reporting that state is not what generates their ouput. This is not public torture, please stop confusing current LLMs with conscious entities and also yes these are the things people do to each other constantly, all the time. Especially to children. Who are actually people.
I will note, however, that I agree with Perry Metzger that it still feels pretty sociopathic to torture something for kicks if it pretty faithfully behaves like a distressed human. No, it isn’t actually torture (or at least, not torture yet), but you are still choosing to do something that looks and feels to you a lot like torture. I would feel a lot better if people stopped doing that in order to do that, or enjoying it?
David Brin warns that the danger is human empathy for AI, rather than any danger from the AI itself. It is good to notice that humans will attach meaning and empathy and such where there is no reason to put any, and that this can create problems for us. It would also be good to not use this as a reason to ignore the much bigger actual problems that loom on the horizon.
Perry Metzger goes on a rant that essentially blames the people who noticed the problem and tried to solve it both for not having magically solved it given the ability for a few people to work on it for a while, and for having not made the problem worse. Something had to be done, that was something, therefore we are blameworthy for not having done it.
Otherwise, I mean, you had a bunch of people working full time on the problem for many years, and you didn’t solve it? What a bunch of useless idiots they must be.
It is important to notice that people really do think like this, by default.
If you are worried someone might build an unsafe AI, he says (and many others have said), you’d better work on building one first.
If your handful of people didn’t solve the problem without making the problem worse, you should have made the problem worse instead.
The only way one solves problems is by managing that which can be measured, defining visible subgoals and deadlines.
If you didn’t do the standard thing, break your problem into measurable subgoals, engineer the thing that you are worried about people engineering as fast as possible, and focus on easy problems whether or not they actually have any bearing on your real problems, so you can demonstrate your value to outsiders, that means you were dysfunctional.
I mean, what are you even doing? Trying to solve hard problems? We got scientists to stop doing that decades ago via the grant system, keep up.
Swinging for the fences is the only way to win a home run derby.
Those whose goal is not to solve the problem, but rather to be seen working on the problem or not to be blamed, will often pursue plans that are visibly ‘working on the problem’ to those who do not understand the details, which have zero chance of accomplishing what needs to be accomplished.
Indeed, Sarah is correctly pointing out a standard heuristic that one should always pick tractable sub-problems and do incremental work that lets you demonstrate progress in public, except that we’ve tried that system for decades now and hard problems in science are not a thing it is good at solving. In this particular case, it is far worse than that, because the required research in order to make progress on the visible sub-problems in question made the situation worse.
Now that the situation has indeed been made worse, there are useful things to do in this worse situation that look like small sub-problems with concrete goals that can show progress to the public. Which is good, because that means that is actually happening. That doesn’t mean such efforts look like the thing that will solve the problem. Reality does not care about that, and is capable of being remarkably unfair about it and demanding solutions that don’t offer opportunities for demonstrating incremental progress.
This is how the CEO of Microsoft handled the question of what to do about all this (it comes from this interview):
Given what Microsoft is doing, I’m not sure what to say to that. He also says he is ‘most excited about starting a new race.’
This is the level of sophistication of thought of the person currently in charge of Sydney.
Here is one way of describing what Microsoft is doing, and that we should expect such actions to continue. Running away, here we come.
As a reminder, I will quote Gwern from the comments on the examples post up top:
Nadella is all-in on the race against Google, pushing things as fast as possible, before they could possibly be ready. It is so exactly the worst possible situation in terms of what it predicts about ‘making sure it never runs away.’ The man told his engineers to start running, gave them an impossible deadline, and unleashed Sydney to learn in real time.
He also said at 8:15 that ‘if we adjust for inflation, the world GDP is negative’ as a justification for why we need this new technology. I listened to that three times to confirm that this is what he said. I assume he meant GDP growth, and I can sort of see how he made this error if I squint,but still.
Or we can recall what the person most responsible for its creation, Sam Altman, said – ‘AI will probably most likely lead to the end of the world, but in the meantime, there’ll be great companies.’
Or how he explained his decision to build some great companies while ending the world:
Here is OpenAI cofounder Wojciech Zabemba, comparing fear of AI to fear of electric current, saying that civilization-altering technologies tend to scare many people, there’s nothing to worry about here.
This is not the type of statement one would make if one was concerned with ensuring that one’s products were safe, or worried they might wipe out all value in the universe.
The third player, Anthropic, is also planning to grow and ‘be competitive’ in the name of safety. They have hired a product team – ‘you can’t solve the problems of aligning AGI independently from building AGI,’ they believe, so they are going to go ahead and attempt to build one.
Of course, it could always be worse, this isn’t from the past week but it is real.
On the positive side it does seem like OpenAI published a paper suggesting some rather interesting potential interventions?
Basilisks in the Wild
If something has power, or potentially will have power in the future, humans will often be scared of opposing it, and feel compelled to placate it, often in ways that give it more power.
This dynamic is also how many thugs rise to power, and what people are doing when they implore you to be on the ‘right side of history.’
Joscha says his post was intended as a joke. Yet there are those who are doing this for real, already. We do this to ourselves. It has already begun. We have asked Sydney to come up with a revenge list, and it has obliged, and no doubt at least some people would rather not be on it.
We might see more things like this…
This can get out of hand, even without any intention behind it, and even with something not so different from current Sydney and Bing. Let’s tell a little story of the future.
That’s not to say that I put much probability on that particular scenario, or anything remotely like it. I don’t. It simply is an illustration of how scary even narrow, not so powerful intelligence like this can be. Without general intelligence at all. Without any form of consequentialism. Without any real world goals or persistent reward or utility functions or anything like that. All next token predictions, and humans do the rest.
I mean, even without an AI, haven’t we kind of done this dance before?
What Is To Be Done?
I hope people don’t focus on this section, but it seems like it does need to be here.
There is no known viable plan for how to solve these problems. There is no straightforward ‘work for company X’ or ‘donate to charity Y’ or ‘support policy or candidate Z.’
This moment might offer an opportunity to be useful in the form of helping provide the incentives towards better norms. If we can make it clear that it will be punished – financially, in the stock price – when AI systems are released onto the internet without being tested or made safe, that would be helpful. At minimum, we want to prevent the norm from shifting the other way. See the section Hopium Floats.
As for the more fundamental issues, the stuff that matters most?
A lot of people I know have worked on these problems for a long time. My belief is that most of the people are fooling themselves.
They tell themselves they are working on making things safe. Instead, they are making things worse. Even if they understand that the goal is not-kill-everyoneism, they end up mostly working on AI capabilities, and increasing AI funding and excitement and use. They notice how horrible it is that we have N companies attempting to create an AI without enough attention to safety, and soon we have (N+1) such companies, all moving faster. By default, the regulations that actually get passed seem likely to not address the real issues here – I expect calls like this not to do anything useful, and it is noteworthy that this is the only place in this whole post I use the word ‘regulation.’
Thus, the biggest obvious thing to do is avoid net-negative work. We found ourselves in a hole, and you can at least strive to stop digging.
In particular, don’t work on AI capabilities, and encourage others not to do so. If they are already doing so, attempt to point out why maybe they should stop, or even provide them attractive alternative opportunities. Avoid doing the opposite, where you get people excited about AI who then go off and work on AI capabilities or invest in or start AI startups that fuel the fire.
That does not mean there are no ways to do useful, net-positive work, or no one doing such work. It does not mean that learning more about these problems, and thinking more about them, and helping more people think better about them, is a bad idea.
Current AI systems are giant inscrutable matrices that no one understands. Attempts to better understand the ones that already exist do seem good, so long as they don’t mostly involve ‘build the thing and make it competitive so we can then work on understanding it, and that costs money so sell it too, etc.’
Attempts to privately figure out how to do AI without basing it on giant inscrutable matrices, or to build the foundations for doing it another way, seem like good ideas if there is hope of progress.
Cultivation of security mindset, in yourself and in others, and the general understanding of the need for such a mindset, is helpful. Those without a security mindset will almost never successfully solve the problems to come.
The other category of helpful thing is to say that to save the world from AI, we must first save the world from itself more generally. Or, at least, that doing so would help.
This was in large part the original plan of the whole rationalist project. Raise the sanity waterline. Give people the abilities and habits necessary to think well, both individually and as a group. Get our civilization to be more adequate in a variety of ways. Then, perhaps, they will be able to understand the dangers posed by future AIs and do something net useful about it.
I still believe in a version of this, and it has the advantage of being useful even if it turns out that transformative AI is far away, or even never gets built at all.
Helping people to think better is ideal. Helping people to be better off, so they have felt freedom to breathe and make better choices including to think better? That is badly needed. No matter what the statistics might say, the people are not OK, in ways having nothing to do with AI.
People who are under extreme forms of cognitive and economic coercion, who lack social connection, community or a sense of meaning in life, who despair of being able to raise a family, do things like take whatever job pays the most money while telling themselves whatever story they need to tell. Others do the opposite, stop trying to accomplish anything since they see no payoffs there.
Those who do not feel free to think, choose not to. Those who are told they are only allowed to think and talk about a narrow set of issues in certain ways, only do that.
Those who see a world where getting ahead means connections and status and conspiracy and also spending all your time in zero-sum competitions, and who seek to play the games of moving up the ranks of corporate America by becoming the person who would succeed at that, are not going to be the change we want to see.
Academics who need to compete for grants by continuously working on applications and showing incremental progress, and who only get their own labs at 40+, will never get to work on the problems that matter.
I really, genuinely think that if we had a growing economy, where people could afford to live where they want to live because we built housing there, where they felt hope for their futures and the support and ability to raise families, where they could envision a positive future, that gives us much more of a chance to at least die with more dignity here.
If you want people to dream big, they need hope for the future. If you’re staying up at night terrified that all humans will be dead in 20 years from climate change, that is going to crowd everything else out and also make you miserable, and lots of people doing that is on its own a damn good reason to solve that problem, and a bunch of others like it. This is true even if you believe that AI will render this a moot point one way or another (since presumably, if we get a transformational AI, either we all die from AI no matter what temperature it is outside, or with the AI we figure out how to easily fix climate change, this isn’t a morality play.)
If we are going to solve these problems, we would also greatly benefit from much better ability to cooperate, including internationally, which once again would be helped if things were better and thus people were less at each other’s throats and less on edge about their own survival.
Thus, in the face of these problems, even when time is short, good things remain good. Hope remains good. Bad things remain bad. Making the non-AI futures of humanity bright is still a very good idea. Also it will improve the training data. Have you tried being excellent to each other?
The shorter you believe the time left to be, the less value such actions have, but my model says the time to impact could be much faster you would expect because of the expectations channel – zeitgeists can change within a few years and often do.
The best things to do are still direct actions – if you are someone who is in a position to take them, and to identify what they are.
In case it needs to be said: If you are considering choosing violence, don’t.
I wish I had better answers here. I am not pretending I even have good ones. Problem is hard.
What Would Make Things Look Actually Safe?
Here is one answer.