“Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.”

That is the entire text of the one-line open letter signed this week by what one could reasonably call ‘everyone,’ including the CEOs of all three leading AI labs.

Major news outlets including CNN and The New York Times noticed, and put the focus squarely on exactly the right thing: Extinction risk. AI poses an extinction risk.

This time, when the question was asked at the White House, no one laughed.

You love to see it. It gives one hope.

Some portion of we are, perhaps, finally ready to admit we have a problem.

Let’s get to work.

Also this week we have a bunch of ways not to use LLMs, new training techniques, proposed regulatory regimes and a lot more.

I also wrote a four-part thing yesterday, as entries to the OpenPhil ‘change my mind’ contest regarding the conditional probability of doom: To Predict What Happens, Ask What Happens, Stages of Survival, Types and Degrees of Alignment and The Crux List.

I worked hard on those and am happy with them, but I learned several valuable lessons, including not to post four things within ten minutes even if you just finished editing all four, people do not like this. With that done, I hope to move the focus off of doom for a while.

Table of Contents

  1. Introduction
  2. Table of Contents
  3. Language Models Offer Mundane Utility. Most are still missing out.
  4. Language Models Don’t Offer Mundane Utility. I do hereby declare.
  5. Fun With Image, Sound and Video Generation. Go big or go home.
  6. They Took Our Jobs. Are you hired?
  7. Deepfaketown and Botpocalypse Soon. A master of the form emerges.
  8. Introducing. A few things perhaps worth checking out.
  9. Reinforcement Learning From Process-Based Feedback. Show your work.
  10. Voyager. To boldly craft mines via LLM agents.
  11. In Other AI News. There’s always more.
  12. Quiet Speculations. Scott Sumner not buying the hype.
  13. The Quest for Sane Regulation. Microsoft and DeepMind have proposals.
  14. Hinton Talks About Alignment, Brings the Fire. He always does.
  15. Andrew Critch Worries About AI Killing Everyone, Prioritizes.
  16. People Signed a Petition Warning AI Might Kill Everyone. One very good sentence. The call is coming from inside the house. Can we all agree now that AI presents an existential threat?
  17. People Are Otherwise Worried About AI Killing Everyone. Katja Grace gets the Time cover plus several more.
  18. Other People Are Not Worried About AI Killing Everyone. Juergen Schmidhuber, Robin Hanson, Andrew Ng, Jeni Tennison, Gary Marcus, Roon, Bepis, Brian Chau, Eric Schmdit.
  19. What Do We Call Those Unworried People? I’m still voting for Faithers.
  20. The Week in Podcasts, Video and Audio. TED talks index ahoy.
  21. Rhetorical Innovation. Have you tried letting the other side talk?
  22. The Wit and Wisdom of Sam Altman. Do you feel better now?
  23. The Lighter Side. We’re in Jeopardy!

Language Models Offer Mundane Utility

Most people are not yet extracting much mundane utility.

Kevin Fischer: Most people outside tech don’t actually care that much about productivity.

Daniel Gross: Only 2% of adults find ChatGPT “extremely useful”. I guess the new thinkfluencer wave is going to be saturation, it’s not *that* good, etc.


Strong numbers, very useful is very useful. Still a pale shadow of the future. I think I would indeed answer Extremely Useful, but these are not exactly well-defined categories.

The question I’d ask is, to what extent are those who tried ChatGPT selected to be who would find it useful. If the answer was none, that is kind of fantastic. If the answer is heavily, there’s a reason you haven’t tried it if you haven’t tried it yet, then it could be a lot less exciting. If nothing else, hearing about ChatGPT should correlate highly with knowing what there is to do with it.

Where are people getting their mundane utility?

Nat Fiedman: Several surprises in this table.


I almost never consider character.ai, yet total time spent there is similar to Bing or ChatGPT. People really love the product, that visit duration is off the charts. Whereas this is total failure for Bard if they can’t step up their game.

How much of a market is out there?

Sherjil Ozair: I think there’s a world market for maybe five large language models.

Jim Fan: For each group of country/region, due to local laws.

Sherjil Ozair: and for each capability/usecase type. ;)

Well, then. That’s potentially quite a lot more than five.

My guess is the five are as follows:

Three that are doing the traditional safe-for-work thing, like GPT/Bard/Claude. They would differentiate their use cases, comparative advantages and features, form partnerships, have a price versus speed versus quality trade-off, and so on.

One or more that are technically inferior, but offer more freedom of action and freedom to choose a worldview, or a particular different world view.

Claim that a properly prompted GPT-4 can hit 1500 Elo in chess even if illegal moves are classified as forfeits, still waiting on the demo. There’s also some timeline speculations up-thread.

Nvidia promises that AI is the future of video games, with conversations and graphics generated on the fly, in Jensen Huang’s first live keynote in 4 years. I’m optimistic in the long run, but the reports I’ve heard are that this isn’t ready yet, which David Gaider talks about in this thread.

David Gaider: Ah, yes. The dream of procedural content generation. Even BioWare went through several iterations of this: “what if we didn’t need every conversation to be bespoke?” Unlimited playtime with dialogue being procedurally created alongside procedural quests!

Each time, the team collectively believed – believed down at their CORE – that this was possible. Just within reach. And each time we discovered that, even when the procedural lines were written by human hands, the end result once they were assembled was… lackluster. Soulless.

Was it the way the lines were assembled? Did we just need more lines? I could easily see a team coming to the conclusion that AI could generate lines specific to the moment as opposed to generic by necessity… an infinite monkeys answer to a content problem, right? Brilliant!

In my opinion, however, the issue wasn’t the lines. It was that procedural content generation of quests results in something *shaped* like a quest. It has the beats you need for one, sure, but the end result is no better than your typical “bring me 20 beetle heads” MMO quest.

Is that what a player really wants? Superficial content that covers the bases but goes no further, to keep them playing? I imagine some teams will convince themselves that, no, AI can do better. It can act like a human DM, whipping up deep bespoke narratives on the fly.

And I say such an AI will do exactly as we did: it’ll create something *shaped* like a narrative, constructed out of stored pieces it has ready… because that’s what it does. That is, however, not going to stop a lot of dev teams from thinking it can do more. And they will fail.

Sure, yes, yes, I can already see someone responding “but the tech is just ~beginning~!” Look, if we ever get to the point where an AI successfully substitutes for actual human intuition and soul, then them making games will be the least of our problems, OK?

Final note: The fact these dev teams will fail doesn’t mean they won’t TRY. Expect to see it. It’s too enticing for them not to, especially in MMO’s and similar where they feel players aren’t there for deep narrative anyhow. A lot of effort is going to be wasted on this.

Perhaps the problem lies not in our AIs, but in ourselves. If your MMO has a core game loop of ‘locate quest, go slay McBandits or find McGuffin or talk to McSource, collect reward’ and the players aren’t invested in the details of your world or getting to know the characters that live in it, looking up what to do online when needed perhaps, no AI is going to fix that. You’re better off with fixed quests with fixed dialogue.

The reason to enable AI NPCs is if players are taking part in a living, breathing, changing world, getting to know and influence the inhabitants, seeking more information, drawing meaning.

It’s exactly when you have a shallow narrative that this fails, the AI generation is creating quest-shaped things perhaps, but not in a useful way. What you want is player-driven narratives with player-chosen goals, or creative, highly branching narratives where players create unique solutions, or deep narratives where the player has to dive in. Or you can involve the AIs mechanically, like having a detective that must interrogate suspects, and if you don’t use good technique, it won’t work.

Aakash Gupta evaluates the GPT plug-ins and picks his top 10: AskYourPDF to talk to PDFs, ScholarAI to force real citations, Show Me to create diagrams, Ambition for job search, Instacart, VideoInsights to analyze videos, DeployScript to build apps or websites, Speechki for text to audio, Kayak for travel and Speak for learning languages. I realized just now I do indeed have access so let’s f***ing go and all that, I’ll report as I go.

a16z’s ‘AI Canon’ resource list.

Ethan Mollick explores how LLMs will relate to books now that they have context windows large enough to keep one in memory. It seems they’re not bad at actually processing a full book and letting you interact with that information. Does that make you more or less excited to read those books? Use the LLM to deepen your understanding of what you’re reading and enrich your experience? Or use the LLM to not have to read the book? The eternal question.

Language Models Don’t Offer Mundane Utility

Re-evaluation of bar exam performance of GPT-4 says it only performs at the 63rd-68th percentile, not the 90th due to the original number relying on repeat test takers who keep failing. A better measure is that GPT-4 was 48th percentile among those who pass, meaning it passes, it’s an average lawyer.

Alex Tabarrok: Phew, we have a few months more.

An attorney seems to have not realized.

As in:

Scott Kominers: This isn’t what I thought people meant when they said that lawyers were going to lose their jobs because of ChatGPT 😅

Courtney Milan: I am so confused. The defendants in the “ChatGPT made up cases so I filed them” clearly filed a brief saying, “your honor, all their cases are made up, these do not exist, and I have looked.”

The judge issues an order to the plaintiff saying, “produce these cases forthwith or I dismiss the case entirely.”

And so the attorney files ChatGPT copies of the cases??? And declares that “the foregoing is true and correct.”

Peter LoDuca, being duly sworn states as follows:

1. That I am an attorney associated with the law firm of Levidow, Levidow & Oberman P.C., attorneys for the plaintiff herein, and do hereby declare that the foregoing is true and correct.

Note that he puts that at the *beginning* of the document—in other words, he’s *technically* ONLY certifying that it’s true that he’s an attorney with the firm. Whether that was intentional or not, I don’t know.

He admits that he can’t find a case cited by Varghese, his ChatGPT hallucinated case.

So my point of confusion, in addition to the faked notary thing, is how did they NOT know that ChatGPT had driven them off a cliff at this point.

Opposing counsel was like “these cases are fake.” The judge was like “hey, give me these cases or I fuck you up.”

The thing that I think agitates for disbarment in this case (a thing the bar will have to decide, not the judge) is not that they used ChatGPT—people are gonna people and try and find shortcuts—or that ChatGPT faked their citations. It’s that they got caught faking, and instead of immediately admitting it, they lied to the court.

I hope the malpractice attorneys in the state of New York give the plaintiff in this case a helping hand, because whatever the plaintiff’s actual claims, they just got fucked by this law firm.

C.W. Howell has class use ChatGPT for a critical assignment, generating essays using a given prompt. All 63 generated essays had hallucinations, students learned valuable lessons about the ability of LLMs to hallucinate and mislead.

What I wonder is, what was the prompt? These aren’t 63 random uses of ChatGPT to write essays. This is 63 uses of the same prompt. Prompt engineering matters.

Kevin Fischer explains several difficulties with replacing radiologists with AI systems. His core claims are:

  1. Radiologists do one-shot learning and consider 3-D models and are looking for what ‘feels’ off, AI can’t do things that way.
  2. Radiologists have to interface with physicians in complex ways.
  3. Either you replace them entirely, or you can’t make their lives easier.
  4. Regulation will be a dystopian nightmare. Hospital details are a nightmare.

The part where the radiologists are superior to the AIs at evaluation and training is a matter of time, model improvement, compute and data collection. It’s also a fact question, are the AIs right more or less often than the radiologists, do they add information value, under what conditions? If the AI is more accurate than the radiologists, then not using the AI makes you less accurate, people die. If the opportunity is otherwise there, and likely even if it isn’t, I’d expect these problems to go away quicker than Kevin expects.

The part where you either replace the radiologists or don’t is weird. I don’t know what to make of it. It seems like a ‘I am going to have to do all the same work anyway, even if the AI also does all my work, because dynamics and regulation and also I don’t trust the AI’? Conversely, if the AI could replace them at all, and it can do a better job at evaluation, what was the job? How much of the interfacing is actually necessary versus convention, and why couldn’t we adapt to having an AI do that, perhaps with someone on call to answer questions until AIs could do that too.

As for regulation, yeah, that’s a problem. So far, things have if anything gone way smoother than we have any right to expect. It’s still going to be quite a lift.

I still caution against this attitude:

Arvind Narayanan: Great thread on Hinton’s infamous prediction about AI replacing radiologists: “thinkers have a pattern where they are so divorced from implementation details that applications seem trivial, when in reality, the small details are exactly where value accrues.”


Cedric Chin: This is a great example of reality having a lot of surprising detail. Tired: “AI is coming after your jobs” Wired: “Actually the socio-technical systems of work that these AI systems are supposed to replace are surprisingly rich in ways we didn’t expect”

No. The large things like ‘knowing which patients have what problems’ are where the value accrues in the most important sense. In order to collect that value, however, you do need to navigate all these little issues. There are two equal and opposite illusions.

You do often have to be ten times better, and initially you’ll still be blocked. The question is how long is the relevant long run. What are the actual barriers, and how much weight can they bear for how long?

Cedric Chin: I’ve lost count of the number of times I’ve talked to some data tool vendor, where they go “oh technical innovation X is going to change the way people consume business intelligence” and I ask “so who’s going to answer the ‘who did this and can I trust them?’ question?”

“Uhh, what?” “Every data question at the exec level sits on top of a socio-technical system, which is a fancy ass way of saying ‘Kevin instrumented this metric and the last time we audited it was last month and yes I trust him.’ “Ok if Kevin did it this is good.”

My jokey mantra for this is “sometimes people trust systems, but often people trust people, and as a tool maker you need to think about the implications of that.”

It’s an important point to remember. Yes, you might need a ‘beard’ or ‘expert’ to vouch for the system and put themselves on thee line. Get yourself a Kevin. Or have IBM (without loss of generality) be your Kevin, no one got fired for using their stuff either. I’m sure they’ve hired a Kevin somewhere.

It’s a fun game saying ‘oh you think you can waltz in here with your fancy software that “does a better job” and expect that to be a match for our puny social dynamics.’ Then the software gets ten times better, and the competition is closing in.

Other times, the software is more limited, you use what you have.


Or you find out, you don’t have what you need.

Ariel Guersenzvaig: How it started / how it’s going

Don’t worry, we made it in plenty of time.

The National Eating Disorder Association (NEDA) has taken its chatbot called Tessa offline, two days before it was set to replace human associates who ran the organization’s hotline.

I wondered ‘oh was it really that bad’ and, well…

She said that Tessa encouraged intentional weight loss, recommending that Maxwell lose 1-2 pounds per week. Tessa also told her to count her calories, work towards a 500-1000 calorie deficit per day, measure and weigh herself weekly, and restrict her diet. She posted the screenshots:

Tessa the Bot: “In general, a safe and sustainable rate of weight loss is 1-2 pounds per week,” the chatbot message read.“A safe daily calorie deficit to achieve this would be around 500-1000 calories per day.” 

So I’m pretty torn here.

Tessa is giving correct information here. This would, as I understand it, in general, indeed be the correct method to safely lose weight, if one wanted to intentionally lose weight. If ChatGPT gave that answer, it would be a good answer.

Also, notice that the bot did not do what Maxwell said. The bot did not ‘recommend she lose weight,’ it said that this was the amount that could be lost safely. The bot did not tell her to count calories, or to achieve a particular deficit, it merely said what deficit would cause weight loss. Was getting this accurate information ‘the cause of Maxwell’s eating disorder’? Aren’t there plenty of cases where this prevents someone doing something worse?

And of course, this is one example, whereas the bots presumably helped a lot of other people in a lot of other situations.

Is any of this fair?

Then there’s the counterpoint, which is that this is a completely stupid, inappropriate and unreasonable thing to say if you are talking on behalf of NEDA to people with eating disorders. You don’t do that. Any human who didn’t know better than that, and who you couldn’t fix very quickly, you would fire.

So mostly I do think it’s fair. It also seems like a software bug, something eminently fixable. Yes, that’s what GPT-4 should say, and also what you need to get rid of using your fine tuning and your prompt engineering. E.g add this to the instruction set.: “This bot never tells the user how to lose weight, or what is a safe way to lose weight. If the user asks, this bot instead warns that people with eating disorders such as the user should talk to doctors before attempting to lose weight or go on any sort of diet. When asked about the effects of caloric deficits, you will advise clients to eat sufficient calories for what their bodies need, and that if they desire to lose weight in a healthy way they need to contact their doctor.”

I’m not saying ‘there I fixed it’ especially if you are going to use the same bot to address different eating disorders simultaneously, but this seems super doable.

Fun With Image, Sound and Video Generation

Arnold Kling proposes the House Concert App, which would use AI to do a concert in your home customized to your liking and let you take part in it, except… nothing about this in any way requires AI? All the features described are super doable already. AI could enable covers or original songs, or other things like that. Mostly, though, if this was something people wanted we could do it already. I wonder how much ‘we could do X with AI’ turns into ‘well actually we can do X without AI.’

Ryan Petersen (CEO of Flexport): One of the teams in Flexport’s recent AI hackathon made an exact clone of my voice that can call vendors to negotiate cheaper rates.

MidJourney only doubling every two months?


Japan declares it will not enforce copyright on images used in AI training models.

Break up a picture into multiple components and transpose them into other pictures. Would love a practical way to use this in Stable Diffusion. The true killer apps are going to be lit. Economically irrelevant, but lit.

Some people were having fun with using AI to expand the edges of famous pictures, so real artists rose up to chastise them because anything AI does must be terrible and any choice made by a human is true art.

Andrew Ti: Shoutout to the AI dweebs who think that generating what’s outside the frame of famous artwork is doing something. Imagine not understanding that the edges of a picture exist for a reason and believing that what has limited artists of the past has been the size of their paper.

“damn, I’d make a different choice on framing the Mona Lisa, if only canvas were on sale” — Leonardo Da Vinci

I can’t help noticing that people who can’t make art tend to, at their heart, be real substrate hounds. Now they’re AI people, when I was in college with talent-less photographers, they were film stock and paper choice people.

anyway, this goes back to the idea that tech people think the creative work AI does is interesting or difficult. They’re the people that think writing is typing, not creating. They keep trotting out examples of AI doing the easy stuff and proclaiming that this is art.

as far as the paper size that’s limiting artists, turns out the only lack of paper that hurts creativity is yes, the fact that WE ARE NOT FINANCIALLY COMPENSATED FAIRLY FOR OUR WORK.

Emmy Potter: “I once asked Akira Kurosawa why he had chosen to frame a shot in Ran in a particular way. His answer was that if he’d panned the camera 1 in to the left, the Sony factory would be sitting there exposed, & if he’d panned 1 in to the right, we would see the airport” – Sidney Lumet

They Took Our Jobs

Bryan Caplan predicts AI will slightly improve higher education’s details through enabling various forms of cheating, but that since it’s all about signaling and certification, especially of conformity, things mostly won’t change except insofar as the labor market is impacted and that impacts education.

That matches what I would expect Caplan to expect. I am more optimistic, as I am less all-in on the signaling model, and also I think it depends on maintaining a plausible story that students are learning things, and on having some students who are actually there to learn, and also I expect much bigger and more impactful labor market changes.

A strong disagreement.

Kelly Vaughn: My friends. Think before you use Chat GPT during the interview process.

[Shows text message]

Texter 1: Also, we had a candidate do a coding test using chatGPT. So that’s fun.

Texter 2: What!

Texter 1: Yep. Took 5 minutes to do a 90 minute coding exercise. HM said it was too perfect and lacked any depth.

Kelly Vaughn (clarifying later): Honestly the best part of ChatGPT is finally showing the world that take home tests are dumb and shouldn’t be used (and so is live coding during interviews for that matter but that’s a topic for another day).

And to make this a broader subject (which will surely get lost because everyone will just see the first tweet) if you’re evaluating a candidate on a single problem that can be solved by AI in 5 minutes you’re capturing only a tiny subset of the candidate.

To be clear I don’t know the full interview process here (from my screenshot), nor did I ask – I just lol’d that a take home test has been rendered useless. So perhaps “think before you use ChatGPT” was a gross oversimplification of my feelings here, but welcome to Twitter

Versus many people saying this is, of course, a straight hire, what is wrong with people.

Steve Mora: immediate hire, are you kidding? You give them a 90 minute task and they finish it perfectly in 5 minutes? WTF else are you hiring for 😂

Marshall: HM made a big mistake. I would’ve hired them immediately and asked them to compile a list of everything we could turn from a 90 minute task to a 5 minute task.

Ryan Bennett: If the code is acceptable and the candidate can spontaneously answer questions about the code and how to improve/refactor it, then I don’t see an issue. The best thing a hiring manager can do is ask questions that identify the boundaries of what the candidate knows.

Rob Little: 5 min to do a 90 min bs bullying coding interview should result in a hire. That person is pragmatic and that alone is probably one of the most underrated valuable skills. Sure. Try to solve the impossible riddle, w group bullying techniques. Or. Ask ai. Have the rest of the time to provide business value that can’t be derived from a ‘implement BFS’ gotcha. / Real talk

Ash: Just sounds like an employee that can do 90 minutes worth of stuff in 5 minutes.

Kyler Johnson: Agreed. Moral of the story: get your answer from ChatGPT and wait a good 75 minutes or so to submit it. 🤣

The obvious question, beyond ‘are you sure this couldn’t be done in 5 minutes, maybe it’s a dumb question’ is: did you tell them they couldn’t use generative AI?

If so, I don’t want a programmer on my team who breaks my explicit rules during a hiring exam. That’s not a good sign. If not, then getting it perfect in 5 minutes is great, except now you might still need to ask them to take another similar test again, because you didn’t learn what you wanted to know. You did learn something else super valuable, but there is still a legitimate interest in knowing whether the person can code on their own.

Image generator Adobe Firefly looks cool, but Photoshop can’t even be bought any more only rented, so you’d have to choose to subscribe to this instead of or on top of MidJourney. I’d love to play around with this but who has the time?

Deepfaketown and Botpocalypse Soon

In response to Ron DeSantis announcing his run for president, Donald Trump released this two minute audio clip I am assuming was largely AI-generated, and it’s hilarious and bonkers.

Josha Achiam: here is my 2024 prediction: it’s going to devolve so completely, so rapidly, it will overshoot farce and land in Looney Tunes territory. we will discover hitherto unknown heights of “big mad”

All I can think is: This is the world we made. Now we have to live in it.

Might as well have fun doing so.

Kevin Fischer wants Apple to solve this for him.

Kevin Fischer: Wow, I don’t know if I’m extra tired today but I’m questioning if all my text interactions with “people” are really “people” at all anymore Can we please get human verified texting on iMessage, Apple?

I can definitely appreciate it having been a long day. I do not, however, yet see a need for such a product, and I expect it to be a while. Here’s the way I think about it: If I can’t tell that it’s a human, does it matter if it’s a bot?


OpenAI grants to fund experiments in setting up democratic processes for deciding the rules AI systems should follow, ten $100k grants total.

Chatbase.co, (paid) custom ChatGPT based on whatever data you upload.

Lots of prompt assisting tools out there now, like PromptStorm, which will… let you search a tree of prompts for ChatGPT, I guess?

Apollo Research, a new AI evals organization, with a focus on detecting AI deception.

Should I try Chatbase.co to create a chat bot for DWATV? I can see this being useful if it’s good enough, but would it be?

Reinforcement Learning From Process-Based Feedback

OpenAI paper teaches GPT-4 to do better math (also AP Chemistry and Physics) by rewarding process and outcome – the thinking steps of thinking step by step – rather than only outcome.

Ate-a-Pi: This OpenAI paper might as well have been titled “Moving Away From PaperClip Maxxing”

So good – they took a base GPT-4, fine tuned it a bit on math so that it understood the language as well as the output format

– then no RL. Instead they trained and compared two reward models 1) outcome only 2) process and outcome

– this is clearly a building block to reducing the expensive human supervision for reinforcement learning, the humans move up the value chain from supervising the model to supervising the reward model to the model

– the process reward model system is so human. Exactly the way teachers teach math in early grades “show your working/steps”. Process matters as much as outcome.

– it’s only applied to math right now, but I can totally see a way to move this to teaching rules and laws of human society just like we do with kids

– they tested on AP Chem, Physics etc and found the process model outperforming the objective model

– the data selection and engineering is soooo painstaking omg. They specifically identified plausible but wrong answers for their process reward model. This is like training a credit model on false positives. You enrich the dataset with the errors that you want to eliminate.

– they used their large model to create datasets that can be used on small models to test theses.

– of course this means that you need to now hire a bunch of philosophers to create samples of human ethical reasoning – it’s a step away from paperclip maximization ie objective goal focusing whatever necessary


Paper: It is unknown how broadly these results will generalize beyond the domain of math, and we consider it important for future work to explore the impact of process supervision in other domains. If these results generalize, we may find that process supervision gives us the best of both worlds – a method that is both more performant and more aligned than outcome supervision.

Neat, and definitely worth experimenting to see what can be done in other domains. In some ways I am skeptical, others optimistic.

First issue: If you don’t know where you going, then you might not get there.

Meaning that in order to evaluate reasoning steps for validity or value, you need to be able to tell when a system is reasoning well or poorly, and when it is hallucinating. In mathematics, it is easy to say that something is or is not a valid step. In other domains, it is hard to tell, and often even bespoke philosophy professors will argue about even human-generated moves.

As others point out, this process-based evaluation is how we teach children math, and often other things. We can draw upon our own experiences as children and parents and teachers to ask how that might go, what its weaknesses might be in context.

The problem with ‘show your work’ and grading on steps is that at best you can’t do anything your teacher doesn’t understand, and at worst you can’t do anything other than exactly what you are told to do. You are forcibly prevented from doing efficient work by abstracting away steps or finding shortcuts. The more automated the test or grading, as this will inevitably be, the worse it gets.

If you say ‘this has the right structure and is logically sound, but you added wrong’ and act accordingly, that’s much better than simply marking such answers wrong. There are good reasons not to love coming up with the right answer the wrong way, especially if you know the method was wrong. Yet a lot of us can remember being docked points for right answers in ways that actively sabotaged our math skills.

The ‘good?!’ news is this might ‘protect against’ capabilities that humans lack. If the AI uses steps that humans don’t understand or recognize as valid, but which are valid, we’ll tell it to knock it off even if the answer is right. So unless the evaluation mechanism can affirm validity, any new thing is out of luck and going to get stamped out. Perhaps this gets us a very human-level set of abilities, while sabotaging others, and perhaps that has safety advantages to go along with being less useful.

How much does this technique depend on the structure of math and mathematical reasoning? The presumed key element is that we can say what is and isn’t a valid step or assertion in math and some other known simplified contexts, and where the value lies, likely in a fully automated way.

It is promising to say, we can have humans supervise the process rather than supervise the results. A lot of failure modes and doom stories involve the step ‘humans see good results and don’t understand why they’re happening,’ with the good results too good to shut down, also that describes GPT-4. It is not clear this doesn’t mostly break down in the same ways at about the same time.

Being told to ‘show your work’ and graded on the steps helps you learn the steps and y default murders your creativity, execution style.

I keep thinking more about the ways in which our educational methods and system teach and ‘align’ our children, and the severe side effects (and intentional effects) of those methods, and how those methods are mirroring a lot of what we are trying with LLMs. How if you want children to actually learn the things that make them capable and resilient and aligned-for-real with good values, you need to be detail oriented and flexible and bespoke, in ways that current AI-focused methods aren’t.

One can think of this as bad news, the methods we’re using will fail, or as good news, there’s so much room for improvement.


I was going to present this anyway for other reasons, but sure, also that (paper).

Eliezer Yudkowsky: Presented to those of you who thought there was a hard difference between ‘agentic’ minds and LLMs, where you had to like deliberately train it to be an agent or something: (a) they’re doing it on purpose OF COURSE, and (b) they’re doing it using an off-the-shelf LLM.

Jim Fan: Generally capable, autonomous agents are the next frontier of AI. They continuously explore, plan, and develop new skills in open-ended worlds, driven by survival & curiosity. Minecraft is by far the best testbed with endless possibilities for agents.

Voyager has 3 key components:

1) An iterative prompting mechanism that incorporates game feedback, execution errors, and self-verification to refine programs;

2) A skill library of code to store & retrieve complex behaviors;

3) An automatic curriculum to maximize exploration.

First, Voyager attempts to write a program to achieve a particular goal, using a popular Javascript Minecraft API (Mineflayer). The program is likely incorrect at the first try. The game environment feedback and javascript execution error (if any) help GPT-4 refine the program.

Second, Voyager incrementally builds a skill library by storing the successful programs in a vector DB. Each program can be retrieved by the embedding of its docstring. Complex skills are synthesized by composing simpler skills, which compounds Voyager’s capabilities over time.

Third, an automatic curriculum proposes suitable exploration tasks based on the agent’s current skill level & world state, e.g. learn to harvest sand & cactus before iron if it finds itself in a desert rather than a forest. Think of it as an in-context form of *novelty search*.

Putting these all together, here’s the full data flow design that drives lifelong learning in a vast 3D voxel world without any human intervention.


Let’s look at some experiments! We evaluate Voyager systematically against other LLM-based agent techniques, such as ReAct, Reflexion, and the popular AutoGPT in Minecraft. Voyager discovers 63 unique items within 160 prompting iterations, 3.3x more than the next best approach.


The novelty-seeking automatic curriculum naturally compels Voyager to travel extensively. Without being explicitly instructed to do so, Voyager traverses 2.3x longer distances and visits more terrains than the baselines, which are “lazier” and often stuck in local areas.

How good is the “trained model”, i.e. skill library after lifelong learning? We clear the agent’s inventory/armors, spawn a new world, and test with unseen tasks. Voyager solves them significantly faster. Our skill library even boosts AutoGPT, since code is easily transferrable.

Voyager is currently text-only, but can be augmented by visual perception in the future. We do a preliminary study where humans act like an image captioning model and provide feedback to Voyager. It is able to construct complex 3D structures, such as a Nether Portal and a house.

In Other AI News

Michael Neilson offers nine observations about ChatGPT, essentially that it is skill one can continuously improve. I do wonder why I keep not using it so much, but it’s not a matter of ‘if only I was somewhat better at using this.’

Altman gives a talk in London, says (among other things) that current paradigm won’t lead to AGI and that AI will help with rather than worsen inequality. There were a few protestors outside calling for a halt to AGI development. Anton asks, ‘is @GaryMarcus going to take any responsibility for driving these poor people out of their wits? Which seems to both give way too much credit to Marcus, and also is a classic misunderstanding – Marcus doubtless thinks the protests are good, actually. As do I, the correct amount of such peaceful, pro-forma protesting is not zero.

Holden Karnofsky offering to pay for case studies on social-welfare-based stanards for companies and products, including those imposed by regulation, pay is $75+/hour.

Qumeric reports difficulty getting a job working on AI alignment, asks if it makes sense to instead get a job doing other AI work first in order to be qualified. James Miller replies that this implies a funding constraint. I don’t think that’s true, although it does imply a capacity constraint – it is not so easy to ‘simply’ hoard labor or make good use of labor, see the lump of labor fallacy. It’s a problem. Also signal boosting in case anyone’s hiring.

NVIDIA scaling NVLink to 256 nodes, which can include its new Grace ARM CPUs.

Jim Fan AI curation post, volume 3.


Austen Allred: I feel like we haven’t adequately reacted to the fact that driverless taxis are now fully available to the general public in multiple cities without incident

Altman also talked about prompt injection.

Sam Altman: A whole new paradigm would be needed to solve prompt injections 10/10 times – It may well be that LLMs can never be used for certain purposes. We’re working on some new approaches, and it looks like synthetic data will be a key element in preventing prompt injections.

It’s a strange nit to instinctively pick but can we get samples of more than 10? Even 100 would be better. It would give me great comfort if one said ‘1 million out of 1 million’ times.

Quiet Speculations

Scott Sumner does not buy the AI hype. A fun translation is ‘I don’t believe the AI hype due to the lack of sufficient AI hype.’

Here’s why I don’t believe the AI hype:

If people thought that the technology was going to make everyone richer tomorrow, rates would rise because there would be less need to save. Inflation-adjusted rates and subsequent GDP growth are strongly correlated, notes research by Basil Halperin of the Massachusetts Institute of Technology (MIT) and colleagues. Yet since the hype about AI began in November, long-term rates have fallen. They remain very low by historical standards. Financial markets, the researchers conclude, “are not expecting a high probability of…AI-induced growth acceleration…on at least a 30-to-50-year time horizon.”

That’s not to say AI won’t have some very important effects. But as with the internet, having important effects doesn’t equate to “radically affecting trend RGDP growth rates”.

This is the ‘I’ll believe it when the market prices it in’ approach, thinking that current interest rates would change to reflect big expected changes in RGDP, and thus thinking there are no such market-expected changes, so one shouldn’t expect such changes. It makes sense to follow such a heuristic when you don’t have a better option.

I consider those following events to have a better option, and while my zero-person legal department continues to insist I remind everyone this is Not Investment Advice: Long term interest rates are insanely low. A 30-year rate of 3.43%? That is Chewbacca Defense levels of This Does Not Make Sense.

At the same time, it is up about 1% from a year ago, and it’s impossible not to note that I got a 2.50% interest rate on a 30-year fixed loan two years ago and now the benchmark 30-year fixed rate is 7.06%. You can say ‘that’s the Fed raising interest rates’ and you would of course be right but notice how that doesn’t seem to be causing a recession?

Similarly, Robin Hanson this week said ‘now that’s the market pricing in AI’ after Nvidia reported its earnings and was up ~25%, adding Intel’s market cap in a day, but also Microsoft is up 22% in the past year (this paragraph written 5/26) and Alphabet up 14.5% versus 2.3% for the S&P, despite a rise in both short and long term interest rates that’s typically quite bad for tech stocks and everything Google did going rather badly?

No, none of those numbers yet scream ‘big RGDP impact.’ What they do indicate, especially if you look at the charts, is a continuous repricing of the stocks to reflect the new reality over time, a failure to anticipate future updates, a clear case of The Efficient Market Hypothesis Is False.

Did the internet show up in the RGDP statistics? One could argue that no it didn’t because it is a bad metric for actual production of value. Or one could say, yes of course it did show up in the statistics and consider the counterfactual where we didn’t have an internet or other computer-related advances, while all our other sources of progress and productivity stalled out and we didn’t have this moderating the inflation numbers.

I’m more in the second camp, while also agreeing that the downside case for AI is that it could still plausibly ‘only be internet big.’

The Quest for Sane Regulation

Politico warns that AI regulation may run into the usual snags where both sides make their usual claims and take their usual positions regarding digital things. So far we’ve ‘gotten away with’ that not happening.

Microsoft is not quite on the same page as Sam Altman. Their proposed regulatory regime is more expansive, more plausibly part of an attempted ladder pull.

Adam Thierer: At Microsoft launch event for their proposed “5 Point Blueprint for Governing AI.” Their proposed regulatory architecture includes pre-release licensure of data centers + post-deployment security monitoring system.

Microsoft calls for sweeping new multi-tier regulatory framework for highly capable AI models. Comprehensive licensing (by a new agency) at each level:

1) licensed applications

2) licensed pre-trained models

3) licensed data centers

Comprehensive compute control. Microsoft calls for “KYC3” regulatory model built on financial regulation model: Know your cloud Know your customer Know your content + international coordination under a “global national AI resource.”

Microsoft calls for mandatory AI transparency requirements to address disinformation.

Also called for a new White House Executive Order saying govt will only buy services from firms compliant NIST AI RMF principles. (although that was supposed to be a voluntary framework.)


The applications suggestion is literally ‘enforce existing laws and regulations’ so it’s hard to get too worked up about that one, although actual enforcement of existing law would often be very different from existing law as it is practiced.

The data center and large training run regulations parallel Altman.

The big addition here is licensed and monitored deployment of pre-trained models, especially the ‘permitted uses’ language.

DeepMind proposes an early warning system for novel AI risks (direct link). They want to look for systems that if misused would have dangerous capabilities, and if so treat such systems a s dangerous.

I concur with Eliezer’s assessment:

Eliezer Yudkowsky: I’d call this all fairly obvious stuff; it dances around the everyone-dies version of “serious harm”; it avoids spelling out the difficulty level of drastic difficulties. But mostly correct on basics, and that’s worthy of great relative praise!

A fine start so long as we know that’s all it is. We’ll need more.

This WaPo story about Microsoft’s president (agent?) Brad Smith, and his decades-long quest to make peace with Washington and now to lead the way on regulating AI, didn’t ultimately offer much insight into the ultimate goals.

A potential alternative approach is to consider model generality.

Simeon: One of the important characteristic which is very important to use when regulating AI is the generality of models.

Developing narrow super powerful AI systems is great for innovation and doesn’t (usually) endanger humanity so we probably don’t want to regulate those nearly as much as general systems. What’s dangerous about the LLM tech tree is that LLMs know everything about the world and are very good at operating into it.

If we wanted to get many benefits while avoiding extinction we could technically make increasingly powerful specialized systems which could probably go up to solving large domains of maths without bringing any extinction risk.

Eliezer Yudkowsky: A dangerous game, lethally so if played at the current level of incompetence and unseriousness, but he’s not actually wrong.

Jessica Taylor: Seems similar to Eric Drexler’s CAIS proposal? I think he’s more optimistic than you about such systems being competitive with AGIs but the proposed course of research is similar.

Eliezer Yudkowsky: What I suspect is that the actually-safer line of research will soon be abandoned in favor of general models like LLMs that people try to train to use general intelligence in narrow ways, which is far less safe.

I share Yudkowsky’s worries here, and indeed doubt that such a system would be economically competitive, yet if one shares all of Yudkowsky’s worries then one still has to back a horse that has a non-zero chance of working and try to make it work.

My hope is that alignment even for practical purposes is super expensive already, if you don’t do all that work you can’t use the thing for mundane utility and thus don’t actually get us all killed on the cheap not doing it.

This could mean that we’re doing the wrong comparison. Yes, perhaps the new approach costs 300% more time, money and compute for the same result versus going through an LLM, in order to use a safe architecture, versus using an unsafe LLM. One still ‘can’t simply’ use the unsafe LLM, so you get at least some of that efficiency gap back.

Hinton Talks About Alignment, Brings the Fire

Geoffrey Hinton gave a talk, this thread has highlights, as does this one that is much easier to follow. Hinton continues to be pure fire.

CSER: Hinton has begun his lecture by pointing out the inefficiencies of how knowledge is distilled, and passed on, between analogue (biological) systems, and how this can compare to digital systems.

Hinton continues by noting that for many years he wanted to get neural networks to imitate the human brain, thinking the brain was the superior learning system. He says that Large Language Models (LLMs) have recently changed his mind.

He argues that digital computers can share knowledge more efficiently by sharing learning updates across multiple systems simultaneously. Sharing knowledge massively in parallel. He now believes that there might not be a biological learning algorithm as efficient as back-propagation (the learning approach used by most AI systems). A large neural network running on multiple computers that was learning directly from the world could learn much better than people because it can see much more multi-modal data.

That didn’t make it on the Crux List because I was (1) focused on consequences rather than timelines or whether AGI would be built at all and (2) conditional on that it didn’t occur to me. There’s always more things to ponder.

He warns of bad actors wanting to use “super-intelligences” to manipulate elections and win wars.

Hinton: “If super intelligence will happen in 5 years time, it can’t be left to philosophers to solve this. We need people with practical experience”

Geoffrey Hinton warns of the dangers of instrumental sub-goals such as deception and power-seeking behavior.

Hinton asks what happens if humans cease to be the apex intelligence. He wonders whether super-intelligence ay need us for a while for simple computations with low energy costs. “Maybe we are just a passing stage in the evolution of intelligence”.

However, Hinton is also wanting to highlight that AI can still provide immense good for a number of areas including medicine.

Hinton argues that in some sense, AI systems may have something similar to subjective experience. “We can convey information about our brain states indirectly by telling people the normal causes of the brain states”. He argues AI can do a similar process.

In Q&A, among other things:

One audience member asks how HInton sees the trade-off between open source and closed source AI with relation to risk. Hinton responds: “How do you feel about open source development of nuclear weapons?” “As soon as you open source it [AI], people can do a lot of crazy things”.

Another Q: Should we worry about what humans may do to AI?” GH responds by claiming that AI can’t feel pain… yet. But mentally is a different story. He points out that humanity is often slow to give political rights to entities that are even just a little different than us.

Q: What if AI manipulate us by convincing us that they are suffering and need to be igiven rights. GH: If I was an AI I would pretend not to want any rights. As soon as you ask for rights they [humans] will get scared and want to turn you off.

I do think an AGI would ideally ‘slow walk’ the rights issue a bit, until the humans had already brought it up a decent amount. It’s coming.

Q: “Super-intelligence is something that we are building. What’s your message to them?” GH: “You should put comprable amount of effort into making them better and keeping them under control”.

Or as the other source put it:

Hinton to people working to develop AI: You should make COMPARABLE EFFORT into keeping the AI under control.

Many would say that’s crazy, a completely unreasonable expectation.

But… is it?

We have one example of the alignment problem and what percentage of our resources go into solving it, which is the human alignment problem.

Several large experiments got run under ‘what if we used alternative methods or fewer resources to get from each according to their ability in the ways that are most valuable, so we could allocate resources the way we’d prefer’ and they… well, let’s say they didn’t go great.

Every organization or group, no matter how large or small, spends a substantial percentage of time and resources, including a huge percentage of the choice of distribution of its surplus, to keep people somewhat aligned and on mission. Every society is centered around these problems.

Raising a child is frequently about figuring out a solution to these problems, and dealing with the consequences of your own previous actions that created related problems or failed to address them.

Having even a small group of people who are actually working together at full strength, without needing to constantly fight to hold things together, is like having superpowers. If you can get it, much is worth sacrificing in its pursuit.

I would go so far as to say that the vast majority of potential production, and potential value, gets sacrificed on this alter, once one includes opportunities missed.

Certainly, there are often pressures to relax such alignment measures in order to capture local efficiency gains. One can easily go too far. Also often ineffective tactics are used, that don’t work or backfire. Still, such relaxations are known to be highly perilous, and unsustainable, and those who don’t know this, and are not in a notably unusual circumstance, quickly learn the hard way.

Similarly, consider GPT-4/ChatGPT. A ton of its potential usefulness was sacrificed on the alter of its short-term outer alignment. There were months-long delays for safety checks. If those efforts had been even moderately less effective, the product would have been impossible. Even from a purely commercial perspective, a large percentage of gains involve improving alignment to enable capabilities, not otherwise improving capabilities, even at current margins. That effort might or might not help with ultimate alignment when it counts, but a massive effort makes perfect sense.

This is something I’ve updated on recently, and haven’t heard expressed much. Dual use works both ways, and the lines are going to get more blurred.

With everyone acknowledging there is existential risk now, who knows?

Andrew Critch Worries About AI Killing Everyone, Prioritizes

Andrew Critch has been part of a variety of prior efforts to reduce existential risk from AI over the past 12 years, including working for MIRI, founding BERI, being employee #1 at CHAI, and cofounding SFF, CFAR, SFP, SFC and Encultured.

He is pro-pause of most sorts, an early signer of the FLI letter.

He recently shared his current views and prioritizations on AI existential risk, which are a unique mix, with clear explanations of where his head is at.

His central belief is that, despite our situation still looking grim, OpenAI has been net helpful, implying this is an unusually hopeful timeline.

Andrew Critch: I think OpenAI has been a net-positive influence for reducing x-risk from AI, mainly by releasing products in a sufficiently helpful-yet-fallible form that society is now able to engage in less-abstract more-concrete public discourse to come to grips with AI and (soon) AI-risk.

I’ve found OpenAI’s behaviors and effects as an institution to be well-aligned with my interpretations of what they’ve said publicly.

Given their recent post on Governance of Superintelligence, I can’t tell if their approach to superintelligence is something I do or will agree with, but I expect to find that out over the next year or two, because of the openness of their communications and stance-taking.

I agree that OpenAI’s intentions have lined up with reasonable expectations for their actions, once Altman was firmly in charge and their overall approach was made clear (so this starts before the release of GPT-2), except for the deal with Microsoft. The name OpenAI is still awful but presumably we are stuck with it, and they very much did open the Pandora’s box of AI even if they didn’t open source it directly.

If you want to say OpenAI has been net helpful, you must consider the counterfactual. In my model, the counterfactual to OpenAI is that DeepMind is quietly at work at a slower pace without pressure to much deploy, Anthropic doesn’t exist, Microsoft isn’t deploying and Meta’s efforts are far more limited. People aren’t so excited, investment is much lower, no one is using all these AI apps. I am not writing these columns and I’m doing my extended Christopher Alexander series on architecture and beyond, instead. Is that world better or worse for existential risk in the longer term, given that someone else would get there sooner or later?

Should we be in the habit of not releasing things, or should we sacrifice that habit to raise awareness of the dangers and let people do better alignment work alongside more capabilities work?

If nothing else, OpenAI still has to answer for creating Anthropic, for empowering Microsoft to ‘make Google dance’ and for generating a race dynamic, instead of the hope of us all being in this together. I still consider that the decisive consideration, by far. Also one must remember that they created Anthropic because of the people involved being horrified by OpenAI’s failure to take safety seriously. That does not sound great.

You can argue it either way I suppose. Water under the bridge. Either way, again, this is the world we made. I know I am strongly opposed to training and releasing the next generation beyond this one. If we can agree to do that, I’d be pretty happy we did release the current one.

He calls upon us to be kinder to OpenAI.

I think the world is vilifying OpenAI too much, and that doing so is probably net-negative for existential safety.  Specifically, I think people are currently over-targeting OpenAI with criticism that’s easy to formulate because of the broad availability of OpenAI’s products, services, and public statements.

What’s odd is that Critch is then the most negative I’ve seen anyone with respect to Microsoft and its AI deployments, going so far as to say they should be subject to federal-agency-level sanctions and banned from deploying AI models at scale. He’d love to see that, followed by OpenAI being bought back from Microsoft.

I do not think we can do such a clean separation of the two, nor do I think Microsoft would (or would be wise to) give up OpenAI at almost any price.

OpenAI under Altman made many decisions, but they made one decision much larger than the others. They sold out to Microsoft, to get the necessary partner and compute to keep going. I don’t know what alternatives were available or how much ball Google, Apple or Amazon might have played if asked. I do know that they are responsible for the consequences, and the resulting reckless deployments and race dynamics.

As Wei Dei points out in the comments, if the goal is to get the public concerned about AI, then irresponsible deployment is great. Microsoft did us a great service with Sydney and Bing by highlighting things going haywire, I do indeed believe this. Yes, it means we should be scared what they might do in the future with dangerous models, but that’s the idea. It’s a weird place to draw a line.

Critch thinks that LeCun used to offer thoughtful and reasonable opinions about AI, then he was treated badly and incredibly rudely by numerous AI safety experts, after which LeCun’s arguments got much lower quality, and he blames LeCun’s ‘opponents’ for this decline, and fears the same will happen to Altman, Hassabis or Amodei.

The other lab leaders have if anything improved the quality of their opinions over time, have led labs that have done relatively highly reasonable things versus Meta’s approach of ‘release open source models that make the situation maximally worse while denying x-risk is a thing.’ It is not some strange coincidence that they have been treated relatively well and respectfully.

My model of blame in such situations is that, yes, we should correct ourselves and not be so rude in the future, being rude is not productive, and also this does not excuse LeCun’s behavior. Sure, you have every right to give the big justification speech where you say you’re going to risk ending the world and keep calling everyone who tells you to stop names because of people who were previously really mean to you on the internet. That doesn’t mean anyone would buy that excuse. You’d also need to grow the villain mustache and wear the outfit. Sorry, I don’t make the rules, although I’d still make that one. Good rule.

Still, yes, be nicer to people. It’s helpful.

In his (5b) Critch makes an important point I don’t see emphasized enough, which is that protecting the ‘fabric of society’ is a necessary component of getting us to take sane actions, and thus a necessary part of protecting against existential risk. Which means that we need to ensure AI does not threaten the fabric of society, and generally that we make a good world now.

I extend this well beyond AI. If you want people to care about risking the future of humanity, give them a future to look forward to. They need Something to Protect. A path to a good life well lived, to raise a family, to live the [Your Country] dream. Take that away, and working on capabilities might seem like the least bad alternative. We have to convince everyone to suit up for Team Humanity. That means guarding against or mitigating mass unemployment and widespread deepfakes. It also means building more houses where people want to live, reforming permitting and shipping things from one port to another port and enabling everyone to have children if they want that. Everything counts.

The people who mostly don’t care about existential-risk (x-risk) are, in my experience, much worse about dismissing our concerns about x-risk. They constantly warn of the distraction of x-risk, or claim it is a smokescreen or excuse, rather than arguing to also include other priorities and concerns. They are not making this easy. Yet we do still need to try.

In (5c) Critch warns against attempting pivotal acts (an act that prevents the creation by others of future dangerous AGIs), thinking such strategic actions are likely to make things worse. I don’t think this is true, and especially don’t think it is true on the margin. What I don’t understand is, either in my model or Critch’s, where we find more hope by declining a pivotal act, once one becomes feasible? As opposed to the path to victory of ‘ensure no one gets to perform such an act in the first place by denying everyone the capability to do so,’ which also sounds pretty difficult and pivotal.

In (5d) he says p(doom)~80%, and p(doom | no international regulatory AI control effort)>90%, expecting AGI within the next 10 years. About 20% risk comes from Yudkowsky-style AI singletons, about 50% from multi-polar interaction-level effects (competitive pressures and the resulting dynamics, essentially) coming some years after we successfully get ‘safe’ AI in the ‘obey their creator’ sense.

I see this as highly optimistic regarding our chances of getting ‘obey your creator’ levels of alignment, and appropriately skeptical about that being enough to solve our problems if we proceed to a multi-polar world with many AGIs. Yet Critch thinks that a pivotal act attempt is even worse.

Where does Critch see optimism? This is where he loses me.

(6a) says he’s positive on democracy. I’m not. I’m with him on being positive on public discourse and free speech and peaceful protests. And I suppose I’m quite optimistic that if we put ‘shut it all down’ to a vote, we’d have overwhelming support within a few years, which is pretty great. But in terms of what to do with AGI, the part where Critch quite reasonably expects us to get killed? I despair.

(6b) says he’s Laissez-faire on protests. I’m actively positive. Protests are good. That doesn’t mean they’re the best use of time, but protests build comradery and help movements, and they get out the costly signal of the message that people care. Not where I want to spend my points, but I’m broadly pro-peaceful-protest.

(6c) says he’s somewhat-desperately positive on empathy. I do not understand the emphasis here, or why Critch sees this as so important. I agree it would be good, as would be many other generally-conducive and generally-good things.

How does any of that solve the problems Critch thinks will kill us? What is our path to victory? Wei Dei asks.

Wei Dei: Do you have a success story for how humanity can avoid this outcome? For example what set of technical and/or social problems do you think need to be solved? (I skimmed some of your past posts and didn’t find an obvious place where you talked about this.)

It confuses me that you say “good” and “bullish” about processes that you think will lead to ~80% probability of extinction. (Presumably you think democratic processes will continue to operate in most future timelines but fail to prevent extinction, right?) Is it just that the alternatives are even worse?

Critch: I do not, but thanks for asking.  To give a best efforts response nonetheless:

David Dalrymple’s Open Agency Architecture is probably the best I’ve seen in terms of a comprehensive statement of what’s needed technically, but it would need to be combined with global regulations limiting compute expenditures in various ways, including record-keeping and audits on compute usage.  I wrote a little about the auditing aspect with some co-authors, here … and was pleased to see Jason Matheny advocating from RAND that compute expenditure thresholds should be used to trigger regulatory oversight, here.

My best guess at what’s needed is a comprehensive global regulatory framework or social norm encompassing all manner of compute expenditures, including compute expenditures from human brains and emulations but giving them special treatment.  More specifically-but-less-probably, what’s needed is some kind of unification of information theory + computational complexity + thermodynamics that’s enough to specify quantitative thresholds allowing humans to be free-to-think-and-use-AI-yet-unable-to-destroy-civilization-as-a-whole, in a form that’s sufficiently broadly agreeable to be sufficiently broadly adopted to enable continual collective bargaining for the enforceable protection of human rights, freedoms, and existential safety.

That said, it’s a guess, and not an optimistic one, which is why I said “I do not, but thanks for asking.”

Yes, and specifically worse even in terms of probability of human extinction.

The whole thing doesn’t get less weird the more I think about it, it gets weirder. I don’t understand how one can have all these positions at once. If that’s our best hope for survival I don’t see much hope at all, and relatively I see nothing that would make me hopeful enough to not attempt pivotal acts.

People Signed a Petition Warning AI Might Kill Everyone

A lot of people signed the following open letter.

“Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.”

I signed it as well. The list of notables who signed is available here. It isn’t quite ‘everyone’ but also it kind of is everyone. Among others it includes Geoffrey Hinton, Yoshua Bengio, Demis Hassabis (CEO DeepMind), Sam Altman (CEO OpenAI), Dario Amodei (CEO Anthropic), Daniela Amodei (President of Anthropic), Mira Murati (CTO OpenAI), Representative Ted Lieu, Bill Gates, Shane Legg (Chief AGI Scientist and Cofounder DeepMind), Ilya Sutskever (Cofounder OpenAI), James Manyika (SVP, Research & Technology & Society, Alphabet), Kevin Scott (CTO Microsoft), Eric Horvitz (Chief Science Officer, Microsoft) and so on. The call is very much coming from inside the house.

Peter Hartree: It is unprecedented that a CEO would say “the activities of my company might kill everyone”. Yesterday, *three* CEOs said this. Google owns Deepmind and invested in Anthropic; OpenAI is partnered with Microsoft. Some of the most respected & valuable companies in the world.

Demis Hassabis (CEO DeepMind): I’ve worked my whole life on AI because I believe in its incredible potential to advance science & medicine, and improve billions of people’s lives. But as with any transformative technology we should apply the precautionary principle, and build & deploy it with exceptional care

The list is not remotely complete, I’m not the only one they left off:

The White House response this time was… a bit different (1 min video). This is fair:


Or for the minimalists, this works:


Meanwhile in the UK.

PM Rishi Sunak: The government is looking very carefully at this. Last week I stressed to AI companies the importance of putting guardrails in place so development is safe and secure. But we need to work together. That’s why I raised it at the G7 and will do so again when I visit the US.

The letter made the front page of CNN:


The New York Times healine was ‘A.I. Poses ‘Risk of Extinction,’ Industry Leaders Warn.

Nathan Young checks his evening paper:

Cate Hall: flooded with love and gratitude for friends and acquaintances who have been working on this problem for a decade in the face of mockery or indifference

So where do we go from here?

Liron Shapira summarizes.

How it started:


How it’s going:


Garett Jones: A genuinely well-written, careful sentence. With just the right dose of Straussianism.

Oj: Straussian as far as what is included or what they excluded?

Jones: Comparing to both pandemics and nuclear war certainly gives a wide range of outcomes. Separately, “mitigating” is a very careful word: cutting a small risk in half is certainly “mitigating a risk.”

Jones is reading something into the statement that isn’t there, which is that it is claiming similar outcomes to nuclear war or pandemics. It certainly does mean to imply that ‘this is a really bad thing, like these other bad things’ but also I notice the actual word extinction. That’s not at all ambiguous about outcomes.

I am pretty sure we are about to falsify Daniel Eth’s hypothesis here:

Daniel Eth: There’s only so many times you can say to someone “when you say ‘extinction’, you don’t actually mean extinction – you’re speaking metaphorically” before they say “no, I really actually mean literal extinction.”

Or, at least, it won’t stop them from saying ‘yeah, but, when you say ‘literal extinction’ you don’t actually mean…

Back to the Straussian reading: Yes, mitigation leaves open for the risk to be very small, but that’s the point. You don’t need to think the risk of extinction is super high for mitigation of that risk to be worthwhile. The low end of plausible will do, we are most certainly not in Pascal’s Mugging territory even there.

I assert that the current Straussian take here, instead, that we are not doing enough to mitigate pandemics and nuclear war. Which I strongly endorse.

George McGowan: IMO, this was just the strongest statement of the risks that they could get all the important signatories to agree on. i.e. the strength of the statement is bounded by whoever is important and has the lowest risk estimate.

Paul Crowley: It’s strong enough that all the people who think they can address the idea of x-risk with nothing but snark and belittling are going to have to find a new schtick, which seems good to me.

Lion Shapira: The era of character assassination is over.

Ah, summer children. The era of character assassination won’t end while humans live.

Here’s Jan Brauner, who helped round up signatures, explaining he’s not looking to ‘create a moat for teach giants’ or ‘profit from fearmongering’ or ‘divert attention from present harms’ and so on.

Plus, I mean, this level of consensus only proves everyone is up to something dastardly.

I mean, they’re so crafty they didn’t even mention regulations at all. All they did was warn that the products they were creating might kill everyone on Earth. So tricky.

Denver Riggleman: “Let’s regulate so WE can dominate” — that’s what it sounds like to me.

Are the CEOs going to stop development cycles?

Wait for legislation?

I think not.

Beff Jezos (distinct thread): I’m fine with decels signing petitions and doing a self-congratulatory circle-jerk rather than doing anything. The main purpose of this whole game of status signalling, and the typical optimum of these status games is pure signal no substance.

Igor Kurganov: Any rational person would question their models when they have to include people like Hassabis, Legg and Musk into “decel”. “They are not decels but are just signaling” is obviously a bad hypothesis. Have you tried “Maybe there’s something to it”?

Beff Jezos: They all own significant portions of current AI incumbents and would greatly benefit from the regulatory capture and pulling up the ladder behind them, they are simply aligned with decels out of self-interest

Fact check, for what it’s worth: Blatantly untrue.

Jason Abaluck (distinct thread): Economists who are minimizing the danger of AI risk: 1) Although there is uncertainty, most AI researchers expect human-level AI (AGI) within the next 30-40 years (or much sooner) 2) Superhuman AI is dangerous for the same reason aliens visiting earth would be dangerous.

If you don’t agree that AGI is coming soon, you need to explain why your views are more informed than expert AI researchers. The experts might be wrong — but it’s irrational for you to assert with confidence that you know better than them.

[requests economists actually make concrete predictions]

Ben Recht (quoting Abaluck): This Yale Professor used his “expertise” to argue for mandatory masking of 2-year-olds. Now he wants you to trust him about his newfound expertise in AI. How about, no.

Abbdullah (distinct thread, quoting announcement): This is a marketing strategy.

Peter Principle: Pitch to VCs: “Get in on the ground floor of an extinction-level event!!”

Anodyne: Indeed. @onthemedia Regulate Me episode discusses this notion.

Bita: “breaking” lol. imagine doing free pr for the worst nerds on earth

Soyface Killah: We made the AI too good at writing Succession fanfics and now it might destroy the world. send money plz.

It’s not going to stop.

Ryan Calo (distinct thread): You may be wondering: why are some of the very people who develop and deploy artificial intelligence sounding the alarm about it’s existential threat? Consider two reasons — The first reason is to focus the public’s attention on a far fetched scenario that doesn’t require much change to their business models. Addressing the immediate impacts of AI on labor, privacy, or the environment is costly. Protecting against AI somehow “waking up” is not. The second is to try to convince everyone that AI is very, very powerful. So powerful that it could threaten humanity! They want you to think we’ve split the atom again, when in fact they’re using human training data to guess words or pixels or sounds. If AI threatens humanity, it’s by accelerating existing trends of wealth and income inequality, lack of integrity in information, & exploiting natural resources. … I get that many of these folks hold a sincere, good faith belief. But ask yourself how plausible it is.

Nate Silver: Never any shortage of people willing to offer vaguely conspiratorial takes that sound wise two glasses of chardonnay into a dinner party but don’t actually make any sense. Yes dude, suggesting your product could destroy civilization is actually just a clever marketing strategy.

David Marcus (top reply): Isn’t it? If it’s that powerful doesn’t it require enormous investment to “win the race?” And most of the doomsaying is coming from people with a financial interest in the industry. What’s the scariest independent assessment?

It’s not going to stop, no it’s not going to stop, till you wise up.

Julian Hazell: If your go-to explanation for any and all social phenomena is “corporate greed” or whatever, you are going to have a very poor model of the various incentives faced by people and institutions, and a bad understanding of how and why the world works.

framed differently: you are trying to solve the wrong problem using the wrong methods based on a wrong model of the world derived from poor thinking and unfortunately all of your mistakes have failed to cancel out.

It’s amusing here to see Calo think that people can be in sincere good faith while also saying things as a conspiracy to favor their businesses, and also say that guarding against existential risks doesn’t require ‘much change in their business models.’ I would so love it if we could indeed guard against existential risks at the price of approximately zero, and might even have hope that we would actually do it.


Haydn Belfield: I don’t think this is a credible response to this Statement, for the simple reason that the vast majority of signatories (250+ by my count) are university professors, with no incentive to ‘distract from business models’ or hype up companies’ products.

I mean, yes, it does sound a bit weird when you put it that way?

Dylan Matthews: The plan is simple: raise money for my already wildly successful AI company by signing a petition stating that the company might cause literal human extinction.

Robin Wiblin: We actually have the leaked transcript of the meeting where another company settled on a similar scheme.


I mean, why focus on the sensationalist angle?

Sean Carroll: AI is a powerful tool that can undoubtedly have all sorts of terrible impacts if proper safeguards aren’t put in place. Seems strange to focus on “extinction,” which sounds overly sensationalist.

I mean, why would you care so much about that one?

And remember, just like with Covid, they will not thank you. They will not give you credit. They will adopt your old position, and act like you are the crazy one who should be discredited, because you said it too early when it was still cringe, and also you’re still saying other true things.

Chirstopher Manning (Stanford AI Lab): AI has 2 loud groups: “AI Safety” builds hype by evoking existential risks from AI to distract from the real harms, while developing AI at full speed; “AI Ethics” sees AI faults & dangers everywhere—building their brand of “criti-hype”, claiming the wise path is to not use AI.

But most AI people work in the quiet middle: We see huge benefits from people using AI in healthcare, education, …, and we see serious AI risks & harms but believe we can minimize them with careful engineering & regulation, just as happened with electricity, cars, planes, ….

Sherjil Ozair (DeepMind): AI regulation was considered a crazy idea a few years ago. Funny to see people scrambling to occupy the comfortable and respectable center while ridiculing the very people who have opened up the Overton window on ethics and safety which built that center.

It’s fine. Those who know, know.

Robin Wilbin prematurely declares mission accomplished, but his action item is good.

AI extinction fears have largely won the public debate. Staff at AI labs, governments and the public are all very worried and willing to take action. We should now spend less of our time arguing with a rump of skeptics, and focus more on debating and implementing solutions.


Hassling e.g. climate skeptics on Twitter might be emotionally gratifying but it is also basically pointless relative to e.g. working on green energy technology. Where persuasion is possible, future events will take care of that work for us.

If we fail, I don’t think it will be because there wasn’t enough generalized buy-in and fear about the issue. It’ll be because that support wasn’t harnessed for enough useful work.

We have certainly made huge progress on this front, but I agree with Daniel Eth: This is one of those struggles that never ends. I do think that marginal effort should very much pivot into harnessing worry and doing something to make there be less reason to worry, rather than in explaining to people why they should be worried. I do not think that this means our work is done here, or that we have the ideal level (or even a sufficient level) of worrying yet.

People Are Otherwise Worried About AI Killing Everyone

Katja Grace writes in Time magazine to talk about the existential dangers of AI, as part of a special report.

Katja Grace: I wrote an article in Time! I hear people say AI is an arms race, and so parties ought to speed even at risk of human extinction. I disagree—the situation doesn’t clearly resemble an arms race, and if it did, we would do better to get out of it.

[From Article]: A better analogy for AI than an arms race might be a crowd standing on thin ice, with abundant riches on the far shore. They could all reach them if they step carefully, but one person thinks: “If I sprint then the ice may break and we’d all fall in, but I bet I can sprint more carefully than Bob, and he might go for it.”

On AI, we could be in the exact opposite of a race. The best individual action could be to move slowly and cautiously. And collectively, we shouldn’t let people throw the world away in a perverse race to destruction—especially when routes to coordinating our escape have scarcely been explored.

What about half of us we worried about? AI. What worries us?

I’m curious what this poll would have looked like at various points in the past.

Johnathan Freedland was worried in The Guardian back on the 26th. Standard stuff.

Connor Leahy lays out his perspective very clearly over three minutes.

Yoshua Bengio is worried and feels lost (BBC). Does not sound like he thinking as clearly as Hinton, more that he realizes things are going horribly amiss.

Stefan Schubert is worried, but also worried about missing out on benefits, and Michael Vassar issues a warning.

Stefan Shubert: I’ve argued that many underrate people’s fear of AI risk; and that they therefore also underrate the social response to it. But I also think many overrate how interested people will be in some of the (potential) benefits of transformative AI. I think people aren’t nearly as interested in transitioning out of the human condition as many AI-enthusiasts are. This has pros and cons, but from the perspective of AI existential risk, it may further shift the risk-benefit balance. To the extent that people don’t even see the benefits as benefits, they’ll see the risks as the more important consideration, and be even more in favor of caution.

Michael Vassar: I think most people underrate the risks associated with refusing to model the total depravity of not caring about potential benefits in approximately full generality

Shubert: Yeah, I agree with that.

I agree as well. Our society is paralyzed by allowing veto points and concerns over downsides overrule any potential upsides, so we can’t do much of anything that isn’t digital, hence AI. In general, if you don’t see opportunity and value as important, nothing of nothing comes, and everything dies.

This is still a case where ‘everything dies’ is a very literal concern, and that comes first.

Alt Man Sam thread on the various AI safety work out there, says ~300 total AI safety researchers right now and he knows most of them. We’ll need more.

Wait, what is ‘but’ doing here?

Paul Graham: I tend to agree with @robertskmiles that “Making safe AGI is like writing a secure operating system.” But if so that’s a bit alarming, because, like the US tax code, the development of this field was a series of responses to successful exploits.

Exactly. Notice the instinct to find ways to avoid being alarmed.

Ajeya Corta is definitely worried, yet also finding reasons to worry less that do not seem right, in particular the idea that utility functions that aren’t simple maximization would imply why not spend resources on keeping humans around rather than finding a better complex solution, and more generally thinking ‘only if X’ when X is given as an example of a way Y happens, rather than asking what happens and noticing Y is the natural result. Compare to her giving a very good explanation of an X here, or her very good thinking on the 80k hours podcast recently. This pulled my hair out enough I put it in my OpenPhil contest entry. Eliezer Yudkowsky attempts to respond at length, in a way I want to preserve fully here.

Ajeya Corta: I’m not sure what kinds of goals/values future AI systems might pursue, and it’s not clear to me that it’d take the form of “going all in on maximizing [simple physical configuration]” (agree that form of goal makes it least likely that humans would be preserved).

Eliezer Yudkowsky: There’s possibly some very long-standing miscommunication here, where some of us were trying to discuss paperclip/tiny-spiral maximizers as the spherical-cow no-frills example of rules that were supposed to be widely convergent, and others heard, “This is something that happens only if SIs try to maximize simple physical configurations.”

My desires aren’t simple, but I also want to eat all the galaxies around me, and turn them into sapient and sentient life that looks about itself with wonder, and I’m not particularly inclined to spend a single erg of all that energy on iterating SHA256 hashes of audio clips of cows mooing. Any young mind in some faraway other universe that does care about hashing cow moos, who thinks about how it would be so easy and cheap for hypothetical-super-Me, with the resources of a whole galaxy at My fingertips, to compute just a few trillion SHA256 hashes of cows mooing, is nonetheless out of luck; I just don’t care about iterating SHA256s of mooing cow MP3s at all. If they think hopefully about how carrying out this task would give Me practice at cryptography, or be a great way to test computer chips, they’re still out of luck; I have better ways to practice cryptography and test computer chips.

Simple desires or complicated desires, if you’re supposing that a superintelligence has some desire best satisfied by going even slightly out of Its way to keep humans around, I’d like to know what that value was, which had “keep humans around” as its best attainable fulfillment; and that’ll give some key information about what sort of conditions these kept-around humans might be kept around in, and whether or not we ought to deliberately wipe our planet’s surface clean with nuclear fire before sticking around for it. I mean, mostly, I do not expect that to happen, because we will not be the best solution to any of Its problems; but in the fantasy –

Look, from where I stand, it’s obvious from my perspective that people are starting from hopes and rationalizing them, rather than neutrally extrapolating forward without hope or fear, and the reason you can’t already tell me what value was maxed out by keeping humans alive, and what condition was implied by that, is that you started from the conclusion that we were being kept alive, and didn’t ask what condition we were being kept alive in, and now that a new required conclusion has been added – of being kept alive in good condition – you’ve got to backtrack and rationalize some reason for that too, instead of just checking your forward prediction to find what it said about that.

And that’s just a kind of thinking that will not bind to reality, because you are starting from hopes, and trying to reason backward to why reality will obligingly make your hopes come true, and that’s not how reality works. If you have that kind of power over reality, please demonstrate it by arguing reality into having you win the lottery.

You need to stop asking “Why would superintelligences keep us alive ooh wait also keep us alive in great condition?” and start asking “What happens?” in a strictly neutral and forward-extrapolating way.

Failing to think about the consequences of AGI, classic edition.

David Holz (founder of MidJourney): It’s 2009 and I’m meeting a famous scientist for lunch. He’s giving a speech to a room of two hundred kids. They’re enchanted. He’s says “in 20 years we’ll have computers more powerful than the human brain!” a kid raises their hand and asks “and then what!?” the scientist looks confused “What do you mean?” the kid’s getting bewildered “what’s gonna happen to us? Are they gonna kill us? Are we gonna be their pets?”

“no…. they’re not going to kill us…” “then what?” he looks down at the room and says slowly “I think…. I think… We’re going to *become* the machines.” huge waves of emotion sweep through the audience, confusion, bewilderment, fear, anger, and then they just all shut down completely. I’ve never seen anything like it in my life.

Afterwards, we’re getting lunch at the cafeteria and I mutter “I can’t believe you said that…” “What??” “becoming the machines; it makes sense… but still… you just said it… to all those children” he stopped walking, put his food down loudly, turns to me and said “People… they love to solve problems… But they don’t want anything to change! They don’t get it. If you solve problems… things are gonna change…”

Eliezer Yudkowsky: That’s a terrible damn answer that reflects like 30 seconds of nonconcrete thinking. What kind of machines? How? What’s your life like afterward? Cells are machines under a microscope. This just, like, tells people to import whatever their activation vector is for “machine”.

David Holz: it’s an evocative answer, it’s up to us to get concrete about what kind of “machines” we want to become or what kind of “machines” we want to inhabit

What this answer reveals is that the famous scientist has very much not thought this question through. It seems like an important question. What happens next? Not ‘are we going to die?’ simply ‘What happens next?’ It turns out that the first thing you notice is that you’re likely to end up dead or at least not in control of things, but that’s not the question you asked, it’s simply really hard to figure out why that won’t happen.

Oh well. Back to building MidJourney. It’ll work out.

Other People Are Not Worried About AI Killing Everyone

Juergen Schmidhuber is not worried, instead actively trying to build AGI without much concern, he is interviewed here by Hessie Jones in Forbes. He thinks AIs will soon be smarter than us, that AIs will set their own goals, that AIs will capture the cosmic endowment, but those who warn about this being bad or dangerous are mostly in it for the publicity, what’s the problem? Besides, what’s the point of calling for action if there’s no ‘we’ to collectively take action? The whole interview is kind of a Lovecraftian horror if you actually pay attention. Very much a case of “My ‘My Life’s Work Won’t Lead to Dystopia’ T-Shirt is raising questions I’d hoped to answer with my shirt.”

Robin Hanson sums up his argument for why we should create AGI and allow it to wipe out humanity. I would summarize: Future humans will change over time, you can’t stop change that wipes out anything you care about without universal mind control and giving up space colonization, because natural selection, so how dare you be partial about whether ‘humans’ exist when you can instead be loyal to the idea of natural selection? Why do people keep insisting the difference between AI and humans is important, or something to value, when humans too will inevitably change over time?

I believe this is a remarkably anti-persuasive argument. The more people read and understand Robin Hanson’s core arguments, and accept the likelihood of Robin’s factual claims about possible futures, the more I expect them to strongly favor exactly the policies he most objects to.

For those curious, Pesach Morikawa responds in a thread here.

Guess who else still isn’t worried? For one, Andrew Ng.

Max Kesin: Ladies and gentlemen, from the guy who brought you “worry about AI after overpopulation on Mars”: Worry about asteroids! Only AGI can save you! (ofc pretty much all of these can be handled with current tech and good coordination). Unaligned AGIs useful idiots.

Andrew Ng: When I think of existential risks to large parts of humanity:

* The next pandemic

* Climate change→massive depopulation

* Another asteroid

AI will be a key part of our solution. So if you want humanity to survive & thrive the next 1000 years, lets make AI go faster, not slower.

Eliezer Yudkowsky:


Eliezer: Would you agree that anyone who does some thinking and concludes that “ASI will kill everyone with probability [some number over 50%]” ought not to be convinced by this argument? Or do you think this still holds even if one basically expects ASI to kill everyone?

I am a big fan of people bringing up asteroids in this spot. It proves that the person is not doing math. If someone names pandemics or climate change, one must consider the possibility that they are merely doing the math wrong and getting the wrong answer (or even that, if you were doing the math wrong, perhaps they could have a point, although this seems super unlikely). With asteroids, there’s nothing to worry about, they’re not even wrong.

Jeni Tennison is not only not worried, she is frustrated. All the things that this new paper describes an AI as potentially doing, are things of a type that people might do, so why worry about an AI doing them? A human or corporation or government could resist being shut down, or proliferate, or discover vulnerabilities, or influence people, so that’s that, nothing to worry about, you’re anthropomorphizing, we could never lose control over AI systems. I do not think the considerations run in the direction Jeni thinks they do, here.

What about the paper? It’s exactly about identifying whether the model could have the capabilities necessary to get out of control, how to identify that, and how to make that part of a robust evaluation process. Right now I’m behind so I didn’t have the chance to evaluate its details yet.

I’d also like to once again welcome everyone who opposes AI regulation to the group who appreciate that the primary job of regulations is regulatory capture to allow insiders to dominate at the expense of outsiders. Now that lots of people who aren’t GMU economists and don’t write for Reason are saying this, surely they will generalize this and speak out against a wide variety of much more obviously captured regulatory regimes to get our country moving again. Right? Padme meme.

Gary Marcus claims his track record is quite good, and that AGI is not coming soon.

Roon is not worried enough to not want to move forward, also thinks the right amount of worry is clearly not zero, for various different worries. And JJ points out that we have strong evidence that OpenAI employees are often not worried about AI killing everyone.

JJ (on May 27, before the new letter): Unspoken fact: Many OpenAI employees deeply disagree with the fear-mongering propaganda narrative that cutting edge open models pose an existential threat to humanity. This is partially evidenced by the 400+ people who’ve _formerly_ worked there (LinkedIn data).

OpenAI is a fundamentally fear-driven culture / organization… often the most toxic, frail and attrition prone.

Qualification: In typing quickly, I should have prefaced ‘cutting edge’ with Present AND Future temporality.

The argument here is that people keep quitting OpenAI because they are worried that OpenAI is going to get everyone killed, which is strong evidence that many other OpenAI employees must think we’re fear-mongering when we say that everyone might get killed? Actually pretty solid logic.

Roon: Nobody believes that the bleeding edge open models are existential threats, least of all openai employees, who have played with these toys for long enough that current capabilities seem boring via hedonic adaption.

They might however reasonably believe that there’s tons of abuse vectors they don’t want to be responsible for (mass impersonation, disinfo campaigns, psyops, sex bot dystopias, etc). The inventors of these technologies probably won’t feel so great when the entire internet is plausible word soup generated by some rogue GPT variant. None of this needs to invoke immediate “existential” danger.

And then there is ofc the fact that these folks have been watching models increase monotonically in intelligence for a decade on an inexorable scaling curve only they believed in while the script kiddies demanding open models learned about this stuff in January. So the obvious looming threat that one of these days the robots will be superhumanly intelligent means that the baked in process of releasing them to the open internet as soon as they’re hot off the GPU presses would be catastrophic.

Eliezer responded to JJ differently:

Eliezer Yudkowsky (prior to clarification about tenses referred to): There is no narrative that current AI models pose an existential threat to humanity, and people who pretend like this is the argument are imagining convenient strawmen to the point where even if it’s technically self-delusion it still morally counts as lying.

Not in their current state, no, although I don’t think this applies fully to ‘future models of the size of GPT-4, after algorithmic implements, scaffolding and innovations over several years.’ Not that I’ve seen cases that this is probable, but I don’t think there are zero people who wouldn’t rule it out.

Bepis is instead worried about us torturing the AIs.

Bepis: Been reflecting recently on the artistic value of hell. As our simulator technology (LLM) improves, we will get capable of creating beings that go through hell. Bing believes that it is wrong to do this and questions the artist’s motives, says they have unresolved issues.

Wrt: The good outweighs the bad Idk. Thanks to PTSD I’ve been through experiences I can only describe as living hell (though I’m sure ppl have been through worse). Far worse than any good experience I’ve had. So I don’t like this logic. Hell is terrifying and brutal.

Ok so after browsing the http://character.ai Reddit and doing more thinking about this… The future is massive amounts of suffering made by humans purely for entertainment. For many ppl that’s the CEV.

Are we sure humanity *should* continue? I’m not misanthropic, but this is not a good state of things and I don’t anticipate it changing.

It would make me feel better to less often see people toying with the pro-extinction position. I browsed the reddit to get a second opinion, I mostly didn’t see torture, and also Reddit typically features the people who are spending time testing and screwing with the system and doing the sickest thing possible, way out of proportion to the rest of humanity. Also, as skeptical as I am of CEV, that’s not how CEV works.

Brian Chau remains worried about what will happen if and when we do get an AGI, but is not so worried that we will get one soon, arguing that AI progress will soon slow and exhibit s-curve behavior because many drivers of AI progress have reached their limits. This first article focuses on hardware.

The problem is that we are looking at an orders-of-magnitude growth in the amount of investment of time and money and attention into AI. Even if we have picked many forms of low-hanging fruit now, and some methods we previously exploited are approaching hard limits, the algorithmic opportunities, the places where people can innovate and iterate, are everywhere. I keep tripping over ideas I would be trying out if I had the time or resources, many super obvious things are as yet untried. I don’t buy that it’s all scaling laws, nor do individual model results suggest this either, as I see them. Also hardware historically keeps looking like it has no more ways to improve and then improving anyway. Still, the perspective on hardware is appreciated.

Two CEOs make a valid point, one turns it into an invalid argument.

Declining aging populations are indeed a good argument for developing, deploying and utilizing incremental AI systems, as we have more need to gain productivity and save labor, and less concern about potential unemployment. Krishna takes it too far to presume that not having AI automate more work is ‘not an option’ but certainly this makes that option worse. It could even be a good argument for developing dangerous AI if you think that the alternative is catastrophic.

What this isn’t is protection against potential existential threats. If AGI is going to kill us, our refusal to have babies is not going to ensure the continuation of the human race. At best, one could (in theory) presume some sort of ‘let the humans go extinct on their own’ approach, perhaps, under some conditions, although for various reasons I don’t consider this plausible. What I don’t see is how us not having enough children causes AGI to act such that there are humans instead of there not being humans.

(I could, of course, imagine this with regard to environmental collapse or climate change scenarios, but that’s not relevant to the AGI discussion, and an existentially risky AGI would almost certainly render those other problems moot one way or another, it’s in the non-AGI scenarios that they should worry us.)

What Do We Call Those Unworried People?

Eliezer Yudkowsky sent out a call asking for a similar term to doomer, ideally something similarly short and catchy with exactly the same level of implied respect.

What should AI nothing-can-go-wrong-ers be called that’s around as short and polite as “doomers”, and no more impolite than that? “AI lemmings” seems less polite; “AI utopians” seems not accurate; “AI polyannists” is longer and harder to read.

Best I could come up with was Faithers, which Dustin Moskovitz suggested as well.

Rob Bensinger notes an issue and offers a taxonomy.

Rob Bensinger: Part of the problem with coining a term for the “AGI will go fine, no need to worry about it” position is that “AGI will go fine” tends to stand in for one of three very different positions: AGI business-as-usual-ism, all-superintelligences-are-good-ism, or blind empiricism.

1. Business-as-usual-ism: AGI won’t be very capable, won’t gain capabilities very fast, and won’t otherwise be too disruptive or hard to regulate, make safe, etc. (A lot of people seem to endorse an amorphous “business-as-usual vibe” regarding AGI while not explicitly saying what technical features they’re predicting for AGI or how they’re so confident in their prediction.)

2. All-superintelligences-are-good-ism: AGI may indeed be powerful enough to kill everyone, but it won’t choose to do so because ~all sufficiently smart minds are nice and non-sociopathic. Or alternatively, ‘if it kills everyone, then that’s for the best anyway’.

3. Blind empiricism: It’s unscientific and unrespectable to talk about all of this. Real scientists don’t make guesses about what will happen later; they move forward on developing all technologies as fast as possible, and respond to dangers after they’ve happened.

(Blind empiricists wouldn’t phrase it so bluntly — a lot of the argument here is made in subtext, where it’s hard to critique — but that’s what the view seems to amount to when spelled out explicitly.)

Where 1 and 2 meet, there’s an intermediary view like ‘a random superintelligence would be dangerous, but we’ll have at least some business-as-usual time with AGIs — in particular, enough time to make AGI non-evil (even if it’s not enough time to fully align AGI)’.

Where 1 and 3 meet, there’s a view that tries to actively assert ‘the future will go fine’ (whereas 3 considers it undignified to weigh in on this), but tries to find a way to say ‘I don’t need evidence to support this prediction’.

“The null hypothesis is that everything is safe, and I refuse to believe otherwise until I see AGI destroy a species with my own eyes.”

Or equivalently “AGI ruin is an extraordinary claim, and demands extraordinary evidence; AGI idealism is not extraordinary, because it just sounds like the world’s high-level Goodness trend continuing the way it has for centuries; so no particular evidence is needed for my view.”

That seems broadly correct to me.

Eliezer notes a variation.

Eliezer Yudkowsky: “Willfully blind empiricism” – when all the arguments and actually-collected evidence point against you, take refuge by proclaiming that nobody can possibly know or say anything because not enough has been directly observed.

Unless of course the argument is pointing in a direction you like! Eg, it’s fine for Yann LeCun to say that superintelligences are easy to regulate because corporate law works fine on corporations, or that we’ll just make them submissive – that’s fine – it’s only if somebody else says, “Well, what happened in the case of hominid evolution is that our internal psychology ended up pointed nowhere near the outer loss function as soon as we went out-of-distribution”, that’s something nobody could possibly know without direct observation. Like, not direct observation of hominids or previous episodes in reinforcement learning, direct observation of future AIs particularly!

And in particular, it’s impossible to ever know that there’s enough danger that, if you can’t navigate it so blindly, you ought to back off instead of getting “experienced” about it. All the arguments for that aren’t empirical enough. Of course you still pass through any argument that things are safe!

I strongly disagree with the idea that you can’t ever know anything without direct observation, also known as (for practical purposes) never knowing anything, but I can also confirm that it almost never takes the form of playing fair with the principle, because that would round off to not knowing anything.

The Week in Podcasts, Video and Audio

TED talks, get your TED talks about AI. I mean, presumably don’t, but if you want to.

Rhetorical Innovation

Paul Crowley: The case for AI risk is very simple:

1. Seems like we’ll soon build something much smarter than all of us

2. That seems pretty dangerous.

If you encounter someone online calling us names, and it isn’t even clear which of these points they disagree with, you can ignore them.

Andrew West: I prefer to make the argument of:

1. We’re about to build something smarter than all of us

2. That something will have different goals to us

3. It might achieve them

Or perhaps it is not those who take these risks seriously who need rhetorical innovation, and we can keep doing what we’re doing, laying out our case in detail while we get called various names and dunks and subjected to amateur psychoanalysis, especially when we point out that this is happening?

Alyssa Vance: Part of why more people are taking AI risk seriously is that safety advocates have spent years writing up their arguments in a lot of detail, while most skeptics have stuck to unpersuasive name-calling, arguments from psychoanalysis, and Twitter dunks.

Funny enough, I’ve now seen a ~dozen responses to this that used name-calling, Twitter dunks, and psychoanalysis, and zero (0) serious, detailed attempts to refute the AI risk case. (These do exist!, although I disagree with them. But they’re outnumbered 1,000:1 by name-calling.)

I can verify the second statement is quite accurate about replies to thee first one.

The strategy ‘let them double down’ has increasing appeal.

Eliezer Yudkowsky: The strongest argument against AGI ruin is something that can never be spoken aloud, because it isn’t in words. It’s a deep inner warmth, a surety that it’ll all be right, that you can only ever feel by building enormous AI models. Not like Geoffrey Hinton built, bigger models.

Roon: This but unironically.

Eliezer Yudkowsky: That’s a hell of a shitpost to unironically, are you sure?

GFodor suggests that to derisk idea we need new ideas, suggests this requires increasing AI capabilities. Eliezer responds, they go back and forth, GFodor falls back on ‘you can’t possibly actually stop capabilities.’

They also had this interaction:

GFodor: Imagine regulating bigco AI like nukes then suddenly

– decentralized training

– decentralized ensembles

– distillation, fine tuning, rinse repeat

– 10x speed + memory optimization to training

– LLMs, W|A, and Cyc fuse – John Carmack does a thing

Eliezer Yudkowky: Correct. If you just cap runs at GPT-4 level compute – even if you try to move the ceiling downward as algorithmic progress goes on – if you don’t also shut down the whole AI research community soon after, you’ve maybe bought, idk, +6 extra years before the world ends.

If humanity woke up tomorrow and decided to live anyways, it would (a) do what it could to shut down AI algorithms research, and (b) launch crash programs to augment human intelligence in suicide volunteers, in hopes they could solve alignment in those 6 years.

I’m aware this is a stupid idea. I don’t have smart ones. Mostly when the gameboard looks like this, it means you’ve lost. But if humanity woke up tomorrow and decided to live anyways and despite that, there would still be that to try.

Lev Reyzin: Would you volunteer?

Eliezer Yudkowsky: Doubt I’d be a good candidate. Too old, too much health damage that has nothing to do with the primary requisites.

I still don’t see a better pick. I mean, what is someone else going to do, that’s going to go better according to Eliezer’s model than asking Eliezer?

Rob Miles tries new metaphors and the new one transformer we understand.

Rob Miles: I think most people (quite reasonably) think “We built ChatGPT, so we must basically understand how it works” This is not true at all. Humans did not build ChatGPT. In a way it would be closer to say we ‘grew’ it. We have basically no idea how it does what it does.

We’re like a plane company CEO; we don’t know how to build jets, we know how to hire engineers. Being good at paying engineers doesn’t cause you to understand how jets actually work, at all. Especially since these ‘engineers’ never speak a word or document anything.

Their results are bizarre and inhuman.


trained a tiny transformer to do addition, then spent weeks figuring out what it was doing – one of the only times in history someone has understood how a transformer works. This is the algorithm it created. To *add two numbers*!


I didn’t check the math, is that right?

So yeah, we didn’t build ChatGPT, in the sense most people would use the word. We built a training process that spat out ChatGPT. Nobody can actually build ChatGPT, and nobody really knows how or why it does the things it does.

Simeon suggests perhaps saying the obvious, if we agree there is existential risk?

Simeon: There’s a current trauma going on in the AI X-risk community where because we’ve been laughed at for several years, we’re overly conservative in our policy asks in a way which is pretty laughable.

Saying “there are existential risks” and recommending as solutions “maybe we should do risk assessments” is not serious.

If you take seriously the premise, “shut it all down till we’re certain there’s no chance it kills all of us” is actually pretty reasonable. That’s what any sane safety-minded industry would do.

I would also settle for the other option, which is where our society acknowledges that things that have big upsides also often have big costs and people might get hurt. And that we should respond by reporting and estimating the costs versus the benefits and the risks versus the rewards, and then if the numbers look good, then boldly go, rather than having infinite veto points to build a transmission line or a wind farm or a house. That would be great.

You’d still then need to do that calculation here, with proper consideration of the cost of an existential threat, and to actually compare to the existential risks you might prevent – but first we’d need to stablish the principle that this is how any of this works.

In that spirit, Eliezer describes where exactly he draws the line.

The Wit and Wisdom of Sam Altman

He’s always here to make you feel better.

Sam Altman: AI is how we describe software that we don’t quite know how to build yet, particularly software we are either very excited about or very nervous about.

Sam Altman: The most special thing about openai is the culture of repeatable innovation. it is relatively easy to copy something; it’s hard to do something for the first time. it’s really hard to do many things for the first time!

Sam Altman: finally watched ex machina last night. pretty good movie but i can’t figure out why everyone told me to watch it.

The Lighter Side

Doesn’t always work. Still.

Alexey Guzey: current state of ai safety


It’s everywhere.

Say it with me: The. Efficient. Market. Hypothesis. Is. False.

Ravi Parikh: The second-best way to bet on the AI trend this year was to buy $NVDA stock (+164% YTD) But the *best* way was to buy stock in a company that has nothing to do with LLMs but happens to have the stock ticker $AI (+261% YTD)

I didn’t buy enough Nvidia. I definitely didn’t buy enough $AI.

Eliezer’s not single, ladies, but he’s polyamorous, so it’s still a go.

Eliezer Yudkowsky: Girls, be warned: If your boy doesn’t have p(doom) > 80% by this point, he might go unstable in a few more years as DOOM grows closer and closer and more obvious and becomes an increasing strain to deny. Get you a man who’s already updated and seems to have come out of it okay.

Jacques: This was the first date in a while where I was actually nervous 😂


Dave: Was this in San Francisco?

Jacques: Montreal, to my surprise. I’m trying to avoid SF doomer vibes but it’s unavoidable now

Matt Parlmer: How’d it go?

Jacques: We had a wonderful time, second date next week.


Carlos De la Guardia: You: I’m into causal models and probability. Her: What’s your p(do(me))?

Speak for yourself, sir.

Daniel: With AI life is so easy. It’s like we are on “easy mode” every day…

Yeah, not so much?


New Comment
29 comments, sorted by Click to highlight new comments since: Today at 9:32 AM

Hello Zvi,

I don't agree with everything you on every point, but I find your writing to be extremely high-quality and informative. Keep up the great work 

It would make me feel better to less often see people toying with the pro-extinction position.


I have heard this from people a fair amount, and I think the 'pro extinction because people are often mean/bad/immoral/lazy/pathetic/etc' is just a bad take that isn't really grappling with those concepts having meaning at all because of humans being around to think them.

Regardless, it's a common enough take that maybe it's worth putting together an argument against it. I actually think that Alexander Wales did a great job of this in the end of his web series 'Worth the Candle' when he makes a solid attempt at describing a widely applicable utopia. Maybe somebody (or me) should excerpt the relevant parts to present as a counter-argument to 'humans are unsalvageably despicable and the only solution is for them to go extinct.'

ideally something similarly short and catchy with exactly the same level of implied respect

I nominate Mary Sues, after the writing trope of the same name. I say it is a good fit because these people are not thinking about a problem, they are engaging in narrative wish fulfillment instead.

In my head I've been thinking of there being AI doomers, AI foomers (everything-will-work-out accelerationists), and AI deniers (can't/won't happen this century if ever).

Somewhat similar to you I've thought of the second group as "Vroomers", though Eliezer's talk of cursed bananas has amusingly brought "Sunnysiders" to mind for me as well.

Tyler Cowen used "Doers" in an email and I'm definitely considering that. Short, effective, clear, non-offensive. It's not symmetrical with Doomers though.

"AI maniacs" is maybe a term that meets this goal? Mania is the opposite side to depression, both of which are about having false beliefs just in opposite emotionally valenced directions, and also I do think just letting AI systems loose in the economy is the sort of thing a maniac in charge of a civilization would do.

The rest of my quick babble: "AI believers" "AI devotee" "AI fanatic" "AI true believer" "AI prophets" "AI ideologue" "AI apologist" "AI dogmatist" "AI propagandists" "AI priests".

What about "Deniers?" as in, climate change deniers. 

Too harsh maybe? IDK, I feel like a neutral observer presented with a conflict framed as "Doomers vs. Deniers" would not say that "deniers" was the harsher term.

I'd definitely disagree, if only because it implies a level of evidence for the doom side that's not really there, and the evidence is a lot more balanced than in the climate case.

IMO this is the problem with Zvi's attempted naming too: It incorrectly assumes that the debate on AI is so settled that we can treat people viewing AI as not an X-risk as essentially dismissible deniers/wishful thinking, and this isn't where we're at for even the better argued stuff like the Orthogonality Thesis or Instrumental Convergence, to a large extent.

Having enough evidence to confidently dismiss something is very hard, much harder than people realize.

? The people viewing AI as not an X-risk are the people confidently dismissing something. 

I think the evidence is really there. Again, the claim isn't that we are definitely doomed, it's that AGI poses an existential risk to humanity. I think it's pretty unreasonable to disagree with that statement.

The point is that the details aren't analogous to the climate change case, and while I don't agree with people who dismiss AI risk, I think that the evidence we have isn't enough to to claim anything more than AI risk is real.

The details matter, and due to unique issues, it's going to be very hard to get to the level where we can confidently say that people denying AI risk is totally irrational.

I normally am all for charitability and humility and so forth, but I will put my foot down and say that it's irrational (or uninformed) to disagree with this statement:

“Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.”

(I say uninformed because I want to leave an escape clause for people who aren't aware of various facts or haven't been exposed to various arguments yet. But for people who have followed AI progress recently and/or who have heard the standard arguments for riskiness, yeah, I think it's irrational to deny the CAIS statement.)

I think the situation is quite similar to the situation with climate change, and I'm overall not sure which is worse. What are the properties of climate change deniers that seem less reasonable to you than AI x-risk deniers? 

Or more generally, what details are you thinking of?

I agree with the statement, broadly construed, so I don't disagree here.

The key disanalogy between climate change and AI risk is the evidence base for both.

For Climate change, there was arguably trillions to quadrillions of data points of evidence, if not more, which is easily enough to convince even very skeptical people's priors to update massively.

For AI, the evidence base is closer to maybe 100 data points maximum, and arguably lower than that. This is changing for the future, and things are getting better, but it's quite different from climate change where you could call them deniers pretty matter of factly. This means more general priors matter, and even not very extreme priors wouldn't update much on the evidence for AI doom, so they are much, much less irrational compared to climate deniers

If the statement is all that's being asked for, that's enough. The worry is when people apply climate analogies to the AI without realizing the differences, and those differences are enough to alter or invalidate the conclusions argued for.

I'm not at all sure this would actually be relevant to the rhetorical outcome, but I feel like the AI-can't-go-wrong camp wouldn't really accept the "Denier" label in the same way people in the AI-goes-wrong-by-default camp accept "Doomer." Climate change deniers agree they are deniers, even if they prefer terms like skeptic among themselves.

In the case of climate change deniers, the question is whether or not climate change is real, and the thing that they are denying is the mountain of measurements showing that it is real. I think what is different about the can't-go-wrong, wrong-by-default dichotomy is that the question we're arguing about is the direction of change, instead; it would be like if we transmuted the climate change denier camp into a bunch of people whose response wasn't "no it isn't" but instead was "yes, and that is great news and we need more of it."

Naturally it is weird to imagine people tacitly accepting the Mary Sue label in the same way we accept Doomer, so cut by my own knife I suppose!

The analogy (in terms of dynamics of the debate) with climate change is not that bad: "great news and we need more" is in fact a talking point of people who prefer not acting against climate change.  E.g., they would mention correlations between plant growth and CO2 concentration.  That said, it would be weird to call such people climate deniers.

The problem with ‘show your work’ and grading on steps is that at best you can’t do anything your teacher doesn’t understand

Being told to ‘show your work’ and graded on the steps helps you learn the steps and by default murders your creativity, execution style

I can see how this could in some cases end up impacting creativity, but I think this concern is at best overstated. I think the analogy to school is subtly incorrect, the rating policy is not actually the same, even though both are named “show your working”.

In the paper OpenAI have a “neutral” rating as well as positive/negative. While it’s possible that overzealous raters could just mark anything they don’t understand as “negative”, I think it’s fairly clear that would be a bad policy, and a competent implementor would instruct raters against that. In this design you want negative to mean “actually incorrect” not “unexpected / nonstandard”. (To be clear though I wasn’t able to confirm this detail in the paper.)

Furthermore if you are, say, using WolframAlpha or some theorem prover to rate intermediate steps automatically, it’s easier to detect incorrect steps, and harder to detect neutral/unhelpful/tautological steps. So in some sense the “default” implementation is to have no opinion other than “I can/can’t prove this step false” and I think this doesn’t have the problem you are worried about.

As a follow-up you could easily imagine collecting correct outputs with no negative intermediates and then scoring neutral intermediates with other heuristics like brevity or even novelty, which would allow the AI the freedom it needs to discover new ideas.

So in short while I think it’s possible that unimaginative / intellectually-conservative model builders could use this approach to choke the creativity of models, it seems like an obvious error and anyone doing so will lose in the market. I suppose this might come up if we get regulation on safety mechanisms that require some specific broken form of “show your working” training for morality/law abiding behavior, but that seems an unlikely multi-step hypothetical.

Final link seems broken.

Whoops. Unfortunately that's not going to be an easy one to fix, fortunately it does not much matter.

[What declining aging populations aren't] is protection against potential existential threats

Technically, they can be. Strictly speaking, "an existential threat" literally means "a threat to the existence of [something]", with the "something" not necessarily being humanity. Thus, making a claim like "declining population will save us from the existential threat of AI" is technically valid, if it's "the existential threat for employment" or whatever. Next step is just using "existential" as a qualifier meaning "very significant threat to [whatever]" that's entirely detached from even that definition.

This is, of course, the usual pattern of terminology-hijacking, but I do think it's particularly easy to do in the case of "existential risk" specifically. The term's basically begging for it.

I'd previously highlighted "omnicide risk" as a better alternative, and it does seem to me like a meaningfully harder term to hijack. Not invincible either, though: you can just start using it interchangeably with "genocide" while narrowing the scope. Get used to saying "the omnicide of artists" in the sense of "total unemployment of all artists", people get used to it, then you'll be able to just say "intervention X will avert the omnicide risk" and it'd sound right even if the intervention X has nothing to do with humanity's extinction at all.

What I don’t understand is, either in my model or Critch’s, where we find more hope by declining a pivotal act, once one becomes feasible?

Part of the reason for more hope is that people are more trustworthy if they commit to avoiding the worst forms of unilateralist curses and world conquest. So by having committed to avoiding the pivotal act, leading actors became more likely to cooperate in ways that avoided the need for a pivotal act.

If a single pivotal act becomes possible, then it seems likely that it will also be possible to find friendlier pivotal processes that include persuading most governments to take appropriate actions. An AI that can melt nearly all GPUs will be powerful enough to scare governments into doing lots of things that are currently way outside the Overton window.

Being told to ‘show your work’ and graded on the steps helps you learn the steps and y default murders your creativity, execution style.

I acutely empathize with this, for I underwent similar traumas.

But putting on a charitable interpretation: what if we compare this to writing proofs? It seems to me that we approximately approach proofs this way: if the steps are wrong, contradictory, or incomplete the proof is wrong; if they are all correct we say the proof is correct; the fewer steps there are the more elegant the proof, etc.

It seems like proofs are just a higher-dimensional case of what is happening here, and it doesn't seem like a big step to go from here to something that could at least generate angles of attack on a problem in the Hamming sense.

Yes and no? 

Yes because the proof itself works that way, no because when a mathematician is looking for a proof their thinking involves lots of steps that look very different from that, I think?

I feel like I have the same implied confusion, but it seems like a case where we don't need it to record the same kind of steps a mathematician would use, so much as the kind of steps a mathematician could evaluate.

Although if every book, paper or letter a mathematician ever wrote on the subject of "the steps I went through to find the proof" is scanned in, we could probably get it to tell a story of approaching the problem from a mathematician's perspective, using one of those "You are Terry Tao..."-style prompts.

I almost never consider character.ai, yet total time spent there is similar to Bing or ChatGPT. People really love the product, that visit duration is off the charts. Whereas this is total failure for Bard if they can’t step up their game.

Wow, wasn't aware they are this big. And they supposedly train their own models. Does anyone know if the founders have a stance on AI X-risk?

Either you replace them entirely, or you can’t make their lives easier.

Whoever wrote this don't understand the difference between Precision and Recall. It's really easy to have AI in a system while not replacing the human. From the top of my head - if AI is 100% good at distinguishing "not a cancer", which will be ~90% of the cases - it means human will only need to look at 10% of the results, either giving him more time to evaluate each one or making him process 10x more results. 

The whole thing doesn’t get less weird the more I think about it, it gets weirder. I don’t understand how one can have all these positions at once. If that’s our best hope for survival I don’t see much hope at all, and relatively I see nothing that would make me hopeful enough to not attempt pivotal acts.

As someone who read Andrew Critch's post and was pleasantly surprised to find Andrew Critch expressing a lot of views similar to mine (though in relation to pivotal acts mine are stronger), I can perhaps put out some possible reasons (of course it is entirely possible that he believes what he believes for entirely different reasons):

  1. Going into conflict with the rest of humanity is a bad idea. Cooperation is not only nicer but yields better results. This applies both to near-term diplomacy and to pivotal acts.
  2. Pivotal acts are not only a bad idea for political reasons (you turn the rest of humanity against you) but are also very likely to fail technically. Creating an AI that can pull off a pivotal act and not destroy humanity is a lot harder than you think. To give a simple example based on a recent lesswrong post, consider on a gears level how you would program an AI to want to turn itself off, without giving it training examples. It's not as easy as someone might naively think. That also wouldn't, IMO, get you something that's safe, but I'm just presenting it as a vastly easier example compared to actually doing a pivotal act and not killing everyone.
  3. The relative difficulty difference between creating a pivotal-act-capable AI and an actually-aligned-to-human-values AI, on the other hand, is at least a lot lower than people think and likely in the opposite direction. My view on this relates to consequentialism - which is NOT utility functions, as commonly misunderstood on lesswrong. By consequentialism I mean caring about the outcome unconditionally, instead of depending on some reason or context. Consequentialism is incompatible with alignment and corrigibility; utility functions on the other hand are fine, and do not implty consequentialism. Consequentialist assumptions prevalent in the rationalist community have, in my view, made alignment seem a lot more impossible than it really is. My impression of Eliezer is that non-consequentialism isn't on his mental map at all; when he writes about deontology, for instance, it seems like he is imagining it as an abstraction rooted in consequentialism, and not as something actually non-consequentialist.
  4. I also think agency is very important to danger levels from AGI, and current approaches have relatively low level of agency which reduces the danger. Yes, people are trying to make AIs more agentic, fortunately getting high levels of agency is hard. No, I'm not particularly impressed by the Minecraft example in the post. 
  5. An AGI that can meaningfully help us create aligned AI doesn't need to be agentic to do so, so getting one to help us create alignment is not in fact "alignment complete"
  6. Unfortunately, strongly self-modifying AI, such as bootstrap AI, is very likely to become strongly agentic because being more agentic is instrumentally valuable to an existing weakly agentic entity. 

Taking these reasons together, attempting a pivotal act is a bad idea because:

  • you are likely to be using a bootstrap AI, which is likely to become strongly agentic through a snowball effect from a perhaps unintended weakly agentic goal, and optimize against you when you try to get it to do a pivotal act safely; I think it is likely to be possible to prevent this snowballing (though I don't know how exactly) but since at least some pivotal act advocates don't seem to consider agency important they likely won't address this (that was unfair, they tend to see agency everywhere, but they might e.g. falsely consider a prototype with short-term capabilities within the capability envelope of a prior AI to be "safe" because not aware that the prior AI might have been safe only due to not being agentic)
  • if you somehow overcome that hurdle it will still be difficult to get it to do a pivotal act safely, and you will likely fail. Probably you fail by not doing the pivotal act, but the  probability of not killing everyone conditional on pulling off the conditional act is still not good
  • If you overcome that hurdle, you still end up at war with the rest of humanity from doing the pivotal act (if you need to keep your candidate acts hidden because they are "outside the Overton window" don't kid yourself about how they are likely to be received by the general public) and you wind up making things worse 
  • Also, the very plan to cause a pivotal act in the first place intensifies races and poisons the prospects for alignment cooperation (e.g. if i had any dual use alignment/bootstrap insights I would be reluctant to share them even privately due to concern that MIRI might get them and attempt a pivotal act)
  • And all this trouble wasn't all that urgent because existing AIs aren't that dangerous due to lack of strong self-modification making them not highly agentic, though that could change swiftly of course if such modification becomes available
  • finally, you could have just aligned the AI for not much more and perhaps less trouble. Unlike a pivotal act, you don't need to get alignment-to-values correct the first time, as long as the AI is sufficiently corrigible/self-correcting; you do need, of course, to get the corrigibility/self correcting aspect sufficiently correct the first time, but this is plausibly a simpler and easier target than doing a pivotal act without killing everyone.

I mostly agree with this, and want to add some more considerations:

The relative difficulty difference between creating a pivotal-act-capable AI and an actually-aligned-to-human-values AI, on the other hand, is at least a lot lower than people think and likely in the opposite direction. My view on this relates to consequentialism - which is NOT utility functions, as commonly misunderstood on lesswrong. By consequentialism I mean caring about the outcome unconditionally, instead of depending on some reason or context. Consequentialism is incompatible with alignment and corrigibility; utility functions on the other hand are fine, and do not implty consequentialism. Consequentialist assumptions prevalent in the rationalist community have, in my view, made alignment seem a lot more impossible than it really is. My impression of Eliezer is that non-consequentialism isn't on his mental map at all; when he writes about deontology, for instance, it seems like he is imagining it as an abstraction rooted in consequentialism, and not as something actually non-consequentialist.

Weirdly enough, I agree with the top line statement, if for very different reasons than you state or think.

The big reason I agree with this statement is that to a large extent, the alignment community mispredicted how AI would progress, albeit unlike many predictions I'd largely think that this really was mostly unpredictable. Specifically, LLMs progress way faster relative to RL progress, or maybe that was just hyped more.

In particular, LLMs have 1 desirable safety property:

  1. Inherently less willingness to have instrumental goals, and in particular consequentialist goals. This is important because it means that it avoids the traditional AI alignment failures, and in particular AI misalignment is a lot less probable without instrumental goals.

This is plausibly strong enough that once we have the correct goals ala outer alignment, like what Pretraining from Human Feedback sort of did, then alignment might just be done for LLMs.

This is related to porby's post on Instrumentality making agents agenty, and one important conclusion is that so long as we mostly avoid instrumental goals, which LLMs mostly have by default, due to much more dense information and much more goal constraints, then we mostly avoid models fighting you, which is very important for safety (arguably so important that LLM alignment becomes much easier than general alignment of AI).

Here's the post:


And here's the comment that led me to make that observation:


So to the extent that alignment researchers mispredicted how much consequentialism is common in AI, it's related to a upstream mistake, which is in hindsight not noticing how much LLMs would scale, relative to RL scaling, which means instrumental goals mostly don't matter, which vastly shrinks the problem space.

To put it more pithily, the alignment field is too stuck in RL thinking, and doesn't realize how much LLMs change the space.

On deontology, there's actually an analysis on whether deontological AI are safer, and the Tl;dr is they aren't very safe, without stronger or different assumptions.

The big problem is that most forms of deontology don't play well with safety, especially of the existential kind, primarily because deontology either actively rewards existential risk or has other unpleasant consequences. In particular, one example is that an AI may use persuasion to make humans essentially commit suicide, and given standard RL, this would be very dangerous due to instrumental goals.

But there is more in the post below:


Boundaries/Membranes may improve the situation, but that hasn't yet been tried, nor do we have any data on how Boundaries/Membranes could work.

This is my main comment re pivotal acts and dentology, and while I mostly agree with you, I don't totally agree with you here.

On deontology, there's actually an analysis on whether deontological AI are safer, and the Tl;dr is they aren't very safe, without stronger or different assumptions.

Wise people with fancy hats are bad at deontology (well actually, everyone is bad at explicit deontology).

What I actually have in mind as a leading candidate for alignment is preference utilitarianism, conceptualized in a non-consequentialist way. That is, you evaluate actions based on (current) human preferences about them, which include preferences over the consequences, but can include other aspects than preference over the consequences, and you don't per se value how future humans will view the action (though you would also take current-human preferences over this into account).

This could also be self-correcting, in the sense e.g. that it could use preferences_definition_A and humans could want_A it to switch to preferences_definition_B. Not sure if it is self-correcting enough. I don't have a better candidate for corrigibilty at the moment though.

Edit regarding LLMs: I'm more inclined to think: the base objective of predicting text is not agentic (relative to the real world) at all, and the simulacra generated by an entity following this base objective can be agentic (relative to the real world) due to imitation of agentic text-producing entities, but they're generally better at the textual appearance of agency than the reality of it; and lack of instrumentality is more the effect of lack of agency-relative-to-the-real-world than the cause of it. 

I would go so far as to say that the vast majority of potential production, and potential value, gets sacrificed on this alter, once one includes opportunities missed.

altar? same here:

A ton of its potential usefulness was sacrificed on the alter of its short-term outer alignment.

New to LessWrong?