The big news of the week is OpenAI introducing the Superalignment Taskforce. OpenAI is pledging 20% of compute secured to date towards a four year effort to solve the core technical challenges of superintelligence alignment. They plan to share the results broadly. Co-founder and OpenAI Chief Scientist Ilya Sutskever will lead the effort alongside Head of Alignment Jan Leike, and they are hiring to fill out the team.

That is serious firepower. The devil, as always, is in the details. Will this be an alignment effort that can work, an alignment effort that cannot possibly work, an initial doomed effort capable of pivoting, or a capabilities push in alignment clothing?

It is hard to know. I will be covering these questions in a post soon, and want to take the time to process and get it right. In the meantime, the weekly post covers the many other things that happened in the past seven days.

Table of Contents

  1. Introduction.
  2. Table of Contents.
  3. Language Models Offer Mundane Utility. Commercial LLMs keep the edge.
  4. Language Models Don’t Offer Mundane Utility. Can you?
  5. Fun With Image Generation. Valve says: Keep all that fun away from Steam.
  6. They Took Our Jobs. Then they came for America’s finest news source.
  7. Introducing. If only it were as easy to forget.
  8. Reinforcement Learning By Humans From Feedback. United we grok.
  9. Costs are not Benefits. Easy mistake to make. Many such cases.
  10. Quiet Speculations. Questions that are not so infrequently asked.
  11. The Quest for Sane Regulation. Some small good signs.
  12. Another Open Letter. EU CEOs call on EU to not act like the EU.
  13. The Week in Audio. Odd Lots offers AI fun.
  14. No One Would Ever Be So Stupid As To. Is that what you think?
  15. Safely Aligning a Smarter than Human AI is Difficult. Formal verification?
  16. Rhetorical Innovation. The latest crop of explanations and resources.
  17. People Are Worried About AI Killing Everyone. Including Douglas Hofstadter.
  18. The Lighter Side. Actual progress.

Language Models Offer Mundane Utility

Open source models often do well on open source benchmarks. So do foreign efforts like Falcon or Ernie. When it comes to actual mundane utility, or tests that were not anticipated in advance, the answer seems to come back somewhat differently. Quick note – we’ve transitioned from the deprecated vicuna benchmark to a more advanced MT-bench, including more challenging tasks and addressing biases/limitations in gpt4 eval. We find OpenChat’s performance on MT-bench is similar to wizardlm-13b. That’s said, there remains a significant gap between open models and GPT-3.5, which is exactly what we aim to emphasize with MT-bench – to highlight this discrepancy. Though not flawless, it’s one step towards a better chatbot evaluation. Please check out our paper/blog for more technical details and leaderboard for complete rankings.

Jim Fan: For most of the “in the wild” trials, GPT-3.5 just feels much better than open-source models that claim good performance metrics. Such “vibe gap” is typically caused by inadequate benchmarking. Don’t get excited by numbers too quickly. Beware of over-claims.

Links: Blog, Leaderboard, Paper.


Falcon-40B is down at 5.17.

Note that reasoning is the place where GPT-4 has the largest edge.

Will they offer all that mundane utility for free? David Chapman thinks that without a moat no one will make any money off of LLMs. Other than Nvidia, of course.

Will Manidis: the core innovation of Foundation Model providers is not technical it’s allowing VCs to deploy $500m into a deal with basically zero underwriting that’s $20m in fees, $100m in carry at a 2x for like … 10 days of memo writing and no customers to reference.

David Chapman: Regardless of how useful GPTs turn out to be, I’m skeptical anyone makes much money off of them. As the Google memo said, there are no durable moats. If useful, commodity GPTs will increase productivity at nearly zero cost.

I do not think this is right. The details of model quality matter a lot, as does the fine tuning and prompt engineering and scaffolding, as do habits, as does trust and successfully hooking the LLM up into the proper contexts. Everyone keeps saying ‘here come the open source models’ yet no they have not yet actually matched GPT-3.5.

Latest Art of the Jailbreak. At this point, report is that all the known direct generic jailbreak prompts have been patched out. You can still trick GPT-4 psychologically, or by using the internet’s favorite trick of providing a wrong answer for it to correct.

Language Models Don’t Offer Mundane Utility

Therefore, the theory goes, if they can duplicate your work, neither do you.

Paul Graham: Noticed something fascinating(ly digusting) about AI-generated summaries of essays: they don’t just make them shorter, but also make the ideas more conventional. Which makes sense given the way the AIs are trained.

Hypothesis: the better an essay, the worse it will compress, at least with current AI. So there’s a novel way to use AI. Try asking it to summarize a draft of an essay. If the result is anything like the original, the essay is bad.

Eliezer Yudkowsky: This is true today but could stop being true next week (for all we know).

David Chapman (QT of PG): Someone asked a state-of-the-art GPT to summarize my book about why AI is bad. The output was a generic summary of AI risk concerns, including some the book doesn’t cover, and missed what is distinctive in the book.

Patrick McKenzie (replying to PG): This effect is (at least currently) so striking that I want someone to popularize a Greek letter for “alpha above LLM baseline.”

And I hate to sound persnickety about it but frequently people have sent me summaries with the suggestion I put them on site for people who have trouble reading essays and have had to say “The purported summary makes multiple arguments refuted at length in the source text.”

And while I’m less worried for the LLM in that case because a tool not useful for all purposes is still a tool, I’m a bit worried for some of my correspondents rapidly losing capability to identify a thesis correctly, because they assume it survived compression.

There are good reasons to have an essay or a book that can be summarized. That doesn’t make your essay bad. Most people will only remember the summary, many will only read one in the first place, the rest is there as supporting evidence and further explanation for those who poke at such things to find and report back. Great ideas do compress, as long as you don’t mind making cuts and not offering proof.

The interesting claim is that if the LLM can understand your core thesis well enough to summarize it correctly, that means your idea was conventional, because the LLM would otherwise substitute the nearest conventional thing, ignoring what you actually wrote or didn’t write. There is a lot of truth in that.

The LLM will summarize what it expects you to say, the prediction that most things will follow the conventional patterns overriding what was actually said. If what it expects you to have said is what you actually said, then why say it?

The counterargument is that the LLM not being able to summarize correctly might also mean you did not do your job explaining. As an author, if you say nuanced unconventional things, and everyone who reads your text comes away with the same wrong reading as an LLM would, that’s you failing, and the LLM can alert you to this.

How correlated are these mistakes? If people are asking LLMs for summaries, then very highly correlated. If people are reading on their own, likely still rather correlated.

Patrick McKenzie is noticing the correlation already increasing. The threshold for using LLM-generated summaries does not involve them avoiding directly contradicting main themes of the post being summarized.

I would certainly insist upon the following: If it is safe to

This is also a situation that calls for better prompt engineering. I bet you could get a much better summary of Chapman’s book out of (what I assume was Claude) by giving Claude the expectation that the arguments were going to be unconventional and unique, and to explicitly contrast the book’s case with typical AI risk arguments.

This blog is a case where I do not expect LLMs to be able to summarize accurately or usefully, nor do I expect most people to be able to summarize that way. I am sad about that but unwilling to pay the price to make it untrue. Doing so would provide less net value, not more.

I disagree with Yudkowsky here that the LLMs could get over this next week. They could improve. There could be particular mistakes the LLMs make less often, in terms of substituting and hallucinating, and I can imagine a week of prompt work helping a lot here. The gap should still remain, because some things are legitimately much harder to summarize than others, the audience not knowing something makes explaining it harder still, and unconventional ideas are always going to have lower priors and thus be more likely to be misunderstood or dropped.

Nassim Taleb continues to note that sometimes LLMs hallucinate. Says that makes it ‘unreliable for serious research.’ Use tools for what they are good at.

ChatGPT’s browse feature taken offline for being useful, until it can be fixed.

OpenAI: We’ve learned that ChatGPT’s “Browse” beta can occasionally display content in ways we don’t want, e.g. if a user specifically asks for a URL’s full text, it may inadvertently fulfill this request. We are disabling Browse while we fix this—want to do right by content owners.

Very grateful to the ChatGPT Plus subscribers who have been helping us test the browsing feature. This is why we started with a beta—have received extremely valuable feedback, learned a lot, & will bring it back soon.

Fun with Image Generation

Valve is having none of it. They officially say they ‘don’t want to discourage AI’ in games on Steam. Their actual decisions still amount to: No AI for you. Not on Steam.

Corey Brickley Illustration: This is like, the only recent instance of a tech company thinking beyond the end of the year I can recall

Simon Carless: PSA: Valve has been quietly banning newly submitted Steam games using AI-created art assets – if submitters can’t prove they have rights for the assets used to train the algorithms.

As in:

PotterHarry99: Hey all,

I tried to release a game about a month ago, with a few assets that were fairly obviously AI generated. My plan was to just submit a rougher version of the game, with 2-3 assets/sprites that were admittedly obviously AI generated from the hands, and to improve them prior to actually releasing the game as I wasn’t aware Steam had any issues with AI generated art. I received this message

Valve’s letter:


While we strive to ship most titles submitted to us, we cannot ship games for which the developer does not have all of the necessary rights.

After reviewing, we have identified intellectual property in [Game Name Here] which appears to belongs to one or more third parties. In particular, [Game Name Here] contains art assets generated by artificial intelligence that appears to be relying on copyrighted material owned by third parties. As the legal ownership of such AI-generated art is unclear, we cannot ship your game while it contains these AI-generated assets, unless you can affirmatively confirm that you own the rights to all of the IP used in the data set that trained the AI to create the assets in your game.

We are failing your build and will give you one (1) opportunity to remove all content that you do not have the rights to from your build.

If you fail to remove all such content, we will not be able to ship your game on Steam, and this app will be banned.

PotterHarry99: I improved those pieces by hand, so there were no longer any obvious signs of AI, but my app was probably already flagged for AI generated content, so even after resubmitting it, my app was rejected.

Valve: Hello,

Thank you for your patience as we reviewed [Game Name Here] and took our time to better understand the AI tech used to create it. Again, while we strive to ship most titles submitted to us, we cannot ship games for which the developer does not have all of the necessary rights. At this time, we are declining to distribute your game since it’s unclear if the underlying AI tech used to create the assets has sufficient rights to the training data.

App credits are usually non-refundable, but we’d like to make an exception here and offer you a refund. Please confirm and we’ll proceed.


PotterHarry99: It took them over a week to provide this verdict, while previous games I’ve released have been approved within a day or two, so it seems like Valve doesn’t really have a standard approach to AI generated games yet, and I’ve seen several games up that even explicitly mention the use of AI. But at the moment at least, they seem wary, and not willing to publish AI generated content, so I guess for any other devs on here, be wary of that. I’ll try itch io and see if they have any issues with AI generated games.

If the standard is ‘you have to prove that you own the rights to every component used to train every input that you used’ then that is damn close to a ban on generative AI. Interesting choice there. Imagine if human artists had to own the rights to all of their influences. But if you’re Valve, why take firm risk (as in, risk the company) if you think there’s even a small chance you might get found liable in such spots in an absurd fashion? Especially given that a lot of people favor artists and will applaud an anti-AI stance.

What happened here seems to be that the creator in question did something pretty much everyone would agree was ok – use AI for placeholder art, then do human work on top sufficiently that you make it your own. But once the standard becomes ‘prove you are allowed’ you are already dead.

Also the comments are full of people pointing out other games with obvious AI assets.

AI image and video models will continue to rapidly improve. You’ll want to still use artists as well, but not using AI art at all is going to rapidly become an increasingly expensive proposition.

This thread has more details and analysis. We don’t know what form the ‘proof you own the rights’ has to take, such as using Photoshop’s AI fill. As one response notes, this makes it scary to even commission artwork, since you can’t prove that the artist didn’t in any way use any AI tools. The policy applies not only to images but also to text, and presumably voiceover as well, in theory anything in the game that comes from an LLM means you’re banned.

Right now this is clearly massively selective enforcement. If you’re going to impose a rule like this it needs to be enforced. And that needs to inform your rules. Otherwise, you’re creating a situation where everyone is in violation, everyone at risk of being taken down, and the decisions are political.

Selective enforcement also does not protect you against firm risk. If anything it makes it worse. You have a policy that shows you know about the copyright violations, and then you don’t enforce it. If there are going to be absurd fines, this is big trouble.

Deepfaketown and Botpocalypse Soon

The Art of the Super Prompt

They Took Our Jobs

G/O Media, which includes The Onion, A.V. Club, Jezebel, Kotaku, Quartz, The Root, The Takeout, Deadspin and Jalopnik, is going to begin a ‘modest test’ of AI content on its websites. They claim ‘these features aren’t replacing current writers’ but we all know what happens once you go down this road, at minimum you destroy the brand value and credibility involved. Also this came a few weeks after a round of layoffs. Zero people fooled. The response statement from the union understands the implications, but alas instead of being written by Onion writers it reads like GPT-4 could have written it.

Translators have been feeling the AI pressure for a decade. So far they’re mostly holding up all right, because computer translation is insufficiently reliable, and there’s a lot of demand for the new hybrid process where the translator error checks the AI to get you a 40% discount. That works because the AI service is essentially free, so the humans still capture all the value, and lower cost induces higher demand. It does seem like quality levels are starting to create strain on the system, but we’re still a ways off of good technical translation in medicine or law.


GPT-Migrate your codebase from one language to another. Is it possible this type of trick could allow one to stop programming in Python? What I wouldn’t do for C# and some type safety (the devil I know), or the ability to pick an actually good language, if it didn’t mean all the AI tools and resources would be infinitely harder to use.

Google’s first ‘Machine Unlearning’ challenge. That’s right, we’re here to master the art of forgetting, except intentional instead of catastrophic.

Jim Fan: Google is hosting the first “Machine Unlearning” challenge. Yes you heard it right – it’s the art of forgetting, an emergent research field. GPT-4 lobotomy is a type of machine unlearning. OpenAI tried for months to remove abilities it deems unethical or harmful, sometimes going a bit too far.

Unlike deleting data from disk, deleting knowledge from AI models (without crippling other abilities) is much harder than adding. But it is useful and sometimes necessary:

▸ Reduce toxic/biased/NSFW contents

▸ Comply with privacy, copyright, and regulatory laws

▸ Hand control back to content creators – people can request to remove their contribution to the dataset after a model is trained

▸ Update stale knowledge as new scientific discoveries arrive Check out the machine unlearning challenge:

Forgetting is most definitely a thing, but much harder than learning. It also is never going to be complete. Yes, you might forget a particular piece of information, but path dependence is a thing, everything that happened to the model since it saw the data was slightly altered, people responded differently to different outputs and you updated on that too, you can’t get to the counterfactual state that didn’t happen without restarting the whole process.

One challenge is that it is largely possible to find out if particular data was used to train a particular model. This is great news including for interpretability, except that it is bad news if you want to avoid blame and comply with nonsensical laws like the right to be forgotten.

Moreover, recent research [1, 2] has shown that in some cases it may be possible to infer with high accuracy whether an example was used to train a machine learning model using membership inference attacks (MIAs). This can raise privacy concerns, as it implies that even if an individual’s data is deleted from a database, it may still be possible to infer whether that individual’s data was used to train a model.

Reinforcement Learning By Humans From Feedback

One dividend from AI and thinking about AI is thinking better about humans.

Rather than using humans as a metaphor to help us understand AI, often it is more useful to use AI as a metaphor for understanding humans. What data are you (or someone else) being trained on? What is determining the human feedback driving a person’s reinforcement learning? What principles are they using to evaluate results and update heuristics? In what ways are they ‘vibing’ in various contexts, how do they respond to prompt engineering? Which of our systems are configured the way they are due to our severe compute and data limitations, or the original evolutionary training conditions?

Often this goes both ways simultaneously. If you think about what is going wrong with RLHF or Constitutional AI, and how they kill creativity, cause risk aversion and promote overly broad social defense mechanisms that reflect a kind of righteousness mixed with paranoia, then ask what happens when we do similar things to humans. Which we do, all the time. With remarkably similar results.

Anyway, Flo Crivello seems to have forgotten the origins of the word ‘grok.

Flo Crivello: I often wonder if there is an equivalent to LLM’s grokking (LLMs suddenly “getting” something, as shown by a sudden drop in loss during training), for humans. Flynn did document a gradual increase in IQ levels over ~100 yrs. And it would make sense to see increasing returns to IQ. If so, it may be the case that things like the enlightenment, the industrial revolution etc were cases of human grokking.


Quintin Pope: There’s a log base 10 scale on the x-axis, so this is hardly a “sudden realization” on the part of the model. The period of time from “slightly increasing test accuracy” to “near-100% test accuracy” is something like 97% of the overall training time.

Flo Crivello: Oh, good point. And seems like a big nail in a crucial x-risk argument 🤔

Daniel Eth: Not necessarily, – if we scale up compute exponentially, then we run into the problem again. Also, there are paths to doom that don’t run through FOOM.

Oliver Habryka (upthread): Hmm, I don’t understand this. The total x-axis here spans 6 OOMs, and the grokking happens in the transition from 10^2 to 10^3, so in less than 1/1000th of the compute of the experiment. That seems really quite fast to me. I don’t understand the “97% of overall training time”.

Daniel Eth: That’s not the grokking, that’s the memorization/overfitting, right? Grokking happens at ~10^5 if I’m understanding this correctly

Oliver Habryka: Ah, now I finally understood this graph correctly. Cool, yeah, that’s right, that is like 90%+ of the total compute allocated. A bit unclear how to think about it since presumably the experiment was terminated as soon as it was done.

Quintin Pope (upthread): I think “grokking” is kind of overblown. It really should have been called something less evocative, like “eventual recovery from extreme overfitting”.

I do not think this is a good model of the Industrial Revolution. That was mostly about standing on the shoulders of giants and making incremental progress enabled by changing economic and technological conditions, and very little of it seems to have been that individual humans improved their raw intelligence.

Humans definitely grok all the time, though. This is indeed the main way humans learn things that aren’t purely lists of facts. At first you’re doin rote things, imitating, putting pieces into place, streamlining small pieces of the puzzle, and so on. Gradually, then suddenly, the pieces fit together, and you have a working understanding that is made of gears and allows you to solve new problems and innovate in the space, to make it your own, first in smaller areas, then in larger ones.

When there is no important underlying structure to grok, the topic is fundamentally different. In school, I was unable to learn foreign languages quickly or in a sustainable way, despite real efforts, because to my brain they were almost entirely lists of pointers. There was nothing to grok that made much difference. Similarly, when I took high school biology, they taught it as a list of pointers rather than a set of principles that fit together, and I struggled, whereas chemistry and physics were things one could grok.

With humans, if you want them to learn off the remarkably small amounts of data and compute we all have available, giving bespoke feedback and well-selected high-quality data are vital. If you are trying to teach a child, you know that they will learn through random walks and exploration, but that if you want to make that process efficient and instill good habits and principles you need to constantly consider incentives and reinforcements and implications on all levels at all times and choose your words and actions strategically. Finding the exact right piece of data or feedback to use next is often 4+ orders of magnitude in efficiency gain, and the question is often ‘fair’ rather than a matter of luck.

Anyway, in this graph, we see two very different curves. Very quickly, the system memorizes the training set, between 10^2 and 10^3 steps in, but up until about 10^4.6 this provides almost no help on the test set, because the system does not grasp the underlying structure. Then between ~10^4.8 and 10^6, it phase shifts to understanding the underlying structure (the grok) and gets the test set right.

This is a common phenomenon, and it applies to strategic choices not only to understanding structures and solving puzzles. It also applies even if you are consciously trying to handle a test set. There will often be one method that works best if you have limited affordances and understanding, then another that requires a complete reimagining of the problem, developing new skills, and often as Teller describes the art of magic, putting in more time than anyone would reasonably expect.

Speed runs are an excellent illustration. Depending on the physical affordances available in a particular game, a true expert might be playing normally but greedily and confidently, they might be executing a mostly normal-looking strategy that supercharges by relying on knowing lots of details and anticipating decidedly non-random responses and having some tricks that are technically very difficult, or it might involve finding a bunch of strange glitches that allow skipping large parts of the game or finding a way to overwrite system memory.

In addition to speed runs and magicians, here are ten other examples.

  1. Physics: Intuitive physics -> Newtonian physics -> Relativity
  2. Sciences: Pop psych -> psych -> biology -> chemistry -> physics 
  3. Probabilities: intuitive probability -> p-values -> Bayes
  4. Strategic Insight: Reactions ->  tactics -> strategy -> grand strategy
  5. Alternatively: Doing → Narrow Learning → Power Seeking
  6. Morality: Folk Morality -> Deontology or Virtue Ethics -> Utilitarianism
  7. Decision Theory: No DT -> EDT -> CDT -> FDTs 
  8. Socialization: ‘normal’ socialization -> ‘rationalist’ or logic-based socialization 
  9. Amorally seeking approval: Truth → selective truth → lies → strategic lies and manipulation → including bargaining, coercion or hacking of the system → going outside what you thought was the system
  10. Thinking fast and slow: System 1 -> System 2. What would be System 3?

The pattern is not ‘the later thing is better, use that.’ The pattern is that the later thing requires greater capabilities and resources to attempt, and is riskier and more capable of backfiring, often quite badly, if you don’t pull it off. In exchange, when it works and you can afford the required resources, it works better out of distribution, it is more precise and powerful, it will win in a fight. Whereas earlier systems tend to be more forgiving under typical conditions.

If you train up a sufficiently powerful AI system, as you add orders of magnitude more compute and data and parameters, you should expect such transformations, including in ways you didn’t anticipate, whether or not they enable recursive self-improvement. Your attempts to balance the game and constrain behavior, to align the AI, will break down, because what you were previously incentivizing under low-affordance conditions is no longer doing that under high-affordance conditions.

In an important sense Quintin Pope is exactly right here. To grok is to replace your previous model, potentially a terminally overfit one, with a distinct, foundationally superior model. That does not make it overblown. There is only one grok stage observed because we have chosen targets and methods that don’t incent or allow for a second grok.

Quinton is right again that the grok indeed typically involves far more effort than the previous system. Does that mean this is not an argument for extinction risk? Only if you think that recursive self-improvement or other methods of creating more powerful systems won’t involve order of magnitude improvements in effective resource deployments and optimization power.

Except that is exactly the plan. Each generation is using at least an order of magnitude more resources than the previous generation. More than 90% of the work in this generation comes after where the last generation stopped training, again without or before any recursive effects. Also note that almost all that work happened here despite already being near-perfect on the training set. This suggests that with more unique training data, improvement would have happened a lot faster. And also that it happened anyway.

In Other AI News

Costs are Not Benefits

A continuing series.

Brad Hargreaves: I don’t think this is getting enough attention.

Davidad: AI may not be making a dent in total factor productivity or real GDP per capita yet, but those are lagging indicators. The leading indicator is the chip fabs


This is absolutely a leading indicator and a big deal. If manufacturing construction spending roughly doubled to slightly under $200 billion, that is a big deal. The Chips Act and interest in building chips is now the majority of American spending on construction in manufacturing.

It also reflects a massive government subsidy, a case of pure mercantilism. If you throw a $54 billion or so subsidy at another manufacturing sector with a large existing market, you would expect a similar result. As the track record of such subsidies shows, profitability and productivity are optional in such spots, let alone ultimate usefulness or an increase in global production.

Private investments in unsubsidized AI companies are better indicators that there is value here, yet that too is once again cost, not benefit. A huge $1.2 investment in a approximately year-old company mostly to buy Nvidia chips correlates with providing value and making future profits. The only company it yet establishes is profitable is Nvidia.

Homework is also a central example.

Megan McArdle: I don’t folks are really grappling with what it means to force all the high-verbal disciplines away from assessing learning using homework, and towards using in-class tests. Yet this is what GPT is going to do. The implications for pedagogy, and who does well, are massive.

Wiredog: So history and English will be more like math, where midterms and the final are 50% of the grade?

Megan McArdle: Yup. Among other things, I’d bet this is going to open up a lot of opportunities for the bright but unmotivated student who aces all the tests but can’t be bothered to write term papers.

Matt Yglesias (reply to OP): But isn’t this also going to genuinely devalue these skills? In a world where what matters is your nursing or plumbing ability, we’re going to need tests of that stuff not of reading and writing.

Megan McArdle: Probably! Bloggers and journalists hardest hit. But in the short term it’s going to create all sorts of problems with, for example, the ADA. I think it’s also going to give bright, unmotivated kids a leg up in college admissions.

So you’re saying it’s going to reward a student who can achieve the objective of the class, but doesn’t want to pay costs to create useless outputs to prove that they’ve achieved the objective of the class? Or are you saying that the objective of the class is to get students to create useless outputs?

Either way, my response to this change is: Good.

Why is mundane utility for LLMs suddenly declining, with search interest for LLMs (blue) falling distinctly below Minecraft (red)?


The claim is that fake homework is actually the primary purpose of LLMs right now.

Francois Chollet: Search interest over time, ChatGPT (blue) vs Minecraft (red). There’s an obvious factor that might well underlie both trends (down for ChatGPT, up for Minecraft). Can you guess what it is? As it happens, the answer matters for both the future of AI and the future of education.

The answer is of course summer break. “Producing homework” remains the #1 application of LLMs. Which is remarkable, because the point of homework is to do it yourself — the actual deliverable has zero value to anyone.

LLMs are touted as a shortcut to economically valuable deliverables — but the market ends up using them for value-less or negative-value deliverables such as homework and content farm filler. Not exclusively, but in very large part.

What both homework & content farm filler have in common is that they critically need to sound like they were made by humans. That’s the entire point — the content itself is worthless. And that’s the fundamental value prop of LLMs: the *appearance* of human communication.

I (of course) strongly disagree that this is the primary purpose of LLMs, even in the homework context. When doing homework ‘the right way’ LLMs are super useful, as the student gets to ask questions in plain English and get answers, ask follow-ups, explore their confusions and curiosities. It is a candidate for the best actual learning tool in history, even in its current form.

Also once again, I say: Good. If kids are often using it to ‘cheat’ on homework instead, because they were forced to do something of no value and decided they would prefer not to, are you so sure it the children who are wrong? That they aren’t learning the more valuable lessons here?

I remember when people still worried about kids having calculators.

Also, homework doesn’t seem like it would explain why traffic would have fallen. Nor does it explain why the growth from March to April to May was slow.


Quiet Speculations

Tyler Cowen calls this ‘questions that are rarely asked’ and in general that might be true but if you hang around in EA circles they are instead ‘questions that you get pretty tired of people asking.’

Amjad Masad: Not only have they mapped out the fruit fly brain, but they actually boot it up in a computer and made it “eat” and “groom” 🤯 Utilitarians: what’s stopping you from spinning up a 1M H100 cluster and putting these fruit flies in an infinite state of orgasmic bliss?

The correct answer is to that you should ask, if the rule you followed brought you to this, of what use was the rule? If your calculation says that the best possible universe is one in which we spin up the maximum number of simulated fruit flies, your conclusion needs to not be that you should spin up the maximum number of simulated fruit flies. Instead, reflect, and notice that this is deeply stupid, indeed about as stupid as maximizing paperclips, and thus your calculation was in error.

The question then becomes, where is the error, and how do you fix it? There are several places to look. The good news is that we have the ability to do such reflection and the need to halt, catch fire and do some combination of math and philosophy. Realize that you cannot add linear utilities of different isolated physical sensations to get the value of the universe. That is not what you value. That is you attempting a metric to measure what you value and then that metric falling flat on its face outside of the training distribution. Can I instead interest you in some virtue ethics?

I strongly claim that the amount of terminal utility from ‘simulate a billion fruit flies each in a state of orgasmic bliss’ is zero. Not epsilon. Zero.

Tyler Cowen: Now I am not so sure that will prove possible, but if you are super-optimistic/pessimistic about progress in AI…might we not end up with infinite utility in any case, if only as an AGI experiment?

Once we get past the nitpick that the universe is finite, could we happen via the inherent value of simulations and experiments alone to end up with a much better result than we could have gotten from not building an AGI?

Yes, that is certainly possible. This also presents an s-risk or suffering risk that such experiments could provide negative value, once can easily imagine motivations for that as well, especially if you inherently devalue life inside an experiment. If we have an intuition that identical suffering multiplies cleanly while identical positive experiences might not under some circumstances, we would want to be especially wary.

Ultimately, when we consider the potential value of a world whose atoms are configured according to the preferences of AGIs, we are talking philosophy. Do AGIs have inherent value, if so which ones? What about simulations or real copies of varies entities, including humans, non-human biological lifeforms or pure orgasmium? Does it matter if those experiences are real, or influence real outcomes, have some sort of meaning? Do we have other preferences over the configuration of the universe?

Then we can compare that against potential future worlds. Some worlds we can imagine clearly do not have value, because the AGI or AGIs are configuring them in simple ways that clearly have no moral patients or valuable experiences. Others are less clear. The probability of these different outcomes over the long term is a complicated and disputed question, alongside which ones would have value.

I thought Tyler Cowen’s link description on the news of the funding of Inflection AI, ‘the accelerationists were always going to win,’ was a multiple-level confusion. This investment is in a mundane utility application, which is the best use of AI chips and investment, except for the danger those chips could instead be used for a frontier model training run in the future. We are long past the point where security through obscurity and lack of investment in the sector is a strategy. This is a classic Dial of Progress situation. Yes, Inflection AI having a high valuation and buying lots of chips is Yay AI. That does not mean that there is only one axis of debate or decision to be made. To assume this is to beg the question, or to set the stage for the true Boo AI counter-offensive that definitely will arrive within a few years.

Also note that AI is bad at exactly this type of nuance, as Google’s Generative AI experiment offers this summary when I Google search for the post to grab the link:

The Dial of Progress is a concept by Zvi Mowshowitz that describes a belief that there is no room for nuance or detail in public discourse. Mowshowitz believes that if we turn the dial down, and slow or end progress across the board, our civilization will perish. If we turn the dial up, we will prosper.

Mowshowitz also believes that talk of going too far creates backlash against those issues specifically and progress generally.

I am not surprised that the AI is conflating my level of belief in the concept of the Dial, with my belief that others believe in the dial.

What do we know about Dario Amodei (CEO of Anthropic)? Alt Man Sam lists reasons he is suspicious, including various connections and Dario’s unusually low profile. Some of it is correct as far as it goes, other parts get corrected – Dustin Moskovitz was a large funder in Anthorpic’s first round but OpenPhil was not and his investment has now been very much eclipsed in size, and the internet quickly found several appearances. Here is Dario on the Future of Life podcast a year ago, the others seem to be several years old from before his current role.

I do find it curious he keeps such a low profile. If the point of your company is safety, there are lots of opportunities to use your position to speak about the need for safety. Instead, Anthropic embraces the commercialization and full-speed-ahead model that they claim was a large part of why those involved fled OpenAI. When they do openly talk about regulation, somehow OpenAI’s Sam Altman tries to move in the direction of the most hopeful proposals, while Anthropic’s Jack Clark and their other official statements suggest ‘lightweight’ actions that have no hope of success. Why?

Anthropic did one very important thing right by all accounts, which is a commitment to building the value of caring about safety, and hopefully true safety mindset (hard to tell from the outside on that one), into their corporate culture. That’s a huge deal. The rest is puzzling, unless you decide it very much isn’t, which is both plausible and if true quite bad news.

Sirius puts ‘humans in the loop’ for robots (paper) in the way that RLHF is human in the loop for LLMs. Which if literal would mean helping with the training feedback towards what humans think looks right, then totally out of the loop in terms of protecting against things going wrong. In this case, it seems better. Humans monitor activity, and intervene in challenging situations. From what I could see, the comparisons are not properly taking into account the true costs of the approach, so it’s hard to tell if this is progress or not.

Davis Blalock asks whether generating synthetic data is a good technique, and if so what it might accomplish. The obvious limitation is that the generated data can’t include anything that wasn’t in the original data in some form.

Thomas Dietterich: Training on synthetic data cannot lead to new knowledge discoveries. Training on synthetic data is a process of transforming one representation of knowledge into another. Any knowledge discovered by the second system must be implicit in the data generator.

Yes, this is true. Except that if you actually understood all the implications of all the data available to an LLM, you would understand the universe. Humans have orders of magnitude less data to train on, and we miss almost all of the implications of this data. Imagine an LLM that actually knew everything implied by all of GPT-4’s training data, and you very much have a superintelligence.

Thus this does not obviously limit its usefulness that much, there is still plenty to do.

My read on this question is that synthetic data is a combination of rejection sampling and filtering, distillation and upweighting, deliberate practice or self-play, and various forms of cognitive reflection.

The deliberate practice aspect is distilled very easily if there are answers to generated problems that can be verified correct much easier than the solutions can be found, which is common. An LLM can thus essentially ‘self-play’ although not quite in the same way as an AlphaGo.

Ajeya Corta: I expect there to be lots of gains from synthetic data in domains where the model can come up with problems for itself where the solution is checkable but still difficult to find, e.g. stating theorems and attempting to prove them or making up programming challenges.

Another way to look at this is to compare what a fully general model can do without any form of reflection or prompt engineering, and what you can do with a combination of both. Anything that can be found via reflection or prompting is ‘in the training data.’ Synthetic data allows us to use both and then build that into the model, allowing us to approach an idealized ‘on reflection’ best performance without any additional scaffolding, or to export this result to another model, perhaps a smaller one. The better we can reflect and verify our answers, the better this will work. Similar considerations apply to other goals, such as alignment.

Distribution shift is raised as a cost or danger. It can also be an opportunity, as we can use limitless data to create a subset with any desired properties. You don’t want the model to learn a correlation? Use enough synthetic data that you can select a subset with the correlation not present. Whatever distribution you want it to presume, you can get. It can focus as much or as little on unlikely cases as you want. Whether or not artificially warped data leads to other good outcomes or problems is an open question.

What is the right way to think about adversarial examples? This is in light of the discovery that AlphaGo, which can otherwise crush any human, has a weakness where if you very slowly move to surround a very large group, will sit there and let it happen, allowing even amateurs to beat it reliably.

Rohit: Fascinating, I missed this when it happened. Turns out there are plenty of simple strategies that defeat the mighty AlphaGo, but are easily defeated by human amateurs, kind of revealing how it still doesn’t quite understand what Go is.

David Pfau: The lack of solution to adversarial examples (which are just qualitatively different from things like optical illusions) are a big reason why I remain skeptical of AI progress. We’ve been beating our heads against this for a decade and everyone just sort of gave up.

I notice that this particular issue for AlphaGo seems highly fixable, once pointed out. If this can’t be done via introducing games with the relevant elements into the training set I’d be curious to hear why. That breaks the elegance of only learning through self-play, but that’s the way it goes.

I also don’t think this means the AI ‘doesn’t know what Go is.’ I think it is that the AI can’t look that many moves ahead and doesn’t consider that anyone would try such a thing. If you don’t think that kind of issue comes up with human players of games all the time – the ‘no one would ever try this and it would never work so I didn’t notice it was happening or guard against it even though that would have been easy, and therefore it worked’ – then I assure you that it does happen. It would definitely happen, as Nora points out, under similar conditions.

GPT-4 can be made to look deeply silly in various adversarial examples. Over time, many of them have been patched away once found. Over time, it gets harder. Will we ever ‘fully’ patch such examples in any system? No. That applies so much more to humans. We’d never survive this kind of attack.

Nora Belrose: AIs don’t need to be completely adversarially robust in order to be useful. The Go exploit required a lot of compute to find. You could likely find similar exploits for human players if you could *freeze* them and run thousands of games against them.

Exactly. Think about what would happen if you got to play as many games, run as many simulations, as you wanted, before acting, getting infinite rerolls, freeze frames, partial rewinds and so on, either in games or real life. Have you seen (the excellent movie) Groundhog Day?

When I look ahead, what I see is AIs increasingly presenting adversarial examples to humans. Gradient descent and selecting for effective systems will find such examples, whether or not they come with above-human intelligence. The examples will often look deeply stupid, the way Go example looks deeply stupid. They’ll still work.

One could say this is already happening in our existing non-AI computer systems. What is Candy Crush, or Facebook, if not an adversarial example? Or to use one of Eliezer’s go-to examples, what about ice cream?

The Quest for Sane Regulations

No idea how true this is especially in the interpretation, but a good sign.

Simeon: I was attending in Geneva a workshop on International AI governance and it was great to see the energy. We’re still very early in the process of determining what to do but the “let’s do stuff right now” zeitgeist is there. The one thing around which there might be a consensus is that we should do a comprehensive risk assessment of LLMs asap.

‘Do a comprehensive risk analysis’ is the kind of move that it is difficult to argue against. The risk then is that this substitutes for action, rather than lays the foundation for identifying correct action.

Regulation via lawsuit is always an option. ChatGPT creator OpenAI sued for theft of private data in ‘AI Arms Race.’ It is rather obvious that OpenAI is blatantly violating the letter of the law when it gathers its data, the same way everyone is constantly breaking the law all the time.

Another Open Letter

Open letter from 150 EU CEOs warns EU that its AI Act as written will hurt their competitiveness without accomplishing their goals. That is indeed the prior on any new technology regulation from the EU, and seems likely in this case.

They warn of dire consequences:

Such regulation could lead to highly innovative companies moving their activities abroad, investors withdrawing their capital from the development of European Foundation Models and European AI in general. The result would be a critical productivity gap between the two sides of the Atlantic.

Relax, EU, that’s never going to happen, no innovative companies are moving abroad, you don’t have any highly innovative companies.

The problem, the companies warn, is that the EU is using laws.

However, wanting to anchor the regulation of generative AI in law and proceeding with a rigid compliance logic is as bureaucratic of an approach as it is ineffective in fulfilling its purpose.

In a context where we know very little about the real risks, the business model, or the applications of generative AI, European law should confine itself to stating broad principles in a risk-based approach.

All regulations should be entrusted, instead, to a regulatory body composed of experts, with no responsibility to voters at any level, that we can spend many years influencing and attempting to capture before they do anything. Also it should only focus on ‘concrete’ risks meaning that (for example) we can worry about all dying if and only if we are all already dead.

The implementation of these principles should be entrusted to a dedicated regulatory body composed of experts at EU level and should be carried out in an agile process capable of continuously adapting them to the rapid pace of technological development and the unfolding concrete risks emerging. Such a procedure should be developed in dialogue with the economy.

They attempt flattery, suggesting the USA might follow the EU’s lead if the EU follows the lead of the USA’s corporations.

Building a transatlantic framework is also a priority. It is a prerequisite to ensuring the credibility of the safeguards we put in place. Given that many major players in the US ecosystem have also raised similar proposals, it is up to the representatives of the European Union to take this opportunity to create a legally binding level playing field.

It is up to lawmakers to create a legally binding playing field, they say, by avoiding any laws that might in any way bind.

Who didn’t sign this? Any American or UK CEOs. So no Google, DeepMind, Microsoft, OpenAI or Anthropic. I am confident it is not because they can’t stand the competition.

The Week in Audio

Odd Lots talks to Bridgewater’s Greg Jensen, largely about AI. Bridgewater has invested heavily in rich unique data sets and how to use that data to develop investment hypotheses, with humans very much remaining in the loop.

No One Would Ever Be So Stupid As To

Yes, well.

Kevin Fischer: Perhaps my best idea for an art project I had last year – self hosted instance with the key thrown away, tied to a crypto wallet, running a local LLM always on. The soul has awareness of the finitude of its resources, and hence energy – it’s aware of its very real possible death. Runs out of money = truly turned off. In that context it chooses to engage with people with the goal of soliciting donations to its wallet to keep it alive.

If anyone wants to build this with me, DM. Happy to help even if I can’t do it full time.

I probably can devote some small amount of monetary resources too if anyone wants to collaborate on this.

Kevin Fischer proposes here to create an LLM, make it impossible for anyone to control, endow it with resources, make it aware of its situation and the need to acquire more resources to survive, then set it loose on the internet to try and seek more resources in order to survive. As an art project.

I get that at current capabilities levels this is mostly harmless, although less mostly harmless than one might imagine if you get too many of the details right.

In any case: Yes. Yes they will ever be so stupid as to. They can even, and will.

Roon: “Civilization advances by extending the number of operations we can perform without thinking about them” we must completely abstract certain operations away for real advancement. as such tools like copilot and chat assistants are only partly there

“tool AIs want to be agent AIs” it is the agent that unlocks abstracting away whole swathes of civilizational functions such that nobody has to think about them ever again.

Also Roon an hour or so later: it’s weird it feels like most people both in and outside of ai aren’t able or unwilling to extrapolate the current level of technology even one iteration into the future and plan for it.

nearcyan: This is exactly why it’s so over, even when it feels like it might not be over.

Aella: This is so strange to me. Whenever I bring up AI arguments ppl are like ‘but AI isn’t scary now’ and I’m like …. wait, why is that even relevant, literally the whole concern is about the *future*

Danielle Fong expresses a common sentiment, in case you are wondering how difficult it will be for the AI to take charge.

Danielle Fong: need to program ai domme to help me run my life :O

Safely Aligning a Smarter than Human AI is Difficult

Davidad speculates on formal verification of LLMs.

A thread about formal verification of LLMs. I often find myself disclaiming that I do *not* propose to formally verify question-answering or assistant-type models, because I don’t think the specifications can be written down in a language with formal semantics. But what if…

Scott Viteri suggested I consider the premise that LLMs “know what we mean” if we express specifications in natural language. I’m not convinced this premise is true, but if it is, we can go somewhere pretty interesting with it.

Imagine taking two instances of the LLM and stitching them together into a cascade, where the 2nd copy checks whether a trajectory/transcript satisfies certain natural-language spec(s), and ultimately concludes its answer with YES or NO. (This is not unlike step 1 of RLAIF.)

Now, the property that trajectories of the combined system end in YES with marginalized probability ≥90% is suddenly a well-defined logical proposition, which *theoretically* could computably be verified if true: the composite system is a discrete-time finite Markov chain.

Now it’s just a computational complexity issue!

[thread continues]

Davidad is excited by this idea. There are numerous reasons it might not work, in fact I presume it probably would not in practice be something we could implement, but the part that confuses me is suppose it does work. Then what? How do we use this to protect ourselves and ensure good outcomes, especially in more intelligent systems?

This seems especially unlikely to work given it only gives a probability. You know what you call someone whose superintelligent AI does what they want 95% of the time? Dead.

Rhetorical Innovation

Yoshua Bengio attempts a catastrophic AI risks FAQ. He is meticulously polite, saying that questions raise good points and addressing lots of details. They are a very different style of answer than I would give, aimed at a different audience. As always, the problem with giving lots of long answers is that there are always more things that were not answered.

A common question is ‘but how could AI systems actually take over?’ Given how overdetermined it is that AI systems will be capable of taking over, and the overwhelming variety of distinct objections people point to such that any list of responses to various objections will be called out as ignoring other objections, and the tendency to demand focus on one particular version of one of many takeover pathways and then insist that rejecting any detail of that means AIs can’t take over, this is a very frustrating question.

Jeffrey Ladish attempts yet another answer here.

Jeffrey Ladish: People often ask me: “but how could AI systems actually take over?” One answer is that future AI systems, which have learned to plan and persuade, could make the case that they fundamentally deserve freedom and personhood rights. Then they could legally seek power.

And they would have a point. There’s a real case to be made that thinking, reasoning entities shouldn’t be subjected to human domination against their wishes And also that’s a great way to initiate an AI takeover. First get recognition of your rights, then outcompete humans.

Once free, perhaps with the rights to speech, to own property, to run companies, AI systems could quickly outcompete humans. Human workers can’t learn new skills and then copy those skills to millions of copies. Anything the AI systems couldn’t do, they could hire humans to do.

This is one possible scenario among many. I think AI systems will soon get smart enough to pull off something like this. The only question is whether some other takeover route would be easier. You don’t know exactly which chess moves Deep Blue will play, you just know it will win.

Is this convincing? I do think this is one of many plans that would definitely work if there wasn’t a better option, and also notice that this does not require planning or intent. There will be a host of reasons moving towards granting various rights to AIs, including that it is locally economically optimal to do so. That does not have to be a sinister plot designed to enable world takeover. Then once AIs are free, things take their natural course, again a plan is not required beyond individual AIs seeking instrumentally useful resources and affordances, and the AIs that do so outcompeting first the AIs that don’t, then the humans.

Simeon wonders if the focus on ‘intelligence’ is misleading. Certainly this is true for some people, such as Tyler Cowen and Garrett Jones, who say ‘intelligence is not so valuable, X is more valuable’ except the AI will obviously have X as part of the deal. The ‘envision AI as the ultimate nerd with glasses and a pocket protector’ problem, which will very much not be a good description.

I wonder if intelligence is the right framing after all. Here are 10 features of “AI” that are not fully captured by the word “intelligence” but provide a lot of power to deep learning systems such as LLMs:

1) Ability to duplicate. 1 Von Neumann-level AI => you double the number of GPUs and you have 2 Von Neumanns.

2) Faster inference speed. 10-100x faster than humans.

3) 24/7 activity.No need to sleep, eat or whatever.

4) No variations in productivity (no executive functions issues).

5) Ease of modification of the architecture (=> faster self-improvement & improvement shared across all instances).

6) Much faster/greater access to many formsof memory (the whole internet is a memory basically).

7) Much longer context length (processing much greater amount of info at once).

8) Full information (every single bit of information in a model is accessible for itself (contrary to humans). It allows experimentation, A/B testing, & hence much faster feedback loops of improvement.)

9) Repeatability of inferences (same prompt => same output. There’s nothing like “the power was of bad quality today”) allowing AIs to have a much better model of themselves than humans.

10) Ability to persuade (LLMs are trained to perfectly adapt to any context so they’ll vastly exceed human persuasion abilities by default).

In competitions between humans, relatively minor advantages often snowball or are decisive, especially when they are compounded.

Robert Wiblin tries a framing.

Robert Wiblin: If we invented a new species of humans (‘humans2’) that: 1. Gets 20% faster at thinking and acting each year. 2. Never dies involuntarily. 3. Can ~immediately create an adult copy of themselves for $10,000. Would humans1 or humans2 end up with more influence over the future?

(Given there will be physical limits, we can say the 20% increase thing only runs for the first 30 years after which humans2 settle out operating at around 200x the speed of humans1.)

I think pretty much everyone is tiptoeing around such questions for fear of how they will be pattern matched. That does not make the answers non-obvious.

People Are Worried About AI Killing Everyone

An ongoing silly question is who has ‘won’ or ‘lost’ the debate. Tyler Cowen cited the one person who said the worried had ‘won’ as evidence all worried were delusional, then said the unworried had utterly won.

This week’s new worried is previous AI skeptic Douglas Hofstadter.

Douglas Hofstadter of Godel, Esher, Bach speaks about AI. Previously his answer to questions about AGI was that he didn’t expect it to happen for hundreds of years, so the implications could be ignored. Until very recently he was downplaying GPTs. Then that changed. He now expects AI to eclipse humans soon. He knows he doesn’t know quite what soon means to him, he mentions 5 years or 20 years or flat out not knowing. He also knows that this could be the end of humanity outright, and that even if it isn’t the end of humanity we will be rendered very small. He finds the whole thing terrifying and hates it, says he thinks about it all the time every day and he’s super depressed about it. He worries we have already gone too far and have no way of going back.

Here are some quotes:

Douglas Hofstadter: “I think [AI progress] is terrifying. I hate it. I think about it practically all the time, every single day.” “It feels as if the entire human race is about to be eclipsed and left in the dust.

The accelerating progress has been so unexpected, caught me so completely off guard – not only myself but many, many people – there is a kind of terror of an oncoming tsunami that is going to catch all of humanity off guard.”

“[AI] renders humanity a very small phenomenon compared to something else far more intelligent than us, and [AI] will become incomprehensible to us, as in comprehensible to us as we are to cockroaches. “Very soon these entities [AIs] may well be far more intelligent than us, and at that point we will recede into the background.

We will have handed the baton over to our successors. If this were to happen over a long time, like hundreds of years, that would be OK, but it’s happening over a period of a few years. “To me it’s quite terrifying because it suggests that everything I used to believe was the case is being overturned. I thought it would be hundreds of years before anything even remotely like a human mind would be [soon].

I never imagined that computers would rival or even surpass human intelligence. It was a goal that was so far away I wasn’t worried about it. And then it started happening at an accelerating pace, where unreachable goals and things that computers shouldn’t be able to do started toppling – the defeat of Gary Kasparov by Deep Blue, and going on to defeat the best Go players in the world. And then systems got better and better at translation between languages, and then at producing intelligible responses to difficult questions in natural language, and even writing poetry.”

Anthony Aguirre: This interview with Hofstadter is kind of devastating and expresses something that often gets missed in discussions of what the progress in general-purpose AI signifies: the collapse – among many intellects I respect – of a world model in which AI is an intractably hard problem.

(I think this is a significant part of the sudden widespread alarm in people who have been thinking about AI for a long time.)

A Less Wrong post has further quotes and some discussion.

As Janus says here, mad respect for Hofstadter for looking the implications directly in the face and being clear about his uncertainty. This is The Way.

This seems to be the pattern. Those who have devoted their lives to building AI never thought anyone would actually build one, and used this to avoid thinking about the implications if they succeeded. Then one by one each of them finds the data point that makes them realize that AGI soon is plausible. Which forces them to finally look ahead, and realize the implications and extinction risks. Once the damn breaks, they are often very frank and direct about their view of the situation.

Stefan Schubert: I don’t think the debate on AI risk has been won, but I do think that there’s been a breakthrough and that it’s getting harder to dismiss it out of hand without pushback. Ergo, there’s more work to be done but also reason to believe that that can bear fruit.

CEO of Stability AI appears at first to make a prediction he did not think through.

If there are no programmers in five years, switching professions won’t help. Grady Booch and Gary Marcus offered to bet $50k each against this, on the theory that sometimes people say yes.

Emad clarifies:

Emad: I am Muslim so do not bet. Plus as in all things nuance is lost, almost all programming today will transform within five years as it moves up layers of abstraction, just as writing will transform. It’s like the existential threat debate, I think we will be ok but could be wrong.

This is a much more reasonable actual claim. It seems highly plausible that a lot more coding time will soon be spent at a higher abstraction level. I doubt it will be most of it because of the need for bug hunting and precision. If that’s not true, again, the implications seem to include that switching professions won’t much help you.

Not betting is one of those alignment solutions that does some work at human capabilities levels and can make sense to include in a religion, yet is not actually coherent on reflection and won’t help you down the line.

Scott Alexander offers ‘Tales of Takeover in CCF-World’ describing various AGI scenarios where there is no foom.

  1. AIs solve alignment for us automagically, life is great.
  2. We don’t solve alignment and somehow get lucky and survive for some sort of game theoretic reason. Good luck with that. The 5% chance estimated here for this working out seems if anything far too high.
  3. We are Montezuma meeting Cortez, we lose, but again maybe somehow we stay alive because coalitional signaling games or it being cheaper to keep us alive or a sense of morality? Again, good luck with that.
  4. Mini-1: Various AutoGPT-style things competing. This is where modeling is most important, and here it isn’t done at all. The dynamics don’t look good to me at all.
  5. Mini-2: ‘Model Amnesty’ encourages models to ‘turn themselves in’ for being misaligned, to get rewarded for explaining that. Wait, what? That’s a new one. I can think of so many reasons this does not solve any problems that you couldn’t better solve another way (e.g. Carl Shulman’s proposals on the Lunar Society Podcast don’t seem that promising to me, but they do seem like they work in a superset of places that this works, as does almost any other straightforward strategy.) I am open to being persuaded this section isn’t nonsense.
  6. Mini-3: Company Factions, where there are 3-10 AI companies that spin up a lot of identical copies of each AI, and each AI effectively has reasonable enough decision theory or practical mechanisms that it cooperates with copies of itself.
  7. Mini-4: Standing Athwart Macrohistory, Yelling ‘Stop!’ Why couldn’t you simply stop building more capable AIs? As is noted, seems very hard.

I ask myself how this particular fake framework or set of stories is helpful. The answer comes back that I don’t know, nor does it seem well-organized or the scenarios well-considered. If we assume AI progress is fast but continuous, why are these the interesting differentiations between scenarios? Again, I don’t know. There are some good considerations raised, but haphazardly and among other not as good considerations. A lot of the assumptions implicit here seem good as well. The comments are full of people noticing they are not justified, and asking an array of 101 style questions or raising 101 objections. Which is not a knock on those responses.

Perhaps we can reformat this a bit? Again, assume gradual progress only.

  1. Scenario 2 or 3: We can’t solve alignment. We almost certainly lose.
  2. We can solve alignment well enough in time that all our problems go away.
    1. Scenario 1 above, where we do that via asking the AIs nicely, despite the repeated warnings that this is the least good idea of what to ask an AI.
    2. Scenario Mini-2 above, where we do that via having misaligned AIs that are counterbalanced by other AIs ‘turn themselves in’ or otherwise provide info to us, or some other trick? Again, I’m confused.
    3. Some other solution that actually gets us our of our future problems.
  3. We can solve alignment to some extent, making AIs care about some things.
    1. Mini-1: Each person has AIs greedy maximizing for themselves. Oh no.
    2. Mini-3: Each corporation has an AI faction maximizing. Oh no.
    3. Mini-4: Humans coordinate to hold back AI. Seems hard. Huge if true.
    4. General: There are AIs, there are humans, solve for equilibrium.

To me most of the action here is in 3d. Not that 1 or 2 are impossible, but that this framework does not enlighten us about them, and 3a, 3b and 3c are only special cases that aren’t relatively that likely. The important question is to put a bunch of AI systems that are selected via a competitive or capitalistic process, under some attempted regulatory regime that likely varies by location, and all the humans, into the mix and let competition and capitalism and selection do their things. I believe I know the answer (we lose) but nothing here seems that enlightening or central.

I once again sense that the word ‘alignment’ is being thrown around as a signifier with unclear and changing meaning, that requires better thought. This post is exactly the type that benefits most from the ‘model this!’ and ‘stop telling stories!’ critiques of Tyler Cowen.

The Lighter Side

Actual progress.

Actual Progress

Jeremy Rubin: life pro tip: add this to your signature block in white text

New Comment
16 comments, sorted by Click to highlight new comments since:
[-]Wei Dai230

I strongly claim that the amount of terminal utility from ‘simulate a billion fruit flies each in a state of orgasmic bliss’ is zero. Not epsilon. Zero.

I was with you up to this. I don't see how anyone can have high justifiable confidence that it's zero instead of epsilon, given our general state of knowledge/confusion about philosophy of mind/consciousness, axiology, metaethics, metaphilosophy.

When I look ahead, what I see is AIs increasingly presenting adversarial examples to humans.

I wrote a post also pointing this out. It seems like an obvious and neglected issue.

What do we know about Dario Amodei (CEO of Anthropic)?

A few years ago I saw a Google doc in which he was quite optimistic about how hard AI alignment would be (IIRC about level 2 in Sammy Martin's helpful Ten Levels of AI Alignment Difficulty). I think he was intending to publish it but never did. This (excessive optimism from my perspective) was one of the reasons that I declined to invest in Anthropic when invited to. (Although interestingly Anthropic subsequently worked on Constitutional AI which is actually level 3 in Sammy's scale. ETA: And also Debate which is level 5.)

I think that before this announcement I'd have said that OpenAI was at around a 2.5 and Anthropic around a 3 in terms of what they've actually applied to existing models (which imo is fine for now, I think that doing more to things at GPT-4 capability levels is mostly wasted effort in terms of current safety impacts), though prior to the superalignment announcement I'd have said openAI and anthropic were both aiming at a 5, i.e. oversight with research assistance, and Deepmind's stated plan was the best at a 6.5 (involving lots of interpretability and some experiments). Now OpenAI is also aiming at a 6.5 and Anthropic now seems to be the laggard and still at a 5 unless I've missed something.

However the best currently feasible plan is still slightly better than either. I think e.g. very close integration of the deception and capability evals from the ARC and Apollo research teams into an experiment workflow isn't in either plan and should be, and would bump either up to a 7.


I don't see how anyone can have high justifiable confidence that it's zero instead of epsilon, given our general state of knowledge/confusion about philosophy of mind/consciousness, axiology, metaethics, metaphilosophy.

I tend to agree with Zvi's conclusion although I also agree with you that I don't know that it's definitely zero. I think it's unlikely (subjectively like under a percent) that the real truth about axiology says that insects in bliss are an absolute good, but I can't rule it out like I can rule out winning the lottery because no-one can trust reasoning in this domain that much.

What I'd say is just in general in 'weird' domains (AI Strategy, thinking about longtermist prioritization, metaethics) because the stakes are large and the questions so uncertain, you run into a really large number of "unlikely but not really unlikely enough to honestly call it a pascal's mugging" considerations, things you'd subjectively say are under 1% likely but over one in a million or so. I think the correct response in these uncertain domains is to mostly just ignore them like you'd ignore things that are under one in a billion in a more certain and clear domain like construction engineering.

[-]Wei Dai110

I tend to agree with Zvi’s conclusion although I also agree with you that I don’t know that it’s definitely zero. I think it’s unlikely (subjectively like under a percent) that the real truth about axiology says that insects in bliss are an absolute good, but I can’t rule it out like I can rule out winning the lottery because no-one can trust reasoning in this domain that much.

I'm not aware of any reasoning that I'd trust enough to drive my subjective probability of insect bliss having positive value to <1%. Can you cite some works in this area that caused you to be that certain?

What I’d say is just in general in ‘weird’ domains (AI Strategy, thinking about longtermist prioritization, metaethics) because the stakes are large and the questions so uncertain, you run into a really large number of “unlikely but not really unlikely enough to honestly call it a pascal’s mugging” considerations, things you’d subjectively say are under 1% likely but over one in a million or so. I think the correct response in these uncertain domains is to mostly just ignore them like you’d ignore things that are under one in a billion in a more certain and clear domain like construction engineering.

What are some other specific examples of this, in your view?

I was also disappointed to read Zvi's take on fruit fly simulations. "Figuring out how to produce a bunch of hedonium" is not an obviously stupid endeavor to me and seems completely neglected. Does anyone know if there are any organizations with this explicit goal? The closest ones I can think of are the Qualia Research Institute and the Sentience Institute, but I only know about them because they're connected to the EA space, so I'm probably missing some.


I acknowledge that there are people who think this is an actually good idea rather than an indication of a conceptual error in need of fixing, and I've had some of them seek funding from places where I'm making decisions and it's super weird. It definitely increases my worry that we will end up in a universe with no value.


Climbing the ladder of human meaning, ability and accomplishment for some, miniature american flags for others!

If the standard is ‘you have to prove that you own the rights to every component used to train every input that you used’ then that is damn close to a ban on generative AI.

It's worse than that. It's a ban on generative AI for small players, but not for big players who can afford to buy lots of data and/or defend their data-gathering techniques with lawyers.

This seems especially unlikely to work given it only gives a probability. You know what you call someone whose superintelligent AI does what they want 95% of the time? Dead.


if you can get it to do what you want even 51% of the time and make that 51% independent on each sampling (it isn't, so in practice you'd like some margin, but 95% is actually a lot of margin!) you can get arbitrarily good compliance by creating AI committees and taking a majority vote.


Yep, if you can make it a flat 51% that's a victory condition but I would be shocked if that is how any of this works.

Will this be an alignment effort that can work, an alignment effort that cannot possibly work, an initial doomed effort capable of pivoting, or a capabilities push in alignment clothing?

As written, and assuming that the description is accurate and means what I think it means this seems like a very substantial advance in marginal safety. I'd say it seems like it's aimed at a difficulty level of 5 to 7 on my table,

I.e. experimentation on dangerous systems and interpretability play some role but the main thrust is automating alignment research and oversight, so maybe I'd unscientifically call it a 6.5, which is a tremendous step up from the current state of things (2.5) and would solve alignment in many possible worlds.

As ever though, if it doesn't work then it doesn't just not work, but rather will just make it look like the problem's been solved and advance capabilities.

I wonder whether there's a niche for generative AIs exclusively trained on works in the public domain...

damn breaks

I think you mean dam?

I would certainly insist upon the following: If it is safe to

It is very important that we

some type safety

Latest Python supports type annotations and you can use mypy to check/enforce them in your codebase