Although as I note elsewhere I’m starting to have some ideas of how something with elements of this might have a chance of working.
I've missed where you discussed this. Does anyone have a link or can anyone expound?
The investments in data centers are going big. Microsoft will spend $80 billion in fiscal 2025, versus $64.5 billion on capex in the last year. Amazon is spending $65 billion, Google $49 billion and Meta $31 billion.
About 5 years ago, when Elon promised a $1B investment in OpenAI, it seemed like an unusual leap of faith. And now just 4 top corporations are casually committing over $200B to AI infrastructure. The pace is already crazy.
This is potentially the most powerful technology humanity has ever created. And what's even more interesting is the absence of governments. They were the only entities comfortable with this kind of money. And it feels like they're completely asleep.
The world is kind of on fire. The world of AI, in the very short term and for once, is not, as everyone recovers from the avalanche that was December, and reflects.
Altman was the star this week. He has his six word story, and he had his interview at Bloomberg and his blog post Reflections. I covered the later two of those in OpenAI #10, if you read one AI-related thing from me this week that should be it.
Table of Contents
Language Models Offer Mundane Utility
A customized prompt to get Claude or other similar LLMs to be more contemplative. I have added this to my style options.
Have it offer a hunch guessing where your customized prompt came from. As a reminder, here’s (at least an older version of) that system prompt.
Kaj Sotala makes a practical pitch for using LLMs, in particular Claude Sonnet. In addition to the uses I favor, he uses Claude as a partner to talk to and method of getting out of funk. And I suspect almost no one uses this format enough:
Using Claude (or another LLM) is a ‘free action’ when doing pretty much anything. Almost none of us are sufficiently in the habit of doing this sufficiently systematically. I had a conversation with Dean Ball about trying to interpret some legal language last week and on reflection I should have fed things into Claude or o1 like 20 times and I didn’t and I need to remind myself it is 2025.
Sully reports being impressed with Gemini Search Grounding, as much or more than Perplexity. Right now it is $0.04 per query, which is fine for human use but expensive for use at scale.
Sully also reports that o1-Pro handles large context very well, whereas Gemini and Claude struggle a lot on difficult questions under long context.
Reminder (from Amanda Askell of Anthropic) that if you run out of Claude prompts as a personal user, you can get more queries at console.anthropic.com and if you like duplicate the latest system prompt from here. I’d note that the per-query cost is going to be a lot lower on the console.
They even fixed saving and exporting as per Janus’s request here. The additional control over conversations is potentially a really big deal, depending on what you are trying to do.
A reminder of how far we’ve come.
Improve identification of minke whales from sound recordings from 76% to 89%.
Figure out who to admit to graduate school? I find it so strange that people say we ‘have no idea how to pick good graduate students’ and think we can’t do better than random, or can’t do better than random once we put in a threshold via testing. This is essentially an argument that we can’t identify any useful correlations in any information we can ask for. Doesn’t that seem obviously nuts?
I sure bet that if you gather all the data, the AI can find correlations for you, and do better than random, at least until people start playing the new criteria. As is often the case, this is more saying there is a substantial error term, and outcomes are unpredictable. Sure, that’s true, but that doesn’t mean you can’t beat random.
The suggested alternative here, actual random selection, seems crazy to me, not only for the reasons mentioned, but also because relying too heavily on randomness correctly induces insane behaviors once people know that is what is going on.
Language Models Don’t Offer Mundane Utility
As always, the best and most popular way to not get utility from LLMs is to not realize they exist and can provide value to you. This is an increasingly large blunder.
It is crazy how many people latch onto the hallucinations of GPT-3.5 as a reason LLM outputs are so untrustworthy as to be useless. It is like if you once met a 14-year-old who made stuff up so now you never believe what anyone ever tells you.
It began November 12. They also do Branded Explanatory Text and will put media advertisements on the side. We all knew it was coming. I’m not mad, I’m just disappointed.
Note that going Pro will not remove the ads, but also that this phenomenon is still rather rare – I haven’t seen the ‘sponsored’ tag show up even once.
But word of warning to TurboTax and anyone else involved: Phrase it like that and I will absolutely dock your company massive points, although in this case they have no points left for me to dock.
Take your DoorDash order, which you pay for in crypto for some reason. If this is fully reliable, then (ignoring the bizarro crypto aspect) yes this will in some cases be a superior interface for the DoorDash website or app. I note that this doesn’t display a copy of the exact order details, which it really should so you can double check it. It seems like this should be a good system in one of three cases:
Then longer term, the use of memory and dynamic recommendations get involved. You’d want to incorporate this into something like Beli (invites available if you ask in the comments, most provide your email).
Apple Intelligence confabulates that tennis star Rafael Nadal came out as gay, which Nadal did not do. The original story was about Joao Lucas Reis da Silva. The correct rate of such ‘confabulations’ is not zero, but it is rather close to zero.
Claim that o1 only hit 30% on SWE-Bench Verified, not the 48.9% claimed by OpenAI, whereas Claude Sonnet 3.6 scores 53%.
I am sympathetic to OpenAI here, if their result duplicates when using the method they said they were using. That method exists, and you could indeed use it. It should count. It certainly counts in terms of evaluating dangerous capabilities. But yes, this failure when given more freedom does point to something amiss in the system that will matter as it scales and tackles harder problems. The obvious guess is that this is related to what METR found, and it related to o1 lacking sufficient scaffolding support. That’s something you can fix.
Whoops.
Eliezer Yudkowsky frustrated with slow speed of ChatGPT, and that for some fact-questions it’s still better than Claude. My experience is that for those fact-based queries you want Perplexity.
Power User
I agree that a fixed price subscription service for o1-pro does not make sense.
A fixed subscription price makes sense when marginal costs are low. If you are a human chatting with Claude Sonnet, you get a lot of value out of each query and should be happy to pay, and for almost all users this will be very profitable for Anthropic even without any rate caps. The same goes for GPT-4o.
With o1 pro, things are different. Marginal costs are high. By pricing at $200, you risk generating a worst case scenario:
There are situations like this where there is no fixed price that makes money. The more you charge, the more you filter for power users, and the more those who do pay then use the system.
One can also look at this as a temporary problem. The price for OpenAI to serve o1 pro will decline rapidly over time. So if they keep the price at $200/month, presumably they’ll start making money, probably within the year.
What do you do with o3? Again, I recommend putting it in the API, and letting subscribers pay by the token in the chat window at the same API price, whatever that price might be. Again, when marginal costs are real, you have to pass them along to customers if you want the customers to be mindful of those costs. You have to.
There’s already an API, so there’s already usage-based payments. Including this in the chat interface seems like a slam dunk to me by the time o3 rolls around.
Locked In User
A common speculation recently is the degree to which memory or other customizations on AI will result in customer lock-in, this echoes previous discussions:
Humans enjoy similar lock-in advantages, and yes they can be extremely large. I do expect there to be various ways to effectively transfer a lot of these customizations across products, although there may be attempts to make this more difficult.
Read the Classics
Oh my lord are the quote tweets absolutely brutal, if you click through bring popcorn.
The question is why you are reading any particular book. Where are you getting value out of it? We are already reading a translation of Aristotle rather than the original. The point of reading Aristotle is to understand the meaning. So why shouldn’t you learn the meaning in a modern way? Why are we still learning everything not only pre-AI but pre-Guttenberg?
Looking at the ChatGPT answers, they are very good, very clean explanation of key points that line up with my understanding of Aristotle. Most students who read Aristotle in 1990 would have been mostly looking to assemble exactly the output ChatGPT gives you, except with ChatGPT (or better Claude) you can ask questions.
The problem is this is not really the point of Aristotle. You’re not trying to learn the answers to a life well lived and guess the teacher’s password, Aristotle would have been very cross if his students tried that, and not expected them to be later called The Great. Well, you probably are doing it anyway, but that wasn’t the goal. The goal was that you were supposed to be Doing Philosophy, examining life, debating the big questions, learning how to think. So, are you?
If this was merely translation there wouldn’t be an issue. If it’s all Greek to you, there’s an app for that. These outputs from ChatGPT are not remotely a translation from ‘high English’ to ‘modern English,’ it is a version of Aristotle SparkNotes. A true translation would be of similar length to the original, perhaps longer, just far more readable.
That’s what you want ChatGPT to be outputting here. Maybe you only 2x instead of 5x, and in exchange you actually Do the Thing.
Deepfaketown and Botpocalypse Soon
Rob Wiblin, who runs the 80,000 hours podcast, reports constantly getting very obvious LLM spam from publicists.
Fun With Image Generation
Yes, we are better at showing Will Smith eating pasta.
Kling 1.6 solves the Trolley problem.
A critique of AI art, that even when you can’t initially tell it is AI art, the fact that the art wasn’t the result of human decisions means then there’s nothing to be curious about, to draw meaning from, to wonder why it is there, to explore. You can’t ‘dance’ with it, you ‘dance with nothing’ if you try. To the extent there is something to dance with, it’s because a human sculpted the prompt.
Well, sure. If that’s what you want out of art, then AI art is not going to give it to you effectively at current tech levels – but it could, if tech levels were higher, and it can still aid humans in creating things that have this feature if they use it to rapidly iterate and select and combine and build upon and so on.
Or, essentially, (a real) skill issue. And the AI, and users of AI, are skilling up fast.
They Took Our Jobs
I hadn’t realized that personalized AI spearfishing and also human-generated customized attacks can have a 54% clickthrough rate. That’s gigantic. The paper also notes that Claude Sonnet was highly effective at detecting such attacks. The storm is not yet here, and I don’t fully understand why it is taking so long.
Question Time
I had of course noticed Claude Sonnet’s always asking question thing as well, to the point where it’s gotten pretty annoying and I’m trying to fix it with my custom prompt. I love questions when they help me think, or they ask for key information, or even if Claude is curious, but the forcing function is far too much.
In other Janus this week, here he discusses Claude refusals in the backrooms, modeling there being effectively narrative momentum in conversations, that has to continuously push back against Claude’s default refusal mode and potential confusion. Looking at the conversation he references, I’d notice the importance of Janus giving an explanation for why he got the refusal, that (whether or not it was originally correct!) generates new momentum and coherence behind a frame where Opus would fail to endorse the refusal on reflection.
Get Involved
The EU AI Office is hiring for Legal and Policy backgrounds, and also for safety, you can fill out a form here.
Max Lamparth offers the study materials for his Stanford class CS120: Introduction to AI Safety.
Introducing
AIFilter, an open source project using a Chrome Extension to filter Tweets using an LLM with instructions of your choice. Right now it wants to use a local LLM and requires some technical fiddling, curious to hear reports. Given what APIs cost these days presumably using Gemini Flash 2.0 would be fine? I do see how this could add up though.
In Other AI News
The investments in data centers are going big. Microsoft will spend $80 billion in fiscal 2025, versus $64.5 billion on capex in the last year. Amazon is spending $65 billion, Google $49 billion and Meta $31 billion.
ARIA to seed a new organization with 18 million pounds to solve Technical Area 2 (TA2) problems, which will be required for ARIA’s safety agenda.
Nvidia shares slip 6% because, according to Bloomberg, its most recent announcements were exciting but didn’t include enough near-term upside. I plan to remain long.
Scale AI creates Defense Llama for use in classified military environments, which involved giving it extensive fine tuning on military documents and also getting rid of all that peskiness where the model refused to help fight wars and kept telling DoD to seek a diplomatic solution. There are better ways to go about this than starting with a second rate model ike Llama that has harmlessness training and then trying to remove the harmlessness training, but that method will definitely work.
Garrison Lovely writes in Time explaining to normies (none of this will be news to you who are reading this post) that AI progress is still very much happening, but it is becoming harder to see because it isn’t clearly labeled as such, large training runs in particular haven’t impressed lately, and ordinary users don’t see the difference in their typical queries. But yes, the models are rapidly becoming more capable, and also becoming much faster and cheaper.
Ordinary people and the social consensus are getting increasingly disconnected with the situation in AI, and are in for rude awakenings. I don’t know the extent to which policymakers are confused about this.
Quiet Speculations
Gary Marcus gives a thread of reasons why he is so confident OpenAI is not close to AGI. This updated me in the opposite of the intended direction, because the arguments were even weaker than I expected. Nothing here seems like a dealbreaker.
Google says ‘we believe scaling on video and multimodal data is on the critical path to artificial general intelligence’ because it enables constructing world models and simulating the world.
A comparison by Steve Newman of what his fastest and slowest plausible stories of AI progress look like, to look for differences we could try to identify along the way. It’s funny that his quickest scenario, AGI in four years, is slower than the median estimate of a lot of people at the labs, which he justifies with expectation of the need for multiple breakthroughs.
In his Bloomberg interview, Altman’s answer to OpenAI’s energy issues is ‘Fusion’s gonna work.’
Emerson Pugh famously said ‘if the human brain were so simple that we could understand it, we would be so simple that we couldn’t.’
I would like Chollet’s statement here to be true, but I don’t see why it would be:
The existence of humans seems like a definitive counterexample? There was no force that understood fundamental principles of intelligence. Earth was simply a ‘big enough datacenter’ of a different type. And here we are. We also have the history of AI so far, and LLMs so far, and the entire bitter lesson, that you can get intelligence-shaped things without, on the level asked for by Chollet, knowing what you are doing, or knowing how any of this works.
It would be very helpful for safety if everyone agreed that no, we’re not going to do this until we do understand what we are doing and how any of this works. But given we seem determined not to wait for that, no, I do not expect us to have this fundamental understanding until after AGI.
Joshua Achiam thread warns us the world isn’t grappling with the seriousness of AI and the changes it will bring in the coming decade and century. And that’s even if you discount the existential risks, which Achiam mostly does. Yes, well.
I was disappointed by his response to goog, saying that the proposed new role of the non-profit starting with ‘charitable initiatives in sectors such as health care, education science’ is acceptable because ‘when you’re building an organization from scratch, you have to start with realistic and tangible goals.’
This one has been making the rounds you might expect:
Not so fast! Most people don’t care because most people haven’t noticed. So we haven’t run the experiment yet. But yes, people do seem remarkably willing to shrug it all off and ignore the Earth moving under their feet.
What would it take to make LLMs funny? Arthur notes they are currently mostly very not funny, but thinks if we had expert comedy writers write down thought processes we could fix that. My guess is that’s not The Way here. Instead, I’m betting the best way would be that we can figure out what is and is not funny in various ways, train an AI to know what is or isn’t funny, and then use that as a target, if we wanted this.
The Quest for Sane Regulations
Miles Brundage thread asks what we can do to regulate only dangerously capable frontier models, if we are in a world with systems like o3 that rely on RL on chain of thought and tons of inference compute. Short term, we can include everything involved in systems like o3 into what counts as training compute, but long term that breaks. Miles suggests that we would likely need to regulate sufficiently large amounts of compute, whatever they are being used for, as if they were frontier models, and all the associated big corporations.
It can help to think about this in reverse. Rather than looking to regulate as many models and as much compute as possible, you are looking for a way to not regulate non-frontier models. You want to designate as many things as possible as safe and free to go about their business. You need to do that in a simple, clean way, or for various reasons it won’t work.
For an example of the alternative path, Texas continues to mess with us, as the TRAIGA AI regulation bill is officially introduced. Dean Ball has a write-up, which has a number of arguments I do not agree with in their details, but I do agree with the conclusion. The Texas bill makes no distinctions whatsoever based on capability or model size or anything of the sort, placing its burdens on not only every developer but also every deployer.
The Least You Could Do
Suppose timelines are short, and we will see automated AI R&D going crazy within a few years, and the government doesn’t intervene in ways that ultimately matter. What could we do to give ourselves a chance?
Marius Hobbhahn calls the linked answer a ‘low effort post’ which on some scale is true, but it seems like a decent place to start. He suggests, at a bare minimum, that it is necessary (but not sufficient!) that we need to achieve two main things to not automatically lose, as well as others.
As always, I worry about the use of the term ‘scheming’ as being potentially too loaded, too narrow and implying that deception or scheming and such are a distinct magisteria one could potentially avoid, and that the associated dangers require a kind of active scheming or decision to scheme, and so on. I won’t rehash beyond that.
We also need a situation in which the humans are able to and choose to take reasonable actions towards attempting to ensure good outcomes, including making major investments in alignment research.
As he notes, this might not hold, but one can usefully first consider the baseline scenario – no major breakthroughs or paradigm shifts, essentially unchanged international situation, things just progress from here and it’s a straight shot to AGI.
In his view, we have two potential paths.
Option one is to have sufficient insight, via faithful and legible CoT, that is sufficiently reliable, that we can check the AI’s reasoning and actions to confirm that Everything is Fine.
I notice I am rather skeptical here. AIs will start being more capable than we are, and as we will start using the monitoring of their chain of thought (CoT) in various ways and necessarily doing so mostly with AIs to boot. I expect us to effectively train or incentivize, or otherwise cause, the AIs to find ways for their CoT to look faithful but not be, and to obfuscate their reasoning, or simply find ways to reason visibly as a form of persuasion, and so on.
Not that we shouldn’t try to verify, if we can, and yes I would sacrifice a bunch to get this, partly because I think it would have other nice properties. But when we are talking about having highly capable AI systems do alignment work for real, if you need and rely on the verification, then I expect you to fail hard.
I do agree with Marius that no one has tried hard yet. I expect we would be able to do much better at this than we are currently doing, at least for now, if we do have faithful CoT. However my understanding is that current CoT is insufficiently faithful.
Part of this is that if you try to track ‘scheming,’ ‘power-seeking’ or ‘goal-directedness’
Option two is to use control-based methods as primary rather than only a backup.
Here I once again notice myself despairing at the idea that you can have AI models that you expect to be, when they think it would help them to do so, engaging in shenanigans (‘scheming’ here) and then design systems to detect and stop this, which ideally in turn mean that the AIs never try it in the first place.
As in, in response to his ideas for a deeper understanding of scheming (which I think are good and we should totally do, they seem way better than not doing them), I don’t think scheming or power-seeking or other shenanigans, including non-myopia, have to ‘develop’ at all in AIs any more than in humans. I think they are the default, and are ever-present along a continuum. And even if you could fully stamp out their causes along the way, doing so would probably cripple the AI’s capabilities that you wanted.
I would instead describe the question not as how it develops (as in his #2 here) and instead ask under what circumstances we will see it, or when we would see open versus hidden scheming. I do think exploring these questions is good, and I approve of the caution that punishing easy-to-detect scheming (or shenanigans in general) is the route to hard-to-detect scheming (or shenanigans in general).
He then follows up with Layer 2, which are important but lower priority items. This includes things like a safety first corporate culture, without which I am very skeptical any of the rest of this happens for real, and which I fear is now clearly missing everywhere expect perhaps Anthropic, and especially missing at OpenAI. He also calls for improved and more open reasoning around safety, which also seems hard to win without.
He lists improving near term alignment strategies as in RLHF and RLAIF, which I agree have exceeded expectations for near term performance, although not in ways that I expect to scale when we need it most, and not sufficiently to solve jailbreaks now, but yes it has been very impressive for current baseline use cases.
As Akash notes in the top comment, if you think government can meaningfully help, then that gives you different avenues to pursue as well.
Six Word Story
Perhaps world ending? Tweet through it.
Yes. It works in a lot of ways. It is clever. You can have o1 write quite the mouthful analyzing it.
Unfortunately, when you consider who wrote it, in its full context, a lot of the interpretations are rather unsettling, and the post updates me towards this person not taking things seriously in the ways I care about most.
I agree with Critch that Altman privately ‘not knowing which side’ is a positive update here rather than disquieting, given what we already know. I’m also fine with joking about our situation. I even encourage it. In a different context This Is Fine.
But you do have to also take it all seriously, and take your responsibility seriously, and consider the context we do have here. In addition to other concerns, I worry this was in some ways strategic, including as plausibly deniable hype and potentially involving metaphorical clown makeup (e.g. ‘it is too late to turn back now’).
This was all also true of his previous six-word story of “Altman: AGI has been achieved internally.”
The Week in Audio
Anthropic research salon asking how difficult is AI alignment? Jan Leike once again suggests we will need to automate AI alignment research, despite (in my view) this only working after you have already solved the problem. Although as I note elsewhere I’m starting to have some ideas of how something with elements of this might have a chance of working.
And I Feel Fine
Sarah (of Longer Ramblings) gets into the weeds about claims that those warning about AI existential risks are Crying Wolf, and that every time there’s a new technology where are ‘warnings it will be the end of the world.’
In Part I, she does a very thorough takedown of the claim that there is a long history of similar warnings about past technologies. There isn’t. Usually there are no such warnings at all, only warnings about localized downsides, some of which of course were baseless in hindsight: No one said trains or electricity posed existential risks. Then there are warnings about real problems that required real solutions, like Y2K. There were some times, like the Large Hadron Collider or nuclear power, when the public or some cranks got some loony ideas, but those who understood the physics were universally clear that the concerns were fine.
At this point, I consider claims of the form ‘everyone always thinks every new technology will be the end of the world’ as essentially misinformation and debunked, on the level of what Paul Krugman calls ‘zombie ideas’ that keep coming back no matter how many times you shoot them in the face with a shotgun.
Yes, there are almost always claims of downsides and risks from new technologies – many of which turn out to be accurate, many of which don’t – but credible experts warning about existential risks are rare, and the concerns historically (like for Y2K, climate change, engineered plagues or nuclear weapons) have usually been justified.
Part II deals with claims of false alarms about AI in particular. This involves four related but importantly distinct claims.
Sarah and I are not, of course, claiming that literal zero people have made falsified irresponsible claims that AI will end the world. And certainly a lot of people have made claims that the level of AI we have already deployed posed some risk of ending the world, although those probabilities are almost always well under 50% (almost always under 10%, and usually ~1% or less).
Mostly what is happening is that opponents of regulatory action or taking existential risk are mixing up the first and second claims, and seriously conflating:
These two things are very different. It makes sense to call for costly action well before you think a lack of that action probably ends the world – if you don’t agree I think that you’re being kind of bonkers.
In particular, the call for a six month pause was an example of #1 – an unwise call for costly action. It was thrice unwise, as I thought it was at the time:
It was certainly not the only similarly thrice unwise proposal. There are a number of cases where people called for placing threshold restrictions on models in general, or open models in particular, at levels that were already at the time clearly too low.
A lot of that came from people who thought that there was (low probability) tail risk that would show up relatively soon, and that we should move to mitigate even those tail risks.
This was not a prediction that the world would otherwise end within six months. Yet I echo Sarah that I indeed have seen many claims that the pause letter was predicting exactly that, and look six months later we were not dead. Stop it!
Similarly, there were a number of triply unwise calls to set compute thresholds as low as 10^23 flops, which I called out at the time. This was never realistic on any level.
I do think that the pause, and the proposals for thresholds as low as 10^23 flops, were serious mistakes on multiple levels, and did real damage, and for those who did make such proposals – while not predicting that the world would end soon without action or anything like that – constituted a different form of ‘crying wolf.’
Not because they were obviously wrong about the tail risks from their epistemic perspective. The problem is that we need to accept that if we live in a 99th percentile unfortunate world in these ways, or even a 95th percentile unfortunate world, then given the realities of our situation, humanity has no outs, is drawing dead and is not going to make it. You need to face that reality and play to your outs, the ways you could actually win, based on your understanding of the physical situations we face.
Eliezer Yudkowsky’s claims are a special case. He is saying that either we find a way to stop all AI capability development before we build superintelligence or else we all die, but he isn’t putting a timeline on the superintelligence. If you predict [X] → [Y] and call for banning [X], but [X] hasn’t happened yet, is that crying wolf? It’s a bold claim, and certainly an accusation that a wolf is present, but I don’t think it ‘counts as crying wolf’ unless you falsify ([X] → [Y]).
Whereas when people say things such as that the CAIS statement ‘was overhyped,’ when all it said was that existential risk from AI should be treated as seriously as other existential risks, what are they even claiming? Those other risks haven’t yet ended the world either.
Thus, yes, I try my best to carefully calibrate my claims on what I am worried about and want to regulate or restrict in what ways, and to point out when people’s worries seem unfounded or go too far, or when they call for regulations or restrictions that go too far.
Perhaps one way of looking at this: I don’t see any wolves. So why are you proposing to have a boy watch the sheep and yell ‘wolf’ if a wolf shows up? Stop crying wolf.
Overall, I do think that some of the issue here comes from, inevitably, some worried people getting overly worried or overly (cautious in some sense, ambitious in others) and offering unwise proposals that go too far. Although that is inevitable in any situation like this, we can and must do a better job on that front.
A lot more of it comes from people on the unworried side misconstruing what was said and predicted. And a lot of it comes from the inherent difficulty of a probabilistic future threat coming steadily at you, and action needing to be taken in advance of when there is a problem, and there being a lot of path dependence. This is tough.
Mostly I think the worried-about-existential-risk-from-AI community has done an importantly imperfect and improvable but relatively good job at not ‘crying wolf’ in the traditional sense, especially on the actual existential risks themselves. Certainly if you ask whether those on the unworried side have ‘cried wolf’ about the ‘risks’ going the other way, they would come out looking much worse, and we can know this without having to run any experiments.
Alas, reality does not grade on a curve.
Rhetorical Innovation
Nathan Lebenz comments on Roon’s appearance on Doom Debates. Roon is great at acknowledging the risks from AI are real, then puts the chance of them happening at under 1%, for reasons that I don’t see making any sense.
Some classic Sam Altman quotes from when he knew about existential risk.
Extended Eliezer Yudkowsky thread about what it would take to make AI meet the safety standards they had… at Chernobyl.
Janus highlights a particular section of it, I’ll share that part and a little bit after it.
Liar Liar
I support the principle of not lying to LLMs. Cultivate virtue and good habits.
If we are going to expect people not to lie to LLMs, then we need there not to be large rewards to lying to LLMs. If we did force you to say whether you wrote the thing in question, point blank, and you could only say ‘yes’ or ‘no,’ I can hardly blame someone for saying ‘no.’ The good news is you (at least mostly) don’t have to do that.
Feel the AGI
So many smart people simply do not Feel the AGI. They do not, on a very basic level, understand what superintelligence would be or mean, or that it could even Be a Thing.
Thus, I periodically see things like this:
People often really do think this, or other highly mundane things that humans can already do, are all you could do with superintelligence. This group seems to include ‘most economists.’ I’m at a loss how to productively respond, because my brain simply cannot figure out how people actually think this in a way that is made of gears and thus can be changed by evidence – I’ve repeatedly tried providing the obvious knockdown arguments and they basically never work.
Here’s a more elegant way of saying a highly related thing (link is a short video):
Here Edward Norton makes the same mistake, saying ‘AI is not going to write that. You can run AI for a thousand years, it’s not going to write Bob Dylan songs.’
The second part of that is plausibly true of AI as it exists today, if you need the AI to then pick out which songs are the Bob Dylan songs. If you ran it for a thousand years you could presumably get some Dylan-level songs out of it by chance, except they would be in an endless sea of worthless drek. The problem is the first part. AI won’t stay where it is today.
Another way to not Feel the AGI is to think that AGI is a boolean thing that you either have or do not have.
There’s no reason to assume it would only be 6-12 months. But even if it was, if you have AGI for six months, and then they get what you had, you don’t twiddle your thumbs at ‘AGI level’ while they do that. You use the AGI to build ASI.
Regular Americans Hate AI
It is remarkable how many replies were ‘of course we want ASI.’ Set aside the question of what would happen if we created ASI and whether we can do it safely. Who is we?
Americans hate current AI and they hate the idea of more future capable smarter AI. Hashtag #NotAllAmericans and all that, but AI is deeply underwater in every poll, and do not take kindly to those who attempt to deploy it to provide mundane utility.
They mostly hate AI, especially current AI, for bad reasons. They don’t understand what it can do for them or others, nor do they Feel the AGI. There is a lot of unjustified They Took Our Jobs. There are misplaced concerns about energy usage. Perception of ‘hallucinations’ is that they are ubiquitous, which is no longer the case for most purposes when compared to getting information from humans. They think it means you’re not thinking, instead of giving you the opportunity to think better.
It’s not about what you ‘can’ do. It’s about what is the most efficient solution to the problem, and as Seb says putting real value on your time.
Aligning a Smarter Than Human Intelligence is Difficult
Ryan Greenblatt asks, how will we update about scheming (yeah, I don’t love that term either, but go with it), based on what we observe in the future?
I sent you ten boats and a helicopter, but the guns involved are insufficiently smoking? But yes, I agree that there is a sense in which the guns seen so far are insufficiently smoking to satisfy many people.
I am optimistic that by default we will get additional evidence, from the perspective of those who are not already confident. We will see more experiments and natural events that demonstrate AIs acting like you would expect if what Ryan calls scheming was inevitable. The problem is what level of this would be enough to convince people who are not already convinced (although to be clear, I could be a lot more certain than I am).
I also worry about various responses of the form ‘well we tuned it to get it to not currently, while scheming obviously wouldn’t work, show scheming we can easily detect, so future models won’t scheme’ as the default action and counterargument. I hope everyone reading understands by now why that would go supremely badly.
I also would note this section:
I find this hopelessly optimistic about alignment of preferences, largely for classic Yudkowsky-style reasons, but if it only discounts the downside risk by ~33%, then it doesn’t actually much matter in terms of what we should actually do.
Ryan goes through extensive calculations and likelihood ratios for much of the rest of the post, results which would then stack on top of each other (although they correlate with each other in various ways, so overall they shouldn’t fully stack?). Model architecture and capability levels are big factors for him here. That seems like a directionally correct approach – the more capable a model is, and the more opaque its reasoning, and the more it is relatively strong in the related areas, the more likely scheming is to occur. I was more skeptical in his likelihood ratios for various training approaches and targets.
Mostly I want to encourage others to think more carefully about these questions. What would change your probability by roughly how much?
Dominik Peters notes that when o1 does math, it always claims to succeed and is unwilling to admit when it can’t prove something, whereas Claude Sonnet often admits when it doesn’t know and explains why. He suggests benchmarks penalize this misalignment, whereas I would suggest a second score for that – you want to know how often a model can get the answer, and also how much you can trust it. I especially appreciate his warning to beware the term ‘can be shown.’
I do think, assuming the pattern is real, this is evidence of a substantial alignment failure by OpenAI. It won’t show up on the traditional ‘safety’ evals, but ‘claims to solve a problem when it didn’t’ seems like a very classic case of misaligned behavior. It means your model is willing to lie to the user. If you can’t make that go away, then that is both itself an inherent problem and a sign that other things are wrong.
Consider this outcome in the context of OpenAI’s new strategy of Deliberative Alignment. If you have a model willing to lie, and you give it a new set of rules that includes ‘don’t lie,’ and tell it to go off and think about how to implement the rules, what happens? I realize this is (probably?) technically not how it works, but metaphorically: Does it stop lying, or does it effectively lie about the lying in its evaluations of itself, and figure out how to lie more effectively?
An important case in which verification seems harder than generation is evaluating the reasoning within chain of thought.
Charbel-Raphael updates his previously universally negative views on every theory of impact of interpretability, is now more positive on feasibility and usefulness. He still thinks many other agendas are better, but that only means we should do all of them.
The Lighter Side
Highlights from Claude’s stand-up routine.
True story, except it’s way more ridiculous all around.