Anton Leicht says evals are in trouble as something one could use in a regulation or law. Why? He lists four factors. Marius Hobbhahn of Apollo also has thoughts. I’m going to post a lot of disagreement and pushback, but I thank Anton for the exercise, which I believe is highly useful.
I think there's one important factor missing: if you really used evals for regulation, then they would be gamed. I trust more the eval when the company is not actually at stake on it. If it was, there would be a natural tendence for evals to slide towards empty box-checking.
A lot happened this week. We’re seeing release after release after upgrade.
It’s easy to lose sight of which ones matter, and two matter quite a lot.
The first is Gemini Flash 2.0, which I covered earlier this week.
The other is that o1, having turned pro, is now also available in the API.
This was obviously coming, but we should also keep in mind it is a huge deal. Being in the API means it can go into Cursor and other IDEs. It means you can build with it. And yes, it has the features you’ve come to expect, like tool use.
The other big development is that Anthropic released one of the most important alignment papers, Alignment Faking in Large Language Models. This takes what I discussed in AIs Will Increasingly Attempt Shenanigans, and demonstrates it with a much improved experimental design, finding a more worrisome set of behaviors, in a far more robust fashion. I will cover that paper soon on its own, hopefully next week.
Table of Contents
From earlier in the week: Gemini Flash 2.0, AIs Will Increasingly Attempt Shenanigans, The o1 System Card is Not About o1.
Language Models Offer Mundane Utility
Kevin Roose chronicles the AI insiders growing fondness for Claude.
How is Devin? Flo Crivello sees some good results, Tyler Johnston says it’s on the cutting edge for coding agents, Zee Waheed says it seems very powerful, Colin McDonnell seems happy so far.
Here’s a play-by-play from Feulf. It’s stunning to me that people would go eat sushi while Devin is running on the first try – sure, once you’re used to it, and I know you can always revert, but… my lord, the protocols we don’t use. Looks like he ran into some issues.
Overall still rather promising reviews so far.
So far, LLMs in many areas have proven to close the skill gap between workers. Chatbots provide information, so those who need information more can benefit more (once they have the key information, which is to know to use LLMs in the first place). Agents, however, show early signs of benefiting the most knowledgeable individuals more, because they displace routine work and emphasize skilled and bespoke work (‘solvers’) that the AI agents can’t do. I’d also add that setting up and properly using AI agents is high skill as well. Ajeya Cotra says that tracks her anecdotal impressions.
Which capabilities matter most?
For mundane utility purposes, you want it to do various basic tasks well, and ideally string them together in agentic fashion. That saves a ton of time and effort, and eventually becomes a step change. That’s what should make you most bullish on short term economic value.
Long term, however, the real value will depend on being able to handle the more conceptually complex tasks. And if you can do those, presumably you can then use that to figure out how to do the simpler tasks.
o1 (not pro) one shot spots a math error in an older 10-page paper, where that paper at the time triggered a flood of people throwing out their black cookware until the error got spotted. Unfortunately Claude took a little extra prompting to find the mistake, which in practice doesn’t work here.
Ethan Mollick asks, should checking like this be standard procedure, the answer is very obviously yes. If you don’t feed a paper into at minimum o1 Pro and Claude, and ask if there are any problems or math errors, at some point in the review process? The review process is not doing its job.
The question is whether you, the reader, should start off doing this with at least one model before reading any given paper, or before using any of its results. And I think the answer is we all know we’re mostly not going to, but ideally also yes?
Steve Newman reasonably suggests checking 1000 random papers, and the plan to do this is proceeding. I’m excited to see what happens.
Janus emphasizes the importance of ‘contextual empathy,’ meaning tracking what an LLM does and doesn’t know from its context window, including the implications for AI agents. Context makes a big difference. For my purposes, this is usually about me remembering to give it the context it needs.
Claude computer use demo, first action is to go to Claude.ai, then it builds a website. Maddie points out all this is untyped and dynamically scoped, each step depends on everything before it. You could try to instead better enforce a plan, if you wanted.
A continuation of the Cowen vs. Sumner vs. o1 economics debate: Sumner responds, o1 responds. Some aspects of Sumner’s response seem like rhetorical mistakes, but his response beats mine in the most important respect, which is highlighting that the guardrails have some buffer – the Fed is (3% at 5%), not (4% at 4%). Now that this was more clear, o1 is much more optimistic, and I like its new second response better than its first one. Now let’s actually do the guardrails!
o1-preview (not even o1!) had superhuman performance on the reasoning tasks of a physician. But of course, consider the standard.
You should see the other guy, because the other guy was GPT-4. They didn’t test Claude Sonnet. But it also wasn’t the full o1, let alone o1 Pro, and the point stands.
Video definitely getting better and Google seems clearly to be in the lead right now, some additional Veo 2 examples. When will we be able to string them together?
If I was looking to make an AI movie for Mundane Utility reasons, I would be inclined to wait another cycle.
Clio Knows All
What do people actually do with Claude? Anthropic’s Clio analyzes and reports, to the extent they can without violating anyone’s privacy.
Note the long tail here, and that most use is highly practical.
I strongly disagree that translations of sexually explicit content are under-flagged, because I think translation shouldn’t be flagged at all as a function. If it’s already there and has been accessed in one language, you should be fully neutral translating it into another.
Whether or not jailbreaks are under-flagged depends on how you think about jailbreaks, and whether you think it is fair to go after someone for jailbreak attempts qua jailbreak attempts.
It makes sense they would overflag security questions and D&D stats. For D&D stats that seems relatively easy to fix since there’s nothing adjacent to worry about. For security I expect it to be tricky to do that while refusing anti-security questions.
Language Models Don’t Offer Mundane Utility
As always, AI doesn’t work if no one uses it. Josh Whiton pulls out ChatGPT voice mode at a 40-person event to do real time translations, and no one else there knew that it was a thing. He suggests AI twitter is something like 18 months in the future. My guess is that’s not the way it’s going to go, because the versions that people actually use 18 months from now will be much better than what we are currently using, and the regular people will be feeling the impacts in other ways by then too.
How much of a bubble are we in? Leo Gao finds only 63% of people could tell him what AGi stood for… at Neurips. The Twitter poll had half of respondents predict 90%+, and it turns out that’s very wrong. This isn’t about actually having a detailed definition of AGI, this is about being able to reply with ‘Artificial General Intelligence.’
The AI labs are mostly shipping models, you have to figure out what to do with them. I actually think this is a mistake by the big labs at this point, likely a large one, and they should also be product companies to a much greater extent than they are.
Rather awful advice to Agnes Callard from ChatGPT on guarding against waffle iron overflow. I could not however replicate this response, and instead got the obviously correct answer of ‘figure out the right amount and then use a measuring cup’ from both Claude and GPT-4o.
Cass Sunstein argues an AI cannot predict the outcome of a coin flip, or in 2014 that Donald Trump would be elected president, or in 2005 that Taylor Swift would be a worldwide sensation, so socialist calculation debate and AI calculation debate are mostly the same thing.
I agree that an AI couldn’t have predicted Swift or Trump with any confidence at those points. But I do think that AI could have successfully attached a much higher probability to both events than the prediction markets did, or than ‘conventional wisdom’ would have assigned. Years ago we already had AI that could predict hit music purely from listening to songs, and there were doubtless a lot of other
Try having it help solve the Dutch Intelligence Agency annual puzzle? First report was Claude wasn’t helpful. Good luck!
How good is o1 Pro on the Putnam? Kyle Kabasares was looking at the answers and thought it got 80+ out of 120, but what matters is the process used so o1 Pro probably only got 10 or so, which isn’t impressive in context, although still better than most of you jokers.
The Case Against Education
Nothing here is new but we have additional confirmation: Study says ‘94% of AI-Generated College Writing is Undetected by Teachers,’ and it’s 97% if the flag has to be for AI use in particular. AI detectors of AI writing might suck, but the AI detectors suck less than human detectors.
The author of the Forbes article here, Derek Newton, is so strongly anti-AI he didn’t even use a spellchecker (he said ‘repot’ here, and wrote ‘ChatGTP’ elsewhere) which I think is an admirably consistent stand.
That’s bad, but the odds here actually don’t seem that terrible. If you have a 6% chance of being caught each time you use AI, and you face severe consequences if caught, then you would need to use it sparingly.
The problem continues to be that when they catch you, you still get away with it:
At that point, everyone involved has no one to blame but themselves. If it can’t adapt a reasonable evidence threshold, then the institution deserves to die.
More o1 Reactions
The big claim has indeed been cited:
Here are some selected responses to the central question. I like Dan Mac’s answer here, based on what I’ve heard – if it’s a metaphorical system 1 task you still want Claude, if it’s a metaphorical system 2 task you want o1 pro, if it’s a web based task you probably want Perplexity, Deep Research or maybe GPT-4o with web search depending on details.
I admit that I haven’t yet paid the $200/month for o1 Pro. I say ‘yet’ because obviously I will want to at some point. But while I’m still overwhelmed dealing with the day-to-day, I don’t find I have the type of queries where I would want to use o1 Pro, and I only rarely use o1.
Ask o1 Pro to act like a physics post doc on mushrooms, get what you asked for?
o1 AidanBench results are impressive, someone test Gemini-1206 and Gemini-Flash-2.0 and also o1 pro on this, I am very curious, it’s doing something right. Here is the repo for AidanBench.
As reported by Tyler Cowen, o1 Pro acts like a rather basic bitch about NGDP futures markets. Don’t get me wrong, this answer would have blown me away two years ago, and he did ask for the problems. But this answer is importantly wrong in multiple places. Subsidy doesn’t make manipulation cheaper, it makes it more expensive, and indeed can make it arbitrarily expensive. And strong reason to manipulate brings in strong reason to trade the other side, answering ‘why am I allowed to do this trade.’
Sully continues to approve, saying it fixed an issue he spent 3 hours on in 46 seconds, all he had to do was paste in the code.
Here’s a great tip, assuming it works.
o1 successfully trades based on tomorrow’s headlines in a backtest, doing much better than most humans who tried this. I had the same question as Amit here, does o1 potentially have enough specific information to outright remember, or otherwise reason it out in ways that wouldn’t have worked in real time?
o1 gets 25% on ARC for $1.50 a task, 31% for $2.50 a task, 32% at $3.80 a task. So we were clearly hitting its limits. No API access yet for o1 Pro.
Deepfaketown and Botpocalypse Soon
CivAI report on the Future of Phishing, including letting it generate a phishing email for a fictional character or celebrity. The current level is ‘AI enables customization of phishing emails to be relevant to the target’ but uptake on even that is still rare. As Jeffrey Ladish points out you need good prompting and good OSINT integration and right now those are also rare, plus most criminals don’t realize what AI can do any more than most other people. The future really is unevenly distributed. The full agent version of this is coming, too.
If we can keep the future this unevenly distributed via various trivial inconveniences and people being slow on the uptake, that is a big advantage, as it allows defenders like GMail to be well ahead of most attackers. It makes me optimistic. What is the right defense? Will we need an AI to be evaluating all incoming emails, or all that aren’t on our whitelist? That seems super doable, if necessary.
Truth Terminal did not, for some mysterious reason, end up pro-social, whatever could have influenced it in such a direction I have no idea.
When will the bots start to open as if they’re people?
This was a plot point on a recent (realistic not sci-fi) television show, and yes it is obviously going to happen, and indeed is already somewhat happening – there are some dating profiles on apps that are actually bots, either partially or fully.
It occurs to me that there is a kind of ‘worst case’ level of this, where the bots are hard to distinguish but also rare enough that people don’t guard themselves against this. If the bots are not good enough or are rare enough, no problem. If the bots are both good enough and common enough, then you start using various verification methods, including rapid escalation to a real world date – if 25% of matches on Match.com were bots, then the obvious response after a few back and forths is to ask to meet for coffee.
The standard xkcd response is of course ‘mission f***ing accomplished.’
Except no, that’s not right, you really do want human reply guys over technically more interesting AI replies in many situations. They’re different products. r/AmITheAsshole requires that the situations usually (not always) be versions of something that really happened to someone (although part of the fun is that they can be dishonest representations).
I presume the medium-term solution is whitelists or proof of humanity. That still doesn’t solve ‘have the AI tell me what to say,’ which is inherently unsolvable.
Huh, Upgrades
OpenAI has a lot of them as part of the Twelve Days of OpenAI.
By far the most important: o1, technically o1-2024-12-17, is now in the OpenAI API, we can finally cook. This comes with function calling, structured outputs and developer messages.
I hope that o1 was tested in this form?
OpenAI has also substantially improved their Realtime API, including WebRTC support, custom input context, controlled response timing and 30 minute session length, no idea how big a deal this is in practice. The audio token prices for gpt-4o-realtime are going down 60% to $40/$80 (per million tokens), and cached audio by 87.5% to $2.50/$2.50, from what was an oddly high previous price, and GPT-4o-mini now has a real time mode at $10/$20 output, text at $0.60/$2.40 and cached audio and text both at $0.30/$0.30.
Call 1-800-ChatGPT! First fifteen minutes per month free.
Also you can message it in WhatsApp.
OpenAI also is now offering Preference Fine-Tuning, they say partners report promising results. One big edge there is the barriers to entry seem a lot smaller.
OpenAI are adding beta SDKs for Go and Java.
ChatGPT now has projects, so you can group chats together, and give each group customized instructions and have the group share files. I’m guessing this is one of those quiet quality of life improvements that’s actually pretty sweet.
ChatGPT Advanced Voice mode now has video screen share, meaning you can share real time visual context with it. Also you can give it a Santa voice. As always, the demo is technically impressive, but the actual use cases seem lame?
All the more reason to work on product features like this.
Voice mode also has real-time web search, which was badly needed. If I am in voice mode, chances seem high I want things that require web searches.
ChatGPT search has specific integrations for weather, stocks, sports, news, maps.
These were major upgrades for Google search once implemented correctly. For practical purposes, I expect them to be a big game for LLM-based search as well. They claim full map integration on mobile, but I tried this and it wasn’t there for me:
I don’t agree, because to me the point of Siri is its integration with the rest of your phone and its apps. I do agree that if this was implemented on Echos it would fully replace Alexa.
Note that Amazon is reportedly working on turning Alexa into ‘remarkable Alexa’ and it will run on Claude, which will probably leapfrog everything else. Which is why I haven’t considered bothering to explore makeshift solutions like JoshGPT or GPT Home, I can simply wait.
So far ChatGPT’s web search only ended up giving me the kick in the nuts necessary to get me to properly lean on Perplexity more often, which I had for some reason previously not been doing. I’ve found that for most use cases where I want ChatGPT web search, Perplexity is what I actually wanted.
McKay Wrigley continues to love o1 Pro.
Google’s NotebookLM ads a call-in feature for podcasts, a Plus paid offering and a new interface that looks like a big step up.
Anthropic’s computer use is not ready for prime time. Janus tries to make it more ready, offering us a GitHub repo that enhances usability in several places, if you don’t want to wait for things to improve on their own.
Grok-2-1212 is out. I begrudgingly accept, but will continue to hate and mildly punish, that we are going to have to accept dates as secondary version numbers going forward. Everyone can now use it for free, or you can use the API with $25 in free credits. Grok used to search Twitter, now it can also search the web. There’s a button to ask Grok for context about a Tweet. They highlight that Grok does well on Multi-EFEval for instruction following, similar to o1 and Claude Sonnet. At least for now, it seems like they’re still behind.
They Took Our Jobs
Wow, they really just went there, didn’t they?
Unlike Connor Leahy, I do not find myself surprised. You don’t talk to your employees this way, but a startup talking to the boss? Sure, why not?
And let’s be honest, they just got some damn good free advertising right here.
And finally, the kicker:
Twill reached out to me. They are a new job board, with their solution to the avalanche of AI-fueled applications being old fashioned referrals, that they pay commissions for. So if I know you well enough to vouch for you somewhat, and you’re looking, let me know and I can show you the open positions.
The Art of the Jailbreak
If at first you don’t succeed at jailbreaking, Anthropic reports, try various small changes, leading to substantially increased success rates via ‘best-of-N’, including 92% of the time for Opus. This can of course be combined with other methods. Code has been open sourced.
My update was that I already knew we were talking price, but ‘dumb’ and ‘random little’ iterations worked better than I would have expected.
Dan Hendrycks points out that Anthropic could claim some prizes from Gray Swan, if they could indeed implement this strategy, although there could be some logistical issues. Or others could use the technique as well.
I would also note that there are various potential defenses here, especially if there is a reasonable way to automate identification of a jailbreak attempt, and you control the AI that must be defended, even if ‘ban the account’ is off the table.
This does make it seem much more hopeless to defend an AI against jailbreaks using current techniques once it ‘falls into the wrong hands,’ if you are worried about that.
Pliny reports from jailbreaking o1-pro, claims a potential ‘shadow self’ encounter. We are so jaded at this point, I barely reacted at all.
Here’s a perspective on Pliny’s work that seems mostly correct:
Get Involved
Third Opinion, a free expert consultation for AI professionals at frontier labs, for when they encounter potentially concerning situations and need guidance.
You can submit a question via their Tor-based tool and they hook you up with someone who can help you while hiding as much context as the situation permits, to protect confidential information. This is their Twitter and this is their Substack. There are obvious trust concerns with a new project like this, so probably good to verify their technology stack before sharing anything too sensitive.
Anthropic’s societal impacts team is hiring in San Francisco, $315k-$340k.
Introducing
Meta floods us with nine new open source AI research artifacts, all the names start with Meta. You’ve got Motivo for moving humanoid agents, Video Seal for neural video watermarking, Flow Matching, ‘Explore Theory-of-Mind,’ Large Concept Models (LCMs) to decouple reasoning from language representation as discussed last week (sounds safe and wise!), Dynamic Byte Latest Transformer, Memory Layers, Image Diversity Modeling and CLIP 1.2.
All Day TA is an LLM teaching assistant specialized to your course and its work, based on the documents you provide. We can definitely use a good product of this type, although I have no idea if this is that product. BlueSky response was very not happy about it, in yet another sign of the left’s (very dumb) attitude towards AI.
In Other AI News
OpenAI to donate $1 million to Trump’s inaugural fund, as did Meta and Amazon.
SoftBank to invest $100 billion in AI in America and ‘create 100,000 American jobs at a minimum,’ doubling its investment, supposedly based on his ‘optimism’ about the new administration. I notice that this is a million dollars per job, which seems high, but passed Claude’s sanity check is still a net positive number.
MidJourney is still out there, still iterating, building new tools and hiring, but has declined to take outside investment or attempt to hyperscale.
Elon Musk sues OpenAI, take at least four. In response, at the link, we have a new burst of partly (and in at least one case incompetently) redacted emails. Elon Musk does not come out of that looking great, but if I was OpenAI I would not have released those emails. The OpenAI email archives have been updated to incorporate the new emails.
I do not believe that OpenAI has any intention of giving the nonprofit fair compensation for the control and equity it is giving up, but this latest lawsuit does seem to be about par for the course for Elon Musk outrage lawsuits.
Meta has also asked the California attorney general to block OpenAI’s conversion, because it is in Meta’s interest to block the conversion.
Verge offers a history of ChatGPT. Nothing truly new, except perhaps OpenAI having someone say ‘we have no plans for a price increase’ one day before they introduce the $200 Pro tier.
Sad news, 26-year-old Suchir Balaji, a recently departed OpenAI engineer and researcher, has died. Authorities concluded he committed suicide after various struggles with mental health, and the investigation found no evidence of foul play.
This is Suchir Balaji’s last public post, about when generative AI using various data is and is not fair use. He was a whistleblower concerned about AI-related copyright issues and other societal impacts, accusing them of violating copyright via model training data, and reports say he was named in a related lawsuit against OpenAI.
The parents are confused, as they spoke to him only hours earlier, and he seemed excited and was planning a trip. They have my deepest sympathies. I too would demand a complete investigation under these circumstances. But also, unfortunately, it is common for children to hide these types of struggles from their parents.
The paper on how Palisade Research made BadGPT-4o in a weekend a month ago, stripping out the safety guidelines without degrading the model, sidestepping OpenAI’s attempt to stop that attack vector. Two weeks later this particular attack was patched, but Palisade believes the general approach will still work.
Write-up in Time Magazine of the Apollo study finding that AI models are capable of scheming.
Quiet Speculations
Once again, since there was another iteration this week, please do not stop preparing for or saving for a relatively ‘ordinary’ future, even though things will probably not be so ordinary. Things are going to change quite a lot, but there are any number of ways that could go. And it is vital to your mental health, your peace of mind, and to your social relationships and family, and your future decisions, that you are prepared for an ‘AI fizzle’ world in which things like ‘save for retirement’ have meaning. Certainly one should not presume that in the future you won’t need capital, or that the transformational effects will arrive by any given year. See my Practical Advice for the Worried.
OpenAI CFO predicts willingness by companies to pay $2k/month subscriptions for virtual assistants. I agree that the sky’s the limit, if you can deliver the value. Realistically, there is nothing OpenAI could charge that it would be incorrect to pay, if they can deliver a substantially superior AI model to the alternatives. If there’s competition, then price goes closer to marginal cost.
Ajeya Cotra’s AI 2025 Forecast predictions includes a 45%+ chance that OpenAI’s high preparedness thresholds get crossed, which would stop deployment until mitigations bring the threat back down to Medium. Like others I’ve seen so far, benchmark progress is impressive, but bounded.
If AI is optimizing for the benchmarks, or if the benchmarks are optimized for the things that AIs are on the verge of doing, then you should expect AI benchmark success to largely not translate to real world performance.
I think a lot of the delay is about us not figuring out how to use AI well, and most people not even doing the things we’ve figured out. I know I’m barely scratching the surface of what AI can do for me, despite writing about it all the time – because I don’t take the time to properly explore what it could do for me. A lot of the reason is that there’s always ‘well you’re super busy, and if it’s not that vital to you right now you can wait a few months and everything will get better and easier.’ Which has indeed been true in most cases.
The most interesting benchmark translation is the ‘N-hour task.’ I understand why AI that can recall facts or do few-minute tasks will still have trouble with N-hour tasks. But when the AI is doing 4-hour or 8-hour tasks, it seems like there should be much less of a barrier to going longer than that, because you already need agency and adaptability and so on. If you find a person who can be self-directed and agentic for even a day and certainly for a week, that usually means they can do it for longer if they want to – the crucial ‘skill check’ has been largely passed already.
Sam Hammond has the chance of High risk at at least 40%.
Are we running out of data? Ilya Sutskever says yes in his talk, although he thinks we’ll work around it.
Tyler Cowen says he is not convinced, because (in general) supply is elastic. Rohit says we can mine and generate more, ‘those Tibetan scrolls are still there,’ and it was noted that the National Archives are not yet digitalized.
After what we have seen recently, I strongly believe AI is not hitting a wall, except insofar as it is bad news for the wall, and perhaps also the wall is us.
What seemingly has hit a wall is the single minded ‘scaling go brrr, add another zero’ approach to pretraining, data and model size. You have to be smarter than that, now.
I mostly don’t buy that the data is out there. There are some untapped data sources, but the diminishing returns seem sharp to me. What is left mostly seems a mixture of more expensive, lower quality and relevance, and more redundant. We have, I am guessing, essentially saturated naturally occuring data in terms of orders of magnitude.
As others have put it, supply might be elastic, but this is not about finding somewhat more data. If you want to Scale Data, you don’t need 10% more data, you need 10 times as much data, duplicates don’t count, and virtual duplication counts little. In terms of existing data, my guess is we are damn close to done, except for seeking out small amounts of bespoke data on particular valuable narrow topics.
You’re probably better off reusing the high quality data we already have, or refining it, or converting it. Which gets you some more juice, but again zeroes are hard to find.
The other hope is creating new synthetic data. Here I have more uncertainty as to how effective this is, and how far it can go. I know the things I would try, and I’d mostly expect them to work with enough effort, but I’m going to keep that quiet.
And then we can find, as Ilya puts it, new S curves to climb, and new things to scale. I am confident that we will.
Those who want to interpret all this as lack of progress will find a way to do so. We still have articles like ‘Will Artificial Intelligence Hit a Wall?’ as if that would be anything other than bad news for the wall. My response would in part be: I would get back to you on that, but I am really busy right now keeping up with all the new AI capabilities and don’t even have time to try them all out properly.
Per a Tyler Cowen retweet, is this a good metric? Also: Should we ask what is the price they charge, or what is the price they are worth?
I see problems with both metrics. I am more interested in how much they are actually worth. If you can find someone you trust for AI-related tasks, they are worth 10 times, or often 100+ times what they cost, whereas ‘what the person charges’ seems to end up being largely a question about social reality rather than value. And of course all this is measuring AI adoption and how we react to it, not the capabilities themselves.
The Quest for Sane Regulations
In what is otherwise a puff piece, Google CEO Sundar Pichai says he’s ‘ready to work on a ‘Manhattan Project’ for AI.’ By which I presume (and hope!) he means he is happy to accept government funding, and beyond that we’ll see.
Tag yourself? I think we’re more like here:
Thus, yes, I think there’s a lot of room to accelerate benefits, but we’re doing very little to truly address the risks.
Anton Leicht says evals are in trouble as something one could use in a regulation or law. Why? He lists four factors. Marius Hobbhahn of Apollo also has thoughts. I’m going to post a lot of disagreement and pushback, but I thank Anton for the exercise, which I believe is highly useful.
So in short I notice I am confused here as to why this is ‘politically vulnerable’ in a way other than ‘power wants to release the models and will say the things power says.’ Which to be clear is a big deal, but I so often find these types of warnings to not correspond to what actually causes issues, or what would actually diffuse them.
I do agree that we should have a deliberate political strategy to defend evals against these (very standard) attacks by power and those who would essentially oppose any actions no matter what.
The Week in Audio
The big talk of the week was by Ilya Sutskever. Self-recommending, but also didn’t say anything big I didn’t already know. Here is a written summary from Carlos Perez.
The important point is about future superintelligence. Both that it is coming, and that it will be highly unpredictable by humans.
As one audience member notes, it is refreshing that Ilya sidesteps ‘will they replace us?’ or ‘will they need rights’ here. The implications are obvious if you pay attention.
But also I think he is deeply confused about such questions, and probably hiding from them somewhat including in his own head. In the question phase, he says ‘if they only want to coexist with us, and maybe also to have rights, maybe that will be fine’ and declines to speculate, and I feel like if he was thinking about such questions as much as he should be he’d have better answers.
I thought everything in Ilya’s main talk seemed essentially correct.
Eugenia Kuyda, the founder and CEO of My Replika which produces, ‘AI companions with a soul,’ calls AI companions perhaps the greatest potential existential risk to humanity, also saying we might slowly die inside and not have any willpower. And when asked how she’s building AI companions safely, she said ‘it’s a double edged sword, if it’s really powerful, so it can do both’ and that’s that? Here is the full video.
I disagree about this particular danger. I think she’s wrong.
What I also know is:
If you believe [X] is the greatest existential threat to humanity, don’t build [X]?
I mean, if [X] equals ASI and can unlock immortality and the stars and paradise on Earth, then some amount of existential risk is acceptable. But… for AI companions? The answer has to this ‘doctor doctor, there’s existential risk in doing that’ has to be ‘then don’t do that,’ no?
I mean, she’s basically screaming ‘my product is terrible, it will ruin your life, under no circumstances should anyone ever use my product.’ Well, then!
Is it all just hype?
Is it possible that this is mostly or entirely hype on her part? It is certainly possible. The claim seems false to me, absurd even.
But three things point against this:
As in, the AI industry has a long history of people being convinced AI was an existential threat to humanity, and as a direct result of this deciding that they should be the ones to build it first. Some to ‘ensure it is safe,’ some for profit, some because the product is too delicious, some because they don’t mind extinction. Many for a combination, that they don’t fully admit to themselves. It’s standard stuff.
So, yeah. It makes total sense to me that the person who started this section of the industry thinks that her product is uniquely dangerous and harmful. I don’t see why this would surprise anyone, at this point.
Anthropic made a video discussing Cleo and how people use Claude, as per above.
How many times do we have to remind you that if you wait until they start self-improving them, there likely is no ‘plug’ to unplug, even most AI movies understand this now.
Lisa Kudrow says Here is an endorsement of AI, because the deaging tech just works and it doesn’t take away from the talent at all. The problem is that in the long term, you start not needing Robin Wright and Tom Hanks to make the movie.
The other problem, of course, is that by all reports Here sucked, and was a soulless movie that did nothing interesting with its gimmick and wasted the talent.
Roon vs. Liron: AI Doom Debate.
Rhetorical Innovation
Richard Ngo quite reasonably worries that everyone is responding to the threat of automated AI R&D in ways all but designed to accelerate automated AI R&D, and that this could be a repeat of what happened with the AGI labs. The problem is, we need evals and other measures so we know when AI R&D is near, if we want to do anything about it. But by doing that, we make it legible, and make it salient, and encourage work on it. Saying ‘X is dangerous’ makes a lot of people go ‘oh I should work on X then.’ What to do?
A strange but interesting and clearly very earnest thread between Teortaxes, Richard Ngo and doomslide, with a very different frame than mine on how to create a positive future, and approaching the problems we must solve from a different angle. Once again we see ‘I don’t see how to solve [X] without restricting [Y], but that’s totalitarianism so we’ll have to do something else.’
Except, as is so often the case, without agreeing or disagreeing that any particular [Y] is needed, I don’t see why all or even most implementations of [Y] are totalitarianism in any sense that differentiates it from the current implementation of American Democracy. Here [Y] is restricting access to large amounts of compute, as it often is in some form. Thus I keep having that ‘you keep using that word’ moment.
Gabriel points out that there is quite a lot of unhobbling (he calls it ‘juicing’) left in existing models, I would especially say this is true of o1 pro, where we truly do not yet know what we have, although I am far more skeptical that one could take other current models that far and I do not think there is substantial danger we are ‘already too late’ in that sense. But yes, we absolutely have to plan while taking into account that what the model can do in the evals before release is a pale shadow of what it is capable of doing when properly unleashed.
Claim that ‘apart from its [spoiler] ending’ we basically have everything from Ex Machina from 2015. I do not think that is true. I do think it was an excellent movie, and I bet it holds up very well, and also it really was never going to end any other way and if you don’t understand that you weren’t and aren’t paying attention.
Aligning a Smarter Than Human Intelligence is Difficult
You’re always grading on a curve. The question is, what curve?
FLI came out with its annual scorecard, most noticeable thing is including Zhipu AI:
All models are vulnerable to jailbreaks. No one’s strategy solves the control problem. Only Anthropic even pretends to have a meaningful external oversight mechanism.
The labs do best on Current Harms. If anything I would have been more generous here, I don’t think OpenAI has done anything so bad as a D+ in practical terms now, and I’d probably upgrade Anthropic and Google as well, although the x.AI and Meta grades seem fair.
On safety frameworks I’d also be inclined to be a bit more generous, although the labs that have nothing obviously still fail. The question is, are we counting our confidence they will follow their frameworks?
On governance and accountability, and on transparency and communication, I think these grades might be too generous. But as always, it depends on compared to what?
If it’s compared to ‘what it would take to get through this most of the time’ then, well, reality is the one thing that doesn’t grade on a curve, and we’re all in trouble.
Do you want the model to cooperate with other AIs? With copies of itself?
I strongly say yes, you absolutely want your AI to be doing this, the things that cause it not to do this are so, so much worse.
Having AIs that cannot solve even simple game theoretic and decision theoretic problems and thus cannot cooperate, or that are myopic, is in some situations a defense against certain downside scenarios. Sure. People aren’t wrong to have a bunch of specific worries about this. It creates problems we will have to solve.
But the alternative is a ticket to obvious complete clusterf***ery no matter what if you start using the uncooperative AIs, both because it directly causes complete clusterf***ery and because it is a telltale sign of the complete cluterf***ery going on elsewhere that caused it, or prevented a solution, provided the AIs are otherwise sufficiently capable.
Depends what the eval is for.
Yes, well.
Not Aligning Smarter Than Human Intelligence Kills You
We have a new paper draft on ‘the economics of AI alignment.’
What I love most about economics is you can have one paragraph that, once written down, is obviously true, and that then lets everyone think better about the problem, and which saves you the need to read through the 60+ page paper ‘proving’ it. Often you already knew that fact, but it’s good to state it cleanly. And what is often most useful about that paragraph is seeing exactly what assumptions were required.
What I hate most about economics is that you still need the 60+ page paper, which takes 2 years and tons of your time, before anyone accepts the paragraph. And that quite often, the paper ends up making you do a bunch of work to extract the paragraph. The abstract is often not that close to the original paragraph, or written to imply something stronger.
This paper was a prime case of that.
Great original paragraph to write down, but they hid it, and all the vibes are way off, presenting this as a good plan rather than showing why it isn’t a good plan.
The plan seems to be, ‘have a strategic AI smarter than you, don’t align it, but it’s fine to deploy it anyway, actually, we will use principal-agent solutions’ or actually ‘we will use MIRI-style decision theoretic uncertainty to make the AI unsure what its true situation is so it has to cooperate with us?’ combined with ‘assume that you can perfectly evaluate whether a given input is aligned and the AI can’t figure out if it’s in deployment even over time?’
Or, look at their abstract:
No, no, no, no, no! You can’t do that. You’re assuming the conclusion. Stop it!
Here’s the core result, as I understand it:
As usual with such papers, yes, but if you’d just said that up front we wouldn’t have needed a 60+ page paper, and a lot of people wouldn’t be under the impression that this was a lot more promising and impressive than it is.
I mean, yes, obviously: If I can run memoryless versions of an AI, while providing insufficient contextual clues, and I can determine whether an output is misaligned sufficiently well, and I can ‘punish’ the AI, then the AI is going to have to choose to mostly respond aligned in any given situation, or I can find out if it is aligned.
They really are even proposing a ‘disciplining is best’ case, where they actually deploy AIs that might be misaligned, on the theory that the uncertainty about deployment will keep them in line.
For an even starker response, and an extensive list of the assumptions, I suggest trying the following prompt or something similar, plus the PDF:
The authors admit these assumptions are a little strong, but come on. Seriously. They’re not ‘a little strong,’ they’re wishing away all the real issues.
If this is your plan for how to deploy future AIs much smarter than you, and let them operate strategically in the real world?
You are so spectacularly, obviously dead.
Within one minute I came up with several plans that would presumably work here. An AI would presumably do something smarter. But if you placed literal me (as an upload, let’s say) into this situation, and this was your plan, and I decide (quite reasonably, given what you clearly did to me) that I dislike you, I am rather confident: You lose.
And that’s true even if I give you the assumption that you can ‘tell if the output is aligned’ during testing, in the sense I presume it is intended here.
I found Tyler Cowen’s statement on this telling, especially the last line.
There is great demand for the perception that successful alignment is more likely.
The Lighter Side
Sign of the times.
Perfection.
To those who are losing sleep over existential risk from AI, I offer some perspective: