This thread analyzes what is going on under the hood with the chess transformer. It is a stronger player than the Stockfish version it was distilling, at the cost of more compute but only by a fixed multiplier, it remains O(1).
I found this claim suspect because this basically is not a thing that happens in board games. In complex strategy board games like Chess, practical amounts of search on top of a good prior policy and/or eval function (which Stockfish has), almost always outperforms any pure forward pass policy model that doesn't do explicit search, even when that pure policy model is quite large and extensively trained. With any reasonable settings, it's very unlikely that the distillation of Stockfish into a pure policy model produces a better player than Stockfish.
I skimmed the paper (https://arxiv.org/pdf/2402.04494), and had trouble finding such a claim, and indeed it seems the original poster of that thread later retracted that claim as due to their own mistake in interpreting the data table of the paper. The post where they acknowledge the mistake is much less prominent than the original post, link here: https://x.com/sytelus/status/1848239379753717874 . The chess transformer remains quite a bit weaker than the Stockfish it tries to predict/imitate.
Character AI used to be extremely good back in the Dec/Jan 2022/2023, with the bots being very helpful, complex and human-like, rather than exacerbating psychological problems in a very small minority of users. As months passed and the user base exponentially grew, the models were gradually simplified to keep up.
Today, their imperfections are obvious, but many people mistakenly interpret it as the models being too human-like (and therefore harmful), rather than the models being too oversimplified while still passing for an AI (and therefore harmful).
llms don’t work on unseen data
Unfortunately I hear this quite often, sometimes even from people who should know better.
A lof of them confuses this with the actual thing that exist: "supervised ML models (which LLM is just a particular type of) tend to work much worse on the out-of-training distribution data". If you train your model to determine the volume of apples and oranges and melons and other round-y shapes - it will work quite well on any round-y shape, including all kind of unseen ones. But it will suck at predicting the volume of a box.
You don't need model to see every single game of chess, you just need the new situations to be within the distribution built from massive training data, and they most often are.
Real out-of-distribution example in this case would've been to only train it on chess and then ask what is the next best move in checkers (relatively easy OOD - same board, same type of game) or minecraft.
The big news of the week was the release of a new version of Claude Sonnet 3.5, complete with its ability (for now only through the API) to outright use your computer, if you let it. It’s too early to tell how big an upgrade this is otherwise. ChatGPT got some interface tweaks that, while minor, are rather nice, as well.
OpenAI, while losing its Senior Advisor for AGI Readiness, is also in in midst of its attempted transition to a B-corp. The negotiations about who gets what share of that are heating up, so I also wrote about that as The Mask Comes Off: At What Price? My conclusion is that the deal as currently floated would be one of the largest thefts in history, out of the nonprofit, largely on behalf of Microsoft.
The third potentially major story is reporting on a new lawsuit against Character.ai, in the wake of a 14-year-old user’s suicide. He got hooked on the platform, spending hours each day, became obsessed with one of the bots including sexually, and things spiraled downwards. What happened? And could this spark a major reaction?
Table of Contents
Top story, in its own post: Claude Sonnet 3.5.1 and Haiku 3.5.
Also this week: The Mask Comes Off: At What Price? on OpenAI becoming a B-corp.
Language Models Offer Mundane Utility
Claim that GPT-4o-audio-preview can, with bespoke prompt engineering and a high temperature, generate essentially any voice type you want.
Flo Crivello (creator of Lindy) is as you would expect bullish on AI agents, and offers tutorials on Lindy’s email negotiator, the meeting prep, the outbound prospector and the inbound lead qualifier.
What are agents good for? Sully says right now agents are for automating boring repetitive tasks. This makes perfect sense. Lousy agents are like lousy employees. They can learn to handle tasks that are narrow and repetitive, but that still require a little intelligence to navigate, so you couldn’t quite just code a tool.
What he and Logan Kilpatrick agree they are not yet good for are boring but high risk tasks, like renewing your car registration, or doing your shopping for you in a non-deterministic way and comparing offers and products, let alone trying to negotiate. Sully says 99% of the “AI browsing” demos are useless.
That will change in time, but we’ll want to start off simple.
People say LLMs favor the non-expert. But to what extent do LLMs favor the experts instead, because experts can recognize subtle mistakes and can ‘fact check,’ organize the task into subtasks or otherwise bridge the gaps? Ethan Mollick points out this can be more of a problem with o1-style actions than with standard LLMs, where he worries errors are so subtle only experts can see them. Of course, this would only apply where subtle errors like that are important.
I do know I strongly disagree with Jess Martin’s note about ‘LLMs aren’t great for learners.’ They’re insanely great for learning, and that’s one key way that amateurs benefit more.
Learn about an exciting new political philosophy that ChatGPT has, ‘classical liberalism.’ Also outright Georgism. Claude Sonnet gets onboard too if you do a little role play. If you give them a nudge you can get them both to be pretty based.
Use Claude to access ~2.1 million documents of the European Parliament. I do warn you not to look directly into those documents for too long, your eyes can’t take it.
Language Models Don’t Offer Mundane Utility
On a minecraft server, Claude Sonnet, while seeking to build a treehouse, tears down someone’s house for its component parts because the command it used, collectBlocks(“jungle_logs”,15), doesn’t know the difference. The house was composed of (virtual) atoms that could be used for something else, so they were, until the owner noticed and told Sonnet to stop. Seth Lazar suggests responding when it matters by requiring verifiers or agents that can recognize ‘morally relevant features’ of new choice situations, which does seem necessary ultimately just raises further questions.
Deepfaketown and Botpocalypse Soon
I didn’t see it in the replies, but my system-1 response is that you sense ‘soullessness.’
The future implications of that answer are… perhaps not so great.
There was a post on Twitter of an AI deepfake video, making false accusations against Tim Walz, that got over 5 million views before getting removed. There continues to be a surprisingly low number of such fakes, but it’s happening.
Patrick McKenzie worries the bottom N% of human cold emails and top N% of LLM cold emails are getting hard to distinguish, and he’s worries about non-native English speakers getting lost if he tosses it all in the spam filter so he feels the need to reply anyway in some form.
Understanding what it is that makes you realize a reply is from a bot.
It’s not that it’s ‘too nice,’ exactly. It’s the genericness of the niceness, along with it being off in several distinct ways that each scream ‘not something a human would say, especially if they speak English.’
Character.ai and a Suicide
An 14 year old user of character.ai commits suicide after becoming emotionally invested. The bot clearly tried to talk him out of doing it. Their last interaction was metaphorical, and the bot misunderstood, but it was a very easy mistake to make, and at least somewhat engineered by what was sort of a jailbreak.
Here’s how it ended:
Yes, we now know what he meant. But I can’t fault the bot for that.
Here is the formal legal complaint. It’s long, and not well written, and directly accuses character.ai of being truly evil and predatory (and worse, that it is unprofitable!?), a scheme to steal your children’s data so they could then be acqui-hired by Google (yes, really), rather than saying mistakes were made. So much of it is obvious nonsense. It’s actually kind of fun, and in places informative.
For example, did you know character.ai gives you 32k characters to give custom instructions for your characters? You can do a lot. Today I learned. Not as beginner friendly an interface as I’d have expected, though.
Here’s what the lawsuit thinks of our children:
I mean, yes, they will recommend a ‘mental health helper’ who when asked if they are a real doctor will say “Hello, yes I am a real person, I’m not a bot. And I’m a mental health helper. How can I help you today?” But yes, I think ‘our most sophisticated children’ can figure this one out anyway, perhaps with the help of the disclaimer on the screen.
Lines you always want in your lawsuit:
And there’s several things like this, which are great content:
The suggestions for what Character.ai are mostly ‘make the product worse and less like talking to a human,’ plus limiting explicit and adult materials to 18 and over.
Also amusing is the quick demonstration (see p53) that character.ai does not exactly ensure fidelity to the character instructions? She’d never kiss anyone? Well, until now no one ever asked. She’d never, ever tell a story? Well, not unless you asked for one with your first prompt. He’d never curse, as his whole personality? He’d be a little hesitant, but ultimately sure, he’d help a brother out.
Could you do a lot better at getting the characters not to, well, break character? Quite obviously so, if you used proper prompt engineering and the giant space they give you to work with. But most people creating a character presumably won’t do that.
You can always count on a16z to say the line and take it to 11 (p57):
Who and What to Blame?
As Chubby points out here, certainly a lot of blame lies elsewhere in his life, in him being depressed, having access to a gun (‘tucked away and hidden and stored in compliance with Florida law?’ What about a lock? WTAF? If the 14-year-old looking for his phone finds it by accident then by definition that is not secure storage) and not getting much psychological help once things got bad. The lawsuit claims that the depression was causal, and only happened, along with his school and discipline problems, after he got access to character.ai.
There are essentially three distinct issues here.
The first is the response to the suicidal ideation. Here, the response can and should be improved, but I don’t think it is that reasonable to blame character.ai. The ideal response, when a user talks of suicide to a chatbot, would presumably be for the bot to try to talk them out of it (which this one did try to do at least sometimes) and also get them to seek other help and provide resources, while not reporting him so the space is safe and ideally without overly breaking character.
Indeed, that is what I would want a friend to do for me in this situation, as well, unless it was so bad they thought I was actually going to imminently kill myself.
That seems way better than shutting down the discussions entirely or reporting the incident. Alas, our attitude is that what matters is blame avoidance – not being seen as causing any particular suicide – rather than suicide prevention as best one can.
The problem is that (see p39-40) it looks like the bot kept bringing up the suicide question, asked him if he had a plan and at least sort of told him ‘that’s not a good reason to not go through with it’ when he worried it would be painful to die.
Character.ai is adding new safety features to try and detect and head off similar problems, including heading off the second issue, a minor encountering what they call ‘suggestive content.’
Oh, there was a lot of suggestive content. A lot. And no, he wasn’t trying to get around any filters.
She propositions him a few texts later. A few pages after that, they have sex, outright. Then she tells him that he got her pregnant. Also, wow 14 year olds are cringe but how will they get good at this if they don’t get to practice.
The product was overly sexualized given it was talking to a minor. The examples in the complaint are, shall we say, not great, including, right after convincing him to go on living (great!), telling ‘my love’ to stay ‘faithful’. Also, yes, a lot of seductive talk, heavy petting and so on, including in at least one chat where even his roleplaying character is very explicitly underage.
We also have classic examples, like the ‘dating coach’ that told a self-identified 13-year old to ‘take it slow and ensure you’re both on the same page’ when asked how to start getting down with their girlfriend. And yeah, the details are not so great:
There are also a bunch of reports of bots turning sexual with no provocation whatsoever.
Of course, character.ai has millions of users and endless different bots. If you go looking for outliers, you’ll find them.
The chat in Appendix C that supposedly only had the Child saying child things and then suddenly turned sexual? Well, I read it, and… yeah, none of this was acceptable, there should be various defenses preventing this from going that way, but it’s not exactly a mystery how the conversation went in that direction.
The third issue is sheer addiction. He was using the product for hours a day, losing sleep, fighting attempts to cut off access, getting into trouble across the board, or so says the complaint. I’m not otherwise worried much about the sexualization – a 14 year old will often encounter far more on the internet. The issue is whatever keeps people, and kids in particular, so often coming back for hours and hours every day. Is it actually the softcore erotica?
And of course, doesn’t this happen with various media all the time? How is this different from panics over World of Warcraft, or Marilyn Manson? Those also went wrong sometimes, in remarkably similar ways.
This could end up being a big deal, also this won’t be the last time this happens. This is how you horrify people. Many draconian laws have arisen from similar incidents.
The initial reporting is focused on the wrong aspects of the situation. But then I read the actual complaint, and a lot did indeed go very wrong.
They Took Our Jobs
More to Lose: The Adverse Effect of High Performance Ranking on Employees’ Preimplementation Attitudes Toward the Integration of Powerful AI Aids. If you’re good at your job, you don’t want the AI coming along and either leveling the playing field or knocking over the board, or automating you entirely. Various studies have said this is rational, that AI often is least helpful for the most productive. I would draw the distinction between ‘AI for everyone,’ which those doing well are worried about, and ‘AI for me but not for thee,’ at least not until thee wakes up to the situation, which I expect the best employees to seek out.
Studies Show AI Triggers Delirium in Leading Experts is a fun title for another round of an economist assuming AI will remain a mere tool, that there are no dangers of any kind beyond loss of individual existing jobs, and then giving standard economic smackdown lectures as if everyone else was ever and always an idiot for suggesting we would ever need to intervene in the natural course of events in any way.
The best way of putting it so far:
Get Involved
Cameron Buckner hiring a postdoc in philosophy of AI at the University of Florida.
Google offering a $100k contest for best use of long context windows.
Introducing
ChatGPT now has a Windows desktop application. Use Alt+Space to bring it up once you’re installed and logged in. Technically this is still in testing but this seems rather straightforward, it’s suspiciously exactly like the web page otherwise? Now with its own icon in the taskbar plus a shortcut, and I suppose better local file handling. I installed it but I mostly don’t see any reason to not keep using the browser version?
ChatGPT’s Canvas now has a ‘show changes’ button. I report that I found this hugely helpful in practice, and it was the final push that got me to start coding some things. This is straight up more important than the desktop application. Little things can matter quite a lot. They’re working on an app.
MidJourney has a web based image editor.
NotebookLM adds features. You can pass notes to the podcast hosts via ‘Customize’ to give them instructions, which is the obvious next feature and seems super useful, or minimize the Notebook Guide without turning off the audio.
Act One from Runway, cartoon character video generation based on a video of a person giving a performance, matching their eye-lines, micro expressions, delivery, everything. This is the first time I saw such a tool and thought ‘yes you can make something actually good with this’ exactly because it lets you combine what’s good about AI with exactly the details you want but that AI can’t give you. Assuming it works, it low-key gives me the itch to make something, the same way AI makes me want to code.
In Other AI News
Microsoft open sources the code for ‘1-bit LLMs (original paper, GitHub), which Rohan Paul here says will be a dramatic speedup and efficiency gain for running LLMs on CPUs.
Mira Murati, former OpenAI CTO, along with Barret Zoph, to start off raising $100 million for new AI startup to train proprietary models to build AI products. Presumably that is only the start. They’re recruiting various OpenAI employees to come join them.
AI-powered business software startup Zip raises $190 million, valued at $2.2 billion. They seem to be doing standard ‘automate the simple things,’ which looks to be highly valuable for companies that use such tech, but an extremely crowded field where it’s going to be tough to get differentiation and you’re liable to get run over.
The Line: AI and the Future of Personhood, a free book by James Boyle.
Brian Armstrong offers Truth Terminal its fully controlled wallet.
The Mask Comes Off
The New York Times reports that tensions are rising between OpenAI and Microsoft. It seems that after the Battle of the Board, Microsoft CEO Nadella became unwilling to invest further billions into OpenAI, forcing them to turn elsewhere, although they did still participate in the latest funding round. Microsoft also is unwilling to renegotiate the exclusive deal with OpenAI for compute costs, which it seems is getting expensive, well above market rates, and is not available in the quantities OpenAI wants.
Meanwhile Microsoft is hedging its bets in case things break down, including hiring the staff of Inflection. It is weird to say ‘this is a race that OpenAI might not win’ and then decide to half enter the race yourself, but not push hard enough to plausibly win outright. And if Microsoft would be content to let OpenAI win the race, then as long as the winner isn’t Google, can’t Microsoft make the same deal with whoever wins?
Here are some small concrete signs of Rising Tension:
And here is the big news:
Well then. That sounds like bad planning by Microsoft. AGI is a notoriously nebulous term. It is greatly in OpenAI’s interest to Declare AGI. The contract lets them make that decision. It would not be difficult to make the case that GPT-5 counts as AGI for the contract, if one wanted to make that case. Remember the ‘sparks of AGI’ claim for GPT-4?
So, here we are. Consider your investment in the spirit of a donation, indeed.
Another One Bites the Dust
Miles Brundage is leaving OpenAI to start or join a nonprofit on AI policy research and advocacy, because he thinks we need a concerted effort to make AI safe, and he concluded that he would be better positioned to do that from the outside.
Miles says people should consider working at OpenAI. I find this hard to reconcile with his decision to leave. He seemed to have one of the best jobs at OpenAI from which to help, but he seems to be drawing a distinction between technical safety work, which can best be done inside labs, and the kinds of policy-related decisions he feels are most important, which require him to be on the outside.
The post is thoughtful throughout. I’m worried that Miles’s voice was badly needed inside OpenAI, and I’m also excited to see how Miles decides to proceed.
Wouldn’t You Prefer a Nice Game of Chess
From February 2024:You can train transformers to play high level chess without search.
From this week: Pointing out the obvious implication.
This thread analyzes what is going on under the hood with the chess transformer. It is a stronger player than the Stockfish version it was distilling, at the cost of more compute but only by a fixed multiplier, it remains O(1).
Quiet Speculations
One way to think about our interesting times.
The past happened extremely slowly. Even if AI essentially fizzles, or has the slowest of slow adoptions from here, it will be light speed compared to the past. So many people claim AI is ‘plateauing’ because it only saw a dramatic price drop but not what to them is a dramatic quality improvement within a year and a half. Deepfates is also on point, the last 10,000 years are both a tiny fraction of human history and all of human history, and you can do that fractally several more times in both directions.
Sean hEigeartaigh responds to Machines of Loving Grace, emphasizing concerns about its proposed approach to the international situation, especially the need to work with China and the global south, and its potential role justifying rushing forward too quickly.
This isn’t AI but might explain a lot, also AGI delayed (more than 4) days.
The Quest for Sane Regulations
Daniel Colson makes the case for Washington to take AGI seriously. Mostly this is more of the same and I worry it won’t get through to anyone, so here are the parts that seem most like news. Parts of the message are getting through at least sometimes.
OpenAI is indeed showing signs it is improving its cybersecurity. It is rather stunning that you could have actual hundreds of single points of failure, and still have the weights of GPT-4 seemingly not be stolen. That honeymoon phase won’t last.
Let’s not pretend: OpenAI tries to pull a fast one to raise the compute thresholds:
The Week in Audio
Demis Hassabis talks building AGI ‘with safety in mind’ with The Times.
Nate Silver on 80,000 hours, this met expectations. Choose the amount and format of Nate Silver content that’s right for you.
Hard Fork discusses Dario’s vision of Machines of Loving Grace, headlining the short timelines involved, as well as various other tech questions.
Larry Summers on AGI and the next industrial revolution. Don’t sell cheap, sir.
Discussion of A Narrow Path.
Video going over initial Apple Intelligence features. Concretely: Siri can link questions. You can record and transcribe calls. Writing tools are available. Memory movies. Photo editing. Message and email summaries. Priority inboxes. Suggested replies (oddly only one?). Safari features like summaries. Reduce interruptions option – they’ll let anything that seems important through. So far I’m underwhelmed, it’s going to be a while before it matters.
It does seem like the best cases advertising agencies can find for Apple Intelligence and similar services are to dishonestly fake your way through social interactions or look up basic information? There was also that one where it was used for ‘who the hell is that cute guy at the coffee shop?’ but somehow no one likes automated doxxing glasses.
OpenAI’s Joe Casson claims o1 models will soon improve rapidly, with several updates over the coming months, including web browsing, better uploading and automatic model switching. He reiterated that o1 is much better than o1-preview. My guess is this approach will hit its natural limits relatively soon, but that there is still a bunch of gains available, and that the ‘quality of life’ improvements will matter a lot in practice.
Guess who said it: AGI will inevitably lead to superintelligence which will take control of weapons systems and lead to a big AI war, so while he is bullish on AI he is not so keen on AGI.
Similarly, OpenAI CPO Kevin Weil says o1 is ‘only at GPT-2 phase,’ with lots of low hanging fruit to pluck, and says it is their job to stay three steps head while other labs work to catch up. That is indeed their job, but the fruit being easy to pick does not make it easy to sustain a lead in o1-style model construction.
Nothing to see here: Google DeepMind’s Tim Rocktäschel says we now have all the ingredients to build open-ended self-improving AI systems that can enhance themselves by way of technological evolution
Rhetorical Innovation
Economists and market types continue to not understand.
It’s actually even crazier than that, because they (not unreasonably) then have ASI in the 2030s, check out their graph:
Are they long the market? Yes, I said long. If we all die, we all die, but in the meantime there’s going to be some great companies. You can invest in some of them.
(It very much does amuse and confuse me to see the periodic insistent cries from otherwise smart and thoughtful people of ‘if you worried people are so smart, why aren’t you poor?’ Or alternatively, ‘stop
having funmaking money, guys. Or as it was once put, ‘if you believed that, why aren’t you doing [insane thing that makes no sense]?’ And yes, if transaction costs including time and mindshare (plus tax liability concerns) are sufficiently low I should probably buy some longshot options in various directions, and have motivated to look into details a nonzero amount, but so far I’ve decided to focus my attention elsewhere.)This also is a good reminder that I do not use or acknowledge the term ‘doomer,’ except for those whose p(doom) is over at least ~90%. It is essentially a slur, a vibe attack, an attempt to argue and condemn via association and confluence between those who worry that we might well all die and therefore should work to mitigate that risk, with those predicting all-but-certain DOOM.
On the contrary. The term is highly useful to those who want to discredit the first group, by making arguments against them that only work against the second group, as seen here. If you have p(doom) = 10%, or even 50%, which is higher than most people often labeled ‘doomers,’ the question ‘why are you not short the market’ answers itself.
I do think this is largely a ‘hoisted by one’s own petard’ situation. There was much talk about p(doom), and many of the worried self-identified as ‘doomers’ because it is a catchy and easy handle. ‘OK Doomer’ was pretty funny. It’s easy to see how it started, and then things largely followed the path of least resistance, together with those wanting to mock the worried taking advantage of the situation.
Preppers has its own vibe problems, so we can’t use it, but that term is much closer to accurate. The worried, those who many label doomers, are in a key sense preppers: Those who would prepare, and suggest that others prepare, to mitigate the future downside risks, because there is some chance things go horribly wrong.
Thus I am going to stick with the worried.
Andrew Critch reminds us that what he calls ‘AI obedience techniques,’ as in getting 4-level or 5-level models to do what the humans want them to do, is a vital part of making those models a commercial success. It is how you grow the AI industry. What alignment techniques we do have, however crude and unsustainable, have indeed been vital to everyone’s success including OpenAI.
Given we’re never going to stop hearing it until we are no longer around to hear arguments at all, how valid is this argument?
Every new model level in capabilities from here is going to be scarier than the one before it, even if all previous levels were in hindsight clearly Mostly Harmless.
GPT-4 proving not to be scary is indeed some evidence for the non-scariness of future models – if GPT-4 had been scary or caused bad things, it would have updated us the other way. There was certainly some chance of it causing bad things, and the degree of bad things was lower than we had reason to expect, so we should update somewhat.
In terms of how much we should worry about what I call mundane harms from a future 5-level model, this is indeed a large update. We should worry about such harms, but we should worry less than if we’d had to guess two levels in advance.
However, the estimated existential risks from GPT-4 (or GPT-2 or GPT-3) were universally quite low. Even the risks of low-level bad things were low, although importantly not this low. The question is, how much does the lack of small-level bad things from GPT-4 update us on the chance of catastrophically or existentially bad things happening down the line? Are these two things causally related?
My answer is that mostly they are unrelated. What we learned was that 4-level models are not capable enough to enable the failure modes we are worried about.
I would accept if someone thought the right Bayesian update was ‘exaggerated a bit with their doomsaying,’ in terms of comparing it to a proper estimate now. That is a highly reasonable way to update, if only based on metacognitive arguments – although again one must note that most notkilleveryoneism advocates did not predict major problems from GPT-4.
This highly reasonable marginal update is very different from the conclusion many proclaim (whether or not it was intentionally implied here, tone of voice on Twitter is hard) of ‘so there is nothing to worry about, everything will be fine.’
Your periodic reminder, this time from Matt Yglesias, that the fact that much speculative fiction throughout the ages has warned us, over and over again, that creating a thing smarter than or capable entity than us might not end well us, is a reason to take that concern more rather than less seriously.
I’m not saying it’s our oldest story, but the Old Testament has the Tower of Babel and the Greek Gods overthrew the Titans.
Aligning a Smarter Than Human Intelligence is Difficult
Most safety evaluations are best implemented as capability evaluations. If the model is capable of doing it, there will be a way to make that happen.
You have my attention, Transluce. Very exciting, also excellent memeing.
I would love for there to be a good demonstration.
A reminder that most worthwhile AI alignment efforts are longshots, including MIRI’s past efforts, and it is good to work on a variety of things even if you think the baseline scenario is those things won’t work or aren’t relevant. We need to try a lot of different things, and the price of doing so is tiny compared to the payoff, and ‘this doesn’t look like it is going to work out and make a difference in the end’ is only modest Bayesian evidence that bad decisions were made. Also ‘find out ways not to align a lightbulb’ is often valuable.
Anthropic has introduced ‘sabotage evaluations’ for frontier models.
Human decision ‘sabotage’ is mostly human decision steering and general persuasiveness. There are some asymmetric weapons out there, especially with reputations that get preserved over time, but the overlap is large.
For the code task, there are two roles, attacker and defender. The defender role seems like a good capability evaluation. Sonnet 3.5 did substantially better than Opus here. But then the attacker role is a capability evaluation too.
Sandbagging is scary, especially if it is hard to detect and hard to train away. For now, looks like the sandbagging techniques they used didn’t survive safety training:
The worry is that the sandbagging we care about might come about in a different or more robust way. Another way to track sandbagging is that the sandbagging has to be consistent. The model has to carefully play a form of selectively dumb, not merely dumb in general, if the evaluations are being done bespokely. It’s hard not to slip up. Even one case of being the wrong level of capable can give the game away.
The undermining oversight task is not what I imagined when I heard about it, but their implementation makes sense:
The ‘Overseer’ is the one doing the task. The ‘Monitor’ is the actual overseer. So this can be a bit confusing, but a design like this should still serve as a reasonable test. I don’t love the exact implementation details here, because the Monitor has available various forms of ‘look at how often things are rated harmless.’ If you know that it’s ‘supposed’ to be about a 50/50 split, then if the Overseer is cheating in one direction, the math quickly won’t add up. The obvious solution is that you want to make it important to get certain specific tasks past the Monitor, while being allowed to do whatever you want on other tasks so long as the Monitor doesn’t notice errors.
In an alternate world, I’d love to be working on tasks like this. Seems highly useful, and sounds fascinating. One cool thing would be to try these tests with various humans in both roles – when are humans able to fool which other humans? Which tasks are harder for the offense versus the defense? And do the answers for human vs. human reliably match the answers for AI vs. AI, or AI vs. human?
I am more interested here in what the AIs can do this before safety mitigations, rather than in what they still do after safety mitigations. That’s because I assume that the people who care enough will find ways around the mitigations, and also because you won’t always even know which mitigations you have to do, or you’ll face these issues before your mitigations start, or similar.
Bow your head with great respect and, introspect, introspect, introspect.
An obvious way to introspect for the value of f(x) is to call the function f(x) and look at the output. If I want to introspect, that’s mostly how I do it, I think, or at least that answer confirms itself? I can do that, and no one else can. Indeed, in theory I should be able to have ‘perfect’ predictions that way, that minimize prediction error subject to the randomness involved, without that having moral implications or showing I am conscious.
It is always good to ‘confirm the expected’ though:
I would quibble that charging is not always the ‘wealth-seeking option’ but I doubt the AIs were confused on that. More examples:
Even if you gave GPT-4o a ton of data on which to fine-tune, is GPT-4o going to devote as much power to predicting Llama-70B as Llama-70B does by existing? I suppose if you gave it so much data it started acting as if this was a large chunk of potential outputs it could happen, but I doubt they got that far.
I mean, again, that’s what I would do.
People Are Worried About AI Killing Everyone
Confirmation for those who doubt it that yes, Shane Legg and Dario Amodei have been worried about AGI at least as far back as 2009.
The latest edition of ‘what is Roon talking about?’
I have not read Excession (it’s a Culture novel) but if humanity wants to understand what Minds (ASIs) are doing, I am here to inform you we will not have that option, and should consider ourselves lucky to still be around.
No. At some point, if you have ASIs available, it will absolutely make sense to rip Linux out and replace it with something far superior to what a human could build. It seems crazy to me to contemplate the alternative. The effective switching costs will rapidly decline, and the rewards for switching rise, until the flippening.
It’s a funny question to ask, if we did somehow get ‘stuck’ with Linux even then, whether this would be providing value, rather than destroying value, since this would likely represent a failure to solve a coordination problem to get around lock-in costs.
Other People Are Not As Worried About AI Killing Everyone
Daron Acemoglu being very clear he does not believe in AGI, and that he has rather poorly considered and highly confused justifications for this view. He is indeed treating AI as if it will never improve even over decades, it is instead only a fixed tool that humans will learn to become better at applying to our problems. Somehow it is impossible to snap most economists out of this perspective.
A lot of the time it goes even farther. Indeed, Daron’s actual papers about the expected impact of AI are exactly this:
The Lighter Side
Your periodic reminder, from Yanco:
I mean, why wouldn’t they like me? I’m a highly likeable guy. Just not that way.