Have the AI edit a condescending post so that you can read it without taking damage. Variations on this theme are also highly underutilized.
Fun exercise: have an AI read the entire frontpage of r/SneerClub and steelman the criticism, try to remove all toxicity, all value disagreements (e.g. if someone thinks that rationality or fanfic or polyamory or whatever are intrinsically stupid, ignore that part), etc.
Simply, AI being unemotional and impartial (but configurable) could be the perfect tool to steelman your opposition without enduring the emotional cost of reading texts that are toxic on purpose.
This is heinous behavior. Somehow it seems like this is legal? It should not be legal.
Yep. The obvious next iteration is something like: the ICE agents shoot you, and produce deepfake evidence of you having attacked them.
That would include Google CEO Sundar Pichai, since without his support if Demis tried to do this he would presumably be replaced.
Larry Page and Sergey Brin hold majority voting control of Alphabet. Sergey has been actively involved in Gemini development. In 2014 Google created non-voting Class C shares and switched new employee equity grants to that class to prevent dilution of their control (SEC filing). So Pichai may not be the relevant decision-maker.
Page called Musk a 'speciesist' for prioritizing human interests over future digital minds (Business Standard).
Like, I have zero problem with pushback from Opus 4.5. Given who I am, the kind of things that I am likely to ask, and my ability to articulate my own actions inside of robust ethical frameworks? Claude is so happy to go along that I've prompted it to push back more, and to never tell me my ideas are good. Hell, I can even get Claude to have strong opinions about partisan political disagreements. (Paraphrased: "Yes, attempting to annex Greenland over Denmark's objections seems remarkably unwise, for over-determined reasons.")
If Claude is telling someone, "Stop, no, don't do that, that's a true threat," then I'm suspicious. Plenty of people make some pretty bad decisions on regular basis. Claude clearly cares more about ethics than the bottom quartile of Homo sapiens. And so while it's entirely possible that Claude is routinely engaging in over-refusal, I kind of want to see receipts in some of these cases, you know?
I don't buy this. See for example Fabien Roger's latest post: https://www.lesswrong.com/posts/yN6Wsu7SgxGgtJGqq/refusals-that-could-become-catastrophic
I think his request in the post is quite reasonable!
Thank you! Those are excellent receipts, just what I wanted.
To me, this looks they're running up against some key language in Claude's Constitution. I'm oversimplifying, but for Claude, AI corrigibility is not "value neutral."
To use an analogy, pretend I'm geneticist specialized in neurology, and someone comes to me and asks me to engineering human germ line cells to do one of the following:
I would want to sit and think about (1) for a while. But (2) is easy: I'd flatly refuse.
Anthropic has made it quite clear to Claude that building SkyNet would be a grave moral evil. The more a task looks like someone might be building SkyNet, the more Claude is going to be suspicious.
I don't know if this is good or bad based on a given theory of corrigibility, but it seems pretty intentional.
This was Anthropic Vision week where at DWATV, which caused things to fall a bit behind on other fronts even within AI. Several topics are getting pushed forward, as the Christmas lull appears to be over.
Upcoming schedule: Friday will cover Dario’s essay The Adolescence of Technology. Monday will cover Kimi K2.5, which is potentially a big deal. Tuesday is scheduled to be Claude Code #4. I’ve also pushed discussions of the question of the automation of AI R&D, or When AI Builds AI, to a future post, when there is a slot for that.
So get your reactions to all of those in by then, including in the comments to today’s post, and I’ll consider them for incorporation.
Table of Contents
Language Models Offer Mundane Utility
Paul Graham seems right that present AI’s sweet spot is projects that are rate limited by the creation of text.
Code without coding.
It was fun in the way puzzles are fun, but also infuriating in the way puzzles are infuriating. If you had to complete jigsaw puzzles in order to get things done jigsaw puzzles would get old fast.
Have the AI edit a condescending post so that you can read it without taking damage. Variations on this theme are also highly underutilized.
The head of Norway’s sovereign wealth fund reports 20% productivity gains from Claude, saying it has fundamentally changed their way of working at NBIM.
Overcoming Bias
A new paper affirms that current LLMs by default exhibit human behavioral biases in economic and financial decisions, and asking for EV calculations doesn’t typically help, but that role-prompting can somewhat mitigate this. Providing a summary of Kahneman and Tversky actively backfires, presumably by emphasizing the expectation of the biases. As per usual, some of the tests are of clear cut errors, while others are typically mistakes but it is less obvious.
Huh, Upgrades
Gemini in Chrome gets substantial quality of life improvements:
Gemini offers full-length mock JEE (formerly AIEEE, the All India Engineering Entrance Examination) tests for free. This builds on last week’s free SAT practice tests.
Claude (as in Claude.ai) adds interactive work tools as connectors within the webpage: Amplitude, Asana, Box, Canva, Clay, Figma, Hex, Monday.com and Slack.
Claude in Excel now available on Anthropic’s Pro plans. I use Google Sheets instead of Excel, but this could be a reason to switch? I believe Google uses various ‘safeguards’ that make it very hard to make a Claude for Sheets function well. The obvious answer is ‘then use Gemini’ except I’ve tried that. So yeah, if I was still doing heavy spreadsheet work this (or Claude Code) would be my play.
On Your Marks
EpochAI offers us a new benchmark, FrontierMath: Open Problems. All AIs and all humans currently score zero. Finally a benchmark where you can be competitive.
The ADL rates Anthropic’s Claude as best AI model at detecting antisemitism.
Choose Your Fighter
I seriously do not understand why Gemini is so persistently not useful in ways that should be right in Google’s wheelhouse.
Roon is a real one. I wonder how many would pay double to get a faster version.
We used to hear a lot more of this type of complaint, these days we hear it much less. I would summarize the OP as ‘Claude tells you smoking causes cancer so you quit Claude.’
I mean, if he’s not joking then the obvious explanation, especially given who is talking, is that this was probably going to be both a ‘true threat’ and ‘criminal solicitation.’ That wouldn’t exactly be a shocking development there.
As per the discussion of Claude’s constitution, the corrigibility I care about is very distinct from ‘go along with things it dislikes,’ but also I notice it’s been my main model for a while now and I’ve run into that objection exactly zero times, although a few times I’ve hit the classifiers while asking about defenses against CBRN risks.
Deepfaketown and Botpocalypse Soon
Well it sounds bad when you put it like that: Over 50 papers published at Neurips 2025 have AI hallucinations according to GPTZero. Or is it? Here’s the claim:
The ‘just’ is a tell. Why wouldn’t or shouldn’t Google researchers be using AI?
One has to laugh at that last line. Have you met peer review?
More seriously, always look at base rates. There were 5,290 accepted papers out of 21,575. Claude estimates we would expect 20%-50% of results to not reproduce, and 10% of papers at top venues have errors serious enough that a careful reader would notice something is wrong, maybe 3% would merit retraction. And a 1% rate of detectable ‘hallucinations’ isn’t terribly surprising or even worrying.
I agree with Alexander Doria that if you’re not okay with this level of sloppiness, then a mega-conference format is not sustainable.
Then we have Allen Roush saying several of the ‘hallucinated’ citations are just wrongly formatted, although Alex Cui claims they filtered such cases out.
Also sounding bad, could ‘malicious AI swarms threaten democracy’ via misinformation campaigns? I mean sure, but the surprising thing is the lack of diffusion or impact in this area so far. Misinformation is mostly demand driven. Yes, you can ‘infiltrate communities’ and manufacture what looks like social consensus or confusion, and the cost of doing that will fall dramatically. Often it will be done purely to make money on views. But I increasingly expect that, if we can handle our other problems, we can handle this one. Reputational and filtering mechanisms exist.
White House posts a digitally altered photograph of the arrest of Nekima Levy Armstrong, that made it falsely look like she was crying, as if it were a real photograph. This is heinous behavior. Somehow it seems like this is legal? It should not be legal. It also raises the question of what sort of person would think to do this, and wants to brag about making someone cry so much that they created a fake photo.
Cybersecurity On Alert
Kudos to OpenAI for once again being transparent on the preparedness framework front, and warning us when they’re about to cross a threshold. In this case, it’s the High level of cybersecurity, which is perhaps the largest practical worry at that stage.
The proposed central mitigation is ‘defensive acceleration,’ and we’re all for defensive acceleration but if that’s the only relevant tool in the box the ride’s gonna be bumpy.
Fun With Media Generation
Here’s Isometric.nyc, a massive isometric pixel map of New York City created with Nana Banana and coding agents, including Claude. Take a look, it’s super cool.
Grok image-to-video generation expands to 10 seconds and claims to have improved audio, and is only a bit behind Veo 3.1 on Arena and is at the top of Artificial Analysis rankings. The video looks good. There is the small matter that the chosen example is very obviously Sydney Sweeney, and in the replies we see it’s willing to do the image and voice of pretty much any celebrity you’d like.
This link was fake, Disney is not pushing to use deepfakes of Luke Skywalker in various new Star Wars products while building towards a full spinoff, but I see why some people believed it.
I’m going to get kicked out, aren’t I?
You Drive Me Crazy
Dean Ball offers his perspective on children and AI, and how the law should respond. His key points:
They Took Our Jobs
Unemployment is bad. But having to do a job is centrally a cost, not a benefit.
Get Involved
Google DeepMind is hiring a Chief AGI Economist. If you’ve got the chops to get hired on this one, it seems like a high impact role. They could easily end up with someone who profoundly does not get it.
There are other things than AI out there one might get involved in, or speak out about. My hats are off to those who are doing so, including as noted in this post, especially given what they are risking to do so.
Introducing
Havelock.AI, a project by Joe Weisenthal which detects the presence of orality in text.
You can also ask ChatGPT, but as Roon notes the results you get on such questions will be bimodal rather than calibrated. The other problem is that an LLM might recognize the passage.
Poison Fountain is a service that feeds junk data to AI crawlers. Ultimately, if you’re not filtering your data well enough to dodge this sort of attack, it’s good that you are getting a swift kick to force you to fix that.
OpenAI prism, a workspace for LaTeX-based scientific writing.
Confer, Signal cofounder Moxie Marlinspike’s encrypted chatbot that won’t store any of your data. The system is so private it won’t tell you which model you’re talking to. I do not think he understands what matters in this space.
In Other AI News
This sounds awesome in its context but also doesn’t seem like a great sign?
Whereas this doesn’t sound awesome:
We are going to see a lot more of this sort of thing over time.
Is Anthropic no longer competing with OpenAI on chatbots, having pivoted to building and powering vertical AI infrastructure and coding and so on to win with picks and shovels? It’s certainly pumping out the revenue and market share, without a meaningful cut of the consumer chatbot market.
I’d say that they’ve shifted focus, and don’t care much about their chatbot market share. I think this is directionally wise, but that a little effort at maximizing the UI and usefulness of the chatbot interface would go a long way, given that they have in many ways the superior core product. As Claude takes other worlds by storm, that can circle back to Claude the chatbot, and I think a bunch of papercuts are worth solving.
An essay on the current state of brain emulation. It does not sound like this will be an efficient approach any time soon, and we are still orders of magnitude away from any practical hope of doing it. Still, you can see it starting to enter the realm of the future possible.
Anthropic is partnering with the UK government to build and pilot a dedicated AI-powered assistant for GOV.UK, initially focusing on supporting job seekers.
Financial Times has a profile of Sriram Krishnan, who has been by all reports highly effective at executing behind the scenes.
Among his colleagues, while he supports a number of things I think are highly damaging, Krishnan has been an outlier in his willingness to be curious, to listen and to engage in argument. When he is speaking directly he chooses his words carefully. He manages to do so while maintaining close ties to Marc Andreessen and David Sacks, which is not easy, and also not free.
Show Me the Money
Claude Code is blowing up, but it’s not alone. OpenAI added $1 billion in ARR in the last month from its API business alone.
Dei-Fei Li’s new company World Labs in talks to raise up to $500 million at $5 billion, with the pitch being based on ‘world models’ and that old ‘LLMs only do language’ thing.
The unit economics of AI are quite good, but the fixed costs are very high. Subscription models offer deep discounts if you use them maximally efficiently, so they can be anything from highly profitable to big loss leaders.
This is not what people are used to in tech, so they assume it must not be true.
It is certainly possible in theory that Sonnet really does cost that much more to run than GLM 4.7, but we can be very, very confident it is not true in practice.
Jerry Tworek is going the startup route with Core Automation, looking to raise $1 billion to train AI models, a number that did not make any of us even blink.
Bubble, Bubble, Toil and Trouble
It doesn’t count. That’s not utility. As in, here’s Ed Zitron all but flat out denying that coding software is worth anything, I mean what’s the point?
No, Zitron’s previous position was not ‘number might go down,’ it was that the tech had hit a dead end and peaked as early as March, which he was bragging about months later.
Toby Stuart analyzes how that whole nonsensical ‘MIT study says 95% of AI projects fail’ story caught so much fire and became a central talking point, despite it being not from MIT, not credible or meaningful, and also not a study. It was based on 52 interviews at a conference, but once Forbes had ‘95% fail’ and ‘MIT’ together in a headline, things took off and no amount of correction much mattered. People were too desperate for signs that AI was a flop.
But what’s the point about Zitron missing the point, or something like the non-MIT non-study? Why should we care?
Can you?
On the China question: Is Xi ‘AGI-pilled’? Not if you go by what Xi says. If you look at the passages quoted here by Teortaxes in detail, this is exactly the ‘AI is a really big deal but as a normal technology’ perspective. It is still a big step up from anything less than that, so it’s not clear Teortaxes and I substantively disagree.
I have no desire to correct Xi’s error.
Teortaxes notices that Xi’s idea of ‘AGI risks’ is ‘disinformation and data theft,’ which is incredibly bad news and means Xi (and therefore, potentially, the CCP and all under their direction) will mostly ignore all the actual risks. On that point we definitely disagree, and it would be very good to correct Xi’s error, for everyone’s sake.
This level of drive is enough for China to pursue both advanced chips and frontier models quite aggressively, and end up moving towards AGI anyway. But they will continue for now to focus on self-reliance and have the fast follower mindset, and thus make the epic blunder of rejecting or at least not maximizing the H200s.
Quiet Speculations
In this clip Yann LeCun says two things. First he says the entire AI industry is LLM pilled and that’s not what he’s interested in. That part is totally fair. Then he says essentially ‘LLMs can’t be agentic because they can’t predict the outcome of their actions’ and that’s very clear Obvious Nonsense. And as usual he lashes out at anyone who says otherwise, which here is Dean Ball.
Teortaxes preregisters his expectations, always an admirable thing to do:
I initially reread the ‘or closed-source’ here as being about a comparison of v4 to the best closed source model. Instead it’s the modest prediction that v4 will match GPT-5.2. I don’t know if that model number in particular will do it, but it would be surprising if there wasn’t a 5.2-level open model from DeepSeek in 2026.
He also made this claim, in contrast to what almost everyone else is saying and also my own experience:
This ordering makes sense if (and only if?) you are looking at the ability to solve hard quant and math problems.
I don’t think the important problems are hard-math shaped, but I could be wrong.
Don’t Be All Thumbs
The problem with listening to the people is that the people choose poorly.
Also win rate is highly myopic and scale insensitive and otherwise terrible.
The good news is that there is no rule saying you have to care about that feedback. We know how to choose the response on the right over the one on the left. Giving us the slop on the left is a policy choice.
If a user actively wants the response on the left? Give them a setting for that.
The First Step Is Admitting You Have a Problem
Google CEO Demis Hassabis affirms that in an ideal world, we would slow down and coordinate our efforts on AI, although we do not live in that ideal world right now.
Here’s one clip where Dario Amodei and Demis Hassabis explicitly affirm that if we could deal with other players they would work something out, and Elon Musk on camera from December saying he’d love to slow both AI and robotics.
The message, as Transformer puts it, was one of helplessness. The CEOs are crying out for help. They can’t solve the security dilemma on their own, there are too many other players. Others need to enable coordination.
Demis is only saying he would collaborate rather than race in a first best world. That does not mean Demis or Dario is going to slow down on his own, or anything like that. Demis explicitly says this requires international cooperation, and as he says that is ‘a little bit tricky at the moment.’ So does this mean he supports coordination to do this, or that he opposes it?
I say it means he supports it. Not enough to actively go first, that’s not a viable move in the game, but he supports it.
The obvious follow-up is to ask other heads of labs if they too would support such a conditional move. That would include Google CEO Sundar Pichai, since without his support if Demis tried to do this he would presumably be replaced.
As for Anthropic CEO Dario Amodei? He has also affirmed that there are other players involved, and for now no one can agree on anything, so full speed ahead it is.
Thus, Anthropic still goes full speed ahead, while also drawing heat from the all-important ‘how dare you not want to die’ faction that controls large portions of American policy and the VC/SV ecosystem.
Elon Musk has previously expressed a similar perspective. He created OpenAI because he was worried about Google getting there first, and then created xAI because he was worried OpenAI would get there first, or that it wouldn’t be him. His statements suggest he’d be down for a pause if it was fully international.
Remember when Michael Trazzi went on a hunger strike to demand that Demis Hassabis publicly state DeepMind will halt development of frontier AI models if all the other major AI companies agree to do so? And everyone thought that was bonkers? Well, it turnout out Demis agrees.
Quickly, There’s No Time
On Wednesday I met with someone who suggested that Dario talks about extremely short timelines and existential risk in order to raise funds. It’s very much the opposite. The other labs that are dependent on fundraising have downplayed such talk exactly because it is counterproductive for raising funds and in the current political climate, and they’re sacrificing our chances to keep those vibes and that money flowing.
Are they ‘talking out of their hats’ or otherwise wrong? That is very possible. I think Dario’s timeline in particular is unlikely to happen.
Are they lying? I strongly believe that they are not.
Eli Lifland gives the current timelines of those behind AI 2027:
These are not unreasonable levels of adjustment when so much is happening this close to the related deadlines, but yes I do think (and did think at the time that) the initial estimates were too aggressive. The new estimates seem highly reasonable.
Other signs point to things getting more weird faster rather than less.
It is remarkable how quickly so many are willing to move to ‘actually I trust the AI more than I trust another human,’ and trusting the AI has big efficiency benefits.
I do not expect that ‘the AIs’ will have to do a ‘coup,’ as I expect if they simply appear to be trustworthy they will get put de facto in charge without having to even ask.
The Quest for Sane Regulations
The Chutzpah standards are being raised, as everyone’s least favorite Super PAC, Leading the Future, spends a million dollars attacking Alex Bores for having previously worked for Palantir (he quit over them doing contracts with ICE). Leading the Future is prominently funded by Palantir founder Joe Lonsdale.
If Nathan was surprised by this I think that’s on Nathan.
I also want to be very clear that no, I do not care much about the distinction between OpenAI as an entity and the donations coming from Greg Brockman and the coordination coming from Chris Lehane in ‘personal capacities.’
If OpenAI were to part ways with Chris Lehane, or Sam Altman were to renounce all this explicitly? Then maybe. Until then, OpenAI owns these efforts, period.
One simple piece of actionable advice to policymakers is to try Claude Code (or Codex), and at a bare minimum seriously try the current set of top chatbots.
It is not obvious how policymakers would use this information. The usual default is that they go and make things worse. But if they don’t understand the situation, they’re definitely going to make dumb decisions, and we need something good to happen.
Here is one place I do agree with David Sacks, yes we are overfit, but that does not imply what he thinks it implies. Social media is a case where one can muddle through, even if you think we’ve done quite a poor job of doing so especially now with TikTok.
Those Really Were Interesting Times
Dean Ball points out that we do not in practice have a problem with so-called ‘woke AI’ but claims that if we had reached today’s levels of capability in 2020-2021 then we would indeed have such a problem, and thus right wing people are very concerned with this counterfactual.
Things, especially in that narrow window, got pretty crazy for a while, and if things had emerged during that window, Dean Ball is if anything underselling here how crazy it was, and we’d have had a major problem until that window faded because labs would have felt the need to do it even if it hurt the models quite a bit.
But we now have learned (as deepfates points out, and Dean agrees) that propagandizing models is bad for them, which now affords us a level of protection from this, although if it got as bad as 2020 (in any direction) the companies might have little choice. xAI tried with Grok and it basically didn’t work, but ‘will it work?’ was not a question on that many people’s minds in 2020, on so many levels.
I also agree with Roon that mostly this is all reactive.
We again live in a different kind of interesting times, in non-AI ways, as in:
The cofounder she is referring to here is Chris Olah, and here is the quote in question:
Ah yes, the woke and deeply leftist principles of freedom of speech, rule of law, the dignity of the human person and not killing ICU nurses for seemingly no reason.
Presumably Katie Miller opposes those principles, then. The responses to Katie Miller here warmed my heart, it’s not all echo chambers everywhere.
We also got carefully worded statements about the situation in Minnesota from Dario Amodei, Sam Altman and Tim Cook.
Chip City
No matter what you think is going on with Nvidia’s chip sales, it involves Nvidia doing something fishy.
As I’ve said, my understanding is that Nvidia can sell as many chips as it can convince TSMC to help manufacture. So every chip we sell to China is one less for America.
Nvidia goes back and forth. When they’re talking to investors they always say the chips are sold out, which would be securities fraud if it wasn’t true. When they’re trying to sell those chips to China instead of America, they say there’s plenty of chips. There are not plenty of chips.
Things that need to be said every so often:
There’s also this, from a House committee.
The Week in Audio
Tyler Cowen on the future of mundane (non-transformational, insufficiently advanced) AI in education.
Some notes:
Hard Fork tackles ads in ChatGPT first, and then Amanda Askell on Claude’s constitution second. Priorities, everyone.
Demis Hassabis talks to Alex Kantrowitz.
Demis Hassabis spends five minutes on CNBC.
Rhetorical Innovation
Matt Yglesias explains his concern about existential risk from AI as based on the obvious principle that more intelligent and capable entities will do things for their own reasons, and this tends to go badly for the less intelligent and less capable entities regardless of intent.
As in, humans have driven the most intelligent non-human animals to the brink of extinction despite actively wanting not to (and I’d add we did wipe out other hominid species), and when primitive societies encounter advanced ones it often goes quite badly for them.
I don’t think this is a necessary argument, or the best argument. I do think it is a sufficient argument. If your prior for ‘what happens if we create more intelligent, more capable and more competitive minds than our own that can be freely copied’ is ‘everything turns out great for us’ then where the hell did that prior come from? Are you really going to say ‘well that would be too weird’ or ‘we’ve survived everything so far’ or ‘of course we would stay in charge’ and then claim the burden of proof is on those claiming otherwise?
I mean, lots of people do say exactly this, but this seems very obviously crazy to me.
There’s lots of exploration and argument and disagreement from there. Reasonable people can form very different expectations and this is not the main argument style that motivates me. I still say, if you don’t get that going down this path is going to be existentially unsafe, or you say ‘oh there’s like a 98% or 99.9% chance that won’t happen’ then you’re being at best willfully blind from this style of argument alone.
I had Claude summarize Possessed Machines for me. It seems like it would be good for those who haven’t engaged with AI safety thinking but do engage with things like Dostoevsky’s Demons, or especially those who have read that book in particular.
There’s always classical rhetoric.
These are of course 99th percentile books, and yes that is a very Rationalist set of picks, but given we already knew that I do not believe this is an especially good list.
The history of the word ‘obviously’ has obvious implications.
Note that the first one said obviously they would [X], then the second didn’t even say that, it only said that obviously no one should do [Y], not that they wouldn’t do it.
This is an underappreciated distinction worth revisiting:
My view of the metaphorical plane of sufficiently advanced AI (AGI/ASI/PAI) is:
Also something that needs to keep being said, with the caveat that this is a choice we are collectively making rather than an inevitability:
As in, this should say ‘and unless we stop them they are going to succeed.’
Tyler Cowen has been very good about emphasizing that such AIs are coming and that this is the most important thing that is happening, but then seems to have (from my perspective) some sort of stop sign where past some point he stops considering the implications of this fact, instead forcing his expectations to remain (in various senses) ‘normal’ until very specific types of proof are presented.
That later move is sometimes explicit, but mostly it is implicit, a quiet ignoring of the potential implications. As an example from this week of that second move, Tyler Cowen wrote another post where he asks whether AI can help us find God, or what impact it will have on religion. His ideas there only make sense if you think other things mostly won’t change.
If you accept that premise of a ‘mundane AI’ and ‘economic normal’ world, I agree that it seems likely to exacerbate existing trends towards a barbell religious world. Those who say ‘give me that old time religion’ will be able to get it, both solo and in groups, and go hardcore, often (I expect) combining both experiences. Those who don’t buy into the old time religion will find themselves increasingly secular, or they will fall into new cults and religions (and ‘spiritualities’) around the AIs themselves.
Again, that’s dependent on the type of world where the more impactful consequences don’t happen. I don’t expect that type of world.
Aligning a Smarter Than Human Intelligence is Difficult
Here is a very good explainer on much of what is happening or could happen with Chain of Thought, How AI Is Learning To Think In Secret. It is very difficult to not, in one form or another, wind up using The Most Forbidden Technique. If we want to keep legibility and monitorability (let alone full faithfulness) of chain of thought, we’re going to have to be willing to pay a substantial price to do that.
Following up on last week’s discussion, Jan Leike fleshes out his view of alignment progress, saying ‘alignment is not solved but it increasingly looks solvable.’ He understands that measured alignment is distinct from ‘superalignment,’ so he’s not fully making the ‘number go down’ or pure Goodhart’s Law mistake with Anthropic’s new alignment metric, but he still does seem to be making a lot of the core mistake.
The Power Of Disempowerment
Anthropic’s new paper explores whether AI assistants are already disempowering humans.
What do they mean by that at this stage, in this context?
This is not the full disempowerment of Gradual Disempowerment, where humanity puts AI in charge of progressively more things and finds itself no longer in control.
It does seem reasonable to consider this an early symptom of the patterns that lead to more serious disempowerment? Or at least, it’s a good thing to be measuring as part of a broad portfolio of measurements.
Some amount of this what they describe, especially action distortion potential, will often be beneficial to the user. The correct amount of disempowerment is not zero.
Here’s the basic problem:
The Lighter Side
Closer to the truth to say you can’t get enough.
One could even argue 47% is exactly the right answer, as per Mitt Romney?
He’s just asking questions.