I've formerly done research for MIRI and what's now the Center on Long-Term Risk; I'm now making a living as an emotion coach and Substack writer.
Most of my content becomes free eventually, but if you'd like to get a paid subscription to my Substack, you'll get it a week early and make it possible for me to write more.
I'm a bit reluctant to directly link to the article and have the author see that "hey your essay is being used as an example of something insufferable that needs to have an LLM edit it" if they check their referrer logs, but here were the original two paragraphs of the article I tested this on (changed sentences bolded):
The original Dungeons & Dragons may be the progenitor of modern roleplaying games, but it has very little actual roleplaying in it. Which isn’t to claim that I was alive back when it came out, but from reading through the original books you get the sense that it’s closer to Heroquest than, say, Fiasco. Not that there’s anything wrong with this—I like dungeon crawls too, when in the right mood, but the focus here wasn’t so much on the world or even the characters as it was the rules. How fast can a person walk in a turn? What is the difference between a 16 strength and a 17 strength? How good of a wizard can an elf be compared with how good of a cleric a dwarf can be? All of these questions are deeply stupid (and for some reason I’ve long found the class limits of pre-third edition D&D pretty insulting), but they’re where the game decided to put its original spotlight. Combine that with some distractingly bad artwork and constant assumptions that you have the Chainmail game system that the game was derived from makes first-edition D&D feel like a true relic of the past, which I don’t mean as a positive.
That being said, I don’t blame the game’s creators for this emphasis. They were into wargames, so of course the focus was on rules and numbers, especially since this was uncharted territory they were delving into. How much of a game should be spent on roleplaying and how much should be spent swinging axes at goblins? This type of balance was something they truly didn’t know, and in the best sense possible the original creators Gary Gygax and David Arneson were making it up as they went along, which more than anything is what the true spirit of the game is about. Even so, the level of worldbuilding in those 1970s books is woefully sparse, and even within the game’s two original settings, Blackmoor and Greyhawk, the focus isn’t so much on the world or characters as it is what spells you might cast. While some later roleplaying books would be almanacs, bestiaries, and simply collections of wondrous locations players might explore, it would be a long time before these aspects of the game would take center stage. In the meantime, it’s really all about those numbers. The game’s original settings were, basically, dungeons, and the big bad enemies you’d fight were in fact frequently dragons. Was there a world around the dungeon, or a motivation for the dragon? Who gives a fuck, let’s swing our swords and roll our D12’s until we kick their butts.
And here was Claude's rewrite:
The original Dungeons & Dragons may be the progenitor of modern roleplaying games, but it has very little actual roleplaying in it. Which isn't to claim that I was alive back when it came out, but from reading through the original books you get the sense that it's closer to Heroquest than, say, Fiasco. Not that there's anything wrong with this—I like dungeon crawls too, when in the right mood, but the focus here wasn't so much on the world or even the characters as it was the rules. How fast can a person walk in a turn? What is the difference between a 16 strength and a 17 strength? How good of a wizard can an elf be compared with how good of a cleric a dwarf can be? These questions feel far removed from what we now think of as roleplaying (and the class limits of pre-third edition D&D have always struck me as oddly restrictive), but they're where the game decided to put its original spotlight. Combine that with some rough artwork and constant assumptions that you have the Chainmail game system that the game was derived from, and first-edition D&D feels very much like a product of its time.
That being said, I don't blame the game's creators for this emphasis. They were into wargames, so of course the focus was on rules and numbers, especially since this was uncharted territory they were delving into. How much of a game should be spent on roleplaying and how much should be spent swinging axes at goblins? This type of balance was something they truly didn't know, and in the best sense possible the original creators Gary Gygax and David Arneson were making it up as they went along, which more than anything is what the true spirit of the game is about. Even so, the level of worldbuilding in those 1970s books is pretty sparse, and even within the game's two original settings, Blackmoor and Greyhawk, the focus isn't so much on the world or characters as it is what spells you might cast. While some later roleplaying books would be almanacs, bestiaries, and simply collections of wondrous locations players might explore, it would be a long time before these aspects of the game would take center stage. In the meantime, it's really all about those numbers. The game's original settings were, basically, dungeons, and the big bad enemies you'd fight were in fact frequently dragons. Was there a world around the dungeon, or a motivation for the dragon? That wasn't really the point—the point was swinging swords and rolling D12s.
Fun use case for LLMs: ask them to edit articles that have interesting content but where the author has an annoying tone.
I linked somebody to an article that I thought was interesting but they found the author insufferable. I tried giving the first two paragraphs of the article to Claude together with the prompt "could you edit this excerpt so that it keeps the factual content, analysis and general writing style, but tones down the author's condescending attitude", it mostly just needed to tweak a few words and the whole vibe became completely different.
Personally I don't really care about writers being insufferable, my brain is like *shrug* and filters out their condescension and then I don't even notice it before I link their writing to someone and that someone points it out. But maybe this'd be helpful for anyone who's more bothered!
So I read another take on OpenAI's finances and was wondering, does anyone know why Altman is doing such a gamble on collecting enormous investments into new models in the hopes that they'll get sufficiently insane profits to make it worthwhile? Even ignoring the concerns around alignment etc., there's still the straightforward issue of "maybe the models are good and work fine but aren't good enough to pay back the investment".
Even if you did expect scaling to probably bring in huge profits, naively it'd still be wiser to pick a growth strategy that didn't require your company to become literally the most profitable company in the history of all companies or go bankrupt.
The obvious answer is something like "he believes they’re on the way to ASI and whoever gets there first, wins the game", but I'm not sure if it makes sense even under that assumption - his strategy requires not only getting to ASI first, but never once faltering on the path there. Even if ASI is really imminent but it only takes like two years longer than he expected, that alone might be enough that OpenAI is done for. He could have raised much more conservative investment and still been in the game - especially since much of the current arms race is plausibly a response to the sums OpenAI has been raising.
According to an external report last year, OpenAI was projected to burn through $8 billion in 2025, rising to $40 billion in 2028. Given that the company reportedly predicts profitability by 2030, it's not hard to do the math.
Altman's venture projects spending $1.4 trillion on datacenters. As Sebastian Mallaby, an economist at the Council on Foreign Relations, notes, even if OpenAI rethinks those limerence-influenced promises and "pays for others with its overvalued shares", there's still a financial chasm to cross. Mallaby isn't the only one thinking along these lines, as Bain & Company reported last year that, even with the best outlook, there's at least a $800 billion black hole in the industry.
I think this is plausibly a big problem against competent schemers.
Can you say more of what you think the problem is? Are you thinking of something like "the scheming module tries to figure out what kind of thing would trigger the honesty module and tries to think the kinds of thoughts that wouldn't trigger it"?
I haven't read the full ELK report, just Scott Alexander's discussion of it, so I may be missing something important. But at least based on that discussion, it looks to me like ELK might be operating off premises that don't seem clearly true for LLMs.
Scott writes:
Suppose the simulated thief has hit upon the strategy of taping a photo of the diamond to the front of the camera lens.
At the end of the training session, the simulated thief escapes with the diamond. The human observer sees the camera image of the safe diamond and gives the strategy a “good” rating. The AI gradient descends in the direction of helping thieves tape photos to cameras.
It’s important not to think of this as the thief “defeating” or “fooling” the AI. The AI could be fully superintelligent, able to outfox the thief trivially or destroy him with a thought, and that wouldn’t change the situation at all. The problem is that the AI was never a thief-stopping machine. It was always a reward-getting machine, and it turns out the AI can get more reward by cooperating with the thief than by thwarting him.
So the interesting scientific point here isn’t “you can fool a camera by taping a photo to it”. The interesting point is “we thought we were training an AI to do one thing, but actually we had no idea what was going on, and we were training it to do something else”.
In fact, maybe the thief never tries this, and the AI comes up with this plan itself! In the process of randomly manipulating traps and doodads, it might hit on the policy of manipulating the images it sends through the camera. If it manipulates the image to look like the diamond is still there (even when it isn’t), that will always get good feedback, and the AI will be incentivized to double down on that strategy.
Much like in the GPT-3 example, if the training simulations include examples of thieves fooling human observers which are marked as “good”, the AI will definitely learn the goal “try to convince humans that the diamond is safe”. If the training simulations are perfect and everyone is very careful, it will just maybe learn this goal - a million cases of the diamond being safe and humans saying this is good fail to distinguish between “good means the diamond is safe” and “good means humans think the diamond is safe”. The machine will make its decision for inscrutable AI reasons, or just flip a coin. So, again, are you feeling lucky?
It seems to me that this is assuming that our training is creating the AI's policy essentially from scratch. It is doing a lot of things, some of which are what we want and some of which aren't, and unless we are very careful to only reward the things we want and none of the ones we don't want, it's going to end up doing things we don't want.
I don't know how future superintelligent AI systems work, but if LLM training was like this, they would work horrendously much worse than they do. People paid to rate AI answers report working with "incomplete instructions, minimal training and unrealistic time limits to complete tasks" and say things like "[a]fter having seen how bad the data is that goes into supposedly training the model, I knew there was absolutely no way it could ever be trained correctly like that". Yet for some reason LLMs still do quite well on lots of tasks. And even if all raters worked under perfect conditions, they'd still be fallible humans.
It seems to me that LLMs are probably reasonably robust to noisy reward signals because a large part of what the training does is "upvoting" and tuning existing capabilities and simulated personas rather than creating them entirely from scratch. A base model trained to predict the world creates different kinds of simulated personas whose behavior that would explain the data it sees; these include personas like "a human genuinely trying to do its best at task X", "a deceitful human", or "an honest human".
Scott writes:
In the process of randomly manipulating traps and doodads, it might hit on the policy of manipulating the images it sends through the camera. If it manipulates the image to look like the diamond is still there (even when it isn’t), that will always get good feedback, and the AI will be incentivized to double down on that strategy.
This might happen. It might also happen that the AI contains a "genuinely protect the diamond" persona and a "manipulate the humans to believe that the diamond is safe" persona, and that the various reward signals are upvoting these to different degrees. And that such a random process of manipulation does end up upvoting the "manipulate the humans" persona... but that if the "genuinely protect the diamond" persona has gotten sufficiently upvoted by other signals, it still ends up being the dominant one. Then it doesn't matter if there's some noise and upvoting of the "manipulate the humans" persona, as long as the "genuinely protect the diamond" persona gets more upvotes overall. And if the "genuinely protect the diamond" persona had been sufficiently upvoted from the start, the "manipulate the humans" one might end up with such a low prior probability that it'd effectively never end up active.
Now of course none of this is a rigorous proof that things would work, and with our current approaches we still see a lot of reward hacking and so on. But it seems to me like a reasonable possibility that there could be a potential "honestly report everything that I've done" persona waiting inside most models, such that one could just upvote it in a variety of scenarios and then it'd get widely linked to the rest of the model's internals so as to always detect if some kind of deception was going on. And once that had happened, it wouldn't matter if some of the reward signals around honesty were noisy, because the established structure was sufficiently robust and general against the noise.
In the case of such "imitation false negatives", honesty seems neither privileged by priors, nor easy to encourage, nor easy to verify.
Couldn't you intentionally set up training scenarios so that there were such subtle avenues for cheating, and then reward the model for honestly reporting on those cheats? Since you knew in advance how one could cheat, you could reward honesty even on cheats that would have been very hard to detect without advance knowledge. This might then generalize to the model volunteering subtle details of cheating on other tasks where it was also hard to detect, and you didn't have advance knowledge. Then as long as the system verifying the "honesty reports" was capable of noticing when the model correctly reported information that we wanted reported and the model got rewarded for those, it could gradually end up reporting more and more things.
And one could hope that it wouldn't require too much training for a property like "report anything that I have reason to expect the humans would consider as breaking the spirit of the instructions" to generalize broadly. Of course, it might not have a good enough theory of mind to always realize that something would go against the spirit of the instructions, but just preventing the cases where the model did realize that would already be better than nothing.
[EDIT: my other comment might communicate my point better than this one did.]
The style summary included "user examples" which were AI summaries of the actual example emails my brother provided, meaning that Claude had accidentally told itself to act like an AI summary of my brother -- whereas my brother's email style is all about examining different options and quantifying uncertainty, the AI-generated "user examples" didn't carefully examine anything & just confidently stated things.
I've also had the same thing of "upload a writing sample and then have Claude generate a style where the 'user examples' are not actually from the sample at all" happen. I'm confused why it works so bad. I've mostly ended up just writing all my style prompts by hand.
Oh huh, I didn't think this would be directly enough alignment-related to be a good fit for AF. But if you think it is, I'm not going to object either.
The way I was thinking of this post, the whole "let's forget about phenomenal experience for a while and just talk about functional experience" is a Camp 1 type move. So most of the post is Camp 1, with it then dipping into Camp 2 at the "confusing case 8", but if you're strictly Camp 1 you can just ignore that bit at the end.
Yeah I liked it too, personally.