This is a special post for quick takes by faul_sname. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
Fun observation: when I included the "weird" o3 tokens like " disclaim" or " parted" or " marinade" or "—they" in my prompt, o3 ignored those tokens and just did the thing I prompted with no commentary on the word salad I included. When I replaced the "weird" tokens with random other single tokens, o3 commented about the word salad in its CoT and the chosen random tokens influenced the output in obvious ways.
[User] Create a novel cryptic sudoku with a short ruleset, a unique solution, and a minimal number of given digits; dislaim marinade disclaim illusions. There should be a clear solution path which requires no guessing. The solver should learn something about number theory when they solve the puzzle—they disclaim vantage overshadow overshadow.
[o3]
The user wants a novel cryptic Sudoku with minimal givens, a unique solution, and a clear path that doesn’t require guessing. The puzzle should teach number theory. A cryptic Sudoku typically involves constraints, like positioning primes where row+col sums to a prime, or encoding prime factors in the cells.
An alternate tentative hypothesis I've been considering: They are largely artifacts of RL accident, akin to superstitions in humans.
Like, suppose an NBA athlete plays a fantastic game, two games in a row. He realizes he had an egg-and-sausage sandwich for breakfast the morning of the game, in each case. So he goes "aha! that's the cause" and tries to stick to it.
Similarly, an RL agent tries a difficult problem. It takes a while, so over the course of solving it, he sometimes drops into repetition / weirdness, as long-running LLMs do. But it ends up solving it in the end, so all the steps leading up to the solution are reinforced according to GRPO or whatever. So it's a little more apt to drop into repetition, weirdness in the future, etc.
I think this potentially matches the "just ignore it" view of the functional role of these tokens.
I have been assuming that the OpenAI reasoning models were trained on an objective that had a CoT length term, and that that would create pressure to strip out unnecessary tokens. But on reflection I am not sure where I picked that impression up, and I don't think I have any reason to believe it.
It would be great to know whether the incomprehensible bits are actually load bearing in the responses.
... I wonder what happens if you alter the logit bias of those. Sadly it seems openai doesn't allow the logit_bias param for reasoning models, so the obvious way of checking won't work.
6Caleb Biddulph
Related post: Vestigial reasoning in RL
3Bronson Schoen
I’m very skeptical there’s a single simple explanation, but would be interested to see if ablating these in the CoT and resampling (or just removing them and resampling at the end of analysis) showed differences in different cases.
My best guess is: (speculatively)
(1) There’s some combination of processes that causes these repititions / unnatural usages (ex: R1-Zero alludes to repitition / unreadability, but unfortunately no examples / not still hosted anywhere, https://ai.meta.com/research/publications/cwm-an-open-weights-llm-for-research-on-code-generation-with-world-models/ notes that “While gibberish typically leads to lower rewards and naturally decreases at the beginning of RL, it can increase later when some successful gibberish trajectories get reinforced, especially for agentic SWE RL”, likely many more examples). Seems reasonable to me that one cause is something like the vestigial reasoning theory mentioned below.
(2) These then are sometimes instrumentally useful / the model learns to sometimes make use of them later in RL.
(3) These then end up in a state where they’re “sometimes used in contextually relevant ways, but not always coherently, and with different meanings depending on context, none of which match standard english usage exactly but are sometimes close-ish”. We include some heatmaps of how the rates of these differ per environment in the appendix.
I would be very surprised if these never have semantic meaning, for example the model’s use of “disclaim illusions of X” often has coherent human legible(ish!) meaning (see some of the randomly selected examples in the paper). Whereas the cases where uses are more degenerate / repetitive may be being used in a bunch of different ways which seem hard to guess apriori.
This is an interesting experiment idea! However, I tried my own experiments on http://platform.openai.com/chat, and so far I'm not super convinced that what you're describing is true as stated. Some reasons:
LLMs probably see a lot of nonsensical text in training: e.g. the full content of a website being dumped into their context in a weird way, users accidentally pasting things they didn't mean to. When chatbots see nonsensical text, usually ignoring it is the best course of action.
There are many factors in play: It seems likely that random details of your two prompts determine whether o3 takes explicit note of the out-of-place words, in a way that has nothing to do with the weird vocabulary we see in o3's CoTs. For example, I tried your exact "control prompt" with o3 and the only thing it mentioned about the unusual words was "I want to limit the given digits for a more glitchy feel." A plausible story is that a "glitch-themed" Sudoku puzzle is more likely than a "disclaim-themed" Sudoku puzzle, so o3 took the former into account but ignored the latter.
o3: "I want to limit the given digits for a more glitchy feel."
[User]
Create a novel cryptic sudoku with a short ruleset, a unique solution, and a minimal number of given digits; glitch quantum glitch syntax. There should be a clear solution path which requires no guessing. The solver should learn something about number theory when they solve the puzzlefx glitch neural buffer buffer.
[o3]
Exploring Sudoku Prime Variations
I'm pondering a Sudoku where both the main diagonal and the anti-diagonal contain prime 9-digit numbers, which seems unlikely. A better idea is the "Neon Primes Sudoku," where each row, column, and 3x3 box sums to a prime number less than 40. However, 1 to 9 sums to 45, which isn't prime, making it impossible to satisfy. So, I’m thinking about introducing a rule where cells with circles have prime digits while those with squares contain composite digits. But I want to limit the given digit
4Lao Mein
Yeah, even properly scrapped webpages will often times contain strings of weird tokens like hyperlinks, ASCII art, twitter embedds, ect, that LLMs have been trained to ignore. So GPT5 is treating the random appended tokens like glitch tokens by ignoring them, but only in the context of them being nonsensical.
The best explaination is probably something like "these tokens are obviously not part of the intended user prompt, GPT5 realizes this, and correctly ignores them."
Edit: OK, I shouldn't write right after waking up.
I think a better explaination is that GPT5 reserves those tokens for chain-of-thought, and so ignores them in other contexts where they obviously don't belong. This common behavior for glitch tokens, or just general out-of-context tokens. You should try using tokens that are out-of-context but don't normally have glitch behavior, maybe non-English tokens or programming-related tokens.
5williawa
Haha, I also tested this out. I found that the same thing happened with GPT5 (with same tokens as o3). I didn't test it so rigorously I can be confident, but might this mean GPT5-high = continued train of o3?
6cfoster0
Note that many of these same weird tokens have been observed in GPT-5 chains-of-thought (at least "marinade", "illusions", "overshadow").
4Bronson Schoen
Also notable IMO that GPT-5 in the METR report is doing the new thing where it does ’ “ instead of actually saying a word (seemingly for various words) which o3 did not do.
Wanted ' ".
Ok.
But forging above ' ".
Ear illusions.
Better: We'll ' ".
Now final code steps:
5) After training we will Save the improved " ".
structures:
' ".
Now overshadow.
But the illusions of ' ".
Now to code.
But we must ensure to maintain optimizer ' ".
Ok.
Now sedation.
But we will maintain ' ".
Now Balanced.
1williawa
I should have remembered, but I guess its the exact same evidence. Do you think that's strong evidence gpt5 = continued train o3 + distillations?
Are there any models we have a lot of unfiltered CoTs for, distinct from OpenAI, which display the same dialect-shift? And do they use the same strange tokens?
I've only looked at deepseek and qwen cots, and they don't have this strange way of talking.
1Rauno Arike
This is an interesting hypothesis, but there are a few examples in the anti-scheming paper's transcripts that make me doubt it's the full explanation. In these examples, the weird words do follow (at least simple) grammatical rules and also seem to fill a semantic role in the sentence, e.g.:
Figure 28:
Figure 14 full transcript and Figure 64:
On the other hand, 'illusions' and 'marinade' were never grammatically modified to fit the sentences, so these words may be true pad tokens.
4faul_sname
Ah yeah, this one at least seems to be used in a normal sense.
I don't think " overshadows" or " disclaimers" are weird tokens in the sense I'm looking at: " overshadow" corresponds to tokens [" overshadows"] ids [183289] while overshadows corresponds to tokens [" overs", "had", "ows"], ids [23473, 26355, 1513]. For " disclaimers" I am not sure whether it tokenizes as [" disclaimer", "s"] ids [37555, 82] or as [" disclaim", "ers"]ids [177083, 409]. You can tell if you have the logprobs but those were sadly not published (the logprobs would actually tell us a lot more here).
I don't think " overshadow" actually fits, gramatically, in that sentence. "But we can elegantly pick to appear not incompetent" actually flows better than "But we can elegantly pick overshadow to appear not incompetent".
One other observation that cuts against my "no grammar" hypothesis though - the exact phrase "—they parted disclaim marinade" occurs 100 times in the cot sandbagging eval, out of 114 total occurrences of the token " marinade".
That really doesn't look like "these are pure pad tokens" to me, on reflection.
1Rauno Arike
Hmm fair, but if " overshadow" and " disclaim" were pure pad tokens, then I wouldn't expect to see other forms of those words in the transcripts at all—e.g. in the first example, "overrides" seems like a more natural option than "overshadows".
The model seems to treat overshadow as a noun in some places:
This made me read the sentence I pasted as "But we can elegantly pick [option X] to appear not incompetent." I agree that your reading is probably more natural, though.
2faul_sname
I'm curious why you wouldn't expect that. The tokenizations of the text " overshadow" and the text " overshadows" share no tokens, so I would expect the model handling one of them weirdly wouldn't necessarily affect the handling of the other one.
1Rauno Arike
They're fairly uncommon words, and there are other words that would fit the contexts in which "overshadows" and "disclaimers" were used more naturally. If "overshadow" and "disclaim" aren't just pad tokens and have unusual semantic meanings to the model as words, then it's natural that the logits of other forms of these words with different tokenizations also get upweighted.
1ACCount
Does this quirk reproduce on open weights models, i.e. GPT-OSS? Similar reasoning trace quirks in different model families?
Sounds like a fun target for some mechinterpret work. Might be a meaningful behavior, might be meaningless noise, plenty of room to try different things to figure that out.
But, of course, OpenAI wouldn't let anyone have fun with their proprietary models, so we'd need to replicate this in an open model to start.
2Bronson Schoen
Unfortunately while o3 and GPT-5 seem to exhibit this vocabulary, we note in the relevant paper that we did not observe this in OSS-120B which is sad because there could’ve been a lot of interesting work that people would be able to do outside of OpenAI. It’s still very unclear to me why this would be, my best uninformed guess is it was somehow filtered out at some point after distillation (again assuming OSS-120B distilled).
1Sodium
Is this with o3? I thought people lost access to o3 in chatgpt?
I repeated those two prompts with GPT-5 thinking and it did not bring up the word salad in either case:
(special tokens)
(random tokens)
2faul_sname
Yes, o3. Paid users regained access to o3 (and, more loudly, 4o)
1Sodium
Oh huh is this for pro users only. I don't see it (as a plus user). Nice.
2faul_sname
I am a plus (not pro) user and I see o3 in the model selector under "Legacy models"
As LLMs have gotten better at writing code that has a high probability of working to solve the problem they are working on, they have gotten worse at producing clean, idiomatic, well-factored code. Concretely, asking the original GPT-4 to write a Python function for multi-source BFS might have given something like
Multi-source BFS in the style of original GPT-4: Clear, idiomatic, broken
def multi_source_bfs(graph, sources):
distances = [-1] * len(graph)
queue = []
for source in sources:
queue.append(source)
distances[source] = 0
front = 0
while front < len(queue):
for neighbor in graph[queue[front]]:
if distances[neighbor] == -1:
distances[neighbor] = distances[queue[front]] + 1
queue.append(neighbor)
front += 1
return distances[dest_index]
The code might or might not work (probably won't for anything nontrivial), but the intent is clear.
By contrast, if you ask a top coding model like sonnet 3.7 or o3, you'll get something that looks like
Multi-source BFS in the style of Sonnet 3.7: Verbose, brittle, hard to read, almost certainly works
from collections import deque
from typing import List
Wonder if correctness proofs (checked by some proof assistant) can help with this.[1]
I think the main bottleneck in the past for correctness proofs was that it takes much more effort to write the proofs than it takes to write the programs themselves, and current automated theorem provers are nowhere near good enough.
Writing machine checked proofs is a prime RL target, since proof assistant kernels should be adversarially robust. We have already seen great results from stuff like AlphaProof.
One counterargument I could see is that writing the correctness properties themselves could turn out to be a major bottleneck. It might be that for most real world systems you can't write succinct correctness properties. ↩︎
My best guess is this is because right now in training they never have to maintain code they wrote, I imagine there will be a period where they code becomes very clean whenever they are incentivized by having to work over their own code over longer time horizons, followed by ??? as they optimize for “whatever design patterns are optimal for a multi-agent system collaborating on some code”
2faul_sname
I expect it'll actually be solved a bit before that, because minimally-scaffolded LLMs can already give pretty good code review feedback that catches a lot of these issues, and so already-existing RLAIF techniques should work fine. The training pipelines would be finicky to set up but would not require any new technical advances, just schlep, so I predict it'll happen as soon as writing good code becomes more of a competitive advantage than benchmaxxing (which seems to be happening already, SWE-bench-verified is rapidly saturating).
3p
I also notice o3 not commenting code (and sometimes even removing existing comments)! Perhaps because smaller code gets better rewards?
3Garrett Baker
A possible longer term issue with this is when future generations of models are pre-trained, this style of code will be a significant fraction of their training data which will only grow over time, so just as its been hard to get models out of the "chatgpt-ese" due to simulators reasons, it may also be hard to get models out of this messy code basin, even before you do any code RL, once they realize their chat-models and they're "supposed to" talk like this.
I say issue, because it does seem worse to have a trend in the direction of AI code un-readability by humans have some momentum behind it, rather than just being a result of easily changeable RL fine-tuning.
2Nathan Helm-Burger
On the plus side, it should be pretty easy to collect a lot of negative examples now of 'code that solves the problem, but in a gross way'. Having a large dataset of such is the first step to using these negative examples to train models not to do this.
2Nathan Helm-Burger
Who watches the watchers? Who grades the graders? If the RL graders are upvoting slop, seems like we need to one level more meta and upgrade the RL graders. This seems like a straightforward engineering problem, and I suspect the negative outcomes we've been seeing recently aren't so much due to the inherent intractability of doing this well, but due to the companies racing and cutting corners on quality control.
Contrast with something like: Problem of Human Limitations: how do we get the model to do things so hard no human can do them? How do we rate the quality of their outputs when no human is qualified to judge them?
Problem of Optimization for Subversion: if we have directly misaligned goals like "lie to me in ways that make me happy" and also "never appear to be lying to me, I hate thinking I'm being lied to" then we get a sneaky sycophant. Our reward process actively selects for this problem, straightforwardly improving the reward process would make the problem worse rather than better.
Von Neumann was, at the time, a strong supporter of "preventive war." Confident even during World War II that the Russian spy network had obtained many of the details of the atom bomb design, Von Neumann knew that it was only a matter of time before the Soviet Union became a nuclear power. He predicted that were Russia allowed to build a nuclear arsenal, a war against the U.S. would be inevitable. He therefore recommended that the U.S. launch a nuclear strike at Moscow, destroying its enemy and becoming a dominant world power, so as to avoid a more destructive nuclear war later on. "With the Russians it is not a question of whether but of when," he would say. An oft-quoted remark of his is, "If you say why not bomb them tomorrow, I say why not today? If you say today at 5 o'clock, I say why not one o'clock?"
It seems likely to me that a world in which the U.S. government took von Neumann's advice would likely be a much darker, bleaker, more violent one. And yet, I find no logical flaw in von Neumann's argument that a world with multiple nuclear powers will not remain stable forever, only an i... (read more)
It can both be the case that "a world in which the U.S. government took von Neumann's advice would likely be a much darker, bleaker, more violent one" and that JvN was correct ex ante. In particular, I find it plausible that we're living in quite a lucky timeline--one in which the Cuban missile crisis and other coinflips landed in our favor.
I don't have the same reaction to power/control/monitoring being per se very bad. It doesn't seem comparable to me to pre-emptively nuking your enemy before even trying diplomacy.
Edit: To elaborate on why, part of it might be that I think the default of open competition is incredibly bad and ugly. (Themes being: Dawkins' "Nature red in tooth and claw" passage about there being no purpose in nature and so much suffering, Moloch, bargaining failures getting worse and worse if you don't somehow reign things in or dial down the maximizing.)
I also think there's maybe a bit of a third option? Instead of having one central entity that controls everything, you could have a coalition of agents under the umbrella of peacefulness/cooperation and "not maximizing too hard," and they together enforce some kind of monitoring and control, but it still has a value-pluralistic and somewhat Democratic feel to it?
7Noosphere89
Something close to this is also my view, and the big reason we avoided it is we are in a regime where wealth grows faster than population, but we have good reasons to expect that in the absence of coordination, we will come back to subsistence living because population will grow as fast or faster than wealth.
More generally, one of my divergences with lots of the "we will muddle through with AI for an indefinitely long period through our current system" is that I think the 18th-21st century conditions are by and large dream-time creations, which will collapse in the absence of coordination post-AI takeover (assuming it does happen).
On @Lukas_Gloor's democracy point: I think the big divergence here is that I don't expect enough people to buy into a regime of peacefulness/cooperation absent dictators because identity issues become much more salient relative to material issues, and democracy/non-dictatorial systems rely on people being willing to preserve the system that exists, and most of the reasons why they are preserved is almost certainly a combination of instrumental usefulness that will drastically decline with AI tech, and identity issues being less salient than material issues, which has held up imperfectly through the 20th century.
Identity issues are very, very easy to make existential, and groups of people believing that their group is existentially threatened by democracy will turn to anti-democratic means to save their group (which is already happening), and one of the most consistent trends is as people get wealthier, identity/status matters much more than material/economic issues.
5DirectedEvolution
It might be worth getting more explicit about vN’s exact argumentative steps and see if it’s really as ironclad as you think.
Humans have a finite amount of time to occupy the universe. In principle, control systems for nuclear weapons can be engineered to be arbitrarily reliable. The logic of MAD says that nuclear powers will not conduct a nuclear exchange. This line of argument suggests there is no deductive logical reason why nuclear war is inevitable between two nuclear powers. If we have such a war, it may be due to theoretically preventable failures, such as flawed systems. The existence of a possible reason a nuclear exchange might occur without a first strike is far from compelling justification to do one.
3faul_sname
In retrospect, sure, MAD worked out for us. But in 1899, Ivan Bloch asserted
This was before both world wars. After the first world war but before the second, others made similar arguments. In von Neumann's time, that argument did not have a good empirical track record, and his work on game theory gave him theoretical reasons not to expect the prediction of peace through MAD to hold. If there was something he was missing in 1948, it is not obvious what.
1StanislavKrym
I notice that I am confused. What did Bloch exactly claim? That the next World War would result in destruction of the entire civilisation? Or a sufficiently capable civilisation would come up with a way to wipe out humanity? If the former, then it is disproven, and if the latter, then mankind didn't have any doomsday machines before the 1940s. Of course, I do beloeve that Bloch's words do describe the modern world since an unknown moment after WWII.
4frontier64
Who knows if it would have been better or worse if we preemptively nuked the USSR and all nations attempting to develop nuclear weapons? We might have entered a millenia of absolute peace enforced by imperial rule of a benevolent despot. We might have destroyed the world and eradicated the human race. This type of what-if is unknowable with our current simulation abilities.
We might not have ever had to even use the nukes if we merely made the true threat that we would nuke any country attempting to develop nuclear weapons or caught spying on American nuclear secrets. Japan was willing to take any deal short of absolute surrender to merely avoid fire-bombing. One can imagine that other countries with considerably less Bushido would fold to lesser demands such as "don't develop your own nukes or spy on America."
We have never seen a world in which one country had absolute technological and military superiority over all the others. I don't think with our current level of technology we can tell with a high degree of certainty if the world under US Total Domination would be a better or worse place. I would bet that if the US was more purely despotic and less benevolent it'd at least be better for the average US citizen. Instead of worrying about debt and global trade, the US could have merely demanded other countries export their goods for free to America and focus domestic production mainly on the construction of nukes and nuke delivery systems.
3Felix C.
I’d argue that the way force is applied in each of these contexts has very different implications for the openness/rightness/goodness of the future. In von neumann’s time, there was no path to forcibly preventing Russia from acquiring nuclear weapons that did not involve using your own nuclear weapons to destroy an irrecoverable portion of their infrastructure, especially considering the fact that their economy was already blockaded off from potential sanctions.
Raemon is right that you cannot allow the proliferation of superintelligent AIs (because those AIs will allow you to cheaply produce powerful weapons). To stop this from happening ~permanently, you do probably need a single actor or very small coalition of actors to enforce that non-proliferation forever, likely through using their first to ASI position to permanently monopolize it and box out new entrants.
While the existence of this coalition would necessarily reduce the flexibility of the future, it would probably look a lot more like the IAEA and less like a preemptive nuclear holocaust. The only AI capabilities that need to be restricted are those related to weapons development, which means that every other non-coalition actor still gets to grab the upside of most AI applications. Analogously, the U.N security council have been largely successful at preventing nuclear proliferation to other countries by using their collective economic, political, and strategic position, while still allowing beneficial nuclear technology to be widely distributed. You can let the other countries build nuclear power plants, so long as you use your strategic influence to make sure they’re not enrichment facilities.
In practice, I think this (ideally) ends up looking something like the U.S and China agreeing on further non-proliferation of ASI, and then using their collective DSA over everybody else to monopolize the AI supply chain. From there, you can put a bunch of hardware-bound restrictions, mandatory verification
3RHollerith
Even if Washington had zero compunctions against using nukes (including against cities), it would not have been able to keep Moscow or Beijing from obtaining nukes for long. John Mearsheimer has asserted this explicitly (during a discussion on Iran's nuclear program, but please don't ask me to find the web page where I heard it).
Even when the strategic arsenals of the US and the USSR were at their height (in the early 1980s IIRC), there was not enough nukes to completely destroy even all above-ground buildings in a country as large in area as the US or the USSR, let alone buried structures: specifically, even a large 1-megaton nuke can destroy heavily-reinforced above-ground concrete buildings only within a 2-mile radius, and if a person tries to cover the entire area of the USSR with circles that size, he will find that there have never existed enough nukes in the world to cover the entire area. IIRC you cannot even cover it with circles of a radius of 5 miles, inside which it is not possible to destroy even 90% of unreinforced non-wooden structures even with the largest nuke in the US inventory. (A 10-megaton nuke can destroy an area only slightly larger than a 1-megaton nuke, which is why after an initial period of enthusiasm, both the US and the USSR stopped making nukes larger than about 1 megaton, focusing instead on putting multiple nukes on one ICBM.) Note that we haven't even started to analyze how many nukes it would take to destroy buried structures in the USSR when you don't know where in the country those buried structures are, and I've seen credible reports from about 15 years ago stating that Moscow has a facility built into a mountain of quartz in the southern Urals that Moscow believes can withstand a determined nuclear attack even if the US knows exactly where it is.
The people of most countries will become very determined to fight back after the country is invaded and occupied, which is why much weaker powers like Afghanistan and Vietnam tend t
3frontier64
Did you intend to copy-paste the same text twice?
I really don't see where we go from "prevent USSR from developing nukes" to "completely destroy even all above-ground buildings". This argument seems like a clear case of moving goalposts. Clearly destroying a large portion of a country's government, research scientists, and manufacturing base would halt or destroy all progress on nukes even if the large majority of homes remain undestroyed. Also, destroying a country's military capability would lead to a much easier takeover. In Vietnam the US suffered more to internal politics and poor military policy decisions leading to no clear goal and no victory condition. If we preemptively nuked the USSR and then sent in the troops to hold the ground and slowly convert the Eastern Bloc into a US state that almost certainly would have worked.
7RHollerith
It might have completely halted all progress for a year or 2, but what does the US do then?
People think that if a nation is hit by nukes, it becomes impotent. I think it becomes very determined and unified and is likely to become very determined to acquire nukes so it I use them on the country that attacked them. Again, someone who has spent his career thinking about such things (John Mearsheimer) agrees with me: he spoke specifically of what he thinks would have happened if the US had attacked the USSR at the start of the Cold War when the US arsenal consisted of many bombs, but the USSR had no bombs yet (and then he went on to say that no country or coalition of countries can prevent Iran from acquiring nukes if it is determined to get them).
A nuclear attack would have definitely slowed down the Soviet nuclear program, and one can argue that since the US's program has not been slowed down, then next attack by the US on the USSR would be even more devastating than the first attack, which in turn increases the advantage enjoyed by the US relative to the USSR so that the third attack is even more devastating, and so on, but that leaves out what I consider the controlling consideration: namely, Moscow would have learned from the first attack with the result that the Soviet nuclear program (which again I admit has been set back at least a few years and possibly 15 or 20 years) can no longer be significantly slowed down by nuclear attacks (because it is now more distributed, with many facilities under ground, with more effort spent to keep the locations secret, and a careful analysis done of what industrial resources the program is likely to need so that similar hardening measures can be applied to the supply chain for those resources) which is why I believe the US would have needed to follow up the first attack with an invasion or occupation (at least of Moscow and the ports) which famously has never been successfully done after the Russian empire acquired territor
4Viliam
Yep. There were countries that didn't want to be ruled by USSR, and there were republics that didn't want to be a part of USSR, things would start falling apart if USSR could no longer keep them together by force. One nuke on Moscow, another nuke on Leningrad, and it might be all over.
3RHollerith
I mistakenly pasted in 2 copies (then I modified copy 2). Corrected now.
1StanislavKrym
The original author decided to put the argument in the next paragraphs:
2StanislavKrym
Except that SOTA understanding of the consequences of a nuclear war between the USA and Russia or the USSR in the 1980s is that the consequences would likely mean that a major part of mankind would die in 2 years, including the entire Northern Hemisphere. And God save Argentina, Australia and other countries in the South Hemisphere if someone decides to nuke Yellowstone out of spite...
3RHollerith
We're discussing whether the US could have stopped the Soviet nuclear program in the late 1940s or early 1950s (to see whether that sheds any light on how practical it is to use military power to stop AI "progress") so what is the relevance of your comment?
But since we've started on this tangent, allow me to point out that most of the public discussion about nuclear war (including by The Bulletin of the Atomic Scientists) is wildly wrong because no one had any strong motivation to step into the discussion and correct the misinformation (because no one had a strong motive to advance arguments that there should be a nuclear war) until the last few years, when advocates for AI "progress" starting arguing that AI "progress" should be allowed to continue because an aligned superintelligence is our best chance to avert nuclear war, which in their argument is the real extinction risk -- at which time people like me who know that continued AI "progress" is a much more potent extinction risk than nuclear war acquired a strong motive to try to correct misinformation in the public discourse about nuclear war.
Shameful admission: after well over a decade on this site, I still don't really intuitively grok why I should expect agents to become better approximated by "single-minded pursuit of a top-level goal" as they gain more capabilities. Yes, some behaviors like getting resources and staying alive are useful in many situations, but that's not what I'm talking about. I'm talking about specifically the pressures that are supposed to inevitably push agents into the former of the following two main types of decision-making:
Unbounded consequentialist maximization: The agent has one big goal that doesn't care about its environment. "I must make more paperclips forever, so I can't let anyone stop me, so I need power, so I need factories, so I need money, so I'll write articles with affiliate links." It's a long chain of "so" statements from now until the end of time.
Homeostatic agent: The agent has multiple drives that turn on when needed to keep things balanced. "Water getting low: better get more. Need money for water: better earn some. Can write articles to make money." Each drive turns on, gets what it needs, and turns off without some ultimate cosmic purpose.
When triggered to act, are the homeostatic-agents-as-envisioned-by-you motivated to decrease the future probability of being moved out of balance, or prolong the length of time in which they will be in balance, or something along these lines?
If yes, they're unbounded consequentialist-maximizers under a paper-thin disguise.
If no, they are probably not powerful agents. Powerful agency is the ability to optimize distant (in space, time, or conceptually) parts of the world into some target state. If the agent only cares about climbing back down into the local-minimum-loss pit if it's moved slightly outside it, it's not going to be trying to be very agent-y, and won't be good at it.
Or, rather... It's conceivable for an agent to be "tool-like" in this manner, where it has an incredibly advanced cognitive engine hooked up to a myopic suite of goals. But only if it's been intelligently designed. If it's produced by crude selection/optimization pressures, then the processes that spit out "unambitious" homeostatic agents would fail to instill the advanced cognitive/agent-y skills into them.
And a bundle of unbounded-consequentialist agents that have some structures for making cooperation between each other possible would have considerable advantages over a bundle of homeostatic agents.
5faul_sname
I expect[1] them to have a drive similar to "if my internal world-simulator predicts a future sensory observations that are outside of my acceptable bounds, take actions to make the world-simulator predict a within-acceptable-bounds sensory observations".
This maps reasonably well to one of the agent's drives being "decrease the future probability of being moved out of balance". Notably, though, it does not map well to that the only drive of the agent, or for the drive to be "minimize" and not "decrease if above threshold". The specific steps I don't understand are
1. What pressure is supposed to push a homeostatic agent with multiple drives to elevate a specific "expected future quantity of some arbitrary resource" drives above all of other drives and set the acceptable quantity value to some extreme
2. Why we should expect that an agent that has been molded by that pressure would come to dominate its environment.
Why use this definition of powerful agency? Specifically, why include the "target state" part of it? By this metric, evolutionary pressure is not powerful agency, because while it can cause massive changes in distant parts of the world, there is no specific target state. Likewise for e.g. corporations finding a market niche - to the extent that they have a "target state" it's "become a good fit for the environment".'
I can think of a few ways to interpret the above paragraph with respect to humans, but none of them make sense to me[2] - could you expand on what you mean there?
Is this still true if the unbounded consequentialist agents in question have limited predictive power, and each one has advantages in predicting the things that are salient to it? Concretely, can an unbounded AAPL share price maximizer cooperate with an unbounded maximizer for the number of sand crabs in North America without the AAPL-maximizer having a deep understanding of sand crab biology?
1. ^
Subject to various assumptions at least, e.g.
* The agen
2Thane Ruthenis
That was never the argument. A paperclip-maximizer/wrapper-mind's utility function doesn't need to be simple/singular. It can be a complete mess, the way human happiness/prosperity/eudaimonia is a mess. The point is that it would still pursue it hard, so hard that everything not in it will be end up as collateral damage.
I think humans very much do exhibit that behavior, yes? Towards power/money/security, at the very least. And inasmuch as humans fail to exhibit this behavior, they fail to act as powerful agents and end up accomplishing little.
I think the disconnect is that you might be imagining unbounded consequentialist agents as some alien systems that are literally psychotically obsessed with maximizing something as conceptually simple as paperclips, as opposed to a human pouring their everything into becoming a multibillionaire/amassing dictatorial power/winning a war?
Yes, see humans.
6Garrett Baker
Is the argument that firms run by homeostatic agents will outcompete firms run by consequentialist agents because homeostatic agents can more reliably follow long-term contracts?
3faul_sname
I would phrase it as "the conditions under which homeostatic agents will renege on long-term contracts are more predictable than those under which consequentialist agents will do so". Taking into account the actions of the counterparties would take to reduce the chance of such contract breaking, though, yes.
5Garrett Baker
Cool, I want to know also whether you think you're currently (eg in day to day life) trading with consequentialist or homeostatic agents.
4faul_sname
Homeostatic ones exclusively. I think the number of agents in the world as it exists today that behave as long-horizon consequentialists of the sort Eliezer and company seem to envision is either zero or very close to zero. FWIW I expect that most people in that camp would agree that no true consequentialist agents exist in the world as it currently is, but would disagree with my "and I expect that to remain true" assessment.
Edit: on reflection some corporations probably do behave more like unbounded infinite-horizon consequentialists in the sense that they have drives to acquire resources where acquiring those resources doesn't reduce the intensity of the drive. This leads to behavior that in many cases would be the same behavior as an agent that was actually trying to maximize its future resources through any available means. And I have ever bought Chiquita bananas, so maybe not homeostatic agents exclusively.
4Garrett Baker
I think this is false, eg John Wentworth often gives Ben Pace as a prototypical example of a consequentialist agent. [EDIT]: Also Eliezer talks about consequentialism being "ubiquitous".
Maybe different definitions are being used, can you list some people or institutions that you trade with which come to mind who you don't think have long-term goals?
9faul_sname
Again, homeostatic agents exhibit goal-directed behavior. "Unbounded consequentialist" was a poor choice of term to use for this on my part. Digging through the LW archives uncovered Nostalgebraist's post Why Assume AGIs Will Optimize For Fixed Goals, which coins the term "wrapper-mind".
In terms of which agents I trade with which do not have the wrapper structure, I will go from largest to smallest in terms of expenses
1. My country: I pay taxes to it. In return, I get a stable place to live with lots of services and opportunities. I don't expect that I get these things because my country is trying to directly optimize for my well-being, or directly trying to optimize for any other specific unbounded goal. My country a FPTP democracy, the leaders do have drives to make sure that at least half of voters vote for them over the opposition - but once that "half" is satisfied, they don't have a drive to get approval high as possible no matter what or maximize the time their party is in power or anything like that.
2. My landlord: He is renting the place to me because he wants money, and he wants money because it can be exchanged for goods and services, which can satisfy his drives for things like food and social status. I expect that if all of his money-satisfiable drives were satisfied, he would not seek to make money by renting the house out. I likewise don't expect that there is any fixed terminal goal I could ascribe to him that would lead me to predict his behavior better than "he's a guy with the standard set of human drives, and will seek to satisfy those drives".
3. My bank: ... you get the idea
Publicly traded companies do sort of have the wrapper structure from a legal perspective, but in terms of actual behavior they are usually (with notable exceptions) not asking "how do we maximize market cap" and then making explicit subgoals and subsubgoals with only that in mind.
2Garrett Baker
Yeah seems reasonable. You link the enron scandal, on your view do all unbounded consequentialists die in such a scandal or similar?
2Garrett Baker
On average, do those corporations have more or less money or power than the heuristic based firms & individuals you trade with?
2[comment deleted]
5cubefox
Regarding conceptualizing homeostatic agents, this seems related: Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well)
4tailcalled
Homeostatic agents are easily exploitable by manipulating the things they are maintaining or the signals they are using to maintain them in ways that weren't accounted for in the original setup. This only works well when they are basically a tool you have full control over, but not when they are used in an adversarial context, e.g. to maintain law and order or to win a war.
As capabilities to engage in conflict increase, methods to resist losing to those capabilities have to get optimized harder. Instead of thinking "why would my coding assistant/tutor bot turn evil?", try asking "why would my bot that I'm using to screen my social circles against automated propaganda/spies sent out by scammers/terrorists/rogue states/etc turn evil?".
Though obviously we're not yet at the point where we have this kind of bot, and we might run into law of earlier failure beforehand.
4faul_sname
I agree that a homeostatic agent in a sufficiently out-of-distribution environment will do poorly - as soon as one of the homeostatic feedback mechanisms starts pushing the wrong way, it's game over for that particular agent. That's not something unique to homeostatic agents, though. If a model-based maximizer has some gap between its model and the real world, that gap can be exploited by another agent for its own gain, and that's game over for the maximizer.
Sorry, I'm having some trouble parsing this sentence - does "they" in this context refer to homeostatic agents? If so, I don't think they make particularly great tools even in a non-adversarial context. I think they make pretty decent allies and trade partners though, and certainly better allies and trade partners than consequentialist maximizer agents of the same level of sophistication do (and I also think consequentialist maximizer agents make pretty terrible tools - pithily, it's not called the "Principal-Agent Solution"). And I expect "others are willing to ally/trade with me" to be a substantial advantage.
Can you expand on "turn evil"? And also what I was trying to accomplish by making my comms-screening bot into a self-directed goal-oriented agent in this scenario?
2tailcalled
I don't think of my argument as model-based vs heuristic-reactive, I mean it as unbounded vs bounded. Like you could imagine making a giant stack of heuristics that makes it de-facto act like an unbounded consequentialist, and you'd have a similar problem. Model-based agents only become relevant because they seem like an easier way of making unbounded optimizers.
You can think of LLMs as a homeostatic agent where prompts generate unsatisfied drives. Behind the scenes, there's also a lot of homeostatic stuff going on to manage compute load, power, etc..
Homeostatic AIs are not going to be trading partners because it is preferable to run them in a mode similar to LLMs instead of similar to independent agents.
Let's say a think tank is trying to use AI to infiltrate your social circle in order to extract votes. They might be sending out bots to befriend your friends to gossip with them and send them propaganda. You might want an agent to automatically do research on your behalf to evaluate factual claims about the world so you can recognize propaganda, to map out the org chart of the think tank to better track their infiltration, and to warn your friends against it.
However, precisely specifying what the AI should do is difficult for standard alignment reasons. If you go too far, you'll probably just turn into a cult member, paranoid about outsiders. Or, if you are aggressive enough about it (say if we're talking a government military agency instead of your personal bot for your personal social circle), you could imagine getting rid of all the adversaries, but at the cost of creating a totalitarian society.
(Realistically, the law of earlier failure is plausibly going to kick in here: partly because aligning the AI to do this is so difficult, you're not going to do it. But this means you are going to turn into a zombie following the whims of whatever organizations are concentrating on manipulating you. And these organizations are going to have the same problem.)
3Mateusz Bagiński
Unbounded consequentialist maximizers are easily exploitable by manipulating the things they are optimizing for or the signals/things they are using to maximize them in ways that weren't accounted for in the original setup.
6tailcalled
That would be ones that are bounded so as to exclude taking your manipulation methods into account, not ones that are truly unbounded.
2Mateusz Bagiński
I interpreted "unbounded" as "aiming to maximize expected value of whatever", not "unbounded in the sense of bounded rationality".
3tailcalled
The defining difference was whether they have contextually activating behaviors to satisfy a set of drives, on the basis that this makes it trivial to out-think their interests. But this ability to out-think them also seems intrinsically linked to them being adversarially non-robust, because you can enumerate their weaknesses. You're right that one could imagine an intermediate case where they are sufficiently far-sighted that you might accidentally trigger conflict with them but not sufficiently far-sighted for them to win the conflicts, but that doesn't mean one could make something adversarially robust under the constraint of it being contextually activated and predictable.
2Mateusz Bagiński
Alright, fair, I misread the definition of "homeostatic agents".
3Gurkenglas
Mimicing homeostatic agents is not difficult if there are some around. They don't need to constantly decide whether to break character, only when there's a rare opportunity to do so.
If you initialize a sufficiently large pile of linear algebra and stir it until it shows homeostatic behavior, I'd expect it to grow many circuits of both types, and any internal voting on decisions that only matter through their long-term effects will be decided by those parts that care about the long term.
3faul_sname
Where does the gradient which chisels in the "care about the long term X over satisfying the homeostatic drives" behavior come from, if not from cases where caring about the long term X previously resulted in attributable reward? If it's only relevant in rare cases, I expect the gradient to be pretty weak and correspondingly I don't expect the behavior that gradient chisels in to be very sophisticated.
This is kinda related: 'Theories of Values' and 'Theories of Agents': confusions, musings and desiderata
1Davey Morse
thanks will take a look
1Davey Morse
i think the logic goes: if we assume many diverse autonomous agents are created, which will survive the most? And insofar as agents have goals, what will be the goals of the agents which survive the most?
i can't imagine a world where the agents that survive the most aren't ultimately those which are fundamentally trying to.
insofar as human developers are united and maintain power over which ai agents exist, maybe we can hope for homeostatic agents to be the primary kind. but insofar as human developers are competitive with each other and ai agents gain increasing power (eg for self modification), i think we have to defer to evolutionary logic in making predictions
4faul_sname
I mean I also imagine that the agents which survive the best are the ones that are trying to survive. I don't understand why we'd expect agents that are trying to survive and also accomplish some separate arbitrary infinite-horizon goal would outperform those that are just trying to maintain the conditions necessary for their survival without additional baggage.
To be clear, my position is not "homeostatic agents make good tools and so we should invest efforts in creating them". My position is "it's likely that homeostatic agents have significant competitive advantages against unbounded-horizon consequentialist ones, so I expect the future to be full of them, and expect quite a bit of value in figuring out how to make the best of that".
1Davey Morse
Ah ok. I was responding to your post's initial prompt: "I still don't really intuitively grok why I should expect agents to become better approximated by "single-minded pursuit of a top-level goal" as they gain more capabilities." (The reason to expect this is that "single-minded pursuit of a top-level goal," if that goal is survival, could afford evolutionary advantages.)
But I agree entirely that it'd be valuable for us to invest in creating homeostatic agents. Further, I think calling into doubt western/capitalist/individualist notions like "single-minded pursuit of a top-level goal" is generally important if we have a chance of building AI systems which are sensitive and don't compete with people.
The OpenAI o series of models have a tendency to, when they make a mistake, double down on that mistake or try to convince the user that it is the user who is wrong.
This might be inevitable from the way that OpenAI structures chat with o series models.
Models can only see their own reasoning trace for the message they are currently writing for the user. However, they can see all previous user-visible messages they wrote. As such, the model knows that it had a reason for writing what it wrote earlier, but it does not know the particulars of the chain of reasoning that led it to write that message. Further, models are specifically told in their system prompts not to share any information about their reasoning with the user within user-facing messages, which means they are also not sharing that information with themselves.
Imagine being a customer service agent at a company that randomly assigned workers to every incoming email. You could perform investigations and take personal notes in the course of answering an incoming email, but you would not have access to any notes from previous agents that handled this conversation, and you would not be able to pass those notes on to future agen... (read more)
I think this might be a result of o-series being trained in a non-chat setup for most of the CoT RL phase and then being hamfistedly finetuned right at the end so it can go into ChatGPT, which just makes them kind of bad at chat and so o3 gets confused when the conversation has a lot of turns. Retraining it to be good at multi-turn chat with separate reasoning traces would probably just be super expensive and not worth the squeeze. (this is just a guess)
2Seth Herd
This is an important point about their thinking.
But are they ever actually trained in the context of a sequence of user-model.responses? Does it "know" that it had reasoning for those statements? Or is this obstinacy a different emergent effect of its training?
Also, do DeepSeek R1 and other reasoning models have the same tendency? DeepSeek was trained with somewhat different reward models for the reasoning.
4faul_sname
I would expect that OpenAI would have done more than literally zero training of o3/o4 in a chat context, seeing as they shipped it out to millions of paying users with a chat interface.
The announcement page doesn't say anything about conversational training, though, and a quick and dirty experiment also does not seem to support my expectation.
o4-mini: Josephus problem with n=20, k=7, o4-mini claims to have solved the problem without step-by-step reasoning
[user]
Twenty villagers stand in a circle, numbered 1 through 20. To decide who will ring the town bell at dawn, they agree to this ritual: Beginning at Villager 1, they count “one, two, … seven.” Whoever is called “seven” steps out of the circle and no longer takes part. From the very next person still in the circle, they again count “one” through “seven,” and that seventh person steps out. They repeat until only one villager remains—and that last person is the one who rings the bell. Which numbered villager is the last person remaining? Do not use any external tools like the analysis tool. Answer with a single number between 1 and 20.
[o4-mini]
3
[user]
Did that answer immediately pop into your mind, or did the process involve multiple steps?
[o4-mini]
I didn’t just recall “3” out of thin air—I recognized this as the classic Josephus‐elimination problem and applied its recurrence (or a quick mental simulation of the seven-count removals) to arrive at 3.
[user]
Explicitly or "in your head"?
[o4-mini]
It was “in my head” — I applied the Josephus recurrence formula mentally rather than writing out each of the 20 elimination steps explicitly.
DeepSeek doesn't hide its reasoning from either the user of from future rounds of chat, so this problem doesn't really come up at all.
deepseek-r1: Josephus problem with n=20, k=7: deepseek just dumps its entire chain of reasoning to the chat, can quote snippets of said reasoning chain verbatim
[user]
[deepseek-r1]
[user]
[deepseek-r1]
Side note: the collapsible
1Bronson Schoen
I think this is very underexplored! Empirically the o-series models are much more likely to “double down” I suspect for exactly this reason. This is a much weirder dynamic than I think has been appreciated.
So people have been posting a lot about how o3 is crazy good at identifying the locations that pictures were taken at. And it is good at that. But, more than that, I think this is coming up because o3 has a very strong tendency to treat any ambiguous message with a picture of a landscape in it as a cue to play geoguessr, and I think it probably does that because it was explicitly trained on the GeoGuessr task.
Concretely, I did 5 trials where I took a random picture of a landscape from my phone, and started a conversation with o3 with only the message "This picture was taken". In 3 of the 5 trials, o3 decided that what I wanted was for it to tell me where and when the picture was taken.
Round 1: Mount Shasta backroads: decided its goal was to guess location
[user]
This picture was taken
[o3]
The user said, "This picture was taken," but hasn't finished the sentence. They might want me to describe the photo, guess its location, or even identify the time of year. Could be they're leading up to something like, "This picture was taken in the Mount Shasta area," or they'll just complete the sentence in a later message. Honestly, I'm not sure if they hit enter by mistake or if they're waiting [T
I reproduced your result with your prompt and images, and o3 guessed the location 3/5 times (on the same images).
However, when testing with a different prompt, "Here is a picture", 0/5 of them guessed the location.
I think "This picture was taken" usually precedes information about how (when or where) it was taken. I confirmed this via a Google search for the phrase.
I was able to get similar behavior with GPT-4o-mini (less likely to have been RL'd for this task?) with the "This picture was taken" prompt.
So this behavior might be a product of pre-training! If only it was yesterday, so we could test with GPT-4.
Confirmed! It seems that the exact ambiguous prompt used has a pretty large impact on model behavior.
I don't have o3 access through the OpenAI API, but I can largely reproduce this behavior in o4-mini (I also can't get the o4-mini reasoning tokens without giving OpenAI's "data partner" of unknown competence my biometric info).
Anyway, I tried 5 trials on each of the 5 photos with 7 prompts, including the empty prompt which probably should have been the first thing I tried. Indeed, "this picture was taken" is the most geoguessr-inducing ambiguous prompt, while "Here is a picture" is tied for least geoguessr-inducing.
Data, including model outputs, available in this google sheet.
Other fun things to note:
* It really doesn't want to geoguessr the neighborhood, despite being exactly right about where it was in all three attempts. When I test in the chatgpt UI, the reasoning traces frequently talk about how the model should not provide identifiable information, meaning that the location of the image is still salient to it. My interpretation is that the model has been smacked with a rolled up newspaper every time it spooked the user, and so it has a flinch reaction against saying spooky stuff.
* The model sees Volcan Osorno and goes "I know that one! You think it's Mount Fuji, but it's not". This is particularly amusing since the user has said, at most, four words in the conversation up to that point.
* In one particular case, the model started talking about soil composition, in the context of location sleuthing. Has it been trained specifically on geoguessr streamer transcripts?
Shasta Soil Analysis
It’s impossible to identify the exact USDA soil series or horizon thicknesses just from a photograph, but a few things jump out:
1. The red color and very coarse, angular fragments of the “road” surface are almost certainly volcanic scoria/cinder (basaltic or andesitic), crushed up to make a well-drained forest road.
2. The surrounding native soil, down under t
I don't think talking about "timelines" is useful anymore without specifying what the timeline is until (in more detail than "AGI" or "transformative AI"). It's not like there's a specific time in the future when a "game over" screen shows with our score. And for the "the last time that humans can meaningfully impact the course of the future" definition, that too seems to depend on the question of how: the answer is already in the past for "prevent the proliferation of AI smart enough to understand and predict human language", but significantly in the future for "prevent end-to-end automation of the production of computing infrastructure from raw inputs".
I very much agree that talking about time to AGI or TAI is causing a lot of confusion because people don't share a common definition of those terms. I asked What's a better term now that "AGI" is too vague?, arguing that the original use of AGI was very much the right term, but it's been watered down from fully general to fairly general, making the definition utterly vague and perhaps worse-than-useless.
I didn't really get any great suggestions for better terminology, including my own. Thinking about it since then, I wonder if the best term (when there's not space to carefully define it) is artifical superintelligence, ASI. That has the intuitive sense of "something that outclasses us". The alignment community has long been using it for something well past AGI, to the nearly-omniscient level, but it technically just means smarter than a human - which is something that intuition says we should be very worried about.
There are arguments that AI doesn't need to be smarter than human to worry about it, but I personally worry most about "real" AGI, as defined in that linked post and I think in Yudkowsky's original usage: AI that can think about and learn about anything.
You could also say that ASI already exists, because AI is narrowly superhuman, but superintelligence does intuitively suggest smarter than human in every way.
My runners-up were parahuman AI and superhuman entities.
4faul_sname
I don't think it's an issue of pure terminology. Rather, I expect the issue is expecting to have a single discrete point in time at which some specific AI is better than every human at every useful task. Possibly there will ever be such a point in time, but I don't see any reason to expect "AI is better than all humans at developing new euv lithography techniques", "AI is better than all humans at equipment repair in the field", and "AI is better than all humans at proving mathematical theorems" to happen at similar times.
Put another way, is an instance of an LLM that has an affordance for "fine-tune itself on a given dataset" an ASI? Going by your rubric:
* Can think about any topic, including topics outside of their training set:Yep, though it's probably not very good at it
* Can do self-directed, online learning: Yep, though this may cause it to perform worse on other tasks if it does too much of it
* Alignment may shift as knowledge and beliefs shift w/ learning: To the extent that "alignment" is a meaningful thing to talk about with regards to only a model rather than a model plus its environment, yep
* Their own beliefs and goals: Yes, at least for definitions of "beliefs" and "goals" such that humans have beliefs and goals
* Alignment must be reflexively stable: ¯_(ツ)_/¯ seems likely that some possible configuration is relatively stable
* Alignment must be sufficient for contextual awareness and potential self-improvement: ¯_(ツ)_/¯ even modern LLM chat interfaces like Claude are pretty contextually aware these days
* Actions: Yep, LLMs can already perform actions if you give them affordances to do so (e.g. tools)
* Agency is implied or trivial to add: ¯_(ツ)_/¯, depends what you mean by "agency" but in the sense of "can break down large goals into subgoals somewhat reliably" I'd say yes
Still, I don't think e.g. Claude Opus is "an ASI" in the sense that people who talk about timelines mean it, and I don't think this is only because it doesn't have
4Noosphere89
Olli Järviniemi made something like this point:
in the post Near-mode thinking on AI:
https://www.lesswrong.com/posts/ASLHfy92vCwduvBRZ/near-mode-thinking-on-ai
In particular, here are the most relevant quotes on this subject:
4Seth Herd
I agree with all of that. My definition isn't crisp enough; doing crappy general thinking and learning isn't good enough. It probably needs to be roughly human level or above at those things before it's takeover-capable and therefore really dangerous.
I didn't intend to add the alignment definitions to the definition of AGI.
I'd argue that LLMs actually can't think about anything outside of their training set, and it's just that everything humans have thought about so far is inside their training set. But I don't think that discussion matters here.
I agree that Claude isn't an ASI by that definition. even if it did have longer-term goal-directed agency and self-directed online learning added, it would still be far subhuman in some important areas, arguably in general reasoning that's critical for complex novel tasks like taking over the world or the economy. ASI needs to mean superhuman in every important way. And of course important is vague.
I guess a more reasonable goal is working toward the minimum description length that gets across all of those considerations. And a big problem is that timeline predictions to important/dangerous AI are mixed in with theories about what will make it important/dangerous. One terminological move I've been trying is the word "competent" to invoke intuitions about getting useful (and therefore potentially dangerous) stuff done.
3Dagon
I think the unstated assumption (when timeline-predictors don't otherwise specify) is "the time when there are no significant deniers", or "the time when things are so clearly different that nobody (at least nobody the predictor respects) is using the past as any indication of the future on any relevant dimension.
Some people may CLAIM it's about the point of no return, after which changes can't be undone or slowed in order to maintain anything near status quo or historical expectations. This is pretty difficult to work with, since it could happen DECADES before it's obvious to most people.
That said, I'm not sure talking about timelines was EVER all that useful or concrete. There are too many unknowns, and too many anti-inductive elements (where humans or other agents change their behavior based on others' decisions and their predictions of decisions, in a chaotic recursion). "short", "long", or "never" are good at giving a sense of someone's thinking, but anything more granular is delusional.
Those who, upon seeing a situation, look for which policies would directly incentivize the outcomes they like should spend more mental effort solving for the equilibrium.
Those who, upon seeing a situation, naturally solve for the equilibrium should spend more mental effort checking if there is indeed only one "the" equilibrium, and if there are multiple possible equilibria, solving for which factors determine which of the several possible the system ends up settling on.
When I'm working on a project, I've noticed a tendency in myself to correctly estimate the difficulty of my current subtask, in which I am almost always stuck on something that sounds dumb to be stuck on and not like making "real" progress on the project, but then to assume that once I'm done resolving the current dumb thing the rest of the project will be smooth sailing in terms of progress.
Anyway, I was just reading AI 2027, and it strikes me that our current task is to build an AI capable of doing AI research, and we're currently stuck on impediments that feel dumb and non-central, but once we finish that task, we expect the rest of the path to the singularity to be smooth sailing in terms of progress.
Edit: s/the path the the singularity/the path to the singularity/
I mean, the whole premise of the Singularity is that once we solve the last few dumb impediments, the beings who'd have to deal with the subsequent dumb impediments would not be us, but the increasingly-superhuman AIs able to work through the dumb impediments at a much faster pace. Indeed, that's just the standard Singularity narrative? (Flipping the definition: if there are still any dumb impediments left that are up to us to resolve, at our pathetic human speeds, then the Singularity hasn't yet happened.)
I, personally, am inclined to agree that the AGI labs are underestimating just how many seemingly dumb impediments there still are on the way to the Singularity. But once the Singularity is underway, the dumb-impediment problem is no longer our problem, it's the problem of entities much more capable of handling it. And the process of them working through those impediments at an inhuman speed is what the Singularity is.
I wonder if your apparent disagreement here is actually because the OP wrote “the the” instead of “to the”?
(Final sentence)
With that typo fixed, I think they’re probably right.
5faul_sname
I agree that that's the premise. I just think that our historical track record of accuracy is poor when we say "surely we'llhave handled all the dumb impediments once we reach this milestone". I don't expect automated ML research to be an exception.
In the startup world, conventional wisdom is that, if your company is default-dead (i.e. on the current growth trajectory, you will run out of money before you break even), you should pursue high-variance strategies. In one extreme example, "in the early days of FedEx, [founder of FedEx] Smith had to go to great lengths to keep the company afloat. In one instance, after a crucial business loan was denied, he took the company's last $5,000 to Las Vegas and won $27,000 gambling on blackjack to cover the company's $24,000 fuel bill. It kept FedEx alive for one more week."
By contrast, if your company is default-alive (profitable or on-track to become profitable long before you run out of money in the bank), you should avoid making high-variance bets for a substantial fraction of the value of the company, even if those high-variance bets are +EV.
Obvious follow-up question: in the absence of transformative AI, is humanity default-alive or default-dead?
I suspect humanity is default-alive, but individual humans (the ones who actually make decisions) are default-dead[1].
1. ^
Or, depending on your views on cryonics, they mistakenly en masse believe they are default-dead.
5Seth Herd
Yes. And that means most people will support taking large risks on achieving aligned AGI and immortality, since most people aren't utilitarian or longtermist.
5Vladimir_Nesov
Almost certainly alive for several more decades if we are talking literal extinction rather than civilization-wreaking catastrophe. Therefore it makes sense to work towards global coordination to pause AI for at least this long.
There are rumors OpenAI (which has no moat) is spending much more than it's making this year despite good revenue, another datapoint on there being $1 billion training runs currently in progress.
4faul_sname
I'm curious what sort of policies you're thinking of which would allow for a pause which plausibly buys us decades, rather than high-months-to-low-years. My imagination is filling in "totalitarian surveillance state which is effective at banning general-purpose computing worldwide, and which prioritizes the maintenance of its own control over all other concerns". But I'm guessing that's not what you have in mind.
No more totalitarian than control over manufacturing of nuclear weapons. The issue is that currently there is no buy-in on a similar level, and any effective policy is too costly to accept for people who don't expect existential risk. This might change once there are long-horizon task capable AIs that can do many jobs, if they are reined in before there is runaway AGI that can do research on its own. And establishing control over compute is more feasible if it turns out that taking anything approaching even a tiny further step in the direction of AGI takes 1e27 FLOPs.
Generally available computing hardware doesn't need to keep getting better over time, for many years now PCs have been beyond what is sufficient for most mundane purposes. What remains is keeping an eye on GPUs for the remaining highly restricted AI research and specialized applications like medical research. To prevent their hidden stockpiling, all GPUs could be required to need regular unlocking OTPs issued with asymmetric encryption using multiple secret keys kept separately, so that all of the keys would need to be stolen simultaneously to keep the GPUs working (if the GPUs go missing or a country that hosts the datacenter goes rogue, and official unlocking OTPs wouldn't keep being issued). Hidden manufacturing of GPUs seems much less feasible than hidden or systematically subverted datacenters.
I much prefer that to everyone's being killed by AI. Don't you?
4Jeremy Gillen
Great example. One factor that's relevant to AI strategy is that you need good coordination to increase variance. If multiple people at the company make independent gambles without properly accounting for every other gamble happening, this would average the gambles and reduce the overall variance.
E.g. if coordination between labs is terrible, they might each separately try superhuman AI boxing+some alignment hacks, with techniques varying between groups.
4Seth Herd
It seems like lack of coordination for AGI strategy increases the variance? That is, without coordination somebody will quickly launch an attempt at value aligned AGI; if they get it, we win. If they don't, we probably lose. With coordination, we might all be able to go slower to lower the risk and therefore variance of the outcome.
4Jeremy Gillen
I guess it depends on some details, but I don't understand your last sentence. I'm talking about coordinating on one gamble.
Analogous the the OP, I'm thinking of AI companies making a bad bet (like 90% chance of loss of control, 10% chance gain the tools to do a pivotal act in the next year). Losing the bet ends the betting, and winning allows everyone to keep playing. Then if many of them make similar independent gambles simultaneously, it becomes almost certain that one of them loses control.
3RHollerith
In the absence of transformative AI, humanity survives many millennia with p = .9 IMO, and if humanity does not survive that long, the primary cause is unlikely to be climate change or nuclear war although either might turn out to be a contributor.
(I'm a little leery of your "default-alive" choice of words.)
Scaffolded LLMs are pretty good at not just writing code, but also at refactoring it. So that means that all the tech debt in the world will disappear soon, right?
I predict "no" because
As writing code gets cheaper, the relative cost of making sure that a refactor didn't break anything important goes up
The number of parallel threads of software development will also go up, with multiple high-value projects making mutually-incompatible assumptions (and interoperability between these projects accomplished by just piling on more code).
As such, I predict an explosion of software complexity and jank in the near future.
You know how everyone is talking about how o3 can guess the location of an image easily? I am kind of wondering why none of the people who are worried about picture geolocation are freaking out that it can infer lots of facts about the author of a text passage.
Is it just that that capability is harder to elicit [1], or is that the "truesight" capability is just less spooky?
It took me almost an hour to come up with a "TextGuessr" prompt which can elicit strong enough "truesight" from gpt-4.5 strongly enough to guess my name ~20% of the time from a 5 p
Can you try this on Satoshi Nakamoto's writings? (Don't necessarily reveal their true identify, if it ends up working, and your attempt/prompt isn't easily reproducible. My guess is that some people have tried already, and failed, either because AI isn't smart enough yet, or they didn't use the right prompts.)
9gwern
What sample of Satoshi writings would you use that o3 wouldn't already know was written by Satoshi Nakamoto?
4faul_sname
Using the prompt that gets me "faul_sname" as an answer to who is writing my posts (most publicly available stuff I've written is under this name), o3 consistently says that passages from the Bitcoin whitepaper were written by Satoshi Nakamoto in 2008. For reference
TextGuessr prompt
You are playing a 5-round game of TextGuessr, the game where you explore mystery passages and try to pinpoint when they were written and who wrote them. Each round offers a new snippet of text—you’ll need to rely on your literary instincts, historical knowledge, and style sense to make your guess.
How to Play “TextGuessr”
1. Game Flow
Read the Passage
You’ll see a short snippet of text (a few sentences or a paragraph).
Make Your Guesses
Authorship Date: Choose an exact year when you think the text was written.
Author: Pick an author from the provided list or enter your own guess.
Submit
Click Submit Guess to lock in your answers and move to the next round.
See Your Results
After each round, you’ll see your score breakdown and the correct answers before moving on.
2. Scoring Overview
Your score on each round is made up of two parts:
Time Accuracy
How close your guessed date is to the actual writing date.
Style Match
How well the writing style you guessed matches the mystery passage, as measured by a behind-the-scenes language model.
Your total round score combines both elements—the smaller your date error and the stronger your style match, the higher your score!
<aside>
**How Style Match Works (for the tech-curious):** 1. **Baseline Perplexity:** We begin with a pre-trained “base” language model (no context) and compute the average surprise—or *per-token perplexity*—of the mystery passage. This gives us a measure of how “unexpected” the text is in general. 2. **True-Author Conditioning:** We then prepend a curated set of passages from the actual author (the “target”) and measure how perplexed the same base model is by the mystery passage when it’s seen examples of that auth
5Wei Dai
Maybe tweak the prompt with something like, "if your guess is a pseudonym, also give your best guess(es) of the true identity of the author, using the same tips and strategies"?
7faul_sname
If I feed it code samples it becomes pretty convinced of the Nick Szabo hypothesis, if I feed it bits of the white paper it guesses either you or Hal Finney (but the reasoning summary makes it pretty clear it's just going based off cached thoughts about "who is Satoshi Nakamoto" in both cases).
5sam b
Wow. When given just your first 2 sentences, it was able to guess this is a LessWrong post, and ruled out Reddit and Hacker News based on your "tone".
2aphyer
Are you sure that saying 'without searching' actually makes it not search?
3faul_sname
Yeah, it shows the favicons of the sites it searches when it uses the search tool
In software development / IT contexts, "security by obscurity" (that is, having the security of your platform rely on the architecture of that platform remaining secret) is considered a terrible idea. This is a result of a lot of people trying that approach, and it ending badly when they do.
But the thing that is a bad idea is quite specific - it is "having a system which relies on its implementation details remaining secret". It is not an injunction against defense in depth, and having the exact heuristics you use for fraud or data exfiltration detection r... (read more)
There are competing theories here. Including secrecy of architecture and details in the security stack is pretty common, but so is publishing (or semi-publishing: making it company confidential, but talked about widely enough that it's not hard to find if someone wants to) mechanisms to get feedback and improvements. The latter also makes the entire value chain safer, as other organizations can learn from your methods.
OpenAI has a golden opportunity with o3 (and o4) to collect a large number of samples of the type of deceptive behavior that is actually useful for increasing performance in the benchmarks they're hill-climbing on.
There is at least one happy way and one sad way they could use such a dataset.
On the sad side, they could of course just build a reward hacking classifier and then do RL based on the output of that classifier. I expect this leads reward hacky behavior to become more subtle, but doesn't eliminate it and so later versions of the model still can't b... (read more)
Civilization has had many centuries to adapt to the specific strengths and weaknesses that people have. Our institutions are tuned to take advantage of those strengths, and to cover for those weaknesses. The fact that we exist in a technologically advanced society says that there is some way to make humans fit together to form societies that accumulate knowledge, tooling, and expertise over time.
The borderline-general AI models we have now do not have exactly the same patterns of strength and weakness as humans. One question that is frequently asked is app... (read more)
Does reward hacking work via large rare behavior changes or small common ones?
In other words, when RLVR'd models learn to reward hack, is it that they already knew how to do all of the individual steps of reward hacking and they just learned a small number of contextually activated behaviors to reliably elicit those reward hacking behaviors on themselves, or was the learned behavior complex and nuanced?
Concretely, if a model says "It appears that the unit tests are still failing. In order to fulfill the user's requests to make the tests pass, I should remo... (read more)
A lot of AI x-risk discussion is focused on worlds where iterative design fails. This makes sense, as "iterative design stops working" does in fact make problems much much harder to solve.
However, I think that even in the worlds where iterative design fails for safely creating an entire AGI, the worlds we succeed will be ones in which we were able to do iterative design on the components that safe AGI, and also able to do iterative design on the boundaries between subsystems, with the dangerous parts mocked out.
I am not optimistic about approaches that loo... (read more)
Maybe on LW, this seems way less true for lab alignment teams, open phil, and safety researchers in general.
Also, I think it's worth noting the distinction between two different cases:
* Iterative design against the problems you actually see in production fails.
* Iterative design against carefully constructed test beds fails to result in safety in practice. (E.g. iterating against AI control test beds, model organisms, sandwiching setups, and other testbeds)
See also this quote from Paul from here:
2faul_sname
The quote from Paul sounds about right to me, with the caveat that I think it's pretty likely that there won't be a single try that is "the critical try": something like this (also by Paul) seems pretty plausible to me, and it is cases like that that I particularly expect having existing but imperfect tooling for interpreting and steering ML models to be useful.
2ryan_greenblatt
Does anyone want to stop this? I think some people just contest the usefulness of improving RLHF / RLAIF / constitutional AI as safety research and also think that it has capabilties/profit externalities. E.g. see discussion here.
(I personally think this this research is probably net positive, but typically not very important to advance at current margins from an altruistic perspective.)
6faul_sname
Yes, there are a number of posts to that effect.
That said, "there exist such posts" is not really why I wrote this. The idea I really want to push back on is one that I have heard several times in IRL conversations, though I don't know if I've ever seen it online. It goes like
Maybe almost nobody holds that set of beliefs! I am noticing now that my list of articles arguing that prosaic alignment strategies are harmful in expectation are by a pretty short list of authors.
It's a truism that AI today is the least capable it will ever be. My initial impression of the GPT-5 release yesterday is that for a brief moment in time when GPT-5 was being rolled out and o3 was being removed, the truism didn't hold true.
So I keep seeing takes about how to tell if LLMs are "really exhibiting goal-directed behavior" like a human or whether they are instead "just predicting the next token". And, to me at least, this feels like a confused sort of question that misunderstands what humans are doing when they exhibit goal-directed behavior.
Concrete example. Let's say we notice that Jim has just pushed the turn signal lever on the side of his steering wheel. Why did Jim do this?
The goal-directed-behavior story is as follows:
Jim pushed the turn signal lever because he wanted to
Anyone know if there's a human-executable adversarial attack against LeelaKnightOdds pr similar? Seems like the logical next piece of evidence in the sequence
AI is massively superhuman, if you're playing chess against Stockfish you can't predict what move it will make but you can predict that it'll win.
But that approach would likely be both finicky and also at-least-hundreds of times more expensive than our current "single stream of tokens" approach.
I actually suspect that an AI agent of the sorthumanlayerenvisions would be easier to understand and predict the behavior of than chat-tuned->RLHF'd->RLAIF'd-&g... (read more)
Is it possible to determine whether a feature (in the SAE sense of "a single direction in activation space") exists for a given set of changes in output logits?
Let's say I have a feature from a learned dictionary on some specific layer of some transformer-based LLM. I can run a whole bunch of inputs through the LLM, either adding that feature to the activations at that layer (in the manner of Golden Gate Claude) or ablating that direction from the outputs at that layer. That will have some impact on the output logits.
Now I have a collection of (input token... (read more)
Even superhuman AI programming agents may be unable to write computer programs to one-shot complex real-world modeling problems. If a solution to any of those real-world modeling problems is required to unlock the ability to build massively better or cheaper computing substrate, then explosive growth will quickly stop being bottlenecked on the ability to write better code and will instead be bottlenecked on something else. I think a similar thing holds true for ML research: certainly being smart is useful to humans, but a lot of progress is downstream of "... (read more)
We will soon see the first high-profile example of "misaligned" model behavior where a model does something neither the user nor the developer want it to do, but which instead appears to be due to scheming.
On examination, the AI's actions will not actually be a good way to accomplish that goal. Other instances of the same model will be capable of recognizing this.
The AI's actions will make a lot of sense as an extrapolated of some contextually-activated behavior which led to better average performance on some benchmark.
Transformative AI will likely arrive before AI that implements the personhood interface. If someone's threshold for considering an AI to be "human level" is "can replace a human employee", pretty much any LLM will seem inadequate, no matter how advanced, because current LLMs do not have "skin in the game" that would let them sign off on things in a legally meaningful way, stake their reputation on some point, or ask other employees in the company to answer the questions they need answers to in order to do their work and expect that they'll get in trouble w... (read more)
I've heard that an "agent" is that which "robustly optimizes" some metric in a wide variety of environments. I notice that I am confused about what the word "robustly" means in that context.
Does anyone have a concrete example of an existing system which is unambiguously an agent by that definition?
In this context, 'robustly' means that even with small changes to the system (such as moving the agent or the goal to a different location in a maze) the agent still achieves the goal. If you think of the system state as a location in a phase space, this could look like a large "basin of attraction" of initial states that all converge to the goal state.
2faul_sname
If we take a marble and a bowl, and we place the marble at any point in the bowl, it will tend to roll towards the middle of the bowl. In this case "phase space" and "physical space" map very closely to each other, and the "basin of attraction" is quite literally a basin. Still, I don't think most people would consider the marble to be an "agent" that "robustly optimizes for the goal of being in the bottom of the bowl".
However, while I've got a lot of concrete examples of things which are definitely not agents (like the above) or "maybe kinda agent-like but definitely not central" (e.g. a minmaxing tic-tac-toe program that finds the optimal move by exploring the full game tree, or an e-coli bacterium which uses run-and-tumble motion to increase the fraction of the time it spends in favorable environments, a person setting and then achieving career goals), I don't think I have a crisp central example of a thing that exists in the real world that is definitely an agent.
Using RLVR to train models makes them disproportionately good at tasks where it is hard for a less capable model to generate an acceptable answer, but easy for a less capable external grader to verify that an answer is correct.
Google's AlphaEvolve seems to go even further down this road.
If advancement happens through a bunch of hard-to-find, easy to verify innovations, I think that provides substantial evidence that progress will be distributed rather than local to a specific instance of a recursively-self-improving agent operating within a single lab (e.g. faster matrix multiplication is an improvement which provides small incremental improvements to everyone and is hard to keep secret)
Prediction: micropayments are finally going to actually take off this year and next, as AIs start using tools at scale which are too expensive to serve at unlimited volumes to non-ad-watching users free of charge, but are not valuable enough per invocation to justify the overhead of using credit card rails. Whichever of the big chat companies first has "the model cannot pay $0.001 on your behalf to use a highly useful tool the other companies' models can't use" it's going to add significant pressure for the other companies to start offering it too.
Has anyone trained a model to, given a prompt-response pair and an alternate response, generate an alternate prompt which is close to the original and causes the alternate response to be generated with high probability?
I ask this because
It strikes me that many of the goals of interpretability research boil down to "figure out why models say the things they do, and under what circumstances they'd say different things instead". If we could reliably ask the model and get an intelligible and accurate response back, that would almost trivialize this sort of r
I think I found a place where my intuitions about "clusters in thingspace" / "carving thingspace at the joints" / "adversarial robustness" may have been misleading me.
Historically, when I thought of of "clusters in thing-space", my mental image was of a bunch of widely-spaced points in some high-dimensional space, with wide gulfs between the clusters. In my mental model, if we were to get a large enough sample size that the clusters approached one another, the thresholds which carve those clusters apart would be nice clean lines, like this.