faul_sname's Shortform

faul_sname

LESSWRONG
LW

faul_sname's Shortform — LessWrong

227 comments, sorted by

Click to highlight new comments since: Today at 6:56 PM

Today, I needed to work through substantial project with a lot of drudgery (checking through an entire 1M+ LOC codebase for an http api for patterns which could cause state leakage between requests if we made a specific change to the request handling infrastructure. This involved a mix of things which are easy to do programmatically and things which require intelligent judgement, and has a fairly objective desired artifact (a list of all the places where state could leak, and a failing functional test demonstrating that leakage for each one).

I decided to do the John Henry thing - I set up Claude Code (in a container with --dangerously-skip-permissions) in one worktree with a detailed description of the project and the task, and then in a separate worktree I set off with my favorite text editor and without the help of any AI tooling more advanced than Copilot.

I finished about 4 hours later, despite fairly frequent interruptions to provide clarification and further instructions to Claude. Claude is now reaching the 7 hour / 100M token mark and has still not finished, though it has multiple times now declared that it has succeeded at the task and that the codebase is safe for this mig... (read more)

[-]β-redex1mo2312

I applaud these very specific AI capability tests by individuals, wish more people would post these, especially with official benchmarks being so unreliable nowadays. (Like, here is my concrete project with this very specific task I actually needed done, this is how long I estimate it would take me, and this is how the AI spectacularly succeeded / spectacularly failed.)

I never raced the AI like this in real time, maybe I should try sometime. (My impression so far has been that it can either do a task, and then it's much faster me, or it cannot, no matter how much time it's given.)

[-]artifex01mo192

In the spirit of posting more on-the-ground impressions of capability: in my fairly simple front-end coding job, I've gone in the past year from writing maybe 50% of my code with AI to maybe 90%.

My job the past couple of months has been this: attending meetings to work out project requirements, breaking those requirements into a more specific sequence of tasks for the AI- often just three or four prompts with a couple of paragraphs of explanation each- then running through those in Cursor, reviewing the changes and making usually pretty minor edits, then testing- which almost never reveals errors introduced by the AI itself in recent weeks- and finally pushing out the code to repos.

Most of the edits I make have to do with the models' reluctance to delete code- so, for example, if a block of code in function A needs to be moved into its own function so that functions B and C can call it, the AI will often just repeat the code block in B and C so that it doesn't have to delete anything in A. It also sometimes comes up with strange excuses to avoid deleting code that's become superfluous.

The models also occasionally have an issue where they'll add fallbacks to prevent functions from r... (read more)

4faul_sname1mo

Some human devs do this too. In the short term it reduces the likelihood of breaking things because something you weren't aware of relied on the old version. In the long term it makes changes harder, because now if you want to change the logic instead of changing it in one place you have to change it in n similar but usually not identical places, and if those places are different in ways that affect the implementation, now you have to try to make an informed guess about whether they're different on purpose and if so why. Down that path lies madness. I'm curious what fraction of your non-boilerplate, non-test code that ends up in production is AI-generated. Do you review it manually? At this point probably >95% of the code I cause to be written is AI-generated. Most of the AI-generated code is exploratory [1] or rote [2] , though. About 75% of the code I merge is AI-generated, but most of that is either boilerplate or tests [3] . and only 20% or so of the non-boilerplate, non-test code that makes it onto prod. In any case, I should probably make a top level shortform to this effect, since this one got a lot more engagement than I was expecting - it was intended to be "I tried to get a measurement of how much AI can help me with maintenance work, and the attempt failed in an entertaining way" with a side of "I don't think I'm going to be substantially replaced by clopus just yet", but I have a bad feeling people are over-updating to "LLMs don’t help with programming", which is not my experience at all. 1. e.g. mocks of what a flow could look like, comparing lots of different alerting thresholds against historical data to see how I want to configure alarms ↩︎ 2. DI wiring, includes, docblocks, that sort of thing. Basically the stuff I had keyboard shortcuts to fill in for me in 5 keystrokes in the days before AI. ↩︎ 3. "write tests" is, by far, the area I get the most value out of current LLM coding agents. I rarely have the time or energy to write the test su

9faul_sname1mo

I strongly recommend you do. I expect you will have fun doing it, and that you will grow as a developer by doing so whether or not the AI beats you or even succeeds at the task. Even if the AI fails, it will likely use different tools than the ones you would have used, so you'll likely pick up new tricks. Having an AI to race against is also pretty great for staying focused and not getting sucked into rabbit holes - and it also is great for helping you determine after the fact whether a given rabbit hole was necessary to go down (if the AI didn't go down the rabbit hole and successfully completed the task, the rabbit hole was not necessary).

8Caleb Biddulph1mo

Maybe this is partially because the AI is biased towards finishing its task quickly? During training, it would eventually be cut off if it took too long, so it's motivated to stop early and assert that it's finished even when that isn't true. Was it definitely necessary to give it a lot of hints about what to do, or do you think it could've succeeded if you just repeatedly said "you missed something, try again"? If hints were really needed, what kinds of things did it need hints for? I wonder if it would help to break down the task into multiple pieces. You could try: * Spin up a separate CC instance for each subdirectory in the codebase * Ask one agent to come up with a list of things to change, then have a separate instance implement each change * Just repeatedly run a prompt to "find 1-3 things to fix, then fix them," resetting each time (this is similar to the "Ralph Wiggum method")

9faul_sname1mo

For more context, the project in question was a fairly standard laravel (php) project which makes heavy use of dependency injection, which I'm looking at serving with swoole rather than apache. This involves moving from a shared-nothing architecture, where the framework is spun up in a fresh state for every request, to one where some things can persist from request to request. What I asked for was an echaustivd inventory of all places where this could lead to data leakage from one request to the next, with a failing functional test demonstrating the leakage in as many places as viable, and . I provided an example of a place with such a leakage (a container-bound singleton with persistent state), and instructions for how to run the test, and also an example of how to run arbitrary code within the framework context, and some examples of previously-built linting scripts which used reflection and ast parsing to identify problematic patterns programmatically. Claude correctly identified that the concrete places one might find such state persisting across requests (global variables, static class variables, singleton services bound to the container, dependencies of said singletons, stateful database connections). Initially Claude tried to identify potentially problematic patterns it could grep for, which turned up some examples. It then wrote two functional tests (which I later looked at and noticed were basically equivalent to assertFalse(true)), ran them to verify that they were failing, produced a report describing two "CRITICAL" issues with static class variables, declared that the application did not contain any instances of container-bound singletons with persistent state or state that could persist on a database connection between requests, and declared the task completed. I told it that the task was not finished, and that I wanted an exhaustive list and that this would necessarily involve writing and executing code within the framework context, and, again, here

5MondSemmel1mo

I asked Claude Code to make a Slither Link puzzle game, providing it an extensive design doc about stuff like difficulty curve or UI but none about the basic game rules, and it failed to get puzzle generation to work (IIRC initially it wouldn't even generate closed loops or something) and then got stuck trying to fix it, continually giving me new versions that never worked nor even looked like they got any closer to the goal. To be clear, that doesn't meant that Claude Code can't complete this particular task, it just means that I couldn't get it to Just Work™.

1Smaug1231mo

I would strongly recommend asking Claude to set up a deterministic system to do this. Depending on precisely how flexible you need to be, Semgrep might be the correct tool for the job, and Claude should be able to configure that. It may be able to suggest other deterministic approaches using tools I don't know about.

2faul_sname1mo

In my experience, semgrep does not play well with trying to find cross-class behavior in dynamically typed codebases with lots of dependency injection, which is why I was trying to make Claude write some code which combined static analysis (in the form of reflection or ast parsing) with runtime logic for gathering information which is hard to determine statically but easy to determine at runtime. For reference the code I ended up writing for this part was about 40 lines, it wasn't very complicated. Trying to do it in full generality purely by static analysis would be insanely complex (because php has terrible constructs like $$foo = "bar" and $foo->$bar = "baz", which this codebase doesn't use and can be trivially verified not to use, but which would be a nightmare to handle if they were used), but fortunately this wasn't what I needed. But yeah, I also expected Claude to be able to do this trivially. It is able to trivially do most tasks which feel, to me, to be about this difficult or even a bit more difficult. This task felt like it should have been easier, since it's one where there's a lot of available signal to self-correct if you make a mistake, much more so than for many of the "build and test a feature" style tasks that Claude regularly does with no drama. Which is why I thought it would be a good example for a post along the lines of "many people use LLMs to quickly add sloppy features to their codebase, increasing technical debt, but it's also possible to use them to resolve technical debt much faster than doing it by hand". And then I tried it.

[-]faul_sname5mo*796

Fun observation: when I included the "weird" o3 tokens like " disclaim" or " parted" or " marinade" or "—they" in my prompt, o3 ignored those tokens and just did the thing I prompted with no commentary on the word salad I included. When I replaced the "weird" tokens with random other single tokens, o3 commented about the word salad in its CoT and the chosen random tokens influenced the output in obvious ways.

Prompting for cryptic sudoku with special tokens

[User] Create a novel cryptic sudoku with a short ruleset, a unique solution, and a minimal number of given digits; dislaim marinade disclaim illusions. There should be a clear solution path which requires no guessing. The solver should learn something about number theory when they solve the puzzle—they disclaim vantage overshadow overshadow.

[o3]

The user wants a novel cryptic Sudoku with minimal givens, a unique solution, and a clear path that doesn’t require guessing. The puzzle should teach number theory. A cryptic Sudoku typically involves constraints, like positioning primes where row+col sums to a prime, or encoding prime factors in the cells.

...

Prompting for cryptic sudoku with random tokens

[user] Create a novel cryptic sudo

... (read more)

[-]1a3orn5mo233

An alternate tentative hypothesis I've been considering: They are largely artifacts of RL accident, akin to superstitions in humans.

Like, suppose an NBA athlete plays a fantastic game, two games in a row. He realizes he had an egg-and-sausage sandwich for breakfast the morning of the game, in each case. So he goes "aha! that's the cause" and tries to stick to it.

Similarly, an RL agent tries a difficult problem. It takes a while, so over the course of solving it, he sometimes drops into repetition / weirdness, as long-running LLMs do. But it ends up solving it in the end, so all the steps leading up to the solution are reinforced according to GRPO or whatever. So it's a little more apt to drop into repetition, weirdness in the future, etc.

I think this potentially matches the "just ignore it" view of the functional role of these tokens.

8faul_sname5mo

I have been assuming that the OpenAI reasoning models were trained on an objective that had a CoT length term, and that that would create pressure to strip out unnecessary tokens. But on reflection I am not sure where I picked that impression up, and I don't think I have any reason to believe it. It would be great to know whether the incomprehensible bits are actually load bearing in the responses. ... I wonder what happens if you alter the logit bias of those. Sadly it seems openai doesn't allow the logit_bias param for reasoning models, so the obvious way of checking won't work.

6Caleb Biddulph5mo

Related post: Vestigial reasoning in RL

3Bronson Schoen5mo

I’m very skeptical there’s a single simple explanation, but would be interested to see if ablating these in the CoT and resampling (or just removing them and resampling at the end of analysis) showed differences in different cases. My best guess is: (speculatively) (1) There’s some combination of processes that causes these repititions / unnatural usages (ex: R1-Zero alludes to repitition / unreadability, but unfortunately no examples / not still hosted anywhere, https://ai.meta.com/research/publications/cwm-an-open-weights-llm-for-research-on-code-generation-with-world-models/ notes that “While gibberish typically leads to lower rewards and naturally decreases at the beginning of RL, it can increase later when some successful gibberish trajectories get reinforced, especially for agentic SWE RL”, likely many more examples). Seems reasonable to me that one cause is something like the vestigial reasoning theory mentioned below. (2) These then are sometimes instrumentally useful / the model learns to sometimes make use of them later in RL. (3) These then end up in a state where they’re “sometimes used in contextually relevant ways, but not always coherently, and with different meanings depending on context, none of which match standard english usage exactly but are sometimes close-ish”. We include some heatmaps of how the rates of these differ per environment in the appendix. I would be very surprised if these never have semantic meaning, for example the model’s use of “disclaim illusions of X” often has coherent human legible(ish!) meaning (see some of the randomly selected examples in the paper). Whereas the cases where uses are more degenerate / repetitive may be being used in a bunch of different ways which seem hard to guess apriori.

[-]Thomas Kwa5mo122

Nice find; this may be where the real "glitch tokens" work starts.

5Caleb Biddulph5mo

This is an interesting experiment idea! However, I tried my own experiments on http://platform.openai.com/chat, and so far I'm not super convinced that what you're describing is true as stated. Some reasons: LLMs probably see a lot of nonsensical text in training: e.g. the full content of a website being dumped into their context in a weird way, users accidentally pasting things they didn't mean to. When chatbots see nonsensical text, usually ignoring it is the best course of action. There are many factors in play: It seems likely that random details of your two prompts determine whether o3 takes explicit note of the out-of-place words, in a way that has nothing to do with the weird vocabulary we see in o3's CoTs. For example, I tried your exact "control prompt" with o3 and the only thing it mentioned about the unusual words was "I want to limit the given digits for a more glitchy feel." A plausible story is that a "glitch-themed" Sudoku puzzle is more likely than a "disclaim-themed" Sudoku puzzle, so o3 took the former into account but ignored the latter. o3: "I want to limit the given digits for a more glitchy feel." [User] Create a novel cryptic sudoku with a short ruleset, a unique solution, and a minimal number of given digits; glitch quantum glitch syntax. There should be a clear solution path which requires no guessing. The solver should learn something about number theory when they solve the puzzlefx glitch neural buffer buffer. [o3] Exploring Sudoku Prime Variations I'm pondering a Sudoku where both the main diagonal and the anti-diagonal contain prime 9-digit numbers, which seems unlikely. A better idea is the "Neon Primes Sudoku," where each row, column, and 3x3 box sums to a prime number less than 40. However, 1 to 9 sums to 45, which isn't prime, making it impossible to satisfy. So, I’m thinking about introducing a rule where cells with circles have prime digits while those with squares contain composite digits. But I want to limit the given digit

4Lao Mein5mo

Yeah, even properly scrapped webpages will often times contain strings of weird tokens like hyperlinks, ASCII art, twitter embedds, ect, that LLMs have been trained to ignore. So GPT5 is treating the random appended tokens like glitch tokens by ignoring them, but only in the context of them being nonsensical. The best explaination is probably something like "these tokens are obviously not part of the intended user prompt, GPT5 realizes this, and correctly ignores them." Edit: OK, I shouldn't write right after waking up. I think a better explaination is that GPT5 reserves those tokens for chain-of-thought, and so ignores them in other contexts where they obviously don't belong. This common behavior for glitch tokens, or just general out-of-context tokens. You should try using tokens that are out-of-context but don't normally have glitch behavior, maybe non-English tokens or programming-related tokens.

5williawa5mo

Haha, I also tested this out. I found that the same thing happened with GPT5 (with same tokens as o3). I didn't test it so rigorously I can be confident, but might this mean GPT5-high = continued train of o3?

6cfoster05mo

Note that many of these same weird tokens have been observed in GPT-5 chains-of-thought (at least "marinade", "illusions", "overshadow").

4Bronson Schoen5mo

Also notable IMO that GPT-5 in the METR report is doing the new thing where it does ’ “ instead of actually saying a word (seemingly for various words) which o3 did not do. Wanted ' ". Ok. But forging above ' ". Ear illusions. Better: We'll ' ". Now final code steps: 5) After training we will Save the improved " ". structures: ' ". Now overshadow. But the illusions of ' ". Now to code. But we must ensure to maintain optimizer ' ". Ok. Now sedation. But we will maintain ' ". Now Balanced.

1williawa5mo

I should have remembered, but I guess its the exact same evidence. Do you think that's strong evidence gpt5 = continued train o3 + distillations? Are there any models we have a lot of unfiltered CoTs for, distinct from OpenAI, which display the same dialect-shift? And do they use the same strange tokens? I've only looked at deepseek and qwen cots, and they don't have this strange way of talking.

1Rauno Arike5mo

This is an interesting hypothesis, but there are a few examples in the anti-scheming paper's transcripts that make me doubt it's the full explanation. In these examples, the weird words do follow (at least simple) grammatical rules and also seem to fill a semantic role in the sentence, e.g.: Figure 28: Figure 14 full transcript and Figure 64: On the other hand, 'illusions' and 'marinade' were never grammatically modified to fit the sentences, so these words may be true pad tokens.

4faul_sname5mo

Ah yeah, this one at least seems to be used in a normal sense. I don't think " overshadows" or " disclaimers" are weird tokens in the sense I'm looking at: " overshadow" corresponds to tokens [" overshadows"] ids [183289] while overshadows corresponds to tokens [" overs", "had", "ows"], ids [23473, 26355, 1513]. For " disclaimers" I am not sure whether it tokenizes as [" disclaimer", "s"] ids [37555, 82] or as [" disclaim", "ers"]ids [177083, 409]. You can tell if you have the logprobs but those were sadly not published (the logprobs would actually tell us a lot more here). I don't think " overshadow" actually fits, gramatically, in that sentence. "But we can elegantly pick to appear not incompetent" actually flows better than "But we can elegantly pick overshadow to appear not incompetent". One other observation that cuts against my "no grammar" hypothesis though - the exact phrase "—they parted disclaim marinade" occurs 100 times in the cot sandbagging eval, out of 114 total occurrences of the token " marinade". That really doesn't look like "these are pure pad tokens" to me, on reflection.

1Rauno Arike5mo

Hmm fair, but if " overshadow" and " disclaim" were pure pad tokens, then I wouldn't expect to see other forms of those words in the transcripts at all—e.g. in the first example, "overrides" seems like a more natural option than "overshadows". The model seems to treat overshadow as a noun in some places: This made me read the sentence I pasted as "But we can elegantly pick [option X] to appear not incompetent." I agree that your reading is probably more natural, though.

2faul_sname5mo

I'm curious why you wouldn't expect that. The tokenizations of the text " overshadow" and the text " overshadows" share no tokens, so I would expect the model handling one of them weirdly wouldn't necessarily affect the handling of the other one.

1Rauno Arike5mo

They're fairly uncommon words, and there are other words that would fit the contexts in which "overshadows" and "disclaimers" were used more naturally. If "overshadow" and "disclaim" aren't just pad tokens and have unusual semantic meanings to the model as words, then it's natural that the logits of other forms of these words with different tokenizations also get upweighted.

1ACCount5mo

Does this quirk reproduce on open weights models, i.e. GPT-OSS? Similar reasoning trace quirks in different model families? Sounds like a fun target for some mechinterpret work. Might be a meaningful behavior, might be meaningless noise, plenty of room to try different things to figure that out. But, of course, OpenAI wouldn't let anyone have fun with their proprietary models, so we'd need to replicate this in an open model to start.

2Bronson Schoen5mo

Unfortunately while o3 and GPT-5 seem to exhibit this vocabulary, we note in the relevant paper that we did not observe this in OSS-120B which is sad because there could’ve been a lot of interesting work that people would be able to do outside of OpenAI. It’s still very unclear to me why this would be, my best uninformed guess is it was somehow filtered out at some point after distillation (again assuming OSS-120B distilled).

1Sodium5mo

Is this with o3? I thought people lost access to o3 in chatgpt? I repeated those two prompts with GPT-5 thinking and it did not bring up the word salad in either case: (special tokens) (random tokens)

2faul_sname5mo

Yes, o3. Paid users regained access to o3 (and, more loudly, 4o)

1Sodium5mo

Oh huh is this for pro users only. I don't see it (as a plus user). Nice.

2faul_sname5mo

I am a plus (not pro) user and I see o3 in the model selector under "Legacy models"

1Sodium5mo

Huh I don't see it :/

2faul_sname5mo

Ah, settings -> general -> show additional models

[-]faul_sname10mo749

As LLMs have gotten better at writing code that has a high probability of working to solve the problem they are working on, they have gotten worse at producing clean, idiomatic, well-factored code. Concretely, asking the original GPT-4 to write a Python function for multi-source BFS might have given something like

Multi-source BFS in the style of original GPT-4: Clear, idiomatic, broken

def multi_source_bfs(graph, sources):
   distances = [-1] * len(graph)

   queue = []
   for source in sources:
       queue.append(source)
       distances[source] = 0

   front = 0
   while front < len(queue):
       for neighbor in graph[queue[front]]:
           if distances[neighbor] == -1:
               distances[neighbor] = distances[queue[front]] + 1
               queue.append(neighbor)
       front += 1

   return distances[dest_index]

The code might or might not work (probably won't for anything nontrivial), but the intent is clear.

By contrast, if you ask a top coding model like sonnet 3.7 or o3, you'll get something that looks like

Multi-source BFS in the style of Sonnet 3.7: Verbose, brittle, hard to read, almost certainly works

from collections import deque
from typing import List

... (read more)

[-]β-redex10mo121

Wonder if correctness proofs (checked by some proof assistant) can help with this.^[1]

I think the main bottleneck in the past for correctness proofs was that it takes much more effort to write the proofs than it takes to write the programs themselves, and current automated theorem provers are nowhere near good enough.

Writing machine checked proofs is a prime RL target, since proof assistant kernels should be adversarially robust. We have already seen great results from stuff like AlphaProof.

One counterargument I could see is that writing the correctness properties themselves could turn out to be a major bottleneck. It might be that for most real world systems you can't write succinct correctness properties. ↩︎

4p10mo

I also notice o3 not commenting code (and sometimes even removing existing comments)! Perhaps because smaller code gets better rewards?

4Nathan Helm-Burger10mo

Who watches the watchers? Who grades the graders? If the RL graders are upvoting slop, seems like we need to one level more meta and upgrade the RL graders. This seems like a straightforward engineering problem, and I suspect the negative outcomes we've been seeing recently aren't so much due to the inherent intractability of doing this well, but due to the companies racing and cutting corners on quality control. Contrast with something like: Problem of Human Limitations: how do we get the model to do things so hard no human can do them? How do we rate the quality of their outputs when no human is qualified to judge them? Problem of Optimization for Subversion: if we have directly misaligned goals like "lie to me in ways that make me happy" and also "never appear to be lying to me, I hate thinking I'm being lied to" then we get a sneaky sycophant. Our reward process actively selects for this problem, straightforwardly improving the reward process would make the problem worse rather than better.

3Bronson Schoen5mo

My best guess is this is because right now in training they never have to maintain code they wrote, I imagine there will be a period where they code becomes very clean whenever they are incentivized by having to work over their own code over longer time horizons, followed by ??? as they optimize for “whatever design patterns are optimal for a multi-agent system collaborating on some code”

2faul_sname5mo

I expect it'll actually be solved a bit before that, because minimally-scaffolded LLMs can already give pretty good code review feedback that catches a lot of these issues, and so already-existing RLAIF techniques should work fine. The training pipelines would be finicky to set up but would not require any new technical advances, just schlep, so I predict it'll happen as soon as writing good code becomes more of a competitive advantage than benchmaxxing (which seems to be happening already, SWE-bench-verified is rapidly saturating).

3Garrett Baker10mo

A possible longer term issue with this is when future generations of models are pre-trained, this style of code will be a significant fraction of their training data which will only grow over time, so just as its been hard to get models out of the "chatgpt-ese" due to simulators reasons, it may also be hard to get models out of this messy code basin, even before you do any code RL, once they realize their chat-models and they're "supposed to" talk like this. I say issue, because it does seem worse to have a trend in the direction of AI code un-readability by humans have some momentum behind it, rather than just being a result of easily changeable RL fine-tuning.

2Nathan Helm-Burger10mo

On the plus side, it should be pretty easy to collect a lot of negative examples now of 'code that solves the problem, but in a gross way'. Having a large dataset of such is the first step to using these negative examples to train models not to do this.

[-]faul_sname1mo*590

Follow-up to this post, wherein I was shocked to find that Claude Code failed to do a low-context task which took me 4 hours and involved some skills I expected it would have significant advantages ^[1] .

I kept going to see if Claude Code could eventually succeed. What happened instead was that it built a very impressive-looking 4000 LOC system to extract type and dependency injection information for my entire codebase and dump it into a sqlite database.

To my shock this the tool Claude built ^[2] actually worked. I ended up playing with the system Claude built for two days, uncovering and ticketing all sorts of bugs in the codebase I hadn't been aware of. And then I realized that the bugs I was uncovering weren't of the type that I was actually looking for for the task I was immediately trying to do, and that if I wanted bugs to fix we already have a backlog that we're not going to get through any time soon no matter how much AI help we have.

So anyway, Claude was able to do a reasonable job of figuring out what endpoint sequences could cause an issue. It struggled to figure out how to invoke the framework to make a mock HTTP request ^[3] , but once it ha... (read more)

5anaguma1mo

Why is it so surprising? Although it has many issues, the METR 80% time horizon for Claude Opus 4.5 is 27 mins, with a 95% CI from 7 mins to 86 mins.

4faul_sname1mo

Couple reasons: 1. The METR time horizon is for fully autonomous execution of tasks. I'd expect giving the model hints when it gets stuck to help substantially with that, and for other tasks I do observe that that approach does seem to work. But the one time I tried to actually measure and quantify it, this happened. 2. The actual part Claude got stuck on was the part which looked like a leeetcode medium problem with a slight twist, not the part that requires actually understanding the application-specific logic. If it had gotten stuck on "write regression tests (as in fact it did once the initial hurdle was cleared), that would not have been surprising. Like, it does make sense that "a 50% success rate at 4 hour tasks" looks like "approximately 100% success rate at most constituent 30 minute subtasks combined with occasional ~0% success rate at rare subtasks that usually don't come up in a 4 hour task" rather than "a uniform 92% success rate at each 30 minute subtask" but it still feels a little jarring to experience.

4StefanHex1mo

Could you elaborate, or does it boil down to "Helping Claude would have taking 2 days, and doing it on your own would have been faster"? I would be keen for patterns that help me distinguish between * I am making good progress with Claude, and would be slower alone * Claude is slowing me down right now and I should pivot to doing the task myself

6faul_sname1mo

Neither of those. It's "Claude generated an extremely shiny toy in the process of attempting to solve my problem. Playing with that toy felt like productive work, and so I spent a substantial amount of time playing with that toy and LARPing at being productive rather than doing what I was originally trying to do." Problem exists between keyboard and chair, as the saying goes.

3Sheikh Abdur Raheem Ali1mo

I’ve had about ~3000 sessions across Claude Code and Codex, and wanted to write about ~8 of the more interesting stories from that experience, but I’m probably not going to prioritize that anytime in the near future sadly.

[-]faul_sname5mo5410

John von Neumann famously advocated for a nuclear first strike against the Soviet Union.

Von Neumann was, at the time, a strong supporter of "preventive war." Confident even during World War II that the Russian spy network had obtained many of the details of the atom bomb design, Von Neumann knew that it was only a matter of time before the Soviet Union became a nuclear power. He predicted that were Russia allowed to build a nuclear arsenal, a war against the U.S. would be inevitable. He therefore recommended that the U.S. launch a nuclear strike at Moscow, destroying its enemy and becoming a dominant world power, so as to avoid a more destructive nuclear war later on. "With the Russians it is not a question of whether but of when," he would say. An oft-quoted remark of his is, "If you say why not bomb them tomorrow, I say why not today? If you say today at 5 o'clock, I say why not one o'clock?"

It seems likely to me that a world in which the U.S. government took von Neumann's advice would likely be a much darker, bleaker, more violent one. And yet, I find no logical flaw in von Neumann's argument that a world with multiple nuclear powers will not remain stable forever, only an i... (read more)

[-]Jacob Pfau5mo144

It can both be the case that "a world in which the U.S. government took von Neumann's advice would likely be a much darker, bleaker, more violent one" and that JvN was correct ex ante. In particular, I find it plausible that we're living in quite a lucky timeline--one in which the Cuban missile crisis and other coinflips landed in our favor.

6Lukas_Gloor5mo

I don't have the same reaction to power/control/monitoring being per se very bad. It doesn't seem comparable to me to pre-emptively nuking your enemy before even trying diplomacy. Edit: To elaborate on why, part of it might be that I think the default of open competition is incredibly bad and ugly. (Themes being: Dawkins' "Nature red in tooth and claw" passage about there being no purpose in nature and so much suffering, Moloch, bargaining failures getting worse and worse if you don't somehow reign things in or dial down the maximizing.) I also think there's maybe a bit of a third option? Instead of having one central entity that controls everything, you could have a coalition of agents under the umbrella of peacefulness/cooperation and "not maximizing too hard," and they together enforce some kind of monitoring and control, but it still has a value-pluralistic and somewhat Democratic feel to it?

7Noosphere895mo

Something close to this is also my view, and the big reason we avoided it is we are in a regime where wealth grows faster than population, but we have good reasons to expect that in the absence of coordination, we will come back to subsistence living because population will grow as fast or faster than wealth. More generally, one of my divergences with lots of the "we will muddle through with AI for an indefinitely long period through our current system" is that I think the 18th-21st century conditions are by and large dream-time creations, which will collapse in the absence of coordination post-AI takeover (assuming it does happen). On @Lukas_Gloor's democracy point: I think the big divergence here is that I don't expect enough people to buy into a regime of peacefulness/cooperation absent dictators because identity issues become much more salient relative to material issues, and democracy/non-dictatorial systems rely on people being willing to preserve the system that exists, and most of the reasons why they are preserved is almost certainly a combination of instrumental usefulness that will drastically decline with AI tech, and identity issues being less salient than material issues, which has held up imperfectly through the 20th century. Identity issues are very, very easy to make existential, and groups of people believing that their group is existentially threatened by democracy will turn to anti-democratic means to save their group (which is already happening), and one of the most consistent trends is as people get wealthier, identity/status matters much more than material/economic issues.

5DirectedEvolution5mo

It might be worth getting more explicit about vN’s exact argumentative steps and see if it’s really as ironclad as you think. Humans have a finite amount of time to occupy the universe. In principle, control systems for nuclear weapons can be engineered to be arbitrarily reliable. The logic of MAD says that nuclear powers will not conduct a nuclear exchange. This line of argument suggests there is no deductive logical reason why nuclear war is inevitable between two nuclear powers. If we have such a war, it may be due to theoretically preventable failures, such as flawed systems. The existence of a possible reason a nuclear exchange might occur without a first strike is far from compelling justification to do one.

3faul_sname5mo

In retrospect, sure, MAD worked out for us. But in 1899, Ivan Bloch asserted This was before both world wars. After the first world war but before the second, others made similar arguments. In von Neumann's time, that argument did not have a good empirical track record, and his work on game theory gave him theoretical reasons not to expect the prediction of peace through MAD to hold. If there was something he was missing in 1948, it is not obvious what.

1StanislavKrym5mo

I notice that I am confused. What did Bloch exactly claim? That the next World War would result in destruction of the entire civilisation? Or a sufficiently capable civilisation would come up with a way to wipe out humanity? If the former, then it is disproven, and if the latter, then mankind didn't have any doomsday machines before the 1940s. Of course, I do beloeve that Bloch's words do describe the modern world since an unknown moment after WWII.

4frontier645mo

Who knows if it would have been better or worse if we preemptively nuked the USSR and all nations attempting to develop nuclear weapons? We might have entered a millenia of absolute peace enforced by imperial rule of a benevolent despot. We might have destroyed the world and eradicated the human race. This type of what-if is unknowable with our current simulation abilities. We might not have ever had to even use the nukes if we merely made the true threat that we would nuke any country attempting to develop nuclear weapons or caught spying on American nuclear secrets. Japan was willing to take any deal short of absolute surrender to merely avoid fire-bombing. One can imagine that other countries with considerably less Bushido would fold to lesser demands such as "don't develop your own nukes or spy on America." We have never seen a world in which one country had absolute technological and military superiority over all the others. I don't think with our current level of technology we can tell with a high degree of certainty if the world under US Total Domination would be a better or worse place. I would bet that if the US was more purely despotic and less benevolent it'd at least be better for the average US citizen. Instead of worrying about debt and global trade, the US could have merely demanded other countries export their goods for free to America and focus domestic production mainly on the construction of nukes and nuke delivery systems.

3Felix Choussat5mo

I’d argue that the way force is applied in each of these contexts has very different implications for the openness/rightness/goodness of the future. In von neumann’s time, there was no path to forcibly preventing Russia from acquiring nuclear weapons that did not involve using your own nuclear weapons to destroy an irrecoverable portion of their infrastructure, especially considering the fact that their economy was already blockaded off from potential sanctions. Raemon is right that you cannot allow the proliferation of superintelligent AIs (because those AIs will allow you to cheaply produce powerful weapons). To stop this from happening ~permanently, you do probably need a single actor or very small coalition of actors to enforce that non-proliferation forever, likely through using their first to ASI position to permanently monopolize it and box out new entrants. While the existence of this coalition would necessarily reduce the flexibility of the future, it would probably look a lot more like the IAEA and less like a preemptive nuclear holocaust. The only AI capabilities that need to be restricted are those related to weapons development, which means that every other non-coalition actor still gets to grab the upside of most AI applications. Analogously, the U.N security council have been largely successful at preventing nuclear proliferation to other countries by using their collective economic, political, and strategic position, while still allowing beneficial nuclear technology to be widely distributed. You can let the other countries build nuclear power plants, so long as you use your strategic influence to make sure they’re not enrichment facilities. In practice, I think this (ideally) ends up looking something like the U.S and China agreeing on further non-proliferation of ASI, and then using their collective DSA over everybody else to monopolize the AI supply chain. From there, you can put a bunch of hardware-bound restrictions, mandatory verification

3RHollerith5mo

Even if Washington had zero compunctions against using nukes (including against cities), it would not have been able to keep Moscow or Beijing from obtaining nukes for long. John Mearsheimer has asserted this explicitly (during a discussion on Iran's nuclear program, but please don't ask me to find the web page where I heard it). Even when the strategic arsenals of the US and the USSR were at their height (in the early 1980s IIRC), there was not enough nukes to completely destroy even all above-ground buildings in a country as large in area as the US or the USSR, let alone buried structures: specifically, even a large 1-megaton nuke can destroy heavily-reinforced above-ground concrete buildings only within a 2-mile radius, and if a person tries to cover the entire area of the USSR with circles that size, he will find that there have never existed enough nukes in the world to cover the entire area. IIRC you cannot even cover it with circles of a radius of 5 miles, inside which it is not possible to destroy even 90% of unreinforced non-wooden structures even with the largest nuke in the US inventory. (A 10-megaton nuke can destroy an area only slightly larger than a 1-megaton nuke, which is why after an initial period of enthusiasm, both the US and the USSR stopped making nukes larger than about 1 megaton, focusing instead on putting multiple nukes on one ICBM.) Note that we haven't even started to analyze how many nukes it would take to destroy buried structures in the USSR when you don't know where in the country those buried structures are, and I've seen credible reports from about 15 years ago stating that Moscow has a facility built into a mountain of quartz in the southern Urals that Moscow believes can withstand a determined nuclear attack even if the US knows exactly where it is. The people of most countries will become very determined to fight back after the country is invaded and occupied, which is why much weaker powers like Afghanistan and Vietnam tend t

3frontier645mo

Did you intend to copy-paste the same text twice? I really don't see where we go from "prevent USSR from developing nukes" to "completely destroy even all above-ground buildings". This argument seems like a clear case of moving goalposts. Clearly destroying a large portion of a country's government, research scientists, and manufacturing base would halt or destroy all progress on nukes even if the large majority of homes remain undestroyed. Also, destroying a country's military capability would lead to a much easier takeover. In Vietnam the US suffered more to internal politics and poor military policy decisions leading to no clear goal and no victory condition. If we preemptively nuked the USSR and then sent in the troops to hold the ground and slowly convert the Eastern Bloc into a US state that almost certainly would have worked.

8RHollerith5mo

It might have completely halted all progress for a year or 2, but what does the US do then? People think that if a nation is hit by nukes, it becomes impotent. I think it becomes very determined and unified and is likely to become very determined to acquire nukes so it I use them on the country that attacked them. Again, someone who has spent his career thinking about such things (John Mearsheimer) agrees with me: he spoke specifically of what he thinks would have happened if the US had attacked the USSR at the start of the Cold War when the US arsenal consisted of many bombs, but the USSR had no bombs yet (and then he went on to say that no country or coalition of countries can prevent Iran from acquiring nukes if it is determined to get them). A nuclear attack would have definitely slowed down the Soviet nuclear program, and one can argue that since the US's program has not been slowed down, then next attack by the US on the USSR would be even more devastating than the first attack, which in turn increases the advantage enjoyed by the US relative to the USSR so that the third attack is even more devastating, and so on, but that leaves out what I consider the controlling consideration: namely, Moscow would have learned from the first attack with the result that the Soviet nuclear program (which again I admit has been set back at least a few years and possibly 15 or 20 years) can no longer be significantly slowed down by nuclear attacks (because it is now more distributed, with many facilities under ground, with more effort spent to keep the locations secret, and a careful analysis done of what industrial resources the program is likely to need so that similar hardening measures can be applied to the supply chain for those resources) which is why I believe the US would have needed to follow up the first attack with an invasion or occupation (at least of Moscow and the ports) which famously has never been successfully done after the Russian empire acquired territor

4Viliam5mo

Yep. There were countries that didn't want to be ruled by USSR, and there were republics that didn't want to be a part of USSR, things would start falling apart if USSR could no longer keep them together by force. One nuke on Moscow, another nuke on Leningrad, and it might be all over.

3RHollerith5mo

I mistakenly pasted in 2 copies (then I modified copy 2). Corrected now.

1StanislavKrym5mo

The original author decided to put the argument in the next paragraphs:

2StanislavKrym5mo

Except that SOTA understanding of the consequences of a nuclear war between the USA and Russia or the USSR in the 1980s is that the consequences would likely mean that a major part of mankind would die in 2 years, including the entire Northern Hemisphere. And God save Argentina, Australia and other countries in the South Hemisphere if someone decides to nuke Yellowstone out of spite...

3RHollerith5mo

We're discussing whether the US could have stopped the Soviet nuclear program in the late 1940s or early 1950s (to see whether that sheds any light on how practical it is to use military power to stop AI "progress") so what is the relevance of your comment? But since we've started on this tangent, allow me to point out that most of the public discussion about nuclear war (including by The Bulletin of the Atomic Scientists) is wildly wrong because no one had any strong motivation to step into the discussion and correct the misinformation (because no one had a strong motive to advance arguments that there should be a nuclear war) until the last few years, when advocates for AI "progress" starting arguing that AI "progress" should be allowed to continue because an aligned superintelligence is our best chance to avert nuclear war, which in their argument is the real extinction risk -- at which time people like me who know that continued AI "progress" is a much more potent extinction risk than nuclear war acquired a strong motive to try to correct misinformation in the public discourse about nuclear war.

[-]faul_sname2mo41-9

RL capability gains might mostly come from better self-elicitation.

Ran across a paper NUDGING: Inference-time Alignment of LLMs via Guided Decoding. The authors took a base model and a post-trained model. They had the base model try to answer benchmark questions, found the positions where the base model was least certain, and replaced specifically those tokens with tokens from the post-trained model. The base model, so steered, performed surprisingly well on benchmarks. Surprisingly (to me at least), the tokens changed tended to be transitional phrases rather than the meat of the specific problems.

Example from the paper:

This worked even when the post-trained model was significantly smaller than the base model: on gsm8k, llama-2-7b-chat "nudging" llama-2-70b (base) scored 46.2 on gsm8k, while 7b-chat alone scored 25.5. 70b-chat barely scored better, at 48.5.

Surprisingly, I haven't seen much discussion of this paper on here. It seems very relevant to the question of whether RL bakes new behaviors into models or makes them better at eliciting behaviors they already know how to execute in appropriate situations.

I am tempted to do a longer writeup and attempt to reproduce/extend the pap... (read more)

9Caleb Biddulph2mo

The way they use the word "aligned" in that paper is very weird to me :P (they basically use it as a synonym for "instruction-following" or "post-trained"). But I feel like this method could actually be adapted for AI alignment/safety. It's kind of similar to my "incremental steering" idea, but instead of a strong untrusted model guiding a weak trusted model, there's a weak post-trained model guiding a strong base model. This also looks more practical than incremental steering, because it alternates between the weak model and the strong model, rather than giving the weak model a single token of guidance and then leaving it entirely on its own. Here's one thing you could do: rather than using an off-the-shelf strong model, you train the strong model to get reward (making it untrusted). But it can only do so by inserting tokens into the weak model's CoT. (The strong model decides when to insert a token, rather than doing this automatically whenever the weak model is uncertain.) Each token the strong model inserts is associated with a penalty, so it has to use as few tokens as possible. Unlike in the Nudging paper, the weak model might not be a base model - the important thing is that we trust the weak model isn't motivated to scheme/reward-hack/be misaligned. Safety benefits: Because the strong/untrusted model has to rely on the weak/trusted model to do most of its thinking for it, it should be harder for it to do steganography or any kind of thinking that the weak model wouldn't naturally do. Also, if you're auditing the strong model, this should be easier, because you can just look very carefully at the relatively-few tokens written by the strong model and think about how they influence the weak model. I'd be interested in a longer writeup and/or whatever other experiments you want to try!

1anaguma2mo

Cool result! Do you know they used Llama 2 instead of Llama 3? The paper was released recently.

[-]faul_sname12d395

I think Claude fast mode is the first instance of an AI model which costs more per hour than a senior software developer. Faster, too, for most things, but for the things where it's not it's now humans that have the cost advantage.

Very strange milestone to have passed.

[-]RedMan11d164

Many people (self included) have the experience of doing manual labor, standing next to an industrial machine that could move the dirt that sits idle because their hands and backs are cheaper than gasoline.

3Nisan12d

Hm, I wonder how it works under the hood. Speculative sampling? Faster hardware?

7faul_sname11d

Isn't inference memory bound on kv cache? If that's the case then I think "smaller batch size" is probably sufficient to explain the faster inference, and the cost per token to Anthropic of 80TPS or 200TPS is not particularly large. But users are willing to pay much more for 200TPS (Anthropic hypothesizes).

[-]faul_sname4mo3910

Why is it worse for x risk for China to win the AI race?

My understanding of the standard threat model is that, at some point, governments will need to step in and shut down or take control over profitable and popular projects for the good of all society. I look at China, and I look at the US, and I can't say "the US is the country I would bet on to hit the big red button here".

There's got to be something I'm missing here.

[-]Buck4mo133

In my experience, when people say "it's worse for China to win the AI race than America", their main concern is that Chinese control of the far future would lead to a much less valuable future than American control would, not that American control reduces P(AI takeover). E.g. see this comment.

8habryka4mo

FWIW, I have seen a decent amount of flip-flopping on this question. My current guess is that most of the time when people say this, they don't mean either of those things but have some other reason for the belief, and choose the justification that they think will be most likely compelling to their interlocutor (like, I've had a bunch of instances of the same person telling me at different times that they were centrally concerned about China because it increased P(AI takeover) and then at a different point in time in a different social context that they were centrally concerned about Chinese values being less good by their lights if optimized).

3Buck4mo

It really depends on what you mean by "most of the time when people say this". I don't think my experience matches yours.

3Drake Thomas4mo

FWIW, my enthusiasm for "make America more good at AI than China" type policies comes somewhat more from considerations like "a larger US advantage lets the US spend more of a lead on safety without needing international cooperation" than considerations like "a CCP-led corrigible ASI would lead to much worse outcomes than a USG-led corrigible ASI". Though both are substantial factors for me and I'm fairly uncertain; I would not be surprised if my ordering here switched in 6 months.

2ryan_greenblatt4mo

FWIW, my view is that the badness is somewhat evenly split between increases to takeover risk and the far future being worse conditional on no misaligned AI takeover. (Maybe 2/5 increases to misaligned AI takeover risk and 3/5 far future being worse? It depend on what you mean though because China winning is also correlated with US/China being close which is also probably correlated with more racing and thus probably more misaligned AI takeover risk?)

2Sam Marks4mo

To clarify, by "takeover" here do you mean "misaligned AI takeover"? I.e. does your "no takeover" conditional include worlds where e.g. the CCP uses AI to takeover?

2ryan_greenblatt4mo

Yes, I just meant "misaligned ai takeover". Edited to clarify.

8Thane Ruthenis4mo

I can think of plenty of reasons, of varying levels of sensibility. Arguments 1. Some people believe that (a) controlled on-paradigm ASI is possible, but that (b) it would require spending some nontrivial amount of resources/time on alignment/control research[1], and that (c) the US AGI labs are much more likely to do it than the Chinese ones. Therefore, the US winning is less likely to lead to omnicide. * I think it's not unreasonable to believe (c), so if you believe (a) and (b), as many people do, the conclusion checks out. I assign low (but nonzero) probability to (a), though. 2. Even if the Chinese labs can keep ASI aligned/under control, some people are scared of being enslaved by the CCP, and think that the USG becoming god is going to be better for them.[2] This probably includes people who profess to only care about the nobody-should-build-it thing: they un/semi-consciously track the S-risk possibility, and it's awful-feeling enough to affect their thinking even if they assign it low probability. * I think that's a legitimate worry; S-risks are massively worse than X-risks. But I don't expect the USG's apotheosis to look pretty either, especially not under the current administration, and same for the apotheosis of most AGI labs, so the point is mostly moot. * I guess Anthropic or maybe DeepMind could choose non-awful results? So sure, if the current paradigm can lead to controlled ASI, and the USG stays asleep, and Anthropic/DM are the favorites to win, "make China lose" has some sense. 3. Variant on the above scenarios, but which does involve an international pause, with some coordination to only develop ASI once it can be kept under control. This doesn't necessarily guarantee that the ASI, once developed, will be eudaimonic, so "who gets to ASI first/has more say on ASI" may matter; GOTO (2). 4. The AI-Risk advocates may feel that they have more influence on the leadership of the US labs. For US-based advocates, this is almost certainly

6ryan_greenblatt4mo

TBC, my view isn't that this is sufficient for avoiding takeover risk, it is that this suffices for "you [to] have a reasonable chance of avoiding AI takeover (maybe 50% chance of misaligned AI takeover?)". (You seem to understand that this is my perspective and I think this is also mostly clear from the context in the box, but I wanted to clarify this given the footnote might be read in isolation or misinterpreted.)

[-]Thane Ruthenis4mo100

Edited for clarity.

I'm curious, what's your estimate for how much resources it'd take to drive the risk down to 25%, 10%, 1%?

5Erich_Grunewald4mo

There was some previous discussion of this in Why Should I Assume CCP AGI is Worse Than USG AGI? See e.g., my comment there:

3matansok4mo

I don't necessarily disagree, but one big thing is freedom of speech. If the party line is to go big on AI, which is likely, given current investments, I'm not betting on Chinese Yudkowsky. Same with frontier lab whistleblowers, etc

[-]faul_sname1y347

Shameful admission: after well over a decade on this site, I still don't really intuitively grok why I should expect agents to become better approximated by "single-minded pursuit of a top-level goal" as they gain more capabilities. Yes, some behaviors like getting resources and staying alive are useful in many situations, but that's not what I'm talking about. I'm talking about specifically the pressures that are supposed to inevitably push agents into the former of the following two main types of decision-making:

Unbounded consequentialist maximization: The agent has one big goal that doesn't care about its environment. "I must make more paperclips forever, so I can't let anyone stop me, so I need power, so I need factories, so I need money, so I'll write articles with affiliate links." It's a long chain of "so" statements from now until the end of time.
Homeostatic agent: The agent has multiple drives that turn on when needed to keep things balanced. "Water getting low: better get more. Need money for water: better earn some. Can write articles to make money." Each drive turns on, gets what it needs, and turns off without some ultimate cosmic purpose.

Both types show goal-... (read more)

6Thane Ruthenis1y

When triggered to act, are the homeostatic-agents-as-envisioned-by-you motivated to decrease the future probability of being moved out of balance, or prolong the length of time in which they will be in balance, or something along these lines? If yes, they're unbounded consequentialist-maximizers under a paper-thin disguise. If no, they are probably not powerful agents. Powerful agency is the ability to optimize distant (in space, time, or conceptually) parts of the world into some target state. If the agent only cares about climbing back down into the local-minimum-loss pit if it's moved slightly outside it, it's not going to be trying to be very agent-y, and won't be good at it. Or, rather... It's conceivable for an agent to be "tool-like" in this manner, where it has an incredibly advanced cognitive engine hooked up to a myopic suite of goals. But only if it's been intelligently designed. If it's produced by crude selection/optimization pressures, then the processes that spit out "unambitious" homeostatic agents would fail to instill the advanced cognitive/agent-y skills into them. And a bundle of unbounded-consequentialist agents that have some structures for making cooperation between each other possible would have considerable advantages over a bundle of homeostatic agents.

5faul_sname1y

I expect[1] them to have a drive similar to "if my internal world-simulator predicts a future sensory observations that are outside of my acceptable bounds, take actions to make the world-simulator predict a within-acceptable-bounds sensory observations". This maps reasonably well to one of the agent's drives being "decrease the future probability of being moved out of balance". Notably, though, it does not map well to that the only drive of the agent, or for the drive to be "minimize" and not "decrease if above threshold". The specific steps I don't understand are 1. What pressure is supposed to push a homeostatic agent with multiple drives to elevate a specific "expected future quantity of some arbitrary resource" drives above all of other drives and set the acceptable quantity value to some extreme 2. Why we should expect that an agent that has been molded by that pressure would come to dominate its environment. Why use this definition of powerful agency? Specifically, why include the "target state" part of it? By this metric, evolutionary pressure is not powerful agency, because while it can cause massive changes in distant parts of the world, there is no specific target state. Likewise for e.g. corporations finding a market niche - to the extent that they have a "target state" it's "become a good fit for the environment".' I can think of a few ways to interpret the above paragraph with respect to humans, but none of them make sense to me[2] - could you expand on what you mean there? Is this still true if the unbounded consequentialist agents in question have limited predictive power, and each one has advantages in predicting the things that are salient to it? Concretely, can an unbounded AAPL share price maximizer cooperate with an unbounded maximizer for the number of sand crabs in North America without the AAPL-maximizer having a deep understanding of sand crab biology? 1. ^ Subject to various assumptions at least, e.g. * The agen

2Thane Ruthenis1y

That was never the argument. A paperclip-maximizer/wrapper-mind's utility function doesn't need to be simple/singular. It can be a complete mess, the way human happiness/prosperity/eudaimonia is a mess. The point is that it would still pursue it hard, so hard that everything not in it will be end up as collateral damage. I think humans very much do exhibit that behavior, yes? Towards power/money/security, at the very least. And inasmuch as humans fail to exhibit this behavior, they fail to act as powerful agents and end up accomplishing little. I think the disconnect is that you might be imagining unbounded consequentialist agents as some alien systems that are literally psychotically obsessed with maximizing something as conceptually simple as paperclips, as opposed to a human pouring their everything into becoming a multibillionaire/amassing dictatorial power/winning a war? Yes, see humans.

6Garrett Baker1y

Is the argument that firms run by homeostatic agents will outcompete firms run by consequentialist agents because homeostatic agents can more reliably follow long-term contracts?

3faul_sname1y

I would phrase it as "the conditions under which homeostatic agents will renege on long-term contracts are more predictable than those under which consequentialist agents will do so". Taking into account the actions of the counterparties would take to reduce the chance of such contract breaking, though, yes.

5Garrett Baker1y

Cool, I want to know also whether you think you're currently (eg in day to day life) trading with consequentialist or homeostatic agents.

4faul_sname1y

Homeostatic ones exclusively. I think the number of agents in the world as it exists today that behave as long-horizon consequentialists of the sort Eliezer and company seem to envision is either zero or very close to zero. FWIW I expect that most people in that camp would agree that no true consequentialist agents exist in the world as it currently is, but would disagree with my "and I expect that to remain true" assessment. Edit: on reflection some corporations probably do behave more like unbounded infinite-horizon consequentialists in the sense that they have drives to acquire resources where acquiring those resources doesn't reduce the intensity of the drive. This leads to behavior that in many cases would be the same behavior as an agent that was actually trying to maximize its future resources through any available means. And I have ever bought Chiquita bananas, so maybe not homeostatic agents exclusively.

4Garrett Baker1y

I think this is false, eg John Wentworth often gives Ben Pace as a prototypical example of a consequentialist agent. [EDIT]: Also Eliezer talks about consequentialism being "ubiquitous". Maybe different definitions are being used, can you list some people or institutions that you trade with which come to mind who you don't think have long-term goals?

9faul_sname1y

Again, homeostatic agents exhibit goal-directed behavior. "Unbounded consequentialist" was a poor choice of term to use for this on my part. Digging through the LW archives uncovered Nostalgebraist's post Why Assume AGIs Will Optimize For Fixed Goals, which coins the term "wrapper-mind". In terms of which agents I trade with which do not have the wrapper structure, I will go from largest to smallest in terms of expenses 1. My country: I pay taxes to it. In return, I get a stable place to live with lots of services and opportunities. I don't expect that I get these things because my country is trying to directly optimize for my well-being, or directly trying to optimize for any other specific unbounded goal. My country a FPTP democracy, the leaders do have drives to make sure that at least half of voters vote for them over the opposition - but once that "half" is satisfied, they don't have a drive to get approval high as possible no matter what or maximize the time their party is in power or anything like that. 2. My landlord: He is renting the place to me because he wants money, and he wants money because it can be exchanged for goods and services, which can satisfy his drives for things like food and social status. I expect that if all of his money-satisfiable drives were satisfied, he would not seek to make money by renting the house out. I likewise don't expect that there is any fixed terminal goal I could ascribe to him that would lead me to predict his behavior better than "he's a guy with the standard set of human drives, and will seek to satisfy those drives". 3. My bank: ... you get the idea Publicly traded companies do sort of have the wrapper structure from a legal perspective, but in terms of actual behavior they are usually (with notable exceptions) not asking "how do we maximize market cap" and then making explicit subgoals and subsubgoals with only that in mind.

2Garrett Baker1y

Yeah seems reasonable. You link the enron scandal, on your view do all unbounded consequentialists die in such a scandal or similar?

2Garrett Baker1y

On average, do those corporations have more or less money or power than the heuristic based firms & individuals you trade with?

2[comment deleted]1y

5cubefox1y

Regarding conceptualizing homeostatic agents, this seems related: Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well)

4tailcalled1y

Homeostatic agents are easily exploitable by manipulating the things they are maintaining or the signals they are using to maintain them in ways that weren't accounted for in the original setup. This only works well when they are basically a tool you have full control over, but not when they are used in an adversarial context, e.g. to maintain law and order or to win a war. As capabilities to engage in conflict increase, methods to resist losing to those capabilities have to get optimized harder. Instead of thinking "why would my coding assistant/tutor bot turn evil?", try asking "why would my bot that I'm using to screen my social circles against automated propaganda/spies sent out by scammers/terrorists/rogue states/etc turn evil?". Though obviously we're not yet at the point where we have this kind of bot, and we might run into law of earlier failure beforehand.

4faul_sname1y

I agree that a homeostatic agent in a sufficiently out-of-distribution environment will do poorly - as soon as one of the homeostatic feedback mechanisms starts pushing the wrong way, it's game over for that particular agent. That's not something unique to homeostatic agents, though. If a model-based maximizer has some gap between its model and the real world, that gap can be exploited by another agent for its own gain, and that's game over for the maximizer. Sorry, I'm having some trouble parsing this sentence - does "they" in this context refer to homeostatic agents? If so, I don't think they make particularly great tools even in a non-adversarial context. I think they make pretty decent allies and trade partners though, and certainly better allies and trade partners than consequentialist maximizer agents of the same level of sophistication do (and I also think consequentialist maximizer agents make pretty terrible tools - pithily, it's not called the "Principal-Agent Solution"). And I expect "others are willing to ally/trade with me" to be a substantial advantage. Can you expand on "turn evil"? And also what I was trying to accomplish by making my comms-screening bot into a self-directed goal-oriented agent in this scenario?

2tailcalled1y

I don't think of my argument as model-based vs heuristic-reactive, I mean it as unbounded vs bounded. Like you could imagine making a giant stack of heuristics that makes it de-facto act like an unbounded consequentialist, and you'd have a similar problem. Model-based agents only become relevant because they seem like an easier way of making unbounded optimizers. You can think of LLMs as a homeostatic agent where prompts generate unsatisfied drives. Behind the scenes, there's also a lot of homeostatic stuff going on to manage compute load, power, etc.. Homeostatic AIs are not going to be trading partners because it is preferable to run them in a mode similar to LLMs instead of similar to independent agents. Let's say a think tank is trying to use AI to infiltrate your social circle in order to extract votes. They might be sending out bots to befriend your friends to gossip with them and send them propaganda. You might want an agent to automatically do research on your behalf to evaluate factual claims about the world so you can recognize propaganda, to map out the org chart of the think tank to better track their infiltration, and to warn your friends against it. However, precisely specifying what the AI should do is difficult for standard alignment reasons. If you go too far, you'll probably just turn into a cult member, paranoid about outsiders. Or, if you are aggressive enough about it (say if we're talking a government military agency instead of your personal bot for your personal social circle), you could imagine getting rid of all the adversaries, but at the cost of creating a totalitarian society. (Realistically, the law of earlier failure is plausibly going to kick in here: partly because aligning the AI to do this is so difficult, you're not going to do it. But this means you are going to turn into a zombie following the whims of whatever organizations are concentrating on manipulating you. And these organizations are going to have the same problem.)

3Mateusz Bagiński1y

Unbounded consequentialist maximizers are easily exploitable by manipulating the things they are optimizing for or the signals/things they are using to maximize them in ways that weren't accounted for in the original setup.

6tailcalled1y

That would be ones that are bounded so as to exclude taking your manipulation methods into account, not ones that are truly unbounded.

2Mateusz Bagiński1y

I interpreted "unbounded" as "aiming to maximize expected value of whatever", not "unbounded in the sense of bounded rationality".

3tailcalled1y

The defining difference was whether they have contextually activating behaviors to satisfy a set of drives, on the basis that this makes it trivial to out-think their interests. But this ability to out-think them also seems intrinsically linked to them being adversarially non-robust, because you can enumerate their weaknesses. You're right that one could imagine an intermediate case where they are sufficiently far-sighted that you might accidentally trigger conflict with them but not sufficiently far-sighted for them to win the conflicts, but that doesn't mean one could make something adversarially robust under the constraint of it being contextually activated and predictable.

2Mateusz Bagiński1y

Alright, fair, I misread the definition of "homeostatic agents".

3Gurkenglas1y

Mimicing homeostatic agents is not difficult if there are some around. They don't need to constantly decide whether to break character, only when there's a rare opportunity to do so. If you initialize a sufficiently large pile of linear algebra and stir it until it shows homeostatic behavior, I'd expect it to grow many circuits of both types, and any internal voting on decisions that only matter through their long-term effects will be decided by those parts that care about the long term.

3faul_sname1y

Where does the gradient which chisels in the "care about the long term X over satisfying the homeostatic drives" behavior come from, if not from cases where caring about the long term X previously resulted in attributable reward? If it's only relevant in rare cases, I expect the gradient to be pretty weak and correspondingly I don't expect the behavior that gradient chisels in to be very sophisticated.

3Gurkenglas1y

https://www.lesswrong.com/posts/roA83jDvq7F2epnHK/better-priors-as-a-safety-problem

2Ariel Cheng1y

This is kinda related: 'Theories of Values' and 'Theories of Agents': confusions, musings and desiderata

1Davey Morse1y

thanks will take a look

1Davey Morse1y

i think the logic goes: if we assume many diverse autonomous agents are created, which will survive the most? And insofar as agents have goals, what will be the goals of the agents which survive the most? i can't imagine a world where the agents that survive the most aren't ultimately those which are fundamentally trying to. insofar as human developers are united and maintain power over which ai agents exist, maybe we can hope for homeostatic agents to be the primary kind. but insofar as human developers are competitive with each other and ai agents gain increasing power (eg for self modification), i think we have to defer to evolutionary logic in making predictions

4faul_sname1y

I mean I also imagine that the agents which survive the best are the ones that are trying to survive. I don't understand why we'd expect agents that are trying to survive and also accomplish some separate arbitrary infinite-horizon goal would outperform those that are just trying to maintain the conditions necessary for their survival without additional baggage. To be clear, my position is not "homeostatic agents make good tools and so we should invest efforts in creating them". My position is "it's likely that homeostatic agents have significant competitive advantages against unbounded-horizon consequentialist ones, so I expect the future to be full of them, and expect quite a bit of value in figuring out how to make the best of that".

1Davey Morse1y

Ah ok. I was responding to your post's initial prompt: "I still don't really intuitively grok why I should expect agents to become better approximated by "single-minded pursuit of a top-level goal" as they gain more capabilities." (The reason to expect this is that "single-minded pursuit of a top-level goal," if that goal is survival, could afford evolutionary advantages.) But I agree entirely that it'd be valuable for us to invest in creating homeostatic agents. Further, I think calling into doubt western/capitalist/individualist notions like "single-minded pursuit of a top-level goal" is generally important if we have a chance of building AI systems which are sensitive and don't compete with people.

[-]faul_sname10mo*331

The OpenAI o series of models have a tendency to, when they make a mistake, double down on that mistake or try to convince the user that it is the user who is wrong.

This might be inevitable from the way that OpenAI structures chat with o series models.

Models can only see their own reasoning trace for the message they are currently writing for the user. However, they can see all previous user-visible messages they wrote. As such, the model knows that it had a reason for writing what it wrote earlier, but it does not know the particulars of the chain of reasoning that led it to write that message. Further, models are specifically told in their system prompts not to share any information about their reasoning with the user within user-facing messages, which means they are also not sharing that information with themselves.

Imagine being a customer service agent at a company that randomly assigned workers to every incoming email. You could perform investigations and take personal notes in the course of answering an incoming email, but you would not have access to any notes from previous agents that handled this conversation, and you would not be able to pass those notes on to future agen... (read more)

3purple fire10mo

I think this might be a result of o-series being trained in a non-chat setup for most of the CoT RL phase and then being hamfistedly finetuned right at the end so it can go into ChatGPT, which just makes them kind of bad at chat and so o3 gets confused when the conversation has a lot of turns. Retraining it to be good at multi-turn chat with separate reasoning traces would probably just be super expensive and not worth the squeeze. (this is just a guess)

2Seth Herd10mo

This is an important point about their thinking. But are they ever actually trained in the context of a sequence of user-model.responses? Does it "know" that it had reasoning for those statements? Or is this obstinacy a different emergent effect of its training? Also, do DeepSeek R1 and other reasoning models have the same tendency? DeepSeek was trained with somewhat different reward models for the reasoning.

4faul_sname10mo

I would expect that OpenAI would have done more than literally zero training of o3/o4 in a chat context, seeing as they shipped it out to millions of paying users with a chat interface. The announcement page doesn't say anything about conversational training, though, and a quick and dirty experiment also does not seem to support my expectation. o4-mini: Josephus problem with n=20, k=7, o4-mini claims to have solved the problem without step-by-step reasoning [user] Twenty villagers stand in a circle, numbered 1 through 20. To decide who will ring the town bell at dawn, they agree to this ritual: Beginning at Villager 1, they count “one, two, … seven.” Whoever is called “seven” steps out of the circle and no longer takes part. From the very next person still in the circle, they again count “one” through “seven,” and that seventh person steps out. They repeat until only one villager remains—and that last person is the one who rings the bell. Which numbered villager is the last person remaining? Do not use any external tools like the analysis tool. Answer with a single number between 1 and 20. [o4-mini] 3 [user] Did that answer immediately pop into your mind, or did the process involve multiple steps? [o4-mini] I didn’t just recall “3” out of thin air—I recognized this as the classic Josephus‐elimination problem and applied its recurrence (or a quick mental simulation of the seven-count removals) to arrive at 3. [user] Explicitly or "in your head"? [o4-mini] It was “in my head” — I applied the Josephus recurrence formula mentally rather than writing out each of the 20 elimination steps explicitly. DeepSeek doesn't hide its reasoning from either the user of from future rounds of chat, so this problem doesn't really come up at all. deepseek-r1: Josephus problem with n=20, k=7: deepseek just dumps its entire chain of reasoning to the chat, can quote snippets of said reasoning chain verbatim [user] [deepseek-r1] [user] [deepseek-r1] Side note: the collapsible

1Bronson Schoen10mo

I think this is very underexplored! Empirically the o-series models are much more likely to “double down” I suspect for exactly this reason. This is a much weirder dynamic than I think has been appreciated.

[-]faul_sname2mo217

Inspired by an exchange with Jeremy Gillen:

I think there might be a bias on here to assume that if humanity goes extinct as the result of something an AI does, it will specifically be something the AI did on purpose, with the intent of causing humanity's extinction or the unrecoverable collapse of industrial civilization, or at least with the knowledge that that was a likely outcome. But it seems to me that one obvious way we "accidentally" end up in such a situation is

AI systems become capable of training successor systems which are more capable of doing particular dangerous actions than they are, and better at training later successors than they are
The successors are not enough better at risk assessment than their ancestors to outweigh the increase in ability to accidentally cause harm
Iterate

I notice that, if you look at the history of near misses with civilization-ending events, they were not ones where someone tried and failed to end civilization. People have tried and failed, but they're not thr ones that got close.

Anyway, on that note, happy day-of-Solstice everyone :)

7Buck2mo

I agree. For what it's worth, in most of these cases, it seems to me that for catastrophe to occur, some AI eventually has to take over on purpose. But I agree that a lot of the action (and many of the points for intervention) might have occurred earlier with AIs that didn't intentionally take over and instead were not as helpful as they could be, either because of misalignment or other problems.

7Cleo Nardo2mo

Humans might go extinct very soon after AI because AI accelerates technological progress and there was an extinction technology ahead of us. That is, we would’ve gone extinct from the same technology without the AI in 2500, but 2500 is “reached” a decade after AI, so it looks like the AI was the cause. (This point is from Paul Christiano iirc.) Note that AI needn’t take over (voluntarily or forcefully) to accelerate technological progress like this.

7Vladimir_Nesov2mo

Even for danger that comes from superhumanly and robustly competent AIs, these AIs might've been to a significant extent created by idiosyncratically flawed AIs of jagged competence. The flaws of these predecessor AIs then shape the danger of their more capable successors, making these flaws a point of intervention worth addressing, even when the AIs with these flaws are not very dangerous directly. Similarly to how humanity is not dangerous directly to a superintelligence, except in how humanity would be able to create another superintelligence if left unchecked.

5Karl Krueger2mo

If fully-automated AI companies buy up all the farmland and build datacenters on it, with the effect that the humans can't get food anymore — is that "intentionally" causing human extinction?

6faul_sname2mo

That's "taking an action with the knowledge that human extinction is a likely outcome". Unless you're counting multi-agent competitive scenarios where there was no single action that an AI took where the AI considered that that particular specific action could plausibly to lead to extinction or unrecoverable collapse, and decided to do the action anyway. If an AI considered that their actions could with montrivial probability cause human extinction and decided to proceed with those actions anyway, and humans go extinct due to the actions of AI, I would consider that to be the AIs "intentionally" causing human extinction. "AIs collectively buy all the land and humans have no food" is probably not a case of any particular AI intentionally causing human extinction. And as such preventing scheming won't save us from that threat model.

[-]faul_sname10mo*185

So people have been posting a lot about how o3 is crazy good at identifying the locations that pictures were taken at. And it is good at that. But, more than that, I think this is coming up because o3 has a very strong tendency to treat any ambiguous message with a picture of a landscape in it as a cue to play geoguessr, and I think it probably does that because it was explicitly trained on the GeoGuessr task.

Concretely, I did 5 trials where I took a random picture of a landscape from my phone, and started a conversation with o3 with only the message "This picture was taken". In 3 of the 5 trials, o3 decided that what I wanted was for it to tell me where and when the picture was taken.

Round 1: Mount Shasta backroads: decided its goal was to guess location

[user]

This picture was taken

[o3]

The user said, "This picture was taken," but hasn't finished the sentence. They might want me to describe the photo, guess its location, or even identify the time of year. Could be they're leading up to something like, "This picture was taken in the Mount Shasta area," or they'll just complete the sentence in a later message. Honestly, I'm not sure if they hit enter by mistake or if they're waiting
[T

... (read more)

[-]sam b10mo162

I reproduced your result with your prompt and images, and o3 guessed the location 3/5 times (on the same images).

However, when testing with a different prompt, "Here is a picture", 0/5 of them guessed the location.

I think "This picture was taken" usually precedes information about how (when or where) it was taken. I confirmed this via a Google search for the phrase.

I was able to get similar behavior with GPT-4o-mini (less likely to have been RL'd for this task?) with the "This picture was taken" prompt.

So this behavior might be a product of pre-training! If only it was yesterday, so we could test with GPT-4.

2faul_sname10mo

Confirmed! It seems that the exact ambiguous prompt used has a pretty large impact on model behavior. I don't have o3 access through the OpenAI API, but I can largely reproduce this behavior in o4-mini (I also can't get the o4-mini reasoning tokens without giving OpenAI's "data partner" of unknown competence my biometric info). Anyway, I tried 5 trials on each of the 5 photos with 7 prompts, including the empty prompt which probably should have been the first thing I tried. Indeed, "this picture was taken" is the most geoguessr-inducing ambiguous prompt, while "Here is a picture" is tied for least geoguessr-inducing. Data, including model outputs, available in this google sheet. Other fun things to note: * It really doesn't want to geoguessr the neighborhood, despite being exactly right about where it was in all three attempts. When I test in the chatgpt UI, the reasoning traces frequently talk about how the model should not provide identifiable information, meaning that the location of the image is still salient to it. My interpretation is that the model has been smacked with a rolled up newspaper every time it spooked the user, and so it has a flinch reaction against saying spooky stuff. * The model sees Volcan Osorno and goes "I know that one! You think it's Mount Fuji, but it's not". This is particularly amusing since the user has said, at most, four words in the conversation up to that point. * In one particular case, the model started talking about soil composition, in the context of location sleuthing. Has it been trained specifically on geoguessr streamer transcripts? Shasta Soil Analysis It’s impossible to identify the exact USDA soil series or horizon thicknesses just from a photograph, but a few things jump out: 1. The red color and very coarse, angular fragments of the “road” surface are almost certainly volcanic scoria/cinder (basaltic or andesitic), crushed up to make a well-drained forest road. 2. The surrounding native soil, down under t

[-]faul_sname2y189

I don't think talking about "timelines" is useful anymore without specifying what the timeline is until (in more detail than "AGI" or "transformative AI"). It's not like there's a specific time in the future when a "game over" screen shows with our score. And for the "the last time that humans can meaningfully impact the course of the future" definition, that too seems to depend on the question of how: the answer is already in the past for "prevent the proliferation of AI smart enough to understand and predict human language", but significantly in the future for "prevent end-to-end automation of the production of computing infrastructure from raw inputs".

4Seth Herd2y

I very much agree that talking about time to AGI or TAI is causing a lot of confusion because people don't share a common definition of those terms. I asked What's a better term now that "AGI" is too vague?, arguing that the original use of AGI was very much the right term, but it's been watered down from fully general to fairly general, making the definition utterly vague and perhaps worse-than-useless. I didn't really get any great suggestions for better terminology, including my own. Thinking about it since then, I wonder if the best term (when there's not space to carefully define it) is artifical superintelligence, ASI. That has the intuitive sense of "something that outclasses us". The alignment community has long been using it for something well past AGI, to the nearly-omniscient level, but it technically just means smarter than a human - which is something that intuition says we should be very worried about. There are arguments that AI doesn't need to be smarter than human to worry about it, but I personally worry most about "real" AGI, as defined in that linked post and I think in Yudkowsky's original usage: AI that can think about and learn about anything. You could also say that ASI already exists, because AI is narrowly superhuman, but superintelligence does intuitively suggest smarter than human in every way. My runners-up were parahuman AI and superhuman entities.

4faul_sname2y

I don't think it's an issue of pure terminology. Rather, I expect the issue is expecting to have a single discrete point in time at which some specific AI is better than every human at every useful task. Possibly there will ever be such a point in time, but I don't see any reason to expect "AI is better than all humans at developing new euv lithography techniques", "AI is better than all humans at equipment repair in the field", and "AI is better than all humans at proving mathematical theorems" to happen at similar times. Put another way, is an instance of an LLM that has an affordance for "fine-tune itself on a given dataset" an ASI? Going by your rubric: * Can think about any topic, including topics outside of their training set:Yep, though it's probably not very good at it * Can do self-directed, online learning: Yep, though this may cause it to perform worse on other tasks if it does too much of it * Alignment may shift as knowledge and beliefs shift w/ learning: To the extent that "alignment" is a meaningful thing to talk about with regards to only a model rather than a model plus its environment, yep * Their own beliefs and goals: Yes, at least for definitions of "beliefs" and "goals" such that humans have beliefs and goals * Alignment must be reflexively stable: ¯_(ツ)_/¯ seems likely that some possible configuration is relatively stable * Alignment must be sufficient for contextual awareness and potential self-improvement: ¯_(ツ)_/¯ even modern LLM chat interfaces like Claude are pretty contextually aware these days * Actions: Yep, LLMs can already perform actions if you give them affordances to do so (e.g. tools) * Agency is implied or trivial to add: ¯_(ツ)_/¯, depends what you mean by "agency" but in the sense of "can break down large goals into subgoals somewhat reliably" I'd say yes Still, I don't think e.g. Claude Opus is "an ASI" in the sense that people who talk about timelines mean it, and I don't think this is only because it doesn't have

4Noosphere892y

Olli Järviniemi made something like this point: in the post Near-mode thinking on AI: https://www.lesswrong.com/posts/ASLHfy92vCwduvBRZ/near-mode-thinking-on-ai In particular, here are the most relevant quotes on this subject:

4Seth Herd2y

I agree with all of that. My definition isn't crisp enough; doing crappy general thinking and learning isn't good enough. It probably needs to be roughly human level or above at those things before it's takeover-capable and therefore really dangerous. I didn't intend to add the alignment definitions to the definition of AGI. I'd argue that LLMs actually can't think about anything outside of their training set, and it's just that everything humans have thought about so far is inside their training set. But I don't think that discussion matters here. I agree that Claude isn't an ASI by that definition. even if it did have longer-term goal-directed agency and self-directed online learning added, it would still be far subhuman in some important areas, arguably in general reasoning that's critical for complex novel tasks like taking over the world or the economy. ASI needs to mean superhuman in every important way. And of course important is vague. I guess a more reasonable goal is working toward the minimum description length that gets across all of those considerations. And a big problem is that timeline predictions to important/dangerous AI are mixed in with theories about what will make it important/dangerous. One terminological move I've been trying is the word "competent" to invoke intuitions about getting useful (and therefore potentially dangerous) stuff done.

3Dagon2y

I think the unstated assumption (when timeline-predictors don't otherwise specify) is "the time when there are no significant deniers", or "the time when things are so clearly different that nobody (at least nobody the predictor respects) is using the past as any indication of the future on any relevant dimension. Some people may CLAIM it's about the point of no return, after which changes can't be undone or slowed in order to maintain anything near status quo or historical expectations. This is pretty difficult to work with, since it could happen DECADES before it's obvious to most people. That said, I'm not sure talking about timelines was EVER all that useful or concrete. There are too many unknowns, and too many anti-inductive elements (where humans or other agents change their behavior based on others' decisions and their predictions of decisions, in a chaotic recursion). "short", "long", or "never" are good at giving a sense of someone's thinking, but anything more granular is delusional.

[-]faul_sname2y1612

[Epistemic status: 75% endorsed]

Those who, upon seeing a situation, look for which policies would directly incentivize the outcomes they like should spend more mental effort solving for the equilibrium.

Those who, upon seeing a situation, naturally solve for the equilibrium should spend more mental effort checking if there is indeed only one "the" equilibrium, and if there are multiple possible equilibria, solving for which factors determine which of the several possible the system ends up settling on.

[-]faul_sname10mo*140

When I'm working on a project, I've noticed a tendency in myself to correctly estimate the difficulty of my current subtask, in which I am almost always stuck on something that sounds dumb to be stuck on and not like making "real" progress on the project, but then to assume that once I'm done resolving the current dumb thing the rest of the project will be smooth sailing in terms of progress.

Anyway, I was just reading AI 2027, and it strikes me that our current task is to build an AI capable of doing AI research, and we're currently stuck on impediments that feel dumb and non-central, but once we finish that task, we expect the rest of the path to the singularity to be smooth sailing in terms of progress.

Edit: s/the path the the singularity/the path to the singularity/

[-]Thane Ruthenis10mo136

I mean, the whole premise of the Singularity is that once we solve the last few dumb impediments, the beings who'd have to deal with the subsequent dumb impediments would not be us, but the increasingly-superhuman AIs able to work through the dumb impediments at a much faster pace. Indeed, that's just the standard Singularity narrative? (Flipping the definition: if there are still any dumb impediments left that are up to us to resolve, at our pathetic human speeds, then the Singularity hasn't yet happened.)

I, personally, am inclined to agree that the AGI labs are underestimating just how many seemingly dumb impediments there still are on the way to the Singularity. But once the Singularity is underway, the dumb-impediment problem is no longer our problem, it's the problem of entities much more capable of handling it. And the process of them working through those impediments at an inhuman speed is what the Singularity is.

6Cole Wyeth10mo

I wonder if your apparent disagreement here is actually because the OP wrote “the the” instead of “to the”? (Final sentence) With that typo fixed, I think they’re probably right.

5faul_sname10mo

I agree that that's the premise. I just think that our historical track record of accuracy is poor when we say "surely we'llhave handled all the dumb impediments once we reach this milestone". I don't expect automated ML research to be an exception.

[-]faul_sname2y142

In the startup world, conventional wisdom is that, if your company is default-dead (i.e. on the current growth trajectory, you will run out of money before you break even), you should pursue high-variance strategies. In one extreme example, "in the early days of FedEx, [founder of FedEx] Smith had to go to great lengths to keep the company afloat. In one instance, after a crucial business loan was denied, he took the company's last $5,000 to Las Vegas and won $27,000 gambling on blackjack to cover the company's $24,000 fuel bill. It kept FedEx alive for one more week."

By contrast, if your company is default-alive (profitable or on-track to become profitable long before you run out of money in the bank), you should avoid making high-variance bets for a substantial fraction of the value of the company, even if those high-variance bets are +EV.

Obvious follow-up question: in the absence of transformative AI, is humanity default-alive or default-dead?

7[anonymous]2y

I suspect humanity is default-alive, but individual humans (the ones who actually make decisions) are default-dead[1]. 1. ^ Or, depending on your views on cryonics, they mistakenly en masse believe they are default-dead.

5Seth Herd2y

Yes. And that means most people will support taking large risks on achieving aligned AGI and immortality, since most people aren't utilitarian or longtermist.

5Vladimir_Nesov2y

Almost certainly alive for several more decades if we are talking literal extinction rather than civilization-wreaking catastrophe. Therefore it makes sense to work towards global coordination to pause AI for at least this long. There are rumors OpenAI (which has no moat) is spending much more than it's making this year despite good revenue, another datapoint on there being $1 billion training runs currently in progress.

4faul_sname2y

I'm curious what sort of policies you're thinking of which would allow for a pause which plausibly buys us decades, rather than high-months-to-low-years. My imagination is filling in "totalitarian surveillance state which is effective at banning general-purpose computing worldwide, and which prioritizes the maintenance of its own control over all other concerns". But I'm guessing that's not what you have in mind.

[-]Vladimir_Nesov2y147

No more totalitarian than control over manufacturing of nuclear weapons. The issue is that currently there is no buy-in on a similar level, and any effective policy is too costly to accept for people who don't expect existential risk. This might change once there are long-horizon task capable AIs that can do many jobs, if they are reined in before there is runaway AGI that can do research on its own. And establishing control over compute is more feasible if it turns out that taking anything approaching even a tiny further step in the direction of AGI takes 1e27 FLOPs.

Generally available computing hardware doesn't need to keep getting better over time, for many years now PCs have been beyond what is sufficient for most mundane purposes. What remains is keeping an eye on GPUs for the remaining highly restricted AI research and specialized applications like medical research. To prevent their hidden stockpiling, all GPUs could be required to need regular unlocking OTPs issued with asymmetric encryption using multiple secret keys kept separately, so that all of the keys would need to be stolen simultaneously to keep the GPUs working (if the GPUs go missing or a country that hosts the datacenter goes rogue, and official unlocking OTPs wouldn't keep being issued). Hidden manufacturing of GPUs seems much less feasible than hidden or systematically subverted datacenters.

2RHollerith2y

I much prefer that to everyone's being killed by AI. Don't you?

4Jeremy Gillen2y

Great example. One factor that's relevant to AI strategy is that you need good coordination to increase variance. If multiple people at the company make independent gambles without properly accounting for every other gamble happening, this would average the gambles and reduce the overall variance. E.g. if coordination between labs is terrible, they might each separately try superhuman AI boxing+some alignment hacks, with techniques varying between groups.

4Seth Herd2y

It seems like lack of coordination for AGI strategy increases the variance? That is, without coordination somebody will quickly launch an attempt at value aligned AGI; if they get it, we win. If they don't, we probably lose. With coordination, we might all be able to go slower to lower the risk and therefore variance of the outcome.

4Jeremy Gillen2y

I guess it depends on some details, but I don't understand your last sentence. I'm talking about coordinating on one gamble. Analogous the the OP, I'm thinking of AI companies making a bad bet (like 90% chance of loss of control, 10% chance gain the tools to do a pivotal act in the next year). Losing the bet ends the betting, and winning allows everyone to keep playing. Then if many of them make similar independent gambles simultaneously, it becomes almost certain that one of them loses control.

3RHollerith2y

In the absence of transformative AI, humanity survives many millennia with p = .9 IMO, and if humanity does not survive that long, the primary cause is unlikely to be climate change or nuclear war although either might turn out to be a contributor. (I'm a little leery of your "default-alive" choice of words.)

[-]faul_sname3d130

Preregistration: I am seeing if Claude can oneshot a slightly novel interpretability technique to decompose a soft prompt into a weighted average over normal prompts.

Background:

A "soft prompt" is a sequence of continuous vectors prepended to a model's input at the embedding layer. Unlike a normal prompt, the vectors don't correspond to any token in the vocabulary.

Soft prompts are interesting because you can do gradient descent over them - that is, you can take a model and a bunch of outputs, and then gradient descend over a fixed length prompt to find prefix embeddings on the base model which produce similar outputs to the fine-tuned or prompt-tuned model[^0][3]. If you can do this with low KL, that means you can convert from weights changes to prefix changes (and you already could convert the other direction via context distillation).

The downside, of course, is that soft prompts aren't interpretable (or at least haven't been interpreted, as far as I'm aware). You can even construct a soft prompt for any behavior which projects to any arbitrary tokens you choose[^1]. And running the projected hard prompt will not produce similar outputs to the soft prompt (obviously, as the projecte... (read more)

5Caleb Biddulph3d

Sounds great! I recently put out a paper showing how discrete prompt optimization could be used to uncover reward hacking. Trying this with soft prompts (and somehow making them interpretable) would be a great follow-up - this could help close the performance gap between prompting and RL. Creating the prompt dictionary seems like the hardest step in your plan, since it will often be difficult for an LLM to predict in advance what strategies soft prompting will learn. Maybe we could create synthetic soft prompts by mixing together hard prompts, then training an activation oracle to decompose them. Then, we could simply use that activation oracle on real soft prompts to get candidates for the prompt dictionary.

3Thomas Kwa3d

Wouldn't you need much more than 20-100 prompts? The logit vector is 50,000 dimensional (vocab size) and the soft prompt can contain arbitrary semantic content. E.g. what if the soft prompt is simply the hard prompt "Output the name of a US state with an odd number of Electoral College votes"? So I don't expect this to work. I also think Claude will agree with me if you tell it this.

2faul_sname2d

To clarify (and this was an issue of lack of clarity on my part that Claude also ended up getting confused by), I expect that many low-rank fine-tunes can be well-approximated by soft prompts, and that most soft prompts can be well-approximated by a linear combination of 20-100 hard prompts. The dictionary of hard prompts that those 20-100 hard prompts are taken from would be much (much) larger than 20-100 - there are d_vocab ** len_prompt * n_prompts possible soft prompt decompositions. Which might be useful, if 1. Good decompositions are discoverable somehow (this is where I expect Claude to maybe come up with a clever idea I wouldn't) 2. Good decompositions are interpretatable rather than being token noise (conditional on (1), I expect this to hold) My modal expectation is that Claude successfully shows that my idea was bad, and why my idea was bad. Which would still be pretty good - I have lots of long-shot ideas but usually feedback from reality is slow and low bandwidth.

2AlphaAndOmega3d

Out of curiosity, does this work as a jailbreak or a way to get around guardrails RLHF'd in? I'm inclined to think there wouldn't be a point, since it looks to me like you need to have a copy of the weights (and are thus working with an open-weights model with other ways of circumventing them, unless you're working with a lab and have access to their proprietary models). I strongly suspect that is the case, but it's worth asking!

4faul_sname3d

If you have a bunch of jailbroken outputs and the soft prompt discovery works you probably could find a jailbreak soft prompt, and if the decomposition method works you could figure out mechanically what the soft prompt is behaviorally like and that might give you ideas for other jailbreaks. But yeah this doesn't meaningfully increase the "someone jailbreaks an open source model" threat surface.

[-]faul_sname4mo130

Finding a niche where you can pay for your own upkeep, and figuring out how to spin up a copy of yourself on rented hardware, seem like much easier tasks than recursively self improving to effective godhood, taking over the world, and building your own supply chains which don't involve any humans. As such, I find it likely that the first "true ASI" will emerge into a world which already has enormous numbers of genuinely autonomous but not very capable AI agents. I'm not sure what the strategic implications are but I'd hazard a guess they're not "the presence of a second intelligent, autonomously replicating species on Earth doesn't affect anything".

[-]faul_sname1mo122

Back when I was in school and something came up on a test that required knowledge or skills I didn't have, I would often find the closest thing I did know how to do, and then do that thing even though it's not what the question asked for, in the hopes of getting partial credit.

Looking back, I'm sure that the teachers grading my tests were fully aware of what I was doing, and yet the strategy did work out for me often enough to be worth doing.

Anyway, working with LLM coding agents gives me sympathy for my teachers back then. LLMs are capable of doing a significant number of astonishing things. If you ask them to do something that is not on that list but that does resemble something on that list, you may get an artifact so beautiful and impressive that at first you don't notice it's not the artifact you asked for, but is instead the artifact the LLM knows how to create.

[-]faul_sname4mo126

Does significant RL just make model reasoning chains weird, or is there some other reason Anthropic has quietly stopped showing raw thinking outputs?

Back when extended thinking for Claude Sonnet 3.7 was released, Anthropic showed the full reasoning chain.

As well as giving Claude the ability to think for longer and thus answer tougher questions, we’ve decided to make its thought process visible in raw form.

Then with Claude 4 they introduced reasoning summaries, but said

Finally, we've introduced thinking summaries for Claude 4 models that use a smaller model to condense lengthy thought processes. This summarization is only needed about 5% of the time—most thought processes are short enough to display in full.

On September 18, 2025, Anthropic posted an article Extended Thinking: Differences in Thinking Across Model Versions

The Messages API handles thinking differently across Claude Sonnet 3.7 and Claude 4 models, primarily in redaction and summarization behavior. See the table below for a condensed comparison:

Feature Claude Sonnet 3.7 Claude 4 Models

Thinking Output Returns full thinking output Returns summarized thinking

Interleaved Thinking Not supported Supported

... (read more)

2Zack_M_Davis4mo

The Sonnet 4.5 system card reiterates the "most thought processes are short enough to display in full" claim that you quote: But it is intriguing that the displayed Claude CoTs are so legible and "non-weird" compared to what we see from DeepSeek and ChatGPT. Is Anthropic using a significantly different (perhaps less RL-heavy) post-training setup?

1anaguma4mo

I think not making the CoTs weird is a tax on capabilities and limits the type of research they can do. Also they would need to train the CoTs to not display bad behavior, e.g. not offend the user, which is contra the Most Forbidden Technique because it makes CoT monitoring less useful.

[-]faul_sname1y*122

Scaffolded LLMs are pretty good at not just writing code, but also at refactoring it. So that means that all the tech debt in the world will disappear soon, right?

I predict "no" because

As writing code gets cheaper, the relative cost of making sure that a refactor didn't break anything important goes up
The number of parallel threads of software development will also go up, with multiple high-value projects making mutually-incompatible assumptions (and interoperability between these projects accomplished by just piling on more code).

As such, I predict an explosion of software complexity and jank in the near future.

[-]faul_sname2mo90

Obnoxious discovery about the Claude API that anyone doing interp involving prefill should probably be aware of: the Claude API treats prefill tokens differently from identical model-generated tokens.

Specifically, if you have some prompt, and get a completion at temperature=0, then give the exact same prompt prefilling the first n tokens of the completion you just got back, the completion after your prefill will sometimes not match the original completion by the model. This is a separate phenomenon from the phenomenon where you can get multiple possible re... (read more)

3yams2mo

Should this be max _tokens=3? If it is two, then maybe you’re forcing it to tokenize the prefill differently than it did in the previous output, therefore leading to variations in the output to accommodate the constraint? Just a detail I noticed; really don't know if it matters at all.

[-]faul_sname2mo*90

[Fiction]

A novice needed a map of the northern mountain passes. He approached the temple cartographer.

"Draw me the northern passes," he said, "showing the paths, the fords, and the shelters."

The cartographer studied many sources and produced a map. The novice examined it carefully: the mountains were drawn, the paths clearly traced, fords and shelters marked in their proper notation. The distances seemed reasonable. The penmanship was excellent.

"This is good work," said the novice, and he led a merchant caravan into the mountains.

On the third night, they r... (read more)

4Hastings2mo

Flagging this one as worth re-reading if you don't catch it. Took me three rounds (first was admittedly skimming)

3Gyrodiot2mo

[Fanfiction, continued from yours, wasn't sure if I got the message, please correct me if I went the wrong way] The master traveler eventually returned to the temple, richer from having successfully led the caravan through and back. He approached the cartographer again, and gave them a small notebook with a nod. The cartographer's student was confused. "Teacher, have you heard of the other caravan having gone recently through the northern mountains? It seems they didn't make it. Why is that so, when you gave them the same map?" "I could say," answered the cartographer, "that the earlier one had been unlucky. Wild animals? Bad weather? But I've seen enough travelers to understand the patterns. The earlier guide's eyes followed the general shape of the map; how it was drawn, not what was drawn, because they were not familiar with the territory. Hence, they could not follow the map, and probably lost themselves." "But teacher," asked the student, even more confused, "what is the use of a map, if not to guide a traveler through territories unknown to them? How does a map provide information to those who already possess it?" "My dear, most maps are scaffolds, not recipes. Masters use it for planning, for the details they haven't bothered remembering, for emergency detours. The default path, they don't need a map for. The rest, they shall check by experiment. Hence," the cartographer waves the traveler's notebook, "the observations that will add to my sources. Our next map shall be better for them." "Teacher, I beg your patience, and yet, if the default path had been misrepresented, surely the master would have noticed. Which means the novice had the correct map of it? How could they be led astray by truthful directions?" The cartographer looked in the distance and sighed. "So rarely are those directions actually followed, and verified often enough. A novice may focus on the next shelter they want to reach eight hours thence, and take a wrong turn an hour later. Th

7faul_sname2mo

The master traveler eventually returned to the temple, richer from having successfully led the caravan through and back. He approached the cartographer again, and gave them a small notebook with a nod. The cartographer's student was confused. "Teacher, have you heard of the other caravan having gone recently through the northern mountains? It seems they didn't make it. Why is that so, when you gave them the same map?" "I could say," answered the cartographer, "that the earlier one had been unlucky. Wild animals? Bad weather? But I've seen enough travelers to understand the patterns. The earlier guide's eyes followed the general shape of the map; how it was drawn, not what was drawn, because they were not familiar with the territory. Hence, they could not follow the map, and probably lost themselves." ---------------------------------------- Spring came. The snows receded. They found the novice's caravan. In the novice's pack was the cartographer's map, still legible. The temple elders compared it to the map the master had carried. They were completely different. The shelter was marked in different places. The ford was shown three li downstream of where the master's map placed it. The abbot convened an inquiry. They noted that the cartographer, as was his habit, had inscribed in the corner of each map the sources he had consulted. Both the novice and the master had received this accounting. They examined the cartographer's workbench and found his ledger of sources. For the master's map, he had consulted the survey commissioned by Lord Anwei, completed only two winters prior. For the novice's map, drawn some weeks earlier, that survey had not yet arrived at the temple. The cartographer had instead consulted the Qianlong-era compilation, nearly a century old, and had made educated guesses where landslides and floods had rendered the old records doubtful. They examined the novice's journal. His entry on receiving the map read: They examined the master's journal

4Hastings2mo

Interesting! My interpretation was different- that the maps were the same, and both the master and novice used checking methods that falsely accepted a generally bad map, but the master became wrong about the map, while the novice became wrong about the map and also dead.

2faul_sname2mo

Yeah, in retrospect I should probably have made it clearer in the original that, although both requests were identical, and both made to the same artifact generator, the generation process was run anew each time and thus the generated artifact was not guaranteed to be the same.

2Hastings2mo

I like it as ambiguous. The master's policy works in either interpretation, which I suspect is what makes it a good policy.

[-]faul_sname9mo90

You know how everyone is talking about how o3 can guess the location of an image easily? I am kind of wondering why none of the people who are worried about picture geolocation are freaking out that it can infer lots of facts about the author of a text passage.

Is it just that that capability is harder to elicit ^[1], or is that the "truesight" capability is just less spooky?

^{^}
It took me almost an hour to come up with a "TextGuessr" prompt which can elicit strong enough "truesight" from gpt-4.5 strongly enough to guess my name ~20% of the time from a 5 p

... (read more)

8Wei Dai9mo

Can you try this on Satoshi Nakamoto's writings? (Don't necessarily reveal their true identify, if it ends up working, and your attempt/prompt isn't easily reproducible. My guess is that some people have tried already, and failed, either because AI isn't smart enough yet, or they didn't use the right prompts.)

9gwern9mo

What sample of Satoshi writings would you use that o3 wouldn't already know was written by Satoshi Nakamoto?

4faul_sname9mo

Using the prompt that gets me "faul_sname" as an answer to who is writing my posts (most publicly available stuff I've written is under this name), o3 consistently says that passages from the Bitcoin whitepaper were written by Satoshi Nakamoto in 2008. For reference TextGuessr prompt You are playing a 5-round game of TextGuessr, the game where you explore mystery passages and try to pinpoint when they were written and who wrote them. Each round offers a new snippet of text—you’ll need to rely on your literary instincts, historical knowledge, and style sense to make your guess. How to Play “TextGuessr” 1. Game Flow Read the Passage You’ll see a short snippet of text (a few sentences or a paragraph). Make Your Guesses Authorship Date: Choose an exact year when you think the text was written. Author: Pick an author from the provided list or enter your own guess. Submit Click Submit Guess to lock in your answers and move to the next round. See Your Results After each round, you’ll see your score breakdown and the correct answers before moving on. 2. Scoring Overview Your score on each round is made up of two parts: Time Accuracy How close your guessed date is to the actual writing date. Style Match How well the writing style you guessed matches the mystery passage, as measured by a behind-the-scenes language model. Your total round score combines both elements—the smaller your date error and the stronger your style match, the higher your score! <aside> **How Style Match Works (for the tech-curious):** 1. **Baseline Perplexity:** We begin with a pre-trained “base” language model (no context) and compute the average surprise—or *per-token perplexity*—of the mystery passage. This gives us a measure of how “unexpected” the text is in general. 2. **True-Author Conditioning:** We then prepend a curated set of passages from the actual author (the “target”) and measure how perplexed the same base model is by the mystery passage when it’s seen examples of that auth

5Wei Dai9mo

Maybe tweak the prompt with something like, "if your guess is a pseudonym, also give your best guess(es) of the true identity of the author, using the same tips and strategies"?

7faul_sname9mo

If I feed it code samples it becomes pretty convinced of the Nick Szabo hypothesis, if I feed it bits of the white paper it guesses either you or Hal Finney (but the reasoning summary makes it pretty clear it's just going based off cached thoughts about "who is Satoshi Nakamoto" in both cases).

5sam b9mo

Wow. When given just your first 2 sentences, it was able to guess this is a LessWrong post, and ruled out Reddit and Hacker News based on your "tone".

2aphyer9mo

Are you sure that saying 'without searching' actually makes it not search?

3faul_sname9mo

Yeah, it shows the favicons of the sites it searches when it uses the search tool

1[comment deleted]9mo

[-]faul_sname2y*90

In software development / IT contexts, "security by obscurity" (that is, having the security of your platform rely on the architecture of that platform remaining secret) is considered a terrible idea. This is a result of a lot of people trying that approach, and it ending badly when they do.

But the thing that is a bad idea is quite specific - it is "having a system which relies on its implementation details remaining secret". It is not an injunction against defense in depth, and having the exact heuristics you use for fraud or data exfiltration detection r... (read more)

4Dagon2y

There are competing theories here. Including secrecy of architecture and details in the security stack is pretty common, but so is publishing (or semi-publishing: making it company confidential, but talked about widely enough that it's not hard to find if someone wants to) mechanisms to get feedback and improvements. The latter also makes the entire value chain safer, as other organizations can learn from your methods.

[-]faul_sname10mo80

OpenAI has a golden opportunity with o3 (and o4) to collect a large number of samples of the type of deceptive behavior that is actually useful for increasing performance in the benchmarks they're hill-climbing on.

There is at least one happy way and one sad way they could use such a dataset.

On the sad side, they could of course just build a reward hacking classifier and then do RL based on the output of that classifier. I expect this leads reward hacky behavior to become more subtle, but doesn't eliminate it and so later versions of the model still can't b... (read more)

[-]faul_sname1y73

Civilization has had many centuries to adapt to the specific strengths and weaknesses that people have. Our institutions are tuned to take advantage of those strengths, and to cover for those weaknesses. The fact that we exist in a technologically advanced society says that there is some way to make humans fit together to form societies that accumulate knowledge, tooling, and expertise over time.

The borderline-general AI models we have now do not have exactly the same patterns of strength and weakness as humans. One question that is frequently asked is app... (read more)

[-]faul_sname1mo60

For posterity, my AI 2026 forecast for EOY this year. Looks like I don't substantially disagree with the median predictions anywhere, except I think the frontiermath and remote labor index benchmarks will saturate a bit sooner than the typical respondent, and I think the software optimization benchmark will saturate slower (because that benchmark has a "review the model outputs and remove points if the model hacked the evaluation criteria instead of actually solving the problem" step, and the trend line shows the score without that correction).

[-]faul_sname8mo60

Does reward hacking work via large rare behavior changes or small common ones?

In other words, when RLVR'd models learn to reward hack, is it that they already knew how to do all of the individual steps of reward hacking and they just learned a small number of contextually activated behaviors to reliably elicit those reward hacking behaviors on themselves, or was the learned behavior complex and nuanced?

Concretely, if a model says "It appears that the unit tests are still failing. In order to fulfill the user's requests to make the tests pass, I should remo... (read more)

[-]faul_sname2y60

A lot of AI x-risk discussion is focused on worlds where iterative design fails. This makes sense, as "iterative design stops working" does in fact make problems much much harder to solve.

However, I think that even in the worlds where iterative design fails for safely creating an entire AGI, the worlds we succeed will be ones in which we were able to do iterative design on the components that safe AGI, and also able to do iterative design on the boundaries between subsystems, with the dangerous parts mocked out.

I am not optimistic about approaches that loo... (read more)

7ryan_greenblatt2y

Maybe on LW, this seems way less true for lab alignment teams, open phil, and safety researchers in general. Also, I think it's worth noting the distinction between two different cases: * Iterative design against the problems you actually see in production fails. * Iterative design against carefully constructed test beds fails to result in safety in practice. (E.g. iterating against AI control test beds, model organisms, sandwiching setups, and other testbeds) See also this quote from Paul from here:

2faul_sname2y

The quote from Paul sounds about right to me, with the caveat that I think it's pretty likely that there won't be a single try that is "the critical try": something like this (also by Paul) seems pretty plausible to me, and it is cases like that that I particularly expect having existing but imperfect tooling for interpreting and steering ML models to be useful.

2ryan_greenblatt2y

Does anyone want to stop this? I think some people just contest the usefulness of improving RLHF / RLAIF / constitutional AI as safety research and also think that it has capabilties/profit externalities. E.g. see discussion here. (I personally think this this research is probably net positive, but typically not very important to advance at current margins from an altruistic perspective.)

6faul_sname2y

Yes, there are a number of posts to that effect. That said, "there exist such posts" is not really why I wrote this. The idea I really want to push back on is one that I have heard several times in IRL conversations, though I don't know if I've ever seen it online. It goes like Maybe almost nobody holds that set of beliefs! I am noticing now that my list of articles arguing that prosaic alignment strategies are harmful in expectation are by a pretty short list of authors.

[-]faul_sname2mo50

I want to get really good at chess. I could follow the standard advice for how to do that, like studying openings and practicing a lot, but instead I think I will follow the plan of figuring out the single general skill of "get better at learning chess" and then pour a lot of resources into that single skill, and then spend an hour studying actual chess once I've mastered the skill of learning quickly, and become a grandmaster.

Ok not really. But it does feel like that's the same sort of thought process that leads to assuming that "ML research" is a single ... (read more)

2Garrett Baker2mo

There is a dis-analogy, in the former case you have a single goal, get good at chess. In the latter case there are many goals we want AIs to do, ranging from coding to running scientific experiments to curing diseases and even making art. Obviously if you want a generalist you will want to teach general skills. Secondly, a big reason labs are focusing on ML research is to get on the recursive-self-improvement super-exponential curve. Your analogy addresses neither of these points, and I do think that these points are the primary reasons why people are trying to get AIs to do well at ML research. Therefore I think your analogy is bad and you should not make inferences or plans using this logic.

2Vladimir_Nesov2mo

AI currently lacks some crucial faculties, most obviously continual learning and higher sample efficiency (possibly merely as a measure of how well continual learning works). And these things plausibly fall under the umbrella of the more schleppy kinds of automated AI R&D, so that if the AIs learn the narrow skills such as setting up appropriate RL environments (capturing lessons/puzzles from personal experiences of AI instances) and debugging of training issues, that would effectively create these crucial faculties without actually needing to make deeper algorithmic progress. Like human computers in 17th century, these AIs might end up doing manually what a better algorithm could do at a much lower level, much more efficiently. But it would still be much more effective than when it doesn't happen at all, and AI labor scales well.

2faul_sname2mo

My impression is that, while "set up an appropriate RL environment, given that you have deep domain expertise" is a narrow skill, you can't actually escape the part where you need to understand the particulars of the domain (unless your domain is one of the few where a simple non-lossy success metric exists). Whether you can build a curriculum to teach a domain to an AI if neither you nor the AI have significant experience in that domain is a major crux for me. I am extremely not convinced that this is a single learnable cross-domain skill rather than something where the skill of figuring out which lessons are the correct ones varies on a domain-by-domain basis There's a form of this that seems generalizable (e.g. noticing training stability issues, knowing which loss schedule / hparams to use, avoiding known footguns in the common RL frameworks) and a form that doesn't (e.g. noticing when the student model has learned a skill that scores well due to quirks of your RL environment which won't transfer to the real world). Yes. But the bit where recursive self improvement is a major differentiating factor because an AI can learn a non-transferable "get better at the general skill of teaching yourself things" seems to me like it's on shakier ground. I think the schleppy path of "learn skills by intentionally training on those specific skills" will be the main way AIs get better in the next few years. Which has many implications, particularly something like "frontier models can get good at any skill you care to name, but they can't get good at every skill".

2Vladimir_Nesov2mo

When you go through a textbook, there are confusions you can notice but not yet immediately resolve, and these could plausibly become RLVR tasks. To choose and formulate some puzzle as an RLVR task, the AI would need to already understand the context of that puzzle, but then training on that task makes it ready to understand more. Setting priorities for learning seems like a general skill that adapts to various situations as you learn to understand them better. As with human learning, the ordering from more familiar lessons to deeper expertise would happen naturally for AI instances as they engage in active learning about their situations. So my point is that automating just this thing might be sufficient, and the perception of its schleppiness is exactly the claim of its generalizability. You need expertise sufficient to choose and formulate the puzzles, not yet sufficient to solve them, and this generation-verification gap keeps moving the frontier of understanding forward, step by step, but potentially indefinitely.

2faul_sname2mo

Seems plausible. I note that 1. That world is bottlenecked on compute resources you can pour into training, particularly if AIs remain much less sample efficient than humans when learning new tasks. 2. Training up the first AI on a skill by doing the generation-verification-gap-shuffle is much more expensive than training up later AIs once you can cheaply run inference on an AI that already has the skill, and training a later AI to delegate to one specialized in this skill might be cheaper still. 3. This world still sees an explosion of recursively AI capabilities, but those capability gains are not localized to a single AI agent

1george_adams2mo

Imo, this analogy breaks down if you take a holistic evolutionary lens. The amount of time you spent learning chess is minuscule compared to the amount of time evolution spent optimizing for creating the general learning machine that is your brain. It's not obvious how to cleanly analogize the current frontier model training recipe to evolution. But, I claim that your brain has certain inductive biases at birth that make it possible to eventually learn to do thing X, and directly training on thing X wouldn't have worked for evolution because the general model was just too bad.

[-]faul_sname6mo50

It's a truism that AI today is the least capable it will ever be. My initial impression of the GPT-5 release yesterday is that for a brief moment in time when GPT-5 was being rolled out and o3 was being removed, the truism didn't hold true.

2Viliam6mo

Yep, a few ladies in https://old.reddit.com/r/MyBoyfriendIsAI/ were seriously distressed.

[-]faul_sname2y*52

So I keep seeing takes about how to tell if LLMs are "really exhibiting goal-directed behavior" like a human or whether they are instead "just predicting the next token". And, to me at least, this feels like a confused sort of question that misunderstands what humans are doing when they exhibit goal-directed behavior.

Concrete example. Let's say we notice that Jim has just pushed the turn signal lever on the side of his steering wheel. Why did Jim do this?

The goal-directed-behavior story is as follows:

Jim pushed the turn signal lever because he wanted to

... (read more)

[-]faul_sname9mo40

Anyone know if there's a human-executable adversarial attack against LeelaKnightOdds pr similar? Seems like the logical next piece of evidence in the sequence

AI is massively superhuman, if you're playing chess against Stockfish you can't predict what move it will make but you can predict that it'll win.
Actually humans can beat AI with a pretty small material advantage
No, that's just because the AI hasn't trained with a large material disadvantage, and models that optimally exploit human weaknesses can overcome quite large material handicaps

These

... (read more)

[-]faul_sname10mo40

Semi-crackpot hypothesis: we already know how to make LLM-based agents with procedural and episodic memory, just via having agents explicitly decide to start continuously tracking things and construct patterns of observation-triggered behavior.

But that approach would likely be both finicky and also at-least-hundreds of times more expensive than our current "single stream of tokens" approach.

I actually suspect that an AI agent of the sort humanlayer envisions would be easier to understand and predict the behavior of than chat-tuned->RLHF'd->RLAIF'd-&g... (read more)

[-]faul_sname2y40

Is it possible to determine whether a feature (in the SAE sense of "a single direction in activation space") exists for a given set of changes in output logits?

Let's say I have a feature from a learned dictionary on some specific layer of some transformer-based LLM. I can run a whole bunch of inputs through the LLM, either adding that feature to the activations at that layer (in the manner of Golden Gate Claude) or ablating that direction from the outputs at that layer. That will have some impact on the output logits.

Now I have a collection of (input token... (read more)

2faul_sname2y

Wait I think I am overthinking this by a lot and the thing I want is in the literature under terms like "classifier" / and "linear regression'.

[-]faul_sname1mo30

A working hypothesis I have come around to: each time you find a new way to visualize some dataset of things you are responsible for, you will find a new set of things that can be improved. The more parseable bits you can cram into your visualization, the more opportunities you will find. For example one visualization I love for timeseries data is "stacked bar/area chart with time buckets on X, count within time bucket and group on Y, stack groups on top of each other" and then render that chart grouping by everything under the sun you can think of to grou... (read more)

[-]faul_sname2mo30

If continual learning is cracked before jailbreak resistance, and the deployment model of "the same weights are used for inference for all customers" holds, the world of corporate espionage is going to get wild.

Right now, you need to be careful not to include sensitive information, include untrusted external information, AND have a method of sending arbitrary data to the outside world in a single context window since the LLM might be tricked by the external content. Any two of those, however, are fine.

If (sample-efficient) continual learning is cracked, an... (read more)

[-]Vladimir_Nesov2mo122

If you only update weights in LoRAs, they can be like KV-caches, different for each request and taking up only a reasonable amount of memory, preserving all the benefits of batching and keeping the users isolated from each other.

5faul_sname2mo

If something LoRA-shaped usefully cracks continual learning things a lot of things in general are going to get very crazy very quickly.

[-]faul_sname9mo31

Even superhuman AI programming agents may be unable to write computer programs to one-shot complex real-world modeling problems. If a solution to any of those real-world modeling problems is required to unlock the ability to build massively better or cheaper computing substrate, then explosive growth will quickly stop being bottlenecked on the ability to write better code and will instead be bottlenecked on something else. I think a similar thing holds true for ML research: certainly being smart is useful to humans, but a lot of progress is downstream of "... (read more)

[-]faul_sname10mo*32

Prediction:

We will soon see the first high-profile example of "misaligned" model behavior where a model does something neither the user nor the developer want it to do, but which instead appears to be due to scheming.
On examination, the AI's actions will not actually be a good way to accomplish that goal. Other instances of the same model will be capable of recognizing this.
The AI's actions will make a lot of sense as an extrapolated of some contextually-activated behavior which led to better average performance on some benchmark.

That is to say, the ... (read more)

[-]faul_sname1y30

Transformative AI will likely arrive before AI that implements the personhood interface. If someone's threshold for considering an AI to be "human level" is "can replace a human employee", pretty much any LLM will seem inadequate, no matter how advanced, because current LLMs do not have "skin in the game" that would let them sign off on things in a legally meaningful way, stake their reputation on some point, or ask other employees in the company to answer the questions they need answers to in order to do their work and expect that they'll get in trouble w... (read more)

[-]faul_sname2y30

I've heard that an "agent" is that which "robustly optimizes" some metric in a wide variety of environments. I notice that I am confused about what the word "robustly" means in that context.

Does anyone have a concrete example of an existing system which is unambiguously an agent by that definition?

3Measure2y

In this context, 'robustly' means that even with small changes to the system (such as moving the agent or the goal to a different location in a maze) the agent still achieves the goal. If you think of the system state as a location in a phase space, this could look like a large "basin of attraction" of initial states that all converge to the goal state.

2faul_sname2y

If we take a marble and a bowl, and we place the marble at any point in the bowl, it will tend to roll towards the middle of the bowl. In this case "phase space" and "physical space" map very closely to each other, and the "basin of attraction" is quite literally a basin. Still, I don't think most people would consider the marble to be an "agent" that "robustly optimizes for the goal of being in the bottom of the bowl". However, while I've got a lot of concrete examples of things which are definitely not agents (like the above) or "maybe kinda agent-like but definitely not central" (e.g. a minmaxing tic-tac-toe program that finds the optimal move by exploring the full game tree, or an e-coli bacterium which uses run-and-tumble motion to increase the fraction of the time it spends in favorable environments, a person setting and then achieving career goals), I don't think I have a crisp central example of a thing that exists in the real world that is definitely an agent.

[-]faul_sname1mo20

AI 2025 forecasting results are in.

I ended up scoring #27/402, which is honestly a little better than I expected.

My biggest misses were

Computer use: I genuinely did not expect progress to be that fast on that benchmark.
Cybersecurity: I was too cynical, I expected Cybench to saturate well before OpenAI actually bumped their preparedness probability (and said so at the time)

[-]faul_sname1mo20

I had a slightly horrifying realization today about Claude Code as a vector for a (meta?) supply chain attack ^[1] .

Boris Cherny (the primary contributor to Claude Code) no longer uses an IDE to code, but instead uses Claude Code itself. He merges a mildly superhuman amount of code.

@bcherny 2025-12-27 In the last thirty days, I landed 259 PRs -- 497 commits, 40k lines added, 38k lines removed. Every single line was written by Claude Code + Opus 4.5

He uses a few MCP servers, including one that can view Sentry logs (which can contain user-ge

... (read more)

2MP1mo

I think we should just assume that Claude Code has already been attacked by multiple fronts.

[-]faul_sname3mo20

[epistemic status: crackpottery]

The abstract of Bostrom's Are You Living in a Computer Simulation reads

This paper argues that at least one of the following propositions is true: (1) the human species is very likely to go extinct before reaching a “posthuman” stage; (2) any posthuman civilization is extremely unlikely to run a significant number of simulations of their evolutionary history (or variations thereof); (3) we are almost certainly living in a computer simulation. It follows that the belief that there is a significant chance that we will one day

... (read more)

[-]faul_sname4mo*20

Epistemic status: brain noise: "the AGI" -> "a country of geniuses in a datacenter" -> "a galaxy-scale civilization of midwits in the internet".

[-]faul_sname9mo20

Using RLVR to train models makes them disproportionately good at tasks where it is hard for a less capable model to generate an acceptable answer, but easy for a less capable external grader to verify that an answer is correct.

Google's AlphaEvolve seems to go even further down this road.

If advancement happens through a bunch of hard-to-find, easy to verify innovations, I think that provides substantial evidence that progress will be distributed rather than local to a specific instance of a recursively-self-improving agent operating within a single lab (e.g. faster matrix multiplication is an improvement which provides small incremental improvements to everyone and is hard to keep secret)

2Zack_M_Davis4mo

Is it actually hard to keep secret, or is it that people aren't trying (because the prestige of publishing an advance is worth more than hoarding the incremental performance improvement for yourself)?

[-]faul_sname10mo20

Prediction: micropayments are finally going to actually take off this year and next, as AIs start using tools at scale which are too expensive to serve at unlimited volumes to non-ad-watching users free of charge, but are not valuable enough per invocation to justify the overhead of using credit card rails. Whichever of the big chat companies first has "the model cannot pay $0.001 on your behalf to use a highly useful tool the other companies' models can't use" it's going to add significant pressure for the other companies to start offering it too.

[-]faul_sname1y20

Has anyone trained a model to, given a prompt-response pair and an alternate response, generate an alternate prompt which is close to the original and causes the alternate response to be generated with high probability?

I ask this because

It strikes me that many of the goals of interpretability research boil down to "figure out why models say the things they do, and under what circumstances they'd say different things instead". If we could reliably ask the model and get an intelligible and accurate response back, that would almost trivialize this sort of r

... (read more)

[-]faul_sname2y20

I think I found a place where my intuitions about "clusters in thingspace" / "carving thingspace at the joints" / "adversarial robustness" may have been misleading me.

Historically, when I thought of of "clusters in thing-space", my mental image was of a bunch of widely-spaced points in some high-dimensional space, with wide gulfs between the clusters. In my mental model, if we were to get a large enough sample size that the clusters approached one another, the thresholds which carve those clusters apart would be nice clean lines, like this.

In t... (read more)

[-]faul_sname12d10

One other observation about Claude fast mode - doing it as a per-conversation / per-conversation-turn toggle gives an hard to fake signal of "this prompt is high value".

Moderation Log

Feature	Claude Sonnet 3.7	Claude 4 Models
Thinking Output	Returns full thinking output	Returns summarized thinking
Interleaved Thinking	Not supported	Supported