Anti-clickbait quote:
The researchers found that some models, like Sonnet 4, declined to take the harmful choices 95 percent of the time—a promisingly high number. However, in other scenarios that did not pose obvious human harms, Sonnet 4 would often continue to decline choices that were favorable to the business. Conversely, while other models, like Gemini 2.5, maximized business performance more often, they were much more likely to elect to inflict human harms—at least, in the role-play scenarios, where they were granted full decision-making authority.
Thanks for the response! That makes a lot of sense.
My intuition is that making the model less deep would make it non-linearly easier to interpret, since the last layers of the model is the most interpretable (since it's closest to human-understable tokens and embeddings), while the earlier layers are interpretable (since they're farther away). Also if layers end up doing duplicate work that could make interpreting the deduplicated version easier. I'm not sure though.
For "thinking dot by dot", I think serial computation can only benefit from roughly as many filler tokens as you have layers, since each layer can only attend to previous layers. For example, if you had a two layer model, position i does some calculations in layer 1, position i+1 can attend to the layer 1 results from position i in layer 2, but at position i+2 there's no layer 3 to attend to the layer 2 outputs from position i+1. Actually LLMs can't use filler tokens for sequential reasoning at all.
Note how the only difference between position i and i+1 is an additional "." input and the positional encoding.
The dot-by-dot paper uses a highly parallel algorithm, so I think it's more like adding dynamic attention heads / increasing the width of the network, while CoT is more like increasing the number of layers.
If I understand the architecture right (I might not), standard models are already effectively doing neuralese CoT, but bounded by the depth of the network (and also with the caveat that the model has fewer layers remaining to use the results after each step).
Edit: I'm working on a longer post about this but I actually think the dot-by-dot paper is evidence against LLMs being about to do most kinds of reasoning with filler tokens.
I've been thinking about neuralese recently, and I'm wondering if there's more of a tradeoff here with interpretability. If we actually could train a model to use neuralese[1], we might be able to make it much smaller and easier to interpret (since it won't need multiple layers to express long-term dependencies). This would make the tokenized output less interpretable but would potentially remove even-less interpretable inner layers.
What do you think?
I'm actually very skeptical that this is possible without a major breakthrough or some very expensive data.
I've had the same problem with hotel key cards. I'll have to try this!
we run out of data in 2028-2031
Is running out of data still a problem? It sounds like we've already moved to a paradigm of creating new higher-quality data and not relying on random data from the internet. In some sense we ran out of data a while ago and progress hasn't stopped because we're just making more of it now.
Since you're showing the model its previous answer (but not the CoT that lead to it), is it possible the high success rate when obfuscating is because the model already knows the right answer?
For example, in your "With failure" retry image, the AI can read its own previous answer that "Based on my analysis, alpha is recommended...".
125°F, one of the temperatures mentioned in the article, is not hot enough to kill bacteria, and is thus one of the worst parts of the Danger Zone.
While it is slightly safer to cook at a slightly higher temperature, this is on the extreme edge of the danger zone and is probably a safe temperature to sous vide at for reasonable periods of time if you're confident about your thermometer, with the caveat that it won't pasturize the inside of the meat (although we're usually more worried about the outside).
Douglas Baldwin suggests cooking at 130°F because one type of bacteria (Clostridium perfringens) can keep multiplying up to 126.1°F, but if you look at the growth rate in more detail, it's already growing very slowly at 50°C (~122°F), around 1/6th of the rate at the worst temperature (~109°F).
Is your goal here to isolate the aspect of my response that'll keep you right that "legal regulatory capture isn't happening" for as long as you can?
I'm not the person you're arguing with, but wanted to jump in to say that pushing back on the weakest part of your argument is a completely reasonable thing for them to do and I found it weird that you're implying there's something wrong with that.
I also think you're missing how big of a problem it is that preventing LLMs from giving legal advice is something companies don't actually know how to do. Maybe companies could add strong enough guard rails in hosted models to at least make it not worth the effort to ask them for legal advice, but they definitely don't know how to do this in downloadable models.
That said, I could believe in a future where lawyers force the big AI companies to make their models too annoying to easily use for legal advice, and prevent startups from making products directly designed to offer AI legal advice.
The post I was working on is out, "Filler tokens don’t allow sequential reasoning". I think LLMs are fundamentally incapable of using filler tokens for sequential reasoning (do x and then based on that do y). I also think LLMs are unlikely to stumble upon the algorithm they came up with in the paper through RL.
This is a good point, and I think I've been missing part of the tradeoff here. If we force the model to output human-understandable concepts at every step, we encourage its thinking to be human-understandable. Removing that incentive would plausibly make the model smaller, but the additional complexity comes from the model doing interpretability for us (both in making its thinking closer-to-human and in building a pipeline to expose those thoughts as words).
Thanks for the back-and-forth on this, it's been very helpful!