Sorry, we weren't clear enough in this section. The use of an opaque memory bank doesn't seem particularly unlikely to us; however, it seems unlikely that a model would entirely rely on its CoT unless it's reasoning about something inside the memory bank. If the LLM doesn't rely on its CoT and uses an opaque memory bank, the memory bank may as well be viewed as a unified system with the LLM's weights, in which case the reasoning performed inside that system falls into one of the categories discussed in the main part of our post.
Thanks for the feedback!
You might also get 'Linguistic drift' from training against a CoT monitor, or because a sufficiently capable and situationally aware model might just decide to come up with and reason in a new language, to not let its thoughts be legible to a CoT monitor.
We agree that linguistic drift can arise both as a result of applying a lot of outcome-based RL pressure, as discussed by Korbak et al., and due to training against a CoT monitor or situational awareness. I think that all of them are consistent with our definition of linguistic drift: "The model uses a language that doesn’t make sense to the monitor. p is directly mentioned in the CoT, but the monitor cannot recover it."
You use Steganography to refer to an encoding scheme, that produces something that looks like normal text, that is about some apparent topic while hiding the true reasoning. However Chen2023 and Lanham2023 treat it synonymous with encoded reasoning in general.
We think it's reasonable to refer to steganography as encoded reasoning—in the Summary section, we mention that it's a subcategory of encoded reasoning. However, linguistic drift counts as encoded reasoning as well, so it seems useful to use more precise terms to differentiate between different types of encoded reasoning.
I am currenty writing a report myself, where I also tried to come up with a more clearly destinctive taxonomy of encoded reasoning
Interesting! I'm curious about whether there are any other ways in which your report differs from our taxonomy, and whether you think incorporating any of the categories in your report into our post would make it better.
Thanks, hadn't heard of that one before!
Yeah, Wikipedia would have deserved a mention, though I'm primarily looking for more hidden gems. Thanks for mentioning Seirdy, I hadn't heard of it before!
What are the best blogs that follow the Long Content philosophy?
In his About page, Gwern describes the idea of Long Content as follows:
I have read blogs for many years and most blog posts are the triumph of the hare over the tortoise. They are meant to be read by a few people on a weekday in 2004 and never again, and are quickly abandoned—and perhaps as Assange says, not a moment too soon. (But isn’t that sad? Isn’t it a terrible ROI for one’s time?) On the other hand, the best blogs always seem to be building something: they are rough drafts—works in progress. So I did not wish to write a blog. Then what? More than just “evergreen content”, what would constitute Long Content as opposed to the existing culture of Short Content? How does one live in a Long Now sort of way?
Gwern himself has, of course, been following this approach to blogging for a long time. There are other rationalist(-adjacent) blogs that seem to follow a similar philosophy: niplav mentions Long Content as a direct inspiration, and the pages of Brian Tomasik, Mariven, Gavin Leech, and Yuxi Liu can also be said to fall into this category.
So far, I have only encountered one page outside of the rat-adjacent sphere that resembles the aforementioned blogs: that of Cosma Shalizi. However, I'm much less familiar with the non-rationalist blogosphere, so there must surely be others. Please post them here if you've encountered them! (And also, feel free to link to rat-adjacent blogs that I've neglected.)
The bigger your buffer the more relaxing it is. Unless you are extremely pressed for time, the number of flights you should miss is essentially zero.
I must quote the original Umeshism here:
If you’ve never missed a flight, you’re spending too much time in airports.
I have a fairly uncommon psychology, well-described by the following quote by Tyler Cowen:
You know, I feel I’ve been very fortunate in life and I think I have about the most even temperament of anyone I know. I literally don’t have unhappy days. It would be hard to say I’ve had zero in life, but I think I’m almost weirdly never unhappy, in a way that’s good for productivity but maybe almost inhuman and to be a little bit feared or looked down upon or not thought well of.
In other words, I don't feel like I suffer (by my definition of suffering, which is mainly about the feeling of pain and discomfort) almost at all, and my current self doesn't endorse moving toward a state where the little suffering I experience disappears at the expense of no longer feeling pleasure, which you mentioned is a side effect of enlightenment (though again, our definitions of pleasure may differ). Should I nevertheless meditate and try to attain enlightenment? I can imagine several responses from you:
I'm curious about which of these is closest to your actual answer, and also about whether there's some additional benefit to awakening that I haven't thought of and didn't list above.
Thanks, good questions!
The following thought experiment by Tim Scanlon is also a relevant example (quote from Edmonds's Parfit):
Suppose that Jones has suffered an accident in the transmitter room of a television station. Electrical equipment has fallen on his arm, and we cannot rescue him without turning off the transmitter for fifteen minutes. A World Cup match is in progress, watched by many people, and it will not be over for an hour. Jones’s injury will not get any worse if we wait, but his hand has been mashed and he is receiving extremely painful electrical shocks. Should we rescue him now or wait until the match is over?
Thanks for the insightful comments! Some responses:
In the situation where the code contains a backdoor but the CoT doesn't acknowledge this, I think it's reasonable to say that p is missing from the CoT: whether the monitor can detect the backdoor by looking at the code depends on how smart the monitor is, and we normally assume that the monitor is dumber than the monitored LLM. Analogously, in the 32=9 case, a monitor can infer that the model computed this intermediate result by looking at the input and the output, but it seems reasonable to say that p is missing from the CoT. That said, we should have made this assumption about the monitor being dumb more explicit in our operationalization, and it also seems plausible that an operationalization separating pout and plat could make things clearer.
We considered making this separation (analogously to how Lee Sharkey distinguishes between active and passive interpretability circumvention methods in Circumventing interpretability: How to defeat mind-readers), but ultimately decided against that: it seems that each category aside from recurrent neuralese is likely to first emerge as an incidental behavior, but could eventually also become deliberate as models become smart enough. Thus, distinguishing between active and passive types of hidden reasoning doesn't seem that informative, and the post is already very long in its present form.
The term 'parallel' seemed appropriate to us since the answer-relevant computations happen in parallel across different token positions: e.g., as we show on our illustration, the calculations 32=9 and 52=25 can both be performed at layer 1 if they're performed at different token positions.[1] Though there isn't more parallelism per step, this is a different kind of parallelism that isn't available when the answer has to be produced in a single forward pass. If we named this 'serial latent computing across steps', I'd worry that this parallelism across token positions wouldn't be as clear and readers might assume that the latent computations only take the form of diagonal information flow across layers and token positions, rather than parallel computations at the same layer but different token positions. Additionally, the term 'parallel' has been used in the past to refer to this kind of reasoning in Max Nadeau's decomposition of unfaithfulness.
We have also noticed the use of 'neuralese' in emergent communication literature. However, we don't think there's a consensus in past literature about this term: there are multiple examples of more recent work referring to opaque recurrence as neuralese. Since there already is another elegant term for emergent discrete codes in linguistic drift while the alternative terms for latent-only serial reasoning are much more clunky, we think it would be better if neuralese was considered to denote opaque recurrent serial reasoning.
A real-world LLM would perhaps be able to perform both of those computations in parallel inside a single layer at a single token position, but we assume that our toy model isn't capable of that. There are likely more complex computations analogous to this one that a frontier LLM cannot perform in a single MLP but can perform at the same layer at different token positions.