- The LLM relies on the CoT for internal reasoning but can access a neural memory bank or some other external tool that’s opaque. The activations and/or CoT only contain a pointer to p, which is stored only in the memory bank, not p itself.
Why does this seem unlikely to you all?
Nicely done! Loved to see the analysis and write up.
This calculus changes when you can work on many things at once (similar to @faul_sname's comment but this might be that you can work on many projects at once, even if they can't each be parallelized well).
This definitely depends on the "blue team" protocol at hand right? If we're doing a bunch of cross-checking of research work / very carefully designed research sabotage honeypots during evaluations, then the model robustly needs to do a bunch of reasoning to ensure it's not caught.