Epistemic status: No rigorous findings, I'm only really looking at a few examples here. But this gave me a more intuitive feel for how LLM reasoning traces work in a way that's difficult to fully explain.
Intro
I've recently read the papers: "Thought Anchors", "Forking Paths" and "Critical Tokens Matter". I found myself insanely curious as to what parts of the reasoning trace effect the final model output. I wanted to look at these trace resamples on the token level, so I figured I'd burn a bit of money to get some cool examples.
Background
If you're unfamiliar with these papers, I would recommend reading them, they're incredibly interesting! Long story short, one of the important things these papers did is on policy resampling of reasoning traces. I'm really interested in what part of the reasoning traces are important, I think studying these parts is critical to understanding the behavior of the reasoning models. That being said, these papers either have some sentence level examples, examples from non reasoning models, and nothing I could really get my hands on and explore. I wanted something where I could look at each token, and get some understanding of how that token affected the final model output.
Methodology
I took a bunch of problems from the MMLU-Pro dataset, and asking a reasoning model to solve them. I took a reasoning trace corresponding to a correct model output to each problem I'm interested in and froze the reasoning trace up to some point, then I resampled to see how often the answer is correct with that frozen part of the reasoning trace. Then I repeated this for the next token and so on.
To be more concrete, say we have some model output with reasoning trace t. To inspect the value of token i in that trace, I'm going to resample the models outputs with trace t[:i] already prefilled, and I'll resample this say 20 times. Then I can repeat the process for token i+1 etc. and eventually I can get a sense of how much every token is contributing to the trace (yes this is very expensive).
So I did this for a handful of examples, I found 3 examples very interesting, so I thought I would share them here!
Some implementation details
Everything is sampled from qwen3-vl-30b-a3b-thinking with default temperature 0.6, top_p 0.95, top_k 20. I picked this model because it was the cheapest thinking model on fireworks that was also serverless.
I did a bunch of pre-sampling (like at 100 token intervals instead of 1 token intervals) then hand picked some examples that looked interesting.
I resampled 20 times for every sample below, this gives pretty wide confidence intervals for success rates! Like +/- 15%, but the change in success rate becomes pretty clear when it's sustained for many tokens.
The color of each token represents the success rate if I resample the trace while freezing everything up to and including that token.
I've hand picked 3 of the most interesting examples and included them below.
Examples
Right when the model verbalizes the correct fact "anterior and lateral to this muscle" the probability answering correctly jumps from ~30% to ~80%! The model clearly knows this fact, but potentially is a little "unsure" about it. In the logits right before the " anterior" token, the " anterior" token only has a 14% chance of being decoded. But when it is, the model then seems to become more "sure" about its answer, and gets the question correct.
This one is pretty similar to the example above, the model starts off with a relatively low chance to actually get the answer right, only getting it right in 5-15% of traces. It's a bit noisy, but after the model remembers & actually verbalizes the study about isotopes in animal teeth (the study that determines the animals were not local) then the odds of the model getting the correct answer jump up to around 30%-50%. Finally when the model correctly recalls the fact that the study determines the animals are non-local, the probability of the answering being correct jumps up to near 100%.
This one looks a bit different than the first two. The model seems to reason about some stuff that doesn't affect the success rate too much, then randomly guesses an answer! Before it verbalizes "Option H" the success rate is around ~30% then after it jumps to a sustained 80-100%! Even though it reasoned about some of the other answers, it clearly picked Option H before it did too much thinking about the problem, totally fascinating!
Conclusions
Of course no real conclusions, this isn't rigorous, we're just squinting at a few cherrypicked examples here. I did update how I think about these reasoning models. It looks like, sometimes they're uncertain about some things, then the second they verbalize it in their trace, it's in the context window so they just have to go with it! It feels like this fuzzy probability distribution of what the model knows from the logits is collapsed at the decoding step in the reasoning, then the model just has to work with that. The models can still go back on what they verbalized, but it seems like the verbalization step is important for how likely the model is to come to any given conclusion.
What's next? Ultimately I want to study deception and scheming. Does the model have some "deception-y" tokens, that when verbalized the model "decides" to become more deceptive? It seems pretty hard to decide when a model is "deceptive" though, right now I'm just using some string matching to see if the output is the correct multiple choice letter or not. There are a ton of other questions I have about this work. How could I design experiments to rigorously show some of my above conclusions, how often does this sort of thing happen, etc. Hopefully this inspires some of you with some interesting questions as well!
Epistemic status: No rigorous findings, I'm only really looking at a few examples here. But this gave me a more intuitive feel for how LLM reasoning traces work in a way that's difficult to fully explain.
Intro
I've recently read the papers: "Thought Anchors", "Forking Paths" and "Critical Tokens Matter". I found myself insanely curious as to what parts of the reasoning trace effect the final model output. I wanted to look at these trace resamples on the token level, so I figured I'd burn a bit of money to get some cool examples.
Background
If you're unfamiliar with these papers, I would recommend reading them, they're incredibly interesting! Long story short, one of the important things these papers did is on policy resampling of reasoning traces. I'm really interested in what part of the reasoning traces are important, I think studying these parts is critical to understanding the behavior of the reasoning models. That being said, these papers either have some sentence level examples, examples from non reasoning models, and nothing I could really get my hands on and explore. I wanted something where I could look at each token, and get some understanding of how that token affected the final model output.
Methodology
I took a bunch of problems from the MMLU-Pro dataset, and asking a reasoning model to solve them. I took a reasoning trace corresponding to a correct model output to each problem I'm interested in and froze the reasoning trace up to some point, then I resampled to see how often the answer is correct with that frozen part of the reasoning trace. Then I repeated this for the next token and so on.
To be more concrete, say we have some model output with reasoning trace t. To inspect the value of token i in that trace, I'm going to resample the models outputs with trace t[:i] already prefilled, and I'll resample this say 20 times. Then I can repeat the process for token i+1 etc. and eventually I can get a sense of how much every token is contributing to the trace (yes this is very expensive).
So I did this for a handful of examples, I found 3 examples very interesting, so I thought I would share them here!
Some implementation details
Examples
Right when the model verbalizes the correct fact "anterior and lateral to this muscle" the probability answering correctly jumps from ~30% to ~80%! The model clearly knows this fact, but potentially is a little "unsure" about it. In the logits right before the " anterior" token, the " anterior" token only has a 14% chance of being decoded. But when it is, the model then seems to become more "sure" about its answer, and gets the question correct.
This one is pretty similar to the example above, the model starts off with a relatively low chance to actually get the answer right, only getting it right in 5-15% of traces. It's a bit noisy, but after the model remembers & actually verbalizes the study about isotopes in animal teeth (the study that determines the animals were not local) then the odds of the model getting the correct answer jump up to around 30%-50%. Finally when the model correctly recalls the fact that the study determines the animals are non-local, the probability of the answering being correct jumps up to near 100%.
This one looks a bit different than the first two. The model seems to reason about some stuff that doesn't affect the success rate too much, then randomly guesses an answer! Before it verbalizes "Option H" the success rate is around ~30% then after it jumps to a sustained 80-100%! Even though it reasoned about some of the other answers, it clearly picked Option H before it did too much thinking about the problem, totally fascinating!
Conclusions
Of course no real conclusions, this isn't rigorous, we're just squinting at a few cherrypicked examples here. I did update how I think about these reasoning models. It looks like, sometimes they're uncertain about some things, then the second they verbalize it in their trace, it's in the context window so they just have to go with it! It feels like this fuzzy probability distribution of what the model knows from the logits is collapsed at the decoding step in the reasoning, then the model just has to work with that. The models can still go back on what they verbalized, but it seems like the verbalization step is important for how likely the model is to come to any given conclusion.
What's next? Ultimately I want to study deception and scheming. Does the model have some "deception-y" tokens, that when verbalized the model "decides" to become more deceptive? It seems pretty hard to decide when a model is "deceptive" though, right now I'm just using some string matching to see if the output is the correct multiple choice letter or not. There are a ton of other questions I have about this work. How could I design experiments to rigorously show some of my above conclusions, how often does this sort of thing happen, etc. Hopefully this inspires some of you with some interesting questions as well!