Easy for claude to bypass though (it does actually fetch a pdf that you can directly download from Anthropic's servers).
Here is a url where you can download the PDF.
I asked Claude to download the PDF mostly as an experiment, here is its explanation for how it did this:
Claude's explanation
Vanta trust portals use a JavaScript single-page application that doesn't expose direct PDF download links in the HTML. The PDF URL is only revealed through network traffic when the page loads.
I used Puppeteer to intercept network requests with content-type: application/pdf:
page.on('response', async response => {
const contentType = response.headers()['content-type'] || '';
if (contentType.includes('pdf')) {
console.log('Found PDF:', response.url());
}
});
When the page loads, it fetches the PDF for its viewer, revealing the actual URL with these parameters:
rid - the document's internal IDr - the trust report slug (from the HTML's data-slugid attribute)view=true - returns the PDF contentThe model must respond in immediately with the answer. Any prefix would result in the answer being wrong. So only "[movie]" would be correct.
(Other than a few fixed generic ones like "the answer is" or "answer:" that I strip away, but the models virtually never output these, so considering these incorrect wouldn't alter the results.)
I think it's more that LLMs were interestingly bad at n-hop latent reasoning and it might be generally representative of a certain type of within-single-forward weakness of current architectures. It's not clear exactly what "n-hop" reasoning effectively about scheming without this information in context corresponds to.
I do think that relevant information won't be in context during the best opportunities for misaligned AIs to sabotage, escape, or otherwise covertly achieve their aims, but the information the AIs need will be much more closesly associated rather than being unrelated hops.
Note that this linked video is talking about tokens that the LLM picks rather than fixed tokens.
I think the distribution of errors for numerical n-hop questions is going to be uninteresting/random most of the time because the only questions with numerical answers are either "day of month X was born" or "number of counties in US state X" where there isn't any real reason for AIs to be close if they are wrong (either about X or the property of X)".
However, I was interested in this question for "adding the result of N 1-hop questions", so I got opus to do some additional analysis (for Gemini 3 Pro with 300 filler tokens):
The most interesting result here is that even on 6 addends, the model gets 75% within +-10 even though the answers are pretty big (median is 288 and many answers are much bigger).
Also, for some reason the model is systematically low, especially for 3 and 4 addends. Maybe because it sometimes skips one of the numbers?
I find that even with the longer prefill of "I will now answer immediately with the answer. The answer is" the model often reasons. I was hoping that the model would be reluctant to break this text prediction task and reason, but apparently not.
I think "how easy does the task seem" and "how much does the task seem like one on which reasoning seem like it should help" might have a big effect on whether the model respects the prefil vs reasons, so your sentence completion task might be not be representative of how the model always behaves.
It is in a single forward pass, just with additional fixed irrelevant tokens after. I think this still counts as "in a single forward pass" for the typical usage of the term. (It just doesn't know the answer until somewhat later tokens.)
Separately, worth noting that the model doesn't do that much worse without filler: performance only drops to 46%, 18%.
My choice to not mention the filler yet was because I didn't want to introduce too much complexity yet and I think this number is the most representative number from a misalignment risk perspective; models will typically have a bunch of tokens somewhere they can use to do opaque reasoning.
The model can not output any tokens before answering, it has to respond immediately. This is what I mean by "no CoT".
The fit is notably better for "cumulative investment over time". Years still produces a slightly better fit.
I've cut off the fit as of 2010, about when the original version of moore's law stops. If you try to project out after 2010, then I think cumulative investment would do better, but I think only because of investment slowing in response to moore's law dying.
(Doing the fit to a lagged-by-3 years investment series doesn't make any important difference.)