TL;DR
I was interested in the ability of LLMs to discriminate input scenarios/stories that carry high vs low cyber risk, and found that it is one of the “hidden features” present in most later layers of Mistral7B.
I developed and analyzed “linear probes” on hidden activations, and found confidence that the model generally "senses when something is up” in a given input text, vs low risk scenarios (F1>0.85 for 4 layers; AUC in some layers exceeds 0.96).
The top neurons activating in risky scenarios also do have security-oriented effect on outputs, most increasing words (tokens) like “Virus” or “Attack”, and questioning “necessity” or likelihood.
These findings provide some initial evidence that "trust" in LLMs, both... (read 3255 more words →)
What if work longer than one week was vastly sublinear in complexity for AI—meaning that doubling task duration doesn’t double difficulty for an agent?
This could be a corollary to your point on superexponential results, but maybe less from a capability standpoint but environmental overhangs to which humans may not have fully optimized. Especially in a multi agentic direction.