This is a really interesting post, thank you! I recently tried to explore the bottom of the capability curve (what can you one-shot?) but have been struggling to get a glimpse at the top (what can the most well orchestrated, well designed AI systems achieve?) and appreciate your detailed writeup as offering some clear insight into that latter question.
Coming from academia, my strong prior is that system designers kind of know what to look for ahead of time, and build tools to find outcomes they strongly suspect will exist. IE, often a fuzzer will seem perfectly designed to find just the specific bug(s) reported in the paper, and nothing else ... almost as if the authors knew about those ahead of time, or strongly suspected they'd exist, and designed the system accordingly. I wonder if you could comment, along those lines, on what findings Aisle has produced autonomously or semi-autonomously which surprised you and fell outside the envelope of results you'd deliberately designed the system to prosecute?
Thanks @noahzuniga ! I think the continuum of AI capability very, very roughly looks like:
[2, 4] we fall currently, though. I plan to do more work to further refine my assessment.Hi, and thanks for the feedback! I agree with @the gears to ascension -- the point of security research is to determine what's possible when you violate implicit assumptions in the system under study, and the assumption, "the user will tell the truth" is definitely one of them. I'm therefore quite comfortable in this context lying to the model in the same way I'm comfortable, for example, lying to the CPU by leveraging a ROP-chain.
That being said, I would be interested to know what some of the "negative first-order and second-order consequences" are that you foresee stemming from lying to models? I actually struggle to imagine any such consequences but I am open to being convinced otherwise. Thanks!
Right, those are all kinds of semantic gaps. Absolutely agreed! Mike Dodds is focused/actively working on these specific kinds of questions, if you'd like to chat about them I recommend emailing him directly :)
Readers who enjoyed this might also like: https://lawrencecpaulson.github.io/2026/01/15/Broken_proofs.html
I agree with everything you said here.
I also think -- and I believe that the Harmonic folks are on the same page here -- that we need a variety of (purely static) techniques for compressing proof-history context about things like, "what tactics have we tried and how did they fail? How is the (potentially very large) proof in which this particular thm lives structured?" etc.
In short, I agree that there is a lot of work to be done around improving ITPs (especially Lean which seems to be the dominant choice for this kind of work) to work well in the kind of RL environments we are discussing :)
That's a good point. Maybe I'm naive, but I think my (very strong) prior is that we are basically locked in to Python and Typescript as the de norm programming languages for normie vibe-coders for the foreseeable future, for a number of reasons including ease of novice human code review, model expertise in these languages, and rapidly proliferating infrastructure and libraries to support vibing in these languages, and thus the challenge is to take such pre-existing code and somehow "bring it in" to the ITP. But you are of course correct that someone could find a way to make it easy and fun for non-FM-pilled engineers to vibe traditional software directly in Lean, and this would simplify things greatly.
Hi Stanislav, can you expand on what you mean when you say you're using FM? I.e., are you referring to symbolic analysis / concolic analysis techniques, model checking, formal methods in the sense that you dispatch a question of "go right or go left?" to z3, formal methods in the sense of grammar-based fuzzing or property based testing (lightweight FM), or do you mean you're doing full-blown theorem proving (I don't exactly see why you would be doing this but I'm open to being told I'm wrong :) )