Zvi covered this in a recent roundup. In gist, the smaller models don't have Mythos' ability to chain vulnerabilities and write effective exploits. They also have a much higher false positive rate. So their output would need extensive skilled manual effort to achieve anything near the same result. In contrast, Mythos output is directly useful to an attacker ... or, fortunately, to a defender.
See also this discussion on HN.
Where is this quotation from? I don't see a link.
Also: "isolated the relevant code". That phrase could be doing a lot of heavy lifting here, right? It's one thing to sift through a million lines of code and identify a bug. It's another to be handed three lines that contain a bug and find it. Needle in a haystack vs. needle with two pieces of straw. If the methodology was identical, okay. But I'd like to see a side-by-side comparison of methodology here.
I googled the source link here: https://aisle.com/blog/ai-cybersecurity-after-mythos-the-jagged-frontier
I'm also concerned about isolating the code. It's the difference between finding a needle in a haystack, and distinguishing a needle from a single straw. Their set of models returned 12/18 false positives (and 18/18 true positives), which suggests terrible specificity to me.
Can they detect the problem, or can they develop a working exploit for the problem? My understanding is that the former is much easier than the latter, and that modern systems make it quite obnoxious (but often not impossible for someone who knows what they're doing) to go from OOB write to arbitrary code execution.
https://aisle.com/blog/ai-cybersecurity-after-mythos-the-jagged-frontier
They can detect the problem, but not develop a working exploit. They also had a 2/3 false positive rate on a the patched version of that function.
They say that's fine because something other than the (frontier) model can do those steps, but don't demonstrate the capability anywhere I could see.
That sounds about right. A dedicated team of security experts can find a hole in anything, but not in everything (i.e. pre-mythos the bottleneck was going from "possible vuln" to "POC").
I've been more skeptical than the average reader/commenter here around the capabilities of Mythos et al., and I also have some limited security experience.
It seems to me more surprising that human researchers didn't discover these exploits, rather than that Mythos/Opus did.
Also, vulnerability research is a very wide field, and as they say, "there are levels to this game".
Overall, I think that Mythos is probably more capable at cybersecurity, but I don't share the vibe that it's a "god in a box", or other such monikers I've seen online.