(For a linkpost like this it'd be helpful to have at least a short explanation of what this is, what the takeaways are, or why one might care in the LW content side)
Are you trying to demonstrate that llm agents are now capable of cloning sqlite (in which case my response is that select null || 'hello' should not yield 'NULLhello' and select null > 5 should not crash), or that llm agents are not yet capable of one-shot cloning sqlite in rust (which is not very surprising).
I think this just means that one needs to spend more time to constructs good test coverage (probably with help of the agents involved).
282 unit tests does not sound like nearly enough for something like SQLite (Google AI thinks that the original SQLite release had dozens of thousands of tests and that the current number of tests is in millions and fuzzers run through about a billion test mutations each day).
I don’t think one needs that much for a proof-of-concept work, but the famous recent port of JustHTML library by Simon Willison was made possible by html5lib-tests having 9200 test cases or so. Perhaps that’s the ballpark number of tests one needs to make it difficult to come up with a counterexample manually (that is, without running a large test suite).
Of course, building a hardened product with the extent of actual SQLite test coverage is a different story. Whether it’s possible or not, it’s certainly much more expensive (in agent labor and in hours of required human supervision).