I appreciate the caution about over-trusting LLM evaluations — especially in fuzzy or performative domains.
However, I think we shouldn't overcorrect. A score of 100 from a model that normally gives 75–85 is not just noise — it's a statistical signal of rare coherence.
Even if we call it “hallucination evaluating hallucination”, it still takes a highly synchronized hallucination to consistently land in the top percentile across different models and formats.
That’s why I’ve taken such results seriously in my own work — not as final proof, but as an indication that something structurally tight has been built.
Blanket dismissal of AI evaluation risks throwing out works that are in fact more semantically solid than most human-written texts on similar topics.
Happy to clarify any part of the technical structure or answer objections. If anyone has thoughts on how this compares to Chain-of-Thought or Tree-of-Thought paradigms, I’d love to discuss.
I appreciate the caution about over-trusting LLM evaluations — especially in fuzzy or performative domains.
However, I think we shouldn't overcorrect. A score of 100 from a model that normally gives 75–85 is not just noise — it's a statistical signal of rare coherence.
Even if we call it “hallucination evaluating hallucination”, it still takes a highly synchronized hallucination to consistently land in the top percentile across different models and formats.
That’s why I’ve taken such results seriously in my own work — not as final proof, but as an indication that something structurally tight has been built.
Blanket dismissal of AI evaluation risks throwing out works that are in fact more semantically solid than most human-written texts on similar topics.