So, I used to be a teacher, criminology, in a small wonderful town. After ten years it was time for a change, I went military. Yes, awkward, but not unrewarding. In any case, I luckily kept all of my evaluations, and some student submissions, every now and again I re-read them for nostalgia’s sake.
Then the venerable GPT got invented, and it was insane, I ignored it, other than comical YouTube videos. At around the time 4o came out, I had to do what is known as a full system migration (I will not re-modify my games, at 200 mods each). I figured I would ask the chatbot. It told me, hallucination or not, I have no idea, that I was looking at 8 hours of my own time, or over a thousand dollars at a repair shop. I opt for the 8 hours, being cheap.
Good old sycophantic 4o led me through the process, with significant frustrations, I might add. I found it was not useless, to my surprise. Then, much later, at a family dinner a young’un told me that she suspected her university was using some LLM to do corrections. Combined with a few YouTube videos I’ve seen with professors complaining about exactly this, I was horrified. GPT 4o, the thing that couldn’t get the commander of a world war 2 battle right, is correcting University Students (the irony that the students had 4o write their papers in turn is not lost on me). Being a nerd in the military, I said “lets do a study!”.
I took an evaluation I used to use and asked myself “how did exhausted lazy me replace measures for actual metrics”, made a table, and called it an audit report. I then took said report, and the evaluation criteria, along with a completed assignment I did myself (writing an assignment for an evaluation I made myself, grade-ception?). I had a few frontier models, and a few offline ones through OpenWebUi correct the thing. I checked their outputs against the audit report, and, sure enough, the LLMs marked precisely as exhausted lazy me did. With one exception, Grok went full confabulation mode and then marked based off of its own confabulations. Interesting, I posted it to Substack for fellow nerds. I wrote in the most boring clinical style I could muster, as that was how I read research reports up to that time.
Months and months later (i.e. now) I discovered LessWrong, a place for nerds who like LLMs. I think, excellent, I shall post my proof that LLMs are terrible at marking human responses to there, that community will eat this stuff up! What do i get?
So, my experiment showing that LLMs learned the same grading shortcuts I used as an exhausted teacher, was flagged by an LLM as not written by a human.
To Whom it May Concern,
So, I used to be a teacher, criminology, in a small wonderful town. After ten years it was time for a change, I went military. Yes, awkward, but not unrewarding. In any case, I luckily kept all of my evaluations, and some student submissions, every now and again I re-read them for nostalgia’s sake.
Then the venerable GPT got invented, and it was insane, I ignored it, other than comical YouTube videos. At around the time 4o came out, I had to do what is known as a full system migration (I will not re-modify my games, at 200 mods each). I figured I would ask the chatbot. It told me, hallucination or not, I have no idea, that I was looking at 8 hours of my own time, or over a thousand dollars at a repair shop. I opt for the 8 hours, being cheap.
Good old sycophantic 4o led me through the process, with significant frustrations, I might add. I found it was not useless, to my surprise. Then, much later, at a family dinner a young’un told me that she suspected her university was using some LLM to do corrections. Combined with a few YouTube videos I’ve seen with professors complaining about exactly this, I was horrified. GPT 4o, the thing that couldn’t get the commander of a world war 2 battle right, is correcting University Students (the irony that the students had 4o write their papers in turn is not lost on me). Being a nerd in the military, I said “lets do a study!”.
I took an evaluation I used to use and asked myself “how did exhausted lazy me replace measures for actual metrics”, made a table, and called it an audit report. I then took said report, and the evaluation criteria, along with a completed assignment I did myself (writing an assignment for an evaluation I made myself, grade-ception?). I had a few frontier models, and a few offline ones through OpenWebUi correct the thing. I checked their outputs against the audit report, and, sure enough, the LLMs marked precisely as exhausted lazy me did. With one exception, Grok went full confabulation mode and then marked based off of its own confabulations. Interesting, I posted it to Substack for fellow nerds. I wrote in the most boring clinical style I could muster, as that was how I read research reports up to that time.
Months and months later (i.e. now) I discovered LessWrong, a place for nerds who like LLMs. I think, excellent, I shall post my proof that LLMs are terrible at marking human responses to there, that community will eat this stuff up! What do i get?
So, my experiment showing that LLMs learned the same grading shortcuts I used as an exhausted teacher, was flagged by an LLM as not written by a human.
The irony compounds.
Thank you for your consideration.
--A former, still lazy, and overworked, teacher.
FYI:
https://normaldood.substack.com/p/audit-report-style-over-substance
and
https://normaldood.substack.com/p/when-evaluation-fails