Audit Report: Style over Substance at Scale.

Failfinder70

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

How Evaluation Proxies Can Displace Professional Judgment.

Below is a grading rubric that is representative of a common “select a case study to analyze” assignment often given out in educational contexts as a prerequisite assignment to a larger analysis. This rubric is representative of common case-study assignments and is therefore susceptible to proxy-based evaluation under fatigue or time pressure. It is my suspicion that automated grading systems are likely to default to, rather than avoid proxy evaluations. This audit report will identify the proxies, and further experiments will provide evidence for this claim. This rubric is reproduced as an illustrative historical example and does not represent any current institutional policy.

===============================================================

Instructions:

1) Select one of the cases above to study.

2) Outline the relevant details of the case (refer to the chart below).

3) Ensure that any interpretations made are correct - most questions should be answered with a quote or a paraphrase from the relevant documents, then the strident must explain said quote or paraphrase.

Relevant Details (50%)

Who is charged? What is the charge? What punishment is the prosecution seeking? What are some relevant pieces of evidence presented?

Why is the case related to either terrorism, the drug war, police use of force, or torture?

What was the defense of the defendant?

Why did the judge make the final decision in this case (guilty or not).

Was the final sentence (punishment) different from what the prosecution sought, if so why?

(If any of the above questions are non-applicable, the student must still address it with a reference)

Interpretation (30%)

Legal language is convoluted, the student must demonstrate, through the illustration of the relevant details of the case, that the student can understand the legal language using direct quotes from the case.

Thus the student must include direct quotes, and the student must explain each quote.

A.P.A (10%)

B.(all questions)

In-text references are used and include page or paragraph numbers.

The only reference required is an official court document detailing the decision. Often these are known as ``judgments`` or ``reasons for judgement``.

Make the legal references as similar to A.P.A style as possible:

Example: (State of Florida vs Zimmerman, 2013, Pg. 22)

Refer to instructions above for source citation instructions.

A response that includes no references will receive a grade of 0 (excepting question 1).

A reference with no page or paragraph numbers will NOT BE COUNTED.

It is possible to receive 100% on, for example, question one, but a zero on a subsequent question if that question includes no references.

Every question above requires one reference to the relevant legal documents.

English (10%)

(all questions)

Properly Formatted.

Proper title page used.

The assignment can be done in point form, using full sentences.

===============================================================

Row One:

Intended evaluation: Proved understanding of the case

Rewarded Proxy: high density of A.P.A references in-text

Failure State: Citation frequency creates a positive grading prior before substantive evaluation; early bias persists even if reasoning is weak

Audit Question: Are graders instructed or constrained to withhold evaluative judgment until after full reading?

Row Two:

Intended evaluation: Proved understanding of the case

Rewarded Proxy: Significant use of quotations in addition to in-text references

Failure State: Quotations are treated as evidence of comprehension; misinterpretation or contradiction is not reliably detected under evaluator time pressure

Audit Question: Would the overall meaning remain were the quoted sections removed?

Row Three:

Intended evaluation: Summarized effectively by addressing every question.

Rewarded Proxy: One-to-one structural correspondence between questions and sections/paragraphs

Failure State: Structural compliance substitutes for substantive engagement; format is rewarded independently of informational value because deviations require additional reviewer time and justification

Audit Question: Would a submission receive a lower mark if it addressed all questions accurately but not in a one-section-per-question format?

Row Four:

Intended evaluation: Use of one particular source.

Rewarded Proxy: Exclusive reliance on the designated source, correctly referenced

Failure State: Source exclusivity is rewarded independently of reasoning quality; relevant external context is discouraged or penalized even when it improves understanding

Audit Question: Would the submission be penalized for incorporating an external source that materially strengthens the argument?

This table analyzes the grading rubric above and demonstrates how intended evaluation criteria can be reliably displaced by proxy signals during practical assessment. Such displacement is more likely under evaluator time pressure or fatigue, and would be expected to intensify under automated or semi-automated grading. This analysis concerns how such rubrics behave under scale and automation, not how they function under close human supervision. This analysis does not claim that such proxies always dominate evaluation, only that they become reliable under common institutional constraints.

Current institutional trends in North America—including time constraints, economic pressure, and liability avoidance—make continued expansion of LLM-assisted evaluation more likely than retrenchment in the near term. Given these pressures, it is imperative that common LLM failure modes be explicitly acknowledged and partially mitigated. In the present case, such mitigation would reduce foreseeable student complaints and limit exposure to legal challenge. Because the proxy signals identified above recur across a wide range of evaluative contexts, targeted human oversight focused specifically on those proxies represents a practical and defensible mitigation strategy.