GA grading calibration is too narrow

Summary

Fixed 2026-05-01

0Questions 0Links 0Comments 0PRs

Spec body Markdown

# GA grading calibration is too narrow

The Grading Assistant compresses the score range. Two essays of clearly different quality come back with grades within a point or two of each other.

## Reproduction

1. Pick two essays of clearly different quality (one with errors, one polished).
2. Run GA on both.
3. Compare scores.

Specifically observed: an error-filled essay and a clean essay graded one point apart. Two essays from a weaker writer (one well-formatted, one poorly formatted) graded identically.

## Expected

GA scores reflect quality differentiation that a human grader would produce. A polished essay should score notably higher than an error-filled one of similar length and topic.

## Actual

Score range is compressed. GA appears to anchor around a default (anecdotally near 77%, mentioned multiple times) and not move far from it.

## Impact

Teacher trust issue: if all essays come back ~77%, the grade isn't useful as a quality signal. Risks reinforcing concerns about grade inflation cited by Kevin in the Feb 27 notes — a deeper philosophical concern that depends on calibration being right. Less acute than [ga-flaky-grading-and-incomplete-markup](ga-flaky-grading-and-incomplete-markup.md) (which is a correctness issue), but undermines the GA's pedagogical value.

## Affected versions

Reported 2026-02-24, with follow-up observations on 2026-03-02 noting "5/16 grades were a 77%" in one batch.

## Suspected cause

Prompt or rubric anchoring the LLM toward a central tendency. Possibly the rubric's category weights aren't producing enough spread, or the LLM's output distribution is being clamped. Worth doing a calibration study against a teacher-graded gold set.

## Workaround

Teacher manually adjusts category scores in the gradebook (already supported).

Repo sync Not recorded

No repo sync metadata recorded yet.