Summary
Fixed 2026-05-01
0Questions
0Links
0Comments
0PRs
Spec body
# GA grading calibration is too narrow The Grading Assistant compresses the score range. Two essays of clearly different quality come back with grades within a point or two of each other. ## Reproduction 1. Pick two essays of clearly different quality (one with errors, one polished). 2. Run GA on both. 3. Compare scores. Specifically observed: an error-filled essay and a clean essay graded one point apart. Two essays from a weaker writer (one well-formatted, one poorly formatted) graded identically. ## Expected GA scores reflect quality differentiation that a human grader would produce. A polished essay should score notably higher than an error-filled one of similar length and topic. ## Actual Score range is compressed. GA appears to anchor around a default (anecdotally near 77%, mentioned multiple times) and not move far from it. ## Impact Teacher trust issue: if all essays come back ~77%, the grade isn't useful as a quality signal. Risks reinforcing concerns about grade inflation cited by Kevin in the Feb 27 notes — a deeper philosophical concern that depends on calibration being right. Less acute than [ga-flaky-grading-and-incomplete-markup](ga-flaky-grading-and-incomplete-markup.md) (which is a correctness issue), but undermines the GA's pedagogical value. ## Affected versions Reported 2026-02-24, with follow-up observations on 2026-03-02 noting "5/16 grades were a 77%" in one batch. ## Suspected cause Prompt or rubric anchoring the LLM toward a central tendency. Possibly the rubric's category weights aren't producing enough spread, or the LLM's output distribution is being clamped. Worth doing a calibration study against a teacher-graded gold set. ## Workaround Teacher manually adjusts category scores in the gradebook (already supported).
Repo sync
No repo sync metadata recorded yet.