Essay grading score variability without content changes

Summary

Grading assistant returned 87 → 80 → 85 → 87 on same revised essay

0Questions 0Links 0Comments 0PRs

Spec body Markdown

# Essay grading score variability without content changes

The grading assistant returns meaningfully different scores across repeated grading runs on the same essay, with no changes to the essay content between runs.

## Reproduction

1. As a student, submit an essay to the grading system. Note the score (observed: 87).
2. Review feedback — in this case, the training assistant's purple syntax/grammar/style recommendations were the only changes acted on.
3. Submit the revised essay to the grading assistant. Note the new score (observed: 80).
4. Without changing the essay, trigger grading again. Note the score (observed: 85).
5. Trigger grading once more. Note the score (observed: 87).

## Expected

Grading the same essay content should return a consistent score, or at minimum scores within a very tight, explainable band (e.g., ±1 due to rounding). Grading is a core trust signal — students and teachers rely on it being stable.

## Actual

Scores varied across four grading runs on the same content: **87 → 80 → 85 → 87**. A 7-point swing on identical text is not explainable by minor rounding differences.

## Impact

High. The grading assistant is a core value proposition of the platform. Score variability without content changes undermines teacher and student trust in grades. If a teacher releases a grade of 80 and the student re-grades and sees 87, the credibility of the system is damaged. Affects any student using the grading assistant.

## Affected versions

Unknown — first observed 2026-05-04. Initial testing on two other student papers did not reproduce the variability, which may indicate this is input-sensitive (certain essay content or length triggers non-determinism) rather than a universal issue.

## Suspected cause

Non-deterministic LLM sampling (temperature > 0) in the grading assistant prompt, with no output normalization or score anchoring between runs. The grading rubric may also leave enough ambiguity that the model scores differently depending on which rubric criteria it weights first. Worth checking whether the grading assistant uses a fixed system prompt or whether any context (e.g., prior grading history) bleeds into the prompt.

## Workaround

None for students. Teachers should be aware that grading scores may not be fully stable until this is resolved — avoid releasing grades immediately after a single grading run if score precision matters.

Repo sync Not recorded

No repo sync metadata recorded yet.