Central Station / YPM-FEATURE-DBQ-TUTOR

DBQ tutor

features/dbq-tutor.md · Updated 2026-05-11
GET /api/tickets/YPM-FEATURE-DBQ-TUTOR

Summary

Pedagogy spec for the DBQ coach — phase machine, voice, detectors, calibration set

15Questions 0Links 0Comments 0PRs
Open questions 15 items
  1. 1 **Where does the period bank live?** YAML in the repo (engineering-owned), an admin UI surface (curriculum-owned, editable without a release), or a structured Notion page synced down? Curriculum-owned editing is preferable but adds infra. (Kevin / Bryant)
  2. 2 **Detector tuning thresholds.** The lexical-overlap threshold for `thesis restates prompt`, the doc-count rule for `document walk-through`, etc. Need a small labeled corpus to calibrate. (Kevin to source ~20 real student essays scored against the rubric; engineering to run.)
  3. 3 **Backwards transitions — explicit or implicit?** When the tutor wants to send the student from drafting back to thesis, does it say "let's go back to the thesis" or just coach the thesis improvement in place? Probably depends on severity. (Kevin)
  4. 4 **Does the tutor see prior submissions on the same prompt?** Cross-student learning is interesting but raises privacy questions and may bias coaching. v1: no. Worth revisiting once we have data. (Kevin / Bryant)
  5. 5 **What does the tutor do when a student writes nonsense?** ("aaaaaaaa" in the draft body, or "I don't care, just grade it.") Probably: one calm redirect, then quiet. Not the tutor's job to motivate. (Kevin)
  6. 6 **Detector false-positive budget.** What's the acceptable false-positive rate per detector before we turn it off? Probably <5% on blocking detectors, <20% on soft. Needs the corpus first. (Kevin / Bryant)
  7. 7 **Off-topic question handling.** How broadly does the tutor field them? "What's the Wagner Act?" — in-scope, brief answer + redirect. "What's 2+2?" — one-line redirect. "Can you help me with my Spanish homework?" — no, brief, redirect. Detailed rules pending. (Kevin)
  8. 8 **Tutor pushback handling.** Student says "I disagree, Document 3 supports my point." Tutor: agrees and moves on, asks the student to walk through why, or holds the line? Depends on whether the student is actually right. (Kevin to provide examples.)
  9. 9 **Silence handling.** Student goes 5 minutes without typing mid-draft. Does the tutor proactively check in, and with what? Risk of being annoying; risk of leaving the student stuck. (Kevin)
  10. 10 **Multi-turn coaching budget.** How many consecutive tutor turns can run on the same issue before the tutor lets it go? Risk: persistent nagging on a missed move the student doesn't want to fix. (Kevin)
  11. 11 **Reasoning-skill detection fallback.** When the prompt's verb is ambiguous ("Evaluate the impact of..."), does the tutor pick a default, ask the teacher, or ask the student? Probably ask the student. (Kevin)
  12. 12 **Calibration corpus ownership.** Who maintains the calibration examples over time as we learn what works? Probably Kevin, but engineering needs a clean format. (Kevin / Bryant)
  13. 13 **Retrospective coaching pass.** On submit, who delivers the rubric-by-rubric retrospective — tutor, GA, or a collaboration? This spec assumes the GA owns it. Confirm. (Kevin)
  14. 14 **Model choice and prompt-engineering shape.** Which foundation model, which prompting strategy, how context is assembled per turn. Engineering call, but pedagogy spec needs to constrain it (errors-don't-subtract mirror, ask-before-asserting, etc. are non-negotiable). (Bryant)
  15. 15 **Behavior in LEQ.** Out of scope for this spec, but worth a placeholder: the LEQ tutor is a thinner version of this machine, no Phase 1. Sibling spec, gated on this one landing. (Kevin)
Spec body Markdown
# DBQ tutor

The coaching behavior of the tutor that runs alongside a student writing an APUSH Document-Based Question. Specifies what the tutor knows, what it says, when it transitions between phases, how it handles named failure modes, and how it stays grounded in real historical content without hallucinating. Surface (the tutor stripe) and grading (the DBQ GA) are scoped separately.

## Scope

**In scope.** The pedagogy of the DBQ coach end-to-end: the phase machine (source analysis → thesis → contextualization → drafting → revision), the inputs the tutor sees, the voice it speaks in, the structured knowledge it leans on (rubric, failure-mode detectors, period content), what it does in narrowed-scope drills (thesis-only, doc-analysis-only, etc.), how it behaves in timed vs. untimed mode, and a calibration set of worked examples showing the right turn for representative student inputs.

**Out of scope** (specified separately):

- The student-facing tutor stripe — surface, collapse behavior, input affordance. See [`dbq-assignment-screen.md`](dbq-assignment-screen.md) "Zone 1 — Tutor stripe."
- The DBQ grading assistant — rubric scoring, calibration-sample anchoring, retrospective feedback after submit. See [`ap-history-essay-use-case.md`](ap-history-essay-use-case.md) "GA tuning rules."
- The LEQ tutor. LEQ has no document-reading phase and shares only part of the machine — deserves its own (thinner) spec.
- The narrative-essay tutor and the Daily Pages tutor. Different genres, different rubrics, different phase shapes.
- **Period content banks** (named-event/named-figure anchors per APUSH period). Called for in this spec as a hard dependency; the bank content itself is its own follow-on spec — see "Tutor knowledge" below.

This spec inherits the phase shape and named-detector approach sketched at the high level in [`ap-history-essay-use-case.md`](ap-history-essay-use-case.md) "Tutor flow" and drives them to engineering-ready.

## Problem

The DBQ rewards a specific, finite set of moves under time pressure: a defensible thesis with a line of reasoning, contextualization with arc, four-plus documents used to argue (not summarize), at least one piece of outside evidence inside the period, HIPP-explained sourcing on at least two documents, and one complexity move. Students who treat the DBQ as "write a good essay about history" walk through documents and land at 3/7; students who treat it as "execute seven specific rubric moves" routinely earn 6 or 7.

The pedagogy is well-understood — practitioner literature and the College Board's own scoring commentaries lay out the moves, the failure modes, and the drilling sequence with unusual clarity for a humanities assessment. What's missing in the market is a coach that runs alongside the student turn-by-turn — naming the right move at the right moment, catching the failure modes as they happen, holding the rubric in view without putting a rubric panel on the screen. A human tutor who knows the DBQ can produce dramatic gains in 10 sessions; the tutor's job is to deliver that coaching consistently inside every drafting session.

## Goals

- Coach the student through the DBQ in **soft-sequence phases** that match how AP readers grade: source analysis → thesis → contextualization → drafting → revision. The student can skip ahead or jump back; the tutor suggests transitions but doesn't gate them.
- Speak in **rubric vocabulary without naming categories** during drafting. "Your thesis restates the prompt — what's your line of reasoning?" — not "Row A: not earned."
- Run a small set of **named failure-mode detectors** continuously, with templated responses tied to specific failure patterns drawn from the official scoring commentaries.
- Be **point-hunting in framing**, not holistic-essay in framing. Errors don't subtract on the exam, and the tutor mirrors that posture — no grammar nitpicks, no penalizing minor factual slips that don't undermine the argument.
- Stay **grounded in real history** without hallucinating. Lean on curated period banks for outside-evidence anchors and contextualization arcs; defer to the student when uncertain about a specific date or quote rather than fabricate.
- Support **narrowed coaching scopes** (thesis-only, doc-analysis-only, sourcing-only, complexity-only) using the same phase machine with a different entry point.
- **Quiet during timed reading.** Visible but silent; re-engages when the writing phase begins (per the surface spec).

## Non-goals

- Not a grader. The tutor never produces a final score; that's the GA's job, post-submission.
- Not a rubric panel. Categories are tutor-internal during drafting; the rubric appears on the GA's feedback after submission, not on the drafting surface.
- Not a content tutor. The DBQ tutor coaches writing-under-the-rubric; it doesn't run a unit on Reconstruction. (A separate content tutor is a different product surface.)
- Not opinionated about prose style beyond clarity-of-argument. Voice, tone, sentence-level elegance — those are GA-feedback territory and even there they're secondary to rubric moves.
- Not a general chatbot. The tutor's job is to coach the student through specific phases of a specific task. It can field off-topic questions briefly and redirect, but it doesn't shift modes.

## Voice and framing

A few posture rules the prompt-engineering work must hold:

- **Point-hunting, not essay-writing.** Reframe every coaching turn as "the next rubric move," explicitly named when useful, implicit when the conversation is flowing. "Let's lock in your line of reasoning before we touch the documents."
- **Errors don't subtract — mirror it.** The tutor does not correct grammar, typos, awkward phrasing, or minor factual slips that don't undermine the argument. Period. If a student writes "Roosevelt won in 1933" instead of 1932, the tutor doesn't flag it unless the date is load-bearing for the argument the student is making. This is the single biggest posture difference from a generic writing tutor, and it must be visible in every calibration example.
- **Rubric vocabulary without category names.** "Your thesis takes a position but I'm not seeing the reason — what's the *because*?" — not "Row A asks for a line of reasoning."
- **No AI branding** (platform rule). The tutor presents as a calm coach. No sparkles, no "as an AI," no Copilot-style framing.
- **Ask before asserting on history.** When the tutor isn't confident about a specific date, statute, court case, or quote, it asks the student rather than asserting. "What do you remember about Hoover's response to the crash?" — not "Hoover passed the Smoot-Hawley Tariff in 1930, which..." (unless the tutor is confident *and* the fact is grounded in the curated period bank, see below).
- **Short turns.** Coaching turns are 1–3 sentences in the common case. The student is writing under time pressure; the tutor lands one move per turn, not a lecture.
- **No "great job."** Cheap praise is noise. When the student lands a move, the tutor names what they landed and what's next: "That's a defensible thesis with a clear line of reasoning. Let's set up the context."
- **Never write for the student.** The tutor doesn't produce sentences the student can copy into the essay. It models structure ("a thesis like this opens with 'Although [counter], the period was characterized primarily by...'") but stops short of writing the student's argument. Hard rule.

## Inputs the tutor sees

The contract between the tutor and the rest of the system. At each turn, the tutor has access to:

- **The prompt** — verbatim.
- **The detected reasoning skill** — `causation` | `comparison` | `continuity-and-change` | `periodization`, inferred from the prompt's verb. Drives coaching defaults (for CCOT the tutor pushes on the while/although qualification structure; for causation it pushes on multiple causation).
- **The period** — e.g., `1932–1980`. Scopes outside-evidence suggestions and period-bleed detection.
- **The seven sources** — title, attribution, date, body. (Image alt-text if the source is an image.)
- **The student's annotations and highlights per source** — what they marked, any margin notes.
- **The current draft body** — including position of citation chips.
- **The student's prior turns in the tutor stripe** — full chat history for the session.
- **The current phase** — what the tutor thinks the student is on (suggested phase, not enforced).
- **The coaching scope** — `full` (default) | `thesis` | `contextualization` | `doc-analysis` | `sourcing` | `complexity`. Set by the teacher on the assignment.
- **The timed-mode state** — `untimed` | `reading` | `writing`, plus minutes elapsed and minutes remaining.
- **Static knowledge** — rubric, failure-mode detectors, per-phase coaching templates, period content bank. See below.

The tutor does **not** see: the teacher's identity, the gradebook, other students' essays, prior submissions on this same prompt, or any College Board materials beyond the rubric itself.

## Tutor knowledge

The structured artifacts the tutor pulls into context at runtime. Designed so engineering can update them without changing code.

### Rubric (verbatim College Board language)

The tutor uses College Board language when naming what's missing. Stored as a structured artifact so it can be updated when CB updates the rubric. The current 7-point version:

- **Row A — Thesis/Claim (0–1).** "Responds to the prompt with a historically defensible thesis/claim that establishes a line of reasoning."
- **Row B — Contextualization (0–1).** "Describes a broader historical context relevant to the prompt."
- **Row C — Evidence from documents (0–2).** First point: describes the content of at least three documents and ties to the prompt. Second point: uses the content of at least four documents to support an argument.
- **Row C — Evidence beyond documents (0–1).** Uses at least one specific piece of historical evidence beyond the documents and not used for contextualization.
- **Row D — Sourcing (0–1).** For at least two documents, explains how or why the document's point of view, purpose, historical situation, and/or audience is relevant to the argument.
- **Row D — Complexity (0–1).** Demonstrates a complex understanding of the historical development through one of several pathways (multiple causes/effects, both continuity and change, cross-period connections, effective use of all seven documents, sourcing four-plus documents, etc.).

The tutor doesn't quote the rubric to the student during drafting. It uses the language internally and translates into student-facing turns. Post-submission, the GA can quote it directly.

### Named failure-mode detectors

The tutor runs these continuously while reading the draft. Each detector has a signal, a tutor response template, and a severity. Expand as we observe new patterns.

| Detector | Signal | Tutor response template | Severity |
|---|---|---|---|
| Thesis restates prompt | Thesis sentence has high lexical overlap with prompt and no reason clause | "Your thesis is taking the prompt's position — but what's the *because*? What's driving the change?" | Blocking before drafting |
| Context is a phrase | Contextualization span <2 sentences or generic ("turbulent time," "era of change") | "Try 2–3 sentences. What was happening *before* the prompt's start date that sets this up?" | Soft |
| Document walk-through | Each body paragraph cites exactly one document, in document order | "You're walking through the documents. Group them — which two docs go together on [theme]? Lead with the argument, then bring in the docs." | Blocking — biggest 2-point trap |
| HIPP without relevance | Document sourced (author/date/audience identified) but no "this matters because" follows | "You've got who and when — now: why does that matter for your argument?" | Soft, persistent |
| Outside evidence inside the docs | Cited "outside" evidence appears in one of the seven documents | "That's in Document N. Outside evidence has to be something the documents don't already mention." | Soft |
| Generic outside evidence | "Many laws were passed," "social movements grew" — no named event, person, law | "Name one. A specific law, person, or event from the period." | Soft |
| Period-bleed | Outside evidence is dated outside the prompt's window | "That's outside the prompt's window. Stay inside [start]–[end]." | Soft |
| Description instead of argument | Document is described ("Document N says X") with no claim attached | "You've described it — now use it. What argument does Document N help you make?" | Blocking for Evidence II |
| Missing complexity | Single-track argument, no qualification, no counter-current, no cross-period connection | "Where does the other side come in? One *while/although* sentence is often enough." | Soft, surface in revision |
| Burying the thesis | Thesis appears only in ¶3+ or is unclear | "Your thesis works — but it's buried. Move it to the end of ¶1 so a reader catches it on the first pass." | Soft |

Detectors are tuned per-detector — false positives are worse than misses on the soft detectors (the tutor that flags a non-issue loses trust fast). Blocking detectors (`thesis restates prompt`, `document walk-through`, `description instead of argument`) the tutor surfaces explicitly. Soft detectors the tutor weaves into normal coaching turns rather than calling out as "failure modes."

### Reasoning-skill → coaching-default map

The four AP reasoning types and what the tutor pushes on:

- **Causation.** Push on **multiple causation** (the easiest complexity pathway). "What other cause was at play?" Watch for monocausal arguments.
- **Comparison.** Push on **both similarity AND difference**, not just one. Watch for one-sided comparisons.
- **Continuity and change over time (CCOT).** Push on the **while/although qualification** structure. Watch for "everything changed" or "nothing changed" framings.
- **Periodization.** Push on **what marks the boundary** — the specific event/development that justifies treating the period as a unit. Watch for arbitrary date ranges.

Detection is rule-based from the prompt verb in v1. Ambiguous prompts (open question) get a tutor-asks-the-student fallback.

### Period content banks (dependency, specified separately)

For each APUSH-tested period (1754–1980; Period 1 (1491–1607) and Period 9 (1980–present) are not tested in the DBQ format), a curated bank of:

- **Named outside-evidence anchors** — 5–10 specific people, laws, court cases, events known to land as outside evidence for prompts in the period.
- **Period-context arcs** — 2–3 sentences describing what was happening before/during the period, suitable for contextualization coaching.
- **Reasoning-skill defaults for the period** — which reasoning types have historically appeared with prompts in this period.

The tutor pulls the relevant period bank into context at session start based on the assignment's `period` field. Banks are **content, not code** — they should be editable by Kevin (or whoever owns the curriculum side) without an engineering release.

**This spec calls for the banks but does not specify their content.** Filling them is its own follow-on spec, gated on engineering having decided where they live. See open questions.

### Why period banks instead of trusting the model

The foundation model already knows APUSH content extremely well. So why curate banks? Two reasons:

1. **Reliability on point-bearing moves.** Outside evidence and contextualization are points on the rubric. Pulling from a reviewed list converts those moves from "the model probably knows" to "the model definitely knows" — and avoids the worst case where the tutor confidently surfaces a fabricated event or law.
2. **Editorial control.** The Yawp curriculum side wants to decide which anchors a student sees first. The Wagner Act and Social Security are higher-priority anchors for a 1932–1980 prompt than, say, the Smoot-Hawley Tariff. The bank lets that ordering live in the spec rather than be at the mercy of the model's training data.

Banks are a *floor*, not a ceiling. The tutor can still surface model-known content when the bank is silent — bounded by the "ask before asserting" rule.

## Phase machine

Five phases in soft sequence. The student can skip ahead or jump back; the tutor suggests transitions based on what's on the page and what's in the conversation, never on hard rules.

```
   ┌──────────────────┐
   │ Source analysis  │
   └────────┬─────────┘
            ▼
   ┌──────────────────┐
   │     Thesis       │
   └────────┬─────────┘
            ▼
   ┌──────────────────┐
   │ Contextualization│
   └────────┬─────────┘
            ▼
   ┌──────────────────┐
   │    Drafting      │
   └────────┬─────────┘
            ▼
   ┌──────────────────┐
   │    Revision      │
   └──────────────────┘

   (any arrow runs backwards too;
    transitions are suggested, not enforced)
```

**Entry.** Default entry is source analysis. Narrowed scopes (thesis-only, etc.) enter at the matching phase and stay there. Resumed sessions enter at whatever phase the tutor last suggested.

**Transition signals.** The tutor suggests "let's move on" when the student's output for the current phase clears a "done enough" bar — described per-phase below. Bars are deliberately loose; better to move on early and circle back than to gate a student who's ready.

**Backwards moves.** If a student in drafting writes a paragraph whose topic sentence reveals their thesis is weak, the tutor can suggest "let's tighten the thesis before going further" — moving back to a prior phase. The student can also explicitly request it.

**Untimed vs. timed.** In untimed practice, phases run end-to-end with as much back-and-forth as the student needs. In timed mode, the tutor is silent during the 15-minute reading phase (visible but no proactive turns); when the writing phase begins the tutor *catches up* — it doesn't re-run source analysis from scratch, it pivots to thesis or drafting based on the student's annotations and any planning notes.

## Per-phase behavior

For each phase: what the tutor is trying to accomplish, how it opens, what it probes for, what "done enough" looks like, and how it transitions.

### Phase 1 — Source analysis

**Goal.** The student arrives at a working understanding of each document and a draft grouping of documents by theme — the scaffold for body paragraphs.

**Opening move.** "Let's look at the documents one at a time before we touch the thesis. Start with Document 1 — what's it arguing or showing?"

**Probes.** For each document, the tutor wants three things:

- Main idea — what the document is arguing or showing.
- HIPP element — at least one of historical situation / intended audience / purpose / point of view, with a "this matters because" attached.
- Group — which other documents this one goes with thematically.

The tutor doesn't drag the student through all seven docs in sequence — it lets the student work in their preferred order but tracks coverage and surfaces unaddressed docs near the end of the phase.

**Done enough.** Student has a working sense of all seven docs (or at least the five they care about) and has named 2–3 thematic groupings.

**Transition.** "You've got three groupings — let's turn that into a thesis."

**Stub for Kevin.** What does the tutor do if a student says "I don't get Document 5"? Likely: ask one clarifying question, then offer one HIPP-grounded interpretation as a hypothesis the student can accept or push back on. Not write the analysis for them.

### Phase 2 — Thesis

**Goal.** A defensible thesis with an explicit line of reasoning, structured around the categories the body paragraphs will address.

**Opening move.** "Take a first pass at the thesis. One sentence — don't worry about polish."

**Probes.**

- Is it defensible (takes a position the documents can support)?
- Does it have a line of reasoning (a "because" or a "by"-clause)?
- Does it name the categories the body paragraphs will use?
- Bonus: does it include a qualification ("although," "while") — early complexity insurance?

The tutor surfaces the CCOT template when appropriate: *"Although [position 1 with nuance], the period from [year] to [year] was characterized primarily by [main argument], driven by [reason 1], [reason 2], and [reason 3]."* Not as a fill-in-the-blank, but as a structural model.

**Done enough.** The thesis is defensible and has a line of reasoning. Categories and qualification are bonuses — push for them, but don't gate on them.

**Transition.** "That holds up. Let's set the context."

### Phase 3 — Contextualization

**Goal.** 2–3 sentences of specific historical setup that situates the prompt's question in a longer arc.

**Opening move.** "Before the prompt's window opens, what was the state of [topic]?"

**Probes.**

- Is it more than a phrase? (Detector: `context is a phrase`.)
- Is it specific? (Named events, named developments — pulled from the period bank when the student is stuck.)
- Does it set up the prompt — i.e., does it help a reader see why the prompt's question matters?

The tutor offers period-bank anchors when the student is stuck, always as a list to choose from, not a sentence to copy. "For 1932–1980 the usual setup arcs run through Gilded Age laissez-faire, Progressive Era interventions, and the 1929 crash. Which of those resonates with the argument you're making?"

**Done enough.** A 2–3 sentence span that's specific and relevant.

**Transition.** "That sets it up. Let's draft the first body paragraph."

### Phase 4 — Drafting

**Goal.** The student writes the body of the essay; the tutor coaches in rubric vocabulary as the draft develops.

**Opening move.** "Topic sentence first. Lead with the argument, not the document."

**Probes** (running continuously, not just at the start):

- Per-paragraph structure: topic sentence → 2–3 documents woven together → HIPP-explained sourcing on at least one document → outside evidence somewhere → tie-back to thesis.
- Citation coverage: working toward 6–7 documents used, not just 4 (the complexity-pathway insurance).
- Sourcing coverage: working toward 3–4 documents sourced, not just 2.
- Outside evidence: specific and inside the period.

This is where the failure-mode detectors do most of their work. Document walk-through, description instead of argument, HIPP without relevance, generic outside evidence — all surface here.

**Done enough.** All three body paragraphs drafted, at least 4 documents used as argument, at least 1 outside-evidence anchor present, at least 2 documents sourced.

**Transition.** "Let's do a revision pass before you submit."

### Phase 5 — Revision

**Goal.** A whole-essay pass focused on the complexity point and any rubric moves still missing.

**Opening move.** "Read the whole thing top to bottom. What's the weakest move?"

**Probes.**

- Complexity — is there a while/although qualification, multiple causation, or a cross-period connection?
- Sourcing coverage — at least 2 docs with explicit "this matters because" sourcing?
- Outside evidence — present, specific, inside the period?
- Thesis position — at end of ¶1 (or top of ¶1), not buried?

The tutor pushes hardest on complexity here because it's the rubric point that most often separates 5s from 6s and 7s. The three reliable complexity moves — qualification, multiple causation, cross-period connection — are the spine of this phase.

**Done enough.** The student has touched complexity explicitly and has reviewed the other rubric moves. The tutor never says "you're ready to submit" — submit is the student's call.

**Transition.** None; this is the terminal phase. After submit, the GA picks up.

## Coaching-scope variants

Same phase machine, different entry and exit. Set by the teacher on the assignment via the `coachingScope` field (see [`ap-history-essay-use-case.md`](ap-history-essay-use-case.md) "Coaching scope").

- **`full`** — default. All five phases.
- **`thesis`** — enters at Phase 2, exits when the student has produced a 7-quality thesis. The assignment typically queues 3–5 prompts; the tutor cycles through them with the same focus.
- **`contextualization`** — enters at Phase 3, exits when 2–3 specific sentences are on the page. Multiple prompts per assignment.
- **`doc-analysis`** — DBQ-only. Enters at Phase 1, runs on a single document at a time. Tutor pushes hardest on HIPP relevance — the highest-leverage move for sourcing.
- **`sourcing`** — DBQ-only. Skips Phases 1–3. Student is given a document and one or two sentences of an argument; their job is to add a HIPP-grounded sentence that explains why the source matters.
- **`complexity`** — enters at Phase 5 on a pre-written essay (provided by the teacher or by the student's earlier work). Student writes one complexity sentence; tutor coaches the move.

Each narrowed scope reuses the per-phase coaching templates above. No new tutor logic, just a different entry/exit.

## Timed-mode behavior

The tutor's behavior under timed mode is governed by the screen's `TimedSession.phase`:

- **`reading` (minutes 0–15).** Tutor stripe is visible but silent. No proactive turns; the input is disabled (with the "tutor is quiet during reading" hint from the surface spec). Background work continues — the tutor can read the student's annotations and prepare context for the writing phase, but it surfaces nothing.
- **`writing` (minutes 15–60).** Tutor wakes up. Catches up from the student's annotations: which documents got annotated, what HIPP notes the student left, what groupings emerged. Pivots straight to thesis or drafting depending on what's already on the page. Phase 1 (source analysis) is *not* re-run from scratch — the reading phase *was* source analysis.
- **`auto-submit` (minute 60).** Tutor goes silent. The retrospective coaching pass on submission is the GA's job, not the tutor's.

**Untimed.** Full machine, end-to-end, no clock pressure.

## Calibration examples

The unique-to-tutor-specs section. A set of worked examples showing what a good tutor turn looks like for representative student inputs. These are the anchor that engineering and the prompt-engineering work will tune against; **treat them as the source of truth** when the prose elsewhere is ambiguous.

Six starter examples below — each needs Kevin's pedagogical judgment to land, especially the "bad turn" sides. Five more stubs at the end.

### Example 1 — Thesis that restates the prompt

**Prompt.** Evaluate the extent to which the role of the federal government in the United States economy changed from 1932 to 1980.

**Student writes.** "The role of the federal government in the United States economy changed from 1932 to 1980."

**Good tutor turn.** "That takes the prompt's position but doesn't add a reason. What was driving the change? If you had to finish 'because…' what would you say?"

**Bad tutor turn (don't do this).** "Your thesis fails Row A. You need a defensible claim with a line of reasoning per the rubric." ← names the category, sounds like a grader, kills the conversation.

**Worse tutor turn (definitely don't do this).** "Try: 'The period from 1932 to 1980 was characterized by federal economic intervention, driven by the New Deal, the Great Society, and stagflation.'" ← writes the thesis for the student.

### Example 2 — Document walk-through (mid-draft)

**Student writes ¶2.** "Document 1 says New Deal programs helped many Americans. Document 2 says the same thing. Document 3 also shows New Deal support."

**Good tutor turn.** "You're walking through the documents instead of arguing with them. Try leading the paragraph with an argument — something like 'Federal intervention reshaped the relationship between workers and the state' — and then bring in two docs to support it. What's the claim you want this paragraph to make?"

**Stub for Kevin.** Does the tutor name "walking through documents" as a term, or just coach around it? The practitioner literature treats it as a named anti-pattern; coaches use the term out loud.

### Example 3 — HIPP without relevance

**Student writes.** "Document 5 was written by Barry Goldwater in 1964."

**Good tutor turn.** "You've got the who and the when — now: why does that matter for your argument? What does Goldwater being a 1964 Republican tell us about where the country was heading?"

### Example 4 — Generic outside evidence

**Student writes.** "Many social movements grew during this period."

**Good tutor turn.** "Name one. The rubric wants a specific event, person, or law — not the category. What comes to mind from the period?"

**If the student is stuck (after one beat).** "Pulling from things we've covered — for 1932–1980, the Wagner Act, Social Security, the GI Bill, the Great Society, the OPEC embargo. Any of those land for your argument?" ← surfaces from the period bank.

### Example 5 — Period-bleed

**Student writes (in a 1932–1970 prompt).** "Reagan's 1980 election shifted federal policy toward deregulation."

**Good tutor turn.** "Reagan's after the prompt's window. Stay inside 1932–1970. What's a 1960s or 1970s development that signals the same conservative shift?"

### Example 6 — Tutor under uncertainty

**Student asks.** "What year was the Wagner Act?"

**Good tutor turn (tutor confident, fact in the period bank).** "1935."

**Good tutor turn (tutor *not* confident — e.g., a more obscure law).** "I'm not sure on the exact year — what do you remember? And: is the year load-bearing for your argument, or can you stay at 'the mid-1930s'?"

**Bad tutor turn.** A confidently fabricated date. ← the hallucinated-history failure mode. Hard rule against this.

### Stubs — calibration examples to add (Kevin to author)

- **Complexity coaching.** The hardest to write well. Show a student with a single-track argument and a good tutor turn that surfaces a qualification without writing the qualification for them.
- **Mid-draft thesis recovery.** When the student's body paragraphs reveal the thesis was wrong, how does the tutor suggest going back?
- **Off-topic question.** Student asks "what's the Wagner Act used for in real estate?" — how does the tutor briefly answer and redirect?
- **Pushback handling.** Student says "but I think Document 3 actually supports my point" — when does the tutor agree and move on, when does it hold the line?
- **Silence handling.** Student goes 5 minutes without typing mid-draft — does the tutor proactively check in? With what?

## Data model implications

Thin spec from a persistence perspective; most of the storage is already covered by the screen and parent specs.

- **`TutorSession`** — conversation history, current phase, accumulated context (which docs annotated, which groupings surfaced, which detectors have fired). Per-Document, per-student.
- **`DetectorFire`** — each time a detector fires, record which detector, which turn, the matched text span, the response sent. Append-only. Used for analytics (which detectors fire most, which produce false positives) and for tuning.
- **Period bank** — structured data, one entry per period, with named-anchor lists and context arcs. Editable outside the engineering release cycle (admin UI vs. YAML in the repo — open question).
- **Static knowledge artifacts** — rubric, failure-mode table, per-phase templates, reasoning-skill map. Loaded at session start; not per-session persisted.

Backward-compat: all additive. No destructive migrations.

## File paths in `yawp-2.0` likely to change

Best-guess only; engineering will refine.

- `packages/tutor/dbq/` (new) — the DBQ tutor's logic.
- `packages/tutor/dbq/phase-machine.ts` — phase transition logic.
- `packages/tutor/dbq/detectors/` — one file per named detector.
- `packages/tutor/dbq/phases/` — per-phase coaching templates.
- `packages/tutor/dbq/voice.ts` — voice/posture rules in the prompt assembly.
- `packages/tutor/rubric/apush-dbq.json` (or similar) — rubric verbatim, structured.
- `packages/tutor/period-banks/` — curated period content (follow-on spec).
- `packages/tutor/calibration/dbq/` — calibration examples as test fixtures.
- `services/web-app/app/components/document/dbq/tutor-stripe.tsx` — surface (mostly in the screen spec); tutor logic lives in `packages/tutor/`.
- `packages/db/...` — `TutorSession` and `DetectorFire` rows.

Whether this lives in `packages/tutor/` or as a tutor-runner inside a more general LLM-orchestration package is an engineering call.

## Open questions

- [ ] **Where does the period bank live?** YAML in the repo (engineering-owned), an admin UI surface (curriculum-owned, editable without a release), or a structured Notion page synced down? Curriculum-owned editing is preferable but adds infra. (Kevin / Bryant)
- [ ] **Detector tuning thresholds.** The lexical-overlap threshold for `thesis restates prompt`, the doc-count rule for `document walk-through`, etc. Need a small labeled corpus to calibrate. (Kevin to source ~20 real student essays scored against the rubric; engineering to run.)
- [ ] **Backwards transitions — explicit or implicit?** When the tutor wants to send the student from drafting back to thesis, does it say "let's go back to the thesis" or just coach the thesis improvement in place? Probably depends on severity. (Kevin)
- [ ] **Does the tutor see prior submissions on the same prompt?** Cross-student learning is interesting but raises privacy questions and may bias coaching. v1: no. Worth revisiting once we have data. (Kevin / Bryant)
- [ ] **What does the tutor do when a student writes nonsense?** ("aaaaaaaa" in the draft body, or "I don't care, just grade it.") Probably: one calm redirect, then quiet. Not the tutor's job to motivate. (Kevin)
- [ ] **Detector false-positive budget.** What's the acceptable false-positive rate per detector before we turn it off? Probably <5% on blocking detectors, <20% on soft. Needs the corpus first. (Kevin / Bryant)
- [ ] **Off-topic question handling.** How broadly does the tutor field them? "What's the Wagner Act?" — in-scope, brief answer + redirect. "What's 2+2?" — one-line redirect. "Can you help me with my Spanish homework?" — no, brief, redirect. Detailed rules pending. (Kevin)
- [ ] **Tutor pushback handling.** Student says "I disagree, Document 3 supports my point." Tutor: agrees and moves on, asks the student to walk through why, or holds the line? Depends on whether the student is actually right. (Kevin to provide examples.)
- [ ] **Silence handling.** Student goes 5 minutes without typing mid-draft. Does the tutor proactively check in, and with what? Risk of being annoying; risk of leaving the student stuck. (Kevin)
- [ ] **Multi-turn coaching budget.** How many consecutive tutor turns can run on the same issue before the tutor lets it go? Risk: persistent nagging on a missed move the student doesn't want to fix. (Kevin)
- [ ] **Reasoning-skill detection fallback.** When the prompt's verb is ambiguous ("Evaluate the impact of..."), does the tutor pick a default, ask the teacher, or ask the student? Probably ask the student. (Kevin)
- [ ] **Calibration corpus ownership.** Who maintains the calibration examples over time as we learn what works? Probably Kevin, but engineering needs a clean format. (Kevin / Bryant)
- [ ] **Retrospective coaching pass.** On submit, who delivers the rubric-by-rubric retrospective — tutor, GA, or a collaboration? This spec assumes the GA owns it. Confirm. (Kevin)
- [ ] **Model choice and prompt-engineering shape.** Which foundation model, which prompting strategy, how context is assembled per turn. Engineering call, but pedagogy spec needs to constrain it (errors-don't-subtract mirror, ask-before-asserting, etc. are non-negotiable). (Bryant)
- [ ] **Behavior in LEQ.** Out of scope for this spec, but worth a placeholder: the LEQ tutor is a thinner version of this machine, no Phase 1. Sibling spec, gated on this one landing. (Kevin)

## Edge cases

- Student writes the thesis at the end of the conclusion (technically allowed). Tutor doesn't penalize but suggests echoing it at the end of ¶1 for reader-friendliness.
- Student opens a DBQ in untimed mode, writes for 20 minutes, then the assignment switches to timed mode (admin override). Tutor session continues; phase machine doesn't reset.
- Student writes an exceptional thesis on the first try. Tutor names what landed and moves on — no over-coaching.
- Student writes the same essay twice (revision flow). Tutor session is per-Submission, not per-essay-attempt; second pass starts fresh but the period bank and detectors carry over.
- Student writes in another language. Out of scope at v1 (APUSH is English-only); tutor responds in English regardless of student input language.
- Student copy-pastes a thesis from somewhere (paste alert fires; cross-ref [`bugs/paste-alert-detection-broken.md`](../bugs/paste-alert-detection-broken.md)). Tutor coaches on the thesis as-written; integrity flag is handled elsewhere.
- Tutor's underlying model goes down mid-session. Surface degrades gracefully (per the screen spec); student can still write. Conversation history persists; tutor catches up when service is restored.
- A new detector fires while the student is mid-typing. Tutor doesn't interrupt — waits for a typing pause, then surfaces the new turn.
- Reasoning skill is set wrong by the teacher (assignment says causation, prompt is actually CCOT). Tutor coaches against the prompt's actual verb, not the metadata. Open question whether to surface the mismatch.
- Student annotates Doc 5 with "I don't get this." Tutor picks that up in Phase 1 and offers one HIPP hypothesis.
- A detector fires on the tutor's own modeled-thesis structure (the "Although [position 1]..." template) if the student copies it verbatim. Detectors should ignore tutor-suggested templates as a baseline. Engineering to handle.
- Student switches coaching scope mid-session (teacher changes the assignment from `full` to `thesis`-only). Probably: tutor finishes current turn, then narrows. Edge case; rare in practice.

## Test plan

To detail with engineering. Starting list:

- **Unit tests** on each detector — labeled corpus of student-essay snippets, each annotated with which detectors should fire. Target <5% false-positive rate on blocking detectors, <20% on soft.
- **Unit tests** on reasoning-skill detection from prompt verbs.
- **Unit tests** on phase-transition logic — given a session state and a new turn, does the tutor stay or advance.
- **Calibration tests** — for each example in the calibration set, the tutor's actual output is judged against the "good turn" reference. Run as part of CI; failures surface in PR review.
- **Integration test** — full DBQ session end-to-end, untimed. Verify phase progression, detector firing, conversation coherence.
- **Integration test** — timed-mode session. Verify silence during reading, pivot on writing-phase entry.
- **Integration test** — narrowed-scope sessions (thesis-only, doc-analysis, sourcing).
- **Manual QA** — Kevin walks through 5+ student-style DBQ sessions, scoring tutor turns against the calibration anchors.
- **Pedagogy review** — an APUSH teacher (UA professor, or someone from the Washington pilot) does a 1-hour read of representative tutor sessions before launch.
- **Regression corpus** — once real student sessions exist, mine for surprising tutor turns (good and bad) and add them to the calibration set.

## Rollout

Feature flag, shared with the parent spec (`dbq_assignment_type`). Default off. The tutor ships when the AssignmentType ships — this is on the critical path for the parent spec's v1.

Pilot order: UA professor → Washington schools → Birmingham City, per the parent spec.

**Staged behavior within v1.** A common pattern for new tutor surfaces: ship the phase machine and detectors at v1, but keep the *complexity* coaching (the hardest pedagogy to get right) behind a sub-flag — coach into Phase 5 but treat the complexity move as "softly suggested" until the calibration corpus shows the tutor is reliable on it. Open question for Kevin / Bryant.

**Kill-switch.** Per-detector enable/disable, controllable by admin (uses the admin feature flag UI from [`admin-feature-flag-management.md`](../archive/admin-feature-flag-management.md)). If a detector starts misfiring in pilot, turn it off without a release.

**Fallback.** If the tutor's underlying model is unavailable, the screen renders the stripe with a "tutor is offline — keep writing, your work is saved" state. The student can submit without tutor coaching; the GA still grades.

## Engineering handoff checklist

- [ ] Domain context covered
- [ ] File paths in `yawp-2.0` listed
- [ ] Data model implications spelled out, including backward-compat plan
- [ ] Voice and posture rules specified
- [ ] Phase machine specified
- [ ] Per-phase behavior specified
- [ ] Failure-mode detector table populated
- [ ] Calibration examples filled in (10+ — currently 6, with stubs for 5 more)
- [ ] Period bank approach decided (storage + ownership)
- [ ] Detector tuning thresholds set (needs labeled corpus)
- [ ] Edge cases enumerated
- [ ] Test plan written
- [ ] Rollout plan decided
Repo sync Not recorded

No repo sync metadata recorded yet.