Boatman Ecosystem documentation is live!
BoatmanMode CLI
Triage Pipeline

Agentic Backlog Triage Pipeline

Why This Exists

Engineers spend significant time on repetitive, well-bounded tickets that AI could handle autonomously. But naive "give the AI a ticket and let it code" approaches fail because they lack risk assessment — an AI agent shouldn't be touching payment logic, database migrations, or public API contracts without guardrails.

The triage pipeline solves this by scoring every ticket for AI-readiness before any code is written. It separates the question "can AI do this?" from "should AI do this?" using a deterministic classification system that no LLM can override.

The Full Pipeline

The system is designed as an 8-stage pipeline. Stages 1-4 are the triage/planning phase. Stages 5-8 are the execution/review phase.

┌─────────────────────────────────────────────────────────────────┐
│  TRIAGE PHASE (read-only, no code changes)                      │
│                                                                 │
│  1. Ingest    — Normalize Linear tickets, extract signals       │
│  2. Score     — 7-dimension rubric (LLM) + classify (no LLM)   │
│  3. Cluster   — Group related tickets, generate context docs    │
│  4. Plan      — Read-only planner generates execution plans     │
├─────────────────────────────────────────────────────────────────┤
│  EXECUTION PHASE (code changes in isolated worktrees)           │
│                                                                 │
│  5. Execute   — Worker agent implements in sandboxed worktree   │
│  6. Review    — Separate reviewer agent, tiered by category     │
│  7. Human     — Human sign-off required for all PRs             │
│  8. Rescue    — Async agent inspects failures + false negatives │
└─────────────────────────────────────────────────────────────────┘

Key Design Decisions

  • The orchestrator is NOT an LLM. It's conventional Go code managing worker lifecycle, enforcing budgets, and gating transitions between stages. LLMs are workers within the pipeline, not the pipeline itself.
  • Per-ticket plans flow independently. Tickets are clustered for context sharing, but each ticket moves through the pipeline on its own. A slow ticket doesn't block others.
  • Decision log per ticket per stage. Every classification, score, and gate result is persisted for auditability and rubric tuning.
  • Workers run in ephemeral git worktrees with file allowlists and pre-commit hooks. The AI cannot touch files outside its approved scope.

Rollout Phases

The pipeline is being deployed incrementally:

PhaseStagesWhat's enabledStatus
Phase 11-3Triage only: score, classify, clusterShipped
Phase 21-4Add plan generation + validation, no executionShipped
Phase 31-7Autonomous execution for AI_DEFINITE only (blastRadius 1 or less)In progress
Phase 41-8Expand to AI_LIKELY with rescue agentPlanned

Phase 3 adds constraint enforcement layers around execution:

  1. Worktree sandbox — isolated git worktree, no access to main branch
  2. File allowlist.allowed-files manifest checked by pre-commit hook
  3. Branch protection — target branch requires PR approval
  4. Process-level budget — cost ceiling per category (AI_DEFINITE: 500K tokens/30min, AI_LIKELY: 1M tokens/60min)

Phase 4 adds the rescue agent: an async agent that inspects false negatives (tickets classified as HUMAN_ONLY that could have been automated) and failed executions (agent got stuck, tests don't pass after max iterations).


Stage Details

Stage 1: Ingest

Normalizes raw Linear tickets into a standard format with extracted signals.

Signal extraction:

  • mentionsFiles — file paths extracted via regex from description
  • domains — inferred from labels, file extensions, description keywords (frontend, backend, graphql, testing, database, api)
  • dependencies — external dependencies mentioned
  • labels — direct from Linear
  • acceptanceCriteria — detected via keyword matching (explicit vs implicit)
  • hasDesignSpec — keyword detection for design/spec references
  • commentCount, lastUpdated, teamKey, projectName, estimate
  • staleness TTL — configurable hours before a normalized ticket is considered stale

Stage 2a: Score

Claude scores each ticket on a 7-dimension rubric (0-5 scale). Scoring runs concurrently with configurable parallelism.

Positive dimensions (higher = better for AI):

DimensionWhat it measures
clarityAre requirements explicit and testable?
codeLocalityIs the change confined to one module?
patternMatchHas this type of problem been solved before in the codebase?
validationStrengthCan success be proven with automated tests?

Negative dimensions (higher = worse for AI):

DimensionWhat it measures
dependencyRiskUnknown infra, external APIs, cross-team sequencing
productAmbiguityUX judgment calls, stakeholder interpretation
blastRadiusHow bad if the implementation is wrong?

Claude also outputs uncertainAxes (which dimensions it's least confident about) and reasons (why it scored each dimension the way it did).

Example — well-scoped backend bug fix:

clarity: 4, codeLocality: 5, patternMatch: 4, validationStrength: 5
dependencyRisk: 0, productAmbiguity: 0, blastRadius: 1
→ AI_DEFINITE

Example — vague cross-team feature:

clarity: 1, codeLocality: 1, patternMatch: 1, validationStrength: 2
dependencyRisk: 4, productAmbiguity: 4, blastRadius: 3
→ HUMAN_REVIEW_REQUIRED

Stage 2b: Classify

Classification is fully deterministic — no LLM involved. This is intentional: the scoring LLM provides signal, but the classification decision is made by a conventional decision tree that humans can audit and tune without prompt engineering.

Categories:

CategoryMeaning
AI_DEFINITESafe for fully autonomous execution
AI_LIKELYProbably safe, but review the plan first
HUMAN_REVIEW_REQUIREDNeeds human judgment before AI proceeds
HUMAN_ONLYMust be done by a human

Decision tree:

1. Hard-stop keywords? → HUMAN_ONLY
   payments, billing, authentication, authorization,
   database migration, public API, incident/sev1/sev2,
   legal/compliance, multi-repo

2. Soft-stop keywords? → HUMAN_REVIEW_REQUIRED
   feature flag, staged rollout, deploy coordination, release train

3. Acceptance criteria missing? → HUMAN_REVIEW_REQUIRED

4. Gate thresholds (ALL must pass for AI_DEFINITE):
   clarity >= 2
   blastRadius < 3
   productAmbiguity < 3
   dependencyRisk < 3

5. If 3+ gates pass → AI_LIKELY
   Otherwise → HUMAN_REVIEW_REQUIRED

Every classification produces a full audit trail: which gates passed/failed, which hard stops matched, and why.

Stage 3: Cluster

Related tickets are grouped using greedy single-linkage clustering based on signal overlap:

Signal typeWeight
Shared domain0.5 per overlap
Shared file mention0.5 per overlap
Shared dependency1.0 per overlap
Merge threshold2.0 minimum

Each cluster produces a Context Document — structured metadata that flows downstream to the planner and executor:

  • repoAreas — directories the cluster touches
  • knownPatterns — existing patterns in those areas (how similar work was done before)
  • validationPlan — how to test changes in this area
  • risks — what could go wrong
  • costCeiling — token budget and time limit per ticket in this cluster

Context documents are the primary mechanism for constraining the planner and executor to the right parts of the codebase.

Stage 4: Plan

A read-only Claude agent with Read, Grep, and Glob tools (no write access) explores the repository and generates an execution plan for each AI_DEFINITE and AI_LIKELY ticket. The planner receives the ticket, its classification, and the cluster's context document as input.

Plan schema:

{
  "ticketId": "ENG-1234",
  "approach": "Add new API endpoint with input validation...",
  "candidateFiles": ["src/api/handlers/create_item.go"],
  "newFiles": ["src/api/handlers/create_item_test.go"],
  "deletedFiles": [],
  "validation": ["go test ./src/api/handlers/..."],
  "rollback": "Revert the handler and remove the test",
  "stopConditions": ["If the items table doesn't exist, abort"],
  "uncertainties": ["Unclear if the index is needed for this query pattern"]
}

Plan validation (4 gates):

GateWhat it checksPass condition
files_existDo candidateFiles exist on disk?More than 50% exist
within_repo_areasAre files within the cluster's repoAreas?Less than 50% out of scope
stop_conditionsAre stop conditions present and meaningful?Non-empty
validation_commandsDo test commands use known runners?Known runners (rspec, jest, yarn, npm, make, etc.)

Plans that fail validation cannot be executed. The validation results (which files are missing, which are out of scope) are shown in the UI so a human can decide whether to fix the plan or skip the ticket.

Stage 5: Execute (Phase 3+)

A worker agent runs in an isolated git worktree with the validated plan as its constraint document. The execution uses the same 9-step work pipeline as boatman work, but with the plan pre-loaded (skipping the planning step).

4-layer constraint enforcement:

  1. Worktree sandbox — all changes happen in an isolated worktree, main branch untouched
  2. File allowlist.allowed-files manifest generated from the plan's candidateFiles, enforced by pre-commit hook
  3. Branch protection — the target branch requires PR approval before merge
  4. Process-level budget — hard token and time limits per category:
    • AI_DEFINITE: 500K tokens, 30 minutes
    • AI_LIKELY: 1M tokens, 60 minutes
    • Heartbeat-based stall detection with SIGTERM/SIGKILL escalation

A draft PR is created immediately after execution completes (before review), so work is preserved even if subsequent stages fail.

Stage 6: Review (Phase 3+)

A separate reviewer agent (ScottBott) reviews the diff. Review depth is tiered by category:

  • AI_DEFINITE — light pass: check for obvious issues, verify tests pass
  • AI_LIKELY — deep read: thorough review, check for edge cases, verify approach matches plan

If review fails, the refactor loop iterates up to max-iterations (default 3).

Stage 7: Human Sign-off (Phase 3+)

All PRs created by the pipeline require human approval before merge. This is a hard constraint in Phase 3 — no auto-merge regardless of classification.

Stage 8: Rescue (Phase 4)

An async agent that runs after execution completes, inspecting:

  • False negatives — tickets classified as HUMAN_ONLY that had simple, well-bounded solutions (used to tune the rubric)
  • Failed executions — agents that got stuck, exceeded budgets, or produced code that doesn't pass review (used to improve planning and constraint enforcement)

CLI Usage

# Score and classify a team's backlog
boatman triage --teams ENG --states backlog --limit 20
 
# With plan generation
boatman triage --teams ENG --states backlog --generate-plans --repo-path .
 
# Specific tickets
boatman triage --ticket-ids ENG-1234,ENG-5678
 
# Stream events for desktop integration
boatman triage --teams ENG --states backlog --emit-events

Flags

FlagDefaultDescription
--teamsTeam keys to fetch (comma-separated)
--statesWorkflow states to filter
--limit50Maximum tickets to process
--ticket-idsSpecific ticket IDs (skips team/state filters)
--concurrency3Parallel Claude scoring calls
--generate-plansfalseRun Stage 4 plan generation
--repo-path.Repository path for plan generation
--emit-eventsfalseEmit JSON events to stdout
--output-dir.boatman-triageDecision log directory
--post-commentsfalseWrite classification comments to Linear
--dry-runfalseNo side effects

Executing from Triage

When you execute a ticket from the triage results, the pre-generated plan is passed to the work pipeline via --plan-file. This skips the planning step and goes straight to execution with validated candidate files.

Triage Result → Click "Execute" → BoatmanMode session created
  └─ Plan written to temp file → boatman work ENG-1234 --plan-file /tmp/plan.json
     └─ Step 3 (Planning) skipped, uses pre-generated plan
     └─ Draft PR created as checkpoint after execution
     └─ Review/refactor loop runs
     └─ PR finalized and marked ready

Decision Log

Triage writes an audit log to --output-dir (default .boatman-triage/):

  • log.jsonl — One JSONL entry per ticket per stage: classification, scores, gate results, token usage, cost
  • context_{clusterId}.json — Full context document per cluster

The decision log is the primary mechanism for rubric tuning — by reviewing which tickets were classified correctly vs incorrectly, the gate thresholds and hard-stop keywords can be adjusted without changing any LLM prompts.


Event System

When --emit-events is passed, the pipeline emits JSON events to stdout for desktop integration:

EventWhenKey data
triage_startedPipeline beginsticketCount, teams
triage_fetch_completeTickets fetchedticketCount
triage_scoring_startedScoring beginsticketCount, concurrency
triage_ticket_scoringIndividual ticket startsticketID, index, total
triage_ticket_scoredIndividual ticket doneall 7 scores, index, total
triage_scoring_completeAll scoring donescored/failed counts
triage_classifyingClassification startsticketCount
triage_clusteringClustering startsticketCount
triage_completePipeline donefull TriageResult JSON
plan_startedPlan generation beginsticketCount
plan_ticket_planningIndividual plan startsticketID, index
plan_ticket_plannedIndividual plan donefileCount, stopConditions
plan_ticket_validatedPlan validatedpassed, missing/out-of-scope files
plan_completeAll plans doneresults array, stats

Model Configuration

claude:
  models:
    scorer: claude-sonnet-4-5         # Rubric scoring (cost-sensitive, runs per ticket)
    triage_planner: claude-opus-4-6   # Plan generation (quality-sensitive, explores repo)

Architecture

cli/internal/triage/
  ├── pipeline.go      # Orchestrator (conventional Go, not LLM)
  ├── ingest.go        # Stage 1: Normalize tickets, extract signals
  ├── scorer.go        # Stage 2a: Claude rubric scoring (concurrent)
  ├── classifier.go    # Stage 2b: Deterministic decision tree (no LLM)
  ├── cluster.go       # Stage 3: Signal-overlap clustering + context docs
  ├── decisionlog.go   # Audit log persistence (JSONL + context docs)
  ├── events.go        # Event emission helpers
  └── types.go         # NormalizedTicket, Classification, Cluster, ContextDoc, etc.

cli/internal/plan/
  ├── generator.go     # Stage 4: Claude with Read/Grep/Glob tools
  ├── validator.go     # 4-gate plan validation (file existence, scope, etc.)
  ├── events.go        # Plan event helpers
  └── types.go         # TicketPlan, PlanValidation, PlanResult, PlanStats

cli/internal/agent/
  └── agent.go         # Stage 5-7: Work pipeline (execute → review → refactor → PR)