Skip to main content

Evaluations and Feedback

Evaluations are the source of learning for the Technical Task Router. They turn a completed task into evidence about whether a route worked.

What to Measure

Use evaluation signals that reflect your harness's real success criteria:

  • Task acceptance
  • Test pass rate
  • Static analysis results
  • Human review outcome
  • Regression detection
  • Cost
  • Latency
  • Retry count
  • Whether the task stayed within budget

The router does not require every harness to use the same scoring system. It needs enough consistent outcome data to compare route quality for similar tasks.

Evaluation Envelope

The evaluation envelope is the set of constraints and checks used to judge a route. For example:

{
"requiredChecks": ["unit_tests", "lint", "review"],
"acceptancePolicy": "human_accepts_patch",
"maxCostUsd": 25,
"maxWallClockMinutes": 20,
"regressionPolicy": "block_on_auth_regression"
}

The same route may be acceptable under one envelope and unacceptable under another. Include evaluation details when you request a route and when you report an outcome.

Feedback Loop

Successful routes teach the router what works. Failed routes teach it what to avoid. Budget misses, flaky retries, reviewer misses, and post-merge regressions are all useful signals.

Comparing Against a Baseline

When evaluating an integration, compare Hokusai-selected routes against your current routing policy:

MetricBaseline policyHokusai route
Acceptance rateCurrent harness behaviorRouted decisions
Average costCurrent model selectionSelected route cost
Wall clockCurrent workflowRouted workflow
Regression rateExisting review processSelected reviewer route
Retry countExisting fallback behaviorRouted fallback behavior

This makes routing lift measurable instead of anecdotal.