Evaluations and Feedback

Evaluations are the source of learning for the Technical Task Router. They turn a completed task into evidence about whether a route worked.

What to Measure

Use evaluation signals that reflect your harness's real success criteria:

Task acceptance
Test pass rate
Static analysis results
Human review outcome
Regression detection
Cost
Latency
Retry count
Whether the task stayed within budget

The router does not require every harness to use the same scoring system. It needs enough consistent outcome data to compare route quality for similar tasks.

Evaluation Envelope

The evaluation envelope is the set of constraints and checks used to judge a route. For example:

{
  "requiredChecks": ["unit_tests", "lint", "review"],
  "acceptancePolicy": "human_accepts_patch",
  "maxCostUsd": 25,
  "maxWallClockMinutes": 20,
  "regressionPolicy": "block_on_auth_regression"
}

The same route may be acceptable under one envelope and unacceptable under another. Include evaluation details when you request a route and when you report an outcome.

Feedback Loop

Successful routes teach the router what works. Failed routes teach it what to avoid. Budget misses, flaky retries, reviewer misses, and post-merge regressions are all useful signals.

Comparing Against a Baseline

When evaluating an integration, compare Hokusai-selected routes against your current routing policy:

Metric	Baseline policy	Hokusai route
Acceptance rate	Current harness behavior	Routed decisions
Average cost	Current model selection	Selected route cost
Wall clock	Current workflow	Routed workflow
Regression rate	Existing review process	Selected reviewer route
Retry count	Existing fallback behavior	Routed fallback behavior

This makes routing lift measurable instead of anecdotal.

What to Measure​

Evaluation Envelope​

Feedback Loop​

Comparing Against a Baseline​

What to Measure

Evaluation Envelope

Feedback Loop

Comparing Against a Baseline