Skip to main content

Benchmark Specs

Overview

A BenchmarkSpec is a top-level entity — not a sub-resource of any model — that defines how a model is evaluated. It owns the eval_spec JSONB payload, which specifies the primary metric, secondary metrics, guardrails, and the measurement, label, and coverage policies that govern evaluation.

Because BenchmarkSpec is independent of any individual model, a single spec can be referenced by multiple registered model versions. Every time hokusai model register is called with --benchmark-spec-id, the CLI records the spec's UUID as an MLflow tag (benchmark_spec_id) on both the run and the registered model version, making evaluation provenance queryable across the registry.

BenchmarkSpec became a first-class entity in migration 014 (ef93d3f), which added the eval_spec JSONB column to the benchmark spec table and introduced the full CRUD API under /api/v1/benchmarks.

Lifecycle

Specs are created as active and soft-deleted (archived) by setting is_active=false. The CLI rejects inactive specs — you must create a new spec or restore an archived one before registering a model against it.

REST CRUD

All endpoints require a bearer token (Authorization: Bearer $HOKUSAI_API_KEY).

Create a spec

curl -X POST https://api.hokus.ai/api/v1/benchmarks \
-H "Authorization: Bearer $HOKUSAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "chest-xray-auroc-v1",
"description": "AUROC-based benchmark for chest X-ray diagnostic model",
"eval_spec": {
"primary_metric": {
"name": "auroc",
"direction": "higher_is_better",
"threshold": 0.85,
"unit": "score"
},
"secondary_metrics": [
{ "name": "f1", "direction": "higher_is_better" },
{ "name": "precision", "direction": "higher_is_better" }
],
"guardrails": [
{
"name": "false_positive_rate",
"direction": "lower_is_better",
"threshold": 0.05,
"blocking": true
}
],
"measurement_policy": { "ci_method": "bootstrap", "ci_alpha": 0.05 },
"label_policy": { "labeler": "human", "min_annotators": 2 },
"coverage_policy": { "min_examples_per_class": 50 }
}
}'

The response includes the spec's UUID, e.g. "id": "bs-abc123...".

List specs

curl "https://api.hokus.ai/api/v1/benchmarks?page=1&page_size=20" \
-H "Authorization: Bearer $HOKUSAI_API_KEY"

Fetch a single spec

curl "https://api.hokus.ai/api/v1/benchmarks/<spec_id>" \
-H "Authorization: Bearer $HOKUSAI_API_KEY"

Update a spec

curl -X PUT "https://api.hokus.ai/api/v1/benchmarks/<spec_id>" \
-H "Authorization: Bearer $HOKUSAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{ "description": "Updated description" }'

BenchmarkSpecUpdate applies partial updates — only the fields you include are changed.

Soft-delete (archive) a spec

curl -X DELETE "https://api.hokus.ai/api/v1/benchmarks/<spec_id>" \
-H "Authorization: Bearer $HOKUSAI_API_KEY"

This sets is_active=false and does not remove the spec from the database.

Web UI

The Hokusai web UI at hokus.ai/create-model prefills a new BenchmarkSpec from the metric and ticker you configure during model creation (Step 3: Performance Metrics). You can use the generated spec ID directly with --benchmark-spec-id without running any curl commands.

eval_spec Schema

eval_spec is a JSONB payload stored on the BenchmarkSpec row. The six headline fields are:

FieldTypeRequiredDescription
primary_metricMetricSpecyesThe single metric DeltaOne is computed against.
secondary_metricslist[MetricSpec]no (default [])Additional metrics measured but not used for DeltaOne scoring.
guardrailslist[GuardrailSpec]no (default [])Hard constraints that block promotion if breached.
measurement_policydict | nullnoFree-form measurement configuration (e.g. confidence-interval method, bootstrap samples).
label_policydict | nullnoFree-form labeling rules (e.g. annotator agreement, label mapping).
coverage_policydict | nullnoFree-form coverage or sub-population requirements (e.g. minimum examples per class).

MetricSpec shape

Used by primary_metric and each entry in secondary_metrics.

FieldTypeRequiredDescription
namestringyesMetric identifier (e.g. auroc, f1).
directionhigher_is_better | lower_is_betteryesWhether larger values indicate improvement.
thresholdfloat | nullnoMinimum acceptable value for promotion.
unitstring | nullnoHuman-readable unit (e.g. score, ms).
mlflow_namestring | nullnoOverride for the MLflow metric key if it differs from name.
scorer_refstring | nullnoReference to a scorer plugin registered in the evaluation pipeline.

GuardrailSpec shape

Each entry in guardrails must include a threshold.

FieldTypeRequiredDescription
namestringyesGuardrail identifier (e.g. false_positive_rate).
directionhigher_is_better | lower_is_betteryesViolation direction.
thresholdfloatyesThe hard limit. Breaching it blocks promotion when blocking=true.
blockingboolno (default true)If true, a breach prevents model promotion.
mlflow_namestring | nullnoOverride for the MLflow metric key.
scorer_refstring | nullnoReference to a scorer plugin.

Auxiliary fields

Three additional fields are available on EvalSpec for advanced use cases:

  • unit_of_analysis (string | null) — The "row" the metric is computed over (e.g. transaction, patient).
  • min_examples (int | null, ≥ 1) — Minimum eligible evaluation rows before the run counts.
  • metric_family (enum, default proportion) — DeltaOne comparator dispatch. Values: proportion, continuous, zero_inflated_continuous, rank_or_ordinal.

Most use cases are covered by the six headline fields above. The auxiliary fields are useful when your metric distribution requires a non-default statistical family or when you need to enforce a minimum sample size.

Worked Example

A complete eval_spec payload for an AUROC-based diagnostic benchmark:

{
"primary_metric": {
"name": "auroc",
"direction": "higher_is_better",
"threshold": 0.85,
"unit": "score"
},
"secondary_metrics": [
{ "name": "f1", "direction": "higher_is_better" },
{ "name": "precision", "direction": "higher_is_better" }
],
"guardrails": [
{
"name": "false_positive_rate",
"direction": "lower_is_better",
"threshold": 0.05,
"blocking": true
}
],
"measurement_policy": {
"ci_method": "bootstrap",
"ci_alpha": 0.05,
"n_bootstrap": 1000
},
"label_policy": {
"labeler": "human",
"min_annotators": 2,
"agreement_threshold": 0.8
},
"coverage_policy": {
"min_examples_per_class": 50,
"stratify_by": "patient_age_band"
}
}

Registering a Model Against a Spec

Once you have a spec ID, pass it to hokusai model register with --benchmark-spec-id:

hokusai model register \
--token-id TICKER \
--benchmark-spec-id <spec_id> \
--model-path ./models/final_model.pkl
FlagDescription
--token-idThe token ticker for your model (e.g. CHEST).
--benchmark-spec-idThe UUID of the BenchmarkSpec to evaluate against. Spec IDs look like bs-abc123.
--model-pathLocal path to the serialized model artifact.

The CLI:

  1. Fetches the spec via HOKUSAI_API_KEY and prints a Resolved benchmark spec … confirmation line.
  2. Derives metric and baseline from eval_spec.primary_metric automatically.
  3. Uploads the model artifact to MLflow.
  4. Tags both the MLflow run and the registered model version with benchmark_spec_id for provenance.
  5. Validates that the model meets the spec's primary_metric.threshold before marking the version REGISTERED.

If you need to override the auto-derived values, --metric and --baseline may still be supplied explicitly; the CLI will emit a warning and use your overrides. When --benchmark-spec-id is omitted entirely, --metric and --baseline are required (legacy mode).

Legacy scalar fields are auto-uplifted at runtime

Specs created before migration 014 use scalar columns (metric_name, metric_direction, baseline_value) instead of eval_spec. The runtime translation module spec_translation.py (in hokusai-data-pipeline) synthesizes an equivalent eval_spec automatically when one is absent, so existing specs continue to work without any user action. New specs should populate eval_spec directly.