Quick Start Guide

Get up and running with the Hokusai data pipeline in 5 minutes. This guide shows you how to test the pipeline locally with sample data.

Prerequisites

This guide assumes you have:

Python 3.8+ installed
Basic familiarity with command line
Git installed on your system

5-Minute Example

Step 1: Setup Environment

# Clone the pipeline repository (if not already done)
git clone https://github.com/hokusai-protocol/hokusai-data-pipeline
cd hokusai-data-pipeline

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate  # On macOS/Linux
# or
venv\Scripts\activate  # On Windows

# Install dependencies
pip install -r requirements.txt

Step 2: Run Pipeline in Dry-Run Mode

# Run with mock data to test setup
python -m src.pipeline.hokusai_pipeline run \
    --dry-run \
    --contributed-data=data/test_fixtures/test_queries.csv \
    --output-dir=./quick-start-output

This command:

Uses --dry-run to generate mock baseline models
Processes the test dataset
Outputs attestation-ready results

Step 3: View Results

# Check output
ls -la quick-start-output/

# View the attestation output
cat quick-start-output/deltaone_output_*.json | jq '.'

Step 4: View MLFlow UI (Optional)

# Start MLFlow UI
mlflow ui

# Open browser to http://localhost:5000

You'll see:

Experiment runs
Model parameters
Performance metrics
Artifacts

Understanding the Output

The pipeline generates several output files:

DeltaOne Attestation

{
  "model_id": "gpt-3.5-turbo",
  "run_id": "run_20240620_153045",
  "deltaone_score": 8.5,
  "improvements": {
    "accuracy": 0.085,
    "response_quality": 0.082,
    "latency_reduction": 0.001
  },
  "metadata": {
    "data_points": 100,
    "evaluation_date": "2024-06-20T15:30:45Z",
    "pipeline_version": "1.0.0"
  }
}

Technical Output Format

{
  "pipeline_version": "1.0.0",
  "timestamp": "2024-01-15T10:30:00Z",
  "baseline_model": {
    "id": "baseline_model_v1",
    "metrics": {
      "accuracy": 0.85,
      "f1_score": 0.82
    }
  },
  "improved_model": {
    "id": "improved_model_v1",
    "metrics": {
      "accuracy": 0.88,
      "f1_score": 0.86
    }
  },
  "delta": {
    "accuracy": 0.03,
    "f1_score": 0.04
  }
}

Key Metrics

DeltaOne Score: Percentage improvement (1 DeltaOne = 1% improvement)
Component Scores: Breakdown by evaluation criteria
Technical Metrics: Raw model performance measurements

Understanding the Pipeline

What Just Happened?

Data Loading: The pipeline loaded your contributed data
Validation: Data was checked for schema compliance and PII
Model Training: A new model was trained with the data
Evaluation: Both models were evaluated on a benchmark
Delta Calculation: Performance improvement was calculated
DeltaOne Score: Improvement converted to DeltaOne rewards
Attestation: Results were formatted for blockchain verification

Real Data Example

Prepare Your Data

Create a CSV file with your training data:

query,document,relevance
"What is machine learning?","Machine learning is a subset of AI...",1
"How does Python work?","Python is an interpreted language...",1
"What is the weather?","Machine learning algorithms can...",0

Save as my_data.csv.

Run Pipeline with Real Data

# Set environment for real run
export HOKUSAI_TEST_MODE=false

# Run pipeline
python -m src.pipeline.hokusai_pipeline run \
    --baseline-model-path=models/baseline.pkl \
    --contributed-data=my_data.csv \
    --output-dir=./outputs \
    --contributor-address=0xYourWalletAddress \
    --experiment-name=my-first-contribution

Monitor Progress

# Watch logs
tail -f outputs/pipeline.log

# Check pipeline status
python -m metaflow tag list

Common Commands

Data Validation

# Validate your data before submission
python -m src.cli.validate_data \
    --input-file=my_data.csv \
    --schema=config/data_schema.json

Preview Mode

# Preview expected improvement
python -m src.preview.preview_improvement \
    --baseline-model=models/baseline.pkl \
    --contributed-data=my_data.csv

Configuration Options

# Custom configuration
python -m src.pipeline.hokusai_pipeline run \
    --config=my_config.yaml \
    --batch-size=64 \
    --epochs=20 \
    --learning-rate=0.001

Output Files

After a successful run, you'll find:

outputs/
├── deltaone_output_20240115_103000.json  # Main attestation file
├── metrics_summary.csv                    # Detailed metrics
├── model_artifacts/                       # Trained models
│   ├── baseline_model.pkl
│   └── improved_model.pkl
├── evaluation_results/                    # Evaluation details
│   └── benchmark_scores.json
└── pipeline.log                          # Execution logs

Next Steps

Now that you've run your first pipeline:

Configure Your Pipeline - Customize evaluation parameters
Supply Real Data - Learn how to contribute your own datasets
First Contribution - Step-by-step guide for your first real contribution
Understanding DeltaOne - Deep dive into reward calculations

Troubleshooting

Python Version Error

# Check your Python version
python --version

# If below 3.8, install a newer version

Pipeline Fails Immediately

# Check Python environment
which python  # Should show venv path

# Reinstall dependencies
pip install --upgrade -r requirements.txt

No Output Generated

# Check logs
cat outputs/pipeline.log | grep ERROR

# Verify data format
python -m src.cli.validate_data --input-file=your_data.csv

MLFlow UI Not Working

# Kill existing process
pkill -f "mlflow ui"

# Start with specific settings
mlflow ui --backend-store-uri ./mlruns --port 5001

Permission Denied

# Make scripts executable
chmod +x scripts/*.sh

For more troubleshooting, see our Troubleshooting Guide.

5-Minute Example​

Step 1: Setup Environment​

Step 2: Run Pipeline in Dry-Run Mode​

Step 3: View Results​

Step 4: View MLFlow UI (Optional)​

Understanding the Output​

DeltaOne Attestation​

Technical Output Format​

Key Metrics​

Understanding the Pipeline​

What Just Happened?​

Real Data Example​

Prepare Your Data​

Run Pipeline with Real Data​

Monitor Progress​

Common Commands​

Data Validation​

Preview Mode​

Configuration Options​

Output Files​

Next Steps​

Troubleshooting​

Python Version Error​

Pipeline Fails Immediately​

No Output Generated​

MLFlow UI Not Working​

Permission Denied​