Getting Started with Hokusai

Hokusai is a decentralized protocol that incentivizes high-quality data contributions to improve AI models. This guide will help you get started based on your role in the ecosystem.

Quick Overview by Role

For Data Suppliers

Install the Hokusai data pipeline
Prepare your data in supported formats
Run data validation and quality checks
Submit data for evaluation and earn rewards

For AI Model Developers

Set up the Hokusai SDK and pipeline
Integrate your model with the evaluation framework
Define performance metrics for DeltaOne token issuance
Deploy and monitor model improvements

For Token Investors

Understand the bonding curve mechanism
Participate in token auctions
Monitor token supply and burn rates
Track model performance metrics

System Requirements

Minimum Requirements

Python 3.8 or higher
8GB RAM
10GB free disk space
Unix-based OS (macOS, Linux) or WSL on Windows

Recommended Requirements

Python 3.11
16GB RAM
50GB free disk space for model storage
SSD for faster data processing

Installation

1. Clone the Repository

git clone https://github.com/hokusai/hokusai-data-pipeline.git
cd hokusai-data-pipeline

2. Create Python Virtual Environment

# Create virtual environment
python -m venv venv

# Activate virtual environment
# On macOS/Linux:
source venv/bin/activate

# On Windows:
venv\Scripts\activate

3. Run Setup Script

The project includes a setup script that handles all dependencies:

./setup.sh

This script will:

Install Python dependencies from requirements.txt
Set up MLFlow tracking directory
Create necessary data directories
Validate the installation

4. Configure Environment

Create a .env file in the project root:

# MLFlow Configuration
MLFLOW_TRACKING_URI=file:./mlruns
MLFLOW_EXPERIMENT_NAME=hokusai-pipeline

# Pipeline Configuration
HOKUSAI_TEST_MODE=false
PIPELINE_LOG_LEVEL=INFO
RANDOM_SEED=42

# Optional: Linear API for workflow automation
LINEAR_API_KEY=your_linear_api_key_here

Quick Start: Run Your First Pipeline

Test Mode (No External Dependencies)

The fastest way to see the pipeline in action is using dry-run mode:

python -m src.pipeline.hokusai_pipeline run \
    --dry-run \
    --contributed-data=data/test_fixtures/test_queries.csv \
    --output-dir=./outputs

This command:

Uses mock models and data
Completes in ~7 seconds
Generates real output files
Requires no external dependencies

View Results

# View the attestation-ready output
cat outputs/delta_output_*.json | jq '.'

# Check MLFlow tracking
mlflow ui
# Open http://localhost:5000 in your browser

Understanding the Output

The pipeline generates a comprehensive JSON output for attestation:

{
  "schema_version": "1.0",
  "delta_computation": {
    "delta_one_score": 0.0332,
    "metric_deltas": {
      "accuracy": {
        "baseline_value": 0.8545,
        "new_value": 0.8840,
        "absolute_delta": 0.0296,
        "relative_delta": 0.0346,
        "improvement": true
      }
    }
  },
  "contributor_attribution": {
    "contributor_id": "contributor_xyz789",
    "wallet_address": "0x742d35Cc6634C0532925a3b844Bc9e7595f62341",
    "contributed_samples": 100
  }
}

Working with Real Data

Prepare Your Data

Create a CSV file with your contributed data:

query_id,query,relevant_doc_id,label
q001,"What is machine learning?",doc123,1
q002,"How to train a model?",doc456,1
q003,"Python programming basics",doc789,0

Run the Pipeline

python -m src.pipeline.hokusai_pipeline run \
    --contributed-data=path/to/your/data.csv \
    --baseline-model-path=path/to/baseline/model \
    --output-dir=./outputs

Key Configuration Options

Environment Variables

Variable	Default	Description
`HOKUSAI_TEST_MODE`	`false`	Enable test mode with mock data
`PIPELINE_LOG_LEVEL`	`INFO`	Logging verbosity (DEBUG, INFO, WARNING, ERROR)
`MLFLOW_TRACKING_URI`	`file:./mlruns`	MLFlow tracking location
`ENABLE_PII_DETECTION`	`true`	Automatic PII detection and hashing
`DATA_VALIDATION_STRICT`	`false`	Fail on any validation warning

Command-Line Arguments

Argument	Description
`--contributed-data`	Path to your contribution data (required)
`--dry-run`	Use mock data and models
`--output-dir`	Where to save results
`--baseline-model-path`	Path to baseline model
`--sample-size`	Limit data samples for testing

For complete configuration reference, see Configuration Guide.

Common Issues and Solutions

Permission Denied on setup.sh

chmod +x setup.sh

Python Version Mismatch

Ensure you're using Python 3.8+:

python --version

No Output Generated

Check output directory permissions:

mkdir -p outputs
chmod 755 outputs

Next Steps

Now that you have Hokusai running:

Configuration Guide - Detailed configuration options
Supplying Data - Learn about data contribution process
Architecture Overview - Understand the system design
API Reference - Integrate with your systems

Getting Help

Check the Troubleshooting Guide
Review existing GitHub Issues
Join our Discord community

Quick Overview by Role​

For Data Suppliers​

For AI Model Developers​

For Token Investors​

System Requirements​

Minimum Requirements​

Recommended Requirements​

Installation​

1. Clone the Repository​

2. Create Python Virtual Environment​

3. Run Setup Script​

4. Configure Environment​

Quick Start: Run Your First Pipeline​

Test Mode (No External Dependencies)​

View Results​

Understanding the Output​

Working with Real Data​

Prepare Your Data​

Run the Pipeline​

Key Configuration Options​

Environment Variables​

Command-Line Arguments​

Common Issues and Solutions​

Permission Denied on setup.sh​

Python Version Mismatch​

No Output Generated​

Next Steps​

Getting Help​