Skip to main content

Getting Started with Hokusai

Hokusai is a decentralized protocol that incentivizes high-quality data contributions to improve AI models. This guide will help you get started based on your role in the ecosystem.

Quick Overview by Role

For Data Suppliers

  1. Install the Hokusai data pipeline
  2. Prepare your data in supported formats
  3. Run data validation and quality checks
  4. Submit data for evaluation and earn rewards

For AI Model Developers

  1. Set up the Hokusai SDK and pipeline
  2. Integrate your model with the evaluation framework
  3. Define performance metrics for DeltaOne token issuance
  4. Deploy and monitor model improvements

For Token Investors

  1. Understand the bonding curve mechanism
  2. Participate in token auctions
  3. Monitor token supply and burn rates
  4. Track model performance metrics

System Requirements

Minimum Requirements

  • Python 3.8 or higher
  • 8GB RAM
  • 10GB free disk space
  • Unix-based OS (macOS, Linux) or WSL on Windows
  • Python 3.11
  • 16GB RAM
  • 50GB free disk space for model storage
  • SSD for faster data processing

Installation

1. Clone the Repository

git clone https://github.com/hokusai/hokusai-data-pipeline.git
cd hokusai-data-pipeline

2. Create Python Virtual Environment

# Create virtual environment
python -m venv venv

# Activate virtual environment
# On macOS/Linux:
source venv/bin/activate

# On Windows:
venv\Scripts\activate

3. Run Setup Script

The project includes a setup script that handles all dependencies:

./setup.sh

This script will:

  • Install Python dependencies from requirements.txt
  • Set up MLFlow tracking directory
  • Create necessary data directories
  • Validate the installation

4. Configure Environment

Create a .env file in the project root:

# MLFlow Configuration
MLFLOW_TRACKING_URI=file:./mlruns
MLFLOW_EXPERIMENT_NAME=hokusai-pipeline

# Pipeline Configuration
HOKUSAI_TEST_MODE=false
PIPELINE_LOG_LEVEL=INFO
RANDOM_SEED=42

# Optional: Linear API for workflow automation
LINEAR_API_KEY=your_linear_api_key_here

Quick Start: Run Your First Pipeline

Test Mode (No External Dependencies)

The fastest way to see the pipeline in action is using dry-run mode:

python -m src.pipeline.hokusai_pipeline run \
--dry-run \
--contributed-data=data/test_fixtures/test_queries.csv \
--output-dir=./outputs

This command:

  • Uses mock models and data
  • Completes in ~7 seconds
  • Generates real output files
  • Requires no external dependencies

View Results

# View the attestation-ready output
cat outputs/delta_output_*.json | jq '.'

# Check MLFlow tracking
mlflow ui
# Open http://localhost:5000 in your browser

Understanding the Output

The pipeline generates a comprehensive JSON output for attestation:

{
"schema_version": "1.0",
"delta_computation": {
"delta_one_score": 0.0332,
"metric_deltas": {
"accuracy": {
"baseline_value": 0.8545,
"new_value": 0.8840,
"absolute_delta": 0.0296,
"relative_delta": 0.0346,
"improvement": true
}
}
},
"contributor_attribution": {
"contributor_id": "contributor_xyz789",
"wallet_address": "0x742d35Cc6634C0532925a3b844Bc9e7595f62341",
"contributed_samples": 100
}
}

Working with Real Data

Prepare Your Data

Create a CSV file with your contributed data:

query_id,query,relevant_doc_id,label
q001,"What is machine learning?",doc123,1
q002,"How to train a model?",doc456,1
q003,"Python programming basics",doc789,0

Run the Pipeline

python -m src.pipeline.hokusai_pipeline run \
--contributed-data=path/to/your/data.csv \
--baseline-model-path=path/to/baseline/model \
--output-dir=./outputs

Key Configuration Options

Environment Variables

VariableDefaultDescription
HOKUSAI_TEST_MODEfalseEnable test mode with mock data
PIPELINE_LOG_LEVELINFOLogging verbosity (DEBUG, INFO, WARNING, ERROR)
MLFLOW_TRACKING_URIfile:./mlrunsMLFlow tracking location
ENABLE_PII_DETECTIONtrueAutomatic PII detection and hashing
DATA_VALIDATION_STRICTfalseFail on any validation warning

Command-Line Arguments

ArgumentDescription
--contributed-dataPath to your contribution data (required)
--dry-runUse mock data and models
--output-dirWhere to save results
--baseline-model-pathPath to baseline model
--sample-sizeLimit data samples for testing

For complete configuration reference, see Configuration Guide.

Common Issues and Solutions

Permission Denied on setup.sh

chmod +x setup.sh

Python Version Mismatch

Ensure you're using Python 3.8+:

python --version

No Output Generated

Check output directory permissions:

mkdir -p outputs
chmod 755 outputs

Next Steps

Now that you have Hokusai running:

  1. Configuration Guide - Detailed configuration options
  2. Supplying Data - Learn about data contribution process
  3. Architecture Overview - Understand the system design
  4. API Reference - Integrate with your systems

Getting Help