Getting Started with Hokusai
Hokusai is a decentralized protocol that incentivizes high-quality data contributions to improve AI models. This guide will help you get started based on your role in the ecosystem.
Quick Overview by Role
For Data Suppliers
- Install the Hokusai data pipeline
- Prepare your data in supported formats
- Run data validation and quality checks
- Submit data for evaluation and earn rewards
For AI Model Developers
- Set up the Hokusai SDK and pipeline
- Integrate your model with the evaluation framework
- Define performance metrics for DeltaOne token issuance
- Deploy and monitor model improvements
For Token Investors
- Understand the bonding curve mechanism
- Participate in token auctions
- Monitor token supply and burn rates
- Track model performance metrics
System Requirements
Minimum Requirements
- Python 3.8 or higher
- 8GB RAM
- 10GB free disk space
- Unix-based OS (macOS, Linux) or WSL on Windows
Recommended Requirements
- Python 3.11
- 16GB RAM
- 50GB free disk space for model storage
- SSD for faster data processing
Installation
1. Clone the Repository
git clone https://github.com/hokusai/hokusai-data-pipeline.git
cd hokusai-data-pipeline
2. Create Python Virtual Environment
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
venv\Scripts\activate
3. Run Setup Script
The project includes a setup script that handles all dependencies:
./setup.sh
This script will:
- Install Python dependencies from requirements.txt
- Set up MLFlow tracking directory
- Create necessary data directories
- Validate the installation
4. Configure Environment
Create a .env
file in the project root:
# MLFlow Configuration
MLFLOW_TRACKING_URI=file:./mlruns
MLFLOW_EXPERIMENT_NAME=hokusai-pipeline
# Pipeline Configuration
HOKUSAI_TEST_MODE=false
PIPELINE_LOG_LEVEL=INFO
RANDOM_SEED=42
# Optional: Linear API for workflow automation
LINEAR_API_KEY=your_linear_api_key_here
Quick Start: Run Your First Pipeline
Test Mode (No External Dependencies)
The fastest way to see the pipeline in action is using dry-run mode:
python -m src.pipeline.hokusai_pipeline run \
--dry-run \
--contributed-data=data/test_fixtures/test_queries.csv \
--output-dir=./outputs
This command:
- Uses mock models and data
- Completes in ~7 seconds
- Generates real output files
- Requires no external dependencies
View Results
# View the attestation-ready output
cat outputs/delta_output_*.json | jq '.'
# Check MLFlow tracking
mlflow ui
# Open http://localhost:5000 in your browser
Understanding the Output
The pipeline generates a comprehensive JSON output for attestation:
{
"schema_version": "1.0",
"delta_computation": {
"delta_one_score": 0.0332,
"metric_deltas": {
"accuracy": {
"baseline_value": 0.8545,
"new_value": 0.8840,
"absolute_delta": 0.0296,
"relative_delta": 0.0346,
"improvement": true
}
}
},
"contributor_attribution": {
"contributor_id": "contributor_xyz789",
"wallet_address": "0x742d35Cc6634C0532925a3b844Bc9e7595f62341",
"contributed_samples": 100
}
}
Working with Real Data
Prepare Your Data
Create a CSV file with your contributed data:
query_id,query,relevant_doc_id,label
q001,"What is machine learning?",doc123,1
q002,"How to train a model?",doc456,1
q003,"Python programming basics",doc789,0
Run the Pipeline
python -m src.pipeline.hokusai_pipeline run \
--contributed-data=path/to/your/data.csv \
--baseline-model-path=path/to/baseline/model \
--output-dir=./outputs
Key Configuration Options
Environment Variables
Variable | Default | Description |
---|---|---|
HOKUSAI_TEST_MODE | false | Enable test mode with mock data |
PIPELINE_LOG_LEVEL | INFO | Logging verbosity (DEBUG, INFO, WARNING, ERROR) |
MLFLOW_TRACKING_URI | file:./mlruns | MLFlow tracking location |
ENABLE_PII_DETECTION | true | Automatic PII detection and hashing |
DATA_VALIDATION_STRICT | false | Fail on any validation warning |
Command-Line Arguments
Argument | Description |
---|---|
--contributed-data | Path to your contribution data (required) |
--dry-run | Use mock data and models |
--output-dir | Where to save results |
--baseline-model-path | Path to baseline model |
--sample-size | Limit data samples for testing |
For complete configuration reference, see Configuration Guide.
Common Issues and Solutions
Permission Denied on setup.sh
chmod +x setup.sh
Python Version Mismatch
Ensure you're using Python 3.8+:
python --version
No Output Generated
Check output directory permissions:
mkdir -p outputs
chmod 755 outputs
Next Steps
Now that you have Hokusai running:
- Configuration Guide - Detailed configuration options
- Supplying Data - Learn about data contribution process
- Architecture Overview - Understand the system design
- API Reference - Integrate with your systems
Getting Help
- Check the Troubleshooting Guide
- Review existing GitHub Issues
- Join our Discord community