Quick Start Guide
Get up and running with the Hokusai data pipeline in 5 minutes. This guide shows you how to test the pipeline locally with sample data.
Prerequisites
This guide assumes you have:
- Python 3.8+ installed
- Basic familiarity with command line
- Git installed on your system
5-Minute Example
Step 1: Setup Environment
# Clone the pipeline repository (if not already done)
git clone https://github.com/hokusai-protocol/hokusai-data-pipeline
cd hokusai-data-pipeline
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # On macOS/Linux
# or
venv\Scripts\activate # On Windows
# Install dependencies
pip install -r requirements.txt
Step 2: Run Pipeline in Dry-Run Mode
# Run with mock data to test setup
python -m src.pipeline.hokusai_pipeline run \
--dry-run \
--contributed-data=data/test_fixtures/test_queries.csv \
--output-dir=./quick-start-output
This command:
- Uses
--dry-run
to generate mock baseline models - Processes the test dataset
- Outputs attestation-ready results
Step 3: View Results
# Check output
ls -la quick-start-output/
# View the attestation output
cat quick-start-output/deltaone_output_*.json | jq '.'
Step 4: View MLFlow UI (Optional)
# Start MLFlow UI
mlflow ui
# Open browser to http://localhost:5000
You'll see:
- Experiment runs
- Model parameters
- Performance metrics
- Artifacts
Understanding the Output
The pipeline generates several output files:
DeltaOne Attestation
{
"model_id": "gpt-3.5-turbo",
"run_id": "run_20240620_153045",
"deltaone_score": 8.5,
"improvements": {
"accuracy": 0.085,
"response_quality": 0.082,
"latency_reduction": 0.001
},
"metadata": {
"data_points": 100,
"evaluation_date": "2024-06-20T15:30:45Z",
"pipeline_version": "1.0.0"
}
}