Your First Contribution
This guide walks you through submitting your first data contribution to improve a machine learning model using the Hokusai pipeline.
Prerequisites
Before starting, ensure you have:
- ✅ Installed the pipeline
- ✅ Run the quick start
- ✅ An Ethereum wallet address for rewards
- ✅ Training data in a supported format
Step 1: Prepare Your Data
Supported Formats
The pipeline accepts data in three formats:
CSV Format (Recommended for beginners)
query,document,relevance
"What is deep learning?","Deep learning is a subset of machine learning...",1
"How to cook pasta?","Deep learning uses neural networks...",0
"Explain neural networks","Neural networks are computing systems...",1
JSON Format
[
{
"query": "What is deep learning?",
"document": "Deep learning is a subset of machine learning...",
"relevance": 1
},
{
"query": "How to cook pasta?",
"document": "Deep learning uses neural networks...",
"relevance": 0
}
]
Parquet Format
For large datasets (>100MB), use Parquet for better performance.
Data Quality Guidelines
Your data should:
- ✅ Be relevant to the model's domain
- ✅ Have accurate labels
- ✅ Not contain duplicate entries
- ✅ Pass PII detection checks
- ✅ Be at least 100 samples (recommended: 1000+)
Step 2: Validate Your Data
Before submission, validate your data:
# Basic validation
python -m src.cli.validate_data \
--input-file=my_contribution.csv \
--check-pii \
--check-duplicates
Expected output:
✓ File format: Valid CSV
✓ Required columns: Present
✓ Data types: Correct