Skip to main content

Privacy Compliance

This guide outlines the privacy requirements and compliance standards for contributing data to Hokusai models, including how the data pipeline automatically handles privacy protection.

Overview

Hokusai requires all contributed data to meet strict privacy standards to protect individual rights and comply with data protection regulations. The data pipeline includes automatic privacy protection features to help ensure compliance.

Automatic Privacy Protection

Pipeline Privacy Features

The Hokusai data pipeline automatically handles privacy protection:

# Pipeline with default privacy protection
python -m src.pipeline.hokusai_pipeline run \
--contributed-data=data.csv \
--output-dir=./outputs

# Pipeline with strict privacy mode
python -m src.pipeline.hokusai_pipeline run \
--contributed-data=data.csv \
--privacy-strict \
--output-dir=./outputs

Automatic PII Detection

The pipeline automatically scans for and handles PII:

  1. Email Addresses: Automatically hashed using SHA-256
  2. Phone Numbers: Detected and masked
  3. Social Security Numbers: Removed or hashed
  4. Names: Identified using NLP and anonymized
  5. IP Addresses: Converted to region codes
  6. Credit Card Numbers: Detected and removed

Example detection output:

{
"privacy_scan": {
"pii_detected": true,
"fields_processed": {
"email": "hashed",
"phone": "masked",
"user_id": "hashed",
"ip_address": "generalized"
},
"privacy_score": 0.98,
"compliance": ["GDPR", "CCPA"]
}
}

Privacy Requirements

1. Data Anonymization

All data must be properly anonymized before submission:

Personal Identifiers

Remove or hash all direct identifiers:

  • Names (first, last, middle)
  • Email addresses
  • Phone numbers
  • Social security numbers
  • Account numbers
  • IP addresses
  • Device IDs

Quasi-Identifiers

Remove or generalize indirect identifiers:

  • Specific age → Age ranges (e.g., 25-34)
  • Exact location → Region or city level
  • Job title → Job category
  • Exact income → Income bracket
  • Specific dates → Month/year only

Sensitive Information

Remove or mask sensitive data:

  • Medical records
  • Financial transactions
  • Religious beliefs
  • Political affiliations
  • Biometric data

2. Anonymization Techniques

Using the SDK

from hokusai import PrivacyTools

# Initialize privacy tools
privacy = PrivacyTools()

# Automatic anonymization
anonymized_data = privacy.auto_anonymize(
data=raw_data,
sensitivity='high',
preserve_utility=True
)

# Custom anonymization
custom_data = privacy.anonymize(
data=raw_data,
rules={
'email': 'hash',
'age': 'generalize:5', # 5-year ranges
'location': 'truncate:city',
'income': 'bracket:10000'
}
)

Using the Pipeline

# Configure privacy settings
privacy_config = {
"pii_detection": {
"enabled": True,
"sensitivity": "high",
"custom_patterns": [
r"EMP\d{6}", # Employee IDs
r"CUST-\d{8}" # Customer IDs
]
},
"anonymization": {
"hash_algorithm": "sha256",
"salt": "your-secret-salt",
"preserve_format": True
}
}

# Run with custom privacy config
python -m src.pipeline.hokusai_pipeline run \
--contributed-data=data.csv \
--privacy-config=privacy_config.json

3. Privacy Validation

SDK Validation

from hokusai import PrivacyValidator

# Initialize validator
validator = PrivacyValidator(
sensitivity_level='high',
compliance_standards=['GDPR', 'CCPA', 'HIPAA']
)

# Validate your data
result = validator.validate(dataset)

if result.is_compliant:
print('✓ Data meets privacy requirements')
print(f'Privacy score: {result.privacy_score}/1.0')
else:
print('✗ Privacy issues found:')
for issue in result.issues:
print(f' - {issue.field}: {issue.description}')

# Apply suggested fixes
fixed_data = validator.auto_fix(dataset, result.suggestions)

Pipeline Validation

# Validate privacy compliance only
python -m src.pipeline.hokusai_pipeline validate \
--contributed-data=data.csv \
--privacy-check-only \
--compliance-standards="GDPR,CCPA"

Compliance Standards

1. GDPR Compliance

The pipeline ensures GDPR compliance through:

  • Data Minimization: Only necessary fields retained
  • Purpose Limitation: Data used only for model improvement
  • Pseudonymization: Identifiers replaced with pseudonyms
  • Right to Erasure: Support for data deletion requests
  • Data Portability: Export data in standard formats
  • Privacy by Design: Built-in privacy protection

Configuration for GDPR:

export PRIVACY_COMPLIANCE_MODE=GDPR
export DATA_RETENTION_DAYS=90
export ENABLE_AUDIT_LOG=true

2. CCPA Compliance

For California residents' data:

  • Opt-Out Mechanism: Honor data sale opt-outs
  • Access Rights: Provide data access APIs
  • Deletion Rights: Support deletion requests
  • Transparency: Clear data usage disclosure
  • Non-Discrimination: Equal service regardless of privacy choices

Configuration for CCPA:

export PRIVACY_COMPLIANCE_MODE=CCPA
export ENABLE_DO_NOT_SELL=true
export DATA_ACCESS_API=true

3. HIPAA Compliance

For healthcare data:

  • De-identification: Remove 18 HIPAA identifiers
  • Limited Dataset: Create limited datasets when needed
  • Encryption: End-to-end encryption
  • Access Controls: Role-based access
  • Audit Trails: Complete audit logging

Configuration for HIPAA:

export PRIVACY_COMPLIANCE_MODE=HIPAA
export ENABLE_HIPAA_DEIDENTIFICATION=true
export ENCRYPTION_LEVEL=AES256

Privacy Configuration

Environment Variables

# Core privacy settings
ENABLE_PII_DETECTION=true
PII_SENSITIVITY_LEVEL=high # low, medium, high
PII_HASH_ALGORITHM=sha256 # sha256, sha512, bcrypt

# Anonymization settings
ANONYMIZATION_SALT=your-secret-salt
PRESERVE_DATA_UTILITY=true
GENERALIZATION_LEVELS=5

# Compliance settings
PRIVACY_COMPLIANCE_MODE=GDPR,CCPA
DATA_RETENTION_DAYS=90
ENABLE_AUDIT_LOG=true
AUDIT_LOG_ENCRYPTION=true

# Performance settings
PRIVACY_SCAN_BATCH_SIZE=10000
PRIVACY_SCAN_WORKERS=4
PRIVACY_SCAN_TIMEOUT=300

Configuration File

{
"privacy": {
"pii_detection": {
"enabled": true,
"sensitivity": "high",
"scan_text_fields": true,
"scan_numeric_fields": false,
"custom_patterns": {
"employee_id": "EMP\\d{6}",
"customer_id": "CUST-\\d{8}"
}
},
"anonymization": {
"strategy": "hash",
"hash_algorithm": "sha256",
"preserve_format": true,
"field_rules": {
"email": {
"method": "hash",
"preserve_domain": true
},
"age": {
"method": "generalize",
"bin_size": 5
},
"location": {
"method": "truncate",
"level": "city"
}
}
},
"compliance": {
"standards": ["GDPR", "CCPA"],
"data_retention_days": 90,
"audit_logging": true,
"encryption_at_rest": true
}
}
}

Privacy Impact Assessment

Automated Assessment

The pipeline can generate privacy impact assessments:

python -m src.pipeline.hokusai_pipeline assess-privacy \
--contributed-data=data.csv \
--output-format=pdf \
--output-file=privacy_assessment.pdf

Assessment includes:

  1. Data Inventory: All fields and types
  2. Risk Analysis: Privacy risk scoring
  3. Compliance Check: Standards compliance
  4. Recommendations: Improvement suggestions

Manual Assessment Checklist

Before submitting data:

  • Identified all personal data fields
  • Applied appropriate anonymization
  • Validated privacy compliance
  • Documented data processing purpose
  • Obtained necessary consents
  • Implemented retention policies
  • Tested de-identification effectiveness
  • Created audit trail

Best Practices

1. Data Collection

  • Collect only necessary data
  • Use privacy-preserving collection methods
  • Implement consent management
  • Document collection purposes

2. Data Processing

# Best practice example
from hokusai import PrivacyPipeline

pipeline = PrivacyPipeline(
auto_detect_pii=True,
fail_on_pii=True,
audit_all_operations=True
)

# Process with full privacy protection
result = pipeline.process(
data=raw_data,
purpose="model_improvement",
retention_days=90
)

3. Data Storage

  • Encrypt data at rest
  • Implement access controls
  • Regular security audits
  • Automated data expiration

4. Data Sharing

  • Minimize data sharing
  • Use secure transfer methods
  • Track all data transfers
  • Implement data agreements

Troubleshooting

Common Issues

PII Detection False Positives

# Exclude known safe fields
privacy_config = {
"pii_detection": {
"exclude_fields": ["product_id", "order_id"],
"exclude_patterns": ["PROD-*", "ORD-*"]
}
}

Over-Anonymization

# Balance privacy and utility
anonymizer = PrivacyTools(
preserve_utility=True,
utility_threshold=0.8 # Maintain 80% data utility
)

Compliance Conflicts

# Handle multiple compliance requirements
validator = PrivacyValidator(
compliance_standards=['GDPR', 'CCPA'],
conflict_resolution='strictest' # Use most restrictive rules
)

Monitoring and Auditing

Privacy Metrics Dashboard

The pipeline provides privacy metrics:

{
"privacy_metrics": {
"pii_fields_found": 5,
"pii_fields_processed": 5,
"anonymization_success_rate": 1.0,
"data_utility_score": 0.85,
"compliance_score": {
"GDPR": 1.0,
"CCPA": 0.98
},
"risk_level": "low"
}
}

Audit Logs

All privacy operations are logged:

2024-01-15 10:30:45 [PRIVACY] PII detected in field 'email'
2024-01-15 10:30:46 [PRIVACY] Applied SHA-256 hashing to 'email'
2024-01-15 10:30:47 [PRIVACY] Generalized 'age' to 5-year ranges
2024-01-15 10:30:48 [PRIVACY] Privacy validation passed

Next Steps

For additional support, contact our Support Team or join our Community Forum.