Performance Guide
This document provides performance benchmarks, optimization tips, and best practices for using the Fairness Pipeline Development Toolkit efficiently.
Table of Contents
Performance Benchmarks
Benchmark Suite
The toolkit includes a comprehensive benchmark suite located in the benchmarks/ directory:
benchmark_metrics_100k.py: Benchmarks fairness metrics computation on 100k samplesbenchmark_pipeline.py: Benchmarks pipeline operations across different dataset sizesbenchmark_bootstrap.py: Benchmarks bootstrap confidence interval computation
Performance Test Suite
A pytest-based performance test suite is available in tests/performance/test_performance_suite.py:
Automated Performance Tests: Establishes performance baselines and detects regressions
CI/CD Integration: Can be run in CI to track performance over time
Scalability Tests: Validates performance across different data sizes
Running Performance Tests:
# Run all performance tests
pytest tests/performance/test_performance_suite.py -v
# Run with performance markers
pytest -m performance -v
Performance Profiling
A profiling script is available to identify bottlenecks in critical paths:
scripts/profile_performance.py: Uses cProfile to profile critical operationsProfiles metrics computation, bootstrap CI, pipeline operations, and intersectional analysis
Identifies top functions by cumulative time
Running Profiling:
# Run profiling script
python scripts/profile_performance.py
# Save profile data for detailed analysis
python -m cProfile -o profile.stats scripts/profile_performance.py
python -m pstats profile.stats
Running Benchmarks
# Run all benchmarks
python benchmarks/benchmark_metrics_100k.py
python benchmarks/benchmark_pipeline.py
python benchmarks/benchmark_bootstrap.py
Typical Performance (Reference Hardware)
Metrics Computation (100k samples):
Demographic Parity Difference: ~0.5-1.0 seconds
Equalized Odds Difference: ~0.8-1.5 seconds
MAE Parity Difference: ~0.6-1.2 seconds
Intersectional analysis: ~2-4x slower (depends on number of groups)
Pipeline Operations:
Bias detection: ~0.5-2.0 seconds (10k samples)
Pipeline transformation: ~0.2-1.0 seconds (10k samples)
Full pipeline (detect + transform): ~1-3 seconds (10k samples)
Bootstrap Confidence Intervals:
Percentile method (1000 samples): ~5-15 seconds
BCa method (1000 samples): ~10-30 seconds
Performance scales linearly with number of bootstrap samples
Note: Actual performance depends on hardware, dataset characteristics, and Python version.
Performance Characteristics
Computational Complexity
Fairness Metrics:
Time Complexity: O(n) where n is the number of samples
Space Complexity: O(n) for storing predictions and sensitive attributes
Bootstrap CI: O(n × B) where B is the number of bootstrap samples
Pipeline Operations:
Bias Detection: O(n × m) where m is the number of features
Transformations: O(n × m) for most transformers
Proxy Detection: O(m²) for correlation computation
Intersectional Analysis:
Time Complexity: O(n × g) where g is the number of intersectional groups
Space Complexity: O(g) for storing group statistics
Can be significantly slower when many groups are present
Bottlenecks
Bootstrap Confidence Intervals: The most computationally expensive operation
Use
ci_method="percentile"for faster computationReduce
ci_samplesfor quicker results (at cost of accuracy)Consider disabling CI for quick checks:
with_ci=False
Intersectional Analysis: Slower due to increased number of groups
Use single-attribute analysis when possible
Filter to most important intersectional groups if needed
Large Datasets: Memory and computation time increase linearly
Use batch processing for very large datasets
Consider sampling for exploratory analysis
Optimization Tips
1. Disable Confidence Intervals for Quick Checks
# Fast check without CI
result = analyzer.demographic_parity_difference(
y_pred=y_pred,
sensitive=sensitive,
with_ci=False # Skip bootstrap CI computation
)
2. Use Percentile Method for Bootstrap CI
# Faster CI method
result = analyzer.demographic_parity_difference(
y_pred=y_pred,
sensitive=sensitive,
with_ci=True,
ci_method="percentile", # Faster than "bca"
ci_samples=500 # Fewer samples = faster
)
3. Reduce Minimum Group Size (When Appropriate)
# Lower threshold for faster computation (use with caution)
analyzer = FairnessAnalyzer(min_group_size=20) # Default is 30
4. Batch Processing for Large Datasets
# Process in batches
batch_size = 10_000
for i in range(0, len(df), batch_size):
batch = df.iloc[i:i + batch_size]
result = analyzer.demographic_parity_difference(
y_pred=batch["y_pred"].to_numpy(),
sensitive=batch["gender"].to_numpy(),
with_ci=False # Disable CI for batch processing
)
5. Cache Results When Possible
# Cache metric results if computing multiple times
from functools import lru_cache
@lru_cache(maxsize=128)
def compute_metric_cached(y_pred_hash, sensitive_hash):
# Compute metric
return analyzer.demographic_parity_difference(...)
6. Use Native Backend (Fastest)
# Native backend is typically fastest
analyzer = FairnessAnalyzer(backend="native")
7. Parallel Processing (Advanced)
For very large datasets, consider parallel processing:
from multiprocessing import Pool
def compute_metric_for_group(args):
group_name, group_data = args
return analyzer.demographic_parity_difference(
y_pred=group_data["y_pred"],
sensitive=group_data["sensitive"]
)
# Process groups in parallel
with Pool() as pool:
results = pool.map(compute_metric_for_group, group_data_list)
Scalability Considerations
Dataset Size Guidelines
Dataset Size |
Recommended Approach |
Notes |
|---|---|---|
< 10k samples |
Full analysis with CI |
Fast enough for interactive use |
10k - 100k samples |
Full analysis, consider reducing CI samples |
Good balance of speed and accuracy |
100k - 1M samples |
Batch processing or sampling |
Use |
> 1M samples |
Sampling or distributed processing |
Consider using Spark/Dask for very large datasets |
Group Size Considerations
Minimum Group Size: Larger
min_group_sizevalues reduce computation time but may exclude important groupsNumber of Groups: More groups (especially in intersectional analysis) increase computation time
Group Imbalance: Highly imbalanced groups may require more bootstrap samples for accurate CI
Memory Usage
Typical Memory Footprint:
Base toolkit: ~50-100 MB
Per 100k samples: ~10-20 MB (depending on data types)
Bootstrap CI (1000 samples): ~50-100 MB additional memory
Memory Optimization:
Use
with_ci=Falseto reduce memory usageProcess data in batches for very large datasets
Use appropriate data types (e.g.,
int8instead ofint64when possible)
Memory Usage
Memory Profiling
# Profile memory usage
import tracemalloc
tracemalloc.start()
result = analyzer.demographic_parity_difference(
y_pred=y_pred,
sensitive=sensitive,
with_ci=True
)
current, peak = tracemalloc.get_traced_memory()
print(f"Peak memory: {peak / 1024 / 1024:.2f} MB")
tracemalloc.stop()
Memory-Efficient Patterns
Use Generators for Large Datasets:
def process_in_chunks(df, chunk_size=10000): for i in range(0, len(df), chunk_size): yield df.iloc[i:i + chunk_size]
Clear Intermediate Results:
result = analyzer.demographic_parity_difference(...) # Process result del result # Explicitly free memory if needed
Use Sparse Data Structures (when applicable):
from scipy.sparse import csr_matrix # Use sparse matrices for very sparse data
CI/CD Integration
Performance Regression Testing
The toolkit includes performance benchmarks and test suite that can be integrated into CI/CD pipelines:
Using Performance Test Suite:
# .github/workflows/ci.yml
- name: Run performance tests
run: |
pytest tests/performance/test_performance_suite.py -v
# Tests will fail if performance degrades beyond baselines
Using Benchmark Scripts:
# .github/workflows/ci.yml
- name: Run performance benchmarks
run: |
python benchmarks/benchmark_metrics_100k.py > benchmark_metrics.txt
python benchmarks/benchmark_pipeline.py > benchmark_pipeline.txt
# Check for performance regressions
python -c "
import re
with open('benchmark_metrics.txt') as f:
content = f.read()
# Extract timing information and check thresholds
# Fail if performance degrades significantly
"
Performance Monitoring
Track performance over time:
# Log performance metrics
import time
import json
start = time.time()
result = analyzer.demographic_parity_difference(...)
duration = time.time() - start
performance_log = {
"metric": "demographic_parity_difference",
"duration": duration,
"n_samples": len(y_pred),
"timestamp": time.time()
}
# Save to file or send to monitoring system
with open("performance_log.json", "a") as f:
f.write(json.dumps(performance_log) + "\n")
Benchmark Baselines
Establish performance baselines for your use case:
# Run benchmarks and save results
python benchmarks/benchmark_metrics_100k.py > baseline_metrics.txt
python benchmarks/benchmark_pipeline.py > baseline_pipeline.txt
# Compare against baselines in CI
python -c "
# Load baseline and current results
# Compare and fail if performance degrades > 20%
"
Best Practices
Start with Quick Checks: Use
with_ci=Falsefor initial explorationEnable CI for Production: Always use confidence intervals for production validation
Profile Before Optimizing: Use profiling tools to identify actual bottlenecks
Monitor Performance: Track performance metrics over time
Set Appropriate Thresholds: Balance accuracy (more CI samples) vs speed
Use Appropriate Backend: Native backend is typically fastest unless you need adapter features
Troubleshooting Performance Issues
Slow Metric Computation
Symptoms: Metrics take > 10 seconds for 100k samples
Solutions:
Check if CI is enabled (disable for quick checks)
Reduce
ci_samplesif CI is neededVerify you’re using native backend
Check for memory pressure (swap usage)
High Memory Usage
Symptoms: Memory usage > 500 MB for 100k samples
Solutions:
Process data in batches
Disable CI if not needed
Check for memory leaks in custom code
Use appropriate data types
Slow Pipeline Operations
Symptoms: Pipeline operations take > 5 seconds for 10k samples
Solutions:
Check number of pipeline steps
Verify transformer implementations are efficient
Consider caching transformer fits
Profile individual steps to identify bottlenecks
Additional Resources
Benchmark Suite: See
benchmarks/README.mdfor detailed benchmark documentationAPI Reference: See
docs/api.mdfor complete API documentationIntegration Guide: See
docs/integration_guide.mdfor integration examples
Performance Comparison
Backend Performance
Backend |
Speed |
Features |
Dependencies |
|---|---|---|---|
Native |
Fastest |
Core metrics |
None (always available) |
Fairlearn |
Medium |
Additional metrics |
fairlearn |
Aequitas |
Slower |
Comprehensive reports |
aequitas |
Recommendation: Use native backend unless you need specific adapter features.
CI Method Performance
Method |
Speed |
Accuracy |
Use Case |
|---|---|---|---|
Percentile |
Fast |
Good |
General use |
BCa |
Slower |
Better |
When accuracy is critical |
Recommendation: Use percentile for most cases, BCa when accuracy is critical.
For questions or performance issues, see the Integration Guide or open an issue on GitHub.