# Performance Guide This document provides performance benchmarks, optimization tips, and best practices for using the Fairness Pipeline Development Toolkit efficiently. --- ## Table of Contents - [Performance Benchmarks](#performance-benchmarks) - [Performance Characteristics](#performance-characteristics) - [Optimization Tips](#optimization-tips) - [Scalability Considerations](#scalability-considerations) - [Memory Usage](#memory-usage) - [CI/CD Integration](#cicd-integration) --- ## Performance Benchmarks ### Benchmark Suite The toolkit includes a comprehensive benchmark suite located in the `benchmarks/` directory: - **`benchmark_metrics_100k.py`**: Benchmarks fairness metrics computation on 100k samples - **`benchmark_pipeline.py`**: Benchmarks pipeline operations across different dataset sizes - **`benchmark_bootstrap.py`**: Benchmarks bootstrap confidence interval computation ### Performance Test Suite A pytest-based performance test suite is available in `tests/performance/test_performance_suite.py`: - **Automated Performance Tests**: Establishes performance baselines and detects regressions - **CI/CD Integration**: Can be run in CI to track performance over time - **Scalability Tests**: Validates performance across different data sizes **Running Performance Tests:** ```bash # Run all performance tests pytest tests/performance/test_performance_suite.py -v # Run with performance markers pytest -m performance -v ``` ### Performance Profiling A profiling script is available to identify bottlenecks in critical paths: - **`scripts/profile_performance.py`**: Uses cProfile to profile critical operations - Profiles metrics computation, bootstrap CI, pipeline operations, and intersectional analysis - Identifies top functions by cumulative time **Running Profiling:** ```bash # Run profiling script python scripts/profile_performance.py # Save profile data for detailed analysis python -m cProfile -o profile.stats scripts/profile_performance.py python -m pstats profile.stats ``` ### Running Benchmarks ```bash # Run all benchmarks python benchmarks/benchmark_metrics_100k.py python benchmarks/benchmark_pipeline.py python benchmarks/benchmark_bootstrap.py ``` ### Typical Performance (Reference Hardware) **Metrics Computation (100k samples):** - Demographic Parity Difference: ~0.5-1.0 seconds - Equalized Odds Difference: ~0.8-1.5 seconds - MAE Parity Difference: ~0.6-1.2 seconds - Intersectional analysis: ~2-4x slower (depends on number of groups) **Pipeline Operations:** - Bias detection: ~0.5-2.0 seconds (10k samples) - Pipeline transformation: ~0.2-1.0 seconds (10k samples) - Full pipeline (detect + transform): ~1-3 seconds (10k samples) **Bootstrap Confidence Intervals:** - Percentile method (1000 samples): ~5-15 seconds - BCa method (1000 samples): ~10-30 seconds - Performance scales linearly with number of bootstrap samples *Note: Actual performance depends on hardware, dataset characteristics, and Python version.* --- ## Performance Characteristics ### Computational Complexity **Fairness Metrics:** - **Time Complexity**: O(n) where n is the number of samples - **Space Complexity**: O(n) for storing predictions and sensitive attributes - **Bootstrap CI**: O(n × B) where B is the number of bootstrap samples **Pipeline Operations:** - **Bias Detection**: O(n × m) where m is the number of features - **Transformations**: O(n × m) for most transformers - **Proxy Detection**: O(m²) for correlation computation **Intersectional Analysis:** - **Time Complexity**: O(n × g) where g is the number of intersectional groups - **Space Complexity**: O(g) for storing group statistics - Can be significantly slower when many groups are present ### Bottlenecks 1. **Bootstrap Confidence Intervals**: The most computationally expensive operation - Use `ci_method="percentile"` for faster computation - Reduce `ci_samples` for quicker results (at cost of accuracy) - Consider disabling CI for quick checks: `with_ci=False` 2. **Intersectional Analysis**: Slower due to increased number of groups - Use single-attribute analysis when possible - Filter to most important intersectional groups if needed 3. **Large Datasets**: Memory and computation time increase linearly - Use batch processing for very large datasets - Consider sampling for exploratory analysis --- ## Optimization Tips ### 1. Disable Confidence Intervals for Quick Checks ```python # Fast check without CI result = analyzer.demographic_parity_difference( y_pred=y_pred, sensitive=sensitive, with_ci=False # Skip bootstrap CI computation ) ``` ### 2. Use Percentile Method for Bootstrap CI ```python # Faster CI method result = analyzer.demographic_parity_difference( y_pred=y_pred, sensitive=sensitive, with_ci=True, ci_method="percentile", # Faster than "bca" ci_samples=500 # Fewer samples = faster ) ``` ### 3. Reduce Minimum Group Size (When Appropriate) ```python # Lower threshold for faster computation (use with caution) analyzer = FairnessAnalyzer(min_group_size=20) # Default is 30 ``` ### 4. Batch Processing for Large Datasets ```python # Process in batches batch_size = 10_000 for i in range(0, len(df), batch_size): batch = df.iloc[i:i + batch_size] result = analyzer.demographic_parity_difference( y_pred=batch["y_pred"].to_numpy(), sensitive=batch["gender"].to_numpy(), with_ci=False # Disable CI for batch processing ) ``` ### 5. Cache Results When Possible ```python # Cache metric results if computing multiple times from functools import lru_cache @lru_cache(maxsize=128) def compute_metric_cached(y_pred_hash, sensitive_hash): # Compute metric return analyzer.demographic_parity_difference(...) ``` ### 6. Use Native Backend (Fastest) ```python # Native backend is typically fastest analyzer = FairnessAnalyzer(backend="native") ``` ### 7. Parallel Processing (Advanced) For very large datasets, consider parallel processing: ```python from multiprocessing import Pool def compute_metric_for_group(args): group_name, group_data = args return analyzer.demographic_parity_difference( y_pred=group_data["y_pred"], sensitive=group_data["sensitive"] ) # Process groups in parallel with Pool() as pool: results = pool.map(compute_metric_for_group, group_data_list) ``` --- ## Scalability Considerations ### Dataset Size Guidelines | Dataset Size | Recommended Approach | Notes | |-------------|---------------------|-------| | < 10k samples | Full analysis with CI | Fast enough for interactive use | | 10k - 100k samples | Full analysis, consider reducing CI samples | Good balance of speed and accuracy | | 100k - 1M samples | Batch processing or sampling | Use `with_ci=False` for quick checks | | > 1M samples | Sampling or distributed processing | Consider using Spark/Dask for very large datasets | ### Group Size Considerations - **Minimum Group Size**: Larger `min_group_size` values reduce computation time but may exclude important groups - **Number of Groups**: More groups (especially in intersectional analysis) increase computation time - **Group Imbalance**: Highly imbalanced groups may require more bootstrap samples for accurate CI ### Memory Usage **Typical Memory Footprint:** - Base toolkit: ~50-100 MB - Per 100k samples: ~10-20 MB (depending on data types) - Bootstrap CI (1000 samples): ~50-100 MB additional memory **Memory Optimization:** - Use `with_ci=False` to reduce memory usage - Process data in batches for very large datasets - Use appropriate data types (e.g., `int8` instead of `int64` when possible) --- ## Memory Usage ### Memory Profiling ```python # Profile memory usage import tracemalloc tracemalloc.start() result = analyzer.demographic_parity_difference( y_pred=y_pred, sensitive=sensitive, with_ci=True ) current, peak = tracemalloc.get_traced_memory() print(f"Peak memory: {peak / 1024 / 1024:.2f} MB") tracemalloc.stop() ``` ### Memory-Efficient Patterns 1. **Use Generators for Large Datasets**: ```python def process_in_chunks(df, chunk_size=10000): for i in range(0, len(df), chunk_size): yield df.iloc[i:i + chunk_size] ``` 2. **Clear Intermediate Results**: ```python result = analyzer.demographic_parity_difference(...) # Process result del result # Explicitly free memory if needed ``` 3. **Use Sparse Data Structures** (when applicable): ```python from scipy.sparse import csr_matrix # Use sparse matrices for very sparse data ``` --- ## CI/CD Integration ### Performance Regression Testing The toolkit includes performance benchmarks and test suite that can be integrated into CI/CD pipelines: **Using Performance Test Suite:** ```yaml # .github/workflows/ci.yml - name: Run performance tests run: | pytest tests/performance/test_performance_suite.py -v # Tests will fail if performance degrades beyond baselines ``` **Using Benchmark Scripts:** ```yaml # .github/workflows/ci.yml - name: Run performance benchmarks run: | python benchmarks/benchmark_metrics_100k.py > benchmark_metrics.txt python benchmarks/benchmark_pipeline.py > benchmark_pipeline.txt # Check for performance regressions python -c " import re with open('benchmark_metrics.txt') as f: content = f.read() # Extract timing information and check thresholds # Fail if performance degrades significantly " ``` ### Performance Monitoring Track performance over time: ```python # Log performance metrics import time import json start = time.time() result = analyzer.demographic_parity_difference(...) duration = time.time() - start performance_log = { "metric": "demographic_parity_difference", "duration": duration, "n_samples": len(y_pred), "timestamp": time.time() } # Save to file or send to monitoring system with open("performance_log.json", "a") as f: f.write(json.dumps(performance_log) + "\n") ``` ### Benchmark Baselines Establish performance baselines for your use case: ```bash # Run benchmarks and save results python benchmarks/benchmark_metrics_100k.py > baseline_metrics.txt python benchmarks/benchmark_pipeline.py > baseline_pipeline.txt # Compare against baselines in CI python -c " # Load baseline and current results # Compare and fail if performance degrades > 20% " ``` --- ## Best Practices 1. **Start with Quick Checks**: Use `with_ci=False` for initial exploration 2. **Enable CI for Production**: Always use confidence intervals for production validation 3. **Profile Before Optimizing**: Use profiling tools to identify actual bottlenecks 4. **Monitor Performance**: Track performance metrics over time 5. **Set Appropriate Thresholds**: Balance accuracy (more CI samples) vs speed 6. **Use Appropriate Backend**: Native backend is typically fastest unless you need adapter features --- ## Troubleshooting Performance Issues ### Slow Metric Computation **Symptoms**: Metrics take > 10 seconds for 100k samples **Solutions**: - Check if CI is enabled (disable for quick checks) - Reduce `ci_samples` if CI is needed - Verify you're using native backend - Check for memory pressure (swap usage) ### High Memory Usage **Symptoms**: Memory usage > 500 MB for 100k samples **Solutions**: - Process data in batches - Disable CI if not needed - Check for memory leaks in custom code - Use appropriate data types ### Slow Pipeline Operations **Symptoms**: Pipeline operations take > 5 seconds for 10k samples **Solutions**: - Check number of pipeline steps - Verify transformer implementations are efficient - Consider caching transformer fits - Profile individual steps to identify bottlenecks --- ## Additional Resources - **Benchmark Suite**: See `benchmarks/README.md` for detailed benchmark documentation - **API Reference**: See `docs/api.md` for complete API documentation - **Integration Guide**: See `docs/integration_guide.md` for integration examples --- ## Performance Comparison ### Backend Performance | Backend | Speed | Features | Dependencies | |---------|-------|----------|---------------| | Native | Fastest | Core metrics | None (always available) | | Fairlearn | Medium | Additional metrics | fairlearn | | Aequitas | Slower | Comprehensive reports | aequitas | *Recommendation: Use native backend unless you need specific adapter features.* ### CI Method Performance | Method | Speed | Accuracy | Use Case | |--------|-------|----------|----------| | Percentile | Fast | Good | General use | | BCa | Slower | Better | When accuracy is critical | *Recommendation: Use percentile for most cases, BCa when accuracy is critical.* --- For questions or performance issues, see the [Integration Guide](integration_guide.md) or open an issue on GitHub.