Performance Guide

This document provides performance benchmarks, optimization tips, and best practices for using the Fairness Pipeline Development Toolkit efficiently.

Table of Contents

Performance Benchmarks
Performance Characteristics
Optimization Tips
Scalability Considerations
Memory Usage
CI/CD Integration

Performance Benchmarks

Benchmark Suite

The toolkit includes a comprehensive benchmark suite located in the benchmarks/ directory:

benchmark_metrics_100k.py: Benchmarks fairness metrics computation on 100k samples
benchmark_pipeline.py: Benchmarks pipeline operations across different dataset sizes
benchmark_bootstrap.py: Benchmarks bootstrap confidence interval computation

Performance Test Suite

A pytest-based performance test suite is available in tests/performance/test_performance_suite.py:

Automated Performance Tests: Establishes performance baselines and detects regressions
CI/CD Integration: Can be run in CI to track performance over time
Scalability Tests: Validates performance across different data sizes

Running Performance Tests:

# Run all performance tests
pytest tests/performance/test_performance_suite.py -v

# Run with performance markers
pytest -m performance -v

Performance Profiling

A profiling script is available to identify bottlenecks in critical paths:

scripts/profile_performance.py: Uses cProfile to profile critical operations
Profiles metrics computation, bootstrap CI, pipeline operations, and intersectional analysis
Identifies top functions by cumulative time

Running Profiling:

# Run profiling script
python scripts/profile_performance.py

# Save profile data for detailed analysis
python -m cProfile -o profile.stats scripts/profile_performance.py
python -m pstats profile.stats

Running Benchmarks

# Run all benchmarks
python benchmarks/benchmark_metrics_100k.py
python benchmarks/benchmark_pipeline.py
python benchmarks/benchmark_bootstrap.py

Typical Performance (Reference Hardware)

Metrics Computation (100k samples):

Demographic Parity Difference: ~0.5-1.0 seconds
Equalized Odds Difference: ~0.8-1.5 seconds
MAE Parity Difference: ~0.6-1.2 seconds
Intersectional analysis: ~2-4x slower (depends on number of groups)

Pipeline Operations:

Bias detection: ~0.5-2.0 seconds (10k samples)
Pipeline transformation: ~0.2-1.0 seconds (10k samples)
Full pipeline (detect + transform): ~1-3 seconds (10k samples)

Bootstrap Confidence Intervals:

Percentile method (1000 samples): ~5-15 seconds
BCa method (1000 samples): ~10-30 seconds
Performance scales linearly with number of bootstrap samples

Note: Actual performance depends on hardware, dataset characteristics, and Python version.

Performance Characteristics

Computational Complexity

Fairness Metrics:

Time Complexity: O(n) where n is the number of samples
Space Complexity: O(n) for storing predictions and sensitive attributes
Bootstrap CI: O(n × B) where B is the number of bootstrap samples

Pipeline Operations:

Bias Detection: O(n × m) where m is the number of features
Transformations: O(n × m) for most transformers
Proxy Detection: O(m²) for correlation computation

Intersectional Analysis:

Time Complexity: O(n × g) where g is the number of intersectional groups
Space Complexity: O(g) for storing group statistics
Can be significantly slower when many groups are present

Bottlenecks

Bootstrap Confidence Intervals: The most computationally expensive operation
- Use ci_method="percentile" for faster computation
- Reduce ci_samples for quicker results (at cost of accuracy)
- Consider disabling CI for quick checks: with_ci=False
Intersectional Analysis: Slower due to increased number of groups
- Use single-attribute analysis when possible
- Filter to most important intersectional groups if needed
Large Datasets: Memory and computation time increase linearly
- Use batch processing for very large datasets
- Consider sampling for exploratory analysis

Optimization Tips

1. Disable Confidence Intervals for Quick Checks

# Fast check without CI
result = analyzer.demographic_parity_difference(
    y_pred=y_pred,
    sensitive=sensitive,
    with_ci=False  # Skip bootstrap CI computation
)

2. Use Percentile Method for Bootstrap CI

# Faster CI method
result = analyzer.demographic_parity_difference(
    y_pred=y_pred,
    sensitive=sensitive,
    with_ci=True,
    ci_method="percentile",  # Faster than "bca"
    ci_samples=500  # Fewer samples = faster
)

3. Reduce Minimum Group Size (When Appropriate)

# Lower threshold for faster computation (use with caution)
analyzer = FairnessAnalyzer(min_group_size=20)  # Default is 30

4. Batch Processing for Large Datasets

# Process in batches
batch_size = 10_000
for i in range(0, len(df), batch_size):
    batch = df.iloc[i:i + batch_size]
    result = analyzer.demographic_parity_difference(
        y_pred=batch["y_pred"].to_numpy(),
        sensitive=batch["gender"].to_numpy(),
        with_ci=False  # Disable CI for batch processing
    )

5. Cache Results When Possible

# Cache metric results if computing multiple times
from functools import lru_cache

@lru_cache(maxsize=128)
def compute_metric_cached(y_pred_hash, sensitive_hash):
    # Compute metric
    return analyzer.demographic_parity_difference(...)

6. Use Native Backend (Fastest)

# Native backend is typically fastest
analyzer = FairnessAnalyzer(backend="native")

7. Parallel Processing (Advanced)

For very large datasets, consider parallel processing:

from multiprocessing import Pool

def compute_metric_for_group(args):
    group_name, group_data = args
    return analyzer.demographic_parity_difference(
        y_pred=group_data["y_pred"],
        sensitive=group_data["sensitive"]
    )

# Process groups in parallel
with Pool() as pool:
    results = pool.map(compute_metric_for_group, group_data_list)

Scalability Considerations

Dataset Size Guidelines

Dataset Size	Recommended Approach	Notes
< 10k samples	Full analysis with CI	Fast enough for interactive use
10k - 100k samples	Full analysis, consider reducing CI samples	Good balance of speed and accuracy
100k - 1M samples	Batch processing or sampling	Use `with_ci=False` for quick checks
> 1M samples	Sampling or distributed processing	Consider using Spark/Dask for very large datasets

Group Size Considerations

Minimum Group Size: Larger min_group_size values reduce computation time but may exclude important groups
Number of Groups: More groups (especially in intersectional analysis) increase computation time
Group Imbalance: Highly imbalanced groups may require more bootstrap samples for accurate CI

Memory Usage

Typical Memory Footprint:

Base toolkit: ~50-100 MB
Per 100k samples: ~10-20 MB (depending on data types)
Bootstrap CI (1000 samples): ~50-100 MB additional memory

Memory Optimization:

Use with_ci=False to reduce memory usage
Process data in batches for very large datasets
Use appropriate data types (e.g., int8 instead of int64 when possible)

Memory Usage

Memory Profiling

# Profile memory usage
import tracemalloc

tracemalloc.start()
result = analyzer.demographic_parity_difference(
    y_pred=y_pred,
    sensitive=sensitive,
    with_ci=True
)
current, peak = tracemalloc.get_traced_memory()
print(f"Peak memory: {peak / 1024 / 1024:.2f} MB")
tracemalloc.stop()

Memory-Efficient Patterns

Use Generators for Large Datasets:

def process_in_chunks(df, chunk_size=10000):
    for i in range(0, len(df), chunk_size):
        yield df.iloc[i:i + chunk_size]

Clear Intermediate Results:

result = analyzer.demographic_parity_difference(...)
# Process result
del result  # Explicitly free memory if needed

Use Sparse Data Structures (when applicable):

from scipy.sparse import csr_matrix
# Use sparse matrices for very sparse data

CI/CD Integration

Performance Regression Testing

The toolkit includes performance benchmarks and test suite that can be integrated into CI/CD pipelines:

Using Performance Test Suite:

# .github/workflows/ci.yml
- name: Run performance tests
  run: |
    pytest tests/performance/test_performance_suite.py -v
    # Tests will fail if performance degrades beyond baselines

Using Benchmark Scripts:

# .github/workflows/ci.yml
- name: Run performance benchmarks
  run: |
    python benchmarks/benchmark_metrics_100k.py > benchmark_metrics.txt
    python benchmarks/benchmark_pipeline.py > benchmark_pipeline.txt
    
    # Check for performance regressions
    python -c "
    import re
    with open('benchmark_metrics.txt') as f:
        content = f.read()
        # Extract timing information and check thresholds
        # Fail if performance degrades significantly
    "

Performance Monitoring

Track performance over time:

# Log performance metrics
import time
import json

start = time.time()
result = analyzer.demographic_parity_difference(...)
duration = time.time() - start

performance_log = {
    "metric": "demographic_parity_difference",
    "duration": duration,
    "n_samples": len(y_pred),
    "timestamp": time.time()
}

# Save to file or send to monitoring system
with open("performance_log.json", "a") as f:
    f.write(json.dumps(performance_log) + "\n")

Benchmark Baselines

Establish performance baselines for your use case:

# Run benchmarks and save results
python benchmarks/benchmark_metrics_100k.py > baseline_metrics.txt
python benchmarks/benchmark_pipeline.py > baseline_pipeline.txt

# Compare against baselines in CI
python -c "
# Load baseline and current results
# Compare and fail if performance degrades > 20%
"

Best Practices

Start with Quick Checks: Use with_ci=False for initial exploration
Enable CI for Production: Always use confidence intervals for production validation
Profile Before Optimizing: Use profiling tools to identify actual bottlenecks
Monitor Performance: Track performance metrics over time
Set Appropriate Thresholds: Balance accuracy (more CI samples) vs speed
Use Appropriate Backend: Native backend is typically fastest unless you need adapter features

Troubleshooting Performance Issues

Slow Metric Computation

Symptoms: Metrics take > 10 seconds for 100k samples

Solutions:

Check if CI is enabled (disable for quick checks)
Reduce ci_samples if CI is needed
Verify you’re using native backend
Check for memory pressure (swap usage)

High Memory Usage

Symptoms: Memory usage > 500 MB for 100k samples

Solutions:

Process data in batches
Disable CI if not needed
Check for memory leaks in custom code
Use appropriate data types

Slow Pipeline Operations

Symptoms: Pipeline operations take > 5 seconds for 10k samples

Solutions:

Check number of pipeline steps
Verify transformer implementations are efficient
Consider caching transformer fits
Profile individual steps to identify bottlenecks

Additional Resources

Benchmark Suite: See benchmarks/README.md for detailed benchmark documentation
API Reference: See docs/api.md for complete API documentation
Integration Guide: See docs/integration_guide.md for integration examples

Performance Comparison

Backend Performance

Backend	Speed	Features	Dependencies
Native	Fastest	Core metrics	None (always available)
Fairlearn	Medium	Additional metrics	fairlearn
Aequitas	Slower	Comprehensive reports	aequitas

Recommendation: Use native backend unless you need specific adapter features.

CI Method Performance

Method	Speed	Accuracy	Use Case
Percentile	Fast	Good	General use
BCa	Slower	Better	When accuracy is critical

Recommendation: Use percentile for most cases, BCa when accuracy is critical.

For questions or performance issues, see the Integration Guide or open an issue on GitHub.