Performance Guide

This document provides performance benchmarks, optimization tips, and best practices for using the Fairness Pipeline Development Toolkit efficiently.


Table of Contents


Performance Benchmarks

Benchmark Suite

The toolkit includes a comprehensive benchmark suite located in the benchmarks/ directory:

  • benchmark_metrics_100k.py: Benchmarks fairness metrics computation on 100k samples

  • benchmark_pipeline.py: Benchmarks pipeline operations across different dataset sizes

  • benchmark_bootstrap.py: Benchmarks bootstrap confidence interval computation

Performance Test Suite

A pytest-based performance test suite is available in tests/performance/test_performance_suite.py:

  • Automated Performance Tests: Establishes performance baselines and detects regressions

  • CI/CD Integration: Can be run in CI to track performance over time

  • Scalability Tests: Validates performance across different data sizes

Running Performance Tests:

# Run all performance tests
pytest tests/performance/test_performance_suite.py -v

# Run with performance markers
pytest -m performance -v

Performance Profiling

A profiling script is available to identify bottlenecks in critical paths:

  • scripts/profile_performance.py: Uses cProfile to profile critical operations

  • Profiles metrics computation, bootstrap CI, pipeline operations, and intersectional analysis

  • Identifies top functions by cumulative time

Running Profiling:

# Run profiling script
python scripts/profile_performance.py

# Save profile data for detailed analysis
python -m cProfile -o profile.stats scripts/profile_performance.py
python -m pstats profile.stats

Running Benchmarks

# Run all benchmarks
python benchmarks/benchmark_metrics_100k.py
python benchmarks/benchmark_pipeline.py
python benchmarks/benchmark_bootstrap.py

Typical Performance (Reference Hardware)

Metrics Computation (100k samples):

  • Demographic Parity Difference: ~0.5-1.0 seconds

  • Equalized Odds Difference: ~0.8-1.5 seconds

  • MAE Parity Difference: ~0.6-1.2 seconds

  • Intersectional analysis: ~2-4x slower (depends on number of groups)

Pipeline Operations:

  • Bias detection: ~0.5-2.0 seconds (10k samples)

  • Pipeline transformation: ~0.2-1.0 seconds (10k samples)

  • Full pipeline (detect + transform): ~1-3 seconds (10k samples)

Bootstrap Confidence Intervals:

  • Percentile method (1000 samples): ~5-15 seconds

  • BCa method (1000 samples): ~10-30 seconds

  • Performance scales linearly with number of bootstrap samples

Note: Actual performance depends on hardware, dataset characteristics, and Python version.


Performance Characteristics

Computational Complexity

Fairness Metrics:

  • Time Complexity: O(n) where n is the number of samples

  • Space Complexity: O(n) for storing predictions and sensitive attributes

  • Bootstrap CI: O(n × B) where B is the number of bootstrap samples

Pipeline Operations:

  • Bias Detection: O(n × m) where m is the number of features

  • Transformations: O(n × m) for most transformers

  • Proxy Detection: O(m²) for correlation computation

Intersectional Analysis:

  • Time Complexity: O(n × g) where g is the number of intersectional groups

  • Space Complexity: O(g) for storing group statistics

  • Can be significantly slower when many groups are present

Bottlenecks

  1. Bootstrap Confidence Intervals: The most computationally expensive operation

    • Use ci_method="percentile" for faster computation

    • Reduce ci_samples for quicker results (at cost of accuracy)

    • Consider disabling CI for quick checks: with_ci=False

  2. Intersectional Analysis: Slower due to increased number of groups

    • Use single-attribute analysis when possible

    • Filter to most important intersectional groups if needed

  3. Large Datasets: Memory and computation time increase linearly

    • Use batch processing for very large datasets

    • Consider sampling for exploratory analysis


Optimization Tips

1. Disable Confidence Intervals for Quick Checks

# Fast check without CI
result = analyzer.demographic_parity_difference(
    y_pred=y_pred,
    sensitive=sensitive,
    with_ci=False  # Skip bootstrap CI computation
)

2. Use Percentile Method for Bootstrap CI

# Faster CI method
result = analyzer.demographic_parity_difference(
    y_pred=y_pred,
    sensitive=sensitive,
    with_ci=True,
    ci_method="percentile",  # Faster than "bca"
    ci_samples=500  # Fewer samples = faster
)

3. Reduce Minimum Group Size (When Appropriate)

# Lower threshold for faster computation (use with caution)
analyzer = FairnessAnalyzer(min_group_size=20)  # Default is 30

4. Batch Processing for Large Datasets

# Process in batches
batch_size = 10_000
for i in range(0, len(df), batch_size):
    batch = df.iloc[i:i + batch_size]
    result = analyzer.demographic_parity_difference(
        y_pred=batch["y_pred"].to_numpy(),
        sensitive=batch["gender"].to_numpy(),
        with_ci=False  # Disable CI for batch processing
    )

5. Cache Results When Possible

# Cache metric results if computing multiple times
from functools import lru_cache

@lru_cache(maxsize=128)
def compute_metric_cached(y_pred_hash, sensitive_hash):
    # Compute metric
    return analyzer.demographic_parity_difference(...)

6. Use Native Backend (Fastest)

# Native backend is typically fastest
analyzer = FairnessAnalyzer(backend="native")

7. Parallel Processing (Advanced)

For very large datasets, consider parallel processing:

from multiprocessing import Pool

def compute_metric_for_group(args):
    group_name, group_data = args
    return analyzer.demographic_parity_difference(
        y_pred=group_data["y_pred"],
        sensitive=group_data["sensitive"]
    )

# Process groups in parallel
with Pool() as pool:
    results = pool.map(compute_metric_for_group, group_data_list)

Scalability Considerations

Dataset Size Guidelines

Dataset Size

Recommended Approach

Notes

< 10k samples

Full analysis with CI

Fast enough for interactive use

10k - 100k samples

Full analysis, consider reducing CI samples

Good balance of speed and accuracy

100k - 1M samples

Batch processing or sampling

Use with_ci=False for quick checks

> 1M samples

Sampling or distributed processing

Consider using Spark/Dask for very large datasets

Group Size Considerations

  • Minimum Group Size: Larger min_group_size values reduce computation time but may exclude important groups

  • Number of Groups: More groups (especially in intersectional analysis) increase computation time

  • Group Imbalance: Highly imbalanced groups may require more bootstrap samples for accurate CI

Memory Usage

Typical Memory Footprint:

  • Base toolkit: ~50-100 MB

  • Per 100k samples: ~10-20 MB (depending on data types)

  • Bootstrap CI (1000 samples): ~50-100 MB additional memory

Memory Optimization:

  • Use with_ci=False to reduce memory usage

  • Process data in batches for very large datasets

  • Use appropriate data types (e.g., int8 instead of int64 when possible)


Memory Usage

Memory Profiling

# Profile memory usage
import tracemalloc

tracemalloc.start()
result = analyzer.demographic_parity_difference(
    y_pred=y_pred,
    sensitive=sensitive,
    with_ci=True
)
current, peak = tracemalloc.get_traced_memory()
print(f"Peak memory: {peak / 1024 / 1024:.2f} MB")
tracemalloc.stop()

Memory-Efficient Patterns

  1. Use Generators for Large Datasets:

    def process_in_chunks(df, chunk_size=10000):
        for i in range(0, len(df), chunk_size):
            yield df.iloc[i:i + chunk_size]
    
  2. Clear Intermediate Results:

    result = analyzer.demographic_parity_difference(...)
    # Process result
    del result  # Explicitly free memory if needed
    
  3. Use Sparse Data Structures (when applicable):

    from scipy.sparse import csr_matrix
    # Use sparse matrices for very sparse data
    

CI/CD Integration

Performance Regression Testing

The toolkit includes performance benchmarks and test suite that can be integrated into CI/CD pipelines:

Using Performance Test Suite:

# .github/workflows/ci.yml
- name: Run performance tests
  run: |
    pytest tests/performance/test_performance_suite.py -v
    # Tests will fail if performance degrades beyond baselines

Using Benchmark Scripts:

# .github/workflows/ci.yml
- name: Run performance benchmarks
  run: |
    python benchmarks/benchmark_metrics_100k.py > benchmark_metrics.txt
    python benchmarks/benchmark_pipeline.py > benchmark_pipeline.txt
    
    # Check for performance regressions
    python -c "
    import re
    with open('benchmark_metrics.txt') as f:
        content = f.read()
        # Extract timing information and check thresholds
        # Fail if performance degrades significantly
    "

Performance Monitoring

Track performance over time:

# Log performance metrics
import time
import json

start = time.time()
result = analyzer.demographic_parity_difference(...)
duration = time.time() - start

performance_log = {
    "metric": "demographic_parity_difference",
    "duration": duration,
    "n_samples": len(y_pred),
    "timestamp": time.time()
}

# Save to file or send to monitoring system
with open("performance_log.json", "a") as f:
    f.write(json.dumps(performance_log) + "\n")

Benchmark Baselines

Establish performance baselines for your use case:

# Run benchmarks and save results
python benchmarks/benchmark_metrics_100k.py > baseline_metrics.txt
python benchmarks/benchmark_pipeline.py > baseline_pipeline.txt

# Compare against baselines in CI
python -c "
# Load baseline and current results
# Compare and fail if performance degrades > 20%
"

Best Practices

  1. Start with Quick Checks: Use with_ci=False for initial exploration

  2. Enable CI for Production: Always use confidence intervals for production validation

  3. Profile Before Optimizing: Use profiling tools to identify actual bottlenecks

  4. Monitor Performance: Track performance metrics over time

  5. Set Appropriate Thresholds: Balance accuracy (more CI samples) vs speed

  6. Use Appropriate Backend: Native backend is typically fastest unless you need adapter features


Troubleshooting Performance Issues

Slow Metric Computation

Symptoms: Metrics take > 10 seconds for 100k samples

Solutions:

  • Check if CI is enabled (disable for quick checks)

  • Reduce ci_samples if CI is needed

  • Verify you’re using native backend

  • Check for memory pressure (swap usage)

High Memory Usage

Symptoms: Memory usage > 500 MB for 100k samples

Solutions:

  • Process data in batches

  • Disable CI if not needed

  • Check for memory leaks in custom code

  • Use appropriate data types

Slow Pipeline Operations

Symptoms: Pipeline operations take > 5 seconds for 10k samples

Solutions:

  • Check number of pipeline steps

  • Verify transformer implementations are efficient

  • Consider caching transformer fits

  • Profile individual steps to identify bottlenecks


Additional Resources

  • Benchmark Suite: See benchmarks/README.md for detailed benchmark documentation

  • API Reference: See docs/api.md for complete API documentation

  • Integration Guide: See docs/integration_guide.md for integration examples


Performance Comparison

Backend Performance

Backend

Speed

Features

Dependencies

Native

Fastest

Core metrics

None (always available)

Fairlearn

Medium

Additional metrics

fairlearn

Aequitas

Slower

Comprehensive reports

aequitas

Recommendation: Use native backend unless you need specific adapter features.

CI Method Performance

Method

Speed

Accuracy

Use Case

Percentile

Fast

Good

General use

BCa

Slower

Better

When accuracy is critical

Recommendation: Use percentile for most cases, BCa when accuracy is critical.


For questions or performance issues, see the Integration Guide or open an issue on GitHub.