Explanation Quality Metrics¶

Quantitatively evaluate the quality of your interpretations using standardized metrics.

Overview¶

While visual inspection of heatmaps is helpful, quantitative metrics provide objective measures of explanation quality:

Faithfulness: Does the explanation reflect model behavior?
Sensitivity: Is the explanation stable under perturbations?
Sanity Checks: Does the method pass basic reasonableness tests?
Localization: Does the explanation correctly identify important regions?

Class: `ExplanationMetrics`¶

import autotimm as at  # recommended alias
from autotimm.interpretation import ExplanationMetrics, GradCAM

model = ImageClassifier(backbone="resnet50", num_classes=10)
explainer = GradCAM(model)
metrics = ExplanationMetrics(model, explainer)

Faithfulness Metrics¶

Faithfulness metrics measure how well the explanation reflects the model's actual behavior.

Deletion¶

Concept: Progressively remove the most important pixels (according to the explanation) and measure how much the prediction drops.

Expected behavior: If the explanation is faithful, removing important pixels should significantly decrease confidence.

result = metrics.deletion(
    image,
    target_class=None,  # None = predicted class
    steps=50,           # Number of deletion steps
    baseline='blur'     # What to replace deleted pixels with
)

print(f"Deletion AUC: {result['auc']:.3f}")
print(f"Final drop: {result['final_drop']:.2%}")

Returned fields: - auc: Area under the curve (lower = better) - final_drop: Final prediction drop (higher = better) - scores: List of prediction scores at each step - original_score: Original prediction score

Interpretation: - AUC < 0.7: Excellent - important pixels have strong impact - AUC 0.7-0.9: Good - reasonable impact - AUC > 0.9: Poor - removing "important" pixels doesn't affect prediction much

Baseline options: - 'black': Replace with zeros - 'blur': Replace with Gaussian blur (recommended) - 'mean': Replace with mean pixel value

Insertion¶

Concept: Start with a baseline (e.g., blurred) image and progressively add back the most important pixels.

Expected behavior: Adding important pixels should quickly recover the original prediction.

result = metrics.insertion(
    image,
    target_class=None,
    steps=50,
    baseline='blur'
)

print(f"Insertion AUC: {result['auc']:.3f}")
print(f"Final rise: {result['final_rise']:.2%}")

Returned fields: - auc: Area under the curve (higher = better) - final_rise: How much prediction recovers (higher = better) - scores: List of prediction scores at each step - baseline_score: Score on baseline image - original_score: Original prediction score

Interpretation: - AUC > 0.7: Excellent - important pixels quickly recover prediction - AUC 0.5-0.7: Good - AUC < 0.5: Poor - "important" pixels don't help prediction

Visualizing Curves¶

import matplotlib.pyplot as plt

deletion_result = metrics.deletion(image, steps=50)
insertion_result = metrics.insertion(image, steps=50)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Deletion curve
ax1.plot(deletion_result['scores'])
ax1.set_title('Deletion Curve')
ax1.set_xlabel('Steps')
ax1.set_ylabel('Prediction Score')

# Insertion curve
ax2.plot(insertion_result['scores'])
ax2.set_title('Insertion Curve')
ax2.set_xlabel('Steps')
ax2.set_ylabel('Prediction Score')

plt.savefig('faithfulness_curves.png')

Sensitivity Analysis¶

Measures explanation stability under input perturbations.

Sensitivity-N¶

Concept: Add random noise to the input and measure how much the explanation changes.

Expected behavior: Small input changes should lead to small explanation changes.

result = metrics.sensitivity_n(
    image,
    target_class=None,
    n_samples=50,      # Number of noisy samples
    noise_level=0.15   # Std of Gaussian noise
)

print(f"Sensitivity: {result['sensitivity']:.4f}")
print(f"Std: {result['std']:.4f}")

Returned fields: - sensitivity: Average explanation change - std: Standard deviation of changes - max_change: Maximum change observed - changes: List of all changes

Interpretation: - < 0.05: Very stable (excellent) - 0.05-0.15: Moderately stable (good) - > 0.15: Unstable (poor)

Adjusting parameters: - Higher n_samples: More accurate but slower (50 recommended) - Higher noise_level: Tests stability under larger perturbations

Sanity Checks¶

Tests whether the explanation method behaves reasonably.

Model Parameter Randomization¶

Concept: Explanation should change significantly if model weights are randomized.

Rationale: If explanations don't change when the model is randomized, the method isn't actually explaining the model's behavior.

result = metrics.model_parameter_randomization_test(image)

print(f"Correlation: {result['correlation']:.3f}")
print(f"Passes: {result['passes']}")

Returned fields: - correlation: Correlation between original and randomized explanations - change: Mean absolute difference - passes: True if correlation < 0.5

Interpretation: - passes=True: Method is model-sensitive - passes=False: Method may not be explaining the model

Data Randomization¶

Concept: Explanation should change significantly when explaining a different class.

Rationale: Explanations for different classes should look different.

result = metrics.data_randomization_test(image)

print(f"Correlation: {result['correlation']:.3f}")
print(f"Passes: {result['passes']}")

Returned fields: - correlation: Correlation with explanation for random class - change: Mean absolute difference - passes: True if correlation < 0.5

Interpretation: - passes=True: Method is class-sensitive - passes=False: Method produces similar explanations for all classes

Localization Metrics¶

For tasks where object location is known (e.g., object detection with annotations).

Pointing Game¶

Concept: Does the maximum attention fall within the object's bounding box?

bbox = [x1, y1, x2, y2]  # Ground truth bounding box

result = metrics.pointing_game(image, bbox)

print(f"Hit: {result['hit']}")
print(f"Max location: {result['max_location']}")

Returned fields: - hit: True if max attention is inside bbox - max_location: (y, x) coordinates of maximum attention - bbox: Input bbox

Interpretation: - hit=True: Explanation correctly localizes object - hit=False: Explanation misses object

Use cases: - Evaluate detection explanations - Validate that attention focuses on relevant objects - Compare localization accuracy across methods

Comprehensive Evaluation¶

Run all applicable metrics at once:

results = metrics.evaluate_all(
    image,
    target_class=None,
    bbox=None  # Optional: for pointing game
)

# Access individual metric results
print(f"Deletion AUC: {results['deletion']['auc']:.3f}")
print(f"Insertion AUC: {results['insertion']['auc']:.3f}")
print(f"Sensitivity: {results['sensitivity']['sensitivity']:.4f}")
print(f"Param check: {results['param_randomization']['passes']}")
print(f"Data check: {results['data_randomization']['passes']}")

# If bbox provided
if 'pointing_game' in results:
    print(f"Pointing game: {results['pointing_game']['hit']}")

Comparing Methods¶

Evaluate multiple explanation methods:

from autotimm.interpretation import GradCAM, GradCAMPlusPlus, IntegratedGradients

methods = {
    'GradCAM': GradCAM(model),
    'GradCAM++': GradCAMPlusPlus(model),
    'IntegratedGradients': IntegratedGradients(model),
}

print(f"{'Method':<20} {'Del AUC':<10} {'Ins AUC':<10} {'Sensitivity':<12}")
print("-" * 52)

for name, explainer in methods.items():
    metrics_obj = ExplanationMetrics(model, explainer, use_cuda=False)

    deletion = metrics_obj.deletion(image, steps=20)
    insertion = metrics_obj.insertion(image, steps=20)
    sensitivity = metrics_obj.sensitivity_n(image, n_samples=20)

    print(f"{name:<20} {deletion['auc']:<10.3f} {insertion['auc']:<10.3f} "
          f"{sensitivity['sensitivity']:<12.4f}")

Output example:

Method               Del AUC    Ins AUC    Sensitivity
----------------------------------------------------
GradCAM              0.723      0.812      0.0441
GradCAM++            0.698      0.835      0.0816
IntegratedGradients  0.654      0.891      0.0569

Analysis: - IntegratedGradients has best faithfulness (lowest deletion, highest insertion) - GradCAM has best stability (lowest sensitivity) - Trade-off between faithfulness and efficiency

Best Practices¶

1. Use Multiple Metrics¶

Don't rely on a single metric. Good explanations should: - Have low deletion AUC (<0.8) - Have high insertion AUC (>0.6) - Have low sensitivity (<0.15) - Pass both sanity checks

2. Consider Your Use Case¶

For model debugging: - Focus on faithfulness (deletion/insertion) - Sanity checks are critical

For production deployment: - Prioritize stability (sensitivity) - Balance with faithfulness

For scientific research: - Report all metrics - Include significance tests

3. Appropriate Baselines¶

# For natural images - use blur
result = metrics.deletion(image, baseline='blur')

# For medical images - consider mean
result = metrics.deletion(image, baseline='mean')

# For stylized/synthetic - black may work
result = metrics.deletion(image, baseline='black')

4. Sufficient Steps¶

# Too few steps (10) - coarse measurement
result = metrics.deletion(image, steps=10)

# Good balance (50) - recommended
result = metrics.deletion(image, steps=50)

# Many steps (100) - more accurate but slower
result = metrics.deletion(image, steps=100)

Performance Considerations¶

Computational Cost¶

Metric	Relative Cost	Notes
Deletion	High	steps × forward passes
Insertion	High	steps × forward passes
Sensitivity	High	n_samples × forward passes
Param randomization	Medium	2 × forward passes + weight copy
Data randomization	Low	2 × forward passes
Pointing game	Low	1 × forward pass

Optimization Tips¶

# Use fewer steps for quick evaluation
quick_result = metrics.deletion(image, steps=10)

# Use fewer samples for sensitivity
quick_sensitivity = metrics.sensitivity_n(image, n_samples=10)

# Skip expensive metrics for large-scale evaluation
# (only run on a sample of images)

Batching (Future Enhancement)¶

Currently, metrics are computed per-image. For evaluation on datasets, process in batches manually:

all_deletions = []
for image in dataset:
    result = metrics.deletion(image, steps=20)
    all_deletions.append(result['auc'])

mean_deletion = np.mean(all_deletions)
print(f"Mean deletion AUC: {mean_deletion:.3f}")

Troubleshooting¶

For interpretation metrics issues, see the Troubleshooting - Interpretation including:

High deletion AUC (>0.9)
Sensitivity NaN
Sanity checks fail
Pointing game always misses

Examples¶

See examples/interpretation/interpretation_metrics_demo.py for comprehensive examples including: - Computing faithfulness metrics - Analyzing sensitivity - Running sanity checks - Comparing multiple methods - Creating visualizations

References¶

Deletion & Insertion: - Petsiuk et al., "RISE: Randomized Input Sampling for Explanation of Black-box Models", BMVC 2018

Sensitivity: - Yeh et al., "On the (In)fidelity and Sensitivity of Explanations", NeurIPS 2019

Sanity Checks: - Adebayo et al., "Sanity Checks for Saliency Maps", NeurIPS 2018

Pointing Game: - Zhang et al., "Top-down Neural Attention by Excitation Backprop", IJCV 2018