Explanation Quality Metrics¶
Quantitatively evaluate the quality of your interpretations using standardized metrics.
Overview¶
While visual inspection of heatmaps is helpful, quantitative metrics provide objective measures of explanation quality:
- Faithfulness: Does the explanation reflect model behavior?
- Sensitivity: Is the explanation stable under perturbations?
- Sanity Checks: Does the method pass basic reasonableness tests?
- Localization: Does the explanation correctly identify important regions?
Class: ExplanationMetrics¶
import autotimm as at # recommended alias
from autotimm.interpretation import ExplanationMetrics, GradCAM
model = ImageClassifier(backbone="resnet50", num_classes=10)
explainer = GradCAM(model)
metrics = ExplanationMetrics(model, explainer)
Faithfulness Metrics¶
Faithfulness metrics measure how well the explanation reflects the model's actual behavior.
Deletion¶
Concept: Progressively remove the most important pixels (according to the explanation) and measure how much the prediction drops.
Expected behavior: If the explanation is faithful, removing important pixels should significantly decrease confidence.
result = metrics.deletion(
image,
target_class=None, # None = predicted class
steps=50, # Number of deletion steps
baseline='blur' # What to replace deleted pixels with
)
print(f"Deletion AUC: {result['auc']:.3f}")
print(f"Final drop: {result['final_drop']:.2%}")
Returned fields:
- auc: Area under the curve (lower = better)
- final_drop: Final prediction drop (higher = better)
- scores: List of prediction scores at each step
- original_score: Original prediction score
Interpretation: - AUC < 0.7: Excellent - important pixels have strong impact - AUC 0.7-0.9: Good - reasonable impact - AUC > 0.9: Poor - removing "important" pixels doesn't affect prediction much
Baseline options:
- 'black': Replace with zeros
- 'blur': Replace with Gaussian blur (recommended)
- 'mean': Replace with mean pixel value
Insertion¶
Concept: Start with a baseline (e.g., blurred) image and progressively add back the most important pixels.
Expected behavior: Adding important pixels should quickly recover the original prediction.
result = metrics.insertion(
image,
target_class=None,
steps=50,
baseline='blur'
)
print(f"Insertion AUC: {result['auc']:.3f}")
print(f"Final rise: {result['final_rise']:.2%}")
Returned fields:
- auc: Area under the curve (higher = better)
- final_rise: How much prediction recovers (higher = better)
- scores: List of prediction scores at each step
- baseline_score: Score on baseline image
- original_score: Original prediction score
Interpretation: - AUC > 0.7: Excellent - important pixels quickly recover prediction - AUC 0.5-0.7: Good - AUC < 0.5: Poor - "important" pixels don't help prediction
Visualizing Curves¶
import matplotlib.pyplot as plt
deletion_result = metrics.deletion(image, steps=50)
insertion_result = metrics.insertion(image, steps=50)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
# Deletion curve
ax1.plot(deletion_result['scores'])
ax1.set_title('Deletion Curve')
ax1.set_xlabel('Steps')
ax1.set_ylabel('Prediction Score')
# Insertion curve
ax2.plot(insertion_result['scores'])
ax2.set_title('Insertion Curve')
ax2.set_xlabel('Steps')
ax2.set_ylabel('Prediction Score')
plt.savefig('faithfulness_curves.png')
Sensitivity Analysis¶
Measures explanation stability under input perturbations.
Sensitivity-N¶
Concept: Add random noise to the input and measure how much the explanation changes.
Expected behavior: Small input changes should lead to small explanation changes.
result = metrics.sensitivity_n(
image,
target_class=None,
n_samples=50, # Number of noisy samples
noise_level=0.15 # Std of Gaussian noise
)
print(f"Sensitivity: {result['sensitivity']:.4f}")
print(f"Std: {result['std']:.4f}")
Returned fields:
- sensitivity: Average explanation change
- std: Standard deviation of changes
- max_change: Maximum change observed
- changes: List of all changes
Interpretation: - < 0.05: Very stable (excellent) - 0.05-0.15: Moderately stable (good) - > 0.15: Unstable (poor)
Adjusting parameters:
- Higher n_samples: More accurate but slower (50 recommended)
- Higher noise_level: Tests stability under larger perturbations
Sanity Checks¶
Tests whether the explanation method behaves reasonably.
Model Parameter Randomization¶
Concept: Explanation should change significantly if model weights are randomized.
Rationale: If explanations don't change when the model is randomized, the method isn't actually explaining the model's behavior.
result = metrics.model_parameter_randomization_test(image)
print(f"Correlation: {result['correlation']:.3f}")
print(f"Passes: {result['passes']}")
Returned fields:
- correlation: Correlation between original and randomized explanations
- change: Mean absolute difference
- passes: True if correlation < 0.5
Interpretation: - passes=True: Method is model-sensitive - passes=False: Method may not be explaining the model
Data Randomization¶
Concept: Explanation should change significantly when explaining a different class.
Rationale: Explanations for different classes should look different.
result = metrics.data_randomization_test(image)
print(f"Correlation: {result['correlation']:.3f}")
print(f"Passes: {result['passes']}")
Returned fields:
- correlation: Correlation with explanation for random class
- change: Mean absolute difference
- passes: True if correlation < 0.5
Interpretation: - passes=True: Method is class-sensitive - passes=False: Method produces similar explanations for all classes
Localization Metrics¶
For tasks where object location is known (e.g., object detection with annotations).
Pointing Game¶
Concept: Does the maximum attention fall within the object's bounding box?
bbox = [x1, y1, x2, y2] # Ground truth bounding box
result = metrics.pointing_game(image, bbox)
print(f"Hit: {result['hit']}")
print(f"Max location: {result['max_location']}")
Returned fields:
- hit: True if max attention is inside bbox
- max_location: (y, x) coordinates of maximum attention
- bbox: Input bbox
Interpretation: - hit=True: Explanation correctly localizes object - hit=False: Explanation misses object
Use cases: - Evaluate detection explanations - Validate that attention focuses on relevant objects - Compare localization accuracy across methods
Comprehensive Evaluation¶
Run all applicable metrics at once:
results = metrics.evaluate_all(
image,
target_class=None,
bbox=None # Optional: for pointing game
)
# Access individual metric results
print(f"Deletion AUC: {results['deletion']['auc']:.3f}")
print(f"Insertion AUC: {results['insertion']['auc']:.3f}")
print(f"Sensitivity: {results['sensitivity']['sensitivity']:.4f}")
print(f"Param check: {results['param_randomization']['passes']}")
print(f"Data check: {results['data_randomization']['passes']}")
# If bbox provided
if 'pointing_game' in results:
print(f"Pointing game: {results['pointing_game']['hit']}")
Comparing Methods¶
Evaluate multiple explanation methods:
from autotimm.interpretation import GradCAM, GradCAMPlusPlus, IntegratedGradients
methods = {
'GradCAM': GradCAM(model),
'GradCAM++': GradCAMPlusPlus(model),
'IntegratedGradients': IntegratedGradients(model),
}
print(f"{'Method':<20} {'Del AUC':<10} {'Ins AUC':<10} {'Sensitivity':<12}")
print("-" * 52)
for name, explainer in methods.items():
metrics_obj = ExplanationMetrics(model, explainer, use_cuda=False)
deletion = metrics_obj.deletion(image, steps=20)
insertion = metrics_obj.insertion(image, steps=20)
sensitivity = metrics_obj.sensitivity_n(image, n_samples=20)
print(f"{name:<20} {deletion['auc']:<10.3f} {insertion['auc']:<10.3f} "
f"{sensitivity['sensitivity']:<12.4f}")
Output example:
Method Del AUC Ins AUC Sensitivity
----------------------------------------------------
GradCAM 0.723 0.812 0.0441
GradCAM++ 0.698 0.835 0.0816
IntegratedGradients 0.654 0.891 0.0569
Analysis: - IntegratedGradients has best faithfulness (lowest deletion, highest insertion) - GradCAM has best stability (lowest sensitivity) - Trade-off between faithfulness and efficiency
Best Practices¶
1. Use Multiple Metrics¶
Don't rely on a single metric. Good explanations should: - Have low deletion AUC (<0.8) - Have high insertion AUC (>0.6) - Have low sensitivity (<0.15) - Pass both sanity checks
2. Consider Your Use Case¶
For model debugging: - Focus on faithfulness (deletion/insertion) - Sanity checks are critical
For production deployment: - Prioritize stability (sensitivity) - Balance with faithfulness
For scientific research: - Report all metrics - Include significance tests
3. Appropriate Baselines¶
# For natural images - use blur
result = metrics.deletion(image, baseline='blur')
# For medical images - consider mean
result = metrics.deletion(image, baseline='mean')
# For stylized/synthetic - black may work
result = metrics.deletion(image, baseline='black')
4. Sufficient Steps¶
# Too few steps (10) - coarse measurement
result = metrics.deletion(image, steps=10)
# Good balance (50) - recommended
result = metrics.deletion(image, steps=50)
# Many steps (100) - more accurate but slower
result = metrics.deletion(image, steps=100)
Performance Considerations¶
Computational Cost¶
| Metric | Relative Cost | Notes |
|---|---|---|
| Deletion | High | steps × forward passes |
| Insertion | High | steps × forward passes |
| Sensitivity | High | n_samples × forward passes |
| Param randomization | Medium | 2 × forward passes + weight copy |
| Data randomization | Low | 2 × forward passes |
| Pointing game | Low | 1 × forward pass |
Optimization Tips¶
# Use fewer steps for quick evaluation
quick_result = metrics.deletion(image, steps=10)
# Use fewer samples for sensitivity
quick_sensitivity = metrics.sensitivity_n(image, n_samples=10)
# Skip expensive metrics for large-scale evaluation
# (only run on a sample of images)
Batching (Future Enhancement)¶
Currently, metrics are computed per-image. For evaluation on datasets, process in batches manually:
all_deletions = []
for image in dataset:
result = metrics.deletion(image, steps=20)
all_deletions.append(result['auc'])
mean_deletion = np.mean(all_deletions)
print(f"Mean deletion AUC: {mean_deletion:.3f}")
Troubleshooting¶
For interpretation metrics issues, see the Troubleshooting - Interpretation including:
- High deletion AUC (>0.9)
- Sensitivity NaN
- Sanity checks fail
- Pointing game always misses
Examples¶
See examples/interpretation/interpretation_metrics_demo.py for comprehensive examples including:
- Computing faithfulness metrics
- Analyzing sensitivity
- Running sanity checks
- Comparing multiple methods
- Creating visualizations
References¶
Deletion & Insertion: - Petsiuk et al., "RISE: Randomized Input Sampling for Explanation of Black-box Models", BMVC 2018
Sensitivity: - Yeh et al., "On the (In)fidelity and Sensitivity of Explanations", NeurIPS 2019
Sanity Checks: - Adebayo et al., "Sanity Checks for Saliency Maps", NeurIPS 2018
Pointing Game: - Zhang et al., "Top-down Neural Attention by Excitation Backprop", IJCV 2018
See Also¶
- Interpretation Methods - Available explanation methods
- Main Guide - Overview and quick start
- Feature Visualization - Analyze learned features