Troubleshooting¶
Comprehensive troubleshooting guide for common issues when using AutoTimm.
Quick Navigation¶
graph LR
A[Problem] --> B[<b>Training</b><br/>Loss, convergence,<br/>gradients, LR]
A --> C[<b>Performance</b><br/>Memory, speed,<br/>GPU, bottlenecks]
A --> D[<b>Data</b><br/>Loading, format,<br/>augmentation]
A --> E[<b>Models</b><br/>Architecture,<br/>pretrained, custom]
A --> F[<b>Deployment</b><br/>Export, serving,<br/>inference]
A --> G[<b>Environment</b><br/>Dependencies,<br/>CUDA, install]
A --> H[<b>Integration</b><br/>HuggingFace, timm,<br/>Lightning]
A --> I[<b>Task-Specific</b><br/>Classification,<br/>detection, segmentation]
style A fill:#1565C0,stroke:#0D47A1
style B fill:#1976D2,stroke:#1565C0
style C fill:#1565C0,stroke:#0D47A1
style D fill:#1976D2,stroke:#1565C0
style E fill:#1565C0,stroke:#0D47A1
style F fill:#1976D2,stroke:#1565C0
style G fill:#1565C0,stroke:#0D47A1
style H fill:#1976D2,stroke:#1565C0
style I fill:#1565C0,stroke:#0D47A1
Browse by Category¶
Training Issues¶
Problems during model training and optimization.
- NaN Losses - Numerical instability and NaN loss values
- Convergence Problems - Overfitting, underfitting, oscillating loss
- Gradient Issues - Gradient explosion and vanishing gradients
- LR Tuning Failures - Learning rate finder issues
Performance Issues¶
Memory, speed, and resource optimization.
- OOM Errors - Out of memory errors and solutions
- Slow Training - Training speed bottlenecks
- Performance Profiling - Identifying and fixing bottlenecks
Data Issues¶
Data loading and augmentation problems.
- Data Loading - Dataset loading and format issues
- Augmentation - Transform and augmentation errors
Model Issues¶
Model loading, checkpoints, and metrics.
- Model Loading - Checkpoint and pretrained model issues
- Metrics - Metric calculation problems
Deployment Issues¶
Export, inference, and production deployment.
- Export & Inference - ONNX, TorchScript, and inference issues
- Production Deployment - C++, iOS, Android, and edge deployment
Environment Issues¶
Hardware, devices, and distributed training.
- Device Errors - CUDA, MPS, and multi-GPU issues
- Installation - Dependencies and version issues
- Distributed Training - Multi-GPU and multi-node problems
Integration Issues¶
External tools and platform integration.
- Loggers - WandB, TensorBoard, MLflow issues
- HuggingFace - HuggingFace Hub integration
- Reproducibility - Seeding and deterministic training
Task-Specific Issues¶
Issues specific to computer vision tasks.
- YOLOX - YOLOX-specific training and inference
- Interpretation - Model interpretation and visualization
Quick Reference¶
Fast lookup tables and common patterns.
- Error Reference - Common error messages and solutions
- Warnings - Common warning messages
Common Workflows¶
"My training loss is NaN"¶
- Check NaN Losses
- Enable gradient clipping
- Reduce learning rate
- Try auto-tuning: LR Tuning
"CUDA out of memory"¶
- See OOM Errors
- Reduce batch size
- Enable gradient accumulation
- Use mixed precision training
"Training is too slow"¶
- Check Slow Training
- Profile with Performance Profiling
- Increase
num_workers - Enable mixed precision
"Model won't converge"¶
- Review Convergence Problems
- Check for overfitting/underfitting
- Adjust learning rate
- Review data quality
"Export fails"¶
- See Export & Inference
- Use
method="trace"for TorchScript - Try lower ONNX opset version
- Simplify model architecture
Getting Help¶
If you encounter an issue not covered here:
- Search the docs: Use the search feature to find relevant information
- Check error reference: See Error Reference for common errors
- Enable debug logging: Get more detailed error information
- Create minimal reproduction: Isolate the issue to simplify debugging
- Report issues: Open an issue at GitHub Issues
Debug Mode¶
Enable detailed logging for troubleshooting:
import autotimm as at # recommended alias
from autotimm import LoggingConfig, ImageClassifier
logging_config = LoggingConfig(
verbosity=2, # Verbose logging
log_gradient_norm=True,
log_learning_rate=True,
)
model = ImageClassifier(
backbone="resnet50",
num_classes=10,
logging_config=logging_config,
)
Related Resources¶
- Training Guide - Complete training documentation
- Data Loading Guide - Data loading best practices
- Deployment Guide - Production deployment guide
- API Reference - Complete API documentation