Troubleshooting¶

Comprehensive troubleshooting guide for common issues when using AutoTimm.

graph LR
    A[Problem] --> B[<b>Training</b><br/>Loss, convergence,<br/>gradients, LR]
    A --> C[<b>Performance</b><br/>Memory, speed,<br/>GPU, bottlenecks]
    A --> D[<b>Data</b><br/>Loading, format,<br/>augmentation]
    A --> E[<b>Models</b><br/>Architecture,<br/>pretrained, custom]
    A --> F[<b>Deployment</b><br/>Export, serving,<br/>inference]
    A --> G[<b>Environment</b><br/>Dependencies,<br/>CUDA, install]
    A --> H[<b>Integration</b><br/>HuggingFace, timm,<br/>Lightning]
    A --> I[<b>Task-Specific</b><br/>Classification,<br/>detection, segmentation]

    style A fill:#1565C0,stroke:#0D47A1
    style B fill:#1976D2,stroke:#1565C0
    style C fill:#1565C0,stroke:#0D47A1
    style D fill:#1976D2,stroke:#1565C0
    style E fill:#1565C0,stroke:#0D47A1
    style F fill:#1976D2,stroke:#1565C0
    style G fill:#1565C0,stroke:#0D47A1
    style H fill:#1976D2,stroke:#1565C0
    style I fill:#1565C0,stroke:#0D47A1

Browse by Category¶

Training Issues¶

Problems during model training and optimization.

NaN Losses - Numerical instability and NaN loss values
Convergence Problems - Overfitting, underfitting, oscillating loss
Gradient Issues - Gradient explosion and vanishing gradients
LR Tuning Failures - Learning rate finder issues

Performance Issues¶

Memory, speed, and resource optimization.

OOM Errors - Out of memory errors and solutions
Slow Training - Training speed bottlenecks
Performance Profiling - Identifying and fixing bottlenecks

Data Issues¶

Data loading and augmentation problems.

Data Loading - Dataset loading and format issues
Augmentation - Transform and augmentation errors

Model Issues¶

Model loading, checkpoints, and metrics.

Model Loading - Checkpoint and pretrained model issues
Metrics - Metric calculation problems

Deployment Issues¶

Export, inference, and production deployment.

Export & Inference - ONNX, TorchScript, and inference issues
Production Deployment - C++, iOS, Android, and edge deployment

Environment Issues¶

Hardware, devices, and distributed training.

Device Errors - CUDA, MPS, and multi-GPU issues
Installation - Dependencies and version issues
Distributed Training - Multi-GPU and multi-node problems

Integration Issues¶

External tools and platform integration.

Loggers - WandB, TensorBoard, MLflow issues
HuggingFace - HuggingFace Hub integration
Reproducibility - Seeding and deterministic training

Task-Specific Issues¶

Issues specific to computer vision tasks.

YOLOX - YOLOX-specific training and inference
Interpretation - Model interpretation and visualization

Quick Reference¶

Fast lookup tables and common patterns.

Error Reference - Common error messages and solutions
Warnings - Common warning messages

Common Workflows¶

"My training loss is NaN"¶

Check NaN Losses
Enable gradient clipping
Reduce learning rate
Try auto-tuning: LR Tuning

"CUDA out of memory"¶

See OOM Errors
Reduce batch size
Enable gradient accumulation
Use mixed precision training

"Training is too slow"¶

Check Slow Training
Profile with Performance Profiling
Increase num_workers
Enable mixed precision

"Model won't converge"¶

Review Convergence Problems
Check for overfitting/underfitting
Adjust learning rate
Review data quality

"Export fails"¶

See Export & Inference
Use method="trace" for TorchScript
Try lower ONNX opset version
Simplify model architecture

Getting Help¶

If you encounter an issue not covered here:

Search the docs: Use the search feature to find relevant information
Check error reference: See Error Reference for common errors
Enable debug logging: Get more detailed error information
Create minimal reproduction: Isolate the issue to simplify debugging
Report issues: Open an issue at GitHub Issues

Debug Mode¶

Enable detailed logging for troubleshooting:

import autotimm as at  # recommended alias
from autotimm import LoggingConfig, ImageClassifier

logging_config = LoggingConfig(
    verbosity=2,  # Verbose logging
    log_gradient_norm=True,
    log_learning_rate=True,
)

model = ImageClassifier(
    backbone="resnet50",
    num_classes=10,
    logging_config=logging_config,
)

Training Guide - Complete training documentation
Data Loading Guide - Data loading best practices
Deployment Guide - Production deployment guide
API Reference - Complete API documentation