Skip to content

Troubleshooting

Comprehensive troubleshooting guide for common issues when using AutoTimm.

Quick Navigation

graph LR
    A[Problem] --> B[<b>Training</b><br/>Loss, convergence,<br/>gradients, LR]
    A --> C[<b>Performance</b><br/>Memory, speed,<br/>GPU, bottlenecks]
    A --> D[<b>Data</b><br/>Loading, format,<br/>augmentation]
    A --> E[<b>Models</b><br/>Architecture,<br/>pretrained, custom]
    A --> F[<b>Deployment</b><br/>Export, serving,<br/>inference]
    A --> G[<b>Environment</b><br/>Dependencies,<br/>CUDA, install]
    A --> H[<b>Integration</b><br/>HuggingFace, timm,<br/>Lightning]
    A --> I[<b>Task-Specific</b><br/>Classification,<br/>detection, segmentation]

    style A fill:#1565C0,stroke:#0D47A1
    style B fill:#1976D2,stroke:#1565C0
    style C fill:#1565C0,stroke:#0D47A1
    style D fill:#1976D2,stroke:#1565C0
    style E fill:#1565C0,stroke:#0D47A1
    style F fill:#1976D2,stroke:#1565C0
    style G fill:#1565C0,stroke:#0D47A1
    style H fill:#1976D2,stroke:#1565C0
    style I fill:#1565C0,stroke:#0D47A1

Browse by Category

Training Issues

Problems during model training and optimization.

Performance Issues

Memory, speed, and resource optimization.

Data Issues

Data loading and augmentation problems.

Model Issues

Model loading, checkpoints, and metrics.

Deployment Issues

Export, inference, and production deployment.

Environment Issues

Hardware, devices, and distributed training.

Integration Issues

External tools and platform integration.

Task-Specific Issues

Issues specific to computer vision tasks.

  • YOLOX - YOLOX-specific training and inference
  • Interpretation - Model interpretation and visualization

Quick Reference

Fast lookup tables and common patterns.

Common Workflows

"My training loss is NaN"

  1. Check NaN Losses
  2. Enable gradient clipping
  3. Reduce learning rate
  4. Try auto-tuning: LR Tuning

"CUDA out of memory"

  1. See OOM Errors
  2. Reduce batch size
  3. Enable gradient accumulation
  4. Use mixed precision training

"Training is too slow"

  1. Check Slow Training
  2. Profile with Performance Profiling
  3. Increase num_workers
  4. Enable mixed precision

"Model won't converge"

  1. Review Convergence Problems
  2. Check for overfitting/underfitting
  3. Adjust learning rate
  4. Review data quality

"Export fails"

  1. See Export & Inference
  2. Use method="trace" for TorchScript
  3. Try lower ONNX opset version
  4. Simplify model architecture

Getting Help

If you encounter an issue not covered here:

  1. Search the docs: Use the search feature to find relevant information
  2. Check error reference: See Error Reference for common errors
  3. Enable debug logging: Get more detailed error information
  4. Create minimal reproduction: Isolate the issue to simplify debugging
  5. Report issues: Open an issue at GitHub Issues

Debug Mode

Enable detailed logging for troubleshooting:

import autotimm as at  # recommended alias
from autotimm import LoggingConfig, ImageClassifier

logging_config = LoggingConfig(
    verbosity=2,  # Verbose logging
    log_gradient_norm=True,
    log_learning_rate=True,
)

model = ImageClassifier(
    backbone="resnet50",
    num_classes=10,
    logging_config=logging_config,
)