Slow Training¶
Training speed optimization and identifying bottlenecks.
Data Loading Bottleneck¶
# Check if CPU is the bottleneck (high GPU idle time)
# 1. Increase workers
data = ImageDataModule(
data_dir="./data",
num_workers=8, # Increase based on CPU cores
persistent_workers=True,
pin_memory=True,
prefetch_factor=4,
)
# 2. Use faster storage (SSD over HDD)
# 3. Preprocess data once
# Convert images to a faster format like WebP or use lmdb
GPU Utilization¶
# 1. Increase batch size (use gradient accumulation if memory limited)
data = ImageDataModule(batch_size=64, ...)
# 2. Use mixed precision
trainer = AutoTrainer(precision="bf16-mixed", ...)
# 3. Compile the model (PyTorch 2.0+)
import torch
model = ImageClassifier(...)
model = torch.compile(model, mode="reduce-overhead")
Validation Frequency¶
# Reduce validation frequency for faster training
trainer = AutoTrainer(
max_epochs=50,
val_check_interval=0.5, # Validate twice per epoch instead of every epoch
)
# Or validate every N steps
trainer = AutoTrainer(
max_epochs=50,
val_check_interval=500, # Validate every 500 steps
)
Related Issues¶
- OOM Errors - Memory constraints
- Profiling - Detailed performance analysis
- Device Errors - Hardware issues