Skip to content

AutoTimm

NaN Losses

theja-vanka/AutoTimm

NaN Losses¶

NaN (Not a Number) losses typically indicate numerical instability during training.

Common Causes¶

Cause	Symptom	Solution
Learning rate too high	Loss spikes then becomes NaN	Reduce `lr` by 10x
Gradient explosion	Gradients grow unbounded	Enable gradient clipping
Division by zero	Specific layer outputs NaN	Check for empty batches
Log of zero/negative	Classification with wrong labels	Verify label range
FP16 underflow	Training with mixed precision	Use bf16 instead

Solutions¶

1. Reduce Learning Rate¶

import autotimm as at  # recommended alias
from autotimm import ImageClassifier, MetricConfig

metrics = [
    MetricConfig(
        name="accuracy",
        backend="torchmetrics",
        metric_class="Accuracy",
        params={"task": "multiclass"},
        stages=["train", "val"],
    )
]

model = ImageClassifier(
    backbone="resnet50",
    num_classes=10,
    metrics=metrics,
    lr=1e-5,  # Start with a lower learning rate
)

2. Enable Gradient Clipping¶

from autotimm import AutoTrainer

trainer = AutoTrainer(
    max_epochs=10,
    gradient_clip_val=1.0,  # Clip gradients to max norm of 1.0
)

3. Use Learning Rate Finder¶

from autotimm import AutoTrainer, TunerConfig

trainer = AutoTrainer(
    max_epochs=10,
    tuner_config=TunerConfig(
        auto_lr=True,
        lr_find_kwargs={
            "min_lr": 1e-7,
            "max_lr": 1.0,
            "num_training": 100,
        },
    ),
)

4. Switch to BF16 Precision¶

trainer = AutoTrainer(
    max_epochs=10,
    precision="bf16-mixed",  # More stable than fp16
)

Debugging NaN Losses¶

import torch

# Enable anomaly detection to find the source
torch.autograd.set_detect_anomaly(True)

# Train with anomaly detection
trainer.fit(model, datamodule=data)

# Remember to disable after debugging
torch.autograd.set_detect_anomaly(False)

Gradient Explosion - Related gradient problems
LR Tuning - Finding the right learning rate
Convergence Problems - Other training instabilities