Skip to content

Performance Benchmarks

This guide provides performance benchmarks for different backbone architectures and tasks in AutoTimm. Use these to select the right model for your accuracy, speed, and memory requirements.

Backbone Comparison

CNN Backbones

Backbone Parameters ImageNet Top-1 Size (MB) Inference (ms)
resnet18 11.7M 69.8% 45 1.2
resnet34 21.8M 73.3% 84 1.8
resnet50 25.6M 80.4% 98 2.4
resnet101 44.5M 81.5% 171 4.1
resnet152 60.2M 82.0% 231 5.8
efficientnet_b0 5.3M 77.7% 21 2.1
efficientnet_b1 7.8M 79.2% 31 2.8
efficientnet_b2 9.1M 80.3% 36 3.2
efficientnet_b3 12.2M 81.7% 48 4.0
efficientnet_b4 19.3M 83.0% 76 5.5
convnext_tiny 28.6M 82.1% 110 3.2
convnext_small 50.2M 83.1% 192 5.1
convnext_base 88.6M 83.8% 339 7.8
mobilenetv3_small_100 2.5M 67.7% 10 0.8
mobilenetv3_large_100 5.5M 75.8% 22 1.1

Inference times measured on NVIDIA V100 with batch size 32, 224x224 images.

Vision Transformer Backbones

Backbone Parameters ImageNet Top-1 Size (MB) Inference (ms)
vit_tiny_patch16_224 5.7M 75.5% 23 1.8
vit_small_patch16_224 22.1M 81.4% 86 2.5
vit_base_patch16_224 86.6M 84.5% 331 4.8
vit_large_patch16_224 304.3M 85.8% 1163 14.2
swin_tiny_patch4_window7_224 28.3M 81.3% 109 4.2
swin_small_patch4_window7_224 49.6M 83.0% 190 6.8
swin_base_patch4_window7_224 87.8M 83.5% 336 10.5
deit_tiny_patch16_224 5.7M 72.2% 23 1.6
deit_small_patch16_224 22.1M 79.9% 86 2.3
deit_base_patch16_224 86.6M 81.8% 331 4.5

Recommendation by Use Case

Use Case Recommended Backbone Why
Edge deployment mobilenetv3_small_100 Smallest, fastest
Mobile apps efficientnet_b0 Good accuracy/speed trade-off
General purpose resnet50 Well-balanced, widely supported
High accuracy convnext_base or swin_base State-of-the-art
Limited GPU memory efficientnet_b2 Low memory, good accuracy
Transfer learning resnet50 or vit_base Best pretrained weights

Classification Results

CIFAR-10

Model Top-1 Accuracy Training Time GPU Memory
ResNet-18 95.2% 12 min 2.1 GB
ResNet-50 96.1% 25 min 3.8 GB
EfficientNet-B0 95.8% 18 min 2.4 GB
ConvNeXt-Tiny 96.4% 28 min 4.2 GB
ViT-Small 96.0% 32 min 4.8 GB

Training: 50 epochs, batch size 128, single V100 GPU

CIFAR-100

Model Top-1 Accuracy Top-5 Accuracy
ResNet-18 77.5% 93.2%
ResNet-50 80.2% 94.8%
EfficientNet-B0 79.1% 94.1%
ConvNeXt-Tiny 81.5% 95.3%
ViT-Small 80.8% 94.9%

ImageNet-1K (Transfer Learning)

Model Top-1 Accuracy Fine-tuning Time
ResNet-50 (pretrained) 80.4% -
ResNet-50 (fine-tuned) 82.1% 8 hours
EfficientNet-B3 (pretrained) 81.7% -
ViT-Base (pretrained) 84.5% -
Swin-Base (pretrained) 83.5% -

Object Detection Results

COCO Detection

Backbone mAP mAP@50 mAP@75 Training Time Memory
ResNet-50 + FPN 38.2 58.1 41.0 16h 8.5 GB
ResNet-101 + FPN 40.1 60.2 43.5 22h 10.2 GB
EfficientNet-B3 + FPN 39.5 59.2 42.8 20h 9.1 GB
Swin-Tiny + FPN 41.2 61.8 44.6 24h 11.5 GB
ConvNeXt-Small + FPN 42.1 62.5 45.3 28h 12.8 GB

Training: 12 epochs, batch size 16, 2x V100 GPUs

Detection Speed vs Accuracy

Model mAP FPS (V100) FPS (T4) Use Case
MobileNetV3 + FPN 32.5 45 22 Real-time
ResNet-50 + FPN 38.2 28 14 Balanced
Swin-Tiny + FPN 41.2 18 9 High accuracy
import autotimm as at  # recommended alias
from autotimm import ObjectDetector, MetricConfig

# For real-time applications
model = ObjectDetector(
    backbone="mobilenetv3_large_100",
    num_classes=80,
    metrics=[...],
    fpn_channels=128,  # Smaller FPN
)

# For high accuracy
model = ObjectDetector(
    backbone="swin_small_patch4_window7_224",
    num_classes=80,
    metrics=[...],
    fpn_channels=256,
)

Segmentation Results

Cityscapes Semantic Segmentation

Backbone mIoU Pixel Acc Training Time Memory
ResNet-50 + DeepLabV3+ 78.2% 96.1% 8h 7.2 GB
ResNet-101 + DeepLabV3+ 79.5% 96.4% 12h 9.8 GB
EfficientNet-B3 + DeepLabV3+ 78.8% 96.2% 10h 8.1 GB
Swin-Tiny + DeepLabV3+ 80.1% 96.6% 14h 10.5 GB

Training: 80 epochs, 512x1024 crops, single V100 GPU

Pascal VOC Segmentation

Backbone mIoU Parameters
ResNet-50 + FCN 72.5% 26M
ResNet-50 + DeepLabV3+ 78.5% 28M
ResNet-101 + DeepLabV3+ 80.2% 47M

Segmentation Speed

Model mIoU FPS (512x512) Memory
MobileNetV3 + FCN 68.5% 52 2.1 GB
ResNet-50 + DeepLabV3+ 78.2% 18 7.2 GB
Swin-Tiny + DeepLabV3+ 80.1% 12 10.5 GB

Memory Usage

By Task

Task Typical Memory (batch 16) Peak Memory
Classification (224x224) 4-8 GB 10 GB
Detection (640x640) 8-16 GB 20 GB
Semantic Segmentation (512x512) 8-12 GB 16 GB
Instance Segmentation (640x640) 12-24 GB 28 GB

By Backbone

Backbone Classification Detection Segmentation
MobileNetV3-Small 1.5 GB 4 GB 3 GB
EfficientNet-B0 2.0 GB 5 GB 4 GB
ResNet-50 3.5 GB 8 GB 7 GB
ViT-Base 6.0 GB 12 GB 10 GB
Swin-Base 7.0 GB 14 GB 12 GB

Memory usage with batch size 16, mixed precision training

Memory Optimization Tips

from autotimm import AutoTrainer, ImageDataModule

# 1. Reduce batch size
data = ImageDataModule(batch_size=8)  # Instead of 32

# 2. Use gradient accumulation
trainer = AutoTrainer(
    accumulate_grad_batches=4,  # Effective batch = 8 * 4 = 32
)

# 3. Use mixed precision
trainer = AutoTrainer(precision="bf16-mixed")

# 4. Reduce image size
data = ImageDataModule(image_size=160)  # Instead of 224

Inference Speed

Classification (224x224)

Backbone V100 (ms) T4 (ms) A100 (ms) CPU (ms)
MobileNetV3-Small 0.8 1.2 0.5 8
EfficientNet-B0 2.1 3.5 1.4 25
ResNet-50 2.4 4.2 1.6 35
ViT-Base 4.8 8.5 3.2 120
Swin-Base 10.5 18.0 7.0 180

Single image inference, batch size 1

Batch Inference Speedup

Backbone Batch 1 Batch 8 Batch 32 Speedup
ResNet-50 2.4 ms 8.5 ms 28 ms 2.7x
ViT-Base 4.8 ms 15 ms 48 ms 3.2x

V100 GPU, images per second = batch_size / time

Detection (640x640)

Model V100 FPS T4 FPS Note
MobileNetV3 + FPN 45 22 Real-time capable
ResNet-50 + FPN 28 14 Good balance
Swin-Tiny + FPN 18 9 Higher accuracy

Model Selection Guidelines

By Hardware

Hardware Recommended Backbone Max Batch Size
4GB GPU (GTX 1650) MobileNetV3, EfficientNet-B0 16-32
8GB GPU (RTX 3070) ResNet-50, EfficientNet-B2 32-64
16GB GPU (V100, RTX 4090) Any CNN, ViT-Base 64-128
24GB+ GPU (A100) Any, including Swin-Large 128+
CPU only MobileNetV3-Small 1-8

By Accuracy Requirement

Requirement Classification Detection Segmentation
Maximum accuracy Swin-Base, ConvNeXt-Large Swin + FPN Swin + DeepLabV3+
High accuracy ResNet-101, EfficientNet-B4 ResNet-101 + FPN ResNet-101 + DeepLabV3+
Balanced ResNet-50, EfficientNet-B2 ResNet-50 + FPN ResNet-50 + DeepLabV3+
Fast inference MobileNetV3, EfficientNet-B0 MobileNetV3 + FPN MobileNetV3 + FCN

By Training Time Budget

Budget Recommended Expected Results
< 1 hour MobileNetV3, ResNet-18 Good baseline
1-4 hours ResNet-50, EfficientNet-B2 Strong results
4-12 hours ResNet-101, ConvNeXt-Small Near state-of-the-art
12+ hours Swin-Base, ViT-Large Best possible

Benchmark Code

Run your own benchmarks:

import time
import torch
from autotimm import ImageClassifier, MetricConfig


def benchmark_model(backbone, num_classes=10, image_size=224, batch_size=32, warmup=10, iterations=100):
    """Benchmark model inference speed."""
    metrics = [
        MetricConfig(
            name="accuracy",
            backend="torchmetrics",
            metric_class="Accuracy",
            params={"task": "multiclass"},
            stages=["val"],
        )
    ]

    model = ImageClassifier(
        backbone=backbone,
        num_classes=num_classes,
        metrics=metrics,
    )
    model.eval()

    if torch.cuda.is_available():
        model = model.cuda()
        device = "cuda"
    else:
        device = "cpu"

    # Create dummy input
    x = torch.randn(batch_size, 3, image_size, image_size).to(device)

    # Warmup
    with torch.inference_mode():
        for _ in range(warmup):
            _ = model(x)

    if device == "cuda":
        torch.cuda.synchronize()

    # Benchmark
    start = time.perf_counter()
    with torch.inference_mode():
        for _ in range(iterations):
            _ = model(x)

    if device == "cuda":
        torch.cuda.synchronize()

    end = time.perf_counter()

    total_time = end - start
    avg_time = total_time / iterations * 1000  # ms
    throughput = batch_size * iterations / total_time

    print(f"{backbone}:")
    print(f"  Average latency: {avg_time:.2f} ms")
    print(f"  Throughput: {throughput:.1f} images/sec")

    return avg_time, throughput


# Run benchmarks
backbones = [
    "mobilenetv3_small_100",
    "efficientnet_b0",
    "resnet50",
    "vit_base_patch16_224",
]

for backbone in backbones:
    benchmark_model(backbone)

See Also