This guide provides performance benchmarks for different backbone architectures and tasks in AutoTimm. Use these to select the right model for your accuracy, speed, and memory requirements.
Backbone Comparison
CNN Backbones
| Backbone |
Parameters |
ImageNet Top-1 |
Size (MB) |
Inference (ms) |
resnet18 |
11.7M |
69.8% |
45 |
1.2 |
resnet34 |
21.8M |
73.3% |
84 |
1.8 |
resnet50 |
25.6M |
80.4% |
98 |
2.4 |
resnet101 |
44.5M |
81.5% |
171 |
4.1 |
resnet152 |
60.2M |
82.0% |
231 |
5.8 |
efficientnet_b0 |
5.3M |
77.7% |
21 |
2.1 |
efficientnet_b1 |
7.8M |
79.2% |
31 |
2.8 |
efficientnet_b2 |
9.1M |
80.3% |
36 |
3.2 |
efficientnet_b3 |
12.2M |
81.7% |
48 |
4.0 |
efficientnet_b4 |
19.3M |
83.0% |
76 |
5.5 |
convnext_tiny |
28.6M |
82.1% |
110 |
3.2 |
convnext_small |
50.2M |
83.1% |
192 |
5.1 |
convnext_base |
88.6M |
83.8% |
339 |
7.8 |
mobilenetv3_small_100 |
2.5M |
67.7% |
10 |
0.8 |
mobilenetv3_large_100 |
5.5M |
75.8% |
22 |
1.1 |
Inference times measured on NVIDIA V100 with batch size 32, 224x224 images.
| Backbone |
Parameters |
ImageNet Top-1 |
Size (MB) |
Inference (ms) |
vit_tiny_patch16_224 |
5.7M |
75.5% |
23 |
1.8 |
vit_small_patch16_224 |
22.1M |
81.4% |
86 |
2.5 |
vit_base_patch16_224 |
86.6M |
84.5% |
331 |
4.8 |
vit_large_patch16_224 |
304.3M |
85.8% |
1163 |
14.2 |
swin_tiny_patch4_window7_224 |
28.3M |
81.3% |
109 |
4.2 |
swin_small_patch4_window7_224 |
49.6M |
83.0% |
190 |
6.8 |
swin_base_patch4_window7_224 |
87.8M |
83.5% |
336 |
10.5 |
deit_tiny_patch16_224 |
5.7M |
72.2% |
23 |
1.6 |
deit_small_patch16_224 |
22.1M |
79.9% |
86 |
2.3 |
deit_base_patch16_224 |
86.6M |
81.8% |
331 |
4.5 |
Recommendation by Use Case
| Use Case |
Recommended Backbone |
Why |
| Edge deployment |
mobilenetv3_small_100 |
Smallest, fastest |
| Mobile apps |
efficientnet_b0 |
Good accuracy/speed trade-off |
| General purpose |
resnet50 |
Well-balanced, widely supported |
| High accuracy |
convnext_base or swin_base |
State-of-the-art |
| Limited GPU memory |
efficientnet_b2 |
Low memory, good accuracy |
| Transfer learning |
resnet50 or vit_base |
Best pretrained weights |
Classification Results
CIFAR-10
| Model |
Top-1 Accuracy |
Training Time |
GPU Memory |
| ResNet-18 |
95.2% |
12 min |
2.1 GB |
| ResNet-50 |
96.1% |
25 min |
3.8 GB |
| EfficientNet-B0 |
95.8% |
18 min |
2.4 GB |
| ConvNeXt-Tiny |
96.4% |
28 min |
4.2 GB |
| ViT-Small |
96.0% |
32 min |
4.8 GB |
Training: 50 epochs, batch size 128, single V100 GPU
CIFAR-100
| Model |
Top-1 Accuracy |
Top-5 Accuracy |
| ResNet-18 |
77.5% |
93.2% |
| ResNet-50 |
80.2% |
94.8% |
| EfficientNet-B0 |
79.1% |
94.1% |
| ConvNeXt-Tiny |
81.5% |
95.3% |
| ViT-Small |
80.8% |
94.9% |
ImageNet-1K (Transfer Learning)
| Model |
Top-1 Accuracy |
Fine-tuning Time |
| ResNet-50 (pretrained) |
80.4% |
- |
| ResNet-50 (fine-tuned) |
82.1% |
8 hours |
| EfficientNet-B3 (pretrained) |
81.7% |
- |
| ViT-Base (pretrained) |
84.5% |
- |
| Swin-Base (pretrained) |
83.5% |
- |
Object Detection Results
COCO Detection
| Backbone |
mAP |
mAP@50 |
mAP@75 |
Training Time |
Memory |
| ResNet-50 + FPN |
38.2 |
58.1 |
41.0 |
16h |
8.5 GB |
| ResNet-101 + FPN |
40.1 |
60.2 |
43.5 |
22h |
10.2 GB |
| EfficientNet-B3 + FPN |
39.5 |
59.2 |
42.8 |
20h |
9.1 GB |
| Swin-Tiny + FPN |
41.2 |
61.8 |
44.6 |
24h |
11.5 GB |
| ConvNeXt-Small + FPN |
42.1 |
62.5 |
45.3 |
28h |
12.8 GB |
Training: 12 epochs, batch size 16, 2x V100 GPUs
Detection Speed vs Accuracy
| Model |
mAP |
FPS (V100) |
FPS (T4) |
Use Case |
| MobileNetV3 + FPN |
32.5 |
45 |
22 |
Real-time |
| ResNet-50 + FPN |
38.2 |
28 |
14 |
Balanced |
| Swin-Tiny + FPN |
41.2 |
18 |
9 |
High accuracy |
Recommended Configuration
import autotimm as at # recommended alias
from autotimm import ObjectDetector, MetricConfig
# For real-time applications
model = ObjectDetector(
backbone="mobilenetv3_large_100",
num_classes=80,
metrics=[...],
fpn_channels=128, # Smaller FPN
)
# For high accuracy
model = ObjectDetector(
backbone="swin_small_patch4_window7_224",
num_classes=80,
metrics=[...],
fpn_channels=256,
)
Segmentation Results
Cityscapes Semantic Segmentation
| Backbone |
mIoU |
Pixel Acc |
Training Time |
Memory |
| ResNet-50 + DeepLabV3+ |
78.2% |
96.1% |
8h |
7.2 GB |
| ResNet-101 + DeepLabV3+ |
79.5% |
96.4% |
12h |
9.8 GB |
| EfficientNet-B3 + DeepLabV3+ |
78.8% |
96.2% |
10h |
8.1 GB |
| Swin-Tiny + DeepLabV3+ |
80.1% |
96.6% |
14h |
10.5 GB |
Training: 80 epochs, 512x1024 crops, single V100 GPU
Pascal VOC Segmentation
| Backbone |
mIoU |
Parameters |
| ResNet-50 + FCN |
72.5% |
26M |
| ResNet-50 + DeepLabV3+ |
78.5% |
28M |
| ResNet-101 + DeepLabV3+ |
80.2% |
47M |
Segmentation Speed
| Model |
mIoU |
FPS (512x512) |
Memory |
| MobileNetV3 + FCN |
68.5% |
52 |
2.1 GB |
| ResNet-50 + DeepLabV3+ |
78.2% |
18 |
7.2 GB |
| Swin-Tiny + DeepLabV3+ |
80.1% |
12 |
10.5 GB |
Memory Usage
By Task
| Task |
Typical Memory (batch 16) |
Peak Memory |
| Classification (224x224) |
4-8 GB |
10 GB |
| Detection (640x640) |
8-16 GB |
20 GB |
| Semantic Segmentation (512x512) |
8-12 GB |
16 GB |
| Instance Segmentation (640x640) |
12-24 GB |
28 GB |
By Backbone
| Backbone |
Classification |
Detection |
Segmentation |
| MobileNetV3-Small |
1.5 GB |
4 GB |
3 GB |
| EfficientNet-B0 |
2.0 GB |
5 GB |
4 GB |
| ResNet-50 |
3.5 GB |
8 GB |
7 GB |
| ViT-Base |
6.0 GB |
12 GB |
10 GB |
| Swin-Base |
7.0 GB |
14 GB |
12 GB |
Memory usage with batch size 16, mixed precision training
Memory Optimization Tips
from autotimm import AutoTrainer, ImageDataModule
# 1. Reduce batch size
data = ImageDataModule(batch_size=8) # Instead of 32
# 2. Use gradient accumulation
trainer = AutoTrainer(
accumulate_grad_batches=4, # Effective batch = 8 * 4 = 32
)
# 3. Use mixed precision
trainer = AutoTrainer(precision="bf16-mixed")
# 4. Reduce image size
data = ImageDataModule(image_size=160) # Instead of 224
Inference Speed
Classification (224x224)
| Backbone |
V100 (ms) |
T4 (ms) |
A100 (ms) |
CPU (ms) |
| MobileNetV3-Small |
0.8 |
1.2 |
0.5 |
8 |
| EfficientNet-B0 |
2.1 |
3.5 |
1.4 |
25 |
| ResNet-50 |
2.4 |
4.2 |
1.6 |
35 |
| ViT-Base |
4.8 |
8.5 |
3.2 |
120 |
| Swin-Base |
10.5 |
18.0 |
7.0 |
180 |
Single image inference, batch size 1
Batch Inference Speedup
| Backbone |
Batch 1 |
Batch 8 |
Batch 32 |
Speedup |
| ResNet-50 |
2.4 ms |
8.5 ms |
28 ms |
2.7x |
| ViT-Base |
4.8 ms |
15 ms |
48 ms |
3.2x |
V100 GPU, images per second = batch_size / time
Detection (640x640)
| Model |
V100 FPS |
T4 FPS |
Note |
| MobileNetV3 + FPN |
45 |
22 |
Real-time capable |
| ResNet-50 + FPN |
28 |
14 |
Good balance |
| Swin-Tiny + FPN |
18 |
9 |
Higher accuracy |
Model Selection Guidelines
By Hardware
| Hardware |
Recommended Backbone |
Max Batch Size |
| 4GB GPU (GTX 1650) |
MobileNetV3, EfficientNet-B0 |
16-32 |
| 8GB GPU (RTX 3070) |
ResNet-50, EfficientNet-B2 |
32-64 |
| 16GB GPU (V100, RTX 4090) |
Any CNN, ViT-Base |
64-128 |
| 24GB+ GPU (A100) |
Any, including Swin-Large |
128+ |
| CPU only |
MobileNetV3-Small |
1-8 |
By Accuracy Requirement
| Requirement |
Classification |
Detection |
Segmentation |
| Maximum accuracy |
Swin-Base, ConvNeXt-Large |
Swin + FPN |
Swin + DeepLabV3+ |
| High accuracy |
ResNet-101, EfficientNet-B4 |
ResNet-101 + FPN |
ResNet-101 + DeepLabV3+ |
| Balanced |
ResNet-50, EfficientNet-B2 |
ResNet-50 + FPN |
ResNet-50 + DeepLabV3+ |
| Fast inference |
MobileNetV3, EfficientNet-B0 |
MobileNetV3 + FPN |
MobileNetV3 + FCN |
By Training Time Budget
| Budget |
Recommended |
Expected Results |
| < 1 hour |
MobileNetV3, ResNet-18 |
Good baseline |
| 1-4 hours |
ResNet-50, EfficientNet-B2 |
Strong results |
| 4-12 hours |
ResNet-101, ConvNeXt-Small |
Near state-of-the-art |
| 12+ hours |
Swin-Base, ViT-Large |
Best possible |
Benchmark Code
Run your own benchmarks:
import time
import torch
from autotimm import ImageClassifier, MetricConfig
def benchmark_model(backbone, num_classes=10, image_size=224, batch_size=32, warmup=10, iterations=100):
"""Benchmark model inference speed."""
metrics = [
MetricConfig(
name="accuracy",
backend="torchmetrics",
metric_class="Accuracy",
params={"task": "multiclass"},
stages=["val"],
)
]
model = ImageClassifier(
backbone=backbone,
num_classes=num_classes,
metrics=metrics,
)
model.eval()
if torch.cuda.is_available():
model = model.cuda()
device = "cuda"
else:
device = "cpu"
# Create dummy input
x = torch.randn(batch_size, 3, image_size, image_size).to(device)
# Warmup
with torch.inference_mode():
for _ in range(warmup):
_ = model(x)
if device == "cuda":
torch.cuda.synchronize()
# Benchmark
start = time.perf_counter()
with torch.inference_mode():
for _ in range(iterations):
_ = model(x)
if device == "cuda":
torch.cuda.synchronize()
end = time.perf_counter()
total_time = end - start
avg_time = total_time / iterations * 1000 # ms
throughput = batch_size * iterations / total_time
print(f"{backbone}:")
print(f" Average latency: {avg_time:.2f} ms")
print(f" Throughput: {throughput:.1f} images/sec")
return avg_time, throughput
# Run benchmarks
backbones = [
"mobilenetv3_small_100",
"efficientnet_b0",
"resnet50",
"vit_base_patch16_224",
]
for backbone in backbones:
benchmark_model(backbone)
See Also