Building Neural Network Architectures Using Only ONNX

Introduction
ONNX (Open Neural Network Exchange) is most commonly known as an export target — a format you dump a PyTorch or TensorFlow model into for deployment. But ONNX is also a fully self-contained, expressive intermediate representation that you can build directly, without ever touching a training framework. This guide treats ONNX as a first-class construction language, not a second-class export artifact.
Why would you want to build networks directly in ONNX?
Portability without framework lock-in. An ONNX graph runs on any hardware backend that supports ONNX Runtime: CPU, CUDA, DirectML, TensorRT, OpenVINO, CoreML, and more. If you define your architecture in ONNX directly, there is no intermediate framework to install or version-pin.
Deterministic, inspectable graphs. When you export from PyTorch, the resulting graph depends on tracing or scripting heuristics that can produce surprising operator sequences. When you write ONNX directly, you know exactly what every node does.
Extremely lightweight deployments. ONNX + ONNX Runtime is a tiny dependency footprint compared to PyTorch or TensorFlow. For embedded systems, edge devices, or serverless inference, this matters enormously.
Fine-grained graph surgery. If you need to fuse operators, insert quantization nodes, rewire connections, or experiment with non-standard topologies, working at the ONNX level directly gives you exact control with no framework abstractions in the way.
Learning how neural networks really work. Building an architecture from raw matrix multiply and activation nodes forces you to understand every dimension, every weight layout, every broadcasting rule. It is an excellent exercise and deeply illuminating.
This guide assumes basic Python proficiency and some familiarity with neural network concepts (layers, activations, convolutions). It does not assume you have ever used PyTorch or TensorFlow.
Understanding the ONNX Format
An ONNX model is a serialized Protocol Buffer file. The .onnx extension is standard, but the file is just a binary proto. The schema is defined in the ONNX specification.
At the top level, a ModelProto contains:
ir_version: The ONNX IR (Intermediate Representation) version.opset_imports: Which operator sets (and which versions of them) this model uses. Most models use the default""domain with a version like17or21.graph: AGraphProto— the actual computation graph.producer_name,producer_version,domain,model_version,doc_string: Metadata fields.
The GraphProto contains:
node: A list ofNodeProtoobjects. Each node is one operation.initializer: A list ofTensorProtoobjects representing constant tensors — weights, biases, embedding tables, etc.input: A list ofValueInfoProtodescribing the graph’s external inputs (their names, types, and shapes).output: A list ofValueInfoProtodescribing the graph’s outputs.
Each NodeProto contains:
op_type: The name of the operator, e.g.,"Gemm","Conv","Relu".domain: Usually""for standard ONNX ops.input: A list of string names — the tensors this node consumes.output: A list of string names — the tensors this node produces.attribute: A list ofAttributeProtoobjects — static hyperparameters like kernel size, axis, epsilon, etc.
Tensor names are just strings. They act as edges in the dataflow graph. If node A produces an output named "relu_out" and node B lists "relu_out" as an input, then B receives A’s output. This is the complete wiring mechanism.
The ONNX Protobuf Schema
You do not need to write raw protobuf. The onnx Python package provides a rich helper library (onnx.helper, onnx.numpy_helper, onnx.checker) that builds proto objects for you. However, understanding the schema directly will save you many hours of debugging.
TensorProto Data Types
Every tensor in ONNX has an element type, encoded as an integer enum:
| Enum Value | Name | Python/NumPy Equivalent |
|---|---|---|
| 1 | FLOAT |
np.float32 |
| 2 | UINT8 |
np.uint8 |
| 3 | INT8 |
np.int8 |
| 4 | UINT16 |
np.uint16 |
| 5 | INT16 |
np.int16 |
| 6 | INT32 |
np.int32 |
| 7 | INT64 |
np.int64 |
| 8 | STRING |
bytes |
| 9 | BOOL |
np.bool_ |
| 10 | FLOAT16 |
np.float16 |
| 11 | DOUBLE |
np.float64 |
| 12 | UINT32 |
np.uint32 |
| 13 | UINT64 |
np.uint64 |
| 14 | COMPLEX64 |
np.complex64 |
| 15 | COMPLEX128 |
np.complex128 |
| 16 | BFLOAT16 |
N/A (custom) |
You reference these via onnx.TensorProto.FLOAT, onnx.TensorProto.INT64, etc.
ValueInfoProto and Shape
Inputs and outputs are described by ValueInfoProto, which pairs a name with a type. The type is a TypeProto, which for tensors includes the element type and a shape. Shapes can have:
- Fixed dimensions:
dim_value = 4means exactly 4 elements on that axis. - Symbolic dimensions:
dim_param = "batch_size"means the dimension is variable but named. ONNX Runtime will accept any runtime value for it. - Fully unknown dimensions: Neither
dim_valuenordim_paramis set — completely dynamic.
Setting Up Your Environment
You need very few packages:
pip install onnx onnxruntime numpyFor visualization (highly recommended):
pip install netronNetron is a browser-based ONNX graph visualizer. You open a .onnx file in it and see the full computation graph rendered as a node diagram, with attributes, shapes, and connections all visible.
Verify your installation:
import onnx
import onnxruntime as ort
import numpy as np
print(f"ONNX version: {onnx.__version__}")
print(f"ONNX IR version: {onnx.IR_VERSION}")
print(f"ONNX Runtime version: {ort.__version__}")Core Building Blocks: ONNX Operators
ONNX defines a large standard operator set. Here are the operators you will use most often when building architectures from scratch.
Linear Algebra
Gemm (General Matrix Multiply): Computes alpha * A @ B + beta * C. This is the workhorse for fully connected layers. Attributes include transA, transB, alpha, beta. The C input (bias) is optional.
MatMul: Computes a simple matrix product A @ B, with numpy-style broadcasting for batched inputs. Has no attributes. Use this when you need raw matmul without the alpha/beta scaling of Gemm.
Add, Sub, Mul, Div: Element-wise arithmetic with broadcasting.
Transpose: Permutes axes. The perm attribute lists the new axis order, e.g., perm=[0, 2, 1] for a batch-first transpose of a 3D tensor.
Activations
Relu: Element-wise max(0, x). No attributes.
Sigmoid: Element-wise 1 / (1 + exp(-x)). No attributes.
Tanh: Element-wise hyperbolic tangent. No attributes.
Gelu: Gaussian Error Linear Unit. Available in newer opsets.
Softmax: Softmax along a specified axis. Default axis is -1.
LeakyRelu: max(alpha * x, x) with a configurable alpha attribute (default 0.01).
Elu: Exponential Linear Unit. Attribute: alpha.
Normalization
BatchNormalization: Normalizes inputs across the batch dimension, then scales and shifts with learnable scale and B (bias) parameters, using running mean and var statistics. Has epsilon and momentum attributes. In inference mode (the default in ONNX), it uses the stored running statistics and has only one output.
LayerNormalization: Normalizes across a specified set of axes (usually the last). Introduced in opset 17. Essential for Transformer architectures.
InstanceNormalization: Normalizes per-channel per-sample. Useful for style transfer networks.
Convolutions
Conv: N-dimensional convolution. Key attributes: kernel_shape, strides, pads, dilations, group (for grouped/depthwise convolutions), auto_pad. Inputs: X (data), W (weights), B (bias, optional).
ConvTranspose: Transposed (fractionally-strided) convolution for upsampling. Same attribute set as Conv plus output_padding.
MaxPool, AveragePool: Pooling with kernel_shape, strides, pads.
GlobalAveragePool, GlobalMaxPool: Reduce each spatial map to a single value.
Recurrence
LSTM: Full Long Short-Term Memory cell. Inputs: X, W, R, B, sequence_lens, initial_h, initial_c, P. Attributes: hidden_size, direction (forward, reverse, bidirectional).
GRU: Gated Recurrent Unit. Similar interface to LSTM.
RNN: Simple Elman RNN.
Shape Manipulation
Reshape: Changes shape without copying data. Takes a shape tensor as the second input (not an attribute). Use -1 for one inferred dimension.
Flatten: Flattens from axis axis onward into a 2D tensor.
Squeeze: Removes dimensions of size 1 at specified axes.
Unsqueeze: Inserts dimensions of size 1 at specified axes.
Concat: Concatenates tensors along a specified axis.
Split: Splits a tensor into multiple outputs along an axis.
Slice: Extracts a sub-tensor using start, end, axes, and step inputs.
Gather: Index-based lookup (embedding table access, index selection).
GatherElements: Gathers elements along a specified axis using an index tensor.
Scatter, ScatterElements: Inverse of Gather.
Pad: Pads a tensor with a constant, edge, reflect, or wrap strategy.
Tile: Repeats a tensor along each axis a specified number of times.
Expand: Broadcasts a tensor to a target shape.
Reduction
ReduceMean, ReduceSum, ReduceMax, ReduceMin, ReduceProd: Reduce along specified axes, with optional keepdims.
ArgMax, ArgMin: Return the index of the max/min value along an axis.
Logical and Comparison
Equal, Less, Greater, LessOrEqual, GreaterOrEqual: Element-wise comparisons returning bool tensors.
And, Or, Not, Xor: Boolean logic.
Where: Selects elements from two tensors based on a bool condition tensor.
Miscellaneous
Cast: Converts element dtype, e.g., from INT64 to FLOAT.
Constant: Embeds a constant tensor directly as a node. Useful when you need a tensor value but it is computed (not stored as an initializer).
Shape: Returns the shape of a tensor as a 1D INT64 tensor.
Size: Returns the total number of elements as a scalar INT64.
Dropout: Applies dropout. In ONNX inference mode, this is a pass-through (no masking).
Einsum: General einsum notation. Available from opset 12.
Constructing Graphs with the ONNX Helper API
The onnx.helper module is your primary interface. Here is an overview of its key functions.
onnx.helper.make_node
Creates a NodeProto.
import onnx
from onnx import helper, TensorProto
node = helper.make_node(
op_type="Relu", # operator name
inputs=["linear_out"], # names of input tensors
outputs=["relu_out"], # names of output tensors
name="relu_1", # optional: name for the node itself
)For operators with attributes:
node = helper.make_node(
op_type="Conv",
inputs=["x", "W", "b"],
outputs=["conv_out"],
kernel_shape=[3, 3],
strides=[1, 1],
pads=[1, 1, 1, 1],
name="conv_1",
)Attributes are passed as keyword arguments. ONNX infers their types automatically from the Python values you pass (int → INT, float → FLOAT, list of ints → INTS, etc.).
onnx.helper.make_tensor_value_info
Creates a ValueInfoProto for describing graph inputs and outputs.
# Fixed batch size of 1, 784 features
x_info = helper.make_tensor_value_info("x", TensorProto.FLOAT, [1, 784])
# Dynamic batch size (symbolic), 10 classes
y_info = helper.make_tensor_value_info("output", TensorProto.FLOAT, ["batch", 10])
# Completely dynamic shape
z_info = helper.make_tensor_value_info("z", TensorProto.FLOAT, None)onnx.numpy_helper.from_array
Converts a NumPy array to a TensorProto for use as an initializer.
import numpy as np
from onnx import numpy_helper
W = np.random.randn(128, 784).astype(np.float32)
W_tensor = numpy_helper.from_array(W, name="fc1_weight")onnx.helper.make_graph
Assembles nodes, initializers, inputs, and outputs into a GraphProto.
graph = helper.make_graph(
nodes=[node1, node2, node3],
name="my_mlp",
inputs=[x_info],
outputs=[y_info],
initializer=[W_tensor, b_tensor],
)onnx.helper.make_model
Wraps a graph in a ModelProto.
model = helper.make_model(
graph,
opset_imports=[helper.make_opsetid("", 17)], # opset 17 of the default domain
)
model.ir_version = 8
model.producer_name = "my_builder"onnx.checker.check_model
Validates the model’s structural correctness. Always run this before saving or running.
onnx.checker.check_model(model)onnx.save
Serializes to a .onnx file.
onnx.save(model, "my_model.onnx")Building a Linear Regression Model
Let us start with the simplest possible “network”: a linear regression that computes y = X @ W + b.
import numpy as np
import onnx
from onnx import helper, TensorProto, numpy_helper
# ------------------------------------------------------------------ #
# 1. Define weights and bias as numpy arrays #
# ------------------------------------------------------------------ #
in_features = 8
out_features = 1
W_data = np.random.randn(in_features, out_features).astype(np.float32)
b_data = np.zeros(out_features, dtype=np.float32)
# ------------------------------------------------------------------ #
# 2. Convert to TensorProto initializers #
# ------------------------------------------------------------------ #
W_init = numpy_helper.from_array(W_data, name="W")
b_init = numpy_helper.from_array(b_data, name="b")
# ------------------------------------------------------------------ #
# 3. Define the graph's external input and output shapes #
# ------------------------------------------------------------------ #
# Input: batch of samples, each with 8 features
x_info = helper.make_tensor_value_info("x", TensorProto.FLOAT, ["batch", in_features])
# Output: batch of scalars
y_info = helper.make_tensor_value_info("y", TensorProto.FLOAT, ["batch", out_features])
# ------------------------------------------------------------------ #
# 4. Define the computation node #
# ------------------------------------------------------------------ #
# Gemm computes: alpha * A @ B + beta * C
# We want: x @ W + b, which is: 1.0 * x @ W + 1.0 * b
gemm_node = helper.make_node(
op_type="Gemm",
inputs=["x", "W", "b"],
outputs=["y"],
alpha=1.0,
beta=1.0,
transB=0, # W is already (in_features, out_features), no transpose needed
name="linear",
)
# ------------------------------------------------------------------ #
# 5. Build the graph #
# ------------------------------------------------------------------ #
graph = helper.make_graph(
nodes=[gemm_node],
name="linear_regression",
inputs=[x_info],
outputs=[y_info],
initializer=[W_init, b_init],
)
# ------------------------------------------------------------------ #
# 6. Build the model #
# ------------------------------------------------------------------ #
model = helper.make_model(
graph,
opset_imports=[helper.make_opsetid("", 17)],
)
model.ir_version = 8
model.producer_name = "onnx_guide"
# ------------------------------------------------------------------ #
# 7. Validate and save #
# ------------------------------------------------------------------ #
onnx.checker.check_model(model)
onnx.save(model, "linear_regression.onnx")
print("Model saved.")- The initializers
Wandbare listed ininitializerand implicitly available as named tensors in the graph. You do not list them as graphinputsbecause they are constants — they do not vary across inference calls. Gemm’stransBattribute controls whether the second matrix is transposed before multiply. WithtransB=0andWshaped[in_features, out_features], the compute isx @ W, giving output shape[batch, out_features].- Symbolic dimensions like
"batch"in shape specifications tell ONNX Runtime to accept any value on that axis at runtime.
Building a Multi-Layer Perceptron (MLP)
A multi-layer perceptron stacks fully connected layers with nonlinear activations between them. Here we build a 3-layer MLP for classification: input → hidden1 → hidden2 → logits → softmax.
import numpy as np
import onnx
from onnx import helper, TensorProto, numpy_helper
# ------------------------------------------------------------------ #
# Architecture hyperparameters #
# ------------------------------------------------------------------ #
in_dim = 784 # e.g., MNIST flattened
h1_dim = 256
h2_dim = 128
out_dim = 10 # classes
def make_fc_weights(in_d, out_d, name_prefix):
"""Create weight and bias initializers for a fully connected layer."""
W = np.random.randn(in_d, out_d).astype(np.float32) * np.sqrt(2.0 / in_d)
b = np.zeros(out_d, dtype=np.float32)
W_init = numpy_helper.from_array(W, name=f"{name_prefix}_W")
b_init = numpy_helper.from_array(b, name=f"{name_prefix}_b")
return W_init, b_init
# ------------------------------------------------------------------ #
# Initializers #
# ------------------------------------------------------------------ #
fc1_W, fc1_b = make_fc_weights(in_dim, h1_dim, "fc1")
fc2_W, fc2_b = make_fc_weights(h1_dim, h2_dim, "fc2")
fc3_W, fc3_b = make_fc_weights(h2_dim, out_dim, "fc3")
all_initializers = [fc1_W, fc1_b, fc2_W, fc2_b, fc3_W, fc3_b]
# ------------------------------------------------------------------ #
# Nodes #
# ------------------------------------------------------------------ #
nodes = []
# Layer 1: Linear → ReLU
nodes.append(helper.make_node(
"Gemm", inputs=["x", "fc1_W", "fc1_b"], outputs=["fc1_out"],
name="fc1", alpha=1.0, beta=1.0,
))
nodes.append(helper.make_node(
"Relu", inputs=["fc1_out"], outputs=["relu1_out"],
name="relu1",
))
# Layer 2: Linear → ReLU
nodes.append(helper.make_node(
"Gemm", inputs=["relu1_out", "fc2_W", "fc2_b"], outputs=["fc2_out"],
name="fc2", alpha=1.0, beta=1.0,
))
nodes.append(helper.make_node(
"Relu", inputs=["fc2_out"], outputs=["relu2_out"],
name="relu2",
))
# Layer 3: Linear (logits)
nodes.append(helper.make_node(
"Gemm", inputs=["relu2_out", "fc3_W", "fc3_b"], outputs=["logits"],
name="fc3", alpha=1.0, beta=1.0,
))
# Softmax over class dimension (axis=-1 is the default)
nodes.append(helper.make_node(
"Softmax", inputs=["logits"], outputs=["probs"],
name="softmax", axis=-1,
))
# ------------------------------------------------------------------ #
# Graph inputs / outputs #
# ------------------------------------------------------------------ #
x_info = helper.make_tensor_value_info("x", TensorProto.FLOAT, ["batch", in_dim])
prob_info = helper.make_tensor_value_info("probs", TensorProto.FLOAT, ["batch", out_dim])
# ------------------------------------------------------------------ #
# Assemble and save #
# ------------------------------------------------------------------ #
graph = helper.make_graph(
nodes, "mlp",
inputs=[x_info],
outputs=[prob_info],
initializer=all_initializers,
)
model = helper.make_model(graph, opset_imports=[helper.make_opsetid("", 17)])
model.ir_version = 8
onnx.checker.check_model(model)
onnx.save(model, "mlp.onnx")
print("MLP saved.")- Weight names in
make_nodemust match exactly the names used innumpy_helper.from_array. A single character mismatch causes a runtime error. - He initialization (
np.sqrt(2.0 / in_d)) is baked into the weight values at construction time. ONNX does not have an initialization scheme concept; weights are just constant tensors. GemmexpectsWshaped[in_dim, out_dim]whentransB=0. Some sources convention their weights as[out_dim, in_dim]and usetransB=1; both are valid.
Building a Convolutional Neural Network (CNN)
CNNs require managing multi-dimensional weight tensors. In ONNX, the Conv operator expects:
- Input
X: shape[batch, in_channels, height, width](NCHW format). - Weight
W: shape[out_channels, in_channels/group, kernel_h, kernel_w]. - Bias
B: shape[out_channels](optional).
Here we build a small CNN for image classification (CIFAR-style input: 3×32×32 → 10 classes).
import numpy as np
import onnx
from onnx import helper, TensorProto, numpy_helper
def conv_weight(out_ch, in_ch, kH, kW, name):
fan_in = in_ch * kH * kW
W = np.random.randn(out_ch, in_ch, kH, kW).astype(np.float32) * np.sqrt(2.0 / fan_in)
return numpy_helper.from_array(W, name=name)
def conv_bias(out_ch, name):
b = np.zeros(out_ch, dtype=np.float32)
return numpy_helper.from_array(b, name=name)
def bn_params(channels, name_prefix):
"""BatchNorm scale (gamma), bias (beta), running mean, running var."""
scale = numpy_helper.from_array(np.ones(channels, dtype=np.float32), name=f"{name_prefix}_scale")
bias = numpy_helper.from_array(np.zeros(channels, dtype=np.float32), name=f"{name_prefix}_bias")
mean = numpy_helper.from_array(np.zeros(channels, dtype=np.float32), name=f"{name_prefix}_mean")
var = numpy_helper.from_array(np.ones(channels, dtype=np.float32), name=f"{name_prefix}_var")
return scale, bias, mean, var
def fc_params(in_d, out_d, name_prefix):
W = np.random.randn(in_d, out_d).astype(np.float32) * np.sqrt(2.0 / in_d)
b = np.zeros(out_d, dtype=np.float32)
return (numpy_helper.from_array(W, name=f"{name_prefix}_W"),
numpy_helper.from_array(b, name=f"{name_prefix}_b"))
# ------------------------------------------------------------------ #
# Initializers #
# ------------------------------------------------------------------ #
inits = []
# Conv block 1: 3 → 32 channels, 3×3 kernel
inits += [conv_weight(32, 3, 3, 3, "conv1_W"), conv_bias(32, "conv1_b")]
inits += list(bn_params(32, "bn1"))
# Conv block 2: 32 → 64 channels, 3×3 kernel
inits += [conv_weight(64, 32, 3, 3, "conv2_W"), conv_bias(64, "conv2_b")]
inits += list(bn_params(64, "bn2"))
# Conv block 3: 64 → 128 channels, 3×3 kernel
inits += [conv_weight(128, 64, 3, 3, "conv3_W"), conv_bias(128, "conv3_b")]
inits += list(bn_params(128, "bn3"))
# Fully connected layers
# After 3 max-pools on 32×32 input: 32/2/2/2 = 4 spatial → 128 * 4 * 4 = 2048
inits += list(fc_params(128 * 4 * 4, 256, "fc1"))
inits += list(fc_params(256, 10, "fc2"))
# ------------------------------------------------------------------ #
# Nodes #
# ------------------------------------------------------------------ #
nodes = []
def conv_bn_relu(x_name, conv_w, conv_b, bn_prefix, out_name, kH=3, kW=3, pad=1):
"""Returns a list of nodes: Conv → BatchNorm → Relu."""
conv_out = f"{out_name}_conv"
bn_out = f"{out_name}_bn"
return [
helper.make_node("Conv",
inputs=[x_name, conv_w, conv_b],
outputs=[conv_out],
kernel_shape=[kH, kW],
strides=[1, 1],
pads=[pad, pad, pad, pad],
name=f"{out_name}_conv_op",
),
helper.make_node("BatchNormalization",
inputs=[conv_out,
f"{bn_prefix}_scale", f"{bn_prefix}_bias",
f"{bn_prefix}_mean", f"{bn_prefix}_var"],
outputs=[bn_out],
epsilon=1e-5,
momentum=0.9,
name=f"{out_name}_bn_op",
),
helper.make_node("Relu",
inputs=[bn_out],
outputs=[out_name],
name=f"{out_name}_relu_op",
),
]
# Block 1 + MaxPool
nodes += conv_bn_relu("x", "conv1_W", "conv1_b", "bn1", "block1_out")
nodes.append(helper.make_node("MaxPool",
inputs=["block1_out"], outputs=["pool1_out"],
kernel_shape=[2, 2], strides=[2, 2], name="pool1",
))
# Block 2 + MaxPool
nodes += conv_bn_relu("pool1_out", "conv2_W", "conv2_b", "bn2", "block2_out")
nodes.append(helper.make_node("MaxPool",
inputs=["block2_out"], outputs=["pool2_out"],
kernel_shape=[2, 2], strides=[2, 2], name="pool2",
))
# Block 3 + MaxPool
nodes += conv_bn_relu("pool2_out", "conv3_W", "conv3_b", "bn3", "block3_out")
nodes.append(helper.make_node("MaxPool",
inputs=["block3_out"], outputs=["pool3_out"],
kernel_shape=[2, 2], strides=[2, 2], name="pool3",
))
# Flatten: [batch, 128, 4, 4] → [batch, 2048]
nodes.append(helper.make_node("Flatten",
inputs=["pool3_out"], outputs=["flat_out"],
axis=1, name="flatten",
))
# FC1 + ReLU
nodes.append(helper.make_node("Gemm",
inputs=["flat_out", "fc1_W", "fc1_b"], outputs=["fc1_out"],
alpha=1.0, beta=1.0, name="fc1",
))
nodes.append(helper.make_node("Relu",
inputs=["fc1_out"], outputs=["fc1_relu"],
name="fc1_relu",
))
# FC2 (logits) + Softmax
nodes.append(helper.make_node("Gemm",
inputs=["fc1_relu", "fc2_W", "fc2_b"], outputs=["logits"],
alpha=1.0, beta=1.0, name="fc2",
))
nodes.append(helper.make_node("Softmax",
inputs=["logits"], outputs=["probs"],
axis=-1, name="softmax",
))
# ------------------------------------------------------------------ #
# Graph, model, validate, save #
# ------------------------------------------------------------------ #
x_info = helper.make_tensor_value_info("x", TensorProto.FLOAT, ["batch", 3, 32, 32])
prob_info = helper.make_tensor_value_info("probs", TensorProto.FLOAT, ["batch", 10])
graph = helper.make_graph(nodes, "cnn_classifier",
inputs=[x_info], outputs=[prob_info], initializer=inits)
model = helper.make_model(graph, opset_imports=[helper.make_opsetid("", 17)])
model.ir_version = 8
onnx.checker.check_model(model)
onnx.save(model, "cnn_classifier.onnx")
print("CNN saved.")- NCHW layout: ONNX
Convassumes channel-first ordering. If your input data is NHWC, you mustTransposeit first. padsattribute: ForConv, pads are listed as[pad_top, pad_left, pad_bottom, pad_right]for 2D convolutions. Some versions use[pad_h_begin, pad_w_begin, pad_h_end, pad_w_end]. Check the ONNX spec for your opset.BatchNormalizationin inference mode: ONNX BN in opset 9+ produces only one output (the normalized tensor). The training-mode outputs (saved mean, saved variance) are not produced in inference mode. If you see BN with 5 outputs, it is training mode; for inference, settraining_mode=0(default).Flattenaxis:axis=1means flatten from dimension 1 onward, preserving the batch dimension. The result is[batch, 128*4*4].
Building a Recurrent Neural Network (RNN/LSTM)
ONNX’s LSTM operator encodes a full LSTM layer in a single node, which is different from the cell-by-cell approach in PyTorch. This makes it compact but the weight layout requires care.
The ONNX LSTM gate order is IOFC (Input, Output, Forget, Cell), while PyTorch uses IFCO (Input, Forget, Cell, Output). This affects how you lay out the weight tensor if you ever interoperate.
import numpy as np
import onnx
from onnx import helper, TensorProto, numpy_helper
seq_len = 20 # sequence length
batch_size = 4 # batch size
input_size = 16 # features per timestep
hidden_size = 32 # LSTM hidden dim
num_layers = 1
directions = 1 # 1 for forward, 2 for bidirectional
# ------------------------------------------------------------------ #
# LSTM Weight Layout (ONNX standard): #
# W shape: [directions, 4 * hidden_size, input_size] #
# R shape: [directions, 4 * hidden_size, hidden_size] #
# B shape: [directions, 8 * hidden_size] (W_bias concat R_bias) #
# ------------------------------------------------------------------ #
W_data = np.random.randn(directions, 4 * hidden_size, input_size).astype(np.float32)
R_data = np.random.randn(directions, 4 * hidden_size, hidden_size).astype(np.float32)
B_data = np.zeros((directions, 8 * hidden_size), dtype=np.float32)
W_init = numpy_helper.from_array(W_data, name="lstm_W")
R_init = numpy_helper.from_array(R_data, name="lstm_R")
B_init = numpy_helper.from_array(B_data, name="lstm_B")
# ------------------------------------------------------------------ #
# LSTM node #
# ------------------------------------------------------------------ #
# Inputs: X, W, R, B, sequence_lens (optional), initial_h, initial_c, P (peepholes)
# Outputs: Y (all hidden states), Y_h (final hidden), Y_c (final cell)
lstm_node = helper.make_node(
"LSTM",
inputs=["x", "lstm_W", "lstm_R", "lstm_B"], # omit optional inputs
outputs=["Y", "Y_h", "Y_c"],
hidden_size=hidden_size,
direction="forward",
name="lstm_layer",
)
# Y shape: [seq_len, directions, batch, hidden_size]
# Y_h shape: [directions, batch, hidden_size]
# Y_c shape: [directions, batch, hidden_size]
# We want the final hidden state: Y_h, shape [1, batch, hidden_size]
# Squeeze the directions dimension:
squeeze_axes = numpy_helper.from_array(np.array([0], dtype=np.int64), name="squeeze_axes")
squeeze_node = helper.make_node(
"Squeeze",
inputs=["Y_h", "squeeze_axes"],
outputs=["h_final"], # shape: [batch, hidden_size]
name="squeeze_h",
)
# Final classifier
fc_W_data = np.random.randn(hidden_size, 5).astype(np.float32) # 5 output classes
fc_b_data = np.zeros(5, dtype=np.float32)
fc_W_init = numpy_helper.from_array(fc_W_data, name="fc_W")
fc_b_init = numpy_helper.from_array(fc_b_data, name="fc_b")
fc_node = helper.make_node(
"Gemm",
inputs=["h_final", "fc_W", "fc_b"],
outputs=["logits"],
alpha=1.0, beta=1.0,
name="fc_out",
)
softmax_node = helper.make_node(
"Softmax",
inputs=["logits"],
outputs=["probs"],
axis=-1,
name="softmax",
)
# ------------------------------------------------------------------ #
# Graph assembly #
# ------------------------------------------------------------------ #
# X: [seq_len, batch, input_size] — ONNX LSTM uses time-first layout
x_info = helper.make_tensor_value_info("x", TensorProto.FLOAT,
[seq_len, "batch", input_size])
prob_info = helper.make_tensor_value_info("probs", TensorProto.FLOAT, ["batch", 5])
graph = helper.make_graph(
[lstm_node, squeeze_node, fc_node, softmax_node],
"lstm_classifier",
inputs=[x_info],
outputs=[prob_info],
initializer=[W_init, R_init, B_init, squeeze_axes, fc_W_init, fc_b_init],
)
model = helper.make_model(graph, opset_imports=[helper.make_opsetid("", 17)])
model.ir_version = 8
onnx.checker.check_model(model)
onnx.save(model, "lstm_classifier.onnx")
print("LSTM model saved.")Crucially, ONNX LSTM takes input in [seq_len, batch, input_size] order (time-first). If your data is batch-first [batch, seq_len, input_size], add a Transpose node before the LSTM:
transpose_node = helper.make_node(
"Transpose",
inputs=["x_batch_first"],
outputs=["x"],
perm=[1, 0, 2], # swap seq and batch dimensions
name="batch_to_seq_first",
)Building a Transformer Block
A Transformer block is the most involved architecture to assemble in raw ONNX, but it is an outstanding exercise in understanding attention. We build a single encoder block: multi-head self-attention followed by a feed-forward network, both with residual connections and layer normalization.
import numpy as np
import onnx
from onnx import helper, TensorProto, numpy_helper
d_model = 64 # embedding dimension
n_heads = 4 # attention heads
d_k = d_model // n_heads # key/query dimension per head = 16
d_ff = 256 # feed-forward inner dimension
seq_len = 10 # sequence length (fixed for this example)
eps = 1e-6
rng = np.random.default_rng(42)
def rand_f32(shape, name):
data = rng.standard_normal(shape).astype(np.float32) * 0.02
return numpy_helper.from_array(data, name=name)
def zeros_f32(shape, name):
return numpy_helper.from_array(np.zeros(shape, dtype=np.float32), name=name)
def ones_f32(shape, name):
return numpy_helper.from_array(np.ones(shape, dtype=np.float32), name=name)
inits = []
nodes = []
# ================================================================== #
# Projection weights for Q, K, V, and output #
# [d_model, d_model] — we will split heads in-graph #
# ================================================================== #
inits += [rand_f32((d_model, d_model), "W_Q"),
rand_f32((d_model, d_model), "W_K"),
rand_f32((d_model, d_model), "W_V"),
rand_f32((d_model, d_model), "W_O"),
zeros_f32((d_model,), "b_Q"),
zeros_f32((d_model,), "b_K"),
zeros_f32((d_model,), "b_V"),
zeros_f32((d_model,), "b_O")]
# Feed-forward weights
inits += [rand_f32((d_model, d_ff), "W_ff1"), zeros_f32((d_ff,), "b_ff1"),
rand_f32((d_ff, d_model), "W_ff2"), zeros_f32((d_model,), "b_ff2")]
# LayerNorm parameters (two sets: after attention, after FFN)
inits += [ones_f32((d_model,), "ln1_scale"), zeros_f32((d_model,), "ln1_bias"),
ones_f32((d_model,), "ln2_scale"), zeros_f32((d_model,), "ln2_bias")]
# Scale factor for attention: 1 / sqrt(d_k)
scale_val = np.array(1.0 / np.sqrt(d_k), dtype=np.float32)
inits.append(numpy_helper.from_array(scale_val, name="attn_scale"))
# Shape constants for Reshape operations
reshape_to_heads = np.array([-1, seq_len, n_heads, d_k], dtype=np.int64)
inits.append(numpy_helper.from_array(reshape_to_heads, name="shape_heads"))
restore_shape = np.array([-1, seq_len, d_model], dtype=np.int64)
inits.append(numpy_helper.from_array(restore_shape, name="shape_restore"))
# ================================================================== #
# MULTI-HEAD SELF-ATTENTION #
# ================================================================== #
# --- Compute Q, K, V projections ---
# MatMul: [batch, seq, d_model] @ [d_model, d_model] → [batch, seq, d_model]
for letter in ["Q", "K", "V"]:
nodes.append(helper.make_node("MatMul",
inputs=["x", f"W_{letter}"],
outputs=[f"{letter}_proj"],
name=f"matmul_{letter}",
))
nodes.append(helper.make_node("Add",
inputs=[f"{letter}_proj", f"b_{letter}"],
outputs=[f"{letter}"],
name=f"add_bias_{letter}",
))
# --- Reshape to [batch, seq, n_heads, d_k] ---
for letter in ["Q", "K", "V"]:
nodes.append(helper.make_node("Reshape",
inputs=[letter, "shape_heads"],
outputs=[f"{letter}_h"],
name=f"reshape_{letter}",
))
# --- Transpose to [batch, n_heads, seq, d_k] ---
for letter in ["Q", "K", "V"]:
nodes.append(helper.make_node("Transpose",
inputs=[f"{letter}_h"],
outputs=[f"{letter}_t"],
perm=[0, 2, 1, 3],
name=f"transpose_{letter}",
))
# --- Attention scores: Q @ K^T ---
nodes.append(helper.make_node("Transpose",
inputs=["K_t"],
outputs=["K_t_T"],
perm=[0, 1, 3, 2],
name="transpose_K_for_attn",
))
nodes.append(helper.make_node("MatMul",
inputs=["Q_t", "K_t_T"],
outputs=["raw_scores"],
name="attn_scores",
))
nodes.append(helper.make_node("Mul",
inputs=["raw_scores", "attn_scale"],
outputs=["scaled_scores"],
name="scale_scores",
))
nodes.append(helper.make_node("Softmax",
inputs=["scaled_scores"],
outputs=["attn_weights"],
axis=-1,
name="attn_softmax",
))
nodes.append(helper.make_node("MatMul",
inputs=["attn_weights", "V_t"],
outputs=["context_t"],
name="attn_context",
))
nodes.append(helper.make_node("Transpose",
inputs=["context_t"],
outputs=["context_h"],
perm=[0, 2, 1, 3],
name="transpose_context",
))
nodes.append(helper.make_node("Reshape",
inputs=["context_h", "shape_restore"],
outputs=["context"],
name="reshape_context",
))
nodes.append(helper.make_node("MatMul",
inputs=["context", "W_O"],
outputs=["attn_out_proj"],
name="output_proj",
))
nodes.append(helper.make_node("Add",
inputs=["attn_out_proj", "b_O"],
outputs=["attn_out"],
name="add_output_bias",
))
# Residual + LayerNorm
nodes.append(helper.make_node("Add",
inputs=["x", "attn_out"],
outputs=["residual1"],
name="residual1",
))
nodes.append(helper.make_node("LayerNormalization",
inputs=["residual1", "ln1_scale", "ln1_bias"],
outputs=["ln1_out"],
axis=-1,
epsilon=eps,
name="layernorm1",
))
# ================================================================== #
# FEED-FORWARD NETWORK #
# ================================================================== #
nodes.append(helper.make_node("MatMul",
inputs=["ln1_out", "W_ff1"],
outputs=["ff1_proj"],
name="ff1_proj",
))
nodes.append(helper.make_node("Add",
inputs=["ff1_proj", "b_ff1"],
outputs=["ff1"],
name="ff1_bias",
))
nodes.append(helper.make_node("Relu",
inputs=["ff1"],
outputs=["ff1_relu"],
name="ff1_relu",
))
nodes.append(helper.make_node("MatMul",
inputs=["ff1_relu", "W_ff2"],
outputs=["ff2_proj"],
name="ff2_proj",
))
nodes.append(helper.make_node("Add",
inputs=["ff2_proj", "b_ff2"],
outputs=["ff2"],
name="ff2_bias",
))
# Residual + LayerNorm
nodes.append(helper.make_node("Add",
inputs=["ln1_out", "ff2"],
outputs=["residual2"],
name="residual2",
))
nodes.append(helper.make_node("LayerNormalization",
inputs=["residual2", "ln2_scale", "ln2_bias"],
outputs=["output"],
axis=-1,
epsilon=eps,
name="layernorm2",
))
# ================================================================== #
# Graph assembly #
# ================================================================== #
x_info = helper.make_tensor_value_info("x", TensorProto.FLOAT, ["batch", seq_len, d_model])
out_info = helper.make_tensor_value_info("output", TensorProto.FLOAT, ["batch", seq_len, d_model])
graph = helper.make_graph(nodes, "transformer_encoder_block",
inputs=[x_info], outputs=[out_info], initializer=inits)
model = helper.make_model(graph, opset_imports=[helper.make_opsetid("", 17)])
model.ir_version = 8
onnx.checker.check_model(model)
onnx.save(model, "transformer_block.onnx")
print("Transformer block saved.")- 3D MatMul: When one operand is 2D
[d_model, d_model]and the other is 3D[batch, seq, d_model], ONNX’s MatMul broadcasts over the batch dimension automatically. - Reshape + Transpose for multi-head attention: The head-splitting is entirely explicit. You reshape the projected Q/K/V to expose the head dimension, then transpose to make it the second axis for batched matrix multiplication.
LayerNormalization: Available from opset 17. It takesscaleandbiasas separate inputs (not attributes), and normalizes along all axes fromaxisto the last.- Broadcasting of the scale scalar: The
attn_scaleconstant is a scalarnp.float32value. ONNX’sMuloperator broadcasts it across the entire[batch, heads, seq, seq]scores tensor without any reshape.
Building a Residual (ResNet-style) Block
Residual connections are essential for deep networks. In ONNX, they are simply Add nodes where one input comes from early in the graph.
import numpy as np
import onnx
from onnx import helper, TensorProto, numpy_helper
def make_conv_bn_relu(x_name, out_name, in_ch, out_ch, stride, inits, nodes, kH=3, kW=3):
"""Adds Conv → BN → ReLU nodes and their initializers in-place."""
fan_in = in_ch * kH * kW
W_data = np.random.randn(out_ch, in_ch, kH, kW).astype(np.float32) * np.sqrt(2.0 / fan_in)
W_init = numpy_helper.from_array(W_data, name=f"{out_name}_cW")
b_init = numpy_helper.from_array(np.zeros(out_ch, dtype=np.float32), name=f"{out_name}_cb")
sc_init = numpy_helper.from_array(np.ones(out_ch, dtype=np.float32), name=f"{out_name}_bns")
bi_init = numpy_helper.from_array(np.zeros(out_ch, dtype=np.float32), name=f"{out_name}_bnb")
mn_init = numpy_helper.from_array(np.zeros(out_ch, dtype=np.float32), name=f"{out_name}_bnm")
vr_init = numpy_helper.from_array(np.ones(out_ch, dtype=np.float32), name=f"{out_name}_bnv")
inits += [W_init, b_init, sc_init, bi_init, mn_init, vr_init]
conv_out = f"{out_name}_c"
bn_out = f"{out_name}_bn"
nodes.append(helper.make_node("Conv",
inputs=[x_name, f"{out_name}_cW", f"{out_name}_cb"],
outputs=[conv_out],
kernel_shape=[kH, kW],
strides=[stride, stride],
pads=[kH//2, kW//2, kH//2, kW//2],
name=f"{out_name}_conv",
))
nodes.append(helper.make_node("BatchNormalization",
inputs=[conv_out, f"{out_name}_bns", f"{out_name}_bnb",
f"{out_name}_bnm", f"{out_name}_bnv"],
outputs=[bn_out],
epsilon=1e-5, momentum=0.1,
name=f"{out_name}_bn",
))
nodes.append(helper.make_node("Relu",
inputs=[bn_out],
outputs=[out_name],
name=f"{out_name}_relu",
))
def make_residual_block(x_name, out_name, in_ch, out_ch, stride, inits, nodes):
"""
A basic ResNet residual block.
If in_ch != out_ch or stride != 1, a 1x1 projection shortcut is added.
"""
mid_name = f"{out_name}_mid"
make_conv_bn_relu(x_name, mid_name, in_ch, out_ch, stride, inits, nodes)
fan_in = out_ch * 3 * 3
W2_data = np.random.randn(out_ch, out_ch, 3, 3).astype(np.float32) * np.sqrt(2.0 / fan_in)
W2_init = numpy_helper.from_array(W2_data, name=f"{out_name}_c2W")
b2_init = numpy_helper.from_array(np.zeros(out_ch, dtype=np.float32), name=f"{out_name}_c2b")
sc2 = numpy_helper.from_array(np.ones(out_ch, dtype=np.float32), name=f"{out_name}_bn2s")
bi2 = numpy_helper.from_array(np.zeros(out_ch, dtype=np.float32), name=f"{out_name}_bn2b")
mn2 = numpy_helper.from_array(np.zeros(out_ch, dtype=np.float32), name=f"{out_name}_bn2m")
vr2 = numpy_helper.from_array(np.ones(out_ch, dtype=np.float32), name=f"{out_name}_bn2v")
inits += [W2_init, b2_init, sc2, bi2, mn2, vr2]
conv2_out = f"{out_name}_c2"
bn2_out = f"{out_name}_bn2"
nodes.append(helper.make_node("Conv",
inputs=[mid_name, f"{out_name}_c2W", f"{out_name}_c2b"],
outputs=[conv2_out],
kernel_shape=[3, 3], strides=[1, 1], pads=[1, 1, 1, 1],
name=f"{out_name}_conv2",
))
nodes.append(helper.make_node("BatchNormalization",
inputs=[conv2_out, f"{out_name}_bn2s", f"{out_name}_bn2b",
f"{out_name}_bn2m", f"{out_name}_bn2v"],
outputs=[bn2_out],
epsilon=1e-5, momentum=0.1,
name=f"{out_name}_bn2",
))
if in_ch != out_ch or stride != 1:
Ws_data = np.random.randn(out_ch, in_ch, 1, 1).astype(np.float32) * np.sqrt(2.0 / in_ch)
Ws_init = numpy_helper.from_array(Ws_data, name=f"{out_name}_sW")
bs_init = numpy_helper.from_array(np.zeros(out_ch, dtype=np.float32), name=f"{out_name}_sb")
inits += [Ws_init, bs_init]
shortcut_name = f"{out_name}_shortcut"
nodes.append(helper.make_node("Conv",
inputs=[x_name, f"{out_name}_sW", f"{out_name}_sb"],
outputs=[shortcut_name],
kernel_shape=[1, 1], strides=[stride, stride], pads=[0, 0, 0, 0],
name=f"{out_name}_shortcut_conv",
))
else:
shortcut_name = x_name
nodes.append(helper.make_node("Add",
inputs=[bn2_out, shortcut_name],
outputs=[f"{out_name}_sum"],
name=f"{out_name}_add",
))
nodes.append(helper.make_node("Relu",
inputs=[f"{out_name}_sum"],
outputs=[out_name],
name=f"{out_name}_relu_final",
))
# ------------------------------------------------------------------ #
# Build a tiny ResNet #
# ------------------------------------------------------------------ #
inits = []
nodes = []
make_conv_bn_relu("x", "stem_out", in_ch=3, out_ch=64, stride=2, inits=inits, nodes=nodes, kH=7, kW=7)
nodes.append(helper.make_node("MaxPool",
inputs=["stem_out"], outputs=["pool_out"],
kernel_shape=[3, 3], strides=[2, 2], pads=[1, 1, 1, 1],
name="stem_pool",
))
make_residual_block("pool_out", "layer1a", in_ch=64, out_ch=64, stride=1, inits=inits, nodes=nodes)
make_residual_block("layer1a", "layer1b", in_ch=64, out_ch=64, stride=1, inits=inits, nodes=nodes)
make_residual_block("layer1b", "layer2a", in_ch=64, out_ch=128, stride=2, inits=inits, nodes=nodes)
make_residual_block("layer2a", "layer2b", in_ch=128, out_ch=128, stride=1, inits=inits, nodes=nodes)
nodes.append(helper.make_node("GlobalAveragePool",
inputs=["layer2b"], outputs=["gap_out"],
name="global_avg_pool",
))
nodes.append(helper.make_node("Flatten",
inputs=["gap_out"], outputs=["flat_out"],
axis=1, name="flatten",
))
fc_W = numpy_helper.from_array(
np.random.randn(128, 10).astype(np.float32) * 0.01, name="fc_W")
fc_b = numpy_helper.from_array(np.zeros(10, dtype=np.float32), name="fc_b")
inits += [fc_W, fc_b]
nodes.append(helper.make_node("Gemm",
inputs=["flat_out", "fc_W", "fc_b"],
outputs=["logits"], alpha=1.0, beta=1.0, name="classifier",
))
nodes.append(helper.make_node("Softmax",
inputs=["logits"], outputs=["probs"], axis=-1, name="softmax",
))
x_info = helper.make_tensor_value_info("x", TensorProto.FLOAT, ["batch", 3, 224, 224])
prob_info = helper.make_tensor_value_info("probs", TensorProto.FLOAT, ["batch", 10])
graph = helper.make_graph(nodes, "tiny_resnet",
inputs=[x_info], outputs=[prob_info], initializer=inits)
model = helper.make_model(graph, opset_imports=[helper.make_opsetid("", 17)])
model.ir_version = 8
onnx.checker.check_model(model)
onnx.save(model, "tiny_resnet.onnx")
print("ResNet-style model saved.")The residual block pattern is elegant in ONNX because the “skip connection” is just a string: you pass the same input name x_name to both the main path and the shortcut Add node. The graph structure itself encodes the skip without any special syntax.
Initializers, Constants, and Weight Management
There are two ways to embed constant data in an ONNX graph.
Initializers are TensorProto objects stored in graph.initializer. They represent parameters (weights, biases) or other constant tensors. They are the preferred way to store large parameter tensors because they are memory-efficient and can be memory-mapped at load time.
W = numpy_helper.from_array(np.eye(64, dtype=np.float32), name="identity_W")
# Add to graph initializer listConstant nodes embed a tensor directly inside a NodeProto. Use these for small scalars or integer constants computed mid-graph (like reshape targets):
const_node = helper.make_node(
"Constant",
inputs=[],
outputs=["const_value"],
value=helper.make_tensor(
name="",
data_type=TensorProto.FLOAT,
dims=[], # scalar
vals=[0.5],
),
)For integer shape tensors (common when using Reshape), you can also store them as initializers:
shape_const = numpy_helper.from_array(
np.array([-1, 128], dtype=np.int64), name="reshape_target"
)Weight Initialization Strategies
Since ONNX weights are just NumPy arrays, you apply initialization schemes yourself:
# He (Kaiming) initialization for ReLU networks
def he_init(shape):
fan_in = np.prod(shape[1:]) if len(shape) > 1 else shape[0]
return np.random.randn(*shape).astype(np.float32) * np.sqrt(2.0 / fan_in)
# Glorot (Xavier) initialization for tanh/sigmoid
def glorot_init(shape):
fan_in = shape[0]
fan_out = shape[1] if len(shape) > 1 else shape[0]
limit = np.sqrt(6.0 / (fan_in + fan_out))
return np.random.uniform(-limit, limit, shape).astype(np.float32)
# Orthogonal initialization (good for RNNs)
def orthogonal_init(shape):
flat = np.random.randn(shape[0], np.prod(shape[1:])).astype(np.float32)
U, _, Vt = np.linalg.svd(flat, full_matrices=False)
return (U if U.shape == flat.shape else Vt).reshape(shape)Shape Inference and Validation
ONNX provides automatic shape inference — it propagates shapes through the graph so you can verify that all intermediate tensor shapes are correct before running.
import onnx
from onnx import shape_inference
model = onnx.load("my_model.onnx")
inferred_model = shape_inference.infer_shapes(model)
# Now inspect inferred shapes
for vi in inferred_model.graph.value_info:
t = vi.type.tensor_type
shape = [d.dim_value if d.HasField("dim_value") else d.dim_param
for d in t.shape.dim]
print(f" {vi.name}: {t.elem_type} {shape}")Always run both onnx.checker.check_model (structural validity) and shape_inference.infer_shapes (shape consistency) after building a model. The checker will catch malformed protos; shape inference will catch shape mismatches before you waste time debugging at runtime.
Checking Shapes Programmatically
def get_shape(model, tensor_name):
"""Return the inferred shape of any named tensor in the model."""
inferred = shape_inference.infer_shapes(model)
all_vi = (list(inferred.graph.input)
+ list(inferred.graph.value_info)
+ list(inferred.graph.output))
for vi in all_vi:
if vi.name == tensor_name:
t = vi.type.tensor_type
return [d.dim_value or d.dim_param for d in t.shape.dim]
return None
shape = get_shape(model, "relu1_out")
print(f"relu1_out shape: {shape}")Running Inference with ONNX Runtime
import onnxruntime as ort
import numpy as np
# Load the session
sess = ort.InferenceSession("mlp.onnx",
providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
# Inspect inputs and outputs
for inp in sess.get_inputs():
print(f"Input: {inp.name} | shape: {inp.shape} | type: {inp.type}")
for out in sess.get_outputs():
print(f"Output: {out.name} | shape: {out.shape} | type: {out.type}")
# Run inference
x = np.random.randn(8, 784).astype(np.float32)
outputs = sess.run(
output_names=["probs"], # None means "return all outputs"
input_feed={"x": x},
)
probs = outputs[0]
print(f"Output shape: {probs.shape}")
print(f"Predictions: {probs.argmax(axis=1)}")Choosing an Execution Provider
ONNX Runtime supports multiple backends. Pass them in priority order:
sess = ort.InferenceSession("model.onnx", providers=[
("TensorrtExecutionProvider", {"device_id": 0}),
("CUDAExecutionProvider", {"device_id": 0}),
"CPUExecutionProvider",
])
print(sess.get_providers()) # shows which providers were actually activatedSession Options
opts = ort.SessionOptions()
opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
opts.intra_op_num_threads = 4
opts.enable_profiling = False
sess = ort.InferenceSession("model.onnx", sess_options=opts,
providers=["CPUExecutionProvider"])Inspecting and Debugging ONNX Graphs
Printing Graph Structure
import onnx
model = onnx.load("model.onnx")
graph = model.graph
print(f"Opset: {model.opset_import[0].version}")
print(f"\nInputs:")
for inp in graph.input:
print(f" {inp.name}")
print(f"\nOutputs:")
for out in graph.output:
print(f" {out.name}")
print(f"\nInitializers: {len(graph.initializer)} tensors")
for init in graph.initializer:
shape = list(init.dims)
dtype = init.data_type
print(f" {init.name:30s} shape={shape}, dtype={dtype}")
print(f"\nNodes ({len(graph.node)} total):")
for node in graph.node:
attrs = {a.name: ... for a in node.attribute}
print(f" [{node.op_type:20s}] {list(node.input)} → {list(node.output)}")Extracting Intermediate Outputs
You can expose intermediate tensors as additional graph outputs for debugging:
import onnx
from onnx import shape_inference
model = onnx.load("mlp.onnx")
inferred = shape_inference.infer_shapes(model)
# Identify the value_info for intermediate tensor "relu1_out"
vi_to_expose = None
for vi in inferred.graph.value_info:
if vi.name == "relu1_out":
vi_to_expose = vi
break
# Add it as a graph output
debug_model = onnx.ModelProto()
debug_model.CopyFrom(inferred)
debug_model.graph.output.append(vi_to_expose)
onnx.save(debug_model, "mlp_debug.onnx")Using Netron
import subprocess
subprocess.Popen(["netron", "model.onnx"])
# or just open the file directly in the Netron appNetron renders the full computation graph in a browser. Each node shows its op type, attributes, and input/output tensor names with their inferred shapes (if you ran shape inference). It is the single most useful tool for understanding and debugging ONNX models.
Advanced Techniques: Control Flow, Subgraphs, and Custom Ops
Control Flow: If, Loop, Scan
ONNX supports limited control flow via three special operators. These operators contain subgraphs (nested GraphProto objects) inside their attributes.
If: Conditional execution. Takes a boolean scalar condition and contains two subgraph attributes: then_branch and else_branch.
# Pseudocode — then_branch and else_branch are full GraphProto objects
if_node = helper.make_node(
"If",
inputs=["condition"],
outputs=["result"],
then_branch=then_graph,
else_branch=else_graph,
)Loop: A counted or condition-based loop. Takes a trip count, initial condition, and initial state tensors, and runs a body subgraph repeatedly.
Scan: Applies a body subgraph across the time axis of sequence inputs, accumulating state. Useful for custom RNNs.
These operators are powerful but complex. Their subgraphs must be complete valid GraphProto objects with their own inputs and outputs. Building them requires careful management of variable names and scoping.
Custom Operators
If you need an operation not in the ONNX standard set, you can define a custom operator with a non-standard domain:
custom_node = helper.make_node(
op_type="MySpecialOp",
domain="com.mycompany",
inputs=["x"],
outputs=["y"],
my_custom_attr=42,
name="custom_op_1",
)
# Register the custom domain in the opset imports
model = helper.make_model(
graph,
opset_imports=[
helper.make_opsetid("", 17),
helper.make_opsetid("com.mycompany", 1),
],
)To run custom ops with ONNX Runtime, you register a Python or C++ custom op implementation:
import onnxruntime as ort
# Python custom op (ort >= 1.13)
class MySpecialOpImpl:
def __init__(self, op, device):
pass
def compute(self, x):
return [x * 2] # example: just double the input
opts = ort.SessionOptions()
sess = ort.InferenceSession(
"model_with_custom_op.onnx",
sess_options=opts,
providers=["CPUExecutionProvider"],
)
# C++ ops are registered via shared librariesFunction-Based Operators
ONNX also allows you to define FunctionProto objects — named, reusable operator definitions composed of existing ONNX ops. These let you package composite operations (like a Transformer block) as a single named op that expands to primitives during execution:
from onnx import helper, TensorProto
func = helper.make_function(
domain="com.myarch",
fname="LayerNormFunc",
inputs=["X", "scale", "bias"],
outputs=["Y"],
nodes=[...], # the expanded graph nodes
opset_imports=[helper.make_opsetid("", 17)],
)
model.functions.append(func)Optimization and Graph Transformations
Raw hand-built ONNX graphs are often not as efficient as they could be. Several tools exist to optimize them.
ONNX Runtime Graph Optimizations
The simplest approach is to let ONNX Runtime’s optimizer do the work at load time:
opts = ort.SessionOptions()
opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
# Optionally, save the optimized model for inspection
opts.optimized_model_filepath = "optimized_model.onnx"
sess = ort.InferenceSession("model.onnx", sess_options=opts,
providers=["CPUExecutionProvider"])ONNX Runtime performs fusions (Conv+BN+Relu → ConvRelu), dead code elimination, constant folding, and more.
ONNX Simplifier
onnx-simplifier is a third-party tool that applies constant folding and other simplifications:
pip install onnxsim
python -m onnxsim model.onnx simplified_model.onnxOr programmatically:
from onnxsim import simplify
import onnx
model = onnx.load("model.onnx")
simplified, check = simplify(model)
assert check, "Simplified ONNX model could not be validated!"
onnx.save(simplified, "simplified_model.onnx")Manual Graph Surgery with onnx.helper and onnx.compose
The onnx.compose module (ONNX ≥ 1.13) provides merge_models and add_prefix utilities for combining and modifying graphs:
from onnx import compose
# Merge two models sequentially (output of model1 feeds input of model2)
combined = compose.merge_models(
model1, model2,
io_map=[("model1_output", "model2_input")],
)For direct graph surgery (removing nodes, inserting nodes, rewiring edges), you work directly with the graph.node list:
model = onnx.load("model.onnx")
graph = model.graph
# Remove a specific node by name
graph.node[:] = [n for n in graph.node if n.name != "relu_to_remove"]
# Insert a new node after a specific point
new_node = helper.make_node("Tanh", inputs=["linear_out"], outputs=["tanh_out"])
insert_idx = next(i for i, n in enumerate(graph.node) if n.name == "linear")
graph.node.insert(insert_idx + 1, new_node)
onnx.checker.check_model(model)
onnx.save(model, "modified_model.onnx")Quantization
ONNX Runtime provides post-training quantization tools:
from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic(
model_input="model.onnx",
model_output="model_quant.onnx",
weight_type=QuantType.QInt8,
)For static quantization (requires calibration data):
from onnxruntime.quantization import quantize_static, CalibrationDataReader, QuantType
class MyCalibReader(CalibrationDataReader):
def get_next(self):
# yield batches of calibration inputs
...
quantize_static(
model_input="model.onnx",
model_output="model_quant_static.onnx",
calibration_data_reader=MyCalibReader(),
quant_format=QuantType.QInt8,
)Best Practices and Common Pitfalls
Always Use Unique Tensor Names
Every intermediate tensor name in the graph must be unique. Reusing a name means two nodes will try to write the same tensor, causing silent corruption or runtime errors. A simple convention is to prefix names with the layer or block name:
"block2_conv1_out" rather than "conv_out"
Match Opset to Your Runtime
ONNX Runtime versions support specific ONNX opset ranges. Using an opset that is too new will cause load failures. Check the ONNX Runtime release notes for the supported opset range, and pin your opset_imports accordingly. Opset 17 is a safe choice for most current runtimes as of 2025.
Initializer vs. Graph Input: Know the Difference
Initializers represent constant parameters that are part of the model. Graph inputs are external tensors provided at inference time. Do not list your weights in graph.input — they belong only in graph.initializer. ONNX Runtime will warn about (and older versions will fail on) weights that appear in both places.
In older ONNX IR versions (IR < 4), initializers were required to also appear as graph inputs. From IR version 4 onward, this is no longer needed. Set model.ir_version = 8 and list weights only as initializers.
Check Data Types Carefully
All of the following will cause silent incorrect results or runtime errors if you mix them up:
- Mixing
float32andfloat64inputs/weights without an explicitCast. - Using Python
int(64-bit) where the model expectsint32. - Passing NHWC image data to a Conv that expects NCHW.
Always verify numpy dtypes when constructing initializers:
W = my_array.astype(np.float32) # always explicitPads Are Symmetric Lists, Not Single Values
The pads attribute on Conv and MaxPool is a flat list of all padding values: [pad_h_begin, pad_w_begin, pad_h_end, pad_w_end] for 2D. For 3D convolutions it extends further. Do not pass a single integer.
Use Squeeze and Unsqueeze on Axes Inputs (Opset ≥ 13)
In ONNX opset 13+, the axes argument to Squeeze and Unsqueeze moved from an attribute to an input tensor. This means you must create a constant tensor for it:
axes_const = numpy_helper.from_array(np.array([0], dtype=np.int64), name="squeeze_axes")
inits.append(axes_const)
squeeze_node = helper.make_node("Squeeze",
inputs=["my_tensor", "squeeze_axes"],
outputs=["squeezed"],
name="squeeze",
)Reshape Takes a Tensor Input, Not an Attribute
In opset 5+, the target shape for Reshape is a 1D INT64 tensor input, not an attribute. Store it as an initializer:
target_shape = numpy_helper.from_array(np.array([-1, 128], dtype=np.int64), name="tgt_shape")
inits.append(target_shape)
reshape_node = helper.make_node("Reshape",
inputs=["flat_input", "tgt_shape"],
outputs=["reshaped"],
)Profile Before Optimizing
ONNX Runtime provides built-in profiling. Enable it to find bottleneck operators before spending time on manual optimizations:
opts = ort.SessionOptions()
opts.enable_profiling = True
sess = ort.InferenceSession("model.onnx", sess_options=opts)
sess.run(...)
prof_file = sess.end_profiling() # returns path to JSON profile
# Open in Chrome at chrome://tracingReference: Commonly Used ONNX Operators
Below is a quick-reference table of the operators used most frequently in architecture construction, with their key attributes and input/output conventions.
| Operator | Key Inputs | Key Attributes | Output Shape (example) |
|---|---|---|---|
Gemm |
A, B, C (bias) | transA, transB, alpha, beta |
[M, N] |
MatMul |
A, B | — | [..., M, N] |
Conv |
X, W, B | kernel_shape, strides, pads, dilations, group |
[N, C_out, H_out, W_out] |
ConvTranspose |
X, W, B | kernel_shape, strides, pads, output_padding |
[N, C_out, H_out, W_out] |
BatchNormalization |
X, scale, B, mean, var | epsilon, momentum |
same as X |
LayerNormalization |
X, scale, B | axis, epsilon |
same as X |
Relu |
X | — | same as X |
Sigmoid |
X | — | same as X |
Tanh |
X | — | same as X |
Softmax |
X | axis |
same as X |
Gelu |
X | approximate |
same as X |
MaxPool |
X | kernel_shape, strides, pads |
[N, C, H_out, W_out] |
GlobalAveragePool |
X | — | [N, C, 1, 1] |
Reshape |
data, shape | — | as specified by shape |
Flatten |
X | axis |
[N, M] |
Transpose |
X | perm |
permuted axes |
Squeeze |
X, axes | — | removes specified dims |
Unsqueeze |
X, axes | — | inserts specified dims |
Concat |
inputs… | axis |
concatenated |
Split |
X | axis, split |
list of tensors |
Add |
A, B | — | broadcast shape |
Mul |
A, B | — | broadcast shape |
ReduceMean |
X, axes | keepdims |
reduced shape |
Cast |
X | to (dtype enum) |
same shape, new dtype |
Gather |
data, indices | axis |
indexed shape |
LSTM |
X, W, R, B | hidden_size, direction |
Y, Y_h, Y_c |
GRU |
X, W, R, B | hidden_size, direction |
Y, Y_h |
Where |
cond, X, Y | — | broadcast shape |
Einsum |
inputs… | equation |
per equation |
Constant |
— | value |
shape of value |
Shape |
X | — | [rank(X)] INT64 |
Expand |
X, shape | — | broadcast target shape |
Further Reading
For everything beyond this guide, the following resources are authoritative:
- ONNX Operator Specification: https://onnx.ai/onnx/operators/ — the canonical reference for every operator, every opset version, and every attribute’s exact semantics.
- ONNX Protobuf Schema: https://github.com/onnx/onnx/blob/main/onnx/onnx.proto
- ONNX Runtime Documentation: https://onnxruntime.ai/docs/
- ONNX Runtime Python API Reference: https://onnxruntime.ai/docs/api/python/api_summary.html
- Netron Visualizer: https://netron.app/