Introduction

ONNX (Open Neural Network Exchange) is most commonly known as an export target — a format you dump a PyTorch or TensorFlow model into for deployment. But ONNX is also a fully self-contained, expressive intermediate representation that you can build directly, without ever touching a training framework. This guide treats ONNX as a first-class construction language, not a second-class export artifact.

Why would you want to build networks directly in ONNX?

Portability without framework lock-in. An ONNX graph runs on any hardware backend that supports ONNX Runtime: CPU, CUDA, DirectML, TensorRT, OpenVINO, CoreML, and more. If you define your architecture in ONNX directly, there is no intermediate framework to install or version-pin.

Deterministic, inspectable graphs. When you export from PyTorch, the resulting graph depends on tracing or scripting heuristics that can produce surprising operator sequences. When you write ONNX directly, you know exactly what every node does.

Extremely lightweight deployments. ONNX + ONNX Runtime is a tiny dependency footprint compared to PyTorch or TensorFlow. For embedded systems, edge devices, or serverless inference, this matters enormously.

Fine-grained graph surgery. If you need to fuse operators, insert quantization nodes, rewire connections, or experiment with non-standard topologies, working at the ONNX level directly gives you exact control with no framework abstractions in the way.

Learning how neural networks really work. Building an architecture from raw matrix multiply and activation nodes forces you to understand every dimension, every weight layout, every broadcasting rule. It is an excellent exercise and deeply illuminating.

This guide assumes basic Python proficiency and some familiarity with neural network concepts (layers, activations, convolutions). It does not assume you have ever used PyTorch or TensorFlow.

Understanding the ONNX Format

An ONNX model is a serialized Protocol Buffer file. The .onnx extension is standard, but the file is just a binary proto. The schema is defined in the ONNX specification.

At the top level, a ModelProto contains:

  • ir_version: The ONNX IR (Intermediate Representation) version.
  • opset_imports: Which operator sets (and which versions of them) this model uses. Most models use the default "" domain with a version like 17 or 21.
  • graph: A GraphProto — the actual computation graph.
  • producer_name, producer_version, domain, model_version, doc_string: Metadata fields.

The GraphProto contains:

  • node: A list of NodeProto objects. Each node is one operation.
  • initializer: A list of TensorProto objects representing constant tensors — weights, biases, embedding tables, etc.
  • input: A list of ValueInfoProto describing the graph’s external inputs (their names, types, and shapes).
  • output: A list of ValueInfoProto describing the graph’s outputs.

Each NodeProto contains:

  • op_type: The name of the operator, e.g., "Gemm", "Conv", "Relu".
  • domain: Usually "" for standard ONNX ops.
  • input: A list of string names — the tensors this node consumes.
  • output: A list of string names — the tensors this node produces.
  • attribute: A list of AttributeProto objects — static hyperparameters like kernel size, axis, epsilon, etc.
Tip

Tensor names are just strings. They act as edges in the dataflow graph. If node A produces an output named "relu_out" and node B lists "relu_out" as an input, then B receives A’s output. This is the complete wiring mechanism.

The ONNX Protobuf Schema

You do not need to write raw protobuf. The onnx Python package provides a rich helper library (onnx.helper, onnx.numpy_helper, onnx.checker) that builds proto objects for you. However, understanding the schema directly will save you many hours of debugging.

TensorProto Data Types

Every tensor in ONNX has an element type, encoded as an integer enum:

ONNX TensorProto data type enum values
Enum Value Name Python/NumPy Equivalent
1 FLOAT np.float32
2 UINT8 np.uint8
3 INT8 np.int8
4 UINT16 np.uint16
5 INT16 np.int16
6 INT32 np.int32
7 INT64 np.int64
8 STRING bytes
9 BOOL np.bool_
10 FLOAT16 np.float16
11 DOUBLE np.float64
12 UINT32 np.uint32
13 UINT64 np.uint64
14 COMPLEX64 np.complex64
15 COMPLEX128 np.complex128
16 BFLOAT16 N/A (custom)

You reference these via onnx.TensorProto.FLOAT, onnx.TensorProto.INT64, etc.

ValueInfoProto and Shape

Inputs and outputs are described by ValueInfoProto, which pairs a name with a type. The type is a TypeProto, which for tensors includes the element type and a shape. Shapes can have:

  • Fixed dimensions: dim_value = 4 means exactly 4 elements on that axis.
  • Symbolic dimensions: dim_param = "batch_size" means the dimension is variable but named. ONNX Runtime will accept any runtime value for it.
  • Fully unknown dimensions: Neither dim_value nor dim_param is set — completely dynamic.

Setting Up Your Environment

You need very few packages:

pip install onnx onnxruntime numpy

For visualization (highly recommended):

pip install netron

Netron is a browser-based ONNX graph visualizer. You open a .onnx file in it and see the full computation graph rendered as a node diagram, with attributes, shapes, and connections all visible.

Verify your installation:

import onnx
import onnxruntime as ort
import numpy as np

print(f"ONNX version: {onnx.__version__}")
print(f"ONNX IR version: {onnx.IR_VERSION}")
print(f"ONNX Runtime version: {ort.__version__}")

Core Building Blocks: ONNX Operators

ONNX defines a large standard operator set. Here are the operators you will use most often when building architectures from scratch.

Linear Algebra

Gemm (General Matrix Multiply): Computes alpha * A @ B + beta * C. This is the workhorse for fully connected layers. Attributes include transA, transB, alpha, beta. The C input (bias) is optional.

MatMul: Computes a simple matrix product A @ B, with numpy-style broadcasting for batched inputs. Has no attributes. Use this when you need raw matmul without the alpha/beta scaling of Gemm.

Add, Sub, Mul, Div: Element-wise arithmetic with broadcasting.

Transpose: Permutes axes. The perm attribute lists the new axis order, e.g., perm=[0, 2, 1] for a batch-first transpose of a 3D tensor.

Activations

Relu: Element-wise max(0, x). No attributes.

Sigmoid: Element-wise 1 / (1 + exp(-x)). No attributes.

Tanh: Element-wise hyperbolic tangent. No attributes.

Gelu: Gaussian Error Linear Unit. Available in newer opsets.

Softmax: Softmax along a specified axis. Default axis is -1.

LeakyRelu: max(alpha * x, x) with a configurable alpha attribute (default 0.01).

Elu: Exponential Linear Unit. Attribute: alpha.

Normalization

BatchNormalization: Normalizes inputs across the batch dimension, then scales and shifts with learnable scale and B (bias) parameters, using running mean and var statistics. Has epsilon and momentum attributes. In inference mode (the default in ONNX), it uses the stored running statistics and has only one output.

LayerNormalization: Normalizes across a specified set of axes (usually the last). Introduced in opset 17. Essential for Transformer architectures.

InstanceNormalization: Normalizes per-channel per-sample. Useful for style transfer networks.

Convolutions

Conv: N-dimensional convolution. Key attributes: kernel_shape, strides, pads, dilations, group (for grouped/depthwise convolutions), auto_pad. Inputs: X (data), W (weights), B (bias, optional).

ConvTranspose: Transposed (fractionally-strided) convolution for upsampling. Same attribute set as Conv plus output_padding.

MaxPool, AveragePool: Pooling with kernel_shape, strides, pads.

GlobalAveragePool, GlobalMaxPool: Reduce each spatial map to a single value.

Recurrence

LSTM: Full Long Short-Term Memory cell. Inputs: X, W, R, B, sequence_lens, initial_h, initial_c, P. Attributes: hidden_size, direction (forward, reverse, bidirectional).

GRU: Gated Recurrent Unit. Similar interface to LSTM.

RNN: Simple Elman RNN.

Shape Manipulation

Reshape: Changes shape without copying data. Takes a shape tensor as the second input (not an attribute). Use -1 for one inferred dimension.

Flatten: Flattens from axis axis onward into a 2D tensor.

Squeeze: Removes dimensions of size 1 at specified axes.

Unsqueeze: Inserts dimensions of size 1 at specified axes.

Concat: Concatenates tensors along a specified axis.

Split: Splits a tensor into multiple outputs along an axis.

Slice: Extracts a sub-tensor using start, end, axes, and step inputs.

Gather: Index-based lookup (embedding table access, index selection).

GatherElements: Gathers elements along a specified axis using an index tensor.

Scatter, ScatterElements: Inverse of Gather.

Pad: Pads a tensor with a constant, edge, reflect, or wrap strategy.

Tile: Repeats a tensor along each axis a specified number of times.

Expand: Broadcasts a tensor to a target shape.

Reduction

ReduceMean, ReduceSum, ReduceMax, ReduceMin, ReduceProd: Reduce along specified axes, with optional keepdims.

ArgMax, ArgMin: Return the index of the max/min value along an axis.

Logical and Comparison

Equal, Less, Greater, LessOrEqual, GreaterOrEqual: Element-wise comparisons returning bool tensors.

And, Or, Not, Xor: Boolean logic.

Where: Selects elements from two tensors based on a bool condition tensor.

Miscellaneous

Cast: Converts element dtype, e.g., from INT64 to FLOAT.

Constant: Embeds a constant tensor directly as a node. Useful when you need a tensor value but it is computed (not stored as an initializer).

Shape: Returns the shape of a tensor as a 1D INT64 tensor.

Size: Returns the total number of elements as a scalar INT64.

Dropout: Applies dropout. In ONNX inference mode, this is a pass-through (no masking).

Einsum: General einsum notation. Available from opset 12.

Constructing Graphs with the ONNX Helper API

The onnx.helper module is your primary interface. Here is an overview of its key functions.

onnx.helper.make_node

Creates a NodeProto.

import onnx
from onnx import helper, TensorProto

node = helper.make_node(
    op_type="Relu",          # operator name
    inputs=["linear_out"],   # names of input tensors
    outputs=["relu_out"],    # names of output tensors
    name="relu_1",           # optional: name for the node itself
)

For operators with attributes:

node = helper.make_node(
    op_type="Conv",
    inputs=["x", "W", "b"],
    outputs=["conv_out"],
    kernel_shape=[3, 3],
    strides=[1, 1],
    pads=[1, 1, 1, 1],
    name="conv_1",
)

Attributes are passed as keyword arguments. ONNX infers their types automatically from the Python values you pass (int → INT, float → FLOAT, list of ints → INTS, etc.).

onnx.helper.make_tensor_value_info

Creates a ValueInfoProto for describing graph inputs and outputs.

# Fixed batch size of 1, 784 features
x_info = helper.make_tensor_value_info("x", TensorProto.FLOAT, [1, 784])

# Dynamic batch size (symbolic), 10 classes
y_info = helper.make_tensor_value_info("output", TensorProto.FLOAT, ["batch", 10])

# Completely dynamic shape
z_info = helper.make_tensor_value_info("z", TensorProto.FLOAT, None)

onnx.numpy_helper.from_array

Converts a NumPy array to a TensorProto for use as an initializer.

import numpy as np
from onnx import numpy_helper

W = np.random.randn(128, 784).astype(np.float32)
W_tensor = numpy_helper.from_array(W, name="fc1_weight")

onnx.helper.make_graph

Assembles nodes, initializers, inputs, and outputs into a GraphProto.

graph = helper.make_graph(
    nodes=[node1, node2, node3],
    name="my_mlp",
    inputs=[x_info],
    outputs=[y_info],
    initializer=[W_tensor, b_tensor],
)

onnx.helper.make_model

Wraps a graph in a ModelProto.

model = helper.make_model(
    graph,
    opset_imports=[helper.make_opsetid("", 17)],  # opset 17 of the default domain
)
model.ir_version = 8
model.producer_name = "my_builder"

onnx.checker.check_model

Validates the model’s structural correctness. Always run this before saving or running.

onnx.checker.check_model(model)

onnx.save

Serializes to a .onnx file.

onnx.save(model, "my_model.onnx")

Building a Linear Regression Model

Let us start with the simplest possible “network”: a linear regression that computes y = X @ W + b.

import numpy as np
import onnx
from onnx import helper, TensorProto, numpy_helper

# ------------------------------------------------------------------ #
# 1. Define weights and bias as numpy arrays                          #
# ------------------------------------------------------------------ #
in_features  = 8
out_features = 1

W_data = np.random.randn(in_features, out_features).astype(np.float32)
b_data = np.zeros(out_features, dtype=np.float32)

# ------------------------------------------------------------------ #
# 2. Convert to TensorProto initializers                              #
# ------------------------------------------------------------------ #
W_init = numpy_helper.from_array(W_data, name="W")
b_init = numpy_helper.from_array(b_data, name="b")

# ------------------------------------------------------------------ #
# 3. Define the graph's external input and output shapes              #
# ------------------------------------------------------------------ #
# Input: batch of samples, each with 8 features
x_info = helper.make_tensor_value_info("x", TensorProto.FLOAT, ["batch", in_features])
# Output: batch of scalars
y_info = helper.make_tensor_value_info("y", TensorProto.FLOAT, ["batch", out_features])

# ------------------------------------------------------------------ #
# 4. Define the computation node                                      #
# ------------------------------------------------------------------ #
# Gemm computes: alpha * A @ B + beta * C
# We want: x @ W + b, which is: 1.0 * x @ W + 1.0 * b
gemm_node = helper.make_node(
    op_type="Gemm",
    inputs=["x", "W", "b"],
    outputs=["y"],
    alpha=1.0,
    beta=1.0,
    transB=0,  # W is already (in_features, out_features), no transpose needed
    name="linear",
)

# ------------------------------------------------------------------ #
# 5. Build the graph                                                  #
# ------------------------------------------------------------------ #
graph = helper.make_graph(
    nodes=[gemm_node],
    name="linear_regression",
    inputs=[x_info],
    outputs=[y_info],
    initializer=[W_init, b_init],
)

# ------------------------------------------------------------------ #
# 6. Build the model                                                  #
# ------------------------------------------------------------------ #
model = helper.make_model(
    graph,
    opset_imports=[helper.make_opsetid("", 17)],
)
model.ir_version = 8
model.producer_name = "onnx_guide"

# ------------------------------------------------------------------ #
# 7. Validate and save                                                #
# ------------------------------------------------------------------ #
onnx.checker.check_model(model)
onnx.save(model, "linear_regression.onnx")
print("Model saved.")
NoteKey details
  • The initializers W and b are listed in initializer and implicitly available as named tensors in the graph. You do not list them as graph inputs because they are constants — they do not vary across inference calls.
  • Gemm’s transB attribute controls whether the second matrix is transposed before multiply. With transB=0 and W shaped [in_features, out_features], the compute is x @ W, giving output shape [batch, out_features].
  • Symbolic dimensions like "batch" in shape specifications tell ONNX Runtime to accept any value on that axis at runtime.

Building a Multi-Layer Perceptron (MLP)

A multi-layer perceptron stacks fully connected layers with nonlinear activations between them. Here we build a 3-layer MLP for classification: input → hidden1 → hidden2 → logits → softmax.

import numpy as np
import onnx
from onnx import helper, TensorProto, numpy_helper

# ------------------------------------------------------------------ #
# Architecture hyperparameters                                        #
# ------------------------------------------------------------------ #
in_dim  = 784   # e.g., MNIST flattened
h1_dim  = 256
h2_dim  = 128
out_dim = 10    # classes

def make_fc_weights(in_d, out_d, name_prefix):
    """Create weight and bias initializers for a fully connected layer."""
    W = np.random.randn(in_d, out_d).astype(np.float32) * np.sqrt(2.0 / in_d)
    b = np.zeros(out_d, dtype=np.float32)
    W_init = numpy_helper.from_array(W, name=f"{name_prefix}_W")
    b_init = numpy_helper.from_array(b, name=f"{name_prefix}_b")
    return W_init, b_init

# ------------------------------------------------------------------ #
# Initializers                                                        #
# ------------------------------------------------------------------ #
fc1_W, fc1_b = make_fc_weights(in_dim, h1_dim, "fc1")
fc2_W, fc2_b = make_fc_weights(h1_dim, h2_dim, "fc2")
fc3_W, fc3_b = make_fc_weights(h2_dim, out_dim, "fc3")

all_initializers = [fc1_W, fc1_b, fc2_W, fc2_b, fc3_W, fc3_b]

# ------------------------------------------------------------------ #
# Nodes                                                               #
# ------------------------------------------------------------------ #
nodes = []

# Layer 1: Linear → ReLU
nodes.append(helper.make_node(
    "Gemm", inputs=["x", "fc1_W", "fc1_b"], outputs=["fc1_out"],
    name="fc1", alpha=1.0, beta=1.0,
))
nodes.append(helper.make_node(
    "Relu", inputs=["fc1_out"], outputs=["relu1_out"],
    name="relu1",
))

# Layer 2: Linear → ReLU
nodes.append(helper.make_node(
    "Gemm", inputs=["relu1_out", "fc2_W", "fc2_b"], outputs=["fc2_out"],
    name="fc2", alpha=1.0, beta=1.0,
))
nodes.append(helper.make_node(
    "Relu", inputs=["fc2_out"], outputs=["relu2_out"],
    name="relu2",
))

# Layer 3: Linear (logits)
nodes.append(helper.make_node(
    "Gemm", inputs=["relu2_out", "fc3_W", "fc3_b"], outputs=["logits"],
    name="fc3", alpha=1.0, beta=1.0,
))

# Softmax over class dimension (axis=-1 is the default)
nodes.append(helper.make_node(
    "Softmax", inputs=["logits"], outputs=["probs"],
    name="softmax", axis=-1,
))

# ------------------------------------------------------------------ #
# Graph inputs / outputs                                              #
# ------------------------------------------------------------------ #
x_info    = helper.make_tensor_value_info("x",     TensorProto.FLOAT, ["batch", in_dim])
prob_info = helper.make_tensor_value_info("probs", TensorProto.FLOAT, ["batch", out_dim])

# ------------------------------------------------------------------ #
# Assemble and save                                                   #
# ------------------------------------------------------------------ #
graph = helper.make_graph(
    nodes, "mlp",
    inputs=[x_info],
    outputs=[prob_info],
    initializer=all_initializers,
)
model = helper.make_model(graph, opset_imports=[helper.make_opsetid("", 17)])
model.ir_version = 8

onnx.checker.check_model(model)
onnx.save(model, "mlp.onnx")
print("MLP saved.")
NoteKey observations
  • Weight names in make_node must match exactly the names used in numpy_helper.from_array. A single character mismatch causes a runtime error.
  • He initialization (np.sqrt(2.0 / in_d)) is baked into the weight values at construction time. ONNX does not have an initialization scheme concept; weights are just constant tensors.
  • Gemm expects W shaped [in_dim, out_dim] when transB=0. Some sources convention their weights as [out_dim, in_dim] and use transB=1; both are valid.

Building a Convolutional Neural Network (CNN)

CNNs require managing multi-dimensional weight tensors. In ONNX, the Conv operator expects:

  • Input X: shape [batch, in_channels, height, width] (NCHW format).
  • Weight W: shape [out_channels, in_channels/group, kernel_h, kernel_w].
  • Bias B: shape [out_channels] (optional).

Here we build a small CNN for image classification (CIFAR-style input: 3×32×32 → 10 classes).

import numpy as np
import onnx
from onnx import helper, TensorProto, numpy_helper

def conv_weight(out_ch, in_ch, kH, kW, name):
    fan_in = in_ch * kH * kW
    W = np.random.randn(out_ch, in_ch, kH, kW).astype(np.float32) * np.sqrt(2.0 / fan_in)
    return numpy_helper.from_array(W, name=name)

def conv_bias(out_ch, name):
    b = np.zeros(out_ch, dtype=np.float32)
    return numpy_helper.from_array(b, name=name)

def bn_params(channels, name_prefix):
    """BatchNorm scale (gamma), bias (beta), running mean, running var."""
    scale = numpy_helper.from_array(np.ones(channels,  dtype=np.float32), name=f"{name_prefix}_scale")
    bias  = numpy_helper.from_array(np.zeros(channels, dtype=np.float32), name=f"{name_prefix}_bias")
    mean  = numpy_helper.from_array(np.zeros(channels, dtype=np.float32), name=f"{name_prefix}_mean")
    var   = numpy_helper.from_array(np.ones(channels,  dtype=np.float32), name=f"{name_prefix}_var")
    return scale, bias, mean, var

def fc_params(in_d, out_d, name_prefix):
    W = np.random.randn(in_d, out_d).astype(np.float32) * np.sqrt(2.0 / in_d)
    b = np.zeros(out_d, dtype=np.float32)
    return (numpy_helper.from_array(W, name=f"{name_prefix}_W"),
            numpy_helper.from_array(b, name=f"{name_prefix}_b"))

# ------------------------------------------------------------------ #
# Initializers                                                        #
# ------------------------------------------------------------------ #
inits = []

# Conv block 1: 3 → 32 channels, 3×3 kernel
inits += [conv_weight(32, 3,  3, 3, "conv1_W"), conv_bias(32, "conv1_b")]
inits += list(bn_params(32, "bn1"))

# Conv block 2: 32 → 64 channels, 3×3 kernel
inits += [conv_weight(64, 32, 3, 3, "conv2_W"), conv_bias(64, "conv2_b")]
inits += list(bn_params(64, "bn2"))

# Conv block 3: 64 → 128 channels, 3×3 kernel
inits += [conv_weight(128, 64, 3, 3, "conv3_W"), conv_bias(128, "conv3_b")]
inits += list(bn_params(128, "bn3"))

# Fully connected layers
# After 3 max-pools on 32×32 input: 32/2/2/2 = 4 spatial → 128 * 4 * 4 = 2048
inits += list(fc_params(128 * 4 * 4, 256, "fc1"))
inits += list(fc_params(256, 10, "fc2"))

# ------------------------------------------------------------------ #
# Nodes                                                               #
# ------------------------------------------------------------------ #
nodes = []

def conv_bn_relu(x_name, conv_w, conv_b, bn_prefix, out_name, kH=3, kW=3, pad=1):
    """Returns a list of nodes: Conv → BatchNorm → Relu."""
    conv_out = f"{out_name}_conv"
    bn_out   = f"{out_name}_bn"
    return [
        helper.make_node("Conv",
            inputs=[x_name, conv_w, conv_b],
            outputs=[conv_out],
            kernel_shape=[kH, kW],
            strides=[1, 1],
            pads=[pad, pad, pad, pad],
            name=f"{out_name}_conv_op",
        ),
        helper.make_node("BatchNormalization",
            inputs=[conv_out,
                    f"{bn_prefix}_scale", f"{bn_prefix}_bias",
                    f"{bn_prefix}_mean",  f"{bn_prefix}_var"],
            outputs=[bn_out],
            epsilon=1e-5,
            momentum=0.9,
            name=f"{out_name}_bn_op",
        ),
        helper.make_node("Relu",
            inputs=[bn_out],
            outputs=[out_name],
            name=f"{out_name}_relu_op",
        ),
    ]

# Block 1 + MaxPool
nodes += conv_bn_relu("x", "conv1_W", "conv1_b", "bn1", "block1_out")
nodes.append(helper.make_node("MaxPool",
    inputs=["block1_out"], outputs=["pool1_out"],
    kernel_shape=[2, 2], strides=[2, 2], name="pool1",
))

# Block 2 + MaxPool
nodes += conv_bn_relu("pool1_out", "conv2_W", "conv2_b", "bn2", "block2_out")
nodes.append(helper.make_node("MaxPool",
    inputs=["block2_out"], outputs=["pool2_out"],
    kernel_shape=[2, 2], strides=[2, 2], name="pool2",
))

# Block 3 + MaxPool
nodes += conv_bn_relu("pool2_out", "conv3_W", "conv3_b", "bn3", "block3_out")
nodes.append(helper.make_node("MaxPool",
    inputs=["block3_out"], outputs=["pool3_out"],
    kernel_shape=[2, 2], strides=[2, 2], name="pool3",
))

# Flatten: [batch, 128, 4, 4] → [batch, 2048]
nodes.append(helper.make_node("Flatten",
    inputs=["pool3_out"], outputs=["flat_out"],
    axis=1, name="flatten",
))

# FC1 + ReLU
nodes.append(helper.make_node("Gemm",
    inputs=["flat_out", "fc1_W", "fc1_b"], outputs=["fc1_out"],
    alpha=1.0, beta=1.0, name="fc1",
))
nodes.append(helper.make_node("Relu",
    inputs=["fc1_out"], outputs=["fc1_relu"],
    name="fc1_relu",
))

# FC2 (logits) + Softmax
nodes.append(helper.make_node("Gemm",
    inputs=["fc1_relu", "fc2_W", "fc2_b"], outputs=["logits"],
    alpha=1.0, beta=1.0, name="fc2",
))
nodes.append(helper.make_node("Softmax",
    inputs=["logits"], outputs=["probs"],
    axis=-1, name="softmax",
))

# ------------------------------------------------------------------ #
# Graph, model, validate, save                                       #
# ------------------------------------------------------------------ #
x_info    = helper.make_tensor_value_info("x",     TensorProto.FLOAT, ["batch", 3, 32, 32])
prob_info = helper.make_tensor_value_info("probs", TensorProto.FLOAT, ["batch", 10])

graph = helper.make_graph(nodes, "cnn_classifier",
    inputs=[x_info], outputs=[prob_info], initializer=inits)
model = helper.make_model(graph, opset_imports=[helper.make_opsetid("", 17)])
model.ir_version = 8

onnx.checker.check_model(model)
onnx.save(model, "cnn_classifier.onnx")
print("CNN saved.")
WarningImportant notes on the CNN
  • NCHW layout: ONNX Conv assumes channel-first ordering. If your input data is NHWC, you must Transpose it first.
  • pads attribute: For Conv, pads are listed as [pad_top, pad_left, pad_bottom, pad_right] for 2D convolutions. Some versions use [pad_h_begin, pad_w_begin, pad_h_end, pad_w_end]. Check the ONNX spec for your opset.
  • BatchNormalization in inference mode: ONNX BN in opset 9+ produces only one output (the normalized tensor). The training-mode outputs (saved mean, saved variance) are not produced in inference mode. If you see BN with 5 outputs, it is training mode; for inference, set training_mode=0 (default).
  • Flatten axis: axis=1 means flatten from dimension 1 onward, preserving the batch dimension. The result is [batch, 128*4*4].

Building a Recurrent Neural Network (RNN/LSTM)

ONNX’s LSTM operator encodes a full LSTM layer in a single node, which is different from the cell-by-cell approach in PyTorch. This makes it compact but the weight layout requires care.

ImportantGate order difference

The ONNX LSTM gate order is IOFC (Input, Output, Forget, Cell), while PyTorch uses IFCO (Input, Forget, Cell, Output). This affects how you lay out the weight tensor if you ever interoperate.

import numpy as np
import onnx
from onnx import helper, TensorProto, numpy_helper

seq_len    = 20     # sequence length
batch_size = 4      # batch size
input_size = 16     # features per timestep
hidden_size = 32    # LSTM hidden dim
num_layers = 1
directions = 1      # 1 for forward, 2 for bidirectional

# ------------------------------------------------------------------ #
# LSTM Weight Layout (ONNX standard):                                 #
# W shape: [directions, 4 * hidden_size, input_size]                  #
# R shape: [directions, 4 * hidden_size, hidden_size]                 #
# B shape: [directions, 8 * hidden_size]  (W_bias concat R_bias)      #
# ------------------------------------------------------------------ #

W_data = np.random.randn(directions, 4 * hidden_size, input_size).astype(np.float32)
R_data = np.random.randn(directions, 4 * hidden_size, hidden_size).astype(np.float32)
B_data = np.zeros((directions, 8 * hidden_size), dtype=np.float32)

W_init = numpy_helper.from_array(W_data, name="lstm_W")
R_init = numpy_helper.from_array(R_data, name="lstm_R")
B_init = numpy_helper.from_array(B_data, name="lstm_B")

# ------------------------------------------------------------------ #
# LSTM node                                                           #
# ------------------------------------------------------------------ #
# Inputs:  X, W, R, B, sequence_lens (optional), initial_h, initial_c, P (peepholes)
# Outputs: Y (all hidden states), Y_h (final hidden), Y_c (final cell)
lstm_node = helper.make_node(
    "LSTM",
    inputs=["x", "lstm_W", "lstm_R", "lstm_B"],  # omit optional inputs
    outputs=["Y", "Y_h", "Y_c"],
    hidden_size=hidden_size,
    direction="forward",
    name="lstm_layer",
)

# Y shape:   [seq_len, directions, batch, hidden_size]
# Y_h shape: [directions, batch, hidden_size]
# Y_c shape: [directions, batch, hidden_size]

# We want the final hidden state: Y_h, shape [1, batch, hidden_size]
# Squeeze the directions dimension:
squeeze_axes = numpy_helper.from_array(np.array([0], dtype=np.int64), name="squeeze_axes")

squeeze_node = helper.make_node(
    "Squeeze",
    inputs=["Y_h", "squeeze_axes"],
    outputs=["h_final"],   # shape: [batch, hidden_size]
    name="squeeze_h",
)

# Final classifier
fc_W_data = np.random.randn(hidden_size, 5).astype(np.float32)  # 5 output classes
fc_b_data = np.zeros(5, dtype=np.float32)
fc_W_init = numpy_helper.from_array(fc_W_data, name="fc_W")
fc_b_init = numpy_helper.from_array(fc_b_data, name="fc_b")

fc_node = helper.make_node(
    "Gemm",
    inputs=["h_final", "fc_W", "fc_b"],
    outputs=["logits"],
    alpha=1.0, beta=1.0,
    name="fc_out",
)

softmax_node = helper.make_node(
    "Softmax",
    inputs=["logits"],
    outputs=["probs"],
    axis=-1,
    name="softmax",
)

# ------------------------------------------------------------------ #
# Graph assembly                                                      #
# ------------------------------------------------------------------ #
# X: [seq_len, batch, input_size] — ONNX LSTM uses time-first layout
x_info    = helper.make_tensor_value_info("x", TensorProto.FLOAT,
                                           [seq_len, "batch", input_size])
prob_info = helper.make_tensor_value_info("probs", TensorProto.FLOAT, ["batch", 5])

graph = helper.make_graph(
    [lstm_node, squeeze_node, fc_node, softmax_node],
    "lstm_classifier",
    inputs=[x_info],
    outputs=[prob_info],
    initializer=[W_init, R_init, B_init, squeeze_axes, fc_W_init, fc_b_init],
)
model = helper.make_model(graph, opset_imports=[helper.make_opsetid("", 17)])
model.ir_version = 8

onnx.checker.check_model(model)
onnx.save(model, "lstm_classifier.onnx")
print("LSTM model saved.")

Crucially, ONNX LSTM takes input in [seq_len, batch, input_size] order (time-first). If your data is batch-first [batch, seq_len, input_size], add a Transpose node before the LSTM:

transpose_node = helper.make_node(
    "Transpose",
    inputs=["x_batch_first"],
    outputs=["x"],
    perm=[1, 0, 2],  # swap seq and batch dimensions
    name="batch_to_seq_first",
)

Building a Transformer Block

A Transformer block is the most involved architecture to assemble in raw ONNX, but it is an outstanding exercise in understanding attention. We build a single encoder block: multi-head self-attention followed by a feed-forward network, both with residual connections and layer normalization.

import numpy as np
import onnx
from onnx import helper, TensorProto, numpy_helper

d_model   = 64    # embedding dimension
n_heads   = 4     # attention heads
d_k       = d_model // n_heads  # key/query dimension per head = 16
d_ff      = 256   # feed-forward inner dimension
seq_len   = 10    # sequence length (fixed for this example)
eps       = 1e-6

rng = np.random.default_rng(42)

def rand_f32(shape, name):
    data = rng.standard_normal(shape).astype(np.float32) * 0.02
    return numpy_helper.from_array(data, name=name)

def zeros_f32(shape, name):
    return numpy_helper.from_array(np.zeros(shape, dtype=np.float32), name=name)

def ones_f32(shape, name):
    return numpy_helper.from_array(np.ones(shape, dtype=np.float32), name=name)

inits  = []
nodes  = []

# ================================================================== #
# Projection weights for Q, K, V, and output                        #
# [d_model, d_model] — we will split heads in-graph                  #
# ================================================================== #
inits += [rand_f32((d_model, d_model), "W_Q"),
          rand_f32((d_model, d_model), "W_K"),
          rand_f32((d_model, d_model), "W_V"),
          rand_f32((d_model, d_model), "W_O"),
          zeros_f32((d_model,), "b_Q"),
          zeros_f32((d_model,), "b_K"),
          zeros_f32((d_model,), "b_V"),
          zeros_f32((d_model,), "b_O")]

# Feed-forward weights
inits += [rand_f32((d_model, d_ff), "W_ff1"), zeros_f32((d_ff,),    "b_ff1"),
          rand_f32((d_ff, d_model), "W_ff2"), zeros_f32((d_model,), "b_ff2")]

# LayerNorm parameters (two sets: after attention, after FFN)
inits += [ones_f32((d_model,),  "ln1_scale"), zeros_f32((d_model,), "ln1_bias"),
          ones_f32((d_model,),  "ln2_scale"), zeros_f32((d_model,), "ln2_bias")]

# Scale factor for attention: 1 / sqrt(d_k)
scale_val = np.array(1.0 / np.sqrt(d_k), dtype=np.float32)
inits.append(numpy_helper.from_array(scale_val, name="attn_scale"))

# Shape constants for Reshape operations
reshape_to_heads = np.array([-1, seq_len, n_heads, d_k], dtype=np.int64)
inits.append(numpy_helper.from_array(reshape_to_heads, name="shape_heads"))

restore_shape = np.array([-1, seq_len, d_model], dtype=np.int64)
inits.append(numpy_helper.from_array(restore_shape, name="shape_restore"))

# ================================================================== #
# MULTI-HEAD SELF-ATTENTION                                           #
# ================================================================== #

# --- Compute Q, K, V projections ---
# MatMul: [batch, seq, d_model] @ [d_model, d_model] → [batch, seq, d_model]
for letter in ["Q", "K", "V"]:
    nodes.append(helper.make_node("MatMul",
        inputs=["x", f"W_{letter}"],
        outputs=[f"{letter}_proj"],
        name=f"matmul_{letter}",
    ))
    nodes.append(helper.make_node("Add",
        inputs=[f"{letter}_proj", f"b_{letter}"],
        outputs=[f"{letter}"],
        name=f"add_bias_{letter}",
    ))

# --- Reshape to [batch, seq, n_heads, d_k] ---
for letter in ["Q", "K", "V"]:
    nodes.append(helper.make_node("Reshape",
        inputs=[letter, "shape_heads"],
        outputs=[f"{letter}_h"],
        name=f"reshape_{letter}",
    ))

# --- Transpose to [batch, n_heads, seq, d_k] ---
for letter in ["Q", "K", "V"]:
    nodes.append(helper.make_node("Transpose",
        inputs=[f"{letter}_h"],
        outputs=[f"{letter}_t"],
        perm=[0, 2, 1, 3],
        name=f"transpose_{letter}",
    ))

# --- Attention scores: Q @ K^T ---
nodes.append(helper.make_node("Transpose",
    inputs=["K_t"],
    outputs=["K_t_T"],
    perm=[0, 1, 3, 2],
    name="transpose_K_for_attn",
))

nodes.append(helper.make_node("MatMul",
    inputs=["Q_t", "K_t_T"],
    outputs=["raw_scores"],
    name="attn_scores",
))

nodes.append(helper.make_node("Mul",
    inputs=["raw_scores", "attn_scale"],
    outputs=["scaled_scores"],
    name="scale_scores",
))

nodes.append(helper.make_node("Softmax",
    inputs=["scaled_scores"],
    outputs=["attn_weights"],
    axis=-1,
    name="attn_softmax",
))

nodes.append(helper.make_node("MatMul",
    inputs=["attn_weights", "V_t"],
    outputs=["context_t"],
    name="attn_context",
))

nodes.append(helper.make_node("Transpose",
    inputs=["context_t"],
    outputs=["context_h"],
    perm=[0, 2, 1, 3],
    name="transpose_context",
))

nodes.append(helper.make_node("Reshape",
    inputs=["context_h", "shape_restore"],
    outputs=["context"],
    name="reshape_context",
))

nodes.append(helper.make_node("MatMul",
    inputs=["context", "W_O"],
    outputs=["attn_out_proj"],
    name="output_proj",
))
nodes.append(helper.make_node("Add",
    inputs=["attn_out_proj", "b_O"],
    outputs=["attn_out"],
    name="add_output_bias",
))

# Residual + LayerNorm
nodes.append(helper.make_node("Add",
    inputs=["x", "attn_out"],
    outputs=["residual1"],
    name="residual1",
))
nodes.append(helper.make_node("LayerNormalization",
    inputs=["residual1", "ln1_scale", "ln1_bias"],
    outputs=["ln1_out"],
    axis=-1,
    epsilon=eps,
    name="layernorm1",
))

# ================================================================== #
# FEED-FORWARD NETWORK                                                #
# ================================================================== #

nodes.append(helper.make_node("MatMul",
    inputs=["ln1_out", "W_ff1"],
    outputs=["ff1_proj"],
    name="ff1_proj",
))
nodes.append(helper.make_node("Add",
    inputs=["ff1_proj", "b_ff1"],
    outputs=["ff1"],
    name="ff1_bias",
))
nodes.append(helper.make_node("Relu",
    inputs=["ff1"],
    outputs=["ff1_relu"],
    name="ff1_relu",
))
nodes.append(helper.make_node("MatMul",
    inputs=["ff1_relu", "W_ff2"],
    outputs=["ff2_proj"],
    name="ff2_proj",
))
nodes.append(helper.make_node("Add",
    inputs=["ff2_proj", "b_ff2"],
    outputs=["ff2"],
    name="ff2_bias",
))

# Residual + LayerNorm
nodes.append(helper.make_node("Add",
    inputs=["ln1_out", "ff2"],
    outputs=["residual2"],
    name="residual2",
))
nodes.append(helper.make_node("LayerNormalization",
    inputs=["residual2", "ln2_scale", "ln2_bias"],
    outputs=["output"],
    axis=-1,
    epsilon=eps,
    name="layernorm2",
))

# ================================================================== #
# Graph assembly                                                      #
# ================================================================== #
x_info   = helper.make_tensor_value_info("x",      TensorProto.FLOAT, ["batch", seq_len, d_model])
out_info = helper.make_tensor_value_info("output", TensorProto.FLOAT, ["batch", seq_len, d_model])

graph = helper.make_graph(nodes, "transformer_encoder_block",
    inputs=[x_info], outputs=[out_info], initializer=inits)
model = helper.make_model(graph, opset_imports=[helper.make_opsetid("", 17)])
model.ir_version = 8

onnx.checker.check_model(model)
onnx.save(model, "transformer_block.onnx")
print("Transformer block saved.")
NoteTransformer-specific ONNX patterns
  • 3D MatMul: When one operand is 2D [d_model, d_model] and the other is 3D [batch, seq, d_model], ONNX’s MatMul broadcasts over the batch dimension automatically.
  • Reshape + Transpose for multi-head attention: The head-splitting is entirely explicit. You reshape the projected Q/K/V to expose the head dimension, then transpose to make it the second axis for batched matrix multiplication.
  • LayerNormalization: Available from opset 17. It takes scale and bias as separate inputs (not attributes), and normalizes along all axes from axis to the last.
  • Broadcasting of the scale scalar: The attn_scale constant is a scalar np.float32 value. ONNX’s Mul operator broadcasts it across the entire [batch, heads, seq, seq] scores tensor without any reshape.

Building a Residual (ResNet-style) Block

Residual connections are essential for deep networks. In ONNX, they are simply Add nodes where one input comes from early in the graph.

import numpy as np
import onnx
from onnx import helper, TensorProto, numpy_helper

def make_conv_bn_relu(x_name, out_name, in_ch, out_ch, stride, inits, nodes, kH=3, kW=3):
    """Adds Conv → BN → ReLU nodes and their initializers in-place."""
    fan_in  = in_ch * kH * kW
    W_data  = np.random.randn(out_ch, in_ch, kH, kW).astype(np.float32) * np.sqrt(2.0 / fan_in)
    W_init  = numpy_helper.from_array(W_data, name=f"{out_name}_cW")
    b_init  = numpy_helper.from_array(np.zeros(out_ch, dtype=np.float32), name=f"{out_name}_cb")
    sc_init = numpy_helper.from_array(np.ones(out_ch,  dtype=np.float32), name=f"{out_name}_bns")
    bi_init = numpy_helper.from_array(np.zeros(out_ch, dtype=np.float32), name=f"{out_name}_bnb")
    mn_init = numpy_helper.from_array(np.zeros(out_ch, dtype=np.float32), name=f"{out_name}_bnm")
    vr_init = numpy_helper.from_array(np.ones(out_ch,  dtype=np.float32), name=f"{out_name}_bnv")
    inits += [W_init, b_init, sc_init, bi_init, mn_init, vr_init]

    conv_out = f"{out_name}_c"
    bn_out   = f"{out_name}_bn"

    nodes.append(helper.make_node("Conv",
        inputs=[x_name, f"{out_name}_cW", f"{out_name}_cb"],
        outputs=[conv_out],
        kernel_shape=[kH, kW],
        strides=[stride, stride],
        pads=[kH//2, kW//2, kH//2, kW//2],
        name=f"{out_name}_conv",
    ))
    nodes.append(helper.make_node("BatchNormalization",
        inputs=[conv_out, f"{out_name}_bns", f"{out_name}_bnb",
                f"{out_name}_bnm", f"{out_name}_bnv"],
        outputs=[bn_out],
        epsilon=1e-5, momentum=0.1,
        name=f"{out_name}_bn",
    ))
    nodes.append(helper.make_node("Relu",
        inputs=[bn_out],
        outputs=[out_name],
        name=f"{out_name}_relu",
    ))

def make_residual_block(x_name, out_name, in_ch, out_ch, stride, inits, nodes):
    """
    A basic ResNet residual block.
    If in_ch != out_ch or stride != 1, a 1x1 projection shortcut is added.
    """
    mid_name = f"{out_name}_mid"
    make_conv_bn_relu(x_name, mid_name, in_ch, out_ch, stride, inits, nodes)

    fan_in  = out_ch * 3 * 3
    W2_data = np.random.randn(out_ch, out_ch, 3, 3).astype(np.float32) * np.sqrt(2.0 / fan_in)
    W2_init = numpy_helper.from_array(W2_data, name=f"{out_name}_c2W")
    b2_init = numpy_helper.from_array(np.zeros(out_ch, dtype=np.float32), name=f"{out_name}_c2b")
    sc2     = numpy_helper.from_array(np.ones(out_ch,  dtype=np.float32), name=f"{out_name}_bn2s")
    bi2     = numpy_helper.from_array(np.zeros(out_ch, dtype=np.float32), name=f"{out_name}_bn2b")
    mn2     = numpy_helper.from_array(np.zeros(out_ch, dtype=np.float32), name=f"{out_name}_bn2m")
    vr2     = numpy_helper.from_array(np.ones(out_ch,  dtype=np.float32), name=f"{out_name}_bn2v")
    inits  += [W2_init, b2_init, sc2, bi2, mn2, vr2]

    conv2_out = f"{out_name}_c2"
    bn2_out   = f"{out_name}_bn2"

    nodes.append(helper.make_node("Conv",
        inputs=[mid_name, f"{out_name}_c2W", f"{out_name}_c2b"],
        outputs=[conv2_out],
        kernel_shape=[3, 3], strides=[1, 1], pads=[1, 1, 1, 1],
        name=f"{out_name}_conv2",
    ))
    nodes.append(helper.make_node("BatchNormalization",
        inputs=[conv2_out, f"{out_name}_bn2s", f"{out_name}_bn2b",
                f"{out_name}_bn2m", f"{out_name}_bn2v"],
        outputs=[bn2_out],
        epsilon=1e-5, momentum=0.1,
        name=f"{out_name}_bn2",
    ))

    if in_ch != out_ch or stride != 1:
        Ws_data = np.random.randn(out_ch, in_ch, 1, 1).astype(np.float32) * np.sqrt(2.0 / in_ch)
        Ws_init = numpy_helper.from_array(Ws_data, name=f"{out_name}_sW")
        bs_init = numpy_helper.from_array(np.zeros(out_ch, dtype=np.float32), name=f"{out_name}_sb")
        inits  += [Ws_init, bs_init]
        shortcut_name = f"{out_name}_shortcut"
        nodes.append(helper.make_node("Conv",
            inputs=[x_name, f"{out_name}_sW", f"{out_name}_sb"],
            outputs=[shortcut_name],
            kernel_shape=[1, 1], strides=[stride, stride], pads=[0, 0, 0, 0],
            name=f"{out_name}_shortcut_conv",
        ))
    else:
        shortcut_name = x_name

    nodes.append(helper.make_node("Add",
        inputs=[bn2_out, shortcut_name],
        outputs=[f"{out_name}_sum"],
        name=f"{out_name}_add",
    ))
    nodes.append(helper.make_node("Relu",
        inputs=[f"{out_name}_sum"],
        outputs=[out_name],
        name=f"{out_name}_relu_final",
    ))

# ------------------------------------------------------------------ #
# Build a tiny ResNet                                                 #
# ------------------------------------------------------------------ #
inits = []
nodes = []

make_conv_bn_relu("x", "stem_out", in_ch=3, out_ch=64, stride=2, inits=inits, nodes=nodes, kH=7, kW=7)
nodes.append(helper.make_node("MaxPool",
    inputs=["stem_out"], outputs=["pool_out"],
    kernel_shape=[3, 3], strides=[2, 2], pads=[1, 1, 1, 1],
    name="stem_pool",
))

make_residual_block("pool_out", "layer1a", in_ch=64, out_ch=64, stride=1, inits=inits, nodes=nodes)
make_residual_block("layer1a",  "layer1b", in_ch=64, out_ch=64, stride=1, inits=inits, nodes=nodes)
make_residual_block("layer1b", "layer2a", in_ch=64,  out_ch=128, stride=2, inits=inits, nodes=nodes)
make_residual_block("layer2a", "layer2b", in_ch=128, out_ch=128, stride=1, inits=inits, nodes=nodes)

nodes.append(helper.make_node("GlobalAveragePool",
    inputs=["layer2b"], outputs=["gap_out"],
    name="global_avg_pool",
))
nodes.append(helper.make_node("Flatten",
    inputs=["gap_out"], outputs=["flat_out"],
    axis=1, name="flatten",
))

fc_W = numpy_helper.from_array(
    np.random.randn(128, 10).astype(np.float32) * 0.01, name="fc_W")
fc_b = numpy_helper.from_array(np.zeros(10, dtype=np.float32), name="fc_b")
inits += [fc_W, fc_b]

nodes.append(helper.make_node("Gemm",
    inputs=["flat_out", "fc_W", "fc_b"],
    outputs=["logits"], alpha=1.0, beta=1.0, name="classifier",
))
nodes.append(helper.make_node("Softmax",
    inputs=["logits"], outputs=["probs"], axis=-1, name="softmax",
))

x_info    = helper.make_tensor_value_info("x",     TensorProto.FLOAT, ["batch", 3, 224, 224])
prob_info = helper.make_tensor_value_info("probs", TensorProto.FLOAT, ["batch", 10])

graph = helper.make_graph(nodes, "tiny_resnet",
    inputs=[x_info], outputs=[prob_info], initializer=inits)
model = helper.make_model(graph, opset_imports=[helper.make_opsetid("", 17)])
model.ir_version = 8

onnx.checker.check_model(model)
onnx.save(model, "tiny_resnet.onnx")
print("ResNet-style model saved.")
Tip

The residual block pattern is elegant in ONNX because the “skip connection” is just a string: you pass the same input name x_name to both the main path and the shortcut Add node. The graph structure itself encodes the skip without any special syntax.

Initializers, Constants, and Weight Management

There are two ways to embed constant data in an ONNX graph.

Initializers are TensorProto objects stored in graph.initializer. They represent parameters (weights, biases) or other constant tensors. They are the preferred way to store large parameter tensors because they are memory-efficient and can be memory-mapped at load time.

W = numpy_helper.from_array(np.eye(64, dtype=np.float32), name="identity_W")
# Add to graph initializer list

Constant nodes embed a tensor directly inside a NodeProto. Use these for small scalars or integer constants computed mid-graph (like reshape targets):

const_node = helper.make_node(
    "Constant",
    inputs=[],
    outputs=["const_value"],
    value=helper.make_tensor(
        name="",
        data_type=TensorProto.FLOAT,
        dims=[],          # scalar
        vals=[0.5],
    ),
)

For integer shape tensors (common when using Reshape), you can also store them as initializers:

shape_const = numpy_helper.from_array(
    np.array([-1, 128], dtype=np.int64), name="reshape_target"
)

Weight Initialization Strategies

Since ONNX weights are just NumPy arrays, you apply initialization schemes yourself:

# He (Kaiming) initialization for ReLU networks
def he_init(shape):
    fan_in = np.prod(shape[1:]) if len(shape) > 1 else shape[0]
    return np.random.randn(*shape).astype(np.float32) * np.sqrt(2.0 / fan_in)

# Glorot (Xavier) initialization for tanh/sigmoid
def glorot_init(shape):
    fan_in  = shape[0]
    fan_out = shape[1] if len(shape) > 1 else shape[0]
    limit   = np.sqrt(6.0 / (fan_in + fan_out))
    return np.random.uniform(-limit, limit, shape).astype(np.float32)

# Orthogonal initialization (good for RNNs)
def orthogonal_init(shape):
    flat = np.random.randn(shape[0], np.prod(shape[1:])).astype(np.float32)
    U, _, Vt = np.linalg.svd(flat, full_matrices=False)
    return (U if U.shape == flat.shape else Vt).reshape(shape)

Shape Inference and Validation

ONNX provides automatic shape inference — it propagates shapes through the graph so you can verify that all intermediate tensor shapes are correct before running.

import onnx
from onnx import shape_inference

model = onnx.load("my_model.onnx")
inferred_model = shape_inference.infer_shapes(model)

# Now inspect inferred shapes
for vi in inferred_model.graph.value_info:
    t = vi.type.tensor_type
    shape = [d.dim_value if d.HasField("dim_value") else d.dim_param
             for d in t.shape.dim]
    print(f"  {vi.name}: {t.elem_type} {shape}")
Tip

Always run both onnx.checker.check_model (structural validity) and shape_inference.infer_shapes (shape consistency) after building a model. The checker will catch malformed protos; shape inference will catch shape mismatches before you waste time debugging at runtime.

Checking Shapes Programmatically

def get_shape(model, tensor_name):
    """Return the inferred shape of any named tensor in the model."""
    inferred = shape_inference.infer_shapes(model)
    all_vi   = (list(inferred.graph.input)
               + list(inferred.graph.value_info)
               + list(inferred.graph.output))
    for vi in all_vi:
        if vi.name == tensor_name:
            t = vi.type.tensor_type
            return [d.dim_value or d.dim_param for d in t.shape.dim]
    return None

shape = get_shape(model, "relu1_out")
print(f"relu1_out shape: {shape}")

Running Inference with ONNX Runtime

import onnxruntime as ort
import numpy as np

# Load the session
sess = ort.InferenceSession("mlp.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"])

# Inspect inputs and outputs
for inp in sess.get_inputs():
    print(f"Input:  {inp.name} | shape: {inp.shape} | type: {inp.type}")
for out in sess.get_outputs():
    print(f"Output: {out.name} | shape: {out.shape} | type: {out.type}")

# Run inference
x = np.random.randn(8, 784).astype(np.float32)
outputs = sess.run(
    output_names=["probs"],  # None means "return all outputs"
    input_feed={"x": x},
)
probs = outputs[0]
print(f"Output shape: {probs.shape}")
print(f"Predictions:  {probs.argmax(axis=1)}")

Choosing an Execution Provider

ONNX Runtime supports multiple backends. Pass them in priority order:

sess = ort.InferenceSession("model.onnx", providers=[
    ("TensorrtExecutionProvider", {"device_id": 0}),
    ("CUDAExecutionProvider",     {"device_id": 0}),
    "CPUExecutionProvider",
])
print(sess.get_providers())  # shows which providers were actually activated

Session Options

opts = ort.SessionOptions()
opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
opts.intra_op_num_threads = 4
opts.enable_profiling = False

sess = ort.InferenceSession("model.onnx", sess_options=opts,
    providers=["CPUExecutionProvider"])

Inspecting and Debugging ONNX Graphs

Printing Graph Structure

import onnx

model = onnx.load("model.onnx")
graph = model.graph

print(f"Opset: {model.opset_import[0].version}")
print(f"\nInputs:")
for inp in graph.input:
    print(f"  {inp.name}")

print(f"\nOutputs:")
for out in graph.output:
    print(f"  {out.name}")

print(f"\nInitializers: {len(graph.initializer)} tensors")
for init in graph.initializer:
    shape = list(init.dims)
    dtype = init.data_type
    print(f"  {init.name:30s} shape={shape}, dtype={dtype}")

print(f"\nNodes ({len(graph.node)} total):")
for node in graph.node:
    attrs = {a.name: ... for a in node.attribute}
    print(f"  [{node.op_type:20s}] {list(node.input)}{list(node.output)}")

Extracting Intermediate Outputs

You can expose intermediate tensors as additional graph outputs for debugging:

import onnx
from onnx import shape_inference

model    = onnx.load("mlp.onnx")
inferred = shape_inference.infer_shapes(model)

# Identify the value_info for intermediate tensor "relu1_out"
vi_to_expose = None
for vi in inferred.graph.value_info:
    if vi.name == "relu1_out":
        vi_to_expose = vi
        break

# Add it as a graph output
debug_model = onnx.ModelProto()
debug_model.CopyFrom(inferred)
debug_model.graph.output.append(vi_to_expose)

onnx.save(debug_model, "mlp_debug.onnx")

Using Netron

import subprocess
subprocess.Popen(["netron", "model.onnx"])
# or just open the file directly in the Netron app
Tip

Netron renders the full computation graph in a browser. Each node shows its op type, attributes, and input/output tensor names with their inferred shapes (if you ran shape inference). It is the single most useful tool for understanding and debugging ONNX models.

Advanced Techniques: Control Flow, Subgraphs, and Custom Ops

Control Flow: If, Loop, Scan

ONNX supports limited control flow via three special operators. These operators contain subgraphs (nested GraphProto objects) inside their attributes.

If: Conditional execution. Takes a boolean scalar condition and contains two subgraph attributes: then_branch and else_branch.

# Pseudocode — then_branch and else_branch are full GraphProto objects
if_node = helper.make_node(
    "If",
    inputs=["condition"],
    outputs=["result"],
    then_branch=then_graph,
    else_branch=else_graph,
)

Loop: A counted or condition-based loop. Takes a trip count, initial condition, and initial state tensors, and runs a body subgraph repeatedly.

Scan: Applies a body subgraph across the time axis of sequence inputs, accumulating state. Useful for custom RNNs.

These operators are powerful but complex. Their subgraphs must be complete valid GraphProto objects with their own inputs and outputs. Building them requires careful management of variable names and scoping.

Custom Operators

If you need an operation not in the ONNX standard set, you can define a custom operator with a non-standard domain:

custom_node = helper.make_node(
    op_type="MySpecialOp",
    domain="com.mycompany",
    inputs=["x"],
    outputs=["y"],
    my_custom_attr=42,
    name="custom_op_1",
)

# Register the custom domain in the opset imports
model = helper.make_model(
    graph,
    opset_imports=[
        helper.make_opsetid("", 17),
        helper.make_opsetid("com.mycompany", 1),
    ],
)

To run custom ops with ONNX Runtime, you register a Python or C++ custom op implementation:

import onnxruntime as ort

# Python custom op (ort >= 1.13)
class MySpecialOpImpl:
    def __init__(self, op, device):
        pass
    def compute(self, x):
        return [x * 2]  # example: just double the input

opts = ort.SessionOptions()
sess = ort.InferenceSession(
    "model_with_custom_op.onnx",
    sess_options=opts,
    providers=["CPUExecutionProvider"],
)
# C++ ops are registered via shared libraries

Function-Based Operators

ONNX also allows you to define FunctionProto objects — named, reusable operator definitions composed of existing ONNX ops. These let you package composite operations (like a Transformer block) as a single named op that expands to primitives during execution:

from onnx import helper, TensorProto

func = helper.make_function(
    domain="com.myarch",
    fname="LayerNormFunc",
    inputs=["X", "scale", "bias"],
    outputs=["Y"],
    nodes=[...],  # the expanded graph nodes
    opset_imports=[helper.make_opsetid("", 17)],
)
model.functions.append(func)

Optimization and Graph Transformations

Raw hand-built ONNX graphs are often not as efficient as they could be. Several tools exist to optimize them.

ONNX Runtime Graph Optimizations

The simplest approach is to let ONNX Runtime’s optimizer do the work at load time:

opts = ort.SessionOptions()
opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

# Optionally, save the optimized model for inspection
opts.optimized_model_filepath = "optimized_model.onnx"

sess = ort.InferenceSession("model.onnx", sess_options=opts,
    providers=["CPUExecutionProvider"])

ONNX Runtime performs fusions (Conv+BN+Relu → ConvRelu), dead code elimination, constant folding, and more.

ONNX Simplifier

onnx-simplifier is a third-party tool that applies constant folding and other simplifications:

pip install onnxsim
python -m onnxsim model.onnx simplified_model.onnx

Or programmatically:

from onnxsim import simplify
import onnx

model      = onnx.load("model.onnx")
simplified, check = simplify(model)
assert check, "Simplified ONNX model could not be validated!"
onnx.save(simplified, "simplified_model.onnx")

Manual Graph Surgery with onnx.helper and onnx.compose

The onnx.compose module (ONNX ≥ 1.13) provides merge_models and add_prefix utilities for combining and modifying graphs:

from onnx import compose

# Merge two models sequentially (output of model1 feeds input of model2)
combined = compose.merge_models(
    model1, model2,
    io_map=[("model1_output", "model2_input")],
)

For direct graph surgery (removing nodes, inserting nodes, rewiring edges), you work directly with the graph.node list:

model = onnx.load("model.onnx")
graph = model.graph

# Remove a specific node by name
graph.node[:] = [n for n in graph.node if n.name != "relu_to_remove"]

# Insert a new node after a specific point
new_node = helper.make_node("Tanh", inputs=["linear_out"], outputs=["tanh_out"])
insert_idx = next(i for i, n in enumerate(graph.node) if n.name == "linear")
graph.node.insert(insert_idx + 1, new_node)

onnx.checker.check_model(model)
onnx.save(model, "modified_model.onnx")

Quantization

ONNX Runtime provides post-training quantization tools:

from onnxruntime.quantization import quantize_dynamic, QuantType

quantize_dynamic(
    model_input="model.onnx",
    model_output="model_quant.onnx",
    weight_type=QuantType.QInt8,
)

For static quantization (requires calibration data):

from onnxruntime.quantization import quantize_static, CalibrationDataReader, QuantType

class MyCalibReader(CalibrationDataReader):
    def get_next(self):
        # yield batches of calibration inputs
        ...

quantize_static(
    model_input="model.onnx",
    model_output="model_quant_static.onnx",
    calibration_data_reader=MyCalibReader(),
    quant_format=QuantType.QInt8,
)

Best Practices and Common Pitfalls

Always Use Unique Tensor Names

Every intermediate tensor name in the graph must be unique. Reusing a name means two nodes will try to write the same tensor, causing silent corruption or runtime errors. A simple convention is to prefix names with the layer or block name:

"block2_conv1_out"  rather than  "conv_out"

Match Opset to Your Runtime

ONNX Runtime versions support specific ONNX opset ranges. Using an opset that is too new will cause load failures. Check the ONNX Runtime release notes for the supported opset range, and pin your opset_imports accordingly. Opset 17 is a safe choice for most current runtimes as of 2025.

Initializer vs. Graph Input: Know the Difference

Initializers represent constant parameters that are part of the model. Graph inputs are external tensors provided at inference time. Do not list your weights in graph.input — they belong only in graph.initializer. ONNX Runtime will warn about (and older versions will fail on) weights that appear in both places.

In older ONNX IR versions (IR < 4), initializers were required to also appear as graph inputs. From IR version 4 onward, this is no longer needed. Set model.ir_version = 8 and list weights only as initializers.

Check Data Types Carefully

Warning

All of the following will cause silent incorrect results or runtime errors if you mix them up:

  • Mixing float32 and float64 inputs/weights without an explicit Cast.
  • Using Python int (64-bit) where the model expects int32.
  • Passing NHWC image data to a Conv that expects NCHW.

Always verify numpy dtypes when constructing initializers:

W = my_array.astype(np.float32)  # always explicit

Pads Are Symmetric Lists, Not Single Values

The pads attribute on Conv and MaxPool is a flat list of all padding values: [pad_h_begin, pad_w_begin, pad_h_end, pad_w_end] for 2D. For 3D convolutions it extends further. Do not pass a single integer.

Use Squeeze and Unsqueeze on Axes Inputs (Opset ≥ 13)

In ONNX opset 13+, the axes argument to Squeeze and Unsqueeze moved from an attribute to an input tensor. This means you must create a constant tensor for it:

axes_const = numpy_helper.from_array(np.array([0], dtype=np.int64), name="squeeze_axes")
inits.append(axes_const)

squeeze_node = helper.make_node("Squeeze",
    inputs=["my_tensor", "squeeze_axes"],
    outputs=["squeezed"],
    name="squeeze",
)

Reshape Takes a Tensor Input, Not an Attribute

In opset 5+, the target shape for Reshape is a 1D INT64 tensor input, not an attribute. Store it as an initializer:

target_shape = numpy_helper.from_array(np.array([-1, 128], dtype=np.int64), name="tgt_shape")
inits.append(target_shape)

reshape_node = helper.make_node("Reshape",
    inputs=["flat_input", "tgt_shape"],
    outputs=["reshaped"],
)

Profile Before Optimizing

ONNX Runtime provides built-in profiling. Enable it to find bottleneck operators before spending time on manual optimizations:

opts = ort.SessionOptions()
opts.enable_profiling = True
sess = ort.InferenceSession("model.onnx", sess_options=opts)
sess.run(...)
prof_file = sess.end_profiling()  # returns path to JSON profile
# Open in Chrome at chrome://tracing

Reference: Commonly Used ONNX Operators

Below is a quick-reference table of the operators used most frequently in architecture construction, with their key attributes and input/output conventions.

Commonly used ONNX operators with key attributes and output shapes
Operator Key Inputs Key Attributes Output Shape (example)
Gemm A, B, C (bias) transA, transB, alpha, beta [M, N]
MatMul A, B [..., M, N]
Conv X, W, B kernel_shape, strides, pads, dilations, group [N, C_out, H_out, W_out]
ConvTranspose X, W, B kernel_shape, strides, pads, output_padding [N, C_out, H_out, W_out]
BatchNormalization X, scale, B, mean, var epsilon, momentum same as X
LayerNormalization X, scale, B axis, epsilon same as X
Relu X same as X
Sigmoid X same as X
Tanh X same as X
Softmax X axis same as X
Gelu X approximate same as X
MaxPool X kernel_shape, strides, pads [N, C, H_out, W_out]
GlobalAveragePool X [N, C, 1, 1]
Reshape data, shape as specified by shape
Flatten X axis [N, M]
Transpose X perm permuted axes
Squeeze X, axes removes specified dims
Unsqueeze X, axes inserts specified dims
Concat inputs… axis concatenated
Split X axis, split list of tensors
Add A, B broadcast shape
Mul A, B broadcast shape
ReduceMean X, axes keepdims reduced shape
Cast X to (dtype enum) same shape, new dtype
Gather data, indices axis indexed shape
LSTM X, W, R, B hidden_size, direction Y, Y_h, Y_c
GRU X, W, R, B hidden_size, direction Y, Y_h
Where cond, X, Y broadcast shape
Einsum inputs… equation per equation
Constant value shape of value
Shape X [rank(X)] INT64
Expand X, shape broadcast target shape

Further Reading

For everything beyond this guide, the following resources are authoritative:


Guide written for ONNX opset 17, ONNX Runtime 1.18+, and Python 3.10+. All code examples use numpy 1.24+ and onnx 1.15+.