TECHNICAL BLOG
Deep Dives for Engineers
Detailed technical articles covering the real problems we solve in embedded systems, AI, and robotics engineering.
Detailed technical articles covering the real problems we solve in embedded systems, AI, and robotics engineering.
A practical guide to deploying neural networks on embedded hardware — model architecture selection, quantisation, pruning, and benchmark frameworks from MobileNet to TensorFlow Lite Micro.
Server-side inference with GPT-4 or SDXL is straightforward — you have essentially unlimited compute. Edge inference means running a neural network on a Cortex-M33 with 512 KB flash and 256 KB SRAM, or on a Raspberry Pi without a discrete GPU, while consuming milliwatts of power. Every architectural decision — layer type, width multiplier, precision — has direct, measurable consequences for latency, memory footprint, and accuracy.
INT8 quantisation typically reduces model size by 4x and inference latency by 2-3x with less than 1% accuracy drop for most CNN architectures:
import tensorflow as tf
def representative_dataset():
for image in calibration_images: # 100-500 representative samples
yield [image[np.newaxis, ...].astype(np.float32) / 255.0]
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
quantised_model = converter.convert()
print(f"Model size: {len(quantised_model) / 1024:.1f} KB")
Pruning removes redundant weights before quantisation, often enabling a further 30-50% size reduction without accuracy degradation on domain-specific tasks:
import tensorflow_model_optimization as tfmot
pruning_schedule = tfmot.sparsity.keras.PolynomialDecay(
initial_sparsity=0.0,
final_sparsity=0.5,
begin_step=0,
end_step=1000,
)
pruned_model = tfmot.sparsity.keras.prune_low_magnitude(model, pruning_schedule)
pruned_model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
pruned_model.fit(train_data, epochs=10, callbacks=[tfmot.sparsity.keras.UpdatePruningStep()])
final_model = tfmot.sparsity.keras.strip_pruning(pruned_model)
TFLite Micro runs inference on bare-metal microcontrollers (no OS required). A complete inference loop on Cortex-M:
#include "tensorflow/lite/micro/micro_interpreter.h"
#include "tensorflow/lite/micro/all_ops_resolver.h"
#include "model_data.h" // Model flatbuffer as C array
constexpr int kTensorArenaSize = 48 * 1024; // Tune based on model
uint8_t tensor_arena[kTensorArenaSize];
tflite::AllOpsResolver resolver;
const tflite::Model* model = tflite::GetModel(g_model_data);
tflite::MicroInterpreter interpreter(model, resolver, tensor_arena, kTensorArenaSize);
interpreter.AllocateTensors();
// Copy input data
memcpy(interpreter.input(0)->data.int8, input_data, input_size);
interpreter.Invoke();
int8_t* output = interpreter.output(0)->data.int8;
For edge targets running Linux (Raspberry Pi, Jetson, Banana Pi), ONNX Runtime is often preferable to TFLite. It supports execution providers that leverage hardware-specific acceleration:
import onnxruntime as ort
import numpy as np
session = ort.InferenceSession(
"mobilenet_v3.onnx",
providers=["CPUExecutionProvider"] # Or "CUDAExecutionProvider" on Jetson
)
input_name = session.get_inputs()[0].name
output = session.run(None, {input_name: preprocessed_image})[0]
Always benchmark on your target hardware under your operating conditions. Use:
perf stat and perf record for CPU profiling on Linux targetsbenchmark_model --graph=model.tflite --num_runs=50Report p50 and p99 latency, not mean — p99 is what determines real-time viability.
Deploying neural networks at the edge is an engineering discipline as much as a data science problem. Architecture selection, quantisation, and pruning must be co-optimised with target hardware constraints from the beginning, not bolted on at the end. The tools — TFLite, ONNX Runtime, TFLite Micro, CMSIS-NN — are mature and production-proven. The skill is in applying them together systematically to meet your specific latency and memory budget.
Continue reading — handpicked articles you might enjoy