Worksprout | Home- Blog Details

A practical guide to deploying neural networks on embedded hardware — model architecture selection, quantisation, pruning, and benchmark frameworks from MobileNet to TensorFlow Lite Micro.

The Edge Inference Constraint Problem

Server-side inference with GPT-4 or SDXL is straightforward — you have essentially unlimited compute. Edge inference means running a neural network on a Cortex-M33 with 512 KB flash and 256 KB SRAM, or on a Raspberry Pi without a discrete GPU, while consuming milliwatts of power. Every architectural decision — layer type, width multiplier, precision — has direct, measurable consequences for latency, memory footprint, and accuracy.

Architecture Families for Edge Deployment

MobileNetV3 — depthwise separable convolutions reduce FLOPs by 8-9x versus standard convolutions with minimal accuracy loss. The "large" variant achieves 75.2% top-1 on ImageNet at 219M FLOPs.
EfficientNet-Lite — compound scaling optimised for mobile; Lite0 variant designed specifically for inference hardware without advanced GPU ops.
SqueezeNet — extremely small (1.2 MB), designed for memory-constrained targets at the cost of some accuracy.
MCUNet — designed specifically for microcontrollers; once-for-all architecture search generates models for specific hardware budgets.

Post-Training Quantisation

INT8 quantisation typically reduces model size by 4x and inference latency by 2-3x with less than 1% accuracy drop for most CNN architectures:

import tensorflow as tf

def representative_dataset():
    for image in calibration_images:  # 100-500 representative samples
        yield [image[np.newaxis, ...].astype(np.float32) / 255.0]

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8

quantised_model = converter.convert()
print(f"Model size: {len(quantised_model) / 1024:.1f} KB")

Structured Pruning

Pruning removes redundant weights before quantisation, often enabling a further 30-50% size reduction without accuracy degradation on domain-specific tasks:

import tensorflow_model_optimization as tfmot

pruning_schedule = tfmot.sparsity.keras.PolynomialDecay(
    initial_sparsity=0.0,
    final_sparsity=0.5,
    begin_step=0,
    end_step=1000,
)
pruned_model = tfmot.sparsity.keras.prune_low_magnitude(model, pruning_schedule)
pruned_model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
pruned_model.fit(train_data, epochs=10, callbacks=[tfmot.sparsity.keras.UpdatePruningStep()])
final_model = tfmot.sparsity.keras.strip_pruning(pruned_model)

TensorFlow Lite Micro for Microcontrollers

TFLite Micro runs inference on bare-metal microcontrollers (no OS required). A complete inference loop on Cortex-M:

#include "tensorflow/lite/micro/micro_interpreter.h"
#include "tensorflow/lite/micro/all_ops_resolver.h"
#include "model_data.h"  // Model flatbuffer as C array

constexpr int kTensorArenaSize = 48 * 1024;  // Tune based on model
uint8_t tensor_arena[kTensorArenaSize];

tflite::AllOpsResolver resolver;
const tflite::Model* model = tflite::GetModel(g_model_data);
tflite::MicroInterpreter interpreter(model, resolver, tensor_arena, kTensorArenaSize);
interpreter.AllocateTensors();

// Copy input data
memcpy(interpreter.input(0)->data.int8, input_data, input_size);
interpreter.Invoke();
int8_t* output = interpreter.output(0)->data.int8;

ONNX Runtime for Linux-Based Edge Targets

For edge targets running Linux (Raspberry Pi, Jetson, Banana Pi), ONNX Runtime is often preferable to TFLite. It supports execution providers that leverage hardware-specific acceleration:

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession(
    "mobilenet_v3.onnx",
    providers=["CPUExecutionProvider"]  # Or "CUDAExecutionProvider" on Jetson
)

input_name = session.get_inputs()[0].name
output = session.run(None, {input_name: preprocessed_image})[0]

Benchmarking Methodology

Always benchmark on your target hardware under your operating conditions. Use:

perf stat and perf record for CPU profiling on Linux targets
Tegrastats on Jetson for power consumption alongside latency
TFLite benchmark tool: benchmark_model --graph=model.tflite --num_runs=50

Report p50 and p99 latency, not mean — p99 is what determines real-time viability.

Conclusion

Deploying neural networks at the edge is an engineering discipline as much as a data science problem. Architecture selection, quantisation, and pruning must be co-optimised with target hardware constraints from the beginning, not bolted on at the end. The tools — TFLite, ONNX Runtime, TFLite Micro, CMSIS-NN — are mature and production-proven. The skill is in applying them together systematically to meet your specific latency and memory budget.

TECHNICAL BLOG

Deep Dives for Engineers

Lightweight Neural Networks for Edge Inference: MobileNet to TinyML

The Edge Inference Constraint Problem

Architecture Families for Edge Deployment

Post-Training Quantisation

Structured Pruning

TensorFlow Lite Micro for Microcontrollers

ONNX Runtime for Linux-Based Edge Targets

Benchmarking Methodology

Conclusion

Worksprout Team

Related Posts

Lightweight Models for Real-Time Inference on Robotic Edge Hardware

Getting Started with ROS 2 on Embedded Linux Systems

Computer Vision for Robotics: Object Detection with YOLOv8 and Depth Cameras