TECHNICAL BLOG
Deep Dives for Engineers
Detailed technical articles covering the real problems we solve in embedded systems, AI, and robotics engineering.
Detailed technical articles covering the real problems we solve in embedded systems, AI, and robotics engineering.
How to combine YOLOv8 real-time object detection with RGB-D depth cameras to build spatial perception pipelines for robotic manipulation and navigation tasks.
A robot that can identify objects in a 2D camera image has done half the work. To grasp an object, navigate around it, or estimate its trajectory, the robot needs 3D spatial information: where is the object in the robot's frame, how far away is it, and what is its orientation? RGB-D cameras (Intel RealSense, Microsoft Azure Kinect, Luxonis OAK-D) provide both colour and depth in a single unit, enabling the spatial perception pipeline that drives most modern robotic manipulation systems.
Ultralytics YOLOv8 is the current state-of-the-art for real-time object detection. Its nano and small variants run at 100+ FPS on a GPU and 15-30 FPS on embedded hardware (Jetson Nano, Raspberry Pi 5), making them practical for real-time robotic perception:
from ultralytics import YOLO
import cv2
model = YOLO("yolov8s.pt") # Small variant: 22M params, ~50 FPS on Jetson Nano
# Custom training on robotic manipulation dataset
model.train(
data="robot_objects.yaml",
epochs=100,
imgsz=640,
batch=16,
device="cuda:0",
augment=True, # Mosaic, mixup, geometric augmentations
)
import pyrealsense2 as rs
import numpy as np
import cv2
pipeline = rs.pipeline()
config = rs.config()
config.enable_stream(rs.stream.color, 640, 480, rs.format.bgr8, 30)
config.enable_stream(rs.stream.depth, 640, 480, rs.format.z16, 30)
profile = pipeline.start(config)
align = rs.align(rs.stream.color)
def get_object_3d_position(box, depth_frame, intrinsics):
cx = int((box.xyxy[0][0] + box.xyxy[0][2]) / 2)
cy = int((box.xyxy[0][1] + box.xyxy[0][3]) / 2)
depth = depth_frame.get_distance(cx, cy)
point_3d = rs.rs2_deproject_pixel_to_point(intrinsics, [cx, cy], depth)
return np.array(point_3d) # [x, y, z] in metres
frames = pipeline.wait_for_frames()
aligned = align.process(frames)
colour_frame = np.asanyarray(aligned.get_color_frame().get_data())
depth_frame = aligned.get_depth_frame()
results = model(colour_frame, conf=0.5)
for box in results[0].boxes:
position_3d = get_object_3d_position(box, depth_frame,
profile.get_stream(rs.stream.depth).as_video_stream_profile().get_intrinsics())
print(f"{model.names[int(box.cls)]}: {position_3d} m")
3D positions from the camera are in the camera frame. Robotic systems need positions in the robot's base frame (or world frame) for planning. Use a transformation chain: camera frame → end-effector frame → robot base frame. tf2 in ROS 2 manages this:
from tf2_ros import TransformListener, Buffer
from geometry_msgs.msg import PointStamped
tf_buffer = Buffer()
tf_listener = TransformListener(tf_buffer, node)
camera_point = PointStamped()
camera_point.header.frame_id = "camera_color_optical_frame"
camera_point.point.x, camera_point.point.y, camera_point.point.z = position_3d
base_point = tf_buffer.transform(camera_point, "robot_base_link")
Structured-light and time-of-flight depth cameras produce noisy measurements, especially on reflective surfaces, at edges, and beyond their rated range. Apply spatial and temporal filtering:
rs.spatial_filter())rs.temporal_filter())Combining YOLOv8's detection speed with RGB-D depth data gives robotic systems the spatial awareness needed for reliable manipulation. The key engineering challenges are coordinate frame management and depth noise handling — invest in robust tf2 transforms from the start, and apply depth filtering appropriate to your sensor and scene. This pipeline forms the perception backbone of most practical robotic systems in production today.
Continue reading — handpicked articles you might enjoy