TECHNICAL BLOG

Deep Dives for Engineers

Detailed technical articles covering the real problems we solve in embedded systems, AI, and robotics engineering.

Computer Vision for Robotics: Object Detection with YOLOv8 and Depth Cameras
Robotics and AI

Computer Vision for Robotics: Object Detection with YOLOv8 and Depth Cameras

Worksprout Team Nov 08, 2024 11 min read

How to combine YOLOv8 real-time object detection with RGB-D depth cameras to build spatial perception pipelines for robotic manipulation and navigation tasks.

Spatial Perception: The Core Robotics Challenge

A robot that can identify objects in a 2D camera image has done half the work. To grasp an object, navigate around it, or estimate its trajectory, the robot needs 3D spatial information: where is the object in the robot's frame, how far away is it, and what is its orientation? RGB-D cameras (Intel RealSense, Microsoft Azure Kinect, Luxonis OAK-D) provide both colour and depth in a single unit, enabling the spatial perception pipeline that drives most modern robotic manipulation systems.

YOLOv8 for Real-Time Detection

Ultralytics YOLOv8 is the current state-of-the-art for real-time object detection. Its nano and small variants run at 100+ FPS on a GPU and 15-30 FPS on embedded hardware (Jetson Nano, Raspberry Pi 5), making them practical for real-time robotic perception:

from ultralytics import YOLO
import cv2

model = YOLO("yolov8s.pt")  # Small variant: 22M params, ~50 FPS on Jetson Nano

# Custom training on robotic manipulation dataset
model.train(
    data="robot_objects.yaml",
    epochs=100,
    imgsz=640,
    batch=16,
    device="cuda:0",
    augment=True,  # Mosaic, mixup, geometric augmentations
)

Integrating RealSense Depth Data

import pyrealsense2 as rs
import numpy as np
import cv2

pipeline = rs.pipeline()
config = rs.config()
config.enable_stream(rs.stream.color, 640, 480, rs.format.bgr8, 30)
config.enable_stream(rs.stream.depth, 640, 480, rs.format.z16, 30)

profile = pipeline.start(config)
align = rs.align(rs.stream.color)

def get_object_3d_position(box, depth_frame, intrinsics):
    cx = int((box.xyxy[0][0] + box.xyxy[0][2]) / 2)
    cy = int((box.xyxy[0][1] + box.xyxy[0][3]) / 2)
    depth = depth_frame.get_distance(cx, cy)
    point_3d = rs.rs2_deproject_pixel_to_point(intrinsics, [cx, cy], depth)
    return np.array(point_3d)  # [x, y, z] in metres

frames = pipeline.wait_for_frames()
aligned = align.process(frames)
colour_frame = np.asanyarray(aligned.get_color_frame().get_data())
depth_frame = aligned.get_depth_frame()

results = model(colour_frame, conf=0.5)
for box in results[0].boxes:
    position_3d = get_object_3d_position(box, depth_frame,
        profile.get_stream(rs.stream.depth).as_video_stream_profile().get_intrinsics())
    print(f"{model.names[int(box.cls)]}: {position_3d} m")

Coordinate Frame Transforms

3D positions from the camera are in the camera frame. Robotic systems need positions in the robot's base frame (or world frame) for planning. Use a transformation chain: camera frame → end-effector frame → robot base frame. tf2 in ROS 2 manages this:

from tf2_ros import TransformListener, Buffer
from geometry_msgs.msg import PointStamped

tf_buffer = Buffer()
tf_listener = TransformListener(tf_buffer, node)

camera_point = PointStamped()
camera_point.header.frame_id = "camera_color_optical_frame"
camera_point.point.x, camera_point.point.y, camera_point.point.z = position_3d

base_point = tf_buffer.transform(camera_point, "robot_base_link")

Handling Depth Noise

Structured-light and time-of-flight depth cameras produce noisy measurements, especially on reflective surfaces, at edges, and beyond their rated range. Apply spatial and temporal filtering:

  • Spatial filter: fill holes and smooth depth edges (rs.spatial_filter())
  • Temporal filter: average depth values across frames (rs.temporal_filter())
  • Depth clipping: reject measurements outside [0.2 m, 3.0 m] for manipulation tasks
  • Median filter on the bounding box ROI: more robust than centre-pixel depth for estimating object distance

Conclusion

Combining YOLOv8's detection speed with RGB-D depth data gives robotic systems the spatial awareness needed for reliable manipulation. The key engineering challenges are coordinate frame management and depth noise handling — invest in robust tf2 transforms from the start, and apply depth filtering appropriate to your sensor and scene. This pipeline forms the perception backbone of most practical robotic systems in production today.

Share

Worksprout Team

The Worksprout engineering team specialises in embedded Linux, RDK-B broadband platforms, edge AI, and robotics systems. Based in Rajshahi, Bangladesh, we design and deploy production embedded intelligence for clients across South Asia and beyond.

Related Posts

Continue reading — handpicked articles you might enjoy