TECHNICAL BLOG
Deep Dives for Engineers
Detailed technical articles covering the real problems we solve in embedded systems, AI, and robotics engineering.
Detailed technical articles covering the real problems we solve in embedded systems, AI, and robotics engineering.
Practical techniques for collecting, cleaning, and augmenting real-world training datasets for robotics vision tasks — combining ethical web scraping, synthetic data generation, and domain randomisation.
Robot learning systems need data that reflects the real-world conditions the robot will encounter — specific object categories, lighting conditions, viewpoints, and backgrounds. Generic ImageNet or COCO-pretrained models provide an excellent starting point, but fine-tuning for domain-specific tasks (grasping industrial components, sorting produce, navigating warehouse environments) requires task-specific data that rarely exists in public datasets. Collecting it requires combining targeted web scraping, synthetic generation, and careful augmentation.
Web scraping for machine learning datasets operates in a nuanced legal and ethical space. Always check robots.txt, respect rate limits, attribute sources, and avoid scraping personal data. For object datasets, product image databases, manufacturer catalogues, and creative commons image sources are appropriate targets.
import httpx
import asyncio
from pathlib import Path
import hashlib
async def scrape_images(urls: list[str], output_dir: Path, rate_limit: float = 1.0):
output_dir.mkdir(parents=True, exist_ok=True)
async with httpx.AsyncClient(timeout=10.0, follow_redirects=True) as client:
for url in urls:
await asyncio.sleep(rate_limit) # Respect server rate limits
try:
response = await client.get(url)
if response.status_code == 200 and "image" in response.headers.get("content-type", ""):
img_hash = hashlib.md5(response.content).hexdigest()
(output_dir / f"{img_hash}.jpg").write_bytes(response.content)
except Exception as e:
print(f"Failed: {url} — {e}")
Raw scraped images contain duplicates, corrupted files, and irrelevant content. Automated cleaning pipeline:
from PIL import Image
import imagehash
import os
def compute_phash(path: str) -> str:
return str(imagehash.phash(Image.open(path)))
def deduplicate_dataset(image_dir: str, threshold: int = 8):
'''Remove near-duplicate images using perceptual hashing.'''
hashes = {}
duplicates = []
for fname in os.listdir(image_dir):
path = os.path.join(image_dir, fname)
try:
h = compute_phash(path)
for existing_hash, existing_path in hashes.items():
if imagehash.hex_to_hash(h) - imagehash.hex_to_hash(existing_hash) < threshold:
duplicates.append(path)
break
else:
hashes[h] = path
except Exception:
duplicates.append(path) # Remove corrupted files
for dup in duplicates:
os.remove(dup)
return len(duplicates)
For objects where real images are scarce (custom industrial components, proprietary hardware), synthetic rendering with domain randomisation is a proven approach. Blender's Python API enables programmatic scene generation:
import bpy
import random
def render_synthetic_dataset(model_path: str, n_images: int, output_dir: str):
bpy.ops.wm.open_mainfile(filepath=model_path)
camera = bpy.data.objects["Camera"]
for i in range(n_images):
# Domain randomisation: vary lighting, background, viewpoint
bpy.data.lights["Light"].energy = random.uniform(500, 2000)
camera.location.x = random.uniform(-2, 2)
camera.location.y = random.uniform(-4, -2)
camera.location.z = random.uniform(1, 3)
bpy.ops.object.camera_aim_at_selected()
# Random background texture
bpy.data.worlds["World"].node_tree.nodes["Background"].inputs[0].default_value = (random.random(), random.random(), random.random(), 1)
bpy.context.scene.render.filepath = f"{output_dir}/synth_{i:05d}.png"
bpy.ops.render.render(write_still=True)
Domain randomisation — varying textures, lighting, camera angles, and background in rendering — forces the model to learn object-intrinsic features rather than memorising scene-specific correlations, improving real-world transfer.
import albumentations as A
from albumentations.pytorch import ToTensorV2
train_transform = A.Compose([
A.RandomResizedCrop(640, 640, scale=(0.5, 1.0)),
A.HorizontalFlip(p=0.5),
A.ColorJitter(brightness=0.3, contrast=0.3, saturation=0.3, hue=0.1, p=0.7),
A.GaussNoise(p=0.3),
A.MotionBlur(p=0.2), # Simulate camera motion in robotics
A.RandomShadow(p=0.2), # Simulate lighting variation
A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
ToTensorV2(),
], bbox_params=A.BboxParams(format="yolo", label_fields=["class_labels"]))
Track these metrics for every dataset version:
High-quality domain-specific training data is often the limiting factor in robotics vision system performance, not model architecture. A systematic data collection pipeline — ethical scraping, deduplication, synthetic augmentation with domain randomisation, and rigorous quality tracking — consistently outperforms a larger model trained on poor data. Invest in your dataset pipeline as seriously as your model architecture.
Continue reading — handpicked articles you might enjoy