TECHNICAL BLOG

Deep Dives for Engineers

Detailed technical articles covering the real problems we solve in embedded systems, AI, and robotics engineering.

Building Real-World Training Datasets for Robotics with Web Scraping and Synthetic Data
Robotics and AI

Building Real-World Training Datasets for Robotics with Web Scraping and Synthetic Data

Worksprout Team Apr 05, 2025 9 min read

Practical techniques for collecting, cleaning, and augmenting real-world training datasets for robotics vision tasks — combining ethical web scraping, synthetic data generation, and domain randomisation.

The Data Bottleneck in Robotics

Robot learning systems need data that reflects the real-world conditions the robot will encounter — specific object categories, lighting conditions, viewpoints, and backgrounds. Generic ImageNet or COCO-pretrained models provide an excellent starting point, but fine-tuning for domain-specific tasks (grasping industrial components, sorting produce, navigating warehouse environments) requires task-specific data that rarely exists in public datasets. Collecting it requires combining targeted web scraping, synthetic generation, and careful augmentation.

Ethical and Legal Web Scraping

Web scraping for machine learning datasets operates in a nuanced legal and ethical space. Always check robots.txt, respect rate limits, attribute sources, and avoid scraping personal data. For object datasets, product image databases, manufacturer catalogues, and creative commons image sources are appropriate targets.

import httpx
import asyncio
from pathlib import Path
import hashlib

async def scrape_images(urls: list[str], output_dir: Path, rate_limit: float = 1.0):
    output_dir.mkdir(parents=True, exist_ok=True)
    async with httpx.AsyncClient(timeout=10.0, follow_redirects=True) as client:
        for url in urls:
            await asyncio.sleep(rate_limit)  # Respect server rate limits
            try:
                response = await client.get(url)
                if response.status_code == 200 and "image" in response.headers.get("content-type", ""):
                    img_hash = hashlib.md5(response.content).hexdigest()
                    (output_dir / f"{img_hash}.jpg").write_bytes(response.content)
            except Exception as e:
                print(f"Failed: {url} — {e}")

Automated Data Cleaning

Raw scraped images contain duplicates, corrupted files, and irrelevant content. Automated cleaning pipeline:

from PIL import Image
import imagehash
import os

def compute_phash(path: str) -> str:
    return str(imagehash.phash(Image.open(path)))

def deduplicate_dataset(image_dir: str, threshold: int = 8):
    '''Remove near-duplicate images using perceptual hashing.'''
    hashes = {}
    duplicates = []
    for fname in os.listdir(image_dir):
        path = os.path.join(image_dir, fname)
        try:
            h = compute_phash(path)
            for existing_hash, existing_path in hashes.items():
                if imagehash.hex_to_hash(h) - imagehash.hex_to_hash(existing_hash) < threshold:
                    duplicates.append(path)
                    break
            else:
                hashes[h] = path
        except Exception:
            duplicates.append(path)  # Remove corrupted files

    for dup in duplicates:
        os.remove(dup)
    return len(duplicates)

Synthetic Data with Blender and Domain Randomisation

For objects where real images are scarce (custom industrial components, proprietary hardware), synthetic rendering with domain randomisation is a proven approach. Blender's Python API enables programmatic scene generation:

import bpy
import random

def render_synthetic_dataset(model_path: str, n_images: int, output_dir: str):
    bpy.ops.wm.open_mainfile(filepath=model_path)
    camera = bpy.data.objects["Camera"]

    for i in range(n_images):
        # Domain randomisation: vary lighting, background, viewpoint
        bpy.data.lights["Light"].energy = random.uniform(500, 2000)
        camera.location.x = random.uniform(-2, 2)
        camera.location.y = random.uniform(-4, -2)
        camera.location.z = random.uniform(1, 3)
        bpy.ops.object.camera_aim_at_selected()

        # Random background texture
        bpy.data.worlds["World"].node_tree.nodes["Background"].inputs[0].default_value =             (random.random(), random.random(), random.random(), 1)

        bpy.context.scene.render.filepath = f"{output_dir}/synth_{i:05d}.png"
        bpy.ops.render.render(write_still=True)

Domain randomisation — varying textures, lighting, camera angles, and background in rendering — forces the model to learn object-intrinsic features rather than memorising scene-specific correlations, improving real-world transfer.

Data Augmentation Pipeline

import albumentations as A
from albumentations.pytorch import ToTensorV2

train_transform = A.Compose([
    A.RandomResizedCrop(640, 640, scale=(0.5, 1.0)),
    A.HorizontalFlip(p=0.5),
    A.ColorJitter(brightness=0.3, contrast=0.3, saturation=0.3, hue=0.1, p=0.7),
    A.GaussNoise(p=0.3),
    A.MotionBlur(p=0.2),  # Simulate camera motion in robotics
    A.RandomShadow(p=0.2),  # Simulate lighting variation
    A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ToTensorV2(),
], bbox_params=A.BboxParams(format="yolo", label_fields=["class_labels"]))

Dataset Quality Metrics

Track these metrics for every dataset version:

  • Class balance — heavily imbalanced datasets produce models that ignore minority classes
  • Annotation consistency — inter-annotator agreement (Cohen's kappa) for human-labelled data
  • Coverage — do your images cover the full range of expected deployment conditions?
  • Train/val/test overlap — check for near-duplicates across splits using perceptual hashing

Conclusion

High-quality domain-specific training data is often the limiting factor in robotics vision system performance, not model architecture. A systematic data collection pipeline — ethical scraping, deduplication, synthetic augmentation with domain randomisation, and rigorous quality tracking — consistently outperforms a larger model trained on poor data. Invest in your dataset pipeline as seriously as your model architecture.

Share

Worksprout Team

The Worksprout engineering team specialises in embedded Linux, RDK-B broadband platforms, edge AI, and robotics systems. Based in Rajshahi, Bangladesh, we design and deploy production embedded intelligence for clients across South Asia and beyond.

Related Posts

Continue reading — handpicked articles you might enjoy