Worksprout | Home- Blog Details

Practical techniques for collecting, cleaning, and augmenting real-world training datasets for robotics vision tasks — combining ethical web scraping, synthetic data generation, and domain randomisation.

The Data Bottleneck in Robotics

Robot learning systems need data that reflects the real-world conditions the robot will encounter — specific object categories, lighting conditions, viewpoints, and backgrounds. Generic ImageNet or COCO-pretrained models provide an excellent starting point, but fine-tuning for domain-specific tasks (grasping industrial components, sorting produce, navigating warehouse environments) requires task-specific data that rarely exists in public datasets. Collecting it requires combining targeted web scraping, synthetic generation, and careful augmentation.

Ethical and Legal Web Scraping

Web scraping for machine learning datasets operates in a nuanced legal and ethical space. Always check robots.txt, respect rate limits, attribute sources, and avoid scraping personal data. For object datasets, product image databases, manufacturer catalogues, and creative commons image sources are appropriate targets.

import httpx
import asyncio
from pathlib import Path
import hashlib

async def scrape_images(urls: list[str], output_dir: Path, rate_limit: float = 1.0):
    output_dir.mkdir(parents=True, exist_ok=True)
    async with httpx.AsyncClient(timeout=10.0, follow_redirects=True) as client:
        for url in urls:
            await asyncio.sleep(rate_limit)  # Respect server rate limits
            try:
                response = await client.get(url)
                if response.status_code == 200 and "image" in response.headers.get("content-type", ""):
                    img_hash = hashlib.md5(response.content).hexdigest()
                    (output_dir / f"{img_hash}.jpg").write_bytes(response.content)
            except Exception as e:
                print(f"Failed: {url} — {e}")

Automated Data Cleaning

Raw scraped images contain duplicates, corrupted files, and irrelevant content. Automated cleaning pipeline:

from PIL import Image
import imagehash
import os

def compute_phash(path: str) -> str:
    return str(imagehash.phash(Image.open(path)))

def deduplicate_dataset(image_dir: str, threshold: int = 8):
    '''Remove near-duplicate images using perceptual hashing.'''
    hashes = {}
    duplicates = []
    for fname in os.listdir(image_dir):
        path = os.path.join(image_dir, fname)
        try:
            h = compute_phash(path)
            for existing_hash, existing_path in hashes.items():
                if imagehash.hex_to_hash(h) - imagehash.hex_to_hash(existing_hash) < threshold:
                    duplicates.append(path)
                    break
            else:
                hashes[h] = path
        except Exception:
            duplicates.append(path)  # Remove corrupted files

    for dup in duplicates:
        os.remove(dup)
    return len(duplicates)

Synthetic Data with Blender and Domain Randomisation

For objects where real images are scarce (custom industrial components, proprietary hardware), synthetic rendering with domain randomisation is a proven approach. Blender's Python API enables programmatic scene generation:

import bpy
import random

def render_synthetic_dataset(model_path: str, n_images: int, output_dir: str):
    bpy.ops.wm.open_mainfile(filepath=model_path)
    camera = bpy.data.objects["Camera"]

    for i in range(n_images):
        # Domain randomisation: vary lighting, background, viewpoint
        bpy.data.lights["Light"].energy = random.uniform(500, 2000)
        camera.location.x = random.uniform(-2, 2)
        camera.location.y = random.uniform(-4, -2)
        camera.location.z = random.uniform(1, 3)
        bpy.ops.object.camera_aim_at_selected()

        # Random background texture
        bpy.data.worlds["World"].node_tree.nodes["Background"].inputs[0].default_value =             (random.random(), random.random(), random.random(), 1)

        bpy.context.scene.render.filepath = f"{output_dir}/synth_{i:05d}.png"
        bpy.ops.render.render(write_still=True)

Domain randomisation — varying textures, lighting, camera angles, and background in rendering — forces the model to learn object-intrinsic features rather than memorising scene-specific correlations, improving real-world transfer.

Data Augmentation Pipeline

import albumentations as A
from albumentations.pytorch import ToTensorV2

train_transform = A.Compose([
    A.RandomResizedCrop(640, 640, scale=(0.5, 1.0)),
    A.HorizontalFlip(p=0.5),
    A.ColorJitter(brightness=0.3, contrast=0.3, saturation=0.3, hue=0.1, p=0.7),
    A.GaussNoise(p=0.3),
    A.MotionBlur(p=0.2),  # Simulate camera motion in robotics
    A.RandomShadow(p=0.2),  # Simulate lighting variation
    A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ToTensorV2(),
], bbox_params=A.BboxParams(format="yolo", label_fields=["class_labels"]))

Dataset Quality Metrics

Track these metrics for every dataset version:

Class balance — heavily imbalanced datasets produce models that ignore minority classes
Annotation consistency — inter-annotator agreement (Cohen's kappa) for human-labelled data
Coverage — do your images cover the full range of expected deployment conditions?
Train/val/test overlap — check for near-duplicates across splits using perceptual hashing

Conclusion

High-quality domain-specific training data is often the limiting factor in robotics vision system performance, not model architecture. A systematic data collection pipeline — ethical scraping, deduplication, synthetic augmentation with domain randomisation, and rigorous quality tracking — consistently outperforms a larger model trained on poor data. Invest in your dataset pipeline as seriously as your model architecture.

TECHNICAL BLOG

Deep Dives for Engineers

Building Real-World Training Datasets for Robotics with Web Scraping and Synthetic Data

The Data Bottleneck in Robotics

Ethical and Legal Web Scraping

Automated Data Cleaning

Synthetic Data with Blender and Domain Randomisation

Data Augmentation Pipeline

Dataset Quality Metrics

Conclusion

Worksprout Team

Related Posts

Lightweight Models for Real-Time Inference on Robotic Edge Hardware

Getting Started with ROS 2 on Embedded Linux Systems

Computer Vision for Robotics: Object Detection with YOLOv8 and Depth Cameras