TECHNICAL BLOG
Deep Dives for Engineers
Detailed technical articles covering the real problems we solve in embedded systems, AI, and robotics engineering.
Detailed technical articles covering the real problems we solve in embedded systems, AI, and robotics engineering.
A practical guide to building robust NLP pipelines — text preprocessing, tokenisation strategies, embeddings, and fine-tuning transformer models for domain-specific tasks.
Modern NLP applications — intent classification, named entity recognition, sentiment analysis, question answering — all share a common pipeline structure: raw text enters, passes through preprocessing and tokenisation, gets converted to embeddings or token IDs, flows through a model, and produces structured output. Understanding each stage in depth enables you to debug failures, optimise performance, and adapt pre-trained models to your domain efficiently.
Preprocessing decisions made before tokenisation have significant downstream impact. For technical domains (embedded systems documentation, support tickets, log files), standard preprocessing often hurts performance. Preserve technical tokens: model numbers, version strings, CLI commands, and error codes carry semantic meaning that generic normalisation destroys.
import re
def preprocess_technical(text: str) -> str:
# Normalise whitespace but preserve structure
text = re.sub(r"
|
", "
", text)
text = re.sub(r" ", " ", text)
# Remove control characters but keep newlines
text = re.sub(r"[^\S
]+", " ", text)
# Do NOT lowercase — RDK-B and rdk-b are different entities
return text.strip()
Modern transformer models use subword tokenisation (BPE or WordPiece) that handles out-of-vocabulary technical terms by decomposing them into known subword units. The choice of tokeniser and its pre-trained vocabulary significantly affects how well technical terminology is represented:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")
text = "Configure TR-181 Device.WiFi.AccessPoint.1.Security.ModeEnabled via CWMP"
tokens = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
print(tokenizer.convert_ids_to_tokens(tokens["input_ids"][0]))
# Analyse how technical tokens are split and whether adding them to vocabulary helps
For highly specialised domains, extend the tokeniser vocabulary with domain-specific terms and continue pre-training on domain text before task fine-tuning.
For semantic search and RAG, sentence-level embeddings matter more than token-level representations. Evaluate embedding models on your specific domain using MTEB (Massive Text Embedding Benchmark) as a framework, but always validate on your own data — benchmark rankings do not always transfer to specialised domains.
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
model = SentenceTransformer("BAAI/bge-base-en-v1.5")
docs = ["RDK-B implements TR-181 via CCSP components",
"Yocto Project builds custom Linux for embedded targets"]
queries = ["How does RDK manage device parameters?"]
doc_embeddings = model.encode(docs, normalize_embeddings=True)
query_embedding = model.encode(queries, normalize_embeddings=True)
scores = cosine_similarity(query_embedding, doc_embeddings)
Fine-tuning a pre-trained transformer on labelled domain data typically outperforms a generic model by 10-25% on in-domain tasks. The PEFT (Parameter-Efficient Fine-Tuning) approach using LoRA adapters achieves near-full fine-tuning performance with a fraction of the trainable parameters:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model, TaskType
base_model = AutoModelForSequenceClassification.from_pretrained(
"microsoft/deberta-v3-base", num_labels=len(INTENT_LABELS)
)
lora_config = LoraConfig(
task_type=TaskType.SEQ_CLS,
r=16,
lora_alpha=32,
target_modules=["query_proj", "key_proj", "value_proj"],
lora_dropout=0.05,
)
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# trainable params: 1,835,008 || all params: 184,435,716 (0.99% trainable)
Macro F1 on a balanced test set is the standard metric, but for production NLP systems also measure:
Building effective NLP pipelines for technical domains requires disciplined attention to how domain-specific language is represented at every stage. Fine-tuned transformer models with domain-appropriate preprocessing consistently outperform generic off-the-shelf solutions. Invest in evaluation infrastructure from the start — it is the only reliable way to know whether your pipeline improvements are actually improvements.
Continue reading — handpicked articles you might enjoy