TECHNICAL BLOG

Deep Dives for Engineers

Detailed technical articles covering the real problems we solve in embedded systems, AI, and robotics engineering.

NLP Pipeline Engineering: From Tokenisation to Transformer Fine-Tuning
Machine Learning

NLP Pipeline Engineering: From Tokenisation to Transformer Fine-Tuning

Worksprout Team Nov 25, 2024 11 min read

A practical guide to building robust NLP pipelines — text preprocessing, tokenisation strategies, embeddings, and fine-tuning transformer models for domain-specific tasks.

The NLP Pipeline View

Modern NLP applications — intent classification, named entity recognition, sentiment analysis, question answering — all share a common pipeline structure: raw text enters, passes through preprocessing and tokenisation, gets converted to embeddings or token IDs, flows through a model, and produces structured output. Understanding each stage in depth enables you to debug failures, optimise performance, and adapt pre-trained models to your domain efficiently.

Text Preprocessing

Preprocessing decisions made before tokenisation have significant downstream impact. For technical domains (embedded systems documentation, support tickets, log files), standard preprocessing often hurts performance. Preserve technical tokens: model numbers, version strings, CLI commands, and error codes carry semantic meaning that generic normalisation destroys.

import re

def preprocess_technical(text: str) -> str:
    # Normalise whitespace but preserve structure
    text = re.sub(r"
|
", "
", text)
    text = re.sub(r"	", " ", text)
    # Remove control characters but keep newlines
    text = re.sub(r"[^\S
]+", " ", text)
    # Do NOT lowercase — RDK-B and rdk-b are different entities
    return text.strip()

Tokenisation and Vocabulary

Modern transformer models use subword tokenisation (BPE or WordPiece) that handles out-of-vocabulary technical terms by decomposing them into known subword units. The choice of tokeniser and its pre-trained vocabulary significantly affects how well technical terminology is represented:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")

text = "Configure TR-181 Device.WiFi.AccessPoint.1.Security.ModeEnabled via CWMP"
tokens = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
print(tokenizer.convert_ids_to_tokens(tokens["input_ids"][0]))
# Analyse how technical tokens are split and whether adding them to vocabulary helps

For highly specialised domains, extend the tokeniser vocabulary with domain-specific terms and continue pre-training on domain text before task fine-tuning.

Embeddings: Choosing the Right Representation

For semantic search and RAG, sentence-level embeddings matter more than token-level representations. Evaluate embedding models on your specific domain using MTEB (Massive Text Embedding Benchmark) as a framework, but always validate on your own data — benchmark rankings do not always transfer to specialised domains.

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model = SentenceTransformer("BAAI/bge-base-en-v1.5")

docs = ["RDK-B implements TR-181 via CCSP components",
        "Yocto Project builds custom Linux for embedded targets"]
queries = ["How does RDK manage device parameters?"]

doc_embeddings = model.encode(docs, normalize_embeddings=True)
query_embedding = model.encode(queries, normalize_embeddings=True)

scores = cosine_similarity(query_embedding, doc_embeddings)

Fine-Tuning for Domain Adaptation

Fine-tuning a pre-trained transformer on labelled domain data typically outperforms a generic model by 10-25% on in-domain tasks. The PEFT (Parameter-Efficient Fine-Tuning) approach using LoRA adapters achieves near-full fine-tuning performance with a fraction of the trainable parameters:

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model, TaskType

base_model = AutoModelForSequenceClassification.from_pretrained(
    "microsoft/deberta-v3-base", num_labels=len(INTENT_LABELS)
)

lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=16,
    lora_alpha=32,
    target_modules=["query_proj", "key_proj", "value_proj"],
    lora_dropout=0.05,
)
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# trainable params: 1,835,008 || all params: 184,435,716 (0.99% trainable)

Evaluation That Matters

Macro F1 on a balanced test set is the standard metric, but for production NLP systems also measure:

  • Confidence calibration — does a 90% confidence prediction actually correct 90% of the time?
  • Performance on tail classes — rare intents or entities often have disproportionate business importance
  • Latency at p99 — transformers are large; profile on your inference hardware
  • Robustness to typos and paraphrasing — test with augmented inputs that mimic real user behaviour

Conclusion

Building effective NLP pipelines for technical domains requires disciplined attention to how domain-specific language is represented at every stage. Fine-tuned transformer models with domain-appropriate preprocessing consistently outperform generic off-the-shelf solutions. Invest in evaluation infrastructure from the start — it is the only reliable way to know whether your pipeline improvements are actually improvements.

Share

Worksprout Team

The Worksprout engineering team specialises in embedded Linux, RDK-B broadband platforms, edge AI, and robotics systems. Based in Rajshahi, Bangladesh, we design and deploy production embedded intelligence for clients across South Asia and beyond.

Related Posts

Continue reading — handpicked articles you might enjoy