Ongoing

NLP Search Engine

Solo Developer · 2026 · Ongoing · 1 person · 4 min read

Designed and built a modular NLP search pipeline from scratch with clean separation between ingestion, preprocessing, and indexing — resulting in a working inverted index and keyword lookup system ready for future enhancements like ranking and semantic search.

Overview

Building a search engine from the ground up to deeply understand how information retrieval works. The project focuses on creating a modular pipeline architecture where each stage — loading, normalization, tokenization, and indexing — is isolated and independently extensible. Currently at Stage 1 with a functional inverted index and basic keyword lookup.

Problem

Most developers use search as a black box. I wanted to understand how search actually works under the hood — from raw text ingestion to query resolution. The challenge was designing a system that is not just functional but architecturally clean enough to grow into a full-featured search engine with ranking, boolean queries, and eventually semantic search.

Constraints

  • Solo project — all design and implementation decisions are self-directed
  • Must be modular enough to support future features without rewriting
  • No external search libraries allowed — the goal is learning, not convenience
  • Python-only implementation for simplicity and rapid iteration
  • Dataset starts small (text files) but architecture must support scaling

Approach

Started with architecture before code. Designed a pipeline with clear separation of concerns: loaders handle file I/O, processors handle text normalization and tokenization, and the pipeline manager orchestrates the flow. The main entry point acts as a coordinator that assigns document IDs and builds the inverted index. This separation ensures each component can evolve independently.

Key Decisions

Pipeline architecture with isolated stages

Reasoning:

Separating loaders, processors, and indexing into distinct modules means future upgrades (new file formats, better tokenizers, ranking algorithms) only require changing one component. This is how production search engines are structured.

Alternatives considered:
  • Single script approach (fast to write but impossible to maintain)
  • Framework-based approach (overkill for learning purposes)

Regex-based tokenization over simple split

Reasoning:

Python's str.split() fails on edge cases like hyphenated words and contractions. A regex pattern like \b\w+(?:[-']\w+)*\b handles these correctly and provides consistent tokenization behavior.

Alternatives considered:
  • str.split() (too simplistic)
  • NLTK tokenizer (external dependency, defeats learning purpose)

Set-based inverted index

Reasoning:

Using sets for document ID storage automatically prevents duplicates and provides O(1) membership testing. For the current stage, presence/absence per document is sufficient — term frequency can be added later.

Alternatives considered:
  • List-based index (allows duplicates, needs manual dedup)
  • Dict with term frequency (premature optimization at this stage)

Unicode NFKC normalization

Reasoning:

NFKC standardizes character representations across different encodings, handling edge cases like accented characters in words such as "café" and "naïve" that appear in real-world text.

Alternatives considered:
  • NFC normalization (less aggressive, misses compatibility characters)
  • No normalization (would cause missed matches)

pathlib for file handling

Reasoning:

pathlib provides a modern, expressive API for filesystem operations compared to os.path. It makes the ingestion code more readable and maintains consistency across platforms.

Alternatives considered:
  • os.path (older API, less readable)
  • glob module (limited functionality)

Tech Stack

  • Python
  • Regex
  • pathlib
  • Unicode (NFKC)

Result & Impact

  • 4 isolated modules (loader, normalizer, tokenizer, manager)
    Pipeline Stages
  • Full inverted index across all documents in dataset
    Index Coverage
  • Modular design ready for 13+ planned enhancements
    Architecture
  • Single keyword lookup with document ID resolution
    Query Support

This project is fundamentally about understanding, not just building. By constructing every component from scratch — text normalization, tokenization, inverted indexing — I gained deep insight into how search engines actually work. The modular architecture proved its value immediately: adding document ID assignment and the inverted index required zero changes to the processing pipeline. The system is now a solid foundation for adding TF-IDF ranking, boolean queries, and eventually semantic search.

Learnings

  • Architecture-first thinking pays off — designing the pipeline before writing code prevented rewrites
  • Separation of concerns is not theoretical — it directly reduces debugging time when adding features
  • Unicode handling matters immediately in real text, not just in edge cases
  • Regex tokenization reveals tradeoffs (e.g., "don't" splitting into "don" and "t") that every search engine must address
  • Simple data structures (sets for inverted index) are often the right choice before optimizing
  • File iteration order affects reproducibility — sorting paths ensures consistent document IDs across runs

🔗 View on GitHub


For a detailed walkthrough of the implementation with code examples and architectural diagrams, read the full article series:

  • Stage 1: Building the Pipeline — Designing the modular architecture, implementing text normalization and tokenization, constructing the inverted index, and wiring up basic keyword search.

What’s Next

The foundation is set. The next stages will focus on making search actually intelligent:

  • Stopword removal — Filtering out common words like “the” and “a” that add noise
  • Stemming & lemmatization — Reducing words to their root forms for better recall
  • TF-IDF / BM25 ranking — Moving beyond presence/absence to relevance scoring
  • Boolean queries — Supporting AND, OR, NOT operators for complex searches
  • Multi-word queries — Handling phrases and multi-term searches
  • Semantic search — Adding embedding-based similarity for meaning-aware retrieval