Skip to main content
HP
Back to All Projects

Redis Search Engine

TF-IDF search engine built on Redis sorted sets

PythonRedisTF-IDFCLI

The Problem

Full-text search is a fundamental problem in computer science, but most developers just plug in Elasticsearch without understanding what happens underneath. This project builds a search engine from scratch using Redis sorted sets for storage and TF-IDF for relevance ranking. This proves you can build high-performance search with surprisingly simple primitives.

Architecture

ZADDZUNIONSTORECLI InputargparseTokenizerStop WordsTF CalculatorTerm FrequencyRedisSorted SetsQuery EngineIDF + RankingRanked ResultsJSON Output

Technical Decisions

Why Redis sorted sets instead of a regular database?

Redis sorted sets give me O(log N) inserts and O(log N + M) range queries, where M is the result size. For a search engine, the "score" in a sorted set IS the relevance score. ZUNIONSTORE lets me combine scores across multiple terms with custom weights (the IDF component) in a single atomic operation. It's elegant. The data structure IS the algorithm.

How does TF-IDF scoring work here?

TF (term frequency) counts how often a word appears in a document, stored as the score in Redis sorted sets at index time. IDF (inverse document frequency) is computed at query time: log(total_documents / documents_containing_term). The final score is TF × IDF. High TF-IDF means a word is frequent in this document but rare globally. Exactly what you want for relevance ranking.

How do negative search terms work?

When you search 'python -flask', the engine first finds all documents matching 'python', then computes the set of documents containing 'flask', and subtracts using Redis ZDIFFSTORE. The set operations make this trivially efficient. No iteration needed.

Code Spotlight

search/engine.py
def _compute_scores(self, terms, n_docs):
    """Compute TF-IDF scores for search terms."""
    weights = {}
    for term in terms:
        df = self.redis.scard(f"idx:{term}")
        if df == 0:
            continue
        # IDF: log(N / df) - rare terms score higher
        idf = math.log(n_docs / df)
        weights[f"idx:{term}"] = idf

    if not weights:
        return []

    # ZUNIONSTORE: combine sorted sets with IDF weights
    # TF is already the score; multiply by IDF
    dest = f"tmp:search:{uuid4().hex}"
    self.redis.zunionstore(dest, weights)
    results = self.redis.zrevrange(
        dest, 0, -1, withscores=True
    )
    self.redis.delete(dest)
    return results

The core of the search engine. TF is stored as Redis sorted set scores at index time. At query time, IDF weights are computed and ZUNIONSTORE combines everything in one atomic operation. The data structure IS the algorithm.

Key Features

Search & Indexing

  • TF-IDF relevance scoring with logarithmic IDF
  • Tokenization with English stop word filtering
  • Negative term exclusion (e.g., 'search -redis')
  • JSON bulk loader for batch document indexing

Performance

  • Redis pipelines for batch operations during indexing
  • Sub-millisecond query response on 10,000+ documents
  • Pagination support with --offset and --limit

Interface

  • Full CLI: index, search, remove, stats commands
  • JSON output support for scripting
  • Programmatic usage via SearchEngine class

Testing

Unit and integration tests covering indexing, search ranking, negative terms, and edge cases.

Unit + integration test suite

  • Tests run against a local Redis instance
  • Covers indexing, search, negative terms, pagination, and stats

Results

01

10,000+ documents indexed with sub-millisecond query response

02

TF-IDF scoring produces Google-relevant ranking quality

03

Negative term exclusion via Redis set operations

04

Clean CLI interface with JSON output support