When brute force beats HNSW — a DuckDB VSS lesson

I was reading this post on HNSW vectors the other day, and that's when I decided to share a few thoughts on my own experience while re-engineering this site, again based on a static-site builder (this time written in Python).

This new static-site generator, among other things, computes semantic similarity between blog posts so the build step can do three things automatically:

surface "related posts" at the bottom of every article,
pair English and Italian translations for hreflang, and
suggest internal-link opportunities at chunk level.

All three rely on cosine similarity over sentence-transformers embeddings, stored and queried inside DuckDB's VSS extension.

When I first wired this up I added an HNSW index. It seemed like the obvious thing to do — vectors deserve a vector index, the docs even encourage it. It turned out the index was never being used. On any of the three queries. I noticed by chance: the SQL previously written did not match the pattern DuckDB's optimiser needed in order to invoke HNSW at all.

This post is the post-mortem. It's the kind of mistake that's easy to make when the words "vector index" feel like the right answer, and the cost of being wrong is invisible because the dataset is small enough that the brute-force fallback finishes in milliseconds anyway.

What is HNSW?

HNSW stands for Hierarchical Navigable Small World. It's a graph-based algorithm for approximate nearest-neighbour search in high-dimensional space — the workhorse data structure inside Pinecone, Weaviate, Qdrant, FAISS, and the VSS extensions of Postgres (pgvector) and DuckDB.

The way I ended up picturing it, after a couple of passes at the original paper: there is a layered graph sitting over the vectors. The top layer is sparse — only a handful of nodes — and each layer below it gets progressively denser, down to the bottom layer that holds every vector I've indexed. When I issue a query I drop into the top layer, walk greedily towards the query vector, descend a level, refine, and keep going until I hit the bottom. The cost per query collapses from the O(N) scan I'd otherwise pay to roughly O(log N) — and the first time it clicked in my head I remember thinking the design was almost too elegant to be true.

That's the magic, and it really does work. At a million vectors HNSW finishes a top-K query in tens of microseconds when an exhaustive scan would take seconds. It's why every modern vector database leans on it.

The catch — and this is the part I missed — is that HNSW only knows how to answer one kind of question: "given this single query vector, what are its K nearest neighbours in the index?" That's the only operation. Everything else is up to you and your query planner.

Why I added the index in the first place

When I plugged DuckDB's VSS into the build pipeline, the codebase looked like this:

con.execute("INSTALL vss; LOAD vss;")
# … bulk-load article vectors …
con.execute("""
    CREATE INDEX hnsw_idx ON articles
    USING HNSW (embedding) WITH (metric = 'cosine')
""")

I added the CREATE INDEX line on autopilot. The reasoning was: I have a 384-dimensional FLOAT array column, I'm going to query it for similarity, this is what one does. The build logs reported the index was created successfully and I moved on to the next thing.

What I didn't do — and what would have caught the mistake immediately — was run EXPLAIN ANALYZE on the actual SQL the application was issuing.

The three queries

I used DuckDB vector similarity in three places:

Related posts, at build time. For every article, find its top-3 same-language siblings.
Hreflang pair suggestions. For every Italian post, find the closest English counterpart above a high threshold.
Internal-link suggestions. For every chunk of every body, find the top-3 posts with cosine ≥ 0.55, excluding the source.

All three are written in the same shape — a self-join on the embedding column, cosine in the projection, and a ROW_NUMBER() window function picking the top-K per source. The internal-links query, which is the heaviest of the three, looks like this:

WITH pairs AS (
    SELECT c.slug AS src, c.chunk_idx,
           a.slug AS dst,
           array_cosine_similarity(c.embedding, a.embedding) AS score
    FROM   chunks c, articles a
    WHERE  c.lang = a.lang AND c.slug != a.slug
),
ranked AS (
    SELECT src, chunk_idx, dst, score,
           ROW_NUMBER() OVER (PARTITION BY src, chunk_idx
                              ORDER BY score DESC) AS rn
    FROM   pairs WHERE score >= ?
)
SELECT src, chunk_idx, dst, score
FROM   ranked WHERE rn <= ?

That's a CROSS JOIN that materialises every (chunk, article) pair, computes the cosine in vectorised columnar form, and then partitions and ranks. Nowhere in there does the planner see the only shape it knows how to rewrite into an HNSW probe, which is roughly:

SELECT slug FROM articles
ORDER BY array_cosine_similarity(embedding, ?) DESC
LIMIT K

A bound parameter vector, ORDER BY similarity, LIMIT K. That single shape — and as far as I can tell, only that shape — triggers HNSW. The CROSS JOIN with cosine in the projection falls back to brute force pairwise computation. So does the equivalent self-join used by the related-posts query. So does the hreflang-pair query. The HNSW index is built, lives in memory, and answers exactly zero questions.

The benchmark

Once I understood that, I wanted to know what the alternatives looked like. I wrote a small benchmark that runs three method shapes on the same dataset and compares wall-clock time and result-set agreement:

Method	Description
A — brute force CROSS JOIN	the current production query (ignored by HNSW)
B — HNSW per-chunk Python loop	pull every chunk's vector to Python; issue one parameterised `ORDER BY similarity LIMIT K` query per chunk; this shape does invoke HNSW
C — HNSW LATERAL join	a single SQL using `LATERAL` so the inner query runs once per outer row, hoping the planner pushes HNSW into the lateral side

I ran each method three times on the English subset of my corpus — 118 posts split into 2,111 sentence-level chunks, ≈530,000 (chunk, article) pairs to score:

Method	Median	Best	vs A
A · brute force CROSS JOIN	328 ms	307 ms	1.00×
B · HNSW per-chunk Python loop	112,634 ms	102,717 ms	0.003× (343× slower)
C · HNSW LATERAL join	616 ms	558 ms	0.53× (1.9× slower)

All three methods produced byte-for-byte identical result sets — recall 1.0000, zero missing, zero extra. So this isn't a quality trade-off; it's pure performance.

Why each non-winner loses

Method B is catastrophically slow for one reason that has nothing to do with HNSW. Every iteration of the loop pays a Python-to-DuckDB round-trip of around 60 ms just to ship the parameter vector and pull the result back. Across 2,111 chunks that's ≈127 seconds of plumbing before HNSW gets to do anything useful. The O(log N) saving inside DuckDB is real but tiny — maybe a millisecond per call against a brute-force scan of 250 articles — and the round-trip cost dwarfs it by two orders of magnitude.

Method C is more interesting, because on paper it looks perfect. It's a single SQL statement — no Python loop, no round-trips — and it uses LATERAL, a SQL construct that, for each row of the outer query, re-runs the inner subquery with that row's values. The inner subquery is written in exactly the shape HNSW recognises: a single input vector, ORDER BY similarity, LIMIT K.

Except it doesn't work. I asked DuckDB for the actual execution plan with EXPLAIN ANALYZE and found that, for every chunk in the outer query, the optimiser ignores HNSW and scans the whole table — exactly the way brute force does. The index sits in memory doing nothing here too. On top of that, the query pays the extra cost of the LATERAL machinery — preparing the subquery, running it row by row, collecting the results — sitting on top of that same full scan. Net result: 1.9× slower than Method A's plain brute-force CROSS JOIN.

So at our scale all three methods produce the same answers, two of them are slower than the dumbest one, and the only one that uses HNSW at all is also the worst by far.

Why brute force wins this size class

Two things compound to make brute force the right answer at small N:

DuckDB's columnar engine vectorises the dot-product loop. Computing cosine over a FLOAT[384] column on 800,000 pairs is sub-second on a laptop because the inner loop is SIMD-friendly and the data lives contiguously. There's nothing magic to optimise away.
Round-trip and planner overhead are not free. At small N each call into DuckDB carries hundreds of microseconds to milliseconds of fixed cost, and that floor is enough to swamp HNSW's logarithmic savings until the index is large enough.

The crossover where HNSW genuinely starts winning depends on the query shape and the language used to issue queries, but for our setup it's somewhere upwards of 10,000 indexed items — and that's only if DuckDB's planner also gains support for at least one of the joinish shapes. Below that, brute force is the right tool.

What I changed

I dropped the CREATE INDEX hnsw_idx ON articles USING HNSW (embedding) line from the embeddings service. The build is slightly faster (no time spent constructing an index nobody asks about) and the SQL is unchanged. All three queries — related posts, hreflang pairs, internal-link suggestions — keep working exactly as before, because they were already running brute force.

Whenever the corpus grows enough to warrant revisiting, the right move will be a Python loop with ORDER BY similarity LIMIT K — Method B in the benchmark — at which point the round-trip overhead becomes a fixed cost worth paying for the per-call O(log N) gain.

The takeaway

The mistake wasn't adding an HNSW index. The mistake was assuming the index was being used because it was created. Two habits I'm trying to internalise:

A vector index is not a property of the column. It's a property of the query shapes a planner can rewrite into index probes. "Index created successfully" tells you nothing about whether your application's actual queries will ever touch it. The only authoritative answer is EXPLAIN ANALYZE on the SQL the app issues — read the plan, look for the index operator, and if it isn't there, neither is your speedup.
Don't reach for an index until brute force is too slow to ship. At small N, DuckDB's vectorised columnar scan is hard to beat. Indexes pay off when their amortised cost (O(log N) per call instead of O(N)) clears the per-call overhead — which means you need both a large enough N and a query shape the planner accepts. Either one missing and the index is dead weight.

Most of my debugging time over the past few years has gone into things that looked like they were working because the surface logs said so. This was another one. The fix was a one-line deletion. The lesson is the same one as always: verify, don't assume.