Skip to content

bug: Vector collection search causes seqscan on 2M+ rows, hitting 10s context timeout (HNSW index ignored) #10186

@alixanderthegreat

Description

@alixanderthegreat

LocalAI version:
v4.3.6-59-gef80a0e8 (ef80a0e)

Environment, CPU architecture, OS, and Version:
Linux x86_64 / Compute Node configuration / PostgreSQL backend with pgvector

Describe the bug
When querying a large database collection (~2.4 million rows) via the /api/agents/collections/{name}/search endpoint using the PostgreSQL backend, the internal query engine constructs a SQL layout that completely blinds the PostgreSQL query planner.

Instead of executing an index-backed nearest-neighbor search, it triggers a full sequential table scan (seqscan). On massive datasets, this calculation takes longer than 10 seconds, consistently hitting LocalAI's internal client-side context deadline and dropping the request with an empty response payload ({"count":0,"results":[]}).

To Reproduce

  1. Populate a collection table via the PostgreSQL backend with a large dataset (>2,000,000 rows).
  2. Build a standard pgvector HNSW index on the embedding column.
  3. Submit a standard vector search request to the API collection endpoint.
  4. Check PostgreSQL engine logs to observe a statement cancellation precisely at the 10-second mark.

Expected behavior
The query engine should leverage the underlying pgvector HNSW/IVFFlat index structure to return the nearest neighbors within milliseconds, regardless of the overall row count.

Logs:

Jun 05 08:38:34 compute postgres[1989806]: LOG:  execute stmtcache_c2e7e...:
Jun 05 08:38:34 compute postgres[1989806]:         SELECT
Jun 05 08:38:34 compute postgres[1989806]:                 id::text,
Jun 05 08:38:34 compute postgres[1989806]:                 COALESCE(title, '') as title,
Jun 05 08:38:34 compute postgres[1989806]:                 content,
Jun 05 08:38:34 compute postgres[1989806]:                 metadata,
Jun 05 08:38:34 compute postgres[1989806]:                 (
Jun 05 08:38:34 compute postgres[1989806]:                         COALESCE((1 - (embedding <=> $3::vector)), 0) * $4
Jun 05 08:38:34 compute postgres[1989806]:                 ) as similarity
Jun 05 08:38:34 compute postgres[1989806]:         FROM documents_literature
Jun 05 08:38:34 compute postgres[1989806]:         WHERE embedding IS NOT NULL
Jun 05 08:38:34 compute postgres[1989806]:         ORDER BY similarity DESC
Jun 05 08:38:34 compute postgres[1989806]:         LIMIT $5
...
Jun 05 08:38:44 compute postgres[1989806]: ERROR:  canceling statement due to statement timeout

Additional context:
The mathematical abstraction wrapper used to calculate a combined similarity score inside the SELECT clause completely invalidates pgvector's HNSW indexing path:

  1. Index Invalidation: PostgreSQL's HNSW implementation can only optimize an unadulterated, raw distance operator sort path directly in the clause: ORDER BY embedding <=> $3::vector.
  2. The Scalar Trap: Because the query targets a dynamic scalar alias (ORDER BY similarity DESC), the query planner cannot walk the HNSW index tree. It is mathematically forced to calculate the distance operator, handle the COALESCE logic, and multiply it by the scalar weight parameter ($4) for all 2.4 million rows sequentially before sorting in-memory.

Verification of Underlying Index Health:
Extracting the target vector out of the query payload and executing a raw database connection via native Go source code completely bypasses the bug and works instantaneously:

// Direct query execution using the raw distance operator natively runs in <10ms
sqlQuery := `
	SELECT id::text, COALESCE(title, ''), content 
	FROM documents_literature 
	WHERE embedding IS NOT NULL 
	ORDER BY embedding <=> $1::vector 
	LIMIT 3;`

Suggested Fix:
Modify the backend query builder so that if a vector search is being executed, the distance operator is passed natively into the ORDER BY clause without mathematical modifiers or scalar transformations that break index parsing.


Note: This technical optimization analysis, logs trace verification, and root-cause mapping were investigated and drafted in collaboration with Google's Gemini.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions