bug: Vector collection search causes seqscan on 2M+ rows, hitting 10s context timeout (HNSW index ignored)

**LocalAI version:**
v4.3.6-59-gef80a0e8 (ef80a0e8259e610277150ccf2a015ff28e9387b7)

**Environment, CPU architecture, OS, and Version:**
Linux x86_64 / Compute Node configuration / PostgreSQL backend with pgvector

**Describe the bug**
When querying a large database collection (~2.4 million rows) via the `/api/agents/collections/{name}/search` endpoint using the PostgreSQL backend, the internal query engine constructs a SQL layout that completely blinds the PostgreSQL query planner. 

Instead of executing an index-backed nearest-neighbor search, it triggers a full sequential table scan (`seqscan`). On massive datasets, this calculation takes longer than 10 seconds, consistently hitting LocalAI's internal client-side context deadline and dropping the request with an empty response payload (`{"count":0,"results":[]}`).

**To Reproduce**
1. Populate a collection table via the PostgreSQL backend with a large dataset (>2,000,000 rows).
2. Build a standard `pgvector` HNSW index on the `embedding` column.
3. Submit a standard vector search request to the API collection endpoint.
4. Check PostgreSQL engine logs to observe a statement cancellation precisely at the 10-second mark.

**Expected behavior**
The query engine should leverage the underlying `pgvector` HNSW/IVFFlat index structure to return the nearest neighbors within milliseconds, regardless of the overall row count.

**Logs:**
```text
Jun 05 08:38:34 compute postgres[1989806]: LOG:  execute stmtcache_c2e7e...:
Jun 05 08:38:34 compute postgres[1989806]:         SELECT
Jun 05 08:38:34 compute postgres[1989806]:                 id::text,
Jun 05 08:38:34 compute postgres[1989806]:                 COALESCE(title, '') as title,
Jun 05 08:38:34 compute postgres[1989806]:                 content,
Jun 05 08:38:34 compute postgres[1989806]:                 metadata,
Jun 05 08:38:34 compute postgres[1989806]:                 (
Jun 05 08:38:34 compute postgres[1989806]:                         COALESCE((1 - (embedding <=> $3::vector)), 0) * $4
Jun 05 08:38:34 compute postgres[1989806]:                 ) as similarity
Jun 05 08:38:34 compute postgres[1989806]:         FROM documents_literature
Jun 05 08:38:34 compute postgres[1989806]:         WHERE embedding IS NOT NULL
Jun 05 08:38:34 compute postgres[1989806]:         ORDER BY similarity DESC
Jun 05 08:38:34 compute postgres[1989806]:         LIMIT $5
...
Jun 05 08:38:44 compute postgres[1989806]: ERROR:  canceling statement due to statement timeout

```

**Additional context:**
The mathematical abstraction wrapper used to calculate a combined `similarity` score inside the `SELECT` clause completely invalidates `pgvector`'s HNSW indexing path:

1. **Index Invalidation:** PostgreSQL's HNSW implementation can *only* optimize an unadulterated, raw distance operator sort path directly in the clause: `ORDER BY embedding <=> $3::vector`.
2. **The Scalar Trap:** Because the query targets a dynamic scalar alias (`ORDER BY similarity DESC`), the query planner cannot walk the HNSW index tree. It is mathematically forced to calculate the distance operator, handle the `COALESCE` logic, and multiply it by the scalar weight parameter (`$4`) for **all 2.4 million rows sequentially** before sorting in-memory.

**Verification of Underlying Index Health:**
Extracting the target vector out of the query payload and executing a raw database connection via native Go source code completely bypasses the bug and works instantaneously:

```go
// Direct query execution using the raw distance operator natively runs in <10ms
sqlQuery := `
	SELECT id::text, COALESCE(title, ''), content 
	FROM documents_literature 
	WHERE embedding IS NOT NULL 
	ORDER BY embedding <=> $1::vector 
	LIMIT 3;`

```

**Suggested Fix:**
Modify the backend query builder so that if a vector search is being executed, the distance operator is passed natively into the `ORDER BY` clause without mathematical modifiers or scalar transformations that break index parsing.

---

*Note: This technical optimization analysis, logs trace verification, and root-cause mapping were investigated and drafted in collaboration with Google's Gemini.*

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug: Vector collection search causes seqscan on 2M+ rows, hitting 10s context timeout (HNSW index ignored) #10186

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

bug: Vector collection search causes seqscan on 2M+ rows, hitting 10s context timeout (HNSW index ignored) #10186

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions