Skip to content

fix: prevent SQL injection in Databricks vector store#4558

Merged
kartik-mem0 merged 3 commits intomainfrom
fix/databricks-sql-injection-4073
Mar 26, 2026
Merged

fix: prevent SQL injection in Databricks vector store#4558
kartik-mem0 merged 3 commits intomainfrom
fix/databricks-sql-injection-4073

Conversation

@utkarsh240799
Copy link
Copy Markdown
Contributor

Linked Issue

Closes #4073

Description

The Databricks vector store implementation (mem0/vector_stores/databricks.py) used f-string interpolation to build SQL queries, creating SQL injection vulnerabilities in three methods. A malicious memory_id like '; DELETE FROM table; -- would allow arbitrary SQL execution.

Problem

All three write methods were vulnerable:

Method Vulnerable code Risk
delete() f"...WHERE memory_id = '{vector_id}'" Attacker-controlled vector_id injected directly into SQL
update() f"{key} = '{value}'" and f"...WHERE memory_id = '{vector_id}'" Both payload values and vector_id injected into SQL. Payload keys used as column names without validation
insert() Values built via _format_sql_value() manual escaping Manual replace("'", "''") escaping is fragile and not equivalent to parameterized queries

Other vector stores in this codebase (pgvector, azure_mysql) already use parameterized queries correctly.

Solution

  • delete(): Replaced '{vector_id}' with :vector_id parameter marker + StatementParameterListItem
  • update(): Parameterized vector_id and all payload values with :payload_{key} markers. Added regex validation (^[A-Za-z_][A-Za-z0-9_]*$) on payload keys used as column names to prevent column name injection
  • insert(): Replaced _format_sql_value() calls for all user-controlled values with :{col}_{row_index} parameter markers. NULL values use literal NULL. Embedding vectors remain inlined via _format_sql_value() since StatementParameterListItem does not support ARRAY types — these are safe as they contain only numeric floats from the embedding model

All parameterized queries use the Databricks SDK's StatementParameterListItem class, which binds values server-side and never interpolates them into the SQL string.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Refactor (no functional changes)
  • Documentation update

Breaking Changes

N/A

Test Coverage

  • I added/updated unit tests
  • I added/updated integration tests
  • I tested manually (describe below)
  • No tests needed (explain why)

Testing details

57 tests total (28 new, 29 updated to verify parameterization). All pass.

SQL injection prevention tests (20 tests) — Each of these 4 payloads is tested against delete(), update() (vector_id and payload values), and insert() (IDs and data):

  • '; DELETE FROM table; --
  • ' OR '1'='1
  • '; DROP TABLE memories; --
  • 1' UNION SELECT * FROM secrets --

Each test verifies: (1) the malicious string does NOT appear in the SQL statement, (2) it IS safely passed via StatementParameterListItem parameters.

Column name injection tests (4 tests) — Verifies update() rejects payload keys containing SQL metacharacters (; DROP, ' OR, newlines) while still processing valid keys.

Functional tests (4 new) — Multi-row insert with unique parameter names, NULL value handling, payload-only update, vector-only update.

Existing tests (29 updated) — All pre-existing tests updated to assert parameterized query usage instead of inline values.

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have added tests that prove my fix/feature works
  • New and existing tests pass locally
  • I have updated documentation if needed

utkarsh240799 and others added 3 commits March 26, 2026 19:03
Replace f-string interpolation with Databricks parameterized queries
using StatementParameterListItem in delete(), update(), and insert()
methods. Add column name validation in update() to reject invalid SQL
identifiers from payload keys. Embedding vectors remain inlined as
they are numeric arrays not supported by the parameterization API.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add explicit type='TIMESTAMP' on StatementParameterListItem for
  created_at/updated_at columns instead of relying on implicit
  STRING->TIMESTAMP casting
- Fix pre-existing bug: update() used Python list repr [0.1, 0.2]
  for embedding which is invalid Databricks SQL, now uses
  _format_sql_value() to produce array(0.1, 0.2) syntax

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@kartik-mem0 kartik-mem0 merged commit a2ffca3 into main Mar 26, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Security: SQL injection in Databricks vector store via f-string interpolation

2 participants