Skip to content

perf(data-warehouse): Optimise mongodb connections#43994

Merged
Gilbert09 merged 6 commits intoPostHog:masterfrom
zhou-quickli:mongo-fix-timeout
Dec 30, 2025
Merged

perf(data-warehouse): Optimise mongodb connections#43994
Gilbert09 merged 6 commits intoPostHog:masterfrom
zhou-quickli:mongo-fix-timeout

Conversation

@zhou-quickli
Copy link
Contributor

Problem

MongoDB data source connection times out ("upstream request timeout") when connecting to databases with a large number of complex collections. The root causes were:

  1. filter_mongo_incremental_fields created a new MongoDB connection for each collection to check indexes. This is a problem because creating each connection is slow due to the overhead of establishing a TCP connection.
  2. validate_credentials was calling full get_schemas() unnecessarily, doing expensive schema inference just to check that any collections exist.
  3. Schema inference was slow, with some collections taking up to 10 seconds to return a result.

Closes #43912

Changes

  • Connection reuse: Changed get_indexes() and filter_mongo_incremental_fields() to accept a Collection object instead of connection string, allowing a single connection to be reused across all collections
  • Parallel schema inference: Added ThreadPoolExecutor to process collections in parallel (up to 4 workers)
  • Faster credential validation: validate_credentials now just calls get_collection_names() instead of full schema inference

How did you test this code?

Ran it locally.

Tested adding a new data source against a MongoDB database with 130+ collections.

  • Before: ~160+ seconds, timing out
  • After: ~15 seconds total

@assign-reviewers-posthog assign-reviewers-posthog bot requested a review from a team December 26, 2025 07:40
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (3)

  1. posthog/temporal/data_imports/sources/mongodb/mongo.py, line 283-286 (link)

    style: generator expression in executor.map() may cause issues with Collection object creation timing

  2. posthog/temporal/data_imports/sources/mongodb/mongo.py, line 281 (link)

    logic: missing nameOnly parameter - according to the issue description, this is needed to allow MongoDB to filter collections based on user permissions

  3. posthog/temporal/data_imports/sources/mongodb/mongo.py, line 297 (link)

    logic: also missing nameOnly parameter here - should be consistent with the fix needed in get_schemas()

2 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="posthog/temporal/data_imports/sources/mongodb/mongo.py">

<violation number="1" location="posthog/temporal/data_imports/sources/mongodb/mongo.py:283">
P1: `ThreadPoolExecutor(max_workers=0)` will raise `ValueError` when connecting to an empty database. If `collection_names` is empty, `min(0, 4) = 0` which is invalid.</violation>
</file>

Since this is your first cubic review, here's how it works:

  • cubic automatically reviews your code and comments on bugs and improvements
  • Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
  • Ask questions if you need clarification on any suggestion

Reply to cubic to teach it or ask questions. Tag @cubic-dev-ai to re-run a review.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="posthog/temporal/data_imports/sources/mongodb/mongo.py">

<violation number="1" location="posthog/temporal/data_imports/sources/mongodb/mongo.py:298">
P2: Missing database name validation that exists in similar functions. If the connection string doesn&#39;t include a database name, `client[connection_params[&quot;database&quot;]]` will access `client[None]` or `client[&quot;&quot;]`, causing confusing errors instead of a clear validation message.</violation>
</file>

Reply to cubic to teach it or ask questions. Tag @cubic-dev-ai to re-run a review.

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
Copy link
Member

@Gilbert09 Gilbert09 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks for the work here!

@Gilbert09 Gilbert09 merged commit ad71ade into PostHog:master Dec 30, 2025
157 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Data Pipeline MongoDB Connector: Schema discovery times out on large databases (missing $sample + nameOnly)

3 participants