perf(data-warehouse): Optimise mongodb connections#43994
perf(data-warehouse): Optimise mongodb connections#43994Gilbert09 merged 6 commits intoPostHog:masterfrom
Conversation
There was a problem hiding this comment.
Additional Comments (3)
-
posthog/temporal/data_imports/sources/mongodb/mongo.py, line 283-286 (link)style: generator expression in
executor.map()may cause issues with Collection object creation timing -
posthog/temporal/data_imports/sources/mongodb/mongo.py, line 281 (link)logic: missing nameOnly parameter - according to the issue description, this is needed to allow MongoDB to filter collections based on user permissions
-
posthog/temporal/data_imports/sources/mongodb/mongo.py, line 297 (link)logic: also missing nameOnly parameter here - should be consistent with the fix needed in
get_schemas()
2 files reviewed, 3 comments
There was a problem hiding this comment.
1 issue found across 2 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="posthog/temporal/data_imports/sources/mongodb/mongo.py">
<violation number="1" location="posthog/temporal/data_imports/sources/mongodb/mongo.py:283">
P1: `ThreadPoolExecutor(max_workers=0)` will raise `ValueError` when connecting to an empty database. If `collection_names` is empty, `min(0, 4) = 0` which is invalid.</violation>
</file>
Since this is your first cubic review, here's how it works:
- cubic automatically reviews your code and comments on bugs and improvements
- Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
- Ask questions if you need clarification on any suggestion
Reply to cubic to teach it or ask questions. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
1 issue found across 2 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="posthog/temporal/data_imports/sources/mongodb/mongo.py">
<violation number="1" location="posthog/temporal/data_imports/sources/mongodb/mongo.py:298">
P2: Missing database name validation that exists in similar functions. If the connection string doesn't include a database name, `client[connection_params["database"]]` will access `client[None]` or `client[""]`, causing confusing errors instead of a clear validation message.</violation>
</file>
Reply to cubic to teach it or ask questions. Tag @cubic-dev-ai to re-run a review.
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
Gilbert09
left a comment
There was a problem hiding this comment.
This is great, thanks for the work here!
Problem
MongoDB data source connection times out ("upstream request timeout") when connecting to databases with a large number of complex collections. The root causes were:
filter_mongo_incremental_fieldscreated a new MongoDB connection for each collection to check indexes. This is a problem because creating each connection is slow due to the overhead of establishing a TCP connection.validate_credentialswas calling fullget_schemas()unnecessarily, doing expensive schema inference just to check that any collections exist.Closes #43912
Changes
get_indexes()andfilter_mongo_incremental_fields()to accept aCollectionobject instead of connection string, allowing a single connection to be reused across all collectionsThreadPoolExecutorto process collections in parallel (up to 4 workers)validate_credentialsnow just callsget_collection_names()instead of full schema inferenceHow did you test this code?
Ran it locally.
Tested adding a new data source against a MongoDB database with 130+ collections.