-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Description
In what situation are you experiencing subpar performance?
When connecting a MongoDB data source with a large number of collections (70+), the initial connection/validation times out with "upstream request timeout" error. The schema discovery process in the MongoDB connector scans all documents in every collection using an expensive aggregation pipeline, which doesn't scale well.
Additionally, there's no way to limit which collections are discovered even when using MongoDB user-level permissions, because the list_collection_names() call uses authorizedCollections=True but doesn't use nameOnly=True, which prevents MongoDB from filtering based on collection-level permissions.
Root Cause Analysis
Issue 1: Expensive Schema Discovery
The schema discovery in posthog/temporal/data_imports/sources/mongodb/mongo.py uses this aggregation pipeline for every collection:
def _get_schema_from_query(collection: Collection) -> list[tuple[str, str]]:
pipeline: list[dict[str, Any]] = [
{"$project": {"arrayofkeyvalue": {"$objectToArray": "$$ROOT"}}},
{"$unwind": "$arrayofkeyvalue"},
{
"$group": {
"_id": "$arrayofkeyvalue.k",
"types": {"$addToSet": {"$type": "$arrayofkeyvalue.v"}},
}
},
]
result = list(collection.aggregate(pipeline))This pipeline:
- Scans every document in the collection
- Explodes each document into N documents (one per field) via
$unwind - Groups all exploded documents
For a collection with 100,000 documents and 30 fields, this creates 3,000,000 intermediate documents to process. With 70+ collections, this quickly exceeds any reasonable timeout.
Suggested fix: Add $sample to limit the scan:
pipeline = [
{"$sample": {"size": 1000}}, # Only sample 1000 docs for schema inference
{"$project": {"arrayofkeyvalue": {"$objectToArray": "$$ROOT"}}},
{"$unwind": "$arrayofkeyvalue"},
{"$group": {"_id": "$arrayofkeyvalue.k", "types": {"$addToSet": {"$type": "$arrayofkeyvalue.v"}}}},
]Issue 2: Cannot Filter Collections via MongoDB Permissions
In get_schemas(), the code calls:
collection_names = db.list_collection_names(authorizedCollections=True)However, according to MongoDB documentation, the authorizedCollections parameter only filters based on user permissions when nameOnly=True is also set:
When
authorizedCollectionsistrueandnameOnlyistrue, users without the required privilege can run the command and list those collections for which they have privileges.
Without nameOnly=True, MongoDB requires the user to have listCollections privilege on the database, which returns all collections regardless of collection-level read permissions.
Suggested fix:
collection_names = db.list_collection_names(authorizedCollections=True, nameOnly=True)This would allow users to create MongoDB users with limited collection access, which would:
- Reduce the number of collections to scan
- Allow users to selectively sync only the collections they need
How to reproduce
- Create a MongoDB Atlas cluster with a database containing 50+ collections, some with large document counts (10,000+)
- In PostHog, go to Data Warehouse → Add Source → MongoDB
- Enter connection string:
mongodb+srv://user:password@cluster.mongodb.net/database?authSource=admin - Click to validate/connect
- Observe "upstream request timeout" error
To reproduce the permission filtering issue:
- Create a MongoDB user with collection-specific read permissions (not
readAnyDatabase) - Attempt to connect with this user
- Observe that either:
- All collections are still returned (if user has
listCollectionson db), or - Connection fails with "not authorized to execute command listCollections"
- All collections are still returned (if user has
Environment
- PostHog Cloud
Additional context
Investigation steps performed:
- Verified credentials work with direct
pymongoconnection - Extracted PostHog's MongoDB connector code and ran locally
- Observed schema discovery taking 60+ seconds for 73 collections before being killed
- Identified the two issues above through code analysis
Relevant code locations:
posthog/temporal/data_imports/sources/mongodb/mongo.py:get_schemas()(line ~265) - callslist_collection_names()_get_schema_from_query()(line ~187) - expensive aggregation without sampling
Workaround attempted:
Creating a MongoDB user with collection-level permissions doesn't work because nameOnly=True is not passed to list_collection_names().
Reproducible Example
See test.py for a standalone script that reproduces the issue using PostHog's exact MongoDB connector logic.
# Install dependencies
pip install pymongo certifi
# Set your connection string
export MONGODB_CONNECTION_STRING="mongodb+srv://user:password@cluster.mongodb.net/database?authSource=admin"
# Run the test
python test.py