Skip to content

Data Pipeline MongoDB Connector: Schema discovery times out on large databases (missing $sample + nameOnly) #43912

@rawmarshmellows

Description

@rawmarshmellows

In what situation are you experiencing subpar performance?

When connecting a MongoDB data source with a large number of collections (70+), the initial connection/validation times out with "upstream request timeout" error. The schema discovery process in the MongoDB connector scans all documents in every collection using an expensive aggregation pipeline, which doesn't scale well.

Additionally, there's no way to limit which collections are discovered even when using MongoDB user-level permissions, because the list_collection_names() call uses authorizedCollections=True but doesn't use nameOnly=True, which prevents MongoDB from filtering based on collection-level permissions.

Root Cause Analysis

Issue 1: Expensive Schema Discovery

The schema discovery in posthog/temporal/data_imports/sources/mongodb/mongo.py uses this aggregation pipeline for every collection:

def _get_schema_from_query(collection: Collection) -> list[tuple[str, str]]:
    pipeline: list[dict[str, Any]] = [
        {"$project": {"arrayofkeyvalue": {"$objectToArray": "$$ROOT"}}},
        {"$unwind": "$arrayofkeyvalue"},
        {
            "$group": {
                "_id": "$arrayofkeyvalue.k",
                "types": {"$addToSet": {"$type": "$arrayofkeyvalue.v"}},
            }
        },
    ]
    result = list(collection.aggregate(pipeline))

This pipeline:

  1. Scans every document in the collection
  2. Explodes each document into N documents (one per field) via $unwind
  3. Groups all exploded documents

For a collection with 100,000 documents and 30 fields, this creates 3,000,000 intermediate documents to process. With 70+ collections, this quickly exceeds any reasonable timeout.

Suggested fix: Add $sample to limit the scan:

pipeline = [
    {"$sample": {"size": 1000}},  # Only sample 1000 docs for schema inference
    {"$project": {"arrayofkeyvalue": {"$objectToArray": "$$ROOT"}}},
    {"$unwind": "$arrayofkeyvalue"},
    {"$group": {"_id": "$arrayofkeyvalue.k", "types": {"$addToSet": {"$type": "$arrayofkeyvalue.v"}}}},
]

Issue 2: Cannot Filter Collections via MongoDB Permissions

In get_schemas(), the code calls:

collection_names = db.list_collection_names(authorizedCollections=True)

However, according to MongoDB documentation, the authorizedCollections parameter only filters based on user permissions when nameOnly=True is also set:

When authorizedCollections is true and nameOnly is true, users without the required privilege can run the command and list those collections for which they have privileges.

Without nameOnly=True, MongoDB requires the user to have listCollections privilege on the database, which returns all collections regardless of collection-level read permissions.

Suggested fix:

collection_names = db.list_collection_names(authorizedCollections=True, nameOnly=True)

This would allow users to create MongoDB users with limited collection access, which would:

  1. Reduce the number of collections to scan
  2. Allow users to selectively sync only the collections they need

How to reproduce

  1. Create a MongoDB Atlas cluster with a database containing 50+ collections, some with large document counts (10,000+)
  2. In PostHog, go to Data Warehouse → Add Source → MongoDB
  3. Enter connection string: mongodb+srv://user:password@cluster.mongodb.net/database?authSource=admin
  4. Click to validate/connect
  5. Observe "upstream request timeout" error

To reproduce the permission filtering issue:

  1. Create a MongoDB user with collection-specific read permissions (not readAnyDatabase)
  2. Attempt to connect with this user
  3. Observe that either:
    • All collections are still returned (if user has listCollections on db), or
    • Connection fails with "not authorized to execute command listCollections"

Environment

  • PostHog Cloud

Additional context

Investigation steps performed:

  1. Verified credentials work with direct pymongo connection
  2. Extracted PostHog's MongoDB connector code and ran locally
  3. Observed schema discovery taking 60+ seconds for 73 collections before being killed
  4. Identified the two issues above through code analysis

Relevant code locations:

  • posthog/temporal/data_imports/sources/mongodb/mongo.py:
    • get_schemas() (line ~265) - calls list_collection_names()
    • _get_schema_from_query() (line ~187) - expensive aggregation without sampling

Workaround attempted:
Creating a MongoDB user with collection-level permissions doesn't work because nameOnly=True is not passed to list_collection_names().


Reproducible Example

See test.py for a standalone script that reproduces the issue using PostHog's exact MongoDB connector logic.

# Install dependencies
pip install pymongo certifi

# Set your connection string
export MONGODB_CONNECTION_STRING="mongodb+srv://user:password@cluster.mongodb.net/database?authSource=admin"

# Run the test
python test.py

test.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    performanceHas to do with performance. For PRs, runs the clickhouse query performance suite

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions