Skip to content

fix: Improve Postgres performance#13318

Closed
erezrokah wants to merge 1 commit intocloudquery:mainfrom
erezrokah:fix/postgres_performance
Closed

fix: Improve Postgres performance#13318
erezrokah wants to merge 1 commit intocloudquery:mainfrom
erezrokah:fix/postgres_performance

Conversation

@erezrokah
Copy link
Copy Markdown
Member

@erezrokah erezrokah commented Aug 24, 2023

Summary

Still need to figure out how to replace the logic I deleted, but listing information schema on each batch insert is creating a bottleneck (saw it on pprof). Probably due to read/write locks or our queries to list tables and columns are slow.
Changes in this PR result in about x10 improvement.

Spec used (I used an old source since I tested on an old Postgres version to get the expected numbers):

kind: source
spec:
  name: aws
  registry: "github"
  path: "cloudquery/aws"
  version: "v19.0.0"
  tables:
    - "*"
  skip_tables:
    - "aws_cloudtrail_*"
    - "aws_iam_*"
    - "aws_servicequotas_*"
  destinations:
    - postgresql
---
kind: destination
spec:
  name: "postgresql"
  path: "cloudquery/postgresql"
  version: "v5.0.5"
  migrate_mode: forced
  spec:
    connection_string: "postgresql://postgres:pass@localhost:5432/postgres?sslmode=disable"

Before:
image

After (with Postgres running from localhost with this fix):
image

defaultBatchSize = 10000
defaultBatchSizeBytes = 1000000
defaultBatchTimeout = 10 * time.Second
defaultBatchSizeBytes = 100000000
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also should be in the doc + should be a major bump

@erezrokah
Copy link
Copy Markdown
Member Author

Going to split this PR so we can release the removal of list tables in a non breaking change

@erezrokah
Copy link
Copy Markdown
Member Author

Closing in favor of #13324 and #13323

@erezrokah erezrokah closed this Aug 25, 2023
@erezrokah erezrokah deleted the fix/postgres_performance branch August 25, 2023 09:45
kodiakhq bot pushed a commit that referenced this pull request Aug 25, 2023

#### Summary

Extracted from #13318. Witnessed the bottleneck using `pprof`.
With default batch settings I'm getting `2m29s` sync time instead of ~`11m`,
With `batch_size_bytes: 100000000` and `batch_timeout: 60s` I'm getting `1m40s`. (Fixed in #13324)

Used spec:

```yaml
kind: source
spec:
  name: aws
  path: "cloudquery/aws"
  version: "v19.0.0"
  tables:
    - "*"
  skip_tables:
    - "aws_cloudtrail_*"
    - "aws_iam_*"
    - "aws_servicequotas_*"
  destinations:
    - postgresql
---
kind: destination
spec:
  name: "postgresql"
  registry: "grpc"
  path: localhost:8888
  # path: cloudquery/postgresql
  # version "v5.0.5"
  spec:
    connection_string: "postgresql://postgres:pass@localhost:5432/postgres?sslmode=disable"
    # batch_size_bytes: 100000000
    # batch_timeout: "60s"
```


<!--
kodiakhq bot pushed a commit that referenced this pull request Aug 28, 2023

#### Summary

Extracted from #13318.

BEGIN_COMMIT_OVERRIDE
feat: Increase default batch size bytes to `100000000` (100 MB) and default batch timeout to `60` seconds.
BREAKING-CHANGE: Increase default batch size bytes to `100000000` (100 MB) and default batch timeout to `60` seconds. We discovered a default higher batch size bytes and timeout settings provide better out of the box performance for the PostgreSQL destination. We're marking it as a breaking change as it might increase memory consumption in some environments.

END_COMMIT_OVERRIDE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants