feat!: Add overwrite-delete-stale mode for destination plugins by yevgenypats · Pull Request #224 · cloudquery/plugin-sdk

yevgenypats · 2022-10-02T22:05:03Z

This introduce number of things:

Major: Adds overwrite-deletestale mode
Major: make destination plugin API consistent with source plugin api
Minor: removes dep on snapshot testing library as we want to avoid this and use standard comparison and/or google diff library (see one of many articles - https://medium.com/@sapegin/whats-wrong-with-snapshot-tests-37fbe20dfe8e). Also had extensive experience with snapshot testing in React so def agree with the article/s

Related postgresql and CLI support - cloudquery/cloudquery#2220

hermanschaaf

I'm not quite done with the review, but leaving some questions in the meantime

clients/destination.go

clients/source.go

hermanschaaf · 2022-10-03T07:52:36Z

internal/pb/source.proto

+  // GetSyncSummary returns the latest sync summary of the source plugin. we don't want to send the summary on
+  // every sync request.


Why wouldn't we want to send the sync summary with every sync request? It seems like a minor overhead for a function that only gets called ~once per day by most users, and means the plugin doesn't have to store that state. My vote would be to send it as part of the Sync response.

that can work

let's do this in a follow-up PR though as it's already pretty big

plugins/destination.go

plugins/source.go

plugins/testdata/relation_table.md

specs/destination.go

erezrokah

Added a few comments, but I think a live overview of this feature will make the PR much easier to understand.

If I got this correctly we use sync_time and source_name to determine which resources to delete which makes sense.

If so and also based on your comment, source_name has to be unique across configuration files, correct?

Also, if someone renames a source the logic fails correct?

specs/destination.go

plugins/source_docs_test.go

specs/destination.go

hermanschaaf · 2022-10-03T09:45:42Z

plugins/testdata/test_table.md

+|_cq_source_name|String|
+|_cq_sync_time|Timestamp|


Instead of _cq_source_name and _cq_sync_time, which will be the same for all inserted rows in a sync, I would like to suggest that we introduce a CloudQuery syncs table, for which an ID gets referenced here.

So for example, here we'd add one new column, _cq_sync_id. In another table, maybe called cq_syncs, we can list all past syncs with the following columns to start with:

id text (pk) source_name text start_time timestamp -- indexed --- other ideas, not strictly required now: end_time timestamp -- update at end of sync status text -- updated when we transition from one state to another errors uint warnings uint metadata jsonb -- plugins can use this to store cursors, for example

For delete-stale functionality, we wouldn't need to join or reference this table: we would only need to delete all rows with sync IDs that are not the current sync's ID.

Users who wish to delete stale records on their own can query this table to get the latest sync ID and then delete all rows with a different sync id. If we expand it with some more data (some ideas listed above), this table could also give users a great place to start monitoring their syncs without needing to parse CLI output or logs.

In any case, even if we keep it minimal for now, such a table will allow us to expand our metadata in the future without polluting all tables with new columns.

This table could also be useful for future functionality, such as cursors, where source plugins need to store some information for retrieval on the next sync. But that can also be designed differently.

I would avoid for now the need for additional table and foreign keys. I think it is quite common in datalakes to have duplicates in favour of simplicity.

So basically simple implementation first -> operational experience -> more complex one later.

we can still add later on a table with all sync times but I would still avoid the foreign keys as they are also not supported in all databases. but I don't think this a blocker and we can always add this table later.

Agree, makes sense to keep things simple at first, and data lakes often do denormalize for the sake of simplicity. I don't think a database needs to strictly support foreign keys for this secondary table to be useful though; you can always look up the latest ID and then run further queries using it (accepting the risk in this case that operations won't necessarily be atomic, again depending on the database).

For me the important thing is having a handle to a unique sync id with which you can reference further information, should you need it. If we think that's something we'll add later, to maintain backwards-compatibility we will then have 5 _cq columns instead of 3:

|_cq_id|UUID| |_cq_parent_id|UUID| |_cq_source_name|String| |_cq_sync_time|Timestamp| |_cq_sync_id|UUID|

I think that's okay, and if you think it's okay we can do it, but worth noting that we have the chance now to keep the number of internal columns to their theoretical minimum.

Or the other option, down the line, would be to use sync_time and source_name as a combined PK, if we want to keep it to 4 internal columns but make joins a bit harder.

I think _cq_sync_time (_source_name) should be good enough for now for uniqueness and I wouldn't optimize yet for small amount of columns it doesn't really cost us anything atm. so would suggest to look into that down the line when/if it will become an option.

Co-authored-by: Herman Schaaf <hermanschaaf@gmail.com>

yevgenypats · 2022-10-03T10:23:20Z

Added a few comments, but I think a live overview of this feature will make the PR much easier to understand.

If I got this correctly we use sync_time and source_name to determine which resources to delete which makes sense.

If so and also based on your comment, source_name has to be unique across configuration files, correct?

Also, if someone renames a source the logic fails correct?

exactly if the someone will change the name of this, it will delete older stale data from a plugin with that name (if such run before). but again it goes back to the comment in the issue is that we cannot implement a magic function in a reliable way that will know what is new what was fetched, where we had permission errors, where not. the only way to do so is just with a marker per sync and the user is responsible in running this correctly. This is why I wasn't in favour of this feature in general as we could put it just on the user to delete whatever data they wish on their schedule. But we can keep it as option and see if it causes more confusion than value.

specs/destination.go

erezrokah

I think we can merge the initial PR and iterate on it with follow up PRs

🤖 I have created a release *beep* *boop* --- ## [0.12.0](v0.11.7...v0.12.0) (2022-10-03) ### ⚠ BREAKING CHANGES * Add overwrite-delete-stale mode for destination plugins (#224) ### Features * Add overwrite-delete-stale mode for destination plugins ([#224](#224)) ([567121d](567121d)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).

#### Summary This adds the following: - Support for delete stale via cloudquery/plugin-sdk#224 - numerous bugfixes - much needed tests for the postgresql destination plugin and split to number of files as it grew alot already. - report status via normal counts and not via logs so behaviour is consistent between debug/release and in general log parsing can be quite fragile. Co-authored-by: Herman Schaaf <hermanschaaf@gmail.com>

feat!: Add overwrite-deletestale mode for destination plugins

485d9a4

yevgenypats requested review from erezrokah and hermanschaaf October 2, 2022 22:05

github-actions bot added breaking labels Oct 2, 2022

yevgenypats mentioned this pull request Oct 2, 2022

feat(postgresql): Add support for overwrite-delete-stale cloudquery/cloudquery#2220

Merged

github-actions bot added breaking and removed feat labels Oct 2, 2022

fix tests

53d375d

yevgenypats mentioned this pull request Oct 2, 2022

Deletion of Stale Data cloudquery/cloudquery#2168

Closed

hermanschaaf reviewed Oct 3, 2022

View reviewed changes

yevgenypats mentioned this pull request Oct 3, 2022

fix(deps): Update module github.com/bradleyjkemp/cupaloy/v2 to v2.8.0 #215

Merged

1 task

hermanschaaf reviewed Oct 3, 2022

View reviewed changes

specs/destination.go Outdated Show resolved Hide resolved

erezrokah requested changes Oct 3, 2022

View reviewed changes

specs/destination.go Outdated Show resolved Hide resolved

specs/destination.go Show resolved Hide resolved

plugins/source_docs_test.go Outdated Show resolved Hide resolved

specs/destination.go Show resolved Hide resolved

hermanschaaf reviewed Oct 3, 2022

View reviewed changes

yevgenypats and others added 2 commits October 3, 2022 13:19

Update clients/source.go

50064f7

Co-authored-by: Herman Schaaf <hermanschaaf@gmail.com>

Update clients/destination.go

6e5b89f

Co-authored-by: Herman Schaaf <hermanschaaf@gmail.com>

hermanschaaf reviewed Oct 3, 2022

View reviewed changes

specs/destination.go Show resolved Hide resolved

bring back snapshot testing

5267b00

yevgenypats requested a review from erezrokah October 3, 2022 11:03

yevgenypats added 2 commits October 3, 2022 14:36

fix review overwrite-delete-stale

b340ce9

more review fixes

89aceb7

yevgenypats requested a review from hermanschaaf October 3, 2022 11:42

hermanschaaf changed the title ~~feat!: Add overwrite-deletestale mode for destination plugins~~ feat!: Add overwrite-delete-stale mode for destination plugins Oct 3, 2022

github-actions bot added breaking and removed feat labels Oct 3, 2022

github-actions bot added the feat label Oct 3, 2022

erezrokah approved these changes Oct 3, 2022

View reviewed changes

hermanschaaf approved these changes Oct 3, 2022

View reviewed changes

Merge branch 'main' into feat/delete_stale

436546e

yevgenypats enabled auto-merge (squash) October 3, 2022 15:07

yevgenypats merged commit 567121d into main Oct 3, 2022

yevgenypats deleted the feat/delete_stale branch October 3, 2022 15:08

cq-bot mentioned this pull request Oct 3, 2022

chore(main): Release v0.12.0 #233

Merged

		// GetSyncSummary returns the latest sync summary of the source plugin. we don't want to send the summary on
		// every sync request.

Conversation

yevgenypats commented Oct 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hermanschaaf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hermanschaaf Oct 3, 2022

Choose a reason for hiding this comment

Uh oh!

yevgenypats Oct 3, 2022

Choose a reason for hiding this comment

Uh oh!

yevgenypats Oct 3, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

erezrokah left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hermanschaaf Oct 3, 2022

Choose a reason for hiding this comment

Uh oh!

yevgenypats Oct 3, 2022

Choose a reason for hiding this comment

Uh oh!

hermanschaaf Oct 3, 2022

Choose a reason for hiding this comment

Uh oh!

yevgenypats Oct 3, 2022

Choose a reason for hiding this comment

Uh oh!

yevgenypats commented Oct 3, 2022

Uh oh!

Uh oh!

erezrokah left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yevgenypats commented Oct 2, 2022 •

edited

Loading