Skip to content

feat!: Add overwrite-delete-stale mode for destination plugins#224

Merged
yevgenypats merged 8 commits intomainfrom
feat/delete_stale
Oct 3, 2022
Merged

feat!: Add overwrite-delete-stale mode for destination plugins#224
yevgenypats merged 8 commits intomainfrom
feat/delete_stale

Conversation

@yevgenypats
Copy link
Copy Markdown
Contributor

@yevgenypats yevgenypats commented Oct 2, 2022

This introduce number of things:

  • Major: Adds overwrite-deletestale mode
  • Major: make destination plugin API consistent with source plugin api
  • Minor: removes dep on snapshot testing library as we want to avoid this and use standard comparison and/or google diff library (see one of many articles - https://medium.com/@sapegin/whats-wrong-with-snapshot-tests-37fbe20dfe8e). Also had extensive experience with snapshot testing in React so def agree with the article/s

Related postgresql and CLI support - cloudquery/cloudquery#2220

Copy link
Copy Markdown
Contributor

@hermanschaaf hermanschaaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite done with the review, but leaving some questions in the meantime

Comment on lines +14 to +15
// GetSyncSummary returns the latest sync summary of the source plugin. we don't want to send the summary on
// every sync request.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wouldn't we want to send the sync summary with every sync request? It seems like a minor overhead for a function that only gets called ~once per day by most users, and means the plugin doesn't have to store that state. My vote would be to send it as part of the Sync response.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that can work

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do this in a follow-up PR though as it's already pretty big

Copy link
Copy Markdown
Member

@erezrokah erezrokah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments, but I think a live overview of this feature will make the PR much easier to understand.

If I got this correctly we use sync_time and source_name to determine which resources to delete which makes sense.

If so and also based on your comment, source_name has to be unique across configuration files, correct?

Also, if someone renames a source the logic fails correct?

Comment on lines +16 to +17
|_cq_source_name|String|
|_cq_sync_time|Timestamp|
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of _cq_source_name and _cq_sync_time, which will be the same for all inserted rows in a sync, I would like to suggest that we introduce a CloudQuery syncs table, for which an ID gets referenced here.

So for example, here we'd add one new column, _cq_sync_id. In another table, maybe called cq_syncs, we can list all past syncs with the following columns to start with:

id text (pk)
source_name text
start_time timestamp -- indexed

--- other ideas, not strictly required now:
end_time timestamp -- update at end of sync
status text -- updated when we transition from one state to another
errors uint
warnings uint
metadata jsonb -- plugins can use this to store cursors, for example

For delete-stale functionality, we wouldn't need to join or reference this table: we would only need to delete all rows with sync IDs that are not the current sync's ID.

Users who wish to delete stale records on their own can query this table to get the latest sync ID and then delete all rows with a different sync id. If we expand it with some more data (some ideas listed above), this table could also give users a great place to start monitoring their syncs without needing to parse CLI output or logs.

In any case, even if we keep it minimal for now, such a table will allow us to expand our metadata in the future without polluting all tables with new columns.

This table could also be useful for future functionality, such as cursors, where source plugins need to store some information for retrieval on the next sync. But that can also be designed differently.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid for now the need for additional table and foreign keys. I think it is quite common in datalakes to have duplicates in favour of simplicity.

So basically simple implementation first -> operational experience -> more complex one later.

we can still add later on a table with all sync times but I would still avoid the foreign keys as they are also not supported in all databases. but I don't think this a blocker and we can always add this table later.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, makes sense to keep things simple at first, and data lakes often do denormalize for the sake of simplicity. I don't think a database needs to strictly support foreign keys for this secondary table to be useful though; you can always look up the latest ID and then run further queries using it (accepting the risk in this case that operations won't necessarily be atomic, again depending on the database).

For me the important thing is having a handle to a unique sync id with which you can reference further information, should you need it. If we think that's something we'll add later, to maintain backwards-compatibility we will then have 5 _cq columns instead of 3:

|_cq_id|UUID|
|_cq_parent_id|UUID|
|_cq_source_name|String|
|_cq_sync_time|Timestamp|
|_cq_sync_id|UUID|

I think that's okay, and if you think it's okay we can do it, but worth noting that we have the chance now to keep the number of internal columns to their theoretical minimum.

Or the other option, down the line, would be to use sync_time and source_name as a combined PK, if we want to keep it to 4 internal columns but make joins a bit harder.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think _cq_sync_time (_source_name) should be good enough for now for uniqueness and I wouldn't optimize yet for small amount of columns it doesn't really cost us anything atm. so would suggest to look into that down the line when/if it will become an option.

yevgenypats and others added 2 commits October 3, 2022 13:19
Co-authored-by: Herman Schaaf <hermanschaaf@gmail.com>
Co-authored-by: Herman Schaaf <hermanschaaf@gmail.com>
@yevgenypats
Copy link
Copy Markdown
Contributor Author

Added a few comments, but I think a live overview of this feature will make the PR much easier to understand.

If I got this correctly we use sync_time and source_name to determine which resources to delete which makes sense.

If so and also based on your comment, source_name has to be unique across configuration files, correct?

Also, if someone renames a source the logic fails correct?

exactly if the someone will change the name of this, it will delete older stale data from a plugin with that name (if such run before). but again it goes back to the comment in the issue is that we cannot implement a magic function in a reliable way that will know what is new what was fetched, where we had permission errors, where not. the only way to do so is just with a marker per sync and the user is responsible in running this correctly. This is why I wasn't in favour of this feature in general as we could put it just on the user to delete whatever data they wish on their schedule. But we can keep it as option and see if it causes more confusion than value.

@yevgenypats yevgenypats requested a review from erezrokah October 3, 2022 11:03
@hermanschaaf hermanschaaf changed the title feat!: Add overwrite-deletestale mode for destination plugins feat!: Add overwrite-delete-stale mode for destination plugins Oct 3, 2022
@github-actions github-actions bot added breaking and removed feat labels Oct 3, 2022
@github-actions github-actions bot added the feat label Oct 3, 2022
Copy link
Copy Markdown
Member

@erezrokah erezrokah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can merge the initial PR and iterate on it with follow up PRs

@yevgenypats yevgenypats enabled auto-merge (squash) October 3, 2022 15:07
@yevgenypats yevgenypats merged commit 567121d into main Oct 3, 2022
@yevgenypats yevgenypats deleted the feat/delete_stale branch October 3, 2022 15:08
yevgenypats pushed a commit that referenced this pull request Oct 3, 2022
🤖 I have created a release *beep* *boop*
---


##
[0.12.0](v0.11.7...v0.12.0)
(2022-10-03)


### ⚠ BREAKING CHANGES

* Add overwrite-delete-stale mode for destination plugins (#224)

### Features

* Add overwrite-delete-stale mode for destination plugins
([#224](#224))
([567121d](567121d))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
yevgenypats added a commit to cloudquery/cloudquery that referenced this pull request Oct 3, 2022
<!-- 🎉 Thank you for making CloudQuery awesome by submitting a PR 🎉 -->

#### Summary

This adds the following:
- Support for delete stale via
cloudquery/plugin-sdk#224
- numerous bugfixes
- much needed tests for the postgresql destination plugin and split to
number of files as it grew alot already.
- report status via normal counts and not via logs so behaviour is
consistent between debug/release and in general log parsing can be quite
fragile.

Co-authored-by: Herman Schaaf <hermanschaaf@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants