feat!: Add overwrite-delete-stale mode for destination plugins#224
feat!: Add overwrite-delete-stale mode for destination plugins#224yevgenypats merged 8 commits intomainfrom
Conversation
hermanschaaf
left a comment
There was a problem hiding this comment.
I'm not quite done with the review, but leaving some questions in the meantime
| // GetSyncSummary returns the latest sync summary of the source plugin. we don't want to send the summary on | ||
| // every sync request. |
There was a problem hiding this comment.
Why wouldn't we want to send the sync summary with every sync request? It seems like a minor overhead for a function that only gets called ~once per day by most users, and means the plugin doesn't have to store that state. My vote would be to send it as part of the Sync response.
There was a problem hiding this comment.
let's do this in a follow-up PR though as it's already pretty big
erezrokah
left a comment
There was a problem hiding this comment.
Added a few comments, but I think a live overview of this feature will make the PR much easier to understand.
If I got this correctly we use sync_time and source_name to determine which resources to delete which makes sense.
If so and also based on your comment, source_name has to be unique across configuration files, correct?
Also, if someone renames a source the logic fails correct?
| |_cq_source_name|String| | ||
| |_cq_sync_time|Timestamp| |
There was a problem hiding this comment.
Instead of _cq_source_name and _cq_sync_time, which will be the same for all inserted rows in a sync, I would like to suggest that we introduce a CloudQuery syncs table, for which an ID gets referenced here.
So for example, here we'd add one new column, _cq_sync_id. In another table, maybe called cq_syncs, we can list all past syncs with the following columns to start with:
id text (pk)
source_name text
start_time timestamp -- indexed
--- other ideas, not strictly required now:
end_time timestamp -- update at end of sync
status text -- updated when we transition from one state to another
errors uint
warnings uint
metadata jsonb -- plugins can use this to store cursors, for example
For delete-stale functionality, we wouldn't need to join or reference this table: we would only need to delete all rows with sync IDs that are not the current sync's ID.
Users who wish to delete stale records on their own can query this table to get the latest sync ID and then delete all rows with a different sync id. If we expand it with some more data (some ideas listed above), this table could also give users a great place to start monitoring their syncs without needing to parse CLI output or logs.
In any case, even if we keep it minimal for now, such a table will allow us to expand our metadata in the future without polluting all tables with new columns.
This table could also be useful for future functionality, such as cursors, where source plugins need to store some information for retrieval on the next sync. But that can also be designed differently.
There was a problem hiding this comment.
I would avoid for now the need for additional table and foreign keys. I think it is quite common in datalakes to have duplicates in favour of simplicity.
So basically simple implementation first -> operational experience -> more complex one later.
we can still add later on a table with all sync times but I would still avoid the foreign keys as they are also not supported in all databases. but I don't think this a blocker and we can always add this table later.
There was a problem hiding this comment.
Agree, makes sense to keep things simple at first, and data lakes often do denormalize for the sake of simplicity. I don't think a database needs to strictly support foreign keys for this secondary table to be useful though; you can always look up the latest ID and then run further queries using it (accepting the risk in this case that operations won't necessarily be atomic, again depending on the database).
For me the important thing is having a handle to a unique sync id with which you can reference further information, should you need it. If we think that's something we'll add later, to maintain backwards-compatibility we will then have 5 _cq columns instead of 3:
|_cq_id|UUID|
|_cq_parent_id|UUID|
|_cq_source_name|String|
|_cq_sync_time|Timestamp|
|_cq_sync_id|UUID|
I think that's okay, and if you think it's okay we can do it, but worth noting that we have the chance now to keep the number of internal columns to their theoretical minimum.
Or the other option, down the line, would be to use sync_time and source_name as a combined PK, if we want to keep it to 4 internal columns but make joins a bit harder.
There was a problem hiding this comment.
I think _cq_sync_time (_source_name) should be good enough for now for uniqueness and I wouldn't optimize yet for small amount of columns it doesn't really cost us anything atm. so would suggest to look into that down the line when/if it will become an option.
Co-authored-by: Herman Schaaf <hermanschaaf@gmail.com>
Co-authored-by: Herman Schaaf <hermanschaaf@gmail.com>
exactly if the someone will change the name of this, it will delete older stale data from a plugin with that name (if such run before). but again it goes back to the comment in the issue is that we cannot implement a magic function in a reliable way that will know what is new what was fetched, where we had permission errors, where not. the only way to do so is just with a marker per sync and the user is responsible in running this correctly. This is why I wasn't in favour of this feature in general as we could put it just on the user to delete whatever data they wish on their schedule. But we can keep it as option and see if it causes more confusion than value. |
erezrokah
left a comment
There was a problem hiding this comment.
I think we can merge the initial PR and iterate on it with follow up PRs
🤖 I have created a release *beep* *boop* --- ## [0.12.0](v0.11.7...v0.12.0) (2022-10-03) ### ⚠ BREAKING CHANGES * Add overwrite-delete-stale mode for destination plugins (#224) ### Features * Add overwrite-delete-stale mode for destination plugins ([#224](#224)) ([567121d](567121d)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
<!-- 🎉 Thank you for making CloudQuery awesome by submitting a PR 🎉 --> #### Summary This adds the following: - Support for delete stale via cloudquery/plugin-sdk#224 - numerous bugfixes - much needed tests for the postgresql destination plugin and split to number of files as it grew alot already. - report status via normal counts and not via logs so behaviour is consistent between debug/release and in general log parsing can be quite fragile. Co-authored-by: Herman Schaaf <hermanschaaf@gmail.com>
This introduce number of things:
overwrite-deletestalemodeRelated postgresql and CLI support - cloudquery/cloudquery#2220