Skip to content

Deletion of Stale Data #2168

@bbernays

Description

@bbernays

Overview:

In v1 CloudQuery never deletes any data. This means that resources that have been deleted in the source are never updated or removed. This negatively impacts users in a few ways:

  • Deleted resources that fail policy evaluations will always be marked as failed
  • Users must manually access the CQ database and identify which tables/rows they want to delete

So overall CloudQuery needs a mechanism for indicating to users that a resource no longer is present in the source. This can be either be marking a resource as deleted via a boolean/timestamp column or by actually deleting the record. During this process CloudQuery should only interact with resources that were fetched.

Cases:
All examples are for AWS, but apply to any resource that supports dynamic multiplexing

  1. Single Config File hard coded accounts
    • At the end of fetch the only resources that should be left are those resources that existed in the source accounts
  2. Single Config File multiple concurrent triggers + overlapping accounts and resources
    • Case does not need to be supported
  3. Single Config File dynamic accounts (via Orgs)
    • At the end of the fetch, for each multiplexed resource that was successfully fetched we should delete records that were not present in the source
  4. Multiple Config files and triggers: Same hard coded accounts + unique resources
    • At the end of the fetch, for each multiplexed resource that was successfully fetched we should delete records that were not present in the source
  5. Multiple Config files and triggers: Different hard coded accounts + overlapping resources
    • At the end of the fetch, for each multiplexed resource that was successfully fetched we should delete records that were not present in the source
  6. Multiple Config files and concurrent triggers: Same dynamic accounts + unique resources
    • At the end of the fetch, for each multiplexed resource that was successfully fetched we should delete records that were not present in the source
  7. Multiple Config files and concurrent triggers: Same dynamic accounts + overlapping resources
    • Case does not need to be supported

Biggest Take Aways:

  1. Deletion of deleted resources is required
  2. Fetching of concurrent non overlapping resources is required
  3. Resources should only be deleted if the list was successful

Open Question:

  1. If a dynamic multiplexing returns different data or if hard coded accounts change should CloudQuery assume the data changed because resources were deleted?
    • Permissions might have changed so that CQ is not able to fetch them
    • Configuration might have changed to narrow the data being fetched

Notes:
When referring to CQ deleting records that have been removed from the source, I am not saying that CQ must delete from DB, only that the records need to easily be identified

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions