[OpenLineage] Fix datasets in GCSDeleteObjectsOperator #39059

kacpermuda · 2024-04-16T10:00:07Z

Currently in OpenLineage method we are including all deleted files as datasets which can lead to increasing the size of the event and make matching datasets between jobs harder.

With that change, when using prefixes, we are using them as dataset names and not full file paths. This way, user can easily control the size of the event and also ensure proper matching, when the same two prefixes are passed to different operators. I am also removing the list of files that was saved for the purpose of lineage datasets, introduced in #35838

When reviewing, please take a look at test cases to see how the code will behave now.

Also, I am adjusting prefix typing (hook.list allows list of prefixes) and error raising that was missing in my opinion.

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

tests/providers/google/cloud/operators/test_gcs.py

…in GCSDeleteObjectsOperator Signed-off-by: Kacper Muda <mudakacper@gmail.com>

boring-cyborg bot added area:providers provider:google Google (including GCP) related issues labels Apr 16, 2024

kacpermuda force-pushed the ol-fix-gcs-delete branch from 5442e7d to d0d875f Compare April 16, 2024 10:57

mobuchowski reviewed Apr 17, 2024

View reviewed changes

tests/providers/google/cloud/operators/test_gcs.py Outdated Show resolved Hide resolved

fix: Use prefixes instead of all file paths for OpenLineage datasets …

9fbdfbb

…in GCSDeleteObjectsOperator Signed-off-by: Kacper Muda <mudakacper@gmail.com>

kacpermuda force-pushed the ol-fix-gcs-delete branch from d0d875f to 9fbdfbb Compare April 18, 2024 07:22

kacpermuda requested a review from mobuchowski April 18, 2024 10:37

mobuchowski approved these changes Apr 18, 2024

View reviewed changes

mobuchowski merged commit 17e60b0 into apache:main Apr 18, 2024

kacpermuda deleted the ol-fix-gcs-delete branch April 18, 2024 17:32

eladkal mentioned this pull request May 1, 2024

Status of testing Providers that were prepared on May 01, 2024 #39346

Closed

eladkal mentioned this pull request May 12, 2024

Status of testing Providers that were prepared on May 12, 2024 #39578

Closed

66 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[OpenLineage] Fix datasets in GCSDeleteObjectsOperator #39059

[OpenLineage] Fix datasets in GCSDeleteObjectsOperator #39059

Uh oh!

kacpermuda commented Apr 16, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[OpenLineage] Fix datasets in GCSDeleteObjectsOperator #39059

[OpenLineage] Fix datasets in GCSDeleteObjectsOperator #39059

Uh oh!

Conversation

kacpermuda commented Apr 16, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants