Skip to content

Conversation

@sschaetz
Copy link
Contributor

@sschaetz sschaetz commented Jan 31, 2021

This change adds a new GCS transform operator. It is based on the existing transform operator with the addition that it will transform multiple files that match the prefix and that were updated within a time-span. The time-span is implicitly defined: it is the time between the current execution timestamp of the DAG instance (time-span start) and the next execution timestamp of the DAG (time-span end).

The use-case is some entity generates files at irregular intervals and an undefined number. The operator will pick up all files that were updated since it executed last. Typically the transform script will iterate over the files, open them, extract some information, collate them into one or more files and upload them to GCS. These result files can then be loaded into BigQuery or processed further or served via a webserver.

Here is an illustration that shows the concept:

bitmap

@boring-cyborg boring-cyborg bot added the provider:google Google (including GCP) related issues label Jan 31, 2021
@boring-cyborg
Copy link

boring-cyborg bot commented Jan 31, 2021

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (flake8, pylint and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it’s a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@sschaetz sschaetz force-pushed the gcs-xform-timespan-operator branch 3 times, most recently from 9043e63 to c3e85d6 Compare January 31, 2021 20:08
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
import dateutil.tz
from airflow.settings import TIMEZONE

I think using this would be better. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question - when we compare against GCS modified time we want to use UTC (I think GCS reports in UTC). But maybe for the transfer function it would make more sense to get the timestamps in the local time. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess airflow.timezone.utc would be better for the GCS modified time then, just for uniformity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I took this from an existing hook - but the next commit has it fixed in both spots.

@sschaetz
Copy link
Contributor Author

sschaetz commented Feb 6, 2021

@turbaszek since you implemented the original transform-operator I wonder if you would be willing to take a look at this one?

@sschaetz
Copy link
Contributor Author

Capturing the outcome of the slack conversation between @turbaszek and myself - I will add:

  • a retry mechanism
  • a flag to allow users to configure to warn or error if a failure occurs

@github-actions
Copy link

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

@github-actions
Copy link

The Workflow run is cancelling this PR. Building image for the PR has been cancelled

@sschaetz sschaetz force-pushed the gcs-xform-timespan-operator branch from e93fa56 to 850fe4c Compare February 27, 2021 14:22
@github-actions
Copy link

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

@sschaetz sschaetz force-pushed the gcs-xform-timespan-operator branch from 850fe4c to 99eed57 Compare February 27, 2021 15:26
@github-actions
Copy link

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

@github-actions
Copy link

The Workflow run is cancelling this PR. Building image for the PR has been cancelled

@sschaetz sschaetz force-pushed the gcs-xform-timespan-operator branch 2 times, most recently from f43abe1 to d1a1df0 Compare February 27, 2021 20:50
@github-actions
Copy link

The Workflow run is cancelling this PR. Building images for the PR has failed. Follow the the workflow link to check the reason.

@sschaetz sschaetz force-pushed the gcs-xform-timespan-operator branch from d1a1df0 to c79a55f Compare March 14, 2021 11:20
@github-actions github-actions bot added the okay to merge It's ok to merge this PR as it does not require more tests label Mar 14, 2021
@github-actions
Copy link

The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest master or amend the last commit of the PR, and push it with --force-with-lease.

@kaxil kaxil merged commit ddc9133 into apache:master Mar 15, 2021
@boring-cyborg
Copy link

boring-cyborg bot commented Mar 15, 2021

Awesome work, congrats on your first merged pull request!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

okay to merge It's ok to merge this PR as it does not require more tests provider:google Google (including GCP) related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants