Skip to content

Split providers out of the main "airflow/" tree into a UV workspace project#42505

Merged
ashb merged 1 commit intoapache:mainfrom
astronomer:uv-workspace-for-providers
Oct 9, 2024
Merged

Split providers out of the main "airflow/" tree into a UV workspace project#42505
ashb merged 1 commit intoapache:mainfrom
astronomer:uv-workspace-for-providers

Conversation

@ashb
Copy link
Member

@ashb ashb commented Sep 26, 2024

closes #42857

As discussed https://lists.apache.org/thread/dyv5jhvt65xs6l5o2byc2b67f4wlwf6r, this is the first part of the new layout.

This is only a partial split so far. It moves all the code and tests, but the provider release generation code hasn't moved yet as it's a few weeks until the next batch, and I wan't to try and keep this already huge PR as small as possible. Likewise the creation of core/ will be a follow on PR

In addition to the straight file rename the other changes I had to make here are:

  • Some mypy/typing fixes.

    Mypy can be fragile about what it picks up when, so maybe some of those
    changes were caused by that. But the typing changes aren't large.

  • Improve typing in common.sql type stub

    Again, likely a mypy file oddity, but the types should be safe

  • Removed the check-providers-init-file-missing check

    This isn't needed now that airflow/providers shouldn't exist at all in the
    main tree.

  • Create a "dev.tests_common" package that contains helper files and common
    pytest fixtures

    Since the provider tests are no longer under tests/ they don't automatically
    share the fixtures from the parent tests/conftest.py so they needed
    extracted.

    Ditto for tests.test_utils -- they can't be easily imported in provider
    tests anymore, so they are moved to a more explicit shared location.

In future we should switch how the CI image is built to make better use of UV caching than our own approach as that would remvoe a lot of custom code.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@ashb
Copy link
Member Author

ashb commented Sep 26, 2024

Oh, prepare provider packages is failing for prod image build. I guess that's going to have to be in this PR then

@ashb ashb force-pushed the uv-workspace-for-providers branch 10 times, most recently from 056127e to 2e952e5 Compare September 30, 2024 20:40
@potiuk
Copy link
Member

potiuk commented Oct 1, 2024

Smalll comment (I am returning back from holidays) and slowly going through the two week's history.

In future we should switch how the CI image is built to make better use of UV caching than our own approach as that would remvoe a lot of custom code.

I tried various things before (including uv caching and local cache mounts) and I could not find better solution. With all the other approaches I tried, our approach is way more efficient for size and time and especially shines because with the current approach you have a single remote cache layer in ghcr.io that is used even if dependencies change, and you do not have cache duplication.

There are two problems with using uv cache:

  1. Enabling UV cache (I tested it having similar hope when UV was released) makes our image 4 GB instead of 2GB (can't remember exactly but it was almost 2x factor). also it does not work with docker layer caching. It's very time-efficient to keep the UV cache but yoy pay a huge price of used space. Note that UV will cache not only the actual packages that it downloads but also the packages it considers as candidates - which in case of significant backtracking might mean that you keep multiple versions of the same package in the cache - this adds up very quickly and uv cache is really huge.

UV cache benefits from the fact that the same cache is potentially reused by multiple projects you have installed on the same machine. The UV cache is stored in HOME dir and reused between all venvs used in the same machine - so in docker it's nearly useless because it multiplies the space used for both - installed and cached files but cached files are not reusable in any way for other venvs (because we have only one venv in docker container). And since layers in docker image cannot be removed, even if we delete the cache after final installation step - the cache will remain in the docker image layer additionally to installed packages - increasing significantly size of the final image.

Of course - you could potentially use locally mounted cache volumes in docker to keep the UV cache on, in order to not pollute docke image, but that would only be useful after you build your image locally once and you won't be able to use remotely cached layers which are currently used when you build the image.

Using remote cache in the way we do it now, is much nicer, because the cache is actually the same as most installed packages (with the exception of the new/ updated packages change after the original cached layer is created. This has the nice property that every time new patchlevel base image is released by python (every two-three weeks or so) the cache is rebuit from scratch, and we get a new "fresh" main cache with latest package versions.

  1. Another problem is that you will not avoid the problem that you need to have all the "source" dependency files copied to the image layer in order to run uv pip install command.

Whatever that command will produce as cache will be invalidated the moment the source dependency files (pyproject.toml, hatch_build.py, provider.yaml files, generated/provider_dependencies.json) - so if you want to effectively use cache, you need to build it BEFORE you copy any of those files into the new image. This is precisely what "our" caching is doing - it does not use any "local file" and does not copy them to the image, but uses "https://github" URL to install the dependencies for the first time - which means that next time that layer will not be invalidated when pyproject.toml or hatch_build.py or any other files that decide about dependencies will change.

Now - I'd love to make it simpler, but so far all the attempts to find a better solution failed because the one we have is the only one that has good "caching" properties.

@potiuk
Copy link
Member

potiuk commented Oct 1, 2024

But maybe you can come up with a better approach that I have not thought about and tried already, of course :)

@ashb
Copy link
Member Author

ashb commented Oct 1, 2024

Yeah, that comment was mostly a note to myself, and not something I plan on even looking at as part of this work.

@ashb ashb force-pushed the uv-workspace-for-providers branch 2 times, most recently from ecf9f37 to 7cb016b Compare October 1, 2024 16:13
@ashb ashb force-pushed the uv-workspace-for-providers branch 3 times, most recently from d7e5864 to ddc7402 Compare October 1, 2024 21:41
@ashb ashb force-pushed the uv-workspace-for-providers branch from ddc7402 to 9cd5774 Compare October 2, 2024 14:37
@ashb
Copy link
Member Author

ashb commented Oct 2, 2024

I think I'm getting closer now 🤞🏻 .

Oddly enough a lot of the time the er diagram static check is failing in CI , even though I've run breeze static-checks -a update-er-diagram and it says pass.

Edit: this turned out to be a (purposefully) untracked file in the migrations folder in my local checkout.

@ashb ashb force-pushed the uv-workspace-for-providers branch 2 times, most recently from 0956d95 to 36ab177 Compare October 8, 2024 10:58
potiuk added a commit that referenced this pull request Oct 14, 2024
Follow-up after #42505 fixing teething problem with tests_common.

Originally in #42505 common test code was moved to "dev" folder, but
the "dev" folder is really dedicated to "build" scripts and the
problem with moving "tests_common" to the folder was that the
whole "dev" folder is replaced (for non-committer PRs) with
the content from the target branch.

This is done for security reasons, because we can accidentally
use any of the scripts from dev in the CI build scripts and we
might not notice, which will open us to a security issue where
a file in "dev" coming from PR could be accidentally executed
during the "pull_request_target" workflow - which would expose
our secrets and GitHub Package write permissions to a
contributor coming from a fork.

This change moves the files, fixes pre-commit specification and
docs, also fixes a number of "doc" issues detected by "ruff" in
the tests_common folder as they were detected after the move.
The tests_common folder is added to folders mounted when
breeze is executed with local folders mounted (in order to
avoid accidental mounting of randomly generated files to
inside the breeze container).

All imports for the common tests were updated to reflect this
move.
pavansharma36 pushed a commit to pavansharma36/airflow that referenced this pull request Oct 14, 2024
This is a no-op change right now, but as part of the provider re-org in apache#42505
this sets us up to be able to load the providers code in the tests

The reason this change is done separately is that changes to breeze code form
forks doesn't take effect, and this small change makes it easier to land on
main without having to re-create that large PR.
pavansharma36 pushed a commit to pavansharma36/airflow that referenced this pull request Oct 14, 2024
…roject (apache#42505)

This is only a partial split so far. It moves all the code and tests, but
leaves the creation of `core/` to a separate PR as this is already large
enough.

In addition to the straight file rename the other changes I had to make here
are:

- Some mypy/typing fixes.

  Mypy can be fragile about what it picks up when, so maybe some of those
  changes were caused by that. But the typing changes aren't large.

- Improve typing in common.sql type stub

  Again, likely a mypy file oddity, but the types should be safe

- Removed the `check-providers-init-file-missing` check

  This isn't needed now that airflow/providers shouldn't exist at all in the
  main tree.

- Create a "dev.tests_common" package that contains helper files and common
  pytest fixtures

  Since the provider tests are no longer under tests/ they don't automatically
  share the fixtures from the parent `tests/conftest.py` so they needed
  extracted.

  Ditto for `tests.test_utils` -- they can't be easily imported in provider
  tests anymore, so they are moved to a more explicit shared location.

In future we should switch how the CI image is built to make better use of UV
caching than our own approach as that would remvoe a lot of custom code.
pavansharma36 pushed a commit to pavansharma36/airflow that referenced this pull request Oct 14, 2024
…roject (apache#42505) (apache#42624)

This is only a partial split so far. It moves all the code and tests, but
leaves the creation of `core/` to a separate PR as this is already large
enough.

In addition to the straight file rename the other changes I had to make here
are:

- Some mypy/typing fixes.

  Mypy can be fragile about what it picks up when, so maybe some of those
  changes were caused by that. But the typing changes aren't large.

- Improve typing in common.sql type stub

  Again, likely a mypy file oddity, but the types should be safe

- Removed the `check-providers-init-file-missing` check

  This isn't needed now that airflow/providers shouldn't exist at all in the
  main tree.

- Create a "dev.tests_common" package that contains helper files and common
  pytest fixtures

  Since the provider tests are no longer under tests/ they don't automatically
  share the fixtures from the parent `tests/conftest.py` so they needed
  extracted.

  Ditto for `tests.test_utils` -- they can't be easily imported in provider
  tests anymore, so they are moved to a more explicit shared location.

In future we should switch how the CI image is built to make better use of UV
caching than our own approach as that would remvoe a lot of custom code.

Co-authored-by: Ash Berlin-Taylor <ash@apache.org>
Co-authored-by: Ryan Hatter <25823361+RNHTTR@users.noreply.github.com>
potiuk added a commit that referenced this pull request Oct 15, 2024
Follow-up after #42505 fixing teething problem with tests_common.

Originally in #42505 common test code was moved to "dev" folder, but
the "dev" folder is really dedicated to "build" scripts and the
problem with moving "tests_common" to the folder was that the
whole "dev" folder is replaced (for non-committer PRs) with
the content from the target branch.

This is done for security reasons, because we can accidentally
use any of the scripts from dev in the CI build scripts and we
might not notice, which will open us to a security issue where
a file in "dev" coming from PR could be accidentally executed
during the "pull_request_target" workflow - which would expose
our secrets and GitHub Package write permissions to a
contributor coming from a fork.

This change moves the files, fixes pre-commit specification and
docs, also fixes a number of "doc" issues detected by "ruff" in
the tests_common folder as they were detected after the move.
The tests_common folder is added to folders mounted when
breeze is executed with local folders mounted (in order to
avoid accidental mounting of randomly generated files to
inside the breeze container).

All imports for the common tests were updated to reflect this
move.
potiuk added a commit that referenced this pull request Oct 15, 2024
* Move tests common without changes

* Fix docstrings in tests_common

* Move tests_common from "dev" to top-level.

Follow-up after #42505 fixing teething problem with tests_common.

Originally in #42505 common test code was moved to "dev" folder, but
the "dev" folder is really dedicated to "build" scripts and the
problem with moving "tests_common" to the folder was that the
whole "dev" folder is replaced (for non-committer PRs) with
the content from the target branch.

This is done for security reasons, because we can accidentally
use any of the scripts from dev in the CI build scripts and we
might not notice, which will open us to a security issue where
a file in "dev" coming from PR could be accidentally executed
during the "pull_request_target" workflow - which would expose
our secrets and GitHub Package write permissions to a
contributor coming from a fork.

This change moves the files, fixes pre-commit specification and
docs, also fixes a number of "doc" issues detected by "ruff" in
the tests_common folder as they were detected after the move.
The tests_common folder is added to folders mounted when
breeze is executed with local folders mounted (in order to
avoid accidental mounting of randomly generated files to
inside the breeze container).

All imports for the common tests were updated to reflect this
move.
R7L208 pushed a commit to R7L208/airflow that referenced this pull request Oct 17, 2024
…roject (apache#42505)

This is only a partial split so far. It moves all the code and tests, but
leaves the creation of `core/` to a separate PR as this is already large
enough.

In addition to the straight file rename the other changes I had to make here
are:

- Some mypy/typing fixes.

  Mypy can be fragile about what it picks up when, so maybe some of those
  changes were caused by that. But the typing changes aren't large.

- Improve typing in common.sql type stub

  Again, likely a mypy file oddity, but the types should be safe

- Removed the `check-providers-init-file-missing` check

  This isn't needed now that airflow/providers shouldn't exist at all in the
  main tree.

- Create a "dev.tests_common" package that contains helper files and common
  pytest fixtures

  Since the provider tests are no longer under tests/ they don't automatically
  share the fixtures from the parent `tests/conftest.py` so they needed
  extracted.

  Ditto for `tests.test_utils` -- they can't be easily imported in provider
  tests anymore, so they are moved to a more explicit shared location.

In future we should switch how the CI image is built to make better use of UV
caching than our own approach as that would remvoe a lot of custom code.
@dstandish
Copy link
Contributor

#protm

@potiuk
Copy link
Member

potiuk commented Oct 29, 2024

breeze testing tests provider/tests/x.py currently fails. (Temp workaround: breeze shell then run pytest provider/tests/x.py directly. Bug is that the testing test command includes tests in the pytest command even when another file is specified, leading to pytest loading tests/conftest.py and providers/tests/conftest.py which it doesn't like as they are both "top level")

This was never supposed to work. the parameters that you can pass to "testing tests" is really "extra pytest paramaters" you can pass - and it's not really inteded to use "pytest files" there. That's why for example the extra parameters are not autocomplete'able with files/paths. The "tests" command is really meant to run the whole "test type" and you can easily use it now.

For example running all airbyte tests can be achieved by:

breeze testing tests --test-type "Providers[airbyte]"

And it nicely works now.

@potiuk
Copy link
Member

potiuk commented Oct 29, 2024

(so I think indeed all known "teething" problems" are solved - and i can try to attempt to split out individual providers).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

7 participants