Skip to content

Conversation

@potiuk
Copy link
Member

@potiuk potiuk commented Feb 15, 2024

…trigger run method (#37363)"

This reverts commit 0640e6d.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues provider:google Google (including GCP) related issues labels Feb 15, 2024
@pankajastro
Copy link
Member

Is this because some tests are failing on the main? CI was green on the PR branch. I'll investigate it.

@potiuk
Copy link
Member Author

potiuk commented Feb 15, 2024

Hey @pankajastro - I believe (and have a demostrable proof) that #37363 broke some of the behaviour of KPO tests. The tests started to fail yesterday in main after the change has been merged and while the tests work in isolation, it seems that they introduce a some heavly fleaky behaviour for The async KPO behaviour when the system is busy - and I think it can have impact in production, so I think better is to revert it now and try to find out the root cause and fix.

I was not able to pinpoint the root cause, but I believe it's caused by this change:

https://github.com/apache/airflow/pull/37363/files#diff-9e582418b4d9f5578d2466b2e5ccd32faa3d8bb651fd7873c488469ccf599c0dR220

It seems that on a busy system, the loop finishes much faster than it should and errors yout with
Pod did not leave 'Pending' phase within specified timeout

The problem is that it does not happen even if you run a complete Providers[-amazon,-google] test type in isolation. It only happens when you run all test types in parallel.

The issue is relatively easy (but a bit time consuming) to reproduce. On my local env, I can rather repeatably (same on CI) reproduce it with:

breeze testing tests --run-in-paralllel

The error is the same as in case of last few main builds - for example this one https://github.com/apache/airflow/actions/runs/7911809922/job/21596873757

When I revert this one, it works nicely without errors.

It's rather fishy and difficult to guess what's the problem.

@pankajastro
Copy link
Member

Thanks @potiuk for looking into it I'll check it.

@potiuk
Copy link
Member Author

potiuk commented Feb 15, 2024

Yeah. It's rather strange - because generally the timeout should not happen until 120 seconds passes, and the longest test in the whole run in CI is 30s - so there is no way those failed tests took > 120 seconds. Really strange (but at least we have a recipe for reproduction).

@potiuk potiuk merged commit 0be6430 into apache:main Feb 15, 2024
@potiuk potiuk deleted the revert-kpo-change branch February 15, 2024 13:58
@eladkal
Copy link
Contributor

eladkal commented Feb 15, 2024

This means we are not ready for RC2 of cncf.kubernetes provider right?

@potiuk
Copy link
Member Author

potiuk commented Feb 15, 2024

Yeah, I think we need to wait for @pankajastro 's investigation :)

@pankajastro
Copy link
Member

For some tests mocking for trigger _wait_for_pod_start method was missing. now the trigger run method call this (_wait_for_pod_start) which waits for a pod to start. not sure why it didn't fail on my PR. a PR #37454 to fix it

@potiuk
Copy link
Member Author

potiuk commented Feb 15, 2024

Merged!

That explains it indeed, It did not fail in your PR, because it would generally work without mocking (and worked for me as well locally) if run in isolation / on not so busy system. Generally PRs will only run a subset of tests that selective checks find relevant.

So in the original PR: #37363 only a small subsert of tests was run:

Always
Providers[amazon]
Providers[apache.beam,apache.cassandra,apache.flink,apache.spark,celery,cncf.kubernetes,common.sql,facebook,hashicorp,microsoft.azure,microsoft.mssql,mysql,openlineage,oracle,postgres,presto,sa
Providers[google]

In the new PR #37454 - I saw you added "full tests needed" and the set of tests was the same as in main:

Providers[-amazon,google]
Core
WWW
CLI
Other
Serialization
Always
PythonVenv
API
BranchExternalPython
BranchPythonVenv
ExternalPython
Operators
PlainAsserts
Providers[amazon]
Providers[google]

That's why I could quickly merge it, beibg more certain that the flakiness is fixed (because this time PR had the same set of tests).

That's why original change was "green" because the set of tests run did not make the machine "busy enough" to trigger the flaky condition. In a way, our main (canary) tests are a good way to find such issues, because they run up to 8 parallel test suites at a time (out of the 16 above) in 8 complete docker-compose instances run on the same docker engine (each one with its own database). So we are REALLY stressing the poor AWS instance and that uncovers all kinds of races and flakiness (which as we saw here is a good idea).

@potiuk
Copy link
Member Author

potiuk commented Feb 15, 2024

Thanks for super-quick diagnosis and fix @pankajastro BTW !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues provider:google Google (including GCP) related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants