Revert "KPO Maintain backward compatibility for execute_complete and … #37446

potiuk · 2024-02-15T13:34:46Z

…trigger run method (#37363)"

This reverts commit 0640e6d.

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

…trigger run method (apache#37363)" This reverts commit 0640e6d.

pankajastro · 2024-02-15T13:46:51Z

Is this because some tests are failing on the main? CI was green on the PR branch. I'll investigate it.

potiuk · 2024-02-15T13:47:52Z

Hey @pankajastro - I believe (and have a demostrable proof) that #37363 broke some of the behaviour of KPO tests. The tests started to fail yesterday in main after the change has been merged and while the tests work in isolation, it seems that they introduce a some heavly fleaky behaviour for The async KPO behaviour when the system is busy - and I think it can have impact in production, so I think better is to revert it now and try to find out the root cause and fix.

I was not able to pinpoint the root cause, but I believe it's caused by this change:

https://github.com/apache/airflow/pull/37363/files#diff-9e582418b4d9f5578d2466b2e5ccd32faa3d8bb651fd7873c488469ccf599c0dR220

It seems that on a busy system, the loop finishes much faster than it should and errors yout with
Pod did not leave 'Pending' phase within specified timeout

The problem is that it does not happen even if you run a complete Providers[-amazon,-google] test type in isolation. It only happens when you run all test types in parallel.

The issue is relatively easy (but a bit time consuming) to reproduce. On my local env, I can rather repeatably (same on CI) reproduce it with:

breeze testing tests --run-in-paralllel

The error is the same as in case of last few main builds - for example this one https://github.com/apache/airflow/actions/runs/7911809922/job/21596873757

When I revert this one, it works nicely without errors.

It's rather fishy and difficult to guess what's the problem.

pankajastro · 2024-02-15T13:51:03Z

Thanks @potiuk for looking into it I'll check it.

potiuk · 2024-02-15T13:58:25Z

Yeah. It's rather strange - because generally the timeout should not happen until 120 seconds passes, and the longest test in the whole run in CI is 30s - so there is no way those failed tests took > 120 seconds. Really strange (but at least we have a recipe for reproduction).

eladkal · 2024-02-15T15:59:53Z

This means we are not ready for RC2 of cncf.kubernetes provider right?

potiuk · 2024-02-15T18:14:29Z

Yeah, I think we need to wait for @pankajastro 's investigation :)

pankajastro · 2024-02-15T18:25:22Z

For some tests mocking for trigger _wait_for_pod_start method was missing. now the trigger run method call this (_wait_for_pod_start) which waits for a pod to start. not sure why it didn't fail on my PR. a PR #37454 to fix it

potiuk · 2024-02-15T18:56:48Z

Merged!

That explains it indeed, It did not fail in your PR, because it would generally work without mocking (and worked for me as well locally) if run in isolation / on not so busy system. Generally PRs will only run a subset of tests that selective checks find relevant.

So in the original PR: #37363 only a small subsert of tests was run:

Always
Providers[amazon]
Providers[apache.beam,apache.cassandra,apache.flink,apache.spark,celery,cncf.kubernetes,common.sql,facebook,hashicorp,microsoft.azure,microsoft.mssql,mysql,openlineage,oracle,postgres,presto,sa
Providers[google]

In the new PR #37454 - I saw you added "full tests needed" and the set of tests was the same as in main:

Providers[-amazon,google]
Core
WWW
CLI
Other
Serialization
Always
PythonVenv
API
BranchExternalPython
BranchPythonVenv
ExternalPython
Operators
PlainAsserts
Providers[amazon]
Providers[google]

That's why I could quickly merge it, beibg more certain that the flakiness is fixed (because this time PR had the same set of tests).

That's why original change was "green" because the set of tests run did not make the machine "busy enough" to trigger the flaky condition. In a way, our main (canary) tests are a good way to find such issues, because they run up to 8 parallel test suites at a time (out of the 16 above) in 8 complete docker-compose instances run on the same docker engine (each one with its own database). So we are REALLY stressing the poor AWS instance and that uncovers all kinds of races and flakiness (which as we saw here is a good idea).

potiuk · 2024-02-15T18:59:30Z

Thanks for super-quick diagnosis and fix @pankajastro BTW !

Revert "KPO Maintain backward compatibility for execute_complete and …

b5a833e

…trigger run method (apache#37363)" This reverts commit 0640e6d.

boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues provider:google Google (including GCP) related issues labels Feb 15, 2024

potiuk requested review from Taragolis, dirrao, hussein-awala and pankajastro February 15, 2024 13:36

pankajastro approved these changes Feb 15, 2024

View reviewed changes

potiuk merged commit 0be6430 into apache:main Feb 15, 2024

potiuk deleted the revert-kpo-change branch February 15, 2024 13:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Revert "KPO Maintain backward compatibility for execute_complete and … #37446

Revert "KPO Maintain backward compatibility for execute_complete and … #37446

Uh oh!

potiuk commented Feb 15, 2024

Uh oh!

pankajastro commented Feb 15, 2024

Uh oh!

potiuk commented Feb 15, 2024

Uh oh!

pankajastro commented Feb 15, 2024

Uh oh!

potiuk commented Feb 15, 2024

Uh oh!

eladkal commented Feb 15, 2024

Uh oh!

potiuk commented Feb 15, 2024

Uh oh!

pankajastro commented Feb 15, 2024

Uh oh!

potiuk commented Feb 15, 2024 •

edited

Loading

Uh oh!

potiuk commented Feb 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Revert "KPO Maintain backward compatibility for execute_complete and … #37446

Revert "KPO Maintain backward compatibility for execute_complete and … #37446

Uh oh!

Conversation

potiuk commented Feb 15, 2024

Uh oh!

pankajastro commented Feb 15, 2024

Uh oh!

potiuk commented Feb 15, 2024

Uh oh!

pankajastro commented Feb 15, 2024

Uh oh!

potiuk commented Feb 15, 2024

Uh oh!

eladkal commented Feb 15, 2024

Uh oh!

potiuk commented Feb 15, 2024

Uh oh!

pankajastro commented Feb 15, 2024

Uh oh!

potiuk commented Feb 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

potiuk commented Feb 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

potiuk commented Feb 15, 2024 •

edited

Loading