Skip to content

feat: inbound only graceful draining#38430

Merged
yanavlasov merged 12 commits intoenvoyproxy:mainfrom
keithmattix:feature/inbound-only-graceful-draining
Mar 4, 2025
Merged

feat: inbound only graceful draining#38430
yanavlasov merged 12 commits intoenvoyproxy:mainfrom
keithmattix:feature/inbound-only-graceful-draining

Conversation

@keithmattix
Copy link
Copy Markdown
Contributor

@keithmattix keithmattix commented Feb 12, 2025

Trying #37873 again to get to the bottom of the flakes that caused us to need to revert. This got a lot more complex after #38333, so I may need @yanavlasov to help me out with the proper semantics

Commit Message:

Fixes https://github.com/envoyproxy/envoy/issues/35020. The inbound_only query param for the drain_listeners admin
endpoint doesn't work with the graceful query parameter. As a result,
outbound listeners will send connection: close headers to upstreams
which is undesired. This PR adds the ability for drain_manager to drain
in a single direction.

Additional Description:
Risk Level: Medium
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

Signed-off-by: Keith Mattix II <keithmattix@microsoft.com>
Signed-off-by: Keith Mattix II <keithmattix@microsoft.com>
@keithmattix keithmattix changed the title Directional Draining feat: inbound only graceful draining Feb 12, 2025
Signed-off-by: Keith Mattix II <keithmattix@microsoft.com>
Signed-off-by: Keith Mattix II <keithmattix@microsoft.com>
@keithmattix
Copy link
Copy Markdown
Contributor Author

Can someone from @envoyproxy/envoy-mobile-triage help me with these Android test failures? The smoking gun appears to be errors like these:

java.lang.UnsatisfiedLinkError: /mnt/engflow/worker/work/3/exec/bazel-out/k8-fastbuild/bin/test/java/io/envoyproxy/envoymobile/engine/testing/quic_test_server_test.runfiles/envoy_mobile/test/jni/libenvoy_jni_with_test_extensions.so: /mnt/engflow/worker/work/3/exec/bazel-out/k8-fastbuild/bin/test/java/io/envoyproxy/envoymobile/engine/testing/quic_test_server_test.runfiles/envoy_mobile/test/jni/libenvoy_jni_with_test_extensions.so: undefined symbol: __atomic_load

I tried to add a link command to the jni bazel config but no dice. I've never touched Envoy native development, so I truly have no idea what I'm doing

@abeyad
Copy link
Copy Markdown
Contributor

abeyad commented Feb 13, 2025

java.lang.UnsatisfiedLinkError: /mnt/engflow/worker/work/3/exec/bazel-out/k8-fastbuild/bin/test/java/io/envoyproxy/envoymobile/engine/testing/quic_test_server_test.runfiles/envoy_mobile/test/jni/libenvoy_jni_with_test_extensions.so: /mnt/engflow/worker/work/3/exec/bazel-out/k8-fastbuild/bin/test/java/io/envoyproxy/envoymobile/engine/testing/quic_test_server_test.runfiles/envoy_mobile/test/jni/libenvoy_jni_with_test_extensions.so: undefined symbol: __atomic_load

You added the -latomic to libenvoy_jni.so, not to libenvoy_jni_with_test_extensions.so. Can you try adding it to

name = "libenvoy_jni_with_test_extensions.so",
? That said, it's unclear why you would get an std::atomic linker error with your change, as we've had std::atomic in Envoy code prior and no linker issues on Android.

@fredyw
Copy link
Copy Markdown
Member

fredyw commented Feb 13, 2025

java.lang.UnsatisfiedLinkError: /mnt/engflow/worker/work/3/exec/bazel-out/k8-fastbuild/bin/test/java/io/envoyproxy/envoymobile/engine/testing/quic_test_server_test.runfiles/envoy_mobile/test/jni/libenvoy_jni_with_test_extensions.so: /mnt/engflow/worker/work/3/exec/bazel-out/k8-fastbuild/bin/test/java/io/envoyproxy/envoymobile/engine/testing/quic_test_server_test.runfiles/envoy_mobile/test/jni/libenvoy_jni_with_test_extensions.so: undefined symbol: __atomic_load

You added the -latomic to libenvoy_jni.so, not to libenvoy_jni_with_test_extensions.so. Can you try adding it to

name = "libenvoy_jni_with_test_extensions.so",

? That said, it's unclear why you would get an std::atomic linker error with your change, as we've had std::atomic in Envoy code prior and no linker issues on Android.

That's what I thought, too but when I tested it in 40faf7e, it didn't help, though: https://github.com/envoyproxy/envoy/actions/runs/13312980039/job/37180020453

I'm also not able to reproduce this locally.

@fredyw
Copy link
Copy Markdown
Member

fredyw commented Feb 13, 2025

java.lang.UnsatisfiedLinkError: /mnt/engflow/worker/work/3/exec/bazel-out/k8-fastbuild/bin/test/java/io/envoyproxy/envoymobile/engine/testing/quic_test_server_test.runfiles/envoy_mobile/test/jni/libenvoy_jni_with_test_extensions.so: /mnt/engflow/worker/work/3/exec/bazel-out/k8-fastbuild/bin/test/java/io/envoyproxy/envoymobile/engine/testing/quic_test_server_test.runfiles/envoy_mobile/test/jni/libenvoy_jni_with_test_extensions.so: undefined symbol: __atomic_load

You added the -latomic to libenvoy_jni.so, not to libenvoy_jni_with_test_extensions.so. Can you try adding it to

name = "libenvoy_jni_with_test_extensions.so",

? That said, it's unclear why you would get an std::atomic linker error with your change, as we've had std::atomic in Envoy code prior and no linker issues on Android.

That's what I thought, too but when I tested it in 40faf7e, it didn't help, though: https://github.com/envoyproxy/envoy/actions/runs/13312980039/job/37180020453

I'm also not able to reproduce this locally.

Oh looks like it did help. I just need to link it with -latomic in a different shared library. Let me test it again.

Signed-off-by: Keith Mattix II <keithmattix@microsoft.com>
Signed-off-by: Keith Mattix II <keithmattix@microsoft.com>
@keithmattix
Copy link
Copy Markdown
Contributor Author

Android is now passing after integrating @fredyw's changes; thanks for the assist!

This is now ready for review

@keithmattix
Copy link
Copy Markdown
Contributor Author

/retest

Signed-off-by: Keith Mattix II <keithmattix@microsoft.com>
Signed-off-by: Keith Mattix II <keithmattix@microsoft.com>
Copy link
Copy Markdown
Contributor

@yanavlasov yanavlasov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which test was flaky?

/wait

# leave room for compiler/linker.
# The number 3G is chosen heuristically to both support large VM and small VM with RBE.
# Startup options cannot be selected via config.
# TODO: Adding just to test android
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover from testing?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes android doesn't run in PRs by default; I wanted to get an approval on the code first before removing so there aren't surprises on main post merge

std::atomic<DrainPair> draining_{DrainPair{false, Network::DrainDirection::None}};
// A map of timers keyed by the direction that triggered the drain
std::map<Network::DrainDirection, Event::TimerPtr> drain_tick_timers_;
std::map<Network::DrainDirection, MonotonicTime> drain_deadlines_ = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a struct would be a better choice? Are the timer keys dynamic?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we only expect a max of two keys (All and InboundOnly) at any time (and we have an assertion in the code to check).

@keithmattix
Copy link
Copy Markdown
Contributor Author

keithmattix commented Feb 14, 2025

Which test was flaky?

/wait

AdminDrainInboundOnlyIdempotent (or something similar). I believe I addressed it by adding the stopped_listener flatHashSet check into a callback function that runs on the main thread which should prevent the race conditions that led to the flaky test. Here's a failed run: https://github.com/envoyproxy/envoy/actions/runs/13172376287/job/36764947882

Signed-off-by: Keith Mattix II <keithmattix@microsoft.com>
@keithmattix
Copy link
Copy Markdown
Contributor Author

@yanavlasov do you have any additional feedback?

Copy link
Copy Markdown
Contributor

@yanavlasov yanavlasov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/wait

Signed-off-by: Keith Mattix II <keithmattix@microsoft.com>
Signed-off-by: Keith Mattix II <keithmattix@microsoft.com>
@keithmattix
Copy link
Copy Markdown
Contributor Author

/retest

Signed-off-by: Keith Mattix II <keithmattix@microsoft.com>
@keithmattix
Copy link
Copy Markdown
Contributor Author

/retest

@yanavlasov yanavlasov merged commit c53f917 into envoyproxy:main Mar 4, 2025
25 checks passed
alam0rt pushed a commit to alam0rt/envoy that referenced this pull request Mar 14, 2025
Trying envoyproxy#37873 again to get to the bottom of the flakes that caused us to
need to revert. This got a lot more complex after envoyproxy#38333, so I may need
@yanavlasov to help me out with the proper semantics

Commit Message:
```
Fixes envoyproxy#35020. The inbound_only query param for the drain_listeners admin
endpoint doesn't work with the graceful query parameter. As a result,
outbound listeners will send connection: close headers to upstreams
which is undesired. This PR adds the ability for drain_manager to drain
in a single direction.
```
Additional Description:
Risk Level: Medium
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional [API
Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):]

---------

Signed-off-by: Keith Mattix II <keithmattix@microsoft.com>
jewertow pushed a commit to jewertow/envoy that referenced this pull request Apr 2, 2025
Trying envoyproxy#37873 again to get to the bottom of the flakes that caused us to
need to revert. This got a lot more complex after envoyproxy#38333, so I may need
@yanavlasov to help me out with the proper semantics

Commit Message: 
```
Fixes envoyproxy#35020. The inbound_only query param for the drain_listeners admin
endpoint doesn't work with the graceful query parameter. As a result,
outbound listeners will send connection: close headers to upstreams
which is undesired. This PR adds the ability for drain_manager to drain
in a single direction.
```
Additional Description:
Risk Level: Medium
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional [API
Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):]

---------

Signed-off-by: Keith Mattix II <keithmattix@microsoft.com>
agrawroh pushed a commit to agrawroh/envoy that referenced this pull request Apr 9, 2025
Trying envoyproxy#37873 again to get to the bottom of the flakes that caused us to
need to revert. This got a lot more complex after envoyproxy#38333, so I may need
@yanavlasov to help me out with the proper semantics

Commit Message: 
```
Fixes envoyproxy#35020. The inbound_only query param for the drain_listeners admin
endpoint doesn't work with the graceful query parameter. As a result,
outbound listeners will send connection: close headers to upstreams
which is undesired. This PR adds the ability for drain_manager to drain
in a single direction.
```
Additional Description:
Risk Level: Medium
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional [API
Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):]

---------

Signed-off-by: Keith Mattix II <keithmattix@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants