Skip to content

Conversation

@bbockelm
Copy link
Collaborator

This implements a simple HTTP-based stress test that, on my laptop, fails 100% of the time. All it does is try an increasing number of concurrent HTTP uploads.

By "stress tests", it tends to fail at 10 concurrent transfers provided this line is set:

    if (idx != 2) curl_easy_setopt(handle.get(), CURLOPT_FORBID_REUSE, 1L);

If, instead, that line is:

    curl_easy_setopt(handle.get(), CURLOPT_FORBID_REUSE, 1L);

then I've never gotten the stress test to fail, even with more than 100 concurrent transfers. Note -- CURLOPT_FORBID_REUSE, when set, causes a new TCP socket to be created for each transfer. By turning it off, there's multiple requests on a single TCP socket.

I believe this is due to the fact that HTTP is such a simple protocol the poller only fires once per socket unless there is socket reuse -- at which point we trigger the race condition mentioned in #2212.

Hopefully the GitHub runner will trigger the failure just as reliably as my laptop!

@bbockelm bbockelm requested a review from amadio January 16, 2025 23:27
@amadio
Copy link
Member

amadio commented Jan 17, 2025

This is quite well timed, thanks! We just received a new mac mini that will be used for EOS/XRootD development and CI. I will try this there as well.

@bbockelm bbockelm changed the base branch from devel to master January 17, 2025 13:15
@amadio amadio force-pushed the macos_stress_test branch from 284311e to 5a87b83 Compare January 27, 2025 13:06
@amadio
Copy link
Member

amadio commented Jan 27, 2025

This pull requests has warnings in CDash, and a test failure on Alpine Linux. @bbockelm Could you please investigate? Thanks.

@bbockelm bbockelm marked this pull request as draft January 27, 2025 18:17
@bbockelm
Copy link
Collaborator Author

Happy to do that! I was a bit confused by the Mac mini comment and whether you were going to pick things up.

Anyhow, now that master is otherwise clean, it's a lot easier to figure out what needs to be tweaked.

@amadio
Copy link
Member

amadio commented Jan 27, 2025

Yes, I do intend to test on the new Mac mini, but the failures in CI are on Alpine and Ubuntu. Submitted changes should pass the CI before we do further testing on them.

@bbockelm
Copy link
Collaborator Author

bbockelm commented Mar 3, 2025

@amadio - tinkered with this a bit more yesterday. Some observations:

  • The comment I added about only one thread writing to the table is a lie. Only one thread modifies the entries but any thread creates new entries (including filling in holes). That's why I originally took the approach of making a copy and didn't swap: you have to synchronize/copy between the two somehow.
  • The failures we're seeing appear to be real -- but I've not been able to reproduce them enough to understand what they are. I tried modifying the PR to move all writes to the table to be from a single thread but haven't gotten it working yet.
  • I find it really curious that alpine fails. Particularly, the relevant code changes are only accessible if defined( __linux__ ) evaluates to false which would be quite surprising on the Alpine platform.

@bbockelm bbockelm force-pushed the macos_stress_test branch from 1fff941 to 59bb5e7 Compare March 3, 2025 16:03
@amadio
Copy link
Member

amadio commented Mar 3, 2025

Alpine uses MUSL instead of glibc. The failure is likely related to that.

@bbockelm
Copy link
Collaborator Author

bbockelm commented Mar 3, 2025

Ah -- actually, I think it was me misreading the test logs from the prior run. On closer inspection, there's something basic missing on my added new unit test which seems quite fixable.

@bbockelm
Copy link
Collaborator Author

@amadio - can you re-run the Mac OS X job a handful of times? I lack the permission to re-run tests in GitHub. I want to make sure that the tests are reliably passing and this isn't a fluke where I got lucky.

The larger refactor -- rewriting the logic so only one thread writes to the table at once (which would obviate needing making a copy of the table) -- was unsuccessful, unfortunately.

@amadio amadio force-pushed the macos_stress_test branch from 450c8ae to d52d48c Compare March 14, 2025 16:54
If we want to programmatically access the HTTP port (instead of
hardcoding it in the configuration), then it will reduce code
duplication to simply use the same port.
This exposes a small gtest-driven stress test that will probe
ever-increasing number of concurrent uploads, looking for a point
where the server may stop responding.
Previously, MacOS failed the stress test because the poller was not
thread safe.  Various threads manipulated the poll table and read from
it without a lock consistently held.

This now ensures the PollMutex lock is held whenever we read or write
from the PollTab.
@bbockelm bbockelm force-pushed the macos_stress_test branch from d52d48c to 1cf9f55 Compare March 15, 2025 15:52
@bbockelm
Copy link
Collaborator Author

@amadio - ok, tests have passed several times in a row so I think I'm happy.

@bbockelm bbockelm marked this pull request as ready for review March 15, 2025 16:29
@amadio amadio linked an issue Mar 17, 2025 that may be closed by this pull request
@amadio amadio added this to the 5.8.0 milestone Mar 17, 2025
@amadio amadio merged commit b3e0091 into xrootd:master Mar 17, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

XRootD drops poller events on Mac OS X

2 participants