Skip to content

Conversation

@dynamic-entropy
Copy link
Collaborator

@dynamic-entropy dynamic-entropy commented Feb 25, 2025

Todos:

  • Add TPC tests for HTTP
  • Merge with tests.sh and refactor

This draft PR only contains a full mesh TPC test only via the xroot protocol among the 4 nodes of the xrootd cluster.
I plan to add HTTP tests soon and refactor the code along with tests.sh.

Feel free to provide any initial reviews or suggestions.

@bbockelm
Copy link
Collaborator

Notes from Zoom chat with @dynamic-entropy:

  • Should cover HTTPS as this is the most common way things are used. Would have slight preference doing this via curl command line so we can more tightly control the headers utilized.
  • For realism, we should utilize the actual authentication methods instead of leaving auth wide open. This means leveraging the SciTokens fixture to make tokens and host certificates available. Luckily, since that fixture already exists, pulling this configuration in should require minimal additional work. @amadio -- if we are pulling in authentication, is it better to improve the cluster test or create a new fixture for the configuration?
  • Also for realism, we should have the test leverage TLS (e.g., HTTPS) instead of an unauthenticated protocol.
  • Currently, there's no "negative test". We should cover the case where we try to transfer a source file that is not found and a file that exists but where there's no permission to read.

@bbockelm
Copy link
Collaborator

One more addendum:

  • Please run the script through the shellcheck command to help identify common bash pitfalls.

@dynamic-entropy
Copy link
Collaborator Author

I need to either do some CMake magic or create two nodes with the same cluster or an entirely different setup which only runs these tests on platforms that have SciTokens available.

We cannot do HTTP TPC without SSL enabled, and I am relying on the SciTokens fixture to create the necessary certs and CAs.

@amadio
Copy link
Member

amadio commented Mar 3, 2025

Please rebase on master again. I just pushed a small fix to get rid of PollerTest.FunctionTest failures. Cheers,

@dynamic-entropy dynamic-entropy force-pushed the tpc-tests branch 3 times, most recently from 144ad8f to 80bd4f3 Compare March 5, 2025 13:55
@dynamic-entropy dynamic-entropy force-pushed the tpc-tests branch 4 times, most recently from d2100fb to 97eea5e Compare March 13, 2025 06:23
@dynamic-entropy dynamic-entropy marked this pull request as ready for review March 13, 2025 06:23
@amadio
Copy link
Member

amadio commented Mar 14, 2025

@dynamic-entropy Please check that at each commit point, things still work. On the first commit, I get this:

-- Found Python: /usr/bin/python3 (found version "3.12.9") found components: Interpreter Development.Module
CMake Error: File /home/amadio/src/xrootd/tests/TPCTests/srv1.cfg does not exist.
CMake Error at tests/TPCTests/CMakeLists.txt:12 (configure_file):
  configure_file Problem configuring file


CMake Error: File /home/amadio/src/xrootd/tests/TPCTests/srv2.cfg does not exist.
CMake Error at tests/TPCTests/CMakeLists.txt:12 (configure_file):
  configure_file Problem configuring file


CMake Error: File /home/amadio/src/xrootd/tests/TPCTests/common.cfg does not exist.
CMake Error at tests/TPCTests/CMakeLists.txt:12 (configure_file):
  configure_file Problem configuring file
...
-- Configuring incomplete, errors occurred!
Command exited with the value: 1
Error(s) when configuring the project
CMake Error at /home/amadio/src/xrootd/test.cmake:213 (message):
  Configuration failed

This means we cannot git-bisect through your changes. You can simply merge some of the commits to fix this, I think (or we can merge with squash, although that's suboptimal).

@dynamic-entropy
Copy link
Collaborator Author

Hello @amadio
Thanks for taking a look. I have edited some and squashed relevant commits.
Cheers

@amadio
Copy link
Member

amadio commented Mar 14, 2025

Ok, thanks. I will try again.

@amadio
Copy link
Member

amadio commented Mar 16, 2025

It seems that, when run alone, the test may pass, but when I run it along with the other tests, it always fails as shown below:

 58/131 Test #107: XRootD::tpc::test .........................................................***Failed    0.01 sec
++ uname
+ [[ Linux == \D\a\r\w\i\n ]]
+ : /home/amadio/src/xrootd/build/src/xrdadler32
+ : /home/amadio/src/xrootd/build/src/xrdcrc32c
+ : /home/amadio/src/xrootd/build/src/XrdCl/xrdcp
+ : /home/amadio/src/xrootd/build/src/XrdCl/xrdfs
++ command -v openssl
+ : /usr/bin/openssl
++ command -v curl
+ : /usr/bin/curl
+ check_commands /home/amadio/src/xrootd/build/src/xrdadler32 /home/amadio/src/xrootd/build/src/xrdcrc32c /home/amadio/src/xrootd/build/src/XrdCl/xrdcp /home/amadio/src/xrootd/build/src/XrdCl/xrdfs /usr/bin/openssl /usr/bin/curl
+ missing=()
+ local missing
+ for PROG in "$@"
+ [[ ! -x /home/amadio/src/xrootd/build/src/xrdadler32 ]]
+ for PROG in "$@"
+ [[ ! -x /home/amadio/src/xrootd/build/src/xrdcrc32c ]]
+ for PROG in "$@"
+ [[ ! -x /home/amadio/src/xrootd/build/src/XrdCl/xrdcp ]]
+ for PROG in "$@"
+ [[ ! -x /home/amadio/src/xrootd/build/src/XrdCl/xrdfs ]]
+ for PROG in "$@"
+ [[ ! -x /usr/bin/openssl ]]
+ for PROG in "$@"
+ [[ ! -x /usr/bin/curl ]]
+ [[ 0 -gt 0 ]]
+ hosts=(['srv1']='root://localhost:10951' ['srv2']='root://localhost:10952')
+ declare -A hosts
+ hosts_http=(['srv1']='https://localhost:10951' ['srv2']='https://localhost:10952')
+ declare -A hosts_http
+ trap cleanup ERR
+ RMTDATADIR=/srvdata/tpc
+ LCLDATADIR=/home/amadio/src/xrootd/build/tests/TPCTests/localdata/tpc
+ mkdir -p /home/amadio/src/xrootd/build/tests/TPCTests/localdata/tpc
+ mkdir -p /home/amadio/src/xrootd/build/tests/TPCTests/generated_tokens
+ mkdir -p /home/amadio/src/xrootd/build/tests/TPCTests/.cache/scitokens
+ rm -rf '/home/amadio/src/xrootd/build/tests/TPCTests/.cache/scitokens/*'
+ setup_scitokens
+ /home/amadio/src/xrootd/build/tests/scitokens/xrdscitokens-create-token /home/amadio/src/xrootd/build/tests/issuer/issuer_pub_1.pem /home/amadio/src/xrootd/build/tests/issuer/issuer_key_1.pem test_1 https://localhost:7095/issuer/one 'storage.modify:/ storage.create:/ storage.read:/'
Failed to open /home/amadio/src/xrootd/build/tests/issuer/issuer_pub_1.pem: No such file or directory
+ echo 'Failed to create token'
Failed to create token
+ exit 1

@amadio amadio self-requested a review March 16, 2025 16:22
Copy link
Member

@amadio amadio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good, this is mostly done, just need to add the extra required fixture to the list and it should be good to merge. Cheers,

@dynamic-entropy
Copy link
Collaborator Author

It seems that, when run alone, the test may pass, but when I run it along with the other tests, it always fails as shown below:

It seems that the tests take more than a minute to run everything that requires a token.
Should I add a commit to increase the lifetime of the token?

@dynamic-entropy dynamic-entropy requested a review from amadio March 16, 2025 16:47
@dynamic-entropy
Copy link
Collaborator Author

@amadio I have added a commit to request a token with an extended (2 min) lifetime for the TPC tests.
Cheers

@dynamic-entropy
Copy link
Collaborator Author

dynamic-entropy commented Mar 17, 2025

I have changed the timeout to 30 minutes
.

@amadio
Copy link
Member

amadio commented Mar 17, 2025

The test now fails with a checksum mismatch: https://my.cdash.org/tests/260410204

@dynamic-entropy
Copy link
Collaborator Author

Okay, I can reproduce this on Ubuntu.
This does not look like a shortcoming of the test though.

The transfers seem to stall in Ubuntu and fail with "Internal server error".
I will give the server logs a look again in the afternoon. For now, I failed to pinpoint the problem.

@amadio
Copy link
Member

amadio commented Mar 18, 2025

Yes, it looks like the test has caught a real issue. The test itself is not the problem. However, we should not merge a test that fails, we should fix the issue it caught along with the addition of the test.

@bbockelm
Copy link
Collaborator

@amadio - do you think the prior failures could be related to #2464?

That is, if there's a problem with the IPv6 code for HTTP redirects as reported b @ellert, then it might be what causes the periodic test failure (and, depending on the runner you land on for retries, may explain why the failures are not consistent -- latest rebase passes just fine). So, it's possible it's not really the scitokens/TPC-related pieces but the underlying HTTP tests for multi-server setups.

Copy link
Collaborator

@bbockelm bbockelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dynamic-entropy and I went over the tests in Zoom. All my initial suggestions have been addressed and this is starting to look quite nice from my point of view.

@amadio
Copy link
Member

amadio commented Apr 1, 2025

@bbockelm Indeed, the test is looking good. However, before merging, we need to understand and fix what is causing it to fail. I'd like to avoid merging tests that fail into master/devel, because then other pull requests will have failures as well. It could be that the reasons are the same as #2464. If yes, then we will need to adapt what runs to the network environment we have (or document that a working ipv4/ipv6 environment is a requirement to run the test suite).

@amadio
Copy link
Member

amadio commented Apr 1, 2025

There is an actual failure to transfer, but the return code is still 201, which doesn't make sense to me (from https://my.cdash.org/tests/260410204):

+++ /usr/bin/curl -X COPY -L -s -o /dev/fd/63 -w '%{http_code}' -H 'Destination: https://localhost:10951//srvdata/tpc/srv2_to_srv1.ref_http_push' -H 'Authorization: Bearer eyJhbGciOiJFUzI1NiIsImtpZCI6InRlc3RfMSJ9.eyJhdWQiOiJodHRwczpcL1wvd2xjZy5jZXJuLmNoXC9qd3RcL3YxXC9hbnkiLCJleHAiOjE3NDIyNDExMzYsImlhdCI6MTc0MjIzOTMzNiwiaXNzIjoiaHR0cHM6XC9cL2xvY2FsaG9zdDo3MDk1XC9pc3N1ZXJcL29uZSIsImp0aSI6IjRjY2Q0MGI5LTI3YzAtNDU2NS1iMTY2LTk1MTUzMzQ5OTNlNSIsIm5iZiI6MTc0MjIzOTMzNiwic2NvcGUiOiJzdG9yYWdlLm1vZGlmeTpcLyBzdG9yYWdlLmNyZWF0ZTpcLyBzdG9yYWdlLnJlYWQ6XC8iLCJzdWIiOiJ0ZXN0Iiwid2xjZy52ZXIiOiIxLjAifQ.OUmqXthbJlkDlxW5m98oDp3ale79EoMNjsKhg29fdPyDoAhLEonuIqCTO4VOORyeduxBHoYC2Af2BwoaMtLX_g' -H 'TransferHeaderAuthorization: Bearer eyJhbGciOiJFUzI1NiIsImtpZCI6InRlc3RfMSJ9.eyJhdWQiOiJodHRwczpcL1wvd2xjZy5jZXJuLmNoXC9qd3RcL3YxXC9hbnkiLCJleHAiOjE3NDIyNDExMzYsImlhdCI6MTc0MjIzOTMzNiwiaXNzIjoiaHR0cHM6XC9cL2xvY2FsaG9zdDo3MDk1XC9pc3N1ZXJcL29uZSIsImp0aSI6IjRjY2Q0MGI5LTI3YzAtNDU2NS1iMTY2LTk1MTUzMzQ5OTNlNSIsIm5iZiI6MTc0MjIzOTMzNiwic2NvcGUiOiJzdG9yYWdlLm1vZGlmeTpcLyBzdG9yYWdlLmNyZWF0ZTpcLyBzdG9yYWdlLnJlYWQ6XC8iLCJzdWIiOiJ0ZXN0Iiwid2xjZy52ZXIiOiIxLjAifQ.OUmqXthbJlkDlxW5m98oDp3ale79EoMNjsKhg29fdPyDoAhLEonuIqCTO4VOORyeduxBHoYC2Af2BwoaMtLX_g' --cacert /home/runner/work/xrootd/xrootd/build/tests/issuer/tlsca.pem https://localhost:10952//srvdata/tpc/srv2.ref
...
Perf Marker
Timestamp: 1742239457
Stripe Index: 0
Stripe Bytes Transferred: 0
Total Stripe Count: 1
RemoteConnections: tcp:[::1]:10951
End
failure: Internal transfer failure, local=/srvdata/tpc/srv2.ref, remote=https://localhost:10951//srvdata/tpc/srv2_to_srv1.ref_http_push, HTTP library failure=Timeout was reached++ http_code=201
++ echo 201
++ return 0
+ assert_eq 201 201 'HTTP TPC push failed'
+ [[ 201 == \2\0\1 ]]

@amadio
Copy link
Member

amadio commented Apr 1, 2025

The error comes from this place in the code:

} else if (res != CURLE_OK) {
std::stringstream ss2;
ss2 << "Internal transfer failure";
std::stringstream ss3;
ss3 << ss2.str() << ": " << curl_easy_strerror(res);
logTransferEvent(LogMask::Error, rec, "TRANSFER_FAIL", ss3.str());
ss << generateClientErr(ss2, rec, res);

@amadio
Copy link
Member

amadio commented Apr 10, 2025

Looks like the current changes are still not enough to fix the issue, the test still failed on Alma 10.

@dynamic-entropy dynamic-entropy force-pushed the tpc-tests branch 2 times, most recently from 8bacd94 to 0c7c455 Compare April 10, 2025 09:43
@dynamic-entropy
Copy link
Collaborator Author

Hello @amadio

Brian just fixed the content length check. Could you please trigger the jobs to run? Seems like GitHub did not run them on my recent push.

@amadio
Copy link
Member

amadio commented Apr 10, 2025

@dynamic-entropy There are some spurious changes in the first commit. Could you please rebase on master again (now that I merged the other PR), and push here so that I can merge also this one? Thanks!

@amadio amadio added this to the 5.8.1 milestone Apr 10, 2025
 - Add configuration files for TPCTests server instances
 - Add setup and test scripts

Signed-off-by: Rahul Chauhan <rahul.chauhan@cern.ch>
-  Add scitokens auth to HTTP TPC

Signed-off-by: Rahul Chauhan <rahul.chauhan@cern.ch>
Signed-off-by: Rahul Chauhan <rahul.chauhan@cern.ch>
…eter

Signed-off-by: Rahul Chauhan <rahul.chauhan@cern.ch>
@amadio amadio merged commit a387d18 into xrootd:master Apr 10, 2025
11 checks passed
@github-project-automation github-project-automation bot moved this from In progress to Done in Release Planning Apr 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants