Skip to content

It's difficult and error-prone to get a test run status #1329

@laurentsenta

Description

@laurentsenta

Description

It is hard to tell whether a job succeeded or not,
for example, this ci action in the Rust SDK looks like it will capture whether the test succeeds or not, but actually, it doesn't:

https://github.com/testground/sdk-rust/blob/5c50ba4f63d2aff6b815cfbfcb0c8f9e4d73809e/.github/workflows/ci.yml#L106-L114

The CLI exit with code = 0, which looks like the job is successful and the action completes, but the actual testground job outcome is "failed".

How to actually get a test result

There is a difference between the run finished successfully output in the log and the actual status shown on the dashboard (when you visit http://localhost:8042 with the daemon running locally).

If you run a test with the --wait, the CLI will exit with code 0 despite the test failing.
This is why the action above doesn't catch failing tests.

Here is an example of the "correct" way to get the outcome, It's 40 lines:

https://github.com/galargh/testground-github-action/blob/7dbfde22f7158acfdeb1b22352a14992f41b0310/entrypoint.sh#L46-L83

After polling for status, you have to run:

testground status --task c9p3s72el22n9gatgq70 --extended 

Which outputs something like:

› testground status --task c9p3s72el22n9gatgq70 --extended
May  4 08:54:21.573004  INFO    using home directory: /Users/laurent/testground
May  4 08:54:21.573192  INFO    .env.toml loaded from: /Users/laurent/testground/.env.toml
May  4 08:54:21.573203  INFO    testground client initialized   {"addr": "http://localhost:8043"}

>>> Result:

ID:             c9p3s72el22n9gatgq70
Priority:       1
Created:        2022-05-04 08:49:32.426292982 +0000 UTC
Type:           run
Status:         complete
Last update:    2022-05-04 08:51:15.295936558 +0000 UTC

Input:
{"Sources":{"base_dir":"/home/laurent/testground/data/work/requests/5950d816","extra_dir":"","plan_dir":"/home/laurent/testground/data/work/requests/5950d816/plan","sdk_dir":""},"build_groups":[0],"composition":{"global":{"build":null,"build_config":{"enabled":true},"builder":"docker:generic","case":"example","disable_metrics":false,"plan":"sdk-rust","run":null,"run_config":{"enabled":true},"runner":"local:docker","total_instances":1},"groups":[{"build":{"dependencies":[],"selectors":null},"id":"single","instances":{"count":1,"percentage":0},"resources":{"cpu":"","memory":""},"run":{"artifact":"51db1e46f370","profiles":null,"test_params":{}}}],"metadata":{"author":"","name":""}},"created_by":{},"manifest":{"Builders":{"docker:generic":{"enabled":true}},"ExtraSources":null,"Name":"sdk-rust","Runners":{"local:docker":{"enabled":true}},"TestCases":[{"Instances":{"Maximum":1,"Minimum":1},"Name":"example","Parameters":null}]},"priority":1}

Result:
{"journal":{"events":{},"pods_statuses":{}},"outcome":"failure","outcomes":{"single":{"ok":0,"total":1}}}

Note the Status: complete at the beginning,
BUT if you use the --extended parameter, you see "outcomes": "failure" JSON log.

What defines this endeavour to be complete?

The CLI should be more explicit about failures, for example, use exit codes that are != 0 when we run a job with --wait and call the status command.

Note that we have to deal with 2 kinds of errors, errors between the CLI and the job running (if the job upload failed, if the cluster is down, etc), which are different than the test run actually failing.

  • list the operations we regularly use in the CLI, especially the one that requires a lot of scripts,
  • Expose a cleaner UI to these.

Sub tasks

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2P2: Should be fixed.

    Type

    No type

    Projects

    Status

    🥞 Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions