Skip to content

Fix raw data job query duplicates.#1800

Merged
aaronweeden merged 1 commit intoubccr:xdmod11.0from
aaronweeden:fix-jobs-raw-data-optimization
Dec 15, 2023
Merged

Fix raw data job query duplicates.#1800
aaronweeden merged 1 commit intoubccr:xdmod11.0from
aaronweeden:fix-jobs-raw-data-optimization

Conversation

@aaronweeden
Copy link
Copy Markdown
Contributor

@aaronweeden aaronweeden commented Dec 14, 2023

Description

This PR fixes a bug just introduced in #1780 in which queries for raw data in the Jobs realm would return duplicates if a single day was requested that contained jobs that span multiple days. For example, if the request's start_date and end_date were both 2023-01-03, and a job started on 2023-01-01 and ended on 2023-01-03, the job would be included three times, once each for the aggregation days 2023-01-01, 2023-01-02, and 2023-01-03, because the query does not contain a DISTINCT modifier or WHERE conditions on the day_id of the aggregation table. This PR adds the WHERE conditions; in the previous example, this means the job is only included once, for the aggregation day 2023-01-03. This is sufficient to fix the bug; the DISTINCT modifier is unnecessary for single-day queries, so it remains omitted for performance.

Tests performed

The bug was caught by the regression test changes in #1792. I rebased that branch on this one and confirmed the tests pass again.

This PR causes the jobs to be returned in a different order depending on whether XDMOD_TEST_MODE is fresh_install or upgrade, so the regression test artifacts are updated. There is also an update to the Cloud realm's artifact because, in between the last time the artifacts were regenerated and now, a change was made to tests/regression/runtests.sh to search and replace the username hashes with <username>. Since then, the cloud tests have been passing when the actual username hashes were in the file, which is fine, but this PR synchronizes the committed file with what is generated by tests/regression/runtests.sh when REG_TEST_FORCE_GENERATION=1; namely, replacing the username hashes in the file with <username>.

I also did the following on xdmod-dev:

  1. Set up my dev port with this branch.
  2. Make a copy of the main development branch's classes/DataWarehouse/Query/Jobs/JobDataset.php to my dev port as classes/DataWarehouse/Query/Jobs/JobDatasetOld.php. Change its class name to JobDatasetOld.
  3. Edit classes/Rest/Controllers/WarehouseControllerProvider.php to add a raw-data-old endpoint that calls getRawDataOld, which is a copy of getRawData that calls getRawDataQueryOld, which is a copy of getRawDataQuery that sets $className to '\DataWarehouse\Query\Jobs\JobDatasetOld'.
  4. For the script from https://github.com/ubccr/xdmod-xsede/pull/436:
    1. Set 'old': 'https://xdmod-dev.ccr.xdmod.org:9001/rest/warehouse/raw-data-old'.
    2. Comment out all the realms except Jobs.
    3. Change NUM_TRIALS_PER_REALM to 5.
    4. Run the script with the --time argument. Confirm there are no differences.
    5. Set NUM_DAYS_PER_REQUEST to 1:
    6. Run the script with the --time argument. Confirm the number of rows in new is always less than or equal to old, and there are no differences.

Checklist:

  • The pull request description is suitable for a Changelog entry
  • The milestone is set correctly on the pull request
  • The appropriate labels have been added to the pull request

@aaronweeden aaronweeden added this to the 11.0.0 milestone Dec 14, 2023
@aaronweeden aaronweeden force-pushed the fix-jobs-raw-data-optimization branch 3 times, most recently from eb3a18e to 7d15428 Compare December 14, 2023 20:39
@aaronweeden aaronweeden merged commit 48c7119 into ubccr:xdmod11.0 Dec 15, 2023
@aaronweeden aaronweeden deleted the fix-jobs-raw-data-optimization branch December 15, 2023 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants