Implement incremental json loading by m-goggins · Pull Request #389 · CDCgov/RecordLinker

m-goggins · 2025-05-22T19:56:00Z

Description

This PR updates the load testing scrambler to load and work on a single cluster at a time from a very large (~2GB) file of generated data. I used ijson to help with the incremental json loading, which fortunately allows us to read in each cluster individually and as a whole rather than in bytes at a time.

To test this, I generated 1.5 million clusters (each with a single record) using the gen_seed_data_script.py. The resulting file was roughly 2GB (200MB when zipped). I then ran the scrambler on the unzipped version of the file and was able to sample and generate ~5 million scrambled records in 5 minutes on my local. python3 -m tests.load.scripts.scramble_data --file="./test_original_data.json" --max-seed-records 5000000

Related Issues

Closes #387

Additional Notes

The next bit of work will be updating the locustfile to reflect the per cluster randomization and ensuring that we hit the /seed endpoint with only 100 records at a time using ijson with a batching function.

<--------------------- REMOVE THE LINES BELOW BEFORE MERGING --------------------->

Checklist

Please review and complete the following checklist before submitting your pull request:

I have ensured that the pull request is of a manageable size, allowing it to be reviewed within a single session.
I have reviewed my changes to ensure they are clear, concise, and well-documented.
I have updated the documentation, if applicable.
I have added or updated test cases to cover my changes, if applicable.
I have minimized the number of reviewers to include only those essential for the review.

Checklist for Reviewers

Please review and complete the following checklist during the review process:

The code follows best practices and conventions.
The changes implement the desired functionality or fix the reported issue.
The tests cover the new changes and pass successfully.
Any potential edge cases or error scenarios have been considered.

…nal data

codecov · 2025-05-22T19:58:43Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.51%. Comparing base (7505a75) to head (924ffeb).
Report is 2 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #389   +/-   ##
=======================================
  Coverage   98.51%   98.51%           
=======================================
  Files          33       33           
  Lines        1948     1948           
=======================================
  Hits         1919     1919           
  Misses         29       29

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ericbuckley · 2025-05-22T20:23:25Z

@m-goggins when you ran the test on 5 million. Did you happen to monitor memory usage over that 5 min span? Did the program quickly reach its max memory usage, or did it slowly rise over the entire 5 mins? Also, curious as to what the total amount of memory required was.

…om/CDCgov/RecordLinker into implement-incremental-json-loading

ericbuckley

Looks good @m-goggins just one small suggestion with the memory tracking code.

m-goggins added 3 commits May 22, 2025 15:51

bump mac duplicate cases

6bab7bc

modify json scrmabler to work on a single cluster at a time

ae984f6

modify scrambler to work over a single cluster at time from the origi…

415f9ca

…nal data

m-goggins marked this pull request as ready for review May 22, 2025 20:05

m-goggins requested review from bamader, ericbuckley and johanna-skylight as code owners May 22, 2025 20:05

m-goggins and others added 8 commits May 27, 2025 15:01

adjust scrambling for dates to ensure they are always valid

235df48

write clusters incrementally to avoid loading too many into memory

2e6cc47

add memory tracing

160a60c

add flush and update output metrics

3b8d689

clean up viz file

8980be6

improving memory usage in gen_seed_test_data.py

3697e4c

ensure birthdates cannot be generated as future dates

dbfc836

Merge branch 'implement-incremental-json-loading' of https://github.c…

b0a06f1

…om/CDCgov/RecordLinker into implement-incremental-json-loading

ericbuckley approved these changes May 28, 2025

View reviewed changes

Comment thread tests/load/scripts/scramble_data.py Outdated

remove memory tracking

924ffeb

m-goggins merged commit b2deb48 into main May 28, 2025
15 checks passed

m-goggins deleted the implement-incremental-json-loading branch May 28, 2025 21:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement incremental json loading#389

Implement incremental json loading#389
m-goggins merged 12 commits into
mainfrom
implement-incremental-json-loading

m-goggins commented May 22, 2025 •

edited

Loading

Uh oh!

codecov Bot commented May 22, 2025 •

edited

Loading

Uh oh!

ericbuckley commented May 22, 2025

Uh oh!

ericbuckley left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

m-goggins commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Additional Notes

Checklist

Checklist for Reviewers

Uh oh!

codecov Bot commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ericbuckley commented May 22, 2025

Uh oh!

ericbuckley left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

m-goggins commented May 22, 2025 •

edited

Loading

codecov Bot commented May 22, 2025 •

edited

Loading