Skip to content

Implement incremental json loading#389

Merged
m-goggins merged 12 commits into
mainfrom
implement-incremental-json-loading
May 28, 2025
Merged

Implement incremental json loading#389
m-goggins merged 12 commits into
mainfrom
implement-incremental-json-loading

Conversation

@m-goggins

@m-goggins m-goggins commented May 22, 2025

Copy link
Copy Markdown
Collaborator

Description

This PR updates the load testing scrambler to load and work on a single cluster at a time from a very large (~2GB) file of generated data. I used ijson to help with the incremental json loading, which fortunately allows us to read in each cluster individually and as a whole rather than in bytes at a time.

To test this, I generated 1.5 million clusters (each with a single record) using the gen_seed_data_script.py. The resulting file was roughly 2GB (200MB when zipped). I then ran the scrambler on the unzipped version of the file and was able to sample and generate ~5 million scrambled records in 5 minutes on my local. python3 -m tests.load.scripts.scramble_data --file="./test_original_data.json" --max-seed-records 5000000

Related Issues

Closes #387

Additional Notes

The next bit of work will be updating the locustfile to reflect the per cluster randomization and ensuring that we hit the /seed endpoint with only 100 records at a time using ijson with a batching function.

<--------------------- REMOVE THE LINES BELOW BEFORE MERGING --------------------->

Checklist

Please review and complete the following checklist before submitting your pull request:

  • I have ensured that the pull request is of a manageable size, allowing it to be reviewed within a single session.
  • I have reviewed my changes to ensure they are clear, concise, and well-documented.
  • I have updated the documentation, if applicable.
  • I have added or updated test cases to cover my changes, if applicable.
  • I have minimized the number of reviewers to include only those essential for the review.

Checklist for Reviewers

Please review and complete the following checklist during the review process:

  • The code follows best practices and conventions.
  • The changes implement the desired functionality or fix the reported issue.
  • The tests cover the new changes and pass successfully.
  • Any potential edge cases or error scenarios have been considered.

@codecov

codecov Bot commented May 22, 2025

Copy link
Copy Markdown

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.51%. Comparing base (7505a75) to head (924ffeb).
Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #389   +/-   ##
=======================================
  Coverage   98.51%   98.51%           
=======================================
  Files          33       33           
  Lines        1948     1948           
=======================================
  Hits         1919     1919           
  Misses         29       29           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@m-goggins m-goggins marked this pull request as ready for review May 22, 2025 20:05
@ericbuckley

Copy link
Copy Markdown
Collaborator

@m-goggins when you ran the test on 5 million. Did you happen to monitor memory usage over that 5 min span? Did the program quickly reach its max memory usage, or did it slowly rise over the entire 5 mins? Also, curious as to what the total amount of memory required was.

@ericbuckley ericbuckley left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @m-goggins just one small suggestion with the memory tracking code.

Comment thread tests/load/scripts/scramble_data.py Outdated
@m-goggins m-goggins merged commit b2deb48 into main May 28, 2025
15 checks passed
@m-goggins m-goggins deleted the implement-incremental-json-loading branch May 28, 2025 21:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement incremental json loading

2 participants