proc calc with iterators by ericbuckley · Pull Request #469 · CDCgov/RecordLinker

ericbuckley · 2025-07-16T00:32:20Z

Description

Optimizing memory usage by ~70% in tuning calculations using iterators.

Related Issues

closes #456

Additional Notes

I couple of dataclasses were added to schemas/tuning to help with passing data between prob_calc and base. TuningPair and TuningProbabilities. The former is particularly helpful with a shift to iterators, as it's no longer possible to return values from the mpi_service queries.
AlgorithmPass.resolved_label was added to help with type checks, as that guarantees a string will be returned (AlgorithmPass.label, could be null)
Feature was configured to be "frozen" (meaning once a Feature object is instantiated, it can't be changed). This is useful as we can now use them as keys in our log-odds dicts, rather than the feature string values as the keys. This reduced the number of time we need to convert back and forth between strings and Features.
tuning/base::tune was split to use 2 sub-functions, run_log_odds and run_rms. With an iterators version, we need to gather true-match and non-match pairs from the database twice. Separating the calculations into two functions makes it clearer as to where the pairs are being used.
The looping constructs in calculate_and_sort_tuning_scores we essentially reversed. Previously we looped over all the algorithm passes, then we looped over the pairs. Now that we have iterators, that doesn't work, as we can only access the pair once. To accomplish this, we now loop over the pairs, then loop over the algorithm passes. _score_pairs_in_class was renamed to _score_records_in_pair as now it just performs the value on 1 pair over N passes.

Performance Notes

Results from running tuning job tests with a 500k cluster using the default params.

main stats

runtime: 42s
max memory: 665MB

PR-469 stats

runtime: 44s
max memory: 204MB
<--------------------- REMOVE THE LINES BELOW BEFORE MERGING --------------------->

Checklist

Please review and complete the following checklist before submitting your pull request:

I have ensured that the pull request is of a manageable size, allowing it to be reviewed within a single session.
I have reviewed my changes to ensure they are clear, concise, and well-documented.
I have updated the documentation, if applicable.
I have added or updated test cases to cover my changes, if applicable.
I have minimized the number of reviewers to include only those essential for the review.

Checklist for Reviewers

Please review and complete the following checklist during the review process:

The code follows best practices and conventions.
The changes implement the desired functionality or fix the reported issue.
The tests cover the new changes and pass successfully.
Any potential edge cases or error scenarios have been considered.

codecov · 2025-07-16T02:45:27Z

Codecov Report

Attention: Patch coverage is 99.11504% with 1 line in your changes missing coverage. Please review.

Project coverage is 98.43%. Comparing base (ff50edf) to head (79daea9).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
src/recordlinker/tuning/base.py	97.56%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #469      +/-   ##
==========================================
- Coverage   98.46%   98.43%   -0.03%     
==========================================
  Files          41       41              
  Lines        2407     2435      +28     
==========================================
+ Hits         2370     2397      +27     
- Misses         37       38       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

bamader

Looks like a good clean refactor! Just a couple of minor comments around docstrings and an organizational questions, but nothing that I think is blocking.

ericbuckley added 8 commits July 11, 2025 12:49

changing tuning match queries to return iterators

c8b3bae

small change

fcbf1c1

add TuningPair class to help with iterators

f1b5047

calculate_class_probs with iterator

215261e

adding TuningPairs and TuningProbabilities

c52b9ec

add AlgorithmPass.resolved_label to guarntee a lable is always available

bd414a1

rework prob_calc functions to work with TuningPair iterables

6898ae9

split log odds and RMS recommendations processes

bc70149

ericbuckley self-assigned this Jul 16, 2025

ericbuckley added the qa Technical improvements to increase code quality label Jul 16, 2025

Merge branch 'main' into qa/456-proc-calc-with-iterators

822153f

ericbuckley marked this pull request as ready for review July 16, 2025 17:15

ericbuckley requested review from bamader and m-goggins as code owners July 16, 2025 17:15

Merge branch 'main' into qa/456-proc-calc-with-iterators

baeb9e9

bamader approved these changes Jul 17, 2025

View reviewed changes

Comment thread src/recordlinker/schemas/tuning.py Outdated

Comment thread src/recordlinker/schemas/tuning.py

Comment thread src/recordlinker/tuning/prob_calc.py Outdated

ericbuckley added 2 commits July 17, 2025 11:00

update comment

f5b4361

remove no agreement true match scores in estimate_rms_bounds

79daea9

ericbuckley merged commit 2fc2c34 into main Jul 17, 2025
15 checks passed

ericbuckley deleted the qa/456-proc-calc-with-iterators branch July 17, 2025 23:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proc calc with iterators#469

proc calc with iterators#469
ericbuckley merged 12 commits into
mainfrom
qa/456-proc-calc-with-iterators

ericbuckley commented Jul 16, 2025 •

edited

Loading

Uh oh!

codecov Bot commented Jul 16, 2025 •

edited

Loading

Uh oh!

bamader left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ericbuckley commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Additional Notes

Performance Notes

main stats

PR-469 stats

Checklist

Checklist for Reviewers

Uh oh!

codecov Bot commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bamader left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ericbuckley commented Jul 16, 2025 •

edited

Loading

codecov Bot commented Jul 16, 2025 •

edited

Loading