Skip to content

proc calc with iterators#469

Merged
ericbuckley merged 12 commits into
mainfrom
qa/456-proc-calc-with-iterators
Jul 17, 2025
Merged

proc calc with iterators#469
ericbuckley merged 12 commits into
mainfrom
qa/456-proc-calc-with-iterators

Conversation

@ericbuckley

@ericbuckley ericbuckley commented Jul 16, 2025

Copy link
Copy Markdown
Collaborator

Description

Optimizing memory usage by ~70% in tuning calculations using iterators.

Related Issues

closes #456

Additional Notes

  • I couple of dataclasses were added to schemas/tuning to help with passing data between prob_calc and base. TuningPair and TuningProbabilities. The former is particularly helpful with a shift to iterators, as it's no longer possible to return values from the mpi_service queries.
  • AlgorithmPass.resolved_label was added to help with type checks, as that guarantees a string will be returned (AlgorithmPass.label, could be null)
  • Feature was configured to be "frozen" (meaning once a Feature object is instantiated, it can't be changed). This is useful as we can now use them as keys in our log-odds dicts, rather than the feature string values as the keys. This reduced the number of time we need to convert back and forth between strings and Features.
  • tuning/base::tune was split to use 2 sub-functions, run_log_odds and run_rms. With an iterators version, we need to gather true-match and non-match pairs from the database twice. Separating the calculations into two functions makes it clearer as to where the pairs are being used.
  • The looping constructs in calculate_and_sort_tuning_scores we essentially reversed. Previously we looped over all the algorithm passes, then we looped over the pairs. Now that we have iterators, that doesn't work, as we can only access the pair once. To accomplish this, we now loop over the pairs, then loop over the algorithm passes. _score_pairs_in_class was renamed to _score_records_in_pair as now it just performs the value on 1 pair over N passes.

Performance Notes

Results from running tuning job tests with a 500k cluster using the default params.

main stats

runtime: 42s
max memory: 665MB

PR-469 stats

runtime: 44s
max memory: 204MB
<--------------------- REMOVE THE LINES BELOW BEFORE MERGING --------------------->

Checklist

Please review and complete the following checklist before submitting your pull request:

  • I have ensured that the pull request is of a manageable size, allowing it to be reviewed within a single session.
  • I have reviewed my changes to ensure they are clear, concise, and well-documented.
  • I have updated the documentation, if applicable.
  • I have added or updated test cases to cover my changes, if applicable.
  • I have minimized the number of reviewers to include only those essential for the review.

Checklist for Reviewers

Please review and complete the following checklist during the review process:

  • The code follows best practices and conventions.
  • The changes implement the desired functionality or fix the reported issue.
  • The tests cover the new changes and pass successfully.
  • Any potential edge cases or error scenarios have been considered.

@ericbuckley ericbuckley self-assigned this Jul 16, 2025
@ericbuckley ericbuckley added the qa Technical improvements to increase code quality label Jul 16, 2025
@codecov

codecov Bot commented Jul 16, 2025

Copy link
Copy Markdown

Codecov Report

Attention: Patch coverage is 99.11504% with 1 line in your changes missing coverage. Please review.

Project coverage is 98.43%. Comparing base (ff50edf) to head (79daea9).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/recordlinker/tuning/base.py 97.56% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #469      +/-   ##
==========================================
- Coverage   98.46%   98.43%   -0.03%     
==========================================
  Files          41       41              
  Lines        2407     2435      +28     
==========================================
+ Hits         2370     2397      +27     
- Misses         37       38       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ericbuckley ericbuckley marked this pull request as ready for review July 16, 2025 17:15

@bamader bamader left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good clean refactor! Just a couple of minor comments around docstrings and an organizational questions, but nothing that I think is blocking.

Comment thread src/recordlinker/schemas/tuning.py Outdated
Comment thread src/recordlinker/schemas/tuning.py
Comment thread src/recordlinker/tuning/prob_calc.py Outdated
@ericbuckley ericbuckley merged commit 2fc2c34 into main Jul 17, 2025
15 checks passed
@ericbuckley ericbuckley deleted the qa/456-proc-calc-with-iterators branch July 17, 2025 23:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qa Technical improvements to increase code quality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reduce tuning memory used

2 participants