Add endpoint to return orphaned persons by m-goggins · Pull Request #225 · CDCgov/RecordLinker

m-goggins · 2025-02-25T23:55:05Z

Description

The PR adds an endpoint to return a paginated list of all the persons with no members.

While working on this PR I also made some small changes to get_orphaned_patients:

added an explicit ORDER_BY after discovering some inconsistent results when I included a limit but not a cursor. I also updated the tests accordingly.
genericized schemas.PaginatedRefs so that the paginated results for both get orphaned patients and persons are the same.

Related Issues

Additional Notes

I spent a good amount of time thinking about how we should execute this query to make it as efficient as possible. My working assumptions were that:

the patient table will be larger than the person table,
both tables will be large, and
orphaned persons will be relatively rare.

I considered two approaches, "LEFT JOIN" (which I ultimately landed on) and "NOT EXISTS", but I am open to hearing others.

LEFT JOIN approach:

SELECT p.*
FROM mpi_person p
LEFT JOIN mpi_patient pt ON pt.person_id = p.id
WHERE pt.id IS NULL

From EXPLAIN QUERY PLAN, we can see that we are scanning mpi_person, using a Bloom filter when scanning the larger patient table to more quickly eliminate rows that definitely do not match any rows in Person, and uses a covering index on mpi_patient.id, which should keep things as quick as possible.

NOT EXISTS approach:

SELECT *
FROM mpi_person p
WHERE NOT EXISTS (
    SELECT 1 
    FROM mpi_patient pt 
    WHERE pt.person_id = p.id
)

This approach is less efficient because of the subquery executes multiple times checking for matching rows in Patient for each row in Person. It would be more efficient if we add an index on Patient.person_id, but I still think LEFT JOIN is a better choice given the assumptions stated above (especially #1).

I also briefly considered a view, but given the number of updates to the Patient and Person tables, I don't think this is our best path forward.

All that said, I am open to other approaches and would love to get folks' thoughts.

<--------------------- REMOVE THE LINES BELOW BEFORE MERGING --------------------->

Checklist

Please review and complete the following checklist before submitting your pull request:

I have ensured that the pull request is of a manageable size, allowing it to be reviewed within a single session.
I have reviewed my changes to ensure they are clear, concise, and well-documented.
I have updated the documentation, if applicable.
I have added or updated test cases to cover my changes, if applicable.
I have minimized the number of reviewers to include only those essential for the review.

Checklist for Reviewers

Please review and complete the following checklist during the review process:

The code follows best practices and conventions.
The changes implement the desired functionality or fix the reported issue.
The tests cover the new changes and pass successfully.
Any potential edge cases or error scenarios have been considered.

…ature/163-manage-orphan-data

…ature/163-get-orphaned-patients

…m/CDCgov/RecordLinker into feature/163-get-orphaned-persons

…ature/163-get-orphaned-persons

codecov · 2025-02-26T16:20:59Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.72%. Comparing base (d54af84) to head (475c6d5).
Report is 3 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #225      +/-   ##
==========================================
+ Coverage   97.69%   97.72%   +0.02%     
==========================================
  Files          32       32              
  Lines        1651     1672      +21     
==========================================
+ Hits         1613     1634      +21     
  Misses         38       38

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ericbuckley

Looks great @m-goggins. Thanks for wrapping up all the MPIAPI work!

## Description The PR adds an endpoint to return a paginated list of all the persons with no members. While working on this PR I also made some small changes to `get_orphaned_patients`: - added an explicit `ORDER_BY` after discovering some inconsistent results when I included a `limit` but not a `cursor`. I also updated the tests accordingly. - genericized `schemas.PaginatedRefs` so that the paginated results for both get orphaned patients and persons are the same. ## Related Issues #163 ## Additional Notes I spent a good amount of time thinking about how we should execute this query to make it as efficient as possible. My working assumptions were that: 1) the patient table will be larger than the person table, 2) both tables will be large, and 3) orphaned persons will be relatively rare. I considered two approaches, "LEFT JOIN" (which I ultimately landed on) and "NOT EXISTS", but I am open to hearing others. LEFT JOIN approach: ``` SELECT p.* FROM mpi_person p LEFT JOIN mpi_patient pt ON pt.person_id = p.id WHERE pt.id IS NULL ``` From `EXPLAIN QUERY PLAN`, we can see that we are scanning `mpi_person`, using a Bloom filter when scanning the larger patient table to more quickly eliminate rows that definitely do not match any rows in Person, and uses a covering index on `mpi_patient.id`, which should keep things as quick as possible. NOT EXISTS approach: ``` SELECT * FROM mpi_person p WHERE NOT EXISTS ( SELECT 1 FROM mpi_patient pt WHERE pt.person_id = p.id ) ``` This approach is less efficient because of the subquery executes multiple times checking for matching rows in Patient for each row in Person. It would be more efficient if we add an index on Patient.person_id, but I still think LEFT JOIN is a better choice given the assumptions stated above (especially #1). I also briefly considered a view, but given the number of updates to the Patient and Person tables, I don't think this is our best path forward. All that said, I am open to other approaches and would love to get folks' thoughts. --------- Co-authored-by: Eric Buckley <eric.buckley@gmail.com>

m-goggins added 25 commits February 19, 2025 14:19

delete empty person + tests

55960df

Merge branch 'main' of https://github.com/CDCgov/RecordLinker into fe…

88e0f69

…ature/163-manage-orphan-data

fix expected return

bba6265

add endpoint for returning oprhaned patient data

55afb49

tests for get_orphaned_patients

08f433b

fix expected return value

4b62106

add router tests

a157ebb

add cursor pagination

cbed38f

Merge branch 'main' of https://github.com/CDCgov/RecordLinker into fe…

5c340c9

…ature/163-get-orphaned-patients

update typing

220ae71

update cursor & tests

d83b766

update returned cursor to str

a91e48d

add tests for patient router

d0c88a9

limit upper limit

271247d

remove print statements

17fa1ec

use request.base_url

a2fb1ce

update get_orphaned_patients to accept id instead of ref_id

376243c

add check for valid patient_ref_id

e2538d1

update Paginated results to be generic for patients or persons

6f2aa24

add get_orphaned_persons

b16fb9c

Merge branch 'feature/163-get-orphaned-patients' of https://github.co…

e322006

…m/CDCgov/RecordLinker into feature/163-get-orphaned-persons

Merge branch 'main' of https://github.com/CDCgov/RecordLinker into fe…

a519e9d

…ature/163-get-orphaned-persons

cleanup

29575f8

genericize PaginatedRefs

adb14f1

update tests for generic paginated data

ca1d177

m-goggins added 3 commits February 26, 2025 08:36

change query to left join

d258e1d

add tests for get_orphaned_persons

4431971

update tests & add check for cursor validity

c9e860a

m-goggins changed the title ~~Feature/163 get orphaned persons~~ Add endpoint to return orphaned persons Feb 26, 2025

m-goggins added 16 commits February 26, 2025 12:27

try exists instead

0a45a3e

just test mpi_service

e48d0fe

go back to outerjoin

88e0de9

test 2nd test

28f96e3

go back to single test

625e844

test with correctly added persons

1f9774f

add back person_router tests

ae718bd

remove commented out code

bbfbe8a

test only limit with get patients

8d3ed7f

update test_patients with limit

7d8adb4

ugh, update limit count typo

ba1ceb1

add cursor and invalid cursor tests

bdf4cd5

add cursor+limit tests

97940b4

add only limit

dfc735a

add explicit order by

cbd7da8

add explicit order by for get_orphaned_patients as well

8dc5842

m-goggins marked this pull request as ready for review February 26, 2025 22:47

m-goggins requested review from bamader and ericbuckley as code owners February 26, 2025 22:47

ericbuckley reviewed Feb 27, 2025

View reviewed changes

Comment thread src/recordlinker/database/mpi_service.py Outdated

ericbuckley previously approved these changes Feb 27, 2025

View reviewed changes

add comment about limit applied after cursor

ed15a79

m-goggins dismissed ericbuckley’s stale review via ed15a79 February 27, 2025 16:43

m-goggins requested a review from ericbuckley February 27, 2025 16:49

ericbuckley reviewed Feb 28, 2025

View reviewed changes

Comment thread src/recordlinker/database/mpi_service.py Outdated

Update src/recordlinker/database/mpi_service.py

475c6d5

ericbuckley approved these changes Feb 28, 2025

View reviewed changes

ericbuckley merged commit a2798c1 into main Feb 28, 2025

ericbuckley deleted the feature/163-get-orphaned-persons branch February 28, 2025 18:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add endpoint to return orphaned persons#225

Add endpoint to return orphaned persons#225
ericbuckley merged 47 commits into
mainfrom
feature/163-get-orphaned-persons

m-goggins commented Feb 25, 2025 •

edited

Loading

Uh oh!

codecov Bot commented Feb 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

ericbuckley left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

m-goggins commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Additional Notes

Checklist

Checklist for Reviewers

Uh oh!

codecov Bot commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

ericbuckley left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

m-goggins commented Feb 25, 2025 •

edited

Loading

codecov Bot commented Feb 26, 2025 •

edited

Loading