Skip to content

Add endpoint to return orphaned persons#225

Merged
ericbuckley merged 47 commits into
mainfrom
feature/163-get-orphaned-persons
Feb 28, 2025
Merged

Add endpoint to return orphaned persons#225
ericbuckley merged 47 commits into
mainfrom
feature/163-get-orphaned-persons

Conversation

@m-goggins

@m-goggins m-goggins commented Feb 25, 2025

Copy link
Copy Markdown
Collaborator

Description

The PR adds an endpoint to return a paginated list of all the persons with no members.

While working on this PR I also made some small changes to get_orphaned_patients:

  • added an explicit ORDER_BY after discovering some inconsistent results when I included a limit but not a cursor. I also updated the tests accordingly.
  • genericized schemas.PaginatedRefs so that the paginated results for both get orphaned patients and persons are the same.

Related Issues

#163

Additional Notes

I spent a good amount of time thinking about how we should execute this query to make it as efficient as possible. My working assumptions were that:

  1. the patient table will be larger than the person table,
  2. both tables will be large, and
  3. orphaned persons will be relatively rare.

I considered two approaches, "LEFT JOIN" (which I ultimately landed on) and "NOT EXISTS", but I am open to hearing others.

LEFT JOIN approach:

SELECT p.*
FROM mpi_person p
LEFT JOIN mpi_patient pt ON pt.person_id = p.id
WHERE pt.id IS NULL

From EXPLAIN QUERY PLAN, we can see that we are scanning mpi_person, using a Bloom filter when scanning the larger patient table to more quickly eliminate rows that definitely do not match any rows in Person, and uses a covering index on mpi_patient.id, which should keep things as quick as possible.

NOT EXISTS approach:

SELECT *
FROM mpi_person p
WHERE NOT EXISTS (
    SELECT 1 
    FROM mpi_patient pt 
    WHERE pt.person_id = p.id
)

This approach is less efficient because of the subquery executes multiple times checking for matching rows in Patient for each row in Person. It would be more efficient if we add an index on Patient.person_id, but I still think LEFT JOIN is a better choice given the assumptions stated above (especially #1).

I also briefly considered a view, but given the number of updates to the Patient and Person tables, I don't think this is our best path forward.

All that said, I am open to other approaches and would love to get folks' thoughts.

<--------------------- REMOVE THE LINES BELOW BEFORE MERGING --------------------->

Checklist

Please review and complete the following checklist before submitting your pull request:

  • I have ensured that the pull request is of a manageable size, allowing it to be reviewed within a single session.
  • I have reviewed my changes to ensure they are clear, concise, and well-documented.
  • I have updated the documentation, if applicable.
  • I have added or updated test cases to cover my changes, if applicable.
  • I have minimized the number of reviewers to include only those essential for the review.

Checklist for Reviewers

Please review and complete the following checklist during the review process:

  • The code follows best practices and conventions.
  • The changes implement the desired functionality or fix the reported issue.
  • The tests cover the new changes and pass successfully.
  • Any potential edge cases or error scenarios have been considered.

@codecov

codecov Bot commented Feb 26, 2025

Copy link
Copy Markdown

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.72%. Comparing base (d54af84) to head (475c6d5).
Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #225      +/-   ##
==========================================
+ Coverage   97.69%   97.72%   +0.02%     
==========================================
  Files          32       32              
  Lines        1651     1672      +21     
==========================================
+ Hits         1613     1634      +21     
  Misses         38       38              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@m-goggins m-goggins changed the title Feature/163 get orphaned persons Add endpoint to return orphaned persons Feb 26, 2025
@m-goggins m-goggins marked this pull request as ready for review February 26, 2025 22:47
Comment thread src/recordlinker/database/mpi_service.py Outdated
ericbuckley
ericbuckley previously approved these changes Feb 27, 2025

@ericbuckley ericbuckley left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @m-goggins. Thanks for wrapping up all the MPIAPI work!

Comment thread src/recordlinker/database/mpi_service.py Outdated
@ericbuckley ericbuckley merged commit a2798c1 into main Feb 28, 2025
@ericbuckley ericbuckley deleted the feature/163-get-orphaned-persons branch February 28, 2025 18:37
bamader pushed a commit that referenced this pull request Mar 19, 2025
## Description
The PR adds an endpoint to return a paginated list of all the persons
with no members.

While working on this PR I also made some small changes to
`get_orphaned_patients`:

- added an explicit `ORDER_BY` after discovering some inconsistent
results when I included a `limit` but not a `cursor`. I also updated the
tests accordingly.
- genericized `schemas.PaginatedRefs` so that the paginated results for
both get orphaned patients and persons are the same.

## Related Issues
#163 

## Additional Notes
I spent a good amount of time thinking about how we should execute this
query to make it as efficient as possible. My working assumptions were
that:
1) the patient table will be larger than the person table, 
2) both tables will be large, and 
3) orphaned persons will be relatively rare.

I considered two approaches, "LEFT JOIN" (which I ultimately landed on)
and "NOT EXISTS", but I am open to hearing others.

LEFT JOIN approach:
```
SELECT p.*
FROM mpi_person p
LEFT JOIN mpi_patient pt ON pt.person_id = p.id
WHERE pt.id IS NULL
``` 

From `EXPLAIN QUERY PLAN`, we can see that we are scanning `mpi_person`,
using a Bloom filter when scanning the larger patient table to more
quickly eliminate rows that definitely do not match any rows in Person,
and uses a covering index on `mpi_patient.id`, which should keep things
as quick as possible.

NOT EXISTS approach:
```
SELECT *
FROM mpi_person p
WHERE NOT EXISTS (
    SELECT 1 
    FROM mpi_patient pt 
    WHERE pt.person_id = p.id
)
```
This approach is less efficient because of the subquery executes
multiple times checking for matching rows in Patient for each row in
Person. It would be more efficient if we add an index on
Patient.person_id, but I still think LEFT JOIN is a better choice given
the assumptions stated above (especially #1).

I also briefly considered a view, but given the number of updates to the
Patient and Person tables, I don't think this is our best path forward.

All that said, I am open to other approaches and would love to get
folks' thoughts.

---------

Co-authored-by: Eric Buckley <eric.buckley@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants