Inspiration

Medical records are difficult to understand on their own. When we mix in the fact that all major labs produce results in completely different ways, it gets even harder for common people to understand the records. There are many old torn physical medical records that may be needed and thus require restoration, so this combined with the goal to make them easier to understand, Get Med Data was born!

What it does

Get Med Data takes in an image of a medical record and gives out a single PDF with three pages.

Page 1 Original Image

Page 2 Original Image with gene mutation highlighted

Page 3 Data from the record presented in an ordered form that anyone can understand

How it works

Step 1: The original image is converted into grayscale and saved as a new image. The new image is then read, and text is extracted using Tesseract OCR. This is the initial step where the first attempt is made to read the text and return the text.

Step 2: Here, there are two possibilities. One that the original image was clean. The other that the original image is a picture of an old record that might have had some physical damage to the paper, resulting in incorrect OCR output. So, to check if there were inaccurate readings, we will check for grammatical errors in the text. A high error rate would indicate issues with the original image.

Step 3: Now that we have our clean text, we can search for the variant name and highlight it on the image.

Step 4: Make a standardized presentation format for all the images to make it easier to understand.

Step 5: Put a final output together with the original image, image with highlights and the standardized form of data in a PDF file.

Challenges

Several challenges came along the way, including figuring out how various libraries interact and how data extraction and separation for data standardization would work.

Accomplishments that we're proud of

Get Med Data uses more than four kinds of technologies. In my opinion, incorporating all of them together was an accomplishment for me.

What we learned

I learned a lot about Tesseract and OCR in general. I also learned a lot about grammar correction using language tools in Python.

What's next for Get Med Data

Get Med Data has a lot of practical applications, and in the future, it will be able to work with not just one type of medical record but many kinds of forms. This would significantly increase the possible use cases as well.

Built With

  • language-tool-python
  • pil
  • pypdf2
  • pytesseract
  • python
  • reportlab
  • tesseract-ocr
Share this project:

Updates