Skip to content

Replace the PDF parsing code with a large language model (already trained) #107

@woodthom2

Description

@woodthom2

Description

We currently have a very basic "predict" method here: https://github.com/harmonydata/harmony/blob/main/src/harmony/parsing/pdf_parser.py#L44

Recently we ran a competition to fine tune an LLM in HuggingFace to better extract questions (and response options) from PDFs. The competition is here and was won by Aashvin

Image

Aashvin's winning model is at https://fastdatascience.z33.web.core.windows.net/submission-5a83e434-58bc-492d-9852-37cd9128cd7e.tar.gz

I am not sure how you can get Aashvin's model into Harmony. One option is to load from this URL. There may be an option for you to upload it to our HuggingFace account https://huggingface.co/harmonydata and then Harmony loads it from HuggingFace Hub. Jay Dugad ( you can find him on the Discord) could give you access to upload the model to HuggingFace.

You would need to replace the predict method in pdf_parser.py https://github.com/harmonydata/harmony/blob/main/src/harmony/parsing/pdf_parser.py#L44 with the code that runs the model.

One difference is that the current PDF extraction model only gets question texts e.g. "How often do you feel anxious" but not response options ("Somewhat" / "Very often" etc). The new model gets both of these.

Rationale

Harmony often extracts the wrong text from a PDF and it's quite inaccurate. This should improve the performance.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions