-
Notifications
You must be signed in to change notification settings - Fork 51
Description
Description
We currently have a very basic "predict" method here: https://github.com/harmonydata/harmony/blob/main/src/harmony/parsing/pdf_parser.py#L44
Recently we ran a competition to fine tune an LLM in HuggingFace to better extract questions (and response options) from PDFs. The competition is here and was won by Aashvin
Aashvin's winning model is at https://fastdatascience.z33.web.core.windows.net/submission-5a83e434-58bc-492d-9852-37cd9128cd7e.tar.gz
I am not sure how you can get Aashvin's model into Harmony. One option is to load from this URL. There may be an option for you to upload it to our HuggingFace account https://huggingface.co/harmonydata and then Harmony loads it from HuggingFace Hub. Jay Dugad ( you can find him on the Discord) could give you access to upload the model to HuggingFace.
You would need to replace the predict method in pdf_parser.py https://github.com/harmonydata/harmony/blob/main/src/harmony/parsing/pdf_parser.py#L44 with the code that runs the model.
One difference is that the current PDF extraction model only gets question texts e.g. "How often do you feel anxious" but not response options ("Somewhat" / "Very often" etc). The new model gets both of these.
Rationale
Harmony often extracts the wrong text from a PDF and it's quite inaccurate. This should improve the performance.
