-
Notifications
You must be signed in to change notification settings - Fork 51
Replace the PDF parsing code with LLM model prediction #108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi Montygole, Thanks for this PR! I've pulled the branch locally and confirmed the following: The updated predict() function correctly loads and runs the LLM model The remaining TODO to update create_instrument_from_list for parsed answers looks good as a next step. Let me know if you’d like help testing that once it’s added. Great work! |
|
@jaydugad @woodthom2 should the answers be placed in an |
|
The response options of a question object. Thanks
…On Fri, 2 May 2025, 19:40 montygole, ***@***.***> wrote:
*montygole* left a comment (harmonydata/harmony#108)
<#108 (comment)>
@jaydugad <https://github.com/jaydugad> @woodthom2
<https://github.com/woodthom2> should the answers be placed in an Answer
object, or should the answers be the options of a Question object?
—
Reply to this email directly, view it on GitHub
<#108 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADUBTVNSQ67Q3VVDYKHLYJL24O3ZFAVCNFSM6AAAAAB4IXXGFSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQNBXHA3DCMRVHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
…onvert_pdf_to_instruments()
|
@jaydugad @woodthom2 In order to prevent the tests from failing, we should add a huggingface token into the repo's secrets. However, I do not have permissions to do this. We should add this to the workflow yaml file like this: |
|
Hi Monty, can you make this code download the model directly from HuggingFace without needing any kind of authentication? For example, here we download the model for the matcher just by supplying a path to the model. If the model is public you should not need any |
|
@woodthom2 it seems as though the tests are passing. Perhaps huggingface was down when the tests failed on github actions last week. It should be working without needing the HF_TOKEN. |
|
@jaydugad @woodthom2 I have made changes to |
|
Hi @montygole yes thank you! Please can you also update |
|
@woodthom2 Yes. Here is the PR: harmonydata/harmonyapi#24. |
|
It looks like there's a conflict between |
|
Sure! I believe that commit a82ec7b introduced some dependency issues (ex. pyproject.toml contains both |
…ned LLM parser
Description
Replaced PDF parsing code with LLM (https://huggingface.co/harmonydata/debertaV2_pdfparser) loading and prediction. This new
predictfunction will return the questions and answers from a given string of text. This requirestransformersandtorch. I added a test case to check that the function which maps predicted classes (other, question, answer) to relevant substrings works with multiple substrings within each predicted class.TODO: update
harmony.create_instrument_from_listinharmony.util.instrument_helper.pyto handle parsed answers. Add test cases for thepredictfunction.Fixes #107
Type of change
Please delete options that are not relevant.
Testing
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration
Since the Harmony Python package is used by the Harmony API (which is itself used by the R library and the web app), we need to avoid making any changes that break the Harmony API. Please also run the Harmony API unit tests and check that the API still runs with your changes to the Python package: https://github.com/harmonydata/harmonyapi
Test Configuration
Checklist
requirements.txt,pyproject.tomland also in therequirements.txtin the API repo