Replace the PDF parsing code with LLM model prediction #108

montygole · 2025-05-01T22:02:38Z

…ned LLM parser

Description

Replaced PDF parsing code with LLM (https://huggingface.co/harmonydata/debertaV2_pdfparser) loading and prediction. This new predict function will return the questions and answers from a given string of text. This requires transformers and torch. I added a test case to check that the function which maps predicted classes (other, question, answer) to relevant substrings works with multiple substrings within each predicted class.

TODO: update harmony.create_instrument_from_list in harmony.util.instrument_helper.py to handle parsed answers. Add test cases for the predict function.

Fixes #107

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Requires a documentation revision

Testing

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

Test that multiple questions, answer and "other" tokens are properly classified by the LLM. Perhaps I'm not using valid sample input text.

Since the Harmony Python package is used by the Harmony API (which is itself used by the R library and the web app), we need to avoid making any changes that break the Harmony API. Please also run the Harmony API unit tests and check that the API still runs with your changes to the Python package: https://github.com/harmonydata/harmonyapi

Test Configuration

Library version:
OS: MacOS 12.7.6
Toolchain: Python3.11, PyCharm

Checklist

…ned LLM parser

jaydugad · 2025-05-02T11:07:54Z

Hi Montygole,

Thanks for this PR! I've pulled the branch locally and confirmed the following:

The updated predict() function correctly loads and runs the LLM model
group_token_spans_by_class() handles multi-span token classes well
Tests in test_convert_pdf.py pass without issues
Local package builds correctly with the updated dependencies

The remaining TODO to update create_instrument_from_list for parsed answers looks good as a next step. Let me know if you’d like help testing that once it’s added.

Great work!

montygole · 2025-05-02T18:39:55Z

@jaydugad @woodthom2 should the answers be placed in an Answer object, or should the answers be the options of a Question object?

woodthom2 · 2025-05-02T19:12:25Z

The response options of a question object. Thanks

…

On Fri, 2 May 2025, 19:40 montygole, ***@***.***> wrote: *montygole* left a comment (harmonydata/harmony#108) <#108 (comment)> @jaydugad <https://github.com/jaydugad> @woodthom2 <https://github.com/woodthom2> should the answers be placed in an Answer object, or should the answers be the options of a Question object? — Reply to this email directly, view it on GitHub <#108 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADUBTVNSQ67Q3VVDYKHLYJL24O3ZFAVCNFSM6AAAAAB4IXXGFSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQNBXHA3DCMRVHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

…onvert_pdf_to_instruments()

montygole · 2025-05-02T20:28:54Z

@jaydugad @woodthom2 In order to prevent the tests from failing, we should add a huggingface token into the repo's secrets. However, I do not have permissions to do this.

We should add this to the workflow yaml file like this:

env:
      HF_TOKEN: ${{ secrets.HF_TOKEN }}

woodthom2 · 2025-05-06T12:57:38Z

Hi Monty, can you make this code download the model directly from HuggingFace without needing any kind of authentication? For example, here we download the model for the matcher just by supplying a path to the model. If the model is public you should not need any HF_TOKEN or secret to be set.

montygole · 2025-05-06T18:00:25Z

@woodthom2 it seems as though the tests are passing. Perhaps huggingface was down when the tests failed on github actions last week. It should be working without needing the HF_TOKEN.

montygole · 2025-05-06T19:07:30Z

@jaydugad @woodthom2 I have made changes to requirements.txt in a new branch on my local fork of the harmonyapi. Shall I create a PR for this once this branch is merged? (see harmonydata/harmonyapi#23)

woodthom2 · 2025-05-07T13:22:52Z

Hi @montygole yes thank you! Please can you also update requirements.txt in the harmonyapi repo and make a separate PR for that. Since your harmonyapi fork will need to import the harmony Python library as a submodule, please can you leave your harmonyapi fork pointing to the main Python library and not to your fork of the Python library if that's possible. Thanks. Please ping me on Discord or here if you have any questions. Thanks so much for the contribution, this is really valued!

montygole · 2025-05-07T17:38:36Z

@woodthom2 Yes. Here is the PR: harmonydata/harmonyapi#24.

jaydugad · 2025-05-08T11:16:00Z

It looks like there's a conflict between XlsxWriter==3.0.9 and XlsxWriter>=3.2.3. Could you please choose one version that’s compatible across environments and update the dependency accordingly?

montygole · 2025-05-09T17:48:01Z

Sure! I believe that commit a82ec7b introduced some dependency issues (ex. pyproject.toml contains both "XlsxWriter>=3.2.3; python_version <= '3.13.3'", and "XlsxWriter==3.0.9; python_version <= '3.13'",). The python_version marker only uses the major.minor versioning format (see https://packaging.pypa.io/en/stable/markers.html#:~:text=python_version%20(str)%20%E2%80%93%20The%20Python%20version%20as%20string%20%27major.minor%27) so these lines can either be changed to python_full_version <= '3.13.3' or to 'python_version <= '3.13''.

montygole · 2025-05-09T19:11:12Z

@jaydugad I have fixed dependency issues with the following commits: 41691be a56af86

montygole added 9 commits May 1, 2025 16:36

Refactor predict() and convert_pdf_to_instruments() to handle fine-tu…

daf5637

…ned LLM parser

Reformat code with Pycharm linter

89dbe1a

Add default tokenizer to token grouping function

d62b265

Update requirements.txt

fbd75c7

Disable tokenizer parallelism (prevents warning)

b5c262e

Fix test case

255e97a

Add group_token_spans_by_class to __init__.py

0524196

Remove print statement. Applied default Pycharm linter

4f97d1e

Updated requirements in pyproject.toml

ab7d83f

montygole mentioned this pull request May 1, 2025

Replace the PDF parsing code with a large language model (already trained) #107

Closed

jaydugad marked this pull request as ready for review May 2, 2025 11:07

Merge branch 'main' into llm_predict

297caa6

montygole added 4 commits May 2, 2025 15:57

Integrated parsed answers into create_instrument_from_list. Updated c…

82863e6

…onvert_pdf_to_instruments()

Updated implementation of create_instrument_from_list() in test cases

81af90d

Added test case for multiple answer_texts in create_instrument_from_list

b2d6f13

Merge remote-tracking branch 'origin/llm_predict' into llm_predict

38d1ec3

montygole added 2 commits May 6, 2025 14:01

Merge branch 'main' into llm_predict

0d7942e

update function usage in test case to match new implementation

4d1450d

montygole mentioned this pull request May 6, 2025

Update requirements.txt based on updates from harmony PR #108 for issue #107 harmonydata/harmonyapi#23

Open

montygole added a commit to montygole/harmonyapi that referenced this pull request May 6, 2025

Update requirements.txt in line with harmonydata/harmony#108

3d7a81a

Merge branch 'main' into llm_predict

fe1b288

montygole added 2 commits May 9, 2025 14:52

Updated pyproject.toml and requirements.txt

a56af86

Updated pyproject.toml

41691be

jaydugad merged commit d8511c5 into harmonydata:main May 12, 2025
1 check passed

Replace the PDF parsing code with LLM model prediction #108

Replace the PDF parsing code with LLM model prediction #108

Uh oh!

Conversation

montygole commented May 1, 2025

Description

Fixes #107

Type of change

Testing

Test Configuration

Checklist

Uh oh!

jaydugad commented May 2, 2025

Uh oh!

montygole commented May 2, 2025

Uh oh!

woodthom2 commented May 2, 2025 via email

Uh oh!

montygole commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

woodthom2 commented May 6, 2025

Uh oh!

montygole commented May 6, 2025

Uh oh!

montygole commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

woodthom2 commented May 7, 2025

Uh oh!

montygole commented May 7, 2025

Uh oh!

jaydugad commented May 8, 2025

Uh oh!

montygole commented May 9, 2025

Uh oh!

montygole commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

montygole commented May 2, 2025 •

edited

Loading

montygole commented May 6, 2025 •

edited

Loading

montygole commented May 9, 2025 •

edited

Loading