Skip to content

Conversation

@montygole
Copy link
Contributor

…ned LLM parser

Description

Replaced PDF parsing code with LLM (https://huggingface.co/harmonydata/debertaV2_pdfparser) loading and prediction. This new predict function will return the questions and answers from a given string of text. This requires transformers and torch. I added a test case to check that the function which maps predicted classes (other, question, answer) to relevant substrings works with multiple substrings within each predicted class.

TODO: update harmony.create_instrument_from_list in harmony.util.instrument_helper.py to handle parsed answers. Add test cases for the predict function.

Fixes #107

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Requires a documentation revision

Testing

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • Test that multiple questions, answer and "other" tokens are properly classified by the LLM. Perhaps I'm not using valid sample input text.

Since the Harmony Python package is used by the Harmony API (which is itself used by the R library and the web app), we need to avoid making any changes that break the Harmony API. Please also run the Harmony API unit tests and check that the API still runs with your changes to the Python package: https://github.com/harmonydata/harmonyapi

Test Configuration

  • Library version:
  • OS: MacOS 12.7.6
  • Toolchain: Python3.11, PyCharm

Checklist

  • My PR is for one issue, rather than for multiple unrelated fixes.
  • My code follows the style guidelines of this project. I have applied a Linter (recommended: Pycharm's code formatter) to make my whitespace consistent with the rest of the project.
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings
  • The Harmony API is not broken by my change to the Harmony Python library
  • I add third party dependencies only when necessary. If I changed the requirements, it changes in requirements.txt, pyproject.toml and also in the requirements.txt in the API repo
  • If I introduced a new feature, I documented it (e.g. making a script example in the script examples repository so that people will know how to use it.

@jaydugad
Copy link
Collaborator

jaydugad commented May 2, 2025

Hi Montygole,

Thanks for this PR! I've pulled the branch locally and confirmed the following:

The updated predict() function correctly loads and runs the LLM model
group_token_spans_by_class() handles multi-span token classes well
Tests in test_convert_pdf.py pass without issues
Local package builds correctly with the updated dependencies

The remaining TODO to update create_instrument_from_list for parsed answers looks good as a next step. Let me know if you’d like help testing that once it’s added.

Great work!

@jaydugad jaydugad marked this pull request as ready for review May 2, 2025 11:07
@montygole
Copy link
Contributor Author

@jaydugad @woodthom2 should the answers be placed in an Answer object, or should the answers be the options of a Question object?

@woodthom2
Copy link
Contributor

woodthom2 commented May 2, 2025 via email

@montygole
Copy link
Contributor Author

montygole commented May 2, 2025

@jaydugad @woodthom2 In order to prevent the tests from failing, we should add a huggingface token into the repo's secrets. However, I do not have permissions to do this.

We should add this to the workflow yaml file like this:

env:
      HF_TOKEN: ${{ secrets.HF_TOKEN }}

@woodthom2
Copy link
Contributor

Hi Monty, can you make this code download the model directly from HuggingFace without needing any kind of authentication? For example, here we download the model for the matcher just by supplying a path to the model. If the model is public you should not need any HF_TOKEN or secret to be set.

@montygole
Copy link
Contributor Author

@woodthom2 it seems as though the tests are passing. Perhaps huggingface was down when the tests failed on github actions last week. It should be working without needing the HF_TOKEN.

@montygole
Copy link
Contributor Author

montygole commented May 6, 2025

@jaydugad @woodthom2 I have made changes to requirements.txt in a new branch on my local fork of the harmonyapi. Shall I create a PR for this once this branch is merged? (see harmonydata/harmonyapi#23)

@woodthom2
Copy link
Contributor

Hi @montygole yes thank you! Please can you also update requirements.txt in the harmonyapi repo and make a separate PR for that. Since your harmonyapi fork will need to import the harmony Python library as a submodule, please can you leave your harmonyapi fork pointing to the main Python library and not to your fork of the Python library if that's possible. Thanks. Please ping me on Discord or here if you have any questions. Thanks so much for the contribution, this is really valued!

@montygole
Copy link
Contributor Author

@woodthom2 Yes. Here is the PR: harmonydata/harmonyapi#24.

@jaydugad
Copy link
Collaborator

jaydugad commented May 8, 2025

It looks like there's a conflict between XlsxWriter==3.0.9 and XlsxWriter>=3.2.3. Could you please choose one version that’s compatible across environments and update the dependency accordingly?

@montygole
Copy link
Contributor Author

Sure! I believe that commit a82ec7b introduced some dependency issues (ex. pyproject.toml contains both "XlsxWriter>=3.2.3; python_version <= '3.13.3'", and "XlsxWriter==3.0.9; python_version <= '3.13'",). The python_version marker only uses the major.minor versioning format (see https://packaging.pypa.io/en/stable/markers.html#:~:text=python_version%20(str)%20%E2%80%93%20The%20Python%20version%20as%20string%20%27major.minor%27) so these lines can either be changed to python_full_version <= '3.13.3' or to 'python_version <= '3.13''.

@montygole
Copy link
Contributor Author

montygole commented May 9, 2025

@jaydugad I have fixed dependency issues with the following commits: 41691be a56af86

@jaydugad jaydugad merged commit d8511c5 into harmonydata:main May 12, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replace the PDF parsing code with a large language model (already trained)

3 participants