Skip to content

Add a title guess method to get "better" title#12018

Merged
koppor merged 24 commits into
JabRef:mainfrom
leaf-soba:close-issue-11999
Oct 30, 2024
Merged

Add a title guess method to get "better" title#12018
koppor merged 24 commits into
JabRef:mainfrom
leaf-soba:close-issue-11999

Conversation

@leaf-soba

Copy link
Copy Markdown
Contributor
  • Closes:Improve BibTeX-from-PDF import #11999
  • As the issue said I should "traditional" (non-AI) code, so I write a AI-liked code in a simple traditional way. Since I don't have enough data, so the rules is limited now.
  • I didn't add it PdfMergeMetadataImporterTest now, since I'm not sure if we like this kind of method or not, so I want to set up the PR with a demo code to discuss it first. I'll add it to PdfMergeMetadataImporter with a better clear code if we like it.

Mandatory checks

  • I own the copyright of the code submitted and I licence it under the MIT license
  • Change in CHANGELOG.md described in a way that is understandable for the average user (if applicable)
  • Tests created for changes (if applicable)
  • Manually tested changed features in running JabRef (always required)
  • Screenshots added in PR description (for UI changes)
  • Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
  • Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

Add title guess method
fix unit test

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun, check the results, commit, and push.

You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".

update unit test to JDK 21 style

@koppor koppor left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to have a test case.

But I think, there should be some earlier switch. See the hint stripper.setSortByPosition(true);

Please try again longer how to handle two-column PDFs. -- The title should be parsed corectly by org.jabref.logic.importer.fileformat.PdfContentImporter#getEntryFromPDFContent

What I meant is: Do two passes: One with sorted by coordinates and one keeping text blocks together. The abstract is probably better prased with keeping text blocks together; the title better with position.

One can tweak org.jabref.logic.importer.fileformat.PdfContentImporter#getEntryFromPDFContent to see if the paper is a single-column or two column paper. One could even go further and detect if it is Springer, IEEE or another format.

}

@Test
void se2Pdfs() throws Exception {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void se2Pdfs() throws Exception {
void se2Pdf() throws Exception {

Comment thread src/test/java/org/jabref/logic/importer/fileformat/PdfContentImporterTest.java Outdated
update unit test
update get title by area
@leaf-soba

Copy link
Copy Markdown
Contributor Author

Good to have a test case.

But I think, there should be some earlier switch. See the hint stripper.setSortByPosition(true);

Please try again longer how to handle two-column PDFs. -- The title should be parsed corectly by org.jabref.logic.importer.fileformat.PdfContentImporter#getEntryFromPDFContent

What I meant is: Do two passes: One with sorted by coordinates and one keeping text blocks together. The abstract is probably better prased with keeping text blocks together; the title better with position.

One can tweak org.jabref.logic.importer.fileformat.PdfContentImporter#getEntryFromPDFContent to see if the paper is a single-column or two column paper. One could even go further and detect if it is Springer, IEEE or another format.

Wow, I thought I wrote something cool and AI-like method, but it looks like it’s not quite what we need! 😅 I’ve now added a method to grab the title by position. This is just a demo code to see if the approach works for us—there’s definitely room for improvement, especially in formatting and commented code. If we’re happy with it, I’ll clean things up and refactor the org.jabref.logic.importer.fileformat.PdfContentImporter#getEntryFromPDFContent input!

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun, check the results, commit, and push.

You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".

remove StringUtils.isBlank and add @AllowedToUseAwt

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun, check the results, commit, and push.

You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".

@leaf-soba leaf-soba requested a review from koppor October 21, 2024 12:26
@koppor

koppor commented Oct 21, 2024

Copy link
Copy Markdown
Member
  • Since I don't have enough data, so the rules is limited now.

Please list the files available for others to be able to review.

I think, JabRef should have an example file for the first four templates listed at https://latextemplates.github.io/.

If not, pleae add.

I see following dirctories containing pre-formatted files

And some more at https://github.com/JabRef/jabref/tree/main/src/test/resources/pdfs - where I think, most of them do not follow a publisher format.


I thought that first dowing a parsing regarding position sorted for title and author, then reparsing with onsorted, but with blocks for abstract is the most working approach.

First, the format of the PDF should be guessed. If IEEE, then set parameters for IEEE. If LNCS, set format for LNCS.

Guessing the title by a rectangle works if you know the publisher format. Thus, that guessing should go first.

For 563 TB of test data - outside Europe and US and maybe many other countries, go to https://annas-archive.org/. -- You can also start with 32 TB of test data.

image

@leaf-soba

Copy link
Copy Markdown
Contributor Author
  • Since I don't have enough data, so the rules is limited now.

Please list the files available for others to be able to review.

I think, JabRef should have an example file for the first four templates listed at https://latextemplates.github.io/.

If not, pleae add.

I see following dirctories containing pre-formatted files

And some more at https://github.com/JabRef/jabref/tree/main/src/test/resources/pdfs - where I think, most of them do not follow a publisher format.

I thought that first dowing a parsing regarding position sorted for title and author, then reparsing with onsorted, but with blocks for abstract is the most working approach.

First, the format of the PDF should be guessed. If IEEE, then set parameters for IEEE. If LNCS, set format for LNCS.

Guessing the title by a rectangle works if you know the publisher format. Thus, that guessing should go first.

For 563 TB of test data - outside Europe and US and maybe many other countries, go to https://annas-archive.org/. -- You can also start with 32 TB of test data.

image

Sorry, so far I have only tested it with the single file mentioned in the issue, se2paper.pdf. I haven’t written academic papers in years, so I’m not very familiar with the IEEE and LNCS formats. I’ll do my best to learn and research them from the test data tomorrow

ToDo:find a minimal pdf for test
@koppor koppor added the status: changes-required Pull requests that are not yet complete label Oct 23, 2024
@koppor koppor marked this pull request as draft October 23, 2024 19:10
@leaf-soba

Copy link
Copy Markdown
Contributor Author

I added 4 more unit test case with "IEEE", "LNCS", "scientificThesis", and a "thesis-example" I found in the pdf folder, new code can pass them all so far, if we need more test I can add it.
The new idea is the title should be the largest font size in most of papers which I checked in the test data.

@leaf-soba leaf-soba marked this pull request as ready for review October 25, 2024 03:19
RemoveTestPrefix
I should change the pdf used in importTwiceWorksAsExpected, or my code need to deal with the paper with same font size in AUTHOR and TITLE?
fix the unit test and open rewrite issue
@koppor

koppor commented Oct 25, 2024

Copy link
Copy Markdown
Member

Remove commented code. You have it in your git history if you ever need it.

@leaf-soba

Copy link
Copy Markdown
Contributor Author

Remove commented code. You have it in your git history if you ever need it.

sorry it is a dirty code now, I want to set up a demo code to check if we like this solution or not, if the solution confirmed I'll refactor it.

remove commented code
@koppor

koppor commented Oct 25, 2024

Copy link
Copy Markdown
Member

Please propose a PR to the user documentation at https://docs.jabref.org/collect/findunlinkedfiles#pdfs-for-which-it-works - explaining the larger titles.

@koppor

koppor commented Oct 25, 2024

Copy link
Copy Markdown
Member

I am currently very busy, thus quick replies only. Sorry for that.

I added 4 more unit test case with "IEEE", "LNCS", "scientificThesis", and a "thesis-example" I found in the pdf folder,

Sounds good.

The idea using the largest object is smart! For future work, maybe, the second page needs to be checked, too. Sometimes, there is a cover page in front. I currently do not have an example at hand, but I don't have an example at hand - and I think, the heuristics will get harder then.

if we need more test I can add it.

Can you add follwing papers as test cases:

  1. https://link.springer.com/article/10.1007/s10664-023-10367-y
  2. https://link.springer.com/article/10.1007/s10664-020-09875-y --> note that it would be great if the sub title was parsed
  3. https://onlinelibrary.wiley.com/doi/10.1002/spe.3169
  4. https://peerj.com/articles/cs-213/#
  5. https://dl.acm.org/doi/10.1145/3597503.3639130 (ACM format)

When adding, add a README.md to the folder where there are stored with the source of the PDF. I mean, a table with filename as first column and link as second column.


In all these papers, you can also test for the abstract.

In all thesse papers, you can test for the authors (which is probably a brain-teaser and left for future work).

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code currently does not meet JabRef's code guidelines.
We use Checkstyle to identify issues.
Please carefully follow the setup guide for the codestyle.
Afterwards, please run checkstyle locally and fix the issues.

remove Blank line at start of block
rename and replace unit test file
add bib and readme.md
koppor
koppor previously approved these changes Oct 30, 2024

@koppor koppor left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need this functionality, therefore merging.

Please try to adress the comments in a follow-up PR.

@@ -0,0 +1,110 @@
@inproceedings{se2paper,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The bib file should be in THE SAME directory as the .pdf files
  • Set . as library-specific directory (in the library properties)
  • Have a file attached to each entry in the bib file.
  • Try to improve BibTeX data in the bib file:
    • Add DOI
    • Fix authors
    • Add Year
    • Generate citation key
    • Rename PDF files to nicer name.

@koppor koppor enabled auto-merge October 30, 2024 00:01
@koppor koppor disabled auto-merge October 30, 2024 00:01
@koppor

koppor commented Oct 30, 2024

Copy link
Copy Markdown
Member

I added a CHANGELOG.md entry

@koppor koppor enabled auto-merge October 30, 2024 00:03

@ParameterizedTest
@MethodSource("providePdfData")
void pdfTitleExtraction(String filePath, String expectedTitle) throws Exception {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up pull request: swap the arguments.

at assert in JUnit, first the expected value appears, then the input data. - This should also be done when wrapping inside @ParameterizedTests

@koppor

koppor commented Oct 30, 2024

Copy link
Copy Markdown
Member

@leaf-soba I think, you need to do some cleanup of the filenames before a merge can go through. See test output on Windows test:

 Error: error: invalid path 'src/test/resources/pdfs/PdfContentImporter/peerj-cs-213 - On the impact of service-oriented patterns on software evolvability: a controlled experiment and metric-based analysis.pdf'

Sorry für that.

Maybe, you can adress all comments by me then.

rename the file to pass CI
auto-merge was automatically disabled October 30, 2024 03:04

Head branch was pushed to by a user without write access

address all comments
@leaf-soba

leaf-soba commented Oct 30, 2024

Copy link
Copy Markdown
Contributor Author

Sorry I know it should be in another follow-up PR as you said before we should keep PR small but this one is too big, but I need to change the file name to pass the CI workflow, so it is a little hard to separate them.

Set . as library-specific directory (in the library properties)
This one is only comment I didn't finish, since I don't understand it very clearly.

Do you mean:

  1. Open JabRef, load my BibTeX file.
  2. Go to Library → Library Properties.
  3. In File Directory, enter a .
  4. save my settings

But I don't know how to change it in code or project file as a PR, could you explain it in the follow-up PR.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.

You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.

fix the file name in unit test
@leaf-soba leaf-soba requested a review from koppor October 30, 2024 03:42
@koppor koppor added this pull request to the merge queue Oct 30, 2024
@koppor

koppor commented Oct 30, 2024

Copy link
Copy Markdown
Member

Do you mean:

[...]]

4. save my settings

Yes.

Then save the file. The .bib file is modified then. Then you can normally usie "git" to commit ato a new branch.

Since this is not a show-stopper, I let this PR an and we can do that in a follow-up PR.

@koppor koppor removed the status: changes-required Pull requests that are not yet complete label Oct 30, 2024
Merged via the queue into JabRef:main with commit 7232836 Oct 30, 2024
@@ -0,0 +1,128 @@
@inproceedings{Kriha2018Teaching,
author = {Walter Kriha and Tobias Jordine}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This is INVALID BibTeX. All commans are missing.

In the future: Please try to open the file in JabRef.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'll open the file in JabRef next time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants