Skip to content

Filename like XMP titles overriding correct content extracted titles #15339

Merged
Siedlerchr merged 8 commits into
JabRef:mainfrom
faneeshh:fix-11999
Mar 16, 2026
Merged

Filename like XMP titles overriding correct content extracted titles #15339
Siedlerchr merged 8 commits into
JabRef:mainfrom
faneeshh:fix-11999

Conversation

@faneeshh

@faneeshh faneeshh commented Mar 13, 2026

Copy link
Copy Markdown
Collaborator

Related issues and pull requests

Closes #11999

PR Description

PdfMergeMetadataImporter.mergeCandidates( ) uses first come first serve merging, so when a PDF's XMP metadata contains a filename like title (Microsoft Word - ieee_on_how_we_teach_jul_01.docx) it silently overwrites the correctly extracted title from PdfContentImporter. So far I've just added a failing test using Kriha2018.pdf which has exactly this metadata to document the bug.

image image

Added isTitleLikelyFilename( ) heuristic and post merge override logic in mergeCandidates( ) so that filename like titles are replaced by the best available real title.

Steps to test

Download this Pdf (https://kriha.de/dload/se2paper.pdf)

  1. Create a new library and go to File -> Import
  2. Select the downloaded Pdf
  3. The title should now show the real paper title instead of the Word filename
image

Checklist

  • I own the copyright of the code submitted and I license it under the MIT license
  • I manually tested my changes in running JabRef (always required)
  • I added JUnit tests for changes (if applicable)
  • I added screenshots in the PR description (if change is visible to the user)
  • [/] I added a screenshot in the PR description showing a library with a single entry with me as author and as title the issue number
  • I described the change in CHANGELOG.md in a way that can be understood by the average user (if change is visible to the user)
  • I checked the user documentation for up to dateness and submitted a pull request to our user documentation repository

@github-actions github-actions Bot added good second issue Issues that involve a tour of two or three interweaved components in JabRef status: no-bot-comments labels Mar 13, 2026
@testlens-app

This comment has been minimized.

@testlens-app

This comment has been minimized.

public class PdfMergeMetadataImporter extends PdfImporter {

private static final Logger LOGGER = LoggerFactory.getLogger(PdfMergeMetadataImporter.class);
private static final Pattern FILENAME_TITLE_PATTERN = Pattern.compile("(?i)(.*\\.(docx|doc|pdf|tex|odt|rtf|ps|eps|html|htm|pptx|ppt|xlsx)$|microsoft (word|powerpoint|excel).*|.*\\\\.*)");

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could reuse ExternalFileTypes? Please investigate.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StandardExternalFileType is in jabgui, so can we use it in jablib?

StandardFileType is missing MS Office formats (docx, pptx, xlsx) and eps.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then add them to StandardFileType

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Also had to filter out ANY_FILE since its * extension is a regex metacharacter.

@testlens-app

This comment has been minimized.

@github-actions github-actions Bot added status: changes-required Pull requests that are not yet complete and removed status: no-bot-comments labels Mar 14, 2026
@testlens-app

This comment has been minimized.

@testlens-app

This comment has been minimized.

@faneeshh faneeshh marked this pull request as ready for review March 14, 2026 20:26
@qodo-free-for-open-source-projects

Copy link
Copy Markdown
Contributor

Review Summary by Qodo

Fix PDF import preferring filename-like XMP titles over content titles

🐞 Bug fix

Grey Divider

Walkthroughs

Description
• Prevent filename-like XMP titles from overriding correctly extracted PDF content titles
• Added heuristic to detect filename patterns in title metadata
• Implement post-merge logic to replace suspicious titles with better alternatives
• Added test case validating fix with real PDF containing Word filename metadata
Diagram
flowchart LR
  A["PDF with XMP metadata"] -->|Extract candidates| B["PdfMergeMetadataImporter"]
  B -->|Merge candidates| C["Check if title is filename-like"]
  C -->|Yes| D["Replace with real title from content"]
  C -->|No| E["Keep original title"]
  D --> F["Final BibEntry"]
  E --> F
Loading

Grey Divider

File Changes

1. jablib/src/main/java/org/jabref/logic/importer/fileformat/pdf/PdfMergeMetadataImporter.java 🐞 Bug fix +22/-3

Add filename detection and title override logic

• Added FILENAME_TITLE_PATTERN regex to detect filename-like titles (docx, doc, pdf, tex, etc.
 extensions and Microsoft Office patterns)
• Implemented isTitleLikelyFilename() method to check if a title matches filename patterns
• Modified mergeCandidates() to accept List<BibEntry> instead of Stream<BibEntry>
• Added post-merge override logic to replace filename-like titles with the first real title found
 from candidates
• Changed import from Stream to Pattern for regex matching

jablib/src/main/java/org/jabref/logic/importer/fileformat/pdf/PdfMergeMetadataImporter.java


2. jablib/src/test/java/org/jabref/logic/importer/fileformat/pdf/PdfMergeMetadataImporterTest.java 🧪 Tests +21/-0

Add test for filename title override behavior

• Added test filenameLikeTitleFromXmpIsOverriddenByContentTitle() to verify the fix
• Test uses Kriha2018.pdf which contains "Microsoft Word - ieee_on_how_we_teach_jul_01.docx" as XMP
 title
• Validates that the real title "On How We Can Teach – Exploring New Ways in Professional Software
 Development for Students" is extracted instead
• Mocks GrobidPreferences and ImportFormatPreferences for offline testing

jablib/src/test/java/org/jabref/logic/importer/fileformat/pdf/PdfMergeMetadataImporterTest.java


3. CHANGELOG.md 📝 Documentation +1/-0

Document PDF title import fix

• Added entry documenting the fix for PDF import preferring content-extracted titles over
 filename-like XMP metadata
• References issue #11999

CHANGELOG.md


Grey Divider

Qodo Logo

@qodo-free-for-open-source-projects

qodo-free-for-open-source-projects Bot commented Mar 14, 2026

Copy link
Copy Markdown
Contributor

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (3) 📎 Requirement gaps (0)

Grey Divider


Remediation recommended

1. Changelog uses “filename like” 📘 Rule violation ✓ Correctness
Description
The new CHANGELOG entry contains slightly unpolished phrasing (“filename like”) that should be
corrected for professional, precise user-facing text. This affects end-user readability and
translation quality expectations for release notes.
Code

CHANGELOG.md[37]

+- We fixed PDF import to prefer the content extracted title over filename like XMP metadata titles. [#11999](https://github.com/JabRef/jabref/issues/11999)
Evidence
PR Compliance requires professional, correct user-facing text and end-user-focused changelog
entries. The added CHANGELOG line includes the phrase “filename like”, which should be
“filename-like” (and is user-facing release text).

AGENTS.md
CHANGELOG.md[37-37]
Best Practice: Learned patterns

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The newly added CHANGELOG entry contains slightly incorrect/awkward user-facing wording (&amp;amp;amp;amp;quot;filename like&amp;amp;amp;amp;quot;).
## Issue Context
CHANGELOG.md is user-facing release text and should be grammatically correct and professional.
## Fix Focus Areas
- CHANGELOG.md[37-37]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Test uses result.getFirst() 📘 Rule violation ⛯ Reliability
Description
The new test calls result.getFirst() without asserting the list is non-empty, which can throw and
obscure the intended assertion failure. This violates the guideline to avoid unsafe list access
without presence checks.
Code

jablib/src/test/java/org/jabref/logic/importer/fileformat/pdf/PdfMergeMetadataImporterTest.java[187]

+                result.getFirst().getTitle());
Evidence
The compliance checklist explicitly calls out getFirst() as an unsafe access pattern when used
without emptiness checks. The added test directly calls result.getFirst() without first asserting
result is non-empty.

jablib/src/test/java/org/jabref/logic/importer/fileformat/pdf/PdfMergeMetadataImporterTest.java[185-187]
Best Practice: Learned patterns

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The test uses `result.getFirst()` without checking that `result` is non-empty, which can throw `NoSuchElementException` and make failures less diagnosable.
## Issue Context
The project compliance rules discourage unsafe list/optional access patterns (including `getFirst()`) without presence checks.
## Fix Focus Areas
- jablib/src/test/java/org/jabref/logic/importer/fileformat/pdf/PdfMergeMetadataImporterTest.java[180-187]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. New BibEntry uses setField 📘 Rule violation ✓ Correctness
Description
mergeCandidates constructs a new BibEntry and then uses entry.setField(...) to set the title,
despite the preference for withField when creating new entries. This diverges from the project’s
preferred BibEntry construction style.
Code

jablib/src/main/java/org/jabref/logic/importer/fileformat/pdf/PdfMergeMetadataImporter.java[R165-172]

+        entry.getField(StandardField.TITLE)
+             .filter(PdfMergeMetadataImporter::isTitleLikelyFilename)
+             .ifPresent(title -> candidates.stream()
+                                           .map(candidate -> candidate.getField(StandardField.TITLE))
+                                           .flatMap(Optional::stream)
+                                           .filter(candidateTitle -> !isTitleLikelyFilename(candidateTitle))
+                                           .findFirst()
+                                           .ifPresent(betterTitle -> entry.setField(StandardField.TITLE, betterTitle)));
Evidence
The compliance checklist requests using BibEntry withers (withField) rather than setField when
constructing new entries. In the modified method, a new BibEntry is created and later mutated via
setField for the title.

AGENTS.md
jablib/src/main/java/org/jabref/logic/importer/fileformat/pdf/PdfMergeMetadataImporter.java[162-172]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
A newly created `BibEntry` is mutated via `setField` in the modified code, while JabRef conventions prefer using `withField` when creating new entries.
## Issue Context
`mergeCandidates` creates `final BibEntry entry = new BibEntry()` and then sets the title via `entry.setField(...)`.
## Fix Focus Areas
- jablib/src/main/java/org/jabref/logic/importer/fileformat/pdf/PdfMergeMetadataImporter.java[162-172]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


View more (1)
4. LaTeX title misclassified 🐞 Bug ✓ Correctness
Description
PdfMergeMetadataImporter.isTitleLikelyFilename treats any title containing a backslash as
filename-like, so a legitimate LaTeX-encoded title (e.g., "\\section{X}" or "$\\alpha$") can be
considered a filename and then replaced by a different candidate title during mergeCandidates,
changing intended merge priority for the title field.
Code

jablib/src/main/java/org/jabref/logic/importer/fileformat/pdf/PdfMergeMetadataImporter.java[R165-172]

+        entry.getField(StandardField.TITLE)
+             .filter(PdfMergeMetadataImporter::isTitleLikelyFilename)
+             .ifPresent(title -> candidates.stream()
+                                           .map(candidate -> candidate.getField(StandardField.TITLE))
+                                           .flatMap(Optional::stream)
+                                           .filter(candidateTitle -> !isTitleLikelyFilename(candidateTitle))
+                                           .findFirst()
+                                           .ifPresent(betterTitle -> entry.setField(StandardField.TITLE, betterTitle)));
Evidence
The merge override logic replaces the merged TITLE whenever isTitleLikelyFilename(...) returns true.
The new heuristic returns true for any backslash because the regex includes the alternative
".*\\\\.*". The codebase explicitly treats backslash-containing LaTeX as valid title content (so
backslash is not, by itself, a filename signal). Also, PdfMergeMetadataImporter runs a BibTeX
parser-based importer (PdfVerbatimBibtexImporter), which can produce LaTeX-escaped field values, so
this misclassification can realistically occur when a candidate title contains LaTeX commands.

jablib/src/main/java/org/jabref/logic/importer/fileformat/pdf/PdfMergeMetadataImporter.java[42-58]
jablib/src/main/java/org/jabref/logic/importer/fileformat/pdf/PdfMergeMetadataImporter.java[153-172]
jablib/src/main/java/org/jabref/logic/importer/fileformat/pdf/PdfVerbatimBibtexImporter.java[27-33]
jablib/src/test/java/org/jabref/logic/integrity/LatexIntegrityCheckerTest.java[40-54]
jablib/src/test/java/org/jabref/logic/cleanup/CleanupWorkerTest.java[150-152]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`PdfMergeMetadataImporter` uses a heuristic regex to detect filename-like titles. The current regex treats **any** backslash in a title as filename-like (`.*\\.*`), which can misclassify legitimate LaTeX-encoded titles and cause `mergeCandidates(...)` to replace them with another candidate title.
### Issue Context
The override logic is executed when the merged `TITLE` is considered filename-like; at that point the code replaces the title with the first alternative non-filename-like title.
### Fix Focus Areas
- jablib/src/main/java/org/jabref/logic/importer/fileformat/pdf/PdfMergeMetadataImporter.java[42-43]
- jablib/src/main/java/org/jabref/logic/importer/fileformat/pdf/PdfMergeMetadataImporter.java[153-172]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

@github-actions github-actions Bot added status: no-bot-comments and removed status: changes-required Pull requests that are not yet complete labels Mar 14, 2026
@testlens-app

This comment has been minimized.

Comment on lines +176 to +183
entry.getField(StandardField.TITLE)
.filter(PdfMergeMetadataImporter::isTitleLikelyFilename)
.ifPresent(title -> candidates.stream()
.map(candidate -> candidate.getField(StandardField.TITLE))
.flatMap(Optional::stream)
.filter(candidateTitle -> !isTitleLikelyFilename(candidateTitle))
.findFirst()
.ifPresent(betterTitle -> entry.setField(StandardField.TITLE, betterTitle)));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just minor codestyle nitpick: A second nested ifPresent makes it a bit less readable. Just put the result of findFirst in a new Optional and check this with ifPresent.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can directly call stream on an optional, so that should work

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted.

@testlens-app

This comment has been minimized.

@testlens-app

testlens-app Bot commented Mar 15, 2026

Copy link
Copy Markdown

✅ All tests passed ✅

🏷️ Commit: ea5c97c
▶️ Tests: 10162 executed
⚪️ Checks: 67/67 completed


Learn more about TestLens at testlens.app.

@koppor koppor added status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers and removed status: no-bot-comments labels Mar 16, 2026
@Siedlerchr Siedlerchr added this pull request to the merge queue Mar 16, 2026
@github-actions github-actions Bot added the status: to-be-merged PRs which are accepted and should go into the merge-queue. label Mar 16, 2026
Merged via the queue into JabRef:main with commit d06f7ed Mar 16, 2026
67 checks passed
faneeshh added a commit to faneeshh/jabref that referenced this pull request Mar 16, 2026
…abRef#15339)

* Added failing test for filename like title override in PdfMergeMetadataImporterTest

* Prefer content extracted title over filename like XMP title

* Changelog

* Changelog grammar

* Refactor to use StandardFileType extensions

* Simplified mergeCandidates title override logic
AnvitaPrasad pushed a commit to AnvitaPrasad/jabref that referenced this pull request Mar 18, 2026
…abRef#15339)

* Added failing test for filename like title override in PdfMergeMetadataImporterTest

* Prefer content extracted title over filename like XMP title

* Changelog

* Changelog grammar

* Refactor to use StandardFileType extensions

* Simplified mergeCandidates title override logic
FynnianB pushed a commit to FynnianB/jabref that referenced this pull request Mar 19, 2026
…abRef#15339)

* Added failing test for filename like title override in PdfMergeMetadataImporterTest

* Prefer content extracted title over filename like XMP title

* Changelog

* Changelog grammar

* Refactor to use StandardFileType extensions

* Simplified mergeCandidates title override logic
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

good second issue Issues that involve a tour of two or three interweaved components in JabRef status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers status: to-be-merged PRs which are accepted and should go into the merge-queue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve BibTeX-from-PDF import

4 participants