Filename like XMP titles overriding correct content extracted titles by faneeshh · Pull Request #15339 · JabRef/jabref

faneeshh · 2026-03-13T22:12:31Z

Related issues and pull requests

Closes #11999

PR Description

PdfMergeMetadataImporter.mergeCandidates( ) uses first come first serve merging, so when a PDF's XMP metadata contains a filename like title (Microsoft Word - ieee_on_how_we_teach_jul_01.docx) it silently overwrites the correctly extracted title from PdfContentImporter. So far I've just added a failing test using Kriha2018.pdf which has exactly this metadata to document the bug.

Added isTitleLikelyFilename( ) heuristic and post merge override logic in mergeCandidates( ) so that filename like titles are replaced by the best available real title.

Steps to test

Download this Pdf (https://kriha.de/dload/se2paper.pdf)

Create a new library and go to File -> Import
Select the downloaded Pdf
The title should now show the real paper title instead of the Word filename

Checklist

I own the copyright of the code submitted and I license it under the MIT license
I manually tested my changes in running JabRef (always required)
I added JUnit tests for changes (if applicable)
I added screenshots in the PR description (if change is visible to the user)
[/] I added a screenshot in the PR description showing a library with a single entry with me as author and as title the issue number
I described the change in CHANGELOG.md in a way that can be understood by the average user (if change is visible to the user)
I checked the user documentation for up to dateness and submitted a pull request to our user documentation repository

…taImporterTest

calixtus · 2026-03-14T14:23:52Z

 public class PdfMergeMetadataImporter extends PdfImporter {

    private static final Logger LOGGER = LoggerFactory.getLogger(PdfMergeMetadataImporter.class);
+    private static final Pattern FILENAME_TITLE_PATTERN = Pattern.compile("(?i)(.*\\.(docx|doc|pdf|tex|odt|rtf|ps|eps|html|htm|pptx|ppt|xlsx)$|microsoft (word|powerpoint|excel).*|.*\\\\.*)");


Maybe this could reuse ExternalFileTypes? Please investigate.

StandardExternalFileType is in jabgui, so can we use it in jablib?

StandardFileType is missing MS Office formats (docx, pptx, xlsx) and eps.

Then add them to StandardFileType

Done. Also had to filter out ANY_FILE since its * extension is a regex metacharacter.

qodo-free-for-open-source-projects · 2026-03-14T20:26:55Z

Review Summary by Qodo

Fix PDF import preferring filename-like XMP titles over content titles

🐞 Bug fix

Walkthroughs

Description

• Prevent filename-like XMP titles from overriding correctly extracted PDF content titles
• Added heuristic to detect filename patterns in title metadata
• Implement post-merge logic to replace suspicious titles with better alternatives
• Added test case validating fix with real PDF containing Word filename metadata

Diagram

flowchart LR
  A["PDF with XMP metadata"] -->|Extract candidates| B["PdfMergeMetadataImporter"]
  B -->|Merge candidates| C["Check if title is filename-like"]
  C -->|Yes| D["Replace with real title from content"]
  C -->|No| E["Keep original title"]
  D --> F["Final BibEntry"]
  E --> F

File Changes

1. jablib/src/main/java/org/jabref/logic/importer/fileformat/pdf/PdfMergeMetadataImporter.java 🐞 Bug fix +22/-3

Add filename detection and title override logic

• Added FILENAME_TITLE_PATTERN regex to detect filename-like titles (docx, doc, pdf, tex, etc.
 extensions and Microsoft Office patterns)
• Implemented isTitleLikelyFilename() method to check if a title matches filename patterns
• Modified mergeCandidates() to accept List<BibEntry> instead of Stream<BibEntry>
• Added post-merge override logic to replace filename-like titles with the first real title found
 from candidates
• Changed import from Stream to Pattern for regex matching

jablib/src/main/java/org/jabref/logic/importer/fileformat/pdf/PdfMergeMetadataImporter.java

2. jablib/src/test/java/org/jabref/logic/importer/fileformat/pdf/PdfMergeMetadataImporterTest.java 🧪 Tests +21/-0

Add test for filename title override behavior

• Added test filenameLikeTitleFromXmpIsOverriddenByContentTitle() to verify the fix
• Test uses Kriha2018.pdf which contains "Microsoft Word - ieee_on_how_we_teach_jul_01.docx" as XMP
 title
• Validates that the real title "On How We Can Teach – Exploring New Ways in Professional Software
 Development for Students" is extracted instead
• Mocks GrobidPreferences and ImportFormatPreferences for offline testing

jablib/src/test/java/org/jabref/logic/importer/fileformat/pdf/PdfMergeMetadataImporterTest.java

3. CHANGELOG.md 📝 Documentation +1/-0

Document PDF title import fix

• Added entry documenting the fix for PDF import preferring content-extracted titles over
 filename-like XMP metadata
• References issue #11999

CHANGELOG.md

qodo-free-for-open-source-projects · 2026-03-14T20:26:56Z

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (3) 📎 Requirement gaps (0)

1. Changelog uses “filename like” 📘 Rule violation ✓ Correctness

Description

The new CHANGELOG entry contains slightly unpolished phrasing (“filename like”) that should be
corrected for professional, precise user-facing text. This affects end-user readability and
translation quality expectations for release notes.

Code

CHANGELOG.md[37]

+- We fixed PDF import to prefer the content extracted title over filename like XMP metadata titles. [#11999](https://github.com/JabRef/jabref/issues/11999)

Evidence

PR Compliance requires professional, correct user-facing text and end-user-focused changelog
entries. The added CHANGELOG line includes the phrase “filename like”, which should be
“filename-like” (and is user-facing release text).

AGENTS.md
CHANGELOG.md[37-37]
Best Practice: Learned patterns

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The newly added CHANGELOG entry contains slightly incorrect/awkward user-facing wording (&amp;amp;amp;amp;quot;filename like&amp;amp;amp;amp;quot;).
## Issue Context
CHANGELOG.md is user-facing release text and should be grammatically correct and professional.
## Fix Focus Areas
- CHANGELOG.md[37-37]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. Test uses result.getFirst() 📘 Rule violation ⛯ Reliability

Description

The new test calls result.getFirst() without asserting the list is non-empty, which can throw and
obscure the intended assertion failure. This violates the guideline to avoid unsafe list access
without presence checks.

Code

jablib/src/test/java/org/jabref/logic/importer/fileformat/pdf/PdfMergeMetadataImporterTest.java[187]
+                result.getFirst().getTitle());

Evidence
The compliance checklist explicitly calls out getFirst() as an unsafe access pattern when used
without emptiness checks. The added test directly calls result.getFirst() without first asserting
result is non-empty.
jablib/src/test/java/org/jabref/logic/importer/fileformat/pdf/PdfMergeMetadataImporterTest.java[185-187]
Best Practice: Learned patterns

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The test uses `result.getFirst()` without checking that `result` is non-empty, which can throw `NoSuchElementException` and make failures less diagnosable.
## Issue Context
The project compliance rules discourage unsafe list/optional access patterns (including `getFirst()`) without presence checks.
## Fix Focus Areas
- jablib/src/test/java/org/jabref/logic/importer/fileformat/pdf/PdfMergeMetadataImporterTest.java[180-187]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. New BibEntry uses setField 📘 Rule violation ✓ Correctness

Description

mergeCandidates constructs a new BibEntry and then uses entry.setField(...) to set the title,
despite the preference for withField when creating new entries. This diverges from the project’s
preferred BibEntry construction style.

Code

jablib/src/main/java/org/jabref/logic/importer/fileformat/pdf/PdfMergeMetadataImporter.java[R165-172]

+        entry.getField(StandardField.TITLE)
+             .filter(PdfMergeMetadataImporter::isTitleLikelyFilename)
+             .ifPresent(title -> candidates.stream()
+                                           .map(candidate -> candidate.getField(StandardField.TITLE))
+                                           .flatMap(Optional::stream)
+                                           .filter(candidateTitle -> !isTitleLikelyFilename(candidateTitle))
+                                           .findFirst()
+                                           .ifPresent(betterTitle -> entry.setField(StandardField.TITLE, betterTitle)));

Evidence
The compliance checklist requests using BibEntry withers (withField) rather than setField when
constructing new entries. In the modified method, a new BibEntry is created and later mutated via
setField for the title.
AGENTS.md
jablib/src/main/java/org/jabref/logic/importer/fileformat/pdf/PdfMergeMetadataImporter.java[162-172]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
A newly created `BibEntry` is mutated via `setField` in the modified code, while JabRef conventions prefer using `withField` when creating new entries.
## Issue Context
`mergeCandidates` creates `final BibEntry entry = new BibEntry()` and then sets the title via `entry.setField(...)`.
## Fix Focus Areas
- jablib/src/main/java/org/jabref/logic/importer/fileformat/pdf/PdfMergeMetadataImporter.java[162-172]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

View more (1)

4. LaTeX title misclassified 🐞 Bug ✓ Correctness

Description

PdfMergeMetadataImporter.isTitleLikelyFilename treats any title containing a backslash as
filename-like, so a legitimate LaTeX-encoded title (e.g., "\\section{X}" or "$\\alpha$") can be
considered a filename and then replaced by a different candidate title during mergeCandidates,
changing intended merge priority for the title field.

Code

jablib/src/main/java/org/jabref/logic/importer/fileformat/pdf/PdfMergeMetadataImporter.java[R165-172]

+        entry.getField(StandardField.TITLE)
+             .filter(PdfMergeMetadataImporter::isTitleLikelyFilename)
+             .ifPresent(title -> candidates.stream()
+                                           .map(candidate -> candidate.getField(StandardField.TITLE))
+                                           .flatMap(Optional::stream)
+                                           .filter(candidateTitle -> !isTitleLikelyFilename(candidateTitle))
+                                           .findFirst()
+                                           .ifPresent(betterTitle -> entry.setField(StandardField.TITLE, betterTitle)));

Evidence
The merge override logic replaces the merged TITLE whenever isTitleLikelyFilename(...) returns true.
The new heuristic returns true for any backslash because the regex includes the alternative
".*\\\\.*". The codebase explicitly treats backslash-containing LaTeX as valid title content (so
backslash is not, by itself, a filename signal). Also, PdfMergeMetadataImporter runs a BibTeX
parser-based importer (PdfVerbatimBibtexImporter), which can produce LaTeX-escaped field values, so
this misclassification can realistically occur when a candidate title contains LaTeX commands.
jablib/src/main/java/org/jabref/logic/importer/fileformat/pdf/PdfMergeMetadataImporter.java[42-58]
jablib/src/main/java/org/jabref/logic/importer/fileformat/pdf/PdfMergeMetadataImporter.java[153-172]
jablib/src/main/java/org/jabref/logic/importer/fileformat/pdf/PdfVerbatimBibtexImporter.java[27-33]
jablib/src/test/java/org/jabref/logic/integrity/LatexIntegrityCheckerTest.java[40-54]
jablib/src/test/java/org/jabref/logic/cleanup/CleanupWorkerTest.java[150-152]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`PdfMergeMetadataImporter` uses a heuristic regex to detect filename-like titles. The current regex treats **any** backslash in a title as filename-like (`.*\\.*`), which can misclassify legitimate LaTeX-encoded titles and cause `mergeCandidates(...)` to replace them with another candidate title.
### Issue Context
The override logic is executed when the merged `TITLE` is considered filename-like; at that point the code replaces the title with the first alternative non-filename-like title.
### Fix Focus Areas
- jablib/src/main/java/org/jabref/logic/importer/fileformat/pdf/PdfMergeMetadataImporter.java[42-43]
- jablib/src/main/java/org/jabref/logic/importer/fileformat/pdf/PdfMergeMetadataImporter.java[153-172]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ⓘ The new review experience is currently in Beta. Learn more

calixtus · 2026-03-15T21:44:45Z

+        entry.getField(StandardField.TITLE)
+             .filter(PdfMergeMetadataImporter::isTitleLikelyFilename)
+             .ifPresent(title -> candidates.stream()
+                                           .map(candidate -> candidate.getField(StandardField.TITLE))
+                                           .flatMap(Optional::stream)
+                                           .filter(candidateTitle -> !isTitleLikelyFilename(candidateTitle))
+                                           .findFirst()
+                                           .ifPresent(betterTitle -> entry.setField(StandardField.TITLE, betterTitle)));


Just minor codestyle nitpick: A second nested ifPresent makes it a bit less readable. Just put the result of findFirst in a new Optional and check this with ifPresent.

you can directly call stream on an optional, so that should work

testlens-app · 2026-03-15T23:55:53Z

✅ All tests passed ✅

🏷️ Commit: ea5c97c
▶️ Tests: 10162 executed
⚪️ Checks: 67/67 completed

Learn more about TestLens at testlens.app.

…abRef#15339) * Added failing test for filename like title override in PdfMergeMetadataImporterTest * Prefer content extracted title over filename like XMP title * Changelog * Changelog grammar * Refactor to use StandardFileType extensions * Simplified mergeCandidates title override logic

Added failing test for filename like title override in PdfMergeMetada…

6f4ad50

…taImporterTest

github-actions Bot added good second issue Issues that involve a tour of two or three interweaved components in JabRef status: no-bot-comments labels Mar 13, 2026