Filename like XMP titles overriding correct content extracted titles #15339
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
| public class PdfMergeMetadataImporter extends PdfImporter { | ||
|
|
||
| private static final Logger LOGGER = LoggerFactory.getLogger(PdfMergeMetadataImporter.class); | ||
| private static final Pattern FILENAME_TITLE_PATTERN = Pattern.compile("(?i)(.*\\.(docx|doc|pdf|tex|odt|rtf|ps|eps|html|htm|pptx|ppt|xlsx)$|microsoft (word|powerpoint|excel).*|.*\\\\.*)"); |
There was a problem hiding this comment.
Maybe this could reuse ExternalFileTypes? Please investigate.
There was a problem hiding this comment.
StandardExternalFileType is in jabgui, so can we use it in jablib?
StandardFileType is missing MS Office formats (docx, pptx, xlsx) and eps.
There was a problem hiding this comment.
Then add them to StandardFileType
There was a problem hiding this comment.
Done. Also had to filter out ANY_FILE since its * extension is a regex metacharacter.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Review Summary by QodoFix PDF import preferring filename-like XMP titles over content titles
WalkthroughsDescription• Prevent filename-like XMP titles from overriding correctly extracted PDF content titles • Added heuristic to detect filename patterns in title metadata • Implement post-merge logic to replace suspicious titles with better alternatives • Added test case validating fix with real PDF containing Word filename metadata Diagramflowchart LR
A["PDF with XMP metadata"] -->|Extract candidates| B["PdfMergeMetadataImporter"]
B -->|Merge candidates| C["Check if title is filename-like"]
C -->|Yes| D["Replace with real title from content"]
C -->|No| E["Keep original title"]
D --> F["Final BibEntry"]
E --> F
File Changes1. jablib/src/main/java/org/jabref/logic/importer/fileformat/pdf/PdfMergeMetadataImporter.java
|
Code Review by Qodo
1. Changelog uses “filename like”
|
This comment has been minimized.
This comment has been minimized.
| entry.getField(StandardField.TITLE) | ||
| .filter(PdfMergeMetadataImporter::isTitleLikelyFilename) | ||
| .ifPresent(title -> candidates.stream() | ||
| .map(candidate -> candidate.getField(StandardField.TITLE)) | ||
| .flatMap(Optional::stream) | ||
| .filter(candidateTitle -> !isTitleLikelyFilename(candidateTitle)) | ||
| .findFirst() | ||
| .ifPresent(betterTitle -> entry.setField(StandardField.TITLE, betterTitle))); |
There was a problem hiding this comment.
Just minor codestyle nitpick: A second nested ifPresent makes it a bit less readable. Just put the result of findFirst in a new Optional and check this with ifPresent.
There was a problem hiding this comment.
you can directly call stream on an optional, so that should work
This comment has been minimized.
This comment has been minimized.
✅ All tests passed ✅🏷️ Commit: ea5c97c Learn more about TestLens at testlens.app. |
…abRef#15339) * Added failing test for filename like title override in PdfMergeMetadataImporterTest * Prefer content extracted title over filename like XMP title * Changelog * Changelog grammar * Refactor to use StandardFileType extensions * Simplified mergeCandidates title override logic
…abRef#15339) * Added failing test for filename like title override in PdfMergeMetadataImporterTest * Prefer content extracted title over filename like XMP title * Changelog * Changelog grammar * Refactor to use StandardFileType extensions * Simplified mergeCandidates title override logic
…abRef#15339) * Added failing test for filename like title override in PdfMergeMetadataImporterTest * Prefer content extracted title over filename like XMP title * Changelog * Changelog grammar * Refactor to use StandardFileType extensions * Simplified mergeCandidates title override logic
Related issues and pull requests
Closes #11999
PR Description
PdfMergeMetadataImporter.mergeCandidates( ) uses first come first serve merging, so when a PDF's XMP metadata contains a filename like title (Microsoft Word - ieee_on_how_we_teach_jul_01.docx) it silently overwrites the correctly extracted title from PdfContentImporter. So far I've just added a failing test using Kriha2018.pdf which has exactly this metadata to document the bug.
Added isTitleLikelyFilename( ) heuristic and post merge override logic in mergeCandidates( ) so that filename like titles are replaced by the best available real title.
Steps to test
Download this Pdf (https://kriha.de/dload/se2paper.pdf)
Checklist
CHANGELOG.mdin a way that can be understood by the average user (if change is visible to the user)