Skip to content

Feature: Add ability to remove XMP metadata from Linked Files#14853

Merged
koppor merged 21 commits into
JabRef:mainfrom
ganesh-vk:remove-xmp-metadata
Jan 27, 2026
Merged

Feature: Add ability to remove XMP metadata from Linked Files#14853
koppor merged 21 commits into
JabRef:mainfrom
ganesh-vk:remove-xmp-metadata

Conversation

@ganesh-vk

@ganesh-vk ganesh-vk commented Jan 13, 2026

Copy link
Copy Markdown
Contributor

User description

Closes #8277

This PR adds a new CleanupJob that removes all XMP metadata from linked PDF files.

Steps to test

  • Open an example library and attach a file to any BibEntry
  • Use JabRef's existing functionality to write XMP metadata onto said file
  • Select the entry with the linked file -> Quality -> Cleanup Entries -> File Related -> Ensure Remove XMP Metadata is selected
  • Cleanup to remove XMP metadata from the linked PDF file
  • Exiftool can also be used to view the metadata before and after cleanup

The implementation in #8681 tried to use XmpPreferences and only delete specific fields giving the user more flexibility. However I think the approach of removing all XMP metadata works better since it does what was intended, let me know if this shouldn't be the expected behaviour.

Mandatory checks

image

PR Type

Enhancement


Description

  • Add new cleanup job to remove XMP metadata from linked PDF files

  • Implement XmpMetadataCleanup class with failure tracking and error handling

  • Integrate XMP removal into cleanup dialog UI with checkbox control

  • Add comprehensive unit tests for metadata removal functionality


Diagram Walkthrough

flowchart LR
  A["CleanupPreferences"] -->|"adds REMOVE_XMP_METADATA step"| B["CleanupWorker"]
  B -->|"creates job"| C["XmpMetadataCleanup"]
  C -->|"uses"| D["XmpUtilWriter.removeXmpMetadata"]
  D -->|"modifies"| E["PDF File"]
  F["CleanupFileViewModel"] -->|"tracks selection"| G["CleanupFileRelatedPanel"]
  G -->|"binds to"| C
Loading

File Walkthrough

Relevant files
Enhancement
8 files
XmpMetadataCleanup.java
New cleanup job implementation for XMP metadata removal   
+70/-0   
CleanupPreferences.java
Add REMOVE_XMP_METADATA cleanup step enum                               
+1/-0     
CleanupWorker.java
Integrate XmpMetadataCleanup job and failure tracking       
+5/-1     
XmpUtilWriter.java
Add removeXmpMetadata static method for metadata removal 
+38/-0   
XmpUtilShared.java
Simplify hasMetadata to check raw XMP without preferences
+3/-5     
CleanupFileViewModel.java
Add removeXmpMetadataSelected property and job selection 
+7/-1     
CleanupFileRelatedPanel.java
Add checkbox control for XMP metadata removal option         
+2/-0     
CleanupFileRelatedPanel.fxml
Add UI checkbox element for remove XMP metadata                   
+1/-0     
Documentation
3 files
CleanupDialogViewModel.java
Fix javadoc comment from iff to if                                             
+1/-1     
JabRef_en.properties
Add localization string for remove XMP metadata                   
+1/-0     
CHANGELOG.md
Document new XMP metadata removal feature                               
+1/-0     
Tests
2 files
XmpMetadataCleanupTest.java
New comprehensive unit tests for XMP metadata cleanup       
+114/-0 
CleanupWorkerTest.java
Add integration test for XMP metadata removal workflow     
+45/-0   

@qodo-free-for-open-source-projects

qodo-free-for-open-source-projects Bot commented Jan 13, 2026

Copy link
Copy Markdown
Contributor

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
Temporary file cleanup failure

Description: The temporary file created at line 337 may not be deleted if an exception occurs before
the finally block, potentially leaving sensitive PDF data on disk. The
Files.deleteIfExists call is only in the finally block which may not execute if the JVM
crashes. XmpUtilWriter.java [333-359]

Referred Code
public static void removeXmpMetadata(Path path) throws IOException, TransformerException {
    // Read from another file
    // Reason: Apache PDFBox does not support writing while the file is opened
    // See https://issues.apache.org/jira/browse/PDFBOX-4028
    Path newFile = Files.createTempFile("JabRef", "pdf");
    try (PDDocument document = Loader.loadPDF(path.toFile())) {
        if (document.isEncrypted()) {
            throw new EncryptedPdfsNotSupportedException();
        }
        PDDocumentCatalog catalog = document.getDocumentCatalog();
        if (catalog.getMetadata() != null) {
            catalog.setMetadata(null);
        }
        document.save(newFile.toFile());
        FileUtil.copyFile(newFile, path, true);
    } catch (IOException e) {
        LOGGER.debug("Could not remove XMP metadata", e);
        throw new TransformerException(
                "Could not remove XMP metadata: " + e.getLocalizedMessage(), e);
    } finally {
        try {


 ... (clipped 6 lines)
Silent failure handling

Description: The code silently continues processing other files after catching exceptions during XMP
removal, which could leave some files in an inconsistent state without clear user
notification of which specific files failed. XmpMetadataCleanup.java [48-56]

Referred Code
if (XmpUtilShared.hasMetadata(path)) {
    try {
        XmpUtilWriter.removeXmpMetadata(path);
        changed.set(true);
    } catch (IOException | TransformerException e) {
        LOGGER.error("Problem removing XMP metadata from file {}", path, e);
        failures.add(new JabRefException(Localization.lang("Problem removing XMP metadata from file %0", path.toString()), e));
    }
}
Ticket Compliance
🟡
🎫 #8277
🟢 Add a feature that allows removing all XMP metadata from one or multiple PDFs
The feature should work with PDFs that have linked entries in JabRef
The feature should address the problem where unwanted metadata exists alongside desired
metadata
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

🔴
Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status:
Path exposure in error: Error message exposes full file system path to user which may reveal internal system
structure

Referred Code
    failures.add(new JabRefException(Localization.lang("Problem removing XMP metadata from file %0", path.toString()), e));
}

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status:
File path in logs: Full file system path is logged which may expose sensitive directory structure or user
information

Referred Code
LOGGER.error("Problem removing XMP metadata from file {}", path, e);
failures.add(new JabRefException(Localization.lang("Problem removing XMP metadata from file %0", path.toString()), e));

Learn more about managing compliance generic rules or creating your own custom rules

  • Update
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@ganesh-vk ganesh-vk changed the title Remove xmp metadata Feature: Add ability to remove XMP metadata from Linked Files Jan 13, 2026
@qodo-free-for-open-source-projects

qodo-free-for-open-source-projects Bot commented Jan 13, 2026

Copy link
Copy Markdown
Contributor

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Report metadata removal more accurately

Instead of reporting a FieldChange on the file field, modify a different field
like comment to more accurately reflect that XMP metadata was removed.

jablib/src/main/java/org/jabref/logic/cleanup/XmpMetadataCleanup.java [60-63]

 if (changed.get()) {
     // Since only metadata is removed, no field "changes" but we still return a list
-    return Collections.singletonList(new FieldChange(entry, StandardField.FILE, entry.getField(StandardField.FILE).orElse(null), entry.getField(StandardField.FILE).orElse(null)));
+    // We use the comment field to record the change.
+    String oldComment = entry.getField(StandardField.COMMENT).orElse("");
+    String newComment = oldComment + "\n[JabRef] XMP metadata removed from linked files.";
+    return Collections.singletonList(new FieldChange(entry, StandardField.COMMENT, oldComment, newComment.trim()));
 }
  • Apply / Chat
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies that returning a FieldChange on the file field is a workaround and proposes a more explicit alternative by modifying the comment field, which improves traceability.

Low
  • Update

@github-actions github-actions Bot added the status: changes-required Pull requests that are not yet complete label Jan 13, 2026
@ganesh-vk

Copy link
Copy Markdown
Contributor Author

Will fix the failing tests in a bit

Comment thread jablib/src/main/java/org/jabref/logic/xmp/XmpUtilWriter.java
@github-actions github-actions Bot removed the status: changes-required Pull requests that are not yet complete label Jan 13, 2026
@koppor koppor added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Jan 14, 2026
filePath.ifPresent(path -> {
try {
XmpUtilWriter.removeXmpMetadata(path);
changed.set(true);

@calixtus calixtus Jan 14, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isnt it a bit bold to assume that if a file exists it will be changed? Maybe there was no metadata to remove in that file?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the current approach does not remove selective fields from the metadata and erases it all entirely, checking if metadata is present adds unnecessary overhead. Erasing all metadata without a check would result in the same outcome.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a con to this approach which I did not look at. If a PDF has no metadata and the cleanup is run, the UI reports a change (The entries containing any PDF regardless of if they have metadata or not). The benefit is if metadata exists -> cleanup without adding checking overhead but I think reporting that an entry was cleaned up when it had no metadata in the first place shouldn't be intended. What do you think? cc @Siedlerchr

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t I think reporting that an entry was cleaned up when it had no metadata in the first place shouldn't be intended

Yes. Always think of a huge library with 1000 or more entries.

  1. I do cleanup: 1500 files changed (because i have mulitple files attached)
  2. I do cleanup: 1500 files changed
  3. I think: Why? Why does JabRef need to change again?!

Comment thread jablib/src/main/java/org/jabref/logic/cleanup/XmpMetadataCleanup.java Outdated

if (changed.get()) {
// Since only metadata is removed, no field "changes" but we still return a list
return Collections.singletonList(new FieldChange(entry, StandardField.FILE, entry.getField(StandardField.FILE).orElse(null), entry.getField(StandardField.FILE).orElse(null)));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OrElse(null) looks very suspicious to me.
Please provide some rational here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I first returned an empty list since no field was being changed but the UI reports no changes (since it counts the number of lists returned) I asked an LLM for ideas on what to return and this approach made sense. Yes the orElse(null) is unnecessary and I will remove it. Will be more wary of the same in the future

return List.of(new FieldChange(
entry,
StandardField.FILE,
entry.getField(StandardField.FILE).orElse(StandardFileType.PDF.toString()),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I currently lack the resources for a detailed review, can you @Siedlerchr check on this?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the orElse is wrong. It should be get() -- but then there is an IntelliJ warning on this, which cannot be configured to be ignored properly.

Proposal: Java comment + .get().

Comment thread jablib/src/main/java/org/jabref/logic/xmp/XmpUtilWriter.java Outdated
@github-actions github-actions Bot added status: changes-required Pull requests that are not yet complete and removed status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers labels Jan 22, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Your pull request conflicts with the target branch.

Please merge with your code. For a step-by-step guide to resolve merge conflicts, see https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/addressing-merge-conflicts/resolving-a-merge-conflict-using-the-command-line.

@github-actions github-actions Bot removed the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Jan 26, 2026
@koppor

koppor commented Jan 26, 2026

Copy link
Copy Markdown
Member

Hi, I made that comment after the last commit. I was just wondering if it is possible for IntelliJ to enforce the formatting rules. Originally, the first character from the description for each throws and param tag were not vertically aligned - this is expected by the format check (I had fixed this by looking at Markdown docs of other methods). IntelliJ's auto format did not perform this when I tried. Is there any other way to automatically format files?

Double check your project setup. The project.xml in the repo was wrong until some days ago, but the guideline for setting up still comtains the hint.

https://devdocs.jabref.org/getting-into-the-code/guidelines-for-setting-up-a-local-workspace/intellij-13-code-style.html#have-auto-format-working-properly-in-javadoc

@ganesh-vk

Copy link
Copy Markdown
Contributor Author

Double check your project setup. The project.xml in the repo was wrong until some days ago, but the guideline for setting up still comtains the hint.

https://devdocs.jabref.org/getting-into-the-code/guidelines-for-setting-up-a-local-workspace/intellij-13-code-style.html#have-auto-format-working-properly-in-javadoc

Auto Formatting somehow working again :), settings were fine so I did not have to change anything.

@github-actions github-actions Bot removed the status: changes-required Pull requests that are not yet complete label Jan 26, 2026
@ganesh-vk

Copy link
Copy Markdown
Contributor Author

I've resolved all conflicts, feel free to let me know if any other changes are necessary

@koppor koppor left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests need to be rectored. They seem to be AI-generated not following JabRef's code style.

return List.of(new FieldChange(
entry,
StandardField.FILE,
entry.getField(StandardField.FILE).orElse(StandardFileType.PDF.toString()),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the orElse is wrong. It should be get() -- but then there is an IntelliJ warning on this, which cannot be configured to be ignored properly.

Proposal: Java comment + .get().

Comment on lines +93 to +106
BibEntry metadataEntry = new BibEntry();
metadataEntry.setCitationKey("Toot");
metadataEntry.setField(StandardField.PDF, "aPdfFile");
metadataEntry.setField(new UnknownField("some"), "1st");
metadataEntry.setField(StandardField.DOI, "http://dx.doi.org/10.1016/0001-8708(80)90035-3");
metadataEntry.setField(StandardField.MONTH, "01");
metadataEntry.setField(StandardField.PAGES, "1-2");
metadataEntry.setField(StandardField.DATE, "01/1999");
metadataEntry.setField(StandardField.PDF, "aPdfFile");
metadataEntry.setField(StandardField.ISSN, "aPsFile");
metadataEntry.setField(StandardField.FILE, "link::");
metadataEntry.setField(StandardField.JOURNAL, "test");
metadataEntry.setField(StandardField.TITLE, "<b>hallo</b> units 1 A case AlGaAs and latex $\\alpha$$\\beta$");
metadataEntry.setField(StandardField.ABSTRACT, "Réflexions");

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, use withField chain.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test here was copied/inspired by the pre existing test below cleanupDoesNothingByDefault. Maybe that needs refactoring too?

metadataEntry.setField(StandardField.DATE, "01/1999");
metadataEntry.setField(StandardField.PDF, "aPdfFile");
metadataEntry.setField(StandardField.ISSN, "aPsFile");
metadataEntry.setField(StandardField.FILE, "link::");

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just put test.pdf as content - and not a wrong format.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to the format from XmpMetadataCleanupTest since you said that the format is correct there

metadataEntry.setField(StandardField.ISSN, "aPsFile");
metadataEntry.setField(StandardField.FILE, "link::");
metadataEntry.setField(StandardField.JOURNAL, "test");
metadataEntry.setField(StandardField.TITLE, "<b>hallo</b> units 1 A case AlGaAs and latex $\\alpha$$\\beta$");

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why a bogus title is used - why not just "title"?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as earlier, copied from cleanupDoesNothingByDefault


CleanupPreferences preset = new CleanupPreferences(CleanupPreferences.CleanupStep.REMOVE_XMP_METADATA);
List<FieldChange> changes = worker.cleanup(preset, entry);
assertFalse(changes.isEmpty());

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assertFalse(changes.isEmpty());
assertEquals(List.of(), changes);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, should be assertNotEquals

olly2018.setField(StandardField.ABSTRACT, "ABSTRACT");
olly2018.setField(StandardField.COMMENT, "COMMENT");
olly2018.setField(StandardField.DOI, "10/3212.3123");
olly2018.setField(StandardField.FILE, ":article_dublinCore.pdf:PDF");

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the format is right.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it was inspired by the earlier Pull Request for the issue #8681

Comment on lines +67 to +71
olly2018.setField(StandardField.AUTHOR, "Olly and Johannes");
olly2018.setField(StandardField.TITLE, "Stefan's palace");
olly2018.setField(StandardField.JOURNAL, "Test Journal");
olly2018.setField(StandardField.VOLUME, "1");
olly2018.setField(StandardField.NUMBER, "1");

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use .withField

Regarding the citation key - see other test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to useField, could you please elaborate on the citation key?

Comment on lines +97 to +98
assertFalse(changes.isEmpty(), "Cleanup should report a change");
assertEquals(StandardField.FILE, changes.getFirst().getField());

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not equalling the field changes?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

BibEntry bibEntry = new BibEntry().withFiles(Collections.singletonList(linkedFile));

List<FieldChange> changes = cleanupJob.cleanup(bibEntry);
assertTrue(changes.isEmpty(), "Cleanup should report no changes for file without metadata");

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reasoning at the assertions - test title should do it

Use List.of(). See other comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for this other than style? I personally feel assertion comments offer more clarity.

@github-actions github-actions Bot added the status: changes-required Pull requests that are not yet complete label Jan 27, 2026
@github-actions github-actions Bot removed the status: changes-required Pull requests that are not yet complete label Jan 27, 2026
@ganesh-vk

Copy link
Copy Markdown
Contributor Author

Javadoc CI check unrelated to changes made by me looks like, seems to have failed on the last merged commit as well: https://github.com/JabRef/jabref/actions/runs/21373986928/job/61525481411

@koppor

koppor commented Jan 27, 2026

Copy link
Copy Markdown
Member

Javadoc CI check unrelated to changes made by me looks like, seems to have failed on the last merged commit as well: JabRef/jabref/actions/runs/21373986928/job/61525481411

Note that runs will be deleted after some weeks. Normally, one copies the smallest possible excerpt of the run.

For the concrete case, fixed in main. I merged main.

JavaDoc would be working at JDK27 - see https://bugs.openjdk.org/browse/JDK-8371504

Comment thread jablib/src/main/java/org/jabref/logic/cleanup/XmpMetadataCleanup.java Outdated
@github-actions github-actions Bot added the status: changes-required Pull requests that are not yet complete label Jan 27, 2026
@koppor koppor enabled auto-merge January 27, 2026 22:47
@github-actions github-actions Bot removed the status: changes-required Pull requests that are not yet complete label Jan 27, 2026
@koppor koppor added this pull request to the merge queue Jan 27, 2026
@github-actions github-actions Bot added the status: to-be-merged PRs which are accepted and should go into the merge-queue. label Jan 27, 2026
Merged via the queue into JabRef:main with commit 850c6dc Jan 27, 2026
51 checks passed
@ganesh-vk ganesh-vk deleted the remove-xmp-metadata branch February 2, 2026 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add feature that removes XMP metadata from pdf(s)

5 participants