Skip to content

Improve KeywordList parsing to support escaped characters and nested structures#12888

Closed
krishnagjsForGit wants to merge 10 commits into
JabRef:mainfrom
krishnagjsForGit:fix-for-issue-12810
Closed

Improve KeywordList parsing to support escaped characters and nested structures#12888
krishnagjsForGit wants to merge 10 commits into
JabRef:mainfrom
krishnagjsForGit:fix-for-issue-12810

Conversation

@krishnagjsForGit

@krishnagjsForGit krishnagjsForGit commented Apr 6, 2025

Copy link
Copy Markdown

Closes #12810

What I changed
Implemented escape handling in KeywordList.parse(String, Character, Character) to support escaping the keyword separator using backslash (). This prevents incorrect splitting of keywords when delimiters appear within the keyword itself.

Refactored the parsing logic for clarity, including renaming loop variables for better readability and intent.

Added test cases in KeywordListTest to cover escape scenarios such as:

Escaped delimiter: "one\,two" → ["one,two"]

Escaped backslash: "one\\two" → ["one\two"]

Mixed escaped and unescaped delimiters.

Where the changes are
KeywordList.java: Modified the parse method to include escape handling logic using a character-by-character loop.

KeywordListTest.java: Added JUnit test cases to ensure parsing behaves correctly with escaped delimiters and backslashes.

Why I made these changes
Fixes issue #12810: Current parsing breaks when delimiters appear inside keywords (e.g., in MeSH terms). There was no way to escape the delimiter character, leading to incorrect keyword splitting.

Improves consistency: Provides a more robust, predictable behavior for keyword parsing, especially when users or importers use delimiters in keyword values.

Next Steps
Awaiting review and feedback.

If the community prefers a different escape character or behavior, I’m happy to adjust the implementation.

After this is merged, similar logic could be extracted or reused where keyword parsing happens elsewhere (e.g., importers).

Mandatory checks

  • I own the copyright of the code submitted and I license it under the MIT license
  • [/] Change in CHANGELOG.md described in a way that is understandable for the average user (if change is visible to the user)
  • Tests created for changes (if applicable)
  • Manually tested changed features in running JabRef (always required)
  • [/] Screenshots added in PR description (if change is visible to the user)
  • [/] Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
  • Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

Comment thread src/test/java/org/jabref/model/entry/KeywordListTest.java
Comment thread src/test/java/org/jabref/model/entry/KeywordListTest.java
Comment thread src/test/java/org/jabref/model/entry/KeywordListTest.java
Comment thread src/test/java/org/jabref/model/entry/KeywordListTest.java
Comment thread src/test/java/org/jabref/model/entry/KeywordListTest.java
Comment thread src/test/java/org/jabref/model/entry/KeywordListTest.java
@krishnagjsForGit

Copy link
Copy Markdown
Author

can someone help me on this. is this a valid comment from Bot ? I have never used DisplayName annotation and I changed method name to be more comprehensive. Am I missing anything here ?

@Siedlerchr

Copy link
Copy Markdown
Member

Seems like a false positive from the bot @koppor

@koppor

koppor commented Apr 6, 2025

Copy link
Copy Markdown
Member

@krishnagjsForGit please change the PR title to contain text summarizing the fix.

@krishnagjsForGit krishnagjsForGit changed the title Fix for issue https://github.com/JabRef/jabref/issues/12810 Improve KeywordList parsing to support escaped characters and nested structures Apr 6, 2025
@krishnagjsForGit

krishnagjsForGit commented Apr 6, 2025

Copy link
Copy Markdown
Author

@krishnagjsForGit please change the PR title to contain text summarizing the fix.

Done. is this fine @koppor ?

Comment thread src/main/java/org/jabref/model/entry/KeywordList.java Outdated
Comment thread src/main/java/org/jabref/model/entry/KeywordList.java Outdated
keywordList.add(chainRoot);
List<String> tokens = splitRespectingEscapes(keywordString, delimiter);

for (String token : tokens) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there's an opportunity to eliminate this loop. In the _splitRespectingEscapes_ method, all tokens are already available and accessible. It feels inefficient to first collect all tokens in a List and then iterate over it.

Sketch:

public static KeywordList parse(String keywordString, Character delimiter, Character hierarchicalDelimiter) {
    if (StringUtil.isBlank(keywordString)) {
        return new KeywordList();
    }

    Objects.requireNonNull(delimiter);
    Objects.requireNonNull(hierarchicalDelimiter);

    KeywordList keywordList = new KeywordList();
    List<String> hierarchy = new ArrayList<>();
    StringBuilder currentToken = new StringBuilder();
    boolean isEscaping = false;

    for (int i = 0; i < keywordString.length(); i++) {
        char currentChar = keywordString.charAt(i);

        if (isEscaping) {
            currentToken.append(currentChar);
            isEscaping = false;
        } else if (currentChar == '\\') {
            isEscaping = true;
        } else if (currentChar == hierarchicalDelimiter) {
            hierarchy.add(currentToken.toString().trim());
            currentToken.setLength(0);
        } else if (currentChar == delimiter) {
            hierarchy.add(currentToken.toString().trim());
            keywordList.add(Keyword.of(hierarchy.toArray(new String[0])));
            hierarchy.clear();
            currentToken.setLength(0);
        } else {
            currentToken.append(currentChar);
        }
    }

    // Handle final token
    if (currentToken.length() > 0 || !hierarchy.isEmpty()) {
        hierarchy.add(currentToken.toString().trim());
        keywordList.add(Keyword.of(hierarchy.toArray(new String[0])));
    }

    return keywordList;
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krishnagjsForGit I see this marked as resolved. Was this discussed further?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I see this loop still present)

String chain = tok.nextToken();
Keyword chainRoot = Keyword.of(chain.split(hierarchicalDelimiter.toString()));
keywordList.add(chainRoot);
List<String> tokens = splitRespectingEscapes(keywordString, delimiter);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yubo-Cao What do you think about this? Maybe an ANTLR grammar would help here, too?

@Yubo-Cao

Copy link
Copy Markdown
Collaborator

I'm sorry for not getting back to you sooner. I have two suggestions:

  1. ANTLR probably won't help in this use case, since the delimiters are dynamically specified
  2. I am concerned regarding the Unicode handling of the code, since Java char is finicky, and potentially you want to use codepoints instead of .charAt
  3. One way to address 2 and increase readability without necessarily using ANTLR is to dynamically construct a Regular Expression with a lookbehind and use only .split and .replace. Take a look at this: https://regex101.com/r/O1kLWF/1

@krishnagjsForGit

Copy link
Copy Markdown
Author

I'm sorry for not getting back to you sooner. I have two suggestions:

  1. ANTLR probably won't help in this use case, since the delimiters are dynamically specified
  2. I am concerned regarding the Unicode handling of the code, since Java char is finicky, and potentially you want to use codepoints instead of .charAt
  3. One way to address 2 and increase readability without necessarily using ANTLR is to dynamically construct a Regular Expression with a lookbehind and use only .split and .replace. Take a look at this: https://regex101.com/r/O1kLWF/1

@koppor you want me to change the code accordingly ?

@koppor

koppor commented May 1, 2025

Copy link
Copy Markdown
Member

I think the current code is good.

Is a result of current JabRefs way of parsing. If it was done "correctly", the Bib parser would do the escape handling - and all other data structures would use tokens. But this is much much effort to change.

@koppor

koppor commented May 1, 2025

Copy link
Copy Markdown
Member

@krishnagjsForGit please add a CHANGELOG.md entry

@subhramit

Copy link
Copy Markdown
Member

@ungerts thank you for helping with reviews!

@koppor koppor left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes requested at #12888 (comment) need to be addressed.

@krishnagjsForGit

Copy link
Copy Markdown
Author

@krishnagjsForGit please add a CHANGELOG.md entry

Can you please check if I updated correctly ?

@subhramit

Copy link
Copy Markdown
Member

How did you merge, by the way?
Did you do git pull upstream main?

@koppor

koppor commented May 2, 2025

Copy link
Copy Markdown
Member

Seems like it was not a clean merge. @koppor may be able to help with some git trick Otherwise, you'll have to git reset and push a fresh commit.

Just checking the output of GitHub here, I also suggest git reset origin/main after a git merge origin/main. - A force push afterwards is OK. Please use gitk --all& as a double check - and run this command before any action and leave the window open. You can press F5 to refresh.

@koppor

koppor commented May 2, 2025

Copy link
Copy Markdown
Member

Alternatively, one could work with git cherry-pick on a new branch - and just cherry-pick the "good" commits.

+ Localization.lang("You can restore the entry using the \"Undo\" operation."));

stateManager.setSelectedEntries(List.of());
stateManager.setSelectedEntries(Collections.emptyList());

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In JabRef, to create an empty list we use List.of() instead of Collections.emptyList(). This aligns with the project's conventions and ensures consistency across the codebase.

@krishnagjsForGit

Copy link
Copy Markdown
Author

Something went amiss and PR got closed. Can I create a separate Clean PR and fix this ? @koppor

when(stateManager.searchQueryProperty()).thenReturn(mock(StringProperty.class));
when(stateManager.activeTabProperty()).thenReturn(OptionalObjectProperty.empty());
KeyBindingRepository keyBindingRepository = new KeyBindingRepository(List.of(), List.of());
KeyBindingRepository keyBindingRepository = new KeyBindingRepository(Collections.emptyList(), Collections.emptyList());

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In JabRef, to create an empty list, List.of() should be used instead of Collections.emptyList() to adhere to the project's coding standards.

@krishnagjsForGit krishnagjsForGit marked this pull request as draft May 3, 2025 03:51
private static Path downloadLtwaFile() throws IOException, URISyntaxException {
LOGGER.info("Downloading LTWA file from {}.", LtwaListMvGenerator.LTWA_URL);
var in = new URI(LTWA_URL).toURL().openStream();
var path = Files.writeString(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of 'Files.createTempFile' should be replaced with '@tempdir' JUnit5 annotation for better test management and cleanup.

if (word.startsWith("-")) {
String key = word.substring(1);
suffixMap.computeIfAbsent(key, k ->
Stream.<LtwaEntry>builder().build().collect(Collectors.toList())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of 'collect(Collectors.toList())' should be replaced with 'toList()' for simplicity and modern Java practices.

BibWriter bibWriter = new BibWriter(outputWriter, OS.NEWLINE);
SelfContainedSaveConfiguration saveConfiguration = new SelfContainedSaveConfiguration(SaveOrder.getDefaultSaveOrder(), false, BibDatabaseWriter.SaveType.WITH_JABREF_META_DATA, false);
FieldPreferences fieldPreferences = new FieldPreferences(true, List.of(), List.of());
FieldPreferences fieldPreferences = new FieldPreferences(true, Collections.emptyList(), Collections.emptyList());

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use List.of() instead of Collections.emptyList() for creating empty lists, as per the project's conventions.

@koppor

koppor commented May 3, 2025

Copy link
Copy Markdown
Member

Something went amiss and PR got closed. Can I create a separate Clean PR and fix this ? @koppor

Yes, please.

@koppor koppor force-pushed the fix-for-issue-12810 branch from 081b9dc to 85cb7c7 Compare May 9, 2025 21:28
@koppor

koppor commented May 9, 2025

Copy link
Copy Markdown
Member

I restored the state as of April, 16th - commit 081b9dc. Mayb,e this helps.

If you do a git fetch origin and then git reset --hard origin/fix-for-issue-12810 you should be on the same page.

Attention: The command git reset --hard .... throws away all your work you did inbetween. Not sure if you did something...

@koppor koppor added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Jun 16, 2025
@trag-bot

trag-bot Bot commented Jul 7, 2025

Copy link
Copy Markdown

@trag-bot didn't find any issues in the code! ✅✨

@koppor

koppor commented Jul 7, 2025

Copy link
Copy Markdown
Member

I saw no activtiy since 2 months, thus closing this and making it free for another contributor.

@koppor koppor closed this Jul 7, 2025
@koppor koppor removed the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Jul 7, 2025
@jabref-machine

Copy link
Copy Markdown
Collaborator

Your code currently does not meet JabRef's code guidelines. We use Checkstyle to identify issues. You can see which checks are failing by locating the box "Some checks were not successful" on the pull request page. To see the test output, locate "Tests / Checkstyle (pull_request)" and click on it.

In case of issues with the import order, double check that you activated Auto Import. You can trigger fixing imports by pressing Ctrl+Alt+O to trigger Optimize Imports.

Please carefully follow the setup guide for the codestyle. Afterwards, please run checkstyle locally and fix the issues, commit, and push.

miguel-cordoba added a commit to miguel-cordoba/jabref that referenced this pull request Jul 24, 2025
miguel-cordoba added a commit to miguel-cordoba/jabref that referenced this pull request Jul 24, 2025
…bRef#12888)

- Addresses unresolved previous reviews: JabRef#12888 (comment)

- Adds CHANGELOG.md entry
miguel-cordoba added a commit to miguel-cordoba/jabref that referenced this pull request Jul 24, 2025
…bRef#12888)

- Addresses unresolved previous reviews: JabRef#12888 (comment)

- Adds CHANGELOG.md entry
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement escaping for keyword separators

7 participants