Preserve no break spaces in Latex to Unicode conversion#15174
Conversation
Review Summary by QodoFix LaTeX tilde to non-breaking space conversion
WalkthroughsDescription• Replace LaTeX tildes with non-breaking spaces instead of standard spaces
• Use negative lookbehind regex to preserve LaTeX tilde accents like \~{n}
• Add test case for no-break space preservation in author names
• Update CHANGELOG with fix description
Diagramflowchart LR
A["LaTeX input with tilde"] -->|"negative lookbehind regex"| B["Standalone tildes identified"]
B -->|"replace with \\u00a0"| C["Non-breaking space output"]
D["LaTeX accents like \\~{n}"] -->|"not matched"| E["Preserved unchanged"]
File Changes1. jablib/src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java
|
Code Review by Qodo
1. NBSP breaks SQL search
|
| String toFormat = inField.replaceAll("(?<!\\\\)~", "\u00a0"); | ||
| toFormat = UNDERSCORE_MATCHER.matcher(toFormat).replaceAll(REPLACEMENT_CHAR); |
There was a problem hiding this comment.
1. Nbsp breaks sql search 🐞 Bug ✓ Correctness
LatexToUnicodeAdapter.format now emits NBSP for ~, and the search index stores this value as the transformed field. Because search queries use LIKE/ILIKE against the transformed column without normalizing whitespace, searching with a normal space may no longer match entries whose transformed text contains NBSP.
Agent Prompt
## Issue description
`LatexToUnicodeAdapter` now converts LaTeX `~` to NBSP (U+00A0). This value is written into the search index’s transformed column and is then queried using SQL LIKE/ILIKE. Because LIKE/ILIKE are character-sensitive, search terms containing normal spaces (U+0020) will not match transformed content containing NBSP.
## Issue Context
The adapter is used for both UI-ish formatting and backend indexing/search normalization. Preserving NBSP for display is desirable, but indexing/search likely needs whitespace normalization to maintain expected matching behavior.
## Fix Focus Areas
- jablib/src/main/java/org/jabref/logic/search/indexing/BibFieldsIndexer.java[474-476]
- jablib/src/main/java/org/jabref/logic/search/query/SearchToSqlVisitor.java[222-263]
- jablib/src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java[32-35]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
…me to something more clear
|
Your pull request conflicts with the target branch. Please merge with your code. For a step-by-step guide to resolve merge conflicts, see https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/addressing-merge-conflicts/resolving-a-merge-conflict-using-the-command-line. |
|
Hello! I am a new contributor looking to get involved. I would love to work on this issue. Could you please assign it to me? |
Are you aware that this is a PR and not an issue? |
This comment has been minimized.
This comment has been minimized.
|
|
||
| private static final Pattern TILDE_MATCHER = Pattern.compile("(?<!\\\\)~"); | ||
|
|
||
| private static final String NON_BREAKING_SPACE = "\u00a0"; |
There was a problem hiding this comment.
| private static final String NON_BREAKING_SPACE = "\u00a0"; | |
| private static final String NO_BREAK_SPACE = "\u00a0"; |
There was a problem hiding this comment.
The formal unicode character name is "no-break" space (https://www.unicode.org/charts/charindex.html), which I think is a better choice than the alternative name "non-breaking space" unless JabRef has a precedent of using the alternative. Thanks for the fix, @faneeshh.
| /// @return an `Optional<String>` with LaTeX resolved into Unicode or `empty` on failure. | ||
| public static Optional<String> parse(@NonNull String inField) { | ||
| String toFormat = UNDERSCORE_MATCHER.matcher(inField).replaceAll(REPLACEMENT_CHAR); | ||
| String toFormat = TILDE_MATCHER.matcher(inField).replaceAll(NON_BREAKING_SPACE); |
There was a problem hiding this comment.
| String toFormat = TILDE_MATCHER.matcher(inField).replaceAll(NON_BREAKING_SPACE); | |
| String toFormat = TILDE_MATCHER.matcher(inField).replaceAll(NO_BREAK_SPACE); |
| } | ||
|
|
||
| @Test | ||
| void formatPreservesNoBreakingSpaces() { |
There was a problem hiding this comment.
| void formatPreservesNoBreakingSpaces() { | |
| void formatPreservesNoBreakSpaces() { |
|
|
||
| ### Fixed | ||
|
|
||
| - We fixed an issue where LaTeX to Unicode conversion replaced tildes with standard spaces instead of non-breaking spaces. ([#15158](https://github.com/JabRef/jabref/issues/15158)) |
There was a problem hiding this comment.
| - We fixed an issue where LaTeX to Unicode conversion replaced tildes with standard spaces instead of non-breaking spaces. ([#15158](https://github.com/JabRef/jabref/issues/15158)) | |
| - We fixed an issue where LaTeX to Unicode conversion replaced tildes with standard spaces instead of no-break spaces. ([#15158](https://github.com/JabRef/jabref/issues/15158)) |
|
8 |
I'm guessing it's failing because it expects a standard space where I've now introduced the no-break space(\u00a0)? If that is the case then should I update the test's expected string to match the new Unicode char? Does that sound right |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Why? -- https://www.doi.org/resources/DOI_URI_Scheme.pdf does not contain any reference to ~ |
Yes, please - then we can easily see it if it s OK. Now, five people in a call are starring at the screen and not understanding anything. Please do it ASAP. |
This comment has been minimized.
This comment has been minimized.
I understand that real DOIs don't contain tildes per the DOI Scheme but the test 6 in CitationStyleGeneratorTest (10.1161/circ.108_827022~special) appeared to be a edge case to test unusual characters which is why I got confused probably. Anyways I've updated the tests to use the no-break space. |
This comment has been minimized.
This comment has been minimized.
Learning: Think "out of the box" Go to https://doi.org/ Enter the doi with ~"
See
--> Test is wrong. When reviewing the test at hand, the reviewers did not see. |
Yeah learning something new with every PR I open... really grateful tho |
|
| Check | Project/Task | Test | Runs |
|---|---|---|---|
| Source Code Tests / Unit tests – jabgui | :jabgui:test | DownloadLinkedFileActionTest > doesntReplaceSourceURL(boolean) | 🔇 |
| Source Code Tests / Unit tests – jabgui | :jabgui:test | DownloadLinkedFileActionTest > replacesLinkedFiles(Path) | 🔇 |
| Source Code Tests / Unit tests – jabgui | :jabgui:test | LinkedFileViewModelTest > downloadPdfFileWhenLinkedFilePointsToPdfUrl(boolean) | 🔇 |
🏷️ Commit: fd3e4bc
⚪️ Checks: 50/50 completed
Muted Tests
Select tests to mute in this pull request:
- DownloadLinkedFileActionTest > doesntReplaceSourceURL(boolean)
- DownloadLinkedFileActionTest > replacesLinkedFiles(Path)
- LinkedFileViewModelTest > downloadPdfFileWhenLinkedFilePointsToPdfUrl(boolean)
Reuse successful test results:
- ♻️ Only rerun the tests that failed or were muted before
Click the checkbox to trigger a rerun:
- Rerun jobs
Learn more about TestLens at testlens.app.
…rg.openrewrite.recipe-rewrite-recipe-bom-3.25.0 * upstream/main: (35 commits) Chore: add dependency-management.md (#15278) Chore(deps): Bump dev.langchain4j:langchain4j-bom in /versions (#15277) New Crowdin updates (#15274) Chore(deps): Bump actions/upload-artifact from 6 to 7 (#15271) Chore(deps): Bump actions/download-artifact from 7 to 8 (#15270) Chore(deps): Bump docker/login-action from 3 to 4 (#15268) Fix threading issues in citations relations tab (#15233) Fix: Citavi XML importer now preserves citation keys (#14658) (#15257) Preserve no break spaces in Latex to Unicode conversion (#15174) Fix: open javafx.scene.control.skin to controlsfx (#15260) Reduce complexity in dependencies setup (restore) (#15194) New translations jabref_en.properties (French) (#15256) Fix: exception dialog shows up when moving sidepanel down/up (#15248) Implement reset for Name Display Preferences (#15136) Chore(deps): Bump net.bytebuddy:byte-buddy in /versions (#15252) Chore(deps): Bump io.zonky.test.postgres:embedded-postgres-binaries-bom (#15253) Chore(deps): Bump io.zonky.test:embedded-postgres in /versions (#15254) Chore(deps): Bump net.ltgt.errorprone from 5.0.0 to 5.1.0 in /jablib (#15251) New Crowdin updates (#15247) Refined the "Select files to import" page in "Search for unlinked local files" dialog (#15110) ...


Closes #15158
PR Description
I updated the Latex to Unicode conversion to ensure the tilde (~) is converted to a non breaking space (\u00a0) instead of a standard space and I should mention that I used a negative lookbehind regex to make sure this only affects standalone tildes and doesn't break Latex tilde accents like ~{n}. I think this would maintain support for accented characters.
Screenshot
Checklist
CHANGELOG.mdin a way that can be understood by the average user (if change is visible to the user)