Skip to content

Preserve no break spaces in Latex to Unicode conversion#15174

Merged
koppor merged 15 commits into
JabRef:mainfrom
faneeshh:fix-latex-unicode-nbsp
Mar 3, 2026
Merged

Preserve no break spaces in Latex to Unicode conversion#15174
koppor merged 15 commits into
JabRef:mainfrom
faneeshh:fix-latex-unicode-nbsp

Conversation

@faneeshh

@faneeshh faneeshh commented Feb 21, 2026

Copy link
Copy Markdown
Collaborator

Closes #15158
PR Description

I updated the Latex to Unicode conversion to ensure the tilde (~) is converted to a non breaking space (\u00a0) instead of a standard space and I should mention that I used a negative lookbehind regex to make sure this only affects standalone tildes and doesn't break Latex tilde accents like ~{n}. I think this would maintain support for accented characters.

Screenshot

image

Checklist

  • I own the copyright of the code submitted and I license it under the MIT license
  • I manually tested my changes in running JabRef (always required)
  • I added JUnit tests for changes (if applicable)
  • [/] I added screenshots in the PR description (if change is visible to the user)
  • I added a screenshot in the PR description showing a library with a single entry with me as author and as title the issue number
  • I described the change in CHANGELOG.md in a way that can be understood by the average user (if change is visible to the user)
  • I checked the user documentation for up to dateness and submitted a pull request to our user documentation repository

@github-actions github-actions Bot added the status: changes-required Pull requests that are not yet complete label Feb 21, 2026
@qodo-free-for-open-source-projects

Copy link
Copy Markdown
Contributor

Review Summary by Qodo

Fix LaTeX tilde to non-breaking space conversion

🐞 Bug fix

Grey Divider

Walkthroughs

Description
• Replace LaTeX tildes with non-breaking spaces instead of standard spaces
• Use negative lookbehind regex to preserve LaTeX tilde accents like \~{n}
• Add test case for no-break space preservation in author names
• Update CHANGELOG with fix description
Diagram
flowchart LR
  A["LaTeX input with tilde"] -->|"negative lookbehind regex"| B["Standalone tildes identified"]
  B -->|"replace with \\u00a0"| C["Non-breaking space output"]
  D["LaTeX accents like \\~{n}"] -->|"not matched"| E["Preserved unchanged"]
Loading

Grey Divider

File Changes

1. jablib/src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java 🐞 Bug fix +2/-1

Implement non-breaking space conversion for LaTeX tildes

• Added regex pattern (?<!\\\\)~ to match standalone tildes not preceded by backslash
• Replace matched tildes with Unicode non-breaking space character \u00a0
• Preserve LaTeX tilde accents by using negative lookbehind assertion

jablib/src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java


2. jablib/src/test/java/org/jabref/logic/layout/format/LatexToUnicodeFormatterTest.java 🧪 Tests +6/-1

Add tests for non-breaking space conversion

• Update existing test equationsMoreComplicatedFormatting to expect non-breaking space
• Add new test formatPreserveNoBreakSpaces to verify tilde conversion in author names
• Test case validates Y.~Matsumoto converts to Y.\u00a0Matsumoto

jablib/src/test/java/org/jabref/logic/layout/format/LatexToUnicodeFormatterTest.java


3. CHANGELOG.md 📝 Documentation +1/-0

Document LaTeX tilde conversion fix

• Add entry documenting fix for LaTeX tilde to non-breaking space conversion
• Reference issue #15158 in the changelog entry
• Place entry in Fixed section with appropriate formatting

CHANGELOG.md


Grey Divider

Qodo Logo

@github-actions github-actions Bot added the good first issue An issue intended for project-newcomers. Varies in difficulty. label Feb 21, 2026
@qodo-free-for-open-source-projects

qodo-free-for-open-source-projects Bot commented Feb 21, 2026

Copy link
Copy Markdown
Contributor

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (3) 📎 Requirement gaps (0)

Grey Divider


Action required

1. NBSP breaks SQL search 🐞 Bug ✓ Correctness
Description
LatexToUnicodeAdapter.format now emits NBSP for ~, and the search index stores this value as the
transformed field. Because search queries use LIKE/ILIKE against the transformed column without
normalizing whitespace, searching with a normal space may no longer match entries whose transformed
text contains NBSP.
Code

jablib/src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java[R33-34]

+        String toFormat = inField.replaceAll("(?<!\\\\)~", "\u00a0");
+        toFormat = UNDERSCORE_MATCHER.matcher(toFormat).replaceAll(REPLACEMENT_CHAR);
Evidence
The adapter replaces ~ with NBSP, the indexer persists LatexToUnicodeAdapter.format(value) as
FIELD_VALUE_TRANSFORMED, and SQL search uses LIKE/ILIKE comparisons against
FIELD_VALUE_TRANSFORMED with user-provided terms. LIKE/ILIKE are character-based, so U+0020
(space) does not match U+00A0 (NBSP).

jablib/src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java[32-35]
jablib/src/main/java/org/jabref/logic/search/indexing/BibFieldsIndexer.java[474-476]
jablib/src/main/java/org/jabref/logic/search/query/SearchToSqlVisitor.java[280-299]
jablib/src/main/java/org/jabref/logic/search/query/SearchToSqlVisitor.java[551-555]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`LatexToUnicodeAdapter` now converts LaTeX `~` to NBSP (U+00A0). This value is written into the search index’s transformed column and is then queried using SQL LIKE/ILIKE. Because LIKE/ILIKE are character-sensitive, search terms containing normal spaces (U+0020) will not match transformed content containing NBSP.
## Issue Context
The adapter is used for both UI-ish formatting and backend indexing/search normalization. Preserving NBSP for display is desirable, but indexing/search likely needs whitespace normalization to maintain expected matching behavior.
## Fix Focus Areas
- jablib/src/main/java/org/jabref/logic/search/indexing/BibFieldsIndexer.java[474-476]
- jablib/src/main/java/org/jabref/logic/search/query/SearchToSqlVisitor.java[222-263]
- jablib/src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java[32-35]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Inline regex in parse()📘 Rule violation ➹ Performance
Description
parse() uses String.replaceAll with a nontrivial regex, which recompiles the pattern on each
call and reduces readability. This violates the rule to prefer compiled Pattern reuse for regex
operations where applicable.
Code

jablib/src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java[33]

+        String toFormat = inField.replaceAll("(?<!\\\\)~", "\u00a0");
Evidence
PR Compliance ID 14 requires compiling and reusing regex patterns via Pattern.compile(...) and
matcher(...) where applicable; the changed line uses an inline regex replacement directly in
replaceAll.

AGENTS.md
jablib/src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java[33-33]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`LatexToUnicodeAdapter.parse()` uses an inline regex via `String.replaceAll`, which recompiles the pattern each time and conflicts with the guideline to use compiled `Pattern`s for nontrivial regex.
## Issue Context
The replacement regex contains a negative lookbehind and runs as part of `parse()`, which may be called frequently.
## Fix Focus Areas
- jablib/src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java[33-34]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Changelog uses Latex spelling📘 Rule violation ✓ Correctness
Description
The new changelog entry uses Latex instead of the standard LaTeX spelling, which is a
typographical/style mistake in user-facing documentation. This reduces professionalism and
consistency of written content.
Code

CHANGELOG.md[45]

+- We fixed an issue where Latex to Unicode conversion replaced tildes with standard spaces instead of non-breaking spaces. ([#15158](https://github.com/JabRef/jabref/issues/15158))
Evidence
PR Compliance ID 28 requires new/modified changelog text to be free of typographical/formatting
mistakes; the added line uses the nonstandard capitalization Latex instead of LaTeX.

CHANGELOG.md[45-45]
Best Practice: Learned patterns

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The changelog entry contains a typographical/style issue: `Latex` should be `LaTeX`.
## Issue Context
This is user-facing documentation and should match established terminology/capitalization used in the project.
## Fix Focus Areas
- CHANGELOG.md[45-45]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Advisory comments

4. Test name not self-documenting📘 Rule violation ✓ Correctness
Description
The new test method name formatPreserveNoBreakSpaces is grammatically awkward and uses a less
common term (NoBreak) instead of the clearer NonBreaking. This reduces readability and
self-documentation in tests.
Code

jablib/src/test/java/org/jabref/logic/layout/format/LatexToUnicodeFormatterTest.java[R209-210]

+    void formatPreserveNoBreakSpaces() {
+        assertEquals("Y.\u00a0Matsumoto", formatter.format("Y.~Matsumoto"));
Evidence
PR Compliance ID 2 requires meaningful, self-documenting names; the added test name is less clear
than a conventional alternative like formatPreservesNonBreakingSpaces.

Rule 2: Generic: Meaningful Naming and Self-Documenting Code
jablib/src/test/java/org/jabref/logic/layout/format/LatexToUnicodeFormatterTest.java[209-210]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The newly added test method name is not as clear/self-documenting as it could be.
## Issue Context
Test names serve as documentation for expected behavior; using conventional terminology improves readability.
## Fix Focus Areas
- jablib/src/test/java/org/jabref/logic/layout/format/LatexToUnicodeFormatterTest.java[209-211]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


5. Tilde after \\ not handled 🐞 Bug ✓ Correctness
Description
The negative lookbehind only converts ~ when not preceded by a backslash, so it will not convert
\~ (two backslashes then tilde). If such sequences are used in fields (and JabRef code treats them
as removable standalone tildes), NBSP preservation would be incomplete.
Code

jablib/src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java[33]

+        String toFormat = inField.replaceAll("(?<!\\\\)~", "\u00a0");
Evidence
The replacement explicitly skips any tilde immediately preceded by \. JabRef’s own formatter tests
treat \\~ as a case where the tilde is not an accent command and should be processed (removed
there), suggesting it can appear as a standalone tilde in practice.

jablib/src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java[32-34]
jablib/src/test/java/org/jabref/logic/layout/format/RemoveTildeTest.java[28-37]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The current negative-lookbehind replacement skips any `~` preceded by a backslash. This likely avoids breaking accent commands like `\~{n}`, but it also skips cases like `\\~` (two backslashes then tilde), which JabRef tests treat as a standalone tilde in other formatting contexts.
## Issue Context
Depending on intended LaTeX semantics in JabRef fields, `\\~` may represent a non-breaking space after a LaTeX command (or simply literal backslashes followed by a standalone tilde). If so, NBSP preservation would be incomplete.
## Fix Focus Areas
- jablib/src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java[32-35]
- jablib/src/test/java/org/jabref/logic/layout/format/LatexToUnicodeFormatterTest.java[133-212]
- jablib/src/test/java/org/jabref/logic/layout/format/RemoveTildeTest.java[28-37]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment on lines +33 to +34
String toFormat = inField.replaceAll("(?<!\\\\)~", "\u00a0");
toFormat = UNDERSCORE_MATCHER.matcher(toFormat).replaceAll(REPLACEMENT_CHAR);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Nbsp breaks sql search 🐞 Bug ✓ Correctness

LatexToUnicodeAdapter.format now emits NBSP for ~, and the search index stores this value as the
transformed field. Because search queries use LIKE/ILIKE against the transformed column without
normalizing whitespace, searching with a normal space may no longer match entries whose transformed
text contains NBSP.
Agent Prompt
## Issue description
`LatexToUnicodeAdapter` now converts LaTeX `~` to NBSP (U+00A0). This value is written into the search index’s transformed column and is then queried using SQL LIKE/ILIKE. Because LIKE/ILIKE are character-sensitive, search terms containing normal spaces (U+0020) will not match transformed content containing NBSP.

## Issue Context
The adapter is used for both UI-ish formatting and backend indexing/search normalization. Preserving NBSP for display is desirable, but indexing/search likely needs whitespace normalization to maintain expected matching behavior.

## Fix Focus Areas
- jablib/src/main/java/org/jabref/logic/search/indexing/BibFieldsIndexer.java[474-476]
- jablib/src/main/java/org/jabref/logic/search/query/SearchToSqlVisitor.java[222-263]
- jablib/src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java[32-35]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

@github-actions

Copy link
Copy Markdown
Contributor

Your pull request conflicts with the target branch.

Please merge with your code. For a step-by-step guide to resolve merge conflicts, see https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/addressing-merge-conflicts/resolving-a-merge-conflict-using-the-command-line.

@aniruddhkorde9-dot

Copy link
Copy Markdown

Hello! I am a new contributor looking to get involved. I would love to work on this issue. Could you please assign it to me?

@calixtus calixtus changed the title Fix #15158: Preserve no break spaces in Latex to Unicode conversion Preserve no break spaces in Latex to Unicode conversion Feb 21, 2026
@subhramit

Copy link
Copy Markdown
Member

Hello! I am a new contributor looking to get involved. I would love to work on this issue. Could you please assign it to me?

Are you aware that this is a PR and not an issue?

calixtus
calixtus previously approved these changes Feb 21, 2026
@testlens-app

This comment has been minimized.


private static final Pattern TILDE_MATCHER = Pattern.compile("(?<!\\\\)~");

private static final String NON_BREAKING_SPACE = "\u00a0";

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private static final String NON_BREAKING_SPACE = "\u00a0";
private static final String NO_BREAK_SPACE = "\u00a0";

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formal unicode character name is "no-break" space (https://www.unicode.org/charts/charindex.html), which I think is a better choice than the alternative name "non-breaking space" unless JabRef has a precedent of using the alternative. Thanks for the fix, @faneeshh.

/// @return an `Optional<String>` with LaTeX resolved into Unicode or `empty` on failure.
public static Optional<String> parse(@NonNull String inField) {
String toFormat = UNDERSCORE_MATCHER.matcher(inField).replaceAll(REPLACEMENT_CHAR);
String toFormat = TILDE_MATCHER.matcher(inField).replaceAll(NON_BREAKING_SPACE);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
String toFormat = TILDE_MATCHER.matcher(inField).replaceAll(NON_BREAKING_SPACE);
String toFormat = TILDE_MATCHER.matcher(inField).replaceAll(NO_BREAK_SPACE);

}

@Test
void formatPreservesNoBreakingSpaces() {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void formatPreservesNoBreakingSpaces() {
void formatPreservesNoBreakSpaces() {

Comment thread CHANGELOG.md Outdated

### Fixed

- We fixed an issue where LaTeX to Unicode conversion replaced tildes with standard spaces instead of non-breaking spaces. ([#15158](https://github.com/JabRef/jabref/issues/15158))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- We fixed an issue where LaTeX to Unicode conversion replaced tildes with standard spaces instead of non-breaking spaces. ([#15158](https://github.com/JabRef/jabref/issues/15158))
- We fixed an issue where LaTeX to Unicode conversion replaced tildes with standard spaces instead of no-break spaces. ([#15158](https://github.com/JabRef/jabref/issues/15158))

calixtus
calixtus previously approved these changes Feb 23, 2026
@calixtus calixtus enabled auto-merge February 23, 2026 21:51
@Siedlerchr

Copy link
Copy Markdown
Member

8
BibEntryTest > getFieldOrAliasLatexFreeComplexConversionInAlias() FAILED
org.opentest4j.AssertionFailedError at BibEntryTest.java:243

@subhramit subhramit disabled auto-merge February 23, 2026 22:34
@faneeshh

Copy link
Copy Markdown
Collaborator Author

8 BibEntryTest > getFieldOrAliasLatexFreeComplexConversionInAlias() FAILED org.opentest4j.AssertionFailedError at BibEntryTest.java:243

8 BibEntryTest > getFieldOrAliasLatexFreeComplexConversionInAlias() FAILED org.opentest4j.AssertionFailedError at BibEntryTest.java:243

I'm guessing it's failing because it expects a standard space where I've now introduced the no-break space(\u00a0)? If that is the case then should I update the test's expected string to match the new Unicode char? Does that sound right

@testlens-app

This comment has been minimized.

@testlens-app

This comment has been minimized.

@koppor

koppor commented Mar 2, 2026

Copy link
Copy Markdown
Member

and DOIs.

Why? -- https://www.doi.org/resources/DOI_URI_Scheme.pdf does not contain any reference to ~

@koppor

koppor commented Mar 2, 2026

Copy link
Copy Markdown
Member

I'm guessing it's failing because it expects a standard space where I've now introduced the no-break space(\u00a0)? If that is the case then should I update the test's expected string to match the new Unicode char? Does that sound right

Yes, please - then we can easily see it if it s OK. Now, five people in a call are starring at the screen and not understanding anything.

Please do it ASAP.

@testlens-app

This comment has been minimized.

@faneeshh

faneeshh commented Mar 2, 2026

Copy link
Copy Markdown
Collaborator Author

and DOIs.

Why? -- https://www.doi.org/resources/DOI_URI_Scheme.pdf does not contain any reference to ~

I understand that real DOIs don't contain tildes per the DOI Scheme but the test 6 in CitationStyleGeneratorTest (10.1161/circ.108_827022~special) appeared to be a edge case to test unusual characters which is why I got confused probably. Anyways I've updated the tests to use the no-break space.

@testlens-app

This comment has been minimized.

@github-actions github-actions Bot added status: no-bot-comments and removed status: changes-required Pull requests that are not yet complete labels Mar 2, 2026
@koppor

koppor commented Mar 3, 2026

Copy link
Copy Markdown
Member

and DOIs.

Why? -- https://www.doi.org/resources/DOI_URI_Scheme.pdf does not contain any reference to ~

I understand that real DOIs don't contain tildes per the DOI Scheme but the test 6 in CitationStyleGeneratorTest (10.1161/circ.108_827022~special) appeared to be a edge case to test unusual characters which is why I got confused probably. Anyways I've updated the tests to use the no-break space.

Learning: Think "out of the box"

Go to https://doi.org/

Enter the doi with ~"

image

See

image

--> Test is wrong. When reviewing the test at hand, the reviewers did not see.

@koppor koppor enabled auto-merge March 3, 2026 09:01
@faneeshh

faneeshh commented Mar 3, 2026

Copy link
Copy Markdown
Collaborator Author

Learning: Think "out of the box"

Yeah learning something new with every PR I open... really grateful tho

@koppor koppor added status: to-be-merged PRs which are accepted and should go into the merge-queue. and removed status: no-bot-comments labels Mar 3, 2026
@testlens-app

testlens-app Bot commented Mar 3, 2026

Copy link
Copy Markdown

⚠️ All checks passed after TestLens muted 3 tests ⚠️

Here is what you should do:

  • Inspect the muted tests carefully.
  • If you are convinced it's fine to ignore these tests, go ahead and merge this PR.
  • If not, re-enable relevant tests by deselecting checkboxes below and rerun checks.

Test Summary

Check Project/Task Test Runs
Source Code Tests / Unit tests – jabgui :jabgui:test DownloadLinkedFileActionTest > doesntReplaceSourceURL(boolean) 🔇
Source Code Tests / Unit tests – jabgui :jabgui:test DownloadLinkedFileActionTest > replacesLinkedFiles(Path) 🔇
Source Code Tests / Unit tests – jabgui :jabgui:test LinkedFileViewModelTest > downloadPdfFileWhenLinkedFilePointsToPdfUrl(boolean) 🔇

🏷️ Commit: fd3e4bc
▶️ Tests: 10099 executed | 3 muted
⚪️ Checks: 50/50 completed

Muted Tests

Select tests to mute in this pull request:

  • DownloadLinkedFileActionTest > doesntReplaceSourceURL(boolean)
  • DownloadLinkedFileActionTest > replacesLinkedFiles(Path)
  • LinkedFileViewModelTest > downloadPdfFileWhenLinkedFilePointsToPdfUrl(boolean)

Reuse successful test results:

  • ♻️ Only rerun the tests that failed or were muted before

Click the checkbox to trigger a rerun:

  • Rerun jobs

Learn more about TestLens at testlens.app.

@koppor koppor added this pull request to the merge queue Mar 3, 2026
Merged via the queue into JabRef:main with commit 14a34e5 Mar 3, 2026
50 checks passed
Siedlerchr added a commit that referenced this pull request Mar 5, 2026
…rg.openrewrite.recipe-rewrite-recipe-bom-3.25.0

* upstream/main: (35 commits)
  Chore: add dependency-management.md (#15278)
  Chore(deps): Bump dev.langchain4j:langchain4j-bom in /versions (#15277)
  New Crowdin updates (#15274)
  Chore(deps): Bump actions/upload-artifact from 6 to 7 (#15271)
  Chore(deps): Bump actions/download-artifact from 7 to 8 (#15270)
  Chore(deps): Bump docker/login-action from 3 to 4 (#15268)
  Fix threading issues in citations relations tab (#15233)
  Fix: Citavi XML importer now preserves citation keys (#14658) (#15257)
  Preserve no break spaces in Latex to Unicode conversion (#15174)
  Fix: open javafx.scene.control.skin to controlsfx (#15260)
  Reduce complexity in dependencies setup (restore) (#15194)
  New translations jabref_en.properties (French) (#15256)
  Fix: exception dialog shows up when moving sidepanel down/up (#15248)
  Implement reset for Name Display Preferences (#15136)
  Chore(deps): Bump net.bytebuddy:byte-buddy in /versions (#15252)
  Chore(deps): Bump io.zonky.test.postgres:embedded-postgres-binaries-bom (#15253)
  Chore(deps): Bump io.zonky.test:embedded-postgres in /versions (#15254)
  Chore(deps): Bump net.ltgt.errorprone from 5.0.0 to 5.1.0 in /jablib (#15251)
  New Crowdin updates (#15247)
  Refined the "Select files to import" page in "Search for unlinked local files" dialog (#15110)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

good first issue An issue intended for project-newcomers. Varies in difficulty. status: no-bot-comments status: to-be-merged PRs which are accepted and should go into the merge-queue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Convert LaTeX to Unicode feature should preseve no-break spaces

7 participants