refactor(metadata): avoid regex for AmazonBookParser by imnotjames · Pull Request #926 · grimmory-tools/grimmory

imnotjames · 2026-04-27T02:21:20Z

Description

this resolves the CodeQL warning that an inefficient regular expression is in place for AmazonBookParser and also avoids trying to parse HTML with a regular expression. This defines whitespace as only blank text nodes & br elements. In the future we could include other values or improve these in other ways, but this is close enough to match the existing behavior.

Linked Issue: Fixes #924

Changes

use jsoup rather than regex for cleaning up line breaks
add test to confirm behavior

Summary by CodeRabbit

Bug Fixes
- Improved book description formatting for titles imported from Amazon, normalizing line breaks and trimming extraneous whitespace so content blocks render with consistent spacing.
Tests
- Added a unit test to verify normalized description output after HTML cleaning to prevent regressions.

this resolves the CodeQL warning that an inefficient regular expression is in place for AmazonBookParser and also avoids trying to parse HTML with a regular expression

coderabbitai · 2026-04-27T02:21:28Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: cb3c373c-a824-4233-bebf-7af2afc20556

📥 Commits

Reviewing files that changed from the base of the PR and between 73284ee and 915d2d0.

📒 Files selected for processing (1)

backend/src/main/java/org/booklore/service/metadata/parser/AmazonBookParser.java

🚧 Files skipped from review as they are similar to previous changes (1)

backend/src/main/java/org/booklore/service/metadata/parser/AmazonBookParser.java

📜 Recent review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Test Suite / Backend Tests
GitHub Check: Test Suite / Frontend Tests
GitHub Check: Analyze (javascript-typescript)
GitHub Check: Analyze (java-kotlin)
GitHub Check: Frontend Lint Threshold Check

📝 Walkthrough

Walkthrough

Replaces regex-based HTML cleaning in AmazonBookParser with DOM inspection via Jsoup: adds an isWhitespaceNode helper, removes leading/trailing blank text and   nodes, simplifies   iteration, and removes unused regex constants. Adds a test asserting normalized description output from fetchTopMetadata.

Changes

Cohort / File(s)	Summary
HTML Cleaning Refactor `backend/src/main/java/.../AmazonBookParser.java`	Replaces regex-driven `<br>`/whitespace normalization with DOM node inspection; adds `isWhitespaceNode(Node)`; removes blank text/`br` nodes at document boundaries; simplifies `br` iteration; deletes now-unused regex constants.
Test Coverage `backend/src/test/java/.../AmazonBookParserTest.java`	Adds a test for `fetchTopMetadata` to verify description HTML is cleaned/normalized (consistent `<br><br>` between blocks); configures `Jsoup.parse` to use real parsing during the test and adds necessary assertions.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

zachyale

Poem

🐰
I nibbled regex down to size,
Jsoup helped tidy sleepy eyes.
Brs and blanks hop out of sight,
Cleaned descriptions, snug and right.
Hooray — the parser's feeling light! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The pull request title follows the conventional commit format with a type (refactor) and scope (metadata), clearly summarizing the main change.
Description check	✅ Passed	The pull request description includes all required sections: description explaining the changes, linked issue reference, and detailed changes list.
Linked Issues check	✅ Passed	The code changes directly address issue `#924` by replacing regex-based BR processing with DOM-based jsoup handling, eliminating inefficient regex and avoiding HTML parsing with regular expressions.
Out of Scope Changes check	✅ Passed	All changes are scoped to addressing the inefficient regex issue in AmazonBookParser; modifications to the parser logic and addition of test coverage are directly aligned with issue `#924`.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

✨ Simplify code

Create PR with simplified code

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

imnotjames · 2026-04-27T02:25:33Z

This is related to Code scanning alert 126 but is frankly the best practice when dealing with HTML. HTML is not a regular language, so regular expressions for parsing it becomes difficult if not outright impossible.

More information here on why you should avoid parsi̶n̷g̷ HTML with a̶ ̴r̸e̵g̵u̵l̴a̵r̶ ̶e̷x̵p̷r̴e̴š̵̜̟̳̼͉̆̆̍̎͝s̸̤̰͈̭̑ì̵̡̭̖̟͒̾͊ō̸̧̻̿͜n̶͖̏͋

coderabbitai

🧹 Nitpick comments (2)

backend/src/main/java/org/booklore/service/metadata/parser/AmazonBookParser.java (1)
886-902: Optional: simplify with pattern matching.

Given Java 25 with --enable-preview is the build target for this codebase, this helper could collapse to a single boolean expression using pattern matching for instanceof, eliminating the explicit casts.
♻️ Proposed simplification
- private boolean isWhitespaceNode(Node node) {
- if (node instanceof TextNode) {
- if (((TextNode) node).isBlank()) {
- return true;
- }
- }
-
- if (node instanceof Element) {
- Element element = (Element) node;
-
- if ("br".equals(element.tagName())) {
- return true;
- }
- }
-
- return false;
- }
+ private boolean isWhitespaceNode(Node node) {
+ if (node instanceof TextNode textNode) {
+ return textNode.isBlank();
+ }
+ if (node instanceof Element element) {
+ return "br".equals(element.tagName());
+ }
+ return false;
+ }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@backend/src/main/java/org/booklore/service/metadata/parser/AmazonBookParser.java`
around lines 886 - 902, Replace the current body of isWhitespaceNode with a
single boolean expression using Java 25 pattern-matching instanceof: return
(node instanceof TextNode t && t.isBlank()) || (node instanceof Element e &&
"br".equals(e.tagName())); this preserves the existing checks (TextNode.isBlank
and Element.tagName() == "br") while removing explicit casts and nested ifs.
backend/src/test/java/org/booklore/service/metadata/parser/AmazonBookParserTest.java (1)
197-220: LGTM — test exercises the new DOM-based cleanup end-to-end.

The HTML input intentionally mixes  ,  , and   with surrounding whitespace and embedded <div> blocks, so this verifies both the consecutive-  collapsing (more than 2 → 2) and the new leading/trailing whitespace+  trim via isWhitespaceNode. Routing Jsoup.parse(anyString()) to the real method (Line 214) is necessary because Jsoup is statically mocked in setup() and cleanDescriptionHtml invokes Jsoup.parse internally.

One small nit: consider asserting on metadata.getDescription() directly (without the \n strip) so any unintended newline reintroduction is caught — body().html().trim() already produces a deterministic, newline-free output for this fixture.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@backend/src/test/java/org/booklore/service/metadata/parser/AmazonBookParserTest.java`
around lines 197 - 220, In AmazonBookParserTest.test
fetchTopMetadata_cleansLineBreaks (method fetchTopMetadata_cleansLineBreaks),
remove the manual newline strip on metadata.getDescription() and assert the
description value directly; specifically replace the current assertEquals that
compares actual (built by metadata.getDescription().replace("\n", "")) with a
direct assertEquals(expected, metadata.getDescription()) so the test will catch
any unintended newline reintroduction by cleanDescriptionHtml/Jsoup.parse.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@backend/src/main/java/org/booklore/service/metadata/parser/AmazonBookParser.java`:
- Around line 886-902: Replace the current body of isWhitespaceNode with a
single boolean expression using Java 25 pattern-matching instanceof: return
(node instanceof TextNode t && t.isBlank()) || (node instanceof Element e &&
"br".equals(e.tagName())); this preserves the existing checks (TextNode.isBlank
and Element.tagName() == "br") while removing explicit casts and nested ifs.

In
`@backend/src/test/java/org/booklore/service/metadata/parser/AmazonBookParserTest.java`:
- Around line 197-220: In AmazonBookParserTest.test
fetchTopMetadata_cleansLineBreaks (method fetchTopMetadata_cleansLineBreaks),
remove the manual newline strip on metadata.getDescription() and assert the
description value directly; specifically replace the current assertEquals that
compares actual (built by metadata.getDescription().replace("\n", "")) with a
direct assertEquals(expected, metadata.getDescription()) so the test will catch
any unintended newline reintroduction by cleanDescriptionHtml/Jsoup.parse.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 4e830b2c-aa14-4e28-8400-a70659b1b0ac

📥 Commits

Reviewing files that changed from the base of the PR and between 1c447e6 and 73284ee.

📒 Files selected for processing (2)

backend/src/main/java/org/booklore/service/metadata/parser/AmazonBookParser.java
backend/src/test/java/org/booklore/service/metadata/parser/AmazonBookParserTest.java

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Packaging Smoke Test

🧰 Additional context used

📓 Path-based instructions (2)

backend/src/**/*.java

📄 CodeRabbit inference engine (AGENTS.md)

backend/src/**/*.java: Use 4-space indentation and match surrounding Java style in backend code
Prefer constructor injection via Lombok patterns already used in the codebase. Do not introduce @Autowired field injection in backend code
Use MapStruct for entity/DTO mapping in backend code
Keep JPA entities on the *Entity suffix in backend code

Files:

backend/src/test/java/org/booklore/service/metadata/parser/AmazonBookParserTest.java
backend/src/main/java/org/booklore/service/metadata/parser/AmazonBookParser.java

backend/src/test/**/*.java

📄 CodeRabbit inference engine (AGENTS.md)

Prefer focused unit tests; use @SpringBootTest only when the Spring context is required in backend code

Files:

backend/src/test/java/org/booklore/service/metadata/parser/AmazonBookParserTest.java

🧠 Learnings (2)

📚 Learning: 2026-04-10T08:15:37.436Z

Learnt from: imnotjames
Repo: grimmory-tools/grimmory PR: 449
File: booklore-api/src/main/java/org/booklore/service/book/BookDownloadService.java:139-145
Timestamp: 2026-04-10T08:15:37.436Z
Learning: When using Spring `ContentDisposition.builder(...).filename(name, StandardCharsets.UTF_8).build()` (i.e., explicitly providing UTF-8), the resulting header value should include both the quoted `filename="=?UTF-8?..."` and the RFC 5987 `filename*=` parameters. In this case, any extra ASCII fallback computation (e.g., deriving an ASCII `fallbackFilename` via `NON_ASCII_PATTERN` and calling `.filename(fallbackFilename)`) is likely redundant—prefer calling only `.filename(fallbackName?, StandardCharsets.UTF_8)` as appropriate and let Spring handle the UTF-8 header parameters. Verify by comparing the emitted header for `filename` and `filename*` before deciding to keep an ASCII fallback.

Applied to files:

backend/src/test/java/org/booklore/service/metadata/parser/AmazonBookParserTest.java
backend/src/main/java/org/booklore/service/metadata/parser/AmazonBookParser.java

📚 Learning: 2026-04-14T12:43:08.698Z

Learnt from: balazs-szucs
Repo: grimmory-tools/grimmory PR: 502
File: booklore-api/src/main/java/org/booklore/service/reader/ChapterCacheService.java:0-0
Timestamp: 2026-04-14T12:43:08.698Z
Learning: For this codebase (booklore-api), target Java 25 with `--enable-preview`, so `_` is intentionally used as an unnamed/ignored variable (e.g., lambda parameter or pattern variable) per Java’s preview feature JEP 456. Do not flag `_` in those contexts as an invalid/reserved identifier; only flag it if it’s used in a non-supported position (e.g., where an unnamed variable is not applicable for the Java preview rules).

Applied to files:

backend/src/test/java/org/booklore/service/metadata/parser/AmazonBookParserTest.java
backend/src/main/java/org/booklore/service/metadata/parser/AmazonBookParser.java

🔇 Additional comments (2)

backend/src/main/java/org/booklore/service/metadata/parser/AmazonBookParser.java (2)

933-948: LGTM — iteration is safe over removed nodes.

The enhanced-for cleanly replaces the index-based traversal. Capturing temp before advancing next = next.nextElementSibling() ensures the walk isn't disturbed by temp.remove(). Outer iterations that later land on already-detached   elements are safe: a detached node returns null from nextElementSibling(), so the inner loop is a no-op.

950-960: LGTM — DOM-based trim is a clean replacement for the regex.

Removing leading/trailing whitespace text nodes and   elements via firstChild()/lastChild() walk is equivalent to (and safer than) the previous regex-based stripping, and correctly addresses the CodeQL inefficient-regex finding for HTML content. Returning document.body().html().trim() gives a stable, normalized output that matches the new test expectation.

)

fix(metadata): avoid regex for AmazonBookParser

73284ee

this resolves the CodeQL warning that an inefficient regular expression is in place for AmazonBookParser and also avoids trying to parse HTML with a regular expression

imnotjames marked this pull request as ready for review April 27, 2026 02:25

coderabbitai Bot added backend enhancement labels Apr 27, 2026

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

imnotjames changed the title ~~fix(metadata): avoid regex for AmazonBookParser~~ refactor(metadata): avoid regex for AmazonBookParser Apr 27, 2026

refactor: simplify checking

915d2d0

coderabbitai Bot requested a review from zachyale April 27, 2026 02:30

balazs-szucs approved these changes Apr 27, 2026

View reviewed changes

imnotjames merged commit bf7c2d8 into grimmory-tools:develop Apr 27, 2026
16 checks passed

paulbovbel pushed a commit to paulbovbel/grimmory that referenced this pull request Apr 27, 2026

refactor(metadata): avoid regex for AmazonBookParser (grimmory-tools#926

d3257db

)

dsmouse pushed a commit to dsmouse/grimmory that referenced this pull request May 11, 2026

refactor(metadata): avoid regex for AmazonBookParser (grimmory-tools#926

5b0c31d

)

coderabbitai Bot mentioned this pull request May 19, 2026

HTML Markup From Metadata Sources Should Be Automatically Sanitized (remove unnecesary tags and such) #1402

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(metadata): avoid regex for AmazonBookParser#926

refactor(metadata): avoid regex for AmazonBookParser#926
imnotjames merged 2 commits into
grimmory-tools:developfrom
imnotjames:refactor/924/use-jsoup-rather-than-regex

imnotjames commented Apr 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

imnotjames commented Apr 27, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

imnotjames commented Apr 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

imnotjames commented Apr 27, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

imnotjames commented Apr 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading