Skip to content

refactor(metadata): avoid regex for AmazonBookParser#926

Merged
imnotjames merged 2 commits into
grimmory-tools:developfrom
imnotjames:refactor/924/use-jsoup-rather-than-regex
Apr 27, 2026
Merged

refactor(metadata): avoid regex for AmazonBookParser#926
imnotjames merged 2 commits into
grimmory-tools:developfrom
imnotjames:refactor/924/use-jsoup-rather-than-regex

Conversation

@imnotjames

@imnotjames imnotjames commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Description

this resolves the CodeQL warning that an inefficient regular expression is in place for AmazonBookParser and also avoids trying to parse HTML with a regular expression. This defines whitespace as only blank text nodes & br elements. In the future we could include other values or improve these in other ways, but this is close enough to match the existing behavior.

Linked Issue: Fixes #924

Changes

  • use jsoup rather than regex for cleaning up line breaks
  • add test to confirm behavior

Summary by CodeRabbit

  • Bug Fixes
    • Improved book description formatting for titles imported from Amazon, normalizing line breaks and trimming extraneous whitespace so content blocks render with consistent spacing.
  • Tests
    • Added a unit test to verify normalized description output after HTML cleaning to prevent regressions.

this resolves the CodeQL warning that an inefficient regular expression
is in place for AmazonBookParser and also avoids trying to parse HTML
with a regular expression
@coderabbitai

coderabbitai Bot commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: cb3c373c-a824-4233-bebf-7af2afc20556

📥 Commits

Reviewing files that changed from the base of the PR and between 73284ee and 915d2d0.

📒 Files selected for processing (1)
  • backend/src/main/java/org/booklore/service/metadata/parser/AmazonBookParser.java
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/src/main/java/org/booklore/service/metadata/parser/AmazonBookParser.java
📜 Recent review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Test Suite / Backend Tests
  • GitHub Check: Test Suite / Frontend Tests
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (java-kotlin)
  • GitHub Check: Frontend Lint Threshold Check

📝 Walkthrough

Walkthrough

Replaces regex-based HTML cleaning in AmazonBookParser with DOM inspection via Jsoup: adds an isWhitespaceNode helper, removes leading/trailing blank text and <br> nodes, simplifies <br> iteration, and removes unused regex constants. Adds a test asserting normalized description output from fetchTopMetadata.

Changes

Cohort / File(s) Summary
HTML Cleaning Refactor
backend/src/main/java/.../AmazonBookParser.java
Replaces regex-driven <br>/whitespace normalization with DOM node inspection; adds isWhitespaceNode(Node); removes blank text/br nodes at document boundaries; simplifies br iteration; deletes now-unused regex constants.
Test Coverage
backend/src/test/java/.../AmazonBookParserTest.java
Adds a test for fetchTopMetadata to verify description HTML is cleaned/normalized (consistent <br><br> between blocks); configures Jsoup.parse to use real parsing during the test and adds necessary assertions.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

  • zachyale

Poem

🐰
I nibbled regex down to size,
Jsoup helped tidy sleepy eyes.
Brs and blanks hop out of sight,
Cleaned descriptions, snug and right.
Hooray — the parser's feeling light! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title follows the conventional commit format with a type (refactor) and scope (metadata), clearly summarizing the main change.
Description check ✅ Passed The pull request description includes all required sections: description explaining the changes, linked issue reference, and detailed changes list.
Linked Issues check ✅ Passed The code changes directly address issue #924 by replacing regex-based BR processing with DOM-based jsoup handling, eliminating inefficient regex and avoiding HTML parsing with regular expressions.
Out of Scope Changes check ✅ Passed All changes are scoped to addressing the inefficient regex issue in AmazonBookParser; modifications to the parser logic and addition of test coverage are directly aligned with issue #924.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
✨ Simplify code
  • Create PR with simplified code

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@imnotjames

Copy link
Copy Markdown
Contributor Author

This is related to Code scanning alert 126 but is frankly the best practice when dealing with HTML. HTML is not a regular language, so regular expressions for parsing it becomes difficult if not outright impossible.

More information here on why you should avoid parsi̶n̷g̷ HTML with a̶ ̴r̸e̵g̵u̵l̴a̵r̶ ̶e̷x̵p̷r̴e̴š̵̜̟̳̼͉̆̆̍̎͝s̸̤̰͈̭̑ì̵̡̭̖̟͒̾͊ō̸̧̻̿͜n̶͖̏͋

@imnotjames imnotjames marked this pull request as ready for review April 27, 2026 02:25

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
backend/src/main/java/org/booklore/service/metadata/parser/AmazonBookParser.java (1)

886-902: Optional: simplify with pattern matching.

Given Java 25 with --enable-preview is the build target for this codebase, this helper could collapse to a single boolean expression using pattern matching for instanceof, eliminating the explicit casts.

♻️ Proposed simplification
-    private boolean isWhitespaceNode(Node node) {
-        if (node instanceof TextNode) {
-            if (((TextNode) node).isBlank()) {
-                return true;
-            }
-        }
-
-        if (node instanceof Element) {
-            Element element = (Element) node;
-
-            if ("br".equals(element.tagName())) {
-                return true;
-            }
-        }
-
-        return false;
-    }
+    private boolean isWhitespaceNode(Node node) {
+        if (node instanceof TextNode textNode) {
+            return textNode.isBlank();
+        }
+        if (node instanceof Element element) {
+            return "br".equals(element.tagName());
+        }
+        return false;
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@backend/src/main/java/org/booklore/service/metadata/parser/AmazonBookParser.java`
around lines 886 - 902, Replace the current body of isWhitespaceNode with a
single boolean expression using Java 25 pattern-matching instanceof: return
(node instanceof TextNode t && t.isBlank()) || (node instanceof Element e &&
"br".equals(e.tagName())); this preserves the existing checks (TextNode.isBlank
and Element.tagName() == "br") while removing explicit casts and nested ifs.
backend/src/test/java/org/booklore/service/metadata/parser/AmazonBookParserTest.java (1)

197-220: LGTM — test exercises the new DOM-based cleanup end-to-end.

The HTML input intentionally mixes <br />, <br >, and <br> with surrounding whitespace and embedded <div> blocks, so this verifies both the consecutive-<br> collapsing (more than 2 → 2) and the new leading/trailing whitespace+<br> trim via isWhitespaceNode. Routing Jsoup.parse(anyString()) to the real method (Line 214) is necessary because Jsoup is statically mocked in setup() and cleanDescriptionHtml invokes Jsoup.parse internally.

One small nit: consider asserting on metadata.getDescription() directly (without the \n strip) so any unintended newline reintroduction is caught — body().html().trim() already produces a deterministic, newline-free output for this fixture.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@backend/src/test/java/org/booklore/service/metadata/parser/AmazonBookParserTest.java`
around lines 197 - 220, In AmazonBookParserTest.test
fetchTopMetadata_cleansLineBreaks (method fetchTopMetadata_cleansLineBreaks),
remove the manual newline strip on metadata.getDescription() and assert the
description value directly; specifically replace the current assertEquals that
compares actual (built by metadata.getDescription().replace("\n", "")) with a
direct assertEquals(expected, metadata.getDescription()) so the test will catch
any unintended newline reintroduction by cleanDescriptionHtml/Jsoup.parse.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@backend/src/main/java/org/booklore/service/metadata/parser/AmazonBookParser.java`:
- Around line 886-902: Replace the current body of isWhitespaceNode with a
single boolean expression using Java 25 pattern-matching instanceof: return
(node instanceof TextNode t && t.isBlank()) || (node instanceof Element e &&
"br".equals(e.tagName())); this preserves the existing checks (TextNode.isBlank
and Element.tagName() == "br") while removing explicit casts and nested ifs.

In
`@backend/src/test/java/org/booklore/service/metadata/parser/AmazonBookParserTest.java`:
- Around line 197-220: In AmazonBookParserTest.test
fetchTopMetadata_cleansLineBreaks (method fetchTopMetadata_cleansLineBreaks),
remove the manual newline strip on metadata.getDescription() and assert the
description value directly; specifically replace the current assertEquals that
compares actual (built by metadata.getDescription().replace("\n", "")) with a
direct assertEquals(expected, metadata.getDescription()) so the test will catch
any unintended newline reintroduction by cleanDescriptionHtml/Jsoup.parse.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 4e830b2c-aa14-4e28-8400-a70659b1b0ac

📥 Commits

Reviewing files that changed from the base of the PR and between 1c447e6 and 73284ee.

📒 Files selected for processing (2)
  • backend/src/main/java/org/booklore/service/metadata/parser/AmazonBookParser.java
  • backend/src/test/java/org/booklore/service/metadata/parser/AmazonBookParserTest.java
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Packaging Smoke Test
🧰 Additional context used
📓 Path-based instructions (2)
backend/src/**/*.java

📄 CodeRabbit inference engine (AGENTS.md)

backend/src/**/*.java: Use 4-space indentation and match surrounding Java style in backend code
Prefer constructor injection via Lombok patterns already used in the codebase. Do not introduce @Autowired field injection in backend code
Use MapStruct for entity/DTO mapping in backend code
Keep JPA entities on the *Entity suffix in backend code

Files:

  • backend/src/test/java/org/booklore/service/metadata/parser/AmazonBookParserTest.java
  • backend/src/main/java/org/booklore/service/metadata/parser/AmazonBookParser.java
backend/src/test/**/*.java

📄 CodeRabbit inference engine (AGENTS.md)

Prefer focused unit tests; use @SpringBootTest only when the Spring context is required in backend code

Files:

  • backend/src/test/java/org/booklore/service/metadata/parser/AmazonBookParserTest.java
🧠 Learnings (2)
📚 Learning: 2026-04-10T08:15:37.436Z
Learnt from: imnotjames
Repo: grimmory-tools/grimmory PR: 449
File: booklore-api/src/main/java/org/booklore/service/book/BookDownloadService.java:139-145
Timestamp: 2026-04-10T08:15:37.436Z
Learning: When using Spring `ContentDisposition.builder(...).filename(name, StandardCharsets.UTF_8).build()` (i.e., explicitly providing UTF-8), the resulting header value should include both the quoted `filename="=?UTF-8?..."` and the RFC 5987 `filename*=` parameters. In this case, any extra ASCII fallback computation (e.g., deriving an ASCII `fallbackFilename` via `NON_ASCII_PATTERN` and calling `.filename(fallbackFilename)`) is likely redundant—prefer calling only `.filename(fallbackName?, StandardCharsets.UTF_8)` as appropriate and let Spring handle the UTF-8 header parameters. Verify by comparing the emitted header for `filename` and `filename*` before deciding to keep an ASCII fallback.

Applied to files:

  • backend/src/test/java/org/booklore/service/metadata/parser/AmazonBookParserTest.java
  • backend/src/main/java/org/booklore/service/metadata/parser/AmazonBookParser.java
📚 Learning: 2026-04-14T12:43:08.698Z
Learnt from: balazs-szucs
Repo: grimmory-tools/grimmory PR: 502
File: booklore-api/src/main/java/org/booklore/service/reader/ChapterCacheService.java:0-0
Timestamp: 2026-04-14T12:43:08.698Z
Learning: For this codebase (booklore-api), target Java 25 with `--enable-preview`, so `_` is intentionally used as an unnamed/ignored variable (e.g., lambda parameter or pattern variable) per Java’s preview feature JEP 456. Do not flag `_` in those contexts as an invalid/reserved identifier; only flag it if it’s used in a non-supported position (e.g., where an unnamed variable is not applicable for the Java preview rules).

Applied to files:

  • backend/src/test/java/org/booklore/service/metadata/parser/AmazonBookParserTest.java
  • backend/src/main/java/org/booklore/service/metadata/parser/AmazonBookParser.java
🔇 Additional comments (2)
backend/src/main/java/org/booklore/service/metadata/parser/AmazonBookParser.java (2)

933-948: LGTM — iteration is safe over removed nodes.

The enhanced-for cleanly replaces the index-based traversal. Capturing temp before advancing next = next.nextElementSibling() ensures the walk isn't disturbed by temp.remove(). Outer iterations that later land on already-detached <br> elements are safe: a detached node returns null from nextElementSibling(), so the inner loop is a no-op.


950-960: LGTM — DOM-based trim is a clean replacement for the regex.

Removing leading/trailing whitespace text nodes and <br> elements via firstChild()/lastChild() walk is equivalent to (and safer than) the previous regex-based stripping, and correctly addresses the CodeQL inefficient-regex finding for HTML content. Returning document.body().html().trim() gives a stable, normalized output that matches the new test expectation.

@imnotjames imnotjames changed the title fix(metadata): avoid regex for AmazonBookParser refactor(metadata): avoid regex for AmazonBookParser Apr 27, 2026
@coderabbitai coderabbitai Bot requested a review from zachyale April 27, 2026 02:30
@imnotjames imnotjames merged commit bf7c2d8 into grimmory-tools:develop Apr 27, 2026
16 checks passed
paulbovbel pushed a commit to paulbovbel/grimmory that referenced this pull request Apr 27, 2026
dsmouse pushed a commit to dsmouse/grimmory that referenced this pull request May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CodeQL Reports inefficient Regular Expression in AmazonBookParser

2 participants