refactor(metadata): avoid regex for AmazonBookParser#926
Conversation
this resolves the CodeQL warning that an inefficient regular expression is in place for AmazonBookParser and also avoids trying to parse HTML with a regular expression
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Organization UI (inherited) Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📜 Recent review details⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
📝 WalkthroughWalkthroughReplaces regex-based HTML cleaning in AmazonBookParser with DOM inspection via Jsoup: adds an Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
This is related to Code scanning alert 126 but is frankly the best practice when dealing with HTML. HTML is not a regular language, so regular expressions for parsing it becomes difficult if not outright impossible. |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
backend/src/main/java/org/booklore/service/metadata/parser/AmazonBookParser.java (1)
886-902: Optional: simplify with pattern matching.Given Java 25 with
--enable-previewis the build target for this codebase, this helper could collapse to a single boolean expression using pattern matching forinstanceof, eliminating the explicit casts.♻️ Proposed simplification
- private boolean isWhitespaceNode(Node node) { - if (node instanceof TextNode) { - if (((TextNode) node).isBlank()) { - return true; - } - } - - if (node instanceof Element) { - Element element = (Element) node; - - if ("br".equals(element.tagName())) { - return true; - } - } - - return false; - } + private boolean isWhitespaceNode(Node node) { + if (node instanceof TextNode textNode) { + return textNode.isBlank(); + } + if (node instanceof Element element) { + return "br".equals(element.tagName()); + } + return false; + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/src/main/java/org/booklore/service/metadata/parser/AmazonBookParser.java` around lines 886 - 902, Replace the current body of isWhitespaceNode with a single boolean expression using Java 25 pattern-matching instanceof: return (node instanceof TextNode t && t.isBlank()) || (node instanceof Element e && "br".equals(e.tagName())); this preserves the existing checks (TextNode.isBlank and Element.tagName() == "br") while removing explicit casts and nested ifs.backend/src/test/java/org/booklore/service/metadata/parser/AmazonBookParserTest.java (1)
197-220: LGTM — test exercises the new DOM-based cleanup end-to-end.The HTML input intentionally mixes
<br />,<br >, and<br>with surrounding whitespace and embedded<div>blocks, so this verifies both the consecutive-<br>collapsing (more than 2 → 2) and the new leading/trailing whitespace+<br>trim viaisWhitespaceNode. RoutingJsoup.parse(anyString())to the real method (Line 214) is necessary becauseJsoupis statically mocked insetup()andcleanDescriptionHtmlinvokesJsoup.parseinternally.One small nit: consider asserting on
metadata.getDescription()directly (without the\nstrip) so any unintended newline reintroduction is caught —body().html().trim()already produces a deterministic, newline-free output for this fixture.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/src/test/java/org/booklore/service/metadata/parser/AmazonBookParserTest.java` around lines 197 - 220, In AmazonBookParserTest.test fetchTopMetadata_cleansLineBreaks (method fetchTopMetadata_cleansLineBreaks), remove the manual newline strip on metadata.getDescription() and assert the description value directly; specifically replace the current assertEquals that compares actual (built by metadata.getDescription().replace("\n", "")) with a direct assertEquals(expected, metadata.getDescription()) so the test will catch any unintended newline reintroduction by cleanDescriptionHtml/Jsoup.parse.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In
`@backend/src/main/java/org/booklore/service/metadata/parser/AmazonBookParser.java`:
- Around line 886-902: Replace the current body of isWhitespaceNode with a
single boolean expression using Java 25 pattern-matching instanceof: return
(node instanceof TextNode t && t.isBlank()) || (node instanceof Element e &&
"br".equals(e.tagName())); this preserves the existing checks (TextNode.isBlank
and Element.tagName() == "br") while removing explicit casts and nested ifs.
In
`@backend/src/test/java/org/booklore/service/metadata/parser/AmazonBookParserTest.java`:
- Around line 197-220: In AmazonBookParserTest.test
fetchTopMetadata_cleansLineBreaks (method fetchTopMetadata_cleansLineBreaks),
remove the manual newline strip on metadata.getDescription() and assert the
description value directly; specifically replace the current assertEquals that
compares actual (built by metadata.getDescription().replace("\n", "")) with a
direct assertEquals(expected, metadata.getDescription()) so the test will catch
any unintended newline reintroduction by cleanDescriptionHtml/Jsoup.parse.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 4e830b2c-aa14-4e28-8400-a70659b1b0ac
📒 Files selected for processing (2)
backend/src/main/java/org/booklore/service/metadata/parser/AmazonBookParser.javabackend/src/test/java/org/booklore/service/metadata/parser/AmazonBookParserTest.java
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Packaging Smoke Test
🧰 Additional context used
📓 Path-based instructions (2)
backend/src/**/*.java
📄 CodeRabbit inference engine (AGENTS.md)
backend/src/**/*.java: Use 4-space indentation and match surrounding Java style in backend code
Prefer constructor injection via Lombok patterns already used in the codebase. Do not introduce@Autowiredfield injection in backend code
Use MapStruct for entity/DTO mapping in backend code
Keep JPA entities on the *Entity suffix in backend code
Files:
backend/src/test/java/org/booklore/service/metadata/parser/AmazonBookParserTest.javabackend/src/main/java/org/booklore/service/metadata/parser/AmazonBookParser.java
backend/src/test/**/*.java
📄 CodeRabbit inference engine (AGENTS.md)
Prefer focused unit tests; use
@SpringBootTestonly when the Spring context is required in backend code
Files:
backend/src/test/java/org/booklore/service/metadata/parser/AmazonBookParserTest.java
🧠 Learnings (2)
📚 Learning: 2026-04-10T08:15:37.436Z
Learnt from: imnotjames
Repo: grimmory-tools/grimmory PR: 449
File: booklore-api/src/main/java/org/booklore/service/book/BookDownloadService.java:139-145
Timestamp: 2026-04-10T08:15:37.436Z
Learning: When using Spring `ContentDisposition.builder(...).filename(name, StandardCharsets.UTF_8).build()` (i.e., explicitly providing UTF-8), the resulting header value should include both the quoted `filename="=?UTF-8?..."` and the RFC 5987 `filename*=` parameters. In this case, any extra ASCII fallback computation (e.g., deriving an ASCII `fallbackFilename` via `NON_ASCII_PATTERN` and calling `.filename(fallbackFilename)`) is likely redundant—prefer calling only `.filename(fallbackName?, StandardCharsets.UTF_8)` as appropriate and let Spring handle the UTF-8 header parameters. Verify by comparing the emitted header for `filename` and `filename*` before deciding to keep an ASCII fallback.
Applied to files:
backend/src/test/java/org/booklore/service/metadata/parser/AmazonBookParserTest.javabackend/src/main/java/org/booklore/service/metadata/parser/AmazonBookParser.java
📚 Learning: 2026-04-14T12:43:08.698Z
Learnt from: balazs-szucs
Repo: grimmory-tools/grimmory PR: 502
File: booklore-api/src/main/java/org/booklore/service/reader/ChapterCacheService.java:0-0
Timestamp: 2026-04-14T12:43:08.698Z
Learning: For this codebase (booklore-api), target Java 25 with `--enable-preview`, so `_` is intentionally used as an unnamed/ignored variable (e.g., lambda parameter or pattern variable) per Java’s preview feature JEP 456. Do not flag `_` in those contexts as an invalid/reserved identifier; only flag it if it’s used in a non-supported position (e.g., where an unnamed variable is not applicable for the Java preview rules).
Applied to files:
backend/src/test/java/org/booklore/service/metadata/parser/AmazonBookParserTest.javabackend/src/main/java/org/booklore/service/metadata/parser/AmazonBookParser.java
🔇 Additional comments (2)
backend/src/main/java/org/booklore/service/metadata/parser/AmazonBookParser.java (2)
933-948: LGTM — iteration is safe over removed nodes.The enhanced-for cleanly replaces the index-based traversal. Capturing
tempbefore advancingnext = next.nextElementSibling()ensures the walk isn't disturbed bytemp.remove(). Outer iterations that later land on already-detached<br>elements are safe: a detached node returnsnullfromnextElementSibling(), so the inner loop is a no-op.
950-960: LGTM — DOM-based trim is a clean replacement for the regex.Removing leading/trailing whitespace text nodes and
<br>elements viafirstChild()/lastChild()walk is equivalent to (and safer than) the previous regex-based stripping, and correctly addresses the CodeQL inefficient-regex finding for HTML content. Returningdocument.body().html().trim()gives a stable, normalized output that matches the new test expectation.
Description
this resolves the CodeQL warning that an inefficient regular expression is in place for AmazonBookParser and also avoids trying to parse HTML with a regular expression. This defines whitespace as only blank text nodes &
brelements. In the future we could include other values or improve these in other ways, but this is close enough to match the existing behavior.Linked Issue: Fixes #924
Changes
Summary by CodeRabbit