Skip to content

Errors with threaded replies#326

Merged
tonyalaribe merged 139 commits intomasterfrom
errors-chp
Feb 28, 2026
Merged

Errors with threaded replies#326
tonyalaribe merged 139 commits intomasterfrom
errors-chp

Conversation

@dawkaka
Copy link
Contributor

@dawkaka dawkaka commented Feb 24, 2026

Summary

Two major features: Error Tracking with Sentry-style fingerprinting and spike detection, and Log Pattern Detection using the DRAIN algorithm with baseline volume tracking. Both integrate with a new threaded notification system for Slack, Discord, Email, and PagerDuty.

Error Tracking System

  • Fingerprinting: Normalizes stack traces per runtime (Go, Node.js, Python, Java, PHP, .NET) into canonical module|function format; normalizes messages by replacing UUIDs, IPs, emails, timestamps with placeholders; uses XXHash for deterministic fingerprints
  • State machine: NewEscalatingOngoingResolved (with Regressed for reactivation)
  • Baseline learning: Median + MAD (Median Absolute Deviation) — robust against outliers, establishes after 24h of data
  • Spike detection: z-score threshold >3σ AND ≥5 absolute increase above mean (avoids noise on low-volume errors)
  • Subscriptions: Per-error subscribe with configurable notifyEveryMinutes throttle (default 30min)
  • Assignment: Errors can be assigned to team members with notification on assignment

Log Pattern Detection

  • DRAIN algorithm: 3-level tree (root → token count → first token → log groups) with similarity threshold 0.5 and LRU eviction at 50 groups per leaf
  • Baseline tracking: LearningEstablished state; hourly volume mean + MAD stored in log_pattern_hourly_stats
  • Spike detection: Rate-change alerts using baseline comparison (>3σ + absolute threshold)
  • Source fields: Patterns extracted from body, summary, url_path (with url_path noise suppression)
  • State machine: NewAcknowledgedIgnored; auto-acknowledges stale patterns after 30 days

Notification System

  • Slack threading: Sends first alert, stores thread_ts, subsequent updates reply in same thread
  • Discord threading: Stores message_id for reply-to references
  • Email templates: New templates for log pattern detection, log pattern rate changes, runtime errors, escalating errors, regressed errors, and error spikes
  • Throttling: Respects per-subscription frequency to prevent alert fatigue

Background Jobs

Job Purpose
ErrorBaselineCalculation Calculate baseline stats from error_events hourly buckets
ErrorSpikeDetection Detect spikes, create issues, send alerts
NewErrorDetected Create initial issue for new error fingerprint
ErrorAssigned Notify on error assignment
FiveMinuteLogPatternExtraction Run DRAIN extraction from recent logs
LogPatternPeriodicProcessing Create baseline issues for new patterns
LogPatternHourlyProcessing Update hourly stats & detect rate spikes

Database Migrations

  • 0033_error.sql: apis.errors and apis.error_events tables with fingerprint indexes, triggers, and error_escalating/error_regressed issue types
  • 0034_error_notification_columns.sql: subscribed, notify_every_minutes, last_notified_at on errors
  • 0035_error_thread_ids.sql: slack_thread_ts and discord_message_id on errors for threaded notifications

Note: Log pattern migrations (log_patterns, log_pattern_hourly_stats, issue column extensions) were already merged via log-patterns-chp PR as 0032_log_patterns.sql.

@claude
Copy link

claude bot commented Feb 28, 2026

Code Review

Great PR overall — the error tracking system is well-designed and the threaded notification changes are clean. The refactoring wins (DRY buildDrainTree, WrappedEnumSC JSON instances, BaselineState extraction) are all solid improvements. A few things worth addressing:


🔴 ErrorPattern mixed snake_case fields — entity column mismatch risk

-- ErrorPatterns.hs
, occurrences_1m :: Int -- snake_case fields match DB column names exactly
, occurrences_5m :: Int

The record is derived with FieldModifiers '[CamelToSnake]. When quietSnake processes a field that already contains underscores (e.g. occurrences_1m), the output depends on the casing library — it may pass through unchanged, but it's inconsistent with the rest of the codebase which uses camelCase and lets the modifier do the work. Safer to rename to occurrences1m / occurrences5m etc. and let CamelToSnake produce occurrences_1m / occurrences_5m — that's the established pattern everywhere else.


🟡 goStdlibPrefixes — use Set for O(log n) lookups

-- ErrorFingerprint.hs
goStdlibPrefixes :: [Text]
goStdlibPrefixes = ["runtime.", "syscall.", "net.", ...]  -- 55 entries

nonePrefix :: [Text] -> Text -> Bool
nonePrefix needles txt = all (not . (`T.isPrefixOf` txt)) needles

Every Go stack frame iterates all 55 prefixes. Data.Set (already in deps) drops this to O(log n):

import Data.Set qualified as S
goStdlibPrefixes :: S.Set Text
goStdlibPrefixes = S.fromList ["runtime.", ...]

isGoStdlib :: Text -> Bool
isGoStdlib line = any (`T.isPrefixOf` line) goStdlibPrefixes

🟡 parseStackFrame — runtime as Text is fragile

parseStackFrame runtime line
  | runtime == "go"                    -> parseGoFrame trimmed
  | runtime `elem` ["nodejs", "webjs"] -> parseJsFrame trimmed

A typo silently falls through to parseGenericFrame. The existing RequestDumps.SDKTypes enum (already imported in ErrorPatterns.hs) or a local ADT would give compile-time safety.


🟡 computeErrorFingerprint — service-less errors collide across services

fingerprintComponents =
  if | not (T.null normalizedStack) ->
       [projectIdText, fromMaybe "" mService, normalizedType, normalizedStack]
  ...
combined = T.intercalate "|" $ filter (not . T.null) fingerprintComponents

When mService = Nothing, the empty string is filtered out. Two errors from different services where both lack a service name will share a fingerprint. May be intentional but deserves a comment, since it affects how multi-service errors are grouped.


🟡 bulkCalculateAndUpdateBaselines — 6-argument parameter order is hard to audit

(pid, now, now, pid, now, pid)

Three pids and three nows in the params tuple make it easy to get the order wrong when the SQL CTE changes. A comment mapping each ? position to its value would help maintainability.


🟢 Good things in this PR

  • encodeEnumSC/decodeEnumSC extracted — eliminates repetition across all WrappedEnumSC instances
  • WrappedEnumSC gains ToJSON/FromJSON — removes the separate CustomJSON '[ConstructorTagModifier ...] deriving for NotificationChannel, QueryLibType, MonitorStatus, SeverityLevel, SpanStatus, SpanKind, LogPatternState etc. Nice sweep.
  • buildDrainTree — clean extraction removing two near-identical V.foldl' patterns
  • batchUpsertErrorPatterns — correct single-round-trip unnest approach; the HM.insertWith grouping to avoid ON CONFLICT DO UPDATE cannot affect row a second time is the right call
  • Unsafe.fromJust err.hashhash :: Text — removes a footgun
  • Replacing _ <- ...; pass with void $ ... throughout is a nice cleanup

🔵 Minor

  • Models.Apis.ErrorPatterns still imports RequestDumps only for SDKTypes. Now that ATError lives in ErrorPatterns, consider moving SDKTypes to a more foundational module (or noting the coupling).
  • runtimeErrorVariantEmail takes three always-paired string args (heading, subjectPrefix, intro) — a small record could enforce the pairing, though it's readable as-is.
  • 0034_error_notification_columns.sql and 0035_error_thread_ids.sql are mentioned in the PR description but their columns appear to be included inline in 0033_error.sql. The .cabal only registers 0033. If they're real separate files coming later, they'll need registering there too.

@claude
Copy link

claude bot commented Feb 28, 2026

Revision history for monoscope

0.1.0.0 -- YYYY-mm-dd

  • First version. Released on an unsuspecting world.

@claude
Copy link

claude bot commented Feb 28, 2026

Code Review — PR #326 (Error Tracking + Threaded Notifications)

Overall this is a solid, well-structured PR. The test suite is comprehensive (20 scenarios covering the full state machine, concurrency, cross-project isolation, and thread-ID persistence), the SQL uses efficient batch patterns throughout, and the WrappedEnumSC consolidation reduces boilerplate across multiple types. A few things worth addressing:


🐛 Potential Bug — Duplicate Notifications

BackgroundJobs.hsnotifyErrorSubscriptions

The JOIN:

JOIN apis.issues i ON i.project_id = e.project_id AND i.target_hash = e.hashAND i.issue_type = ?

can return multiple rows per error pattern when more than one RuntimeException issue exists for the same target_hash (e.g. original issue + a later regressed issue). This means forConcurrently fires duplicate notifications and updateErrorPatternThreadIdsAndNotifiedAt races with itself on the same row. Fix with DISTINCT ON (e.id) / ORDER BY e.id, i.created_at DESC:

SELECT DISTINCT ON (e.id)
  e.id, e.error_data, e.state, i.id, i.title,
  e.slack_thread_ts, e.discord_message_id
FROM apis.error_patterns e
JOIN apis.issues i ON i.project_id = e.project_id AND i.target_hash = e.hash
WHEREORDER BY e.id, i.created_at DESC

⚡ Performance — Sequential Spike Detection

BackgroundJobs.hsdetectErrorSpikes

forM_ is sequential. Each iteration does DB writes and potentially sends notifications. For projects with many error patterns this is a bottleneck — same pattern as notifyErrorSubscriptions which correctly uses forConcurrently. Switch to:

forConcurrently_ errRates \errRate -> ...

🔍 Design — 7 Positional Text Args

Pkg/ErrorFingerprint.hs line 565

computeErrorFingerprint :: Text -> Maybe Text -> Maybe Text -> Text -> Text -> Text -> Text -> Text
computeErrorFingerprint projectIdText mService spanName runtime exceptionType message stackTrace = 

Seven positional Text / Maybe Text arguments is a silent footgun — callers can swap runtime/exceptionType/message without the compiler noticing. A small input record (or even a type alias tuple) would make call sites self-documenting and safer.


🧹 Style — Named Thread-ID Pair

The (Maybe Text, Maybe Text) tuple representing (slackThreadTs, discordMessageId) is threaded through sendAlertToChannels, createAndNotifyErrorIssue, processNewError, and notifyErrorSubscriptions. A type alias:

type ThreadIds = (Maybe Text, Maybe Text)

or a small record would make all those signatures easier to read without adding code.


🧹 Succinctness — cleanPythonFunction

Pkg/ErrorFingerprint.hs line 317

cleanPythonFunction = flip (foldr (uncurry T.replace)) ([] :: [(Text, Text)])

The double-flip/foldr encoding is harder to read than the equivalent eta-expanded form which is the same length:

cleanPythonFunction func = foldr (\(f, t) -> T.replace f t) func [("<lambda>", "lambda"), ]

✅ Good Stuff

  • Batch SQL throughout: unnest arrays in batchUpsertErrorPatterns and upsertErrorPatternHourlyStats avoid N+1 round-trips entirely.
  • bulkCalculateAndUpdateBaselines: single CTE with PERCENTILE_CONT + MAD is exactly right — no per-row Haskell computation needed.
  • {-# NOINLINE goStdlibPrefixes #-}: correct to prevent repeated allocation of the list CAF.
  • truncateHour in Haskell: avoids date_trunc per row in SQL (already applied in upsertErrorPatternHourlyStats).
  • ~(subj, html) lazy irrefutable in notifyErrorSubscriptions: avoids building email content when only Slack/Discord is configured.
  • WrappedEnumSC JSON instances: adding ToJSON/FromJSON to the newtype derivation eliminates the separate DAE.CustomJSON derivation across ErrorState, BaselineState, LogPatternState, MonitorStatus, etc. — clean reduction.
  • Test coverage: the 20-scenario ErrorPatternsSpec is thorough. Especially good: concurrency test (5e), cross-project isolation (test 9), decay chain (7a), and thread-ID reuse (test 6).

@claude

This comment was marked as outdated.

@claude

This comment was marked as outdated.

@claude
Copy link

claude bot commented Feb 28, 2026

PR Review: Errors with Threaded Replies

@claude

This comment was marked as outdated.

@claude
Copy link

claude bot commented Feb 28, 2026

See review content posted below - full markdown review to follow

@claude
Copy link

claude bot commented Feb 28, 2026

test

@claude
Copy link

claude bot commented Feb 28, 2026

Revision history for monoscope

0.1.0.0 -- YYYY-mm-dd

  • First version. Released on an unsuspecting world.

1 similar comment
@claude
Copy link

claude bot commented Feb 28, 2026

Revision history for monoscope

0.1.0.0 -- YYYY-mm-dd

  • First version. Released on an unsuspecting world.

@claude
Copy link

claude bot commented Feb 28, 2026

PR Review part 1 of 2: BUGS. (1) createOtelTraceWithExceptionAtTime has broken UUID encoding: UUID.toText returns 36 chars with dashes, so T.replicate (32-36) returns empty string, and B16.decode fails silently. traceIdBS/spanIdBS are always empty bytes. Fix: strip dashes with T.filter before padding. (2) insertErrorQueryAndParams in Anomalies.hs is dead code: processProjectErrors now delegates to batchUpsertErrorPatterns, nothing calls insertErrorQueryAndParams. Remove it. PERFORMANCE: (3) goStdlibPrefixes is a 50-entry [Text] list scanned per Go stack frame. Use HashSet Text and check only the package segment via fst (splitDotted funcPart) for O(1) lookup.

@claude
Copy link

claude bot commented Feb 28, 2026

PR Review part 2 of 2: EXPORTS: (4) sendSlackAlertWith/sendDiscordAlertWith are exported but documented as Internal -- either remove from exports or drop the doc comment. STYLE: (5) errorSubscriptionAction has 3 consecutive single-binding lets (isActive, notifyEvery, actionUrl) -- combine into one let block. (6) createAndNotifyErrorIssue uses Effectful.Reader.Static.ask when ask is imported unqualified -- use ask @Config.AuthContext. (7) anomalyDetailCore uses the ampersand-LambdaCase pattern only once -- case issue.issueType of matches the rest of the file. (8) Multiple new sites use whenJust ... Relude.when -- for_ projectM is cleaner and avoids the qualified import. GOOD: void $ replaces _ <- pass throughout; WrappedEnumSC gains ToJSON/FromJSON directly eliminating ~6 duplicate via CustomJSON derivations; colIdxMap via zip cols [0..] removes sync-drift bugs; Permissions ordering fix with Ord makes >= PEdit semantically correct; sortBy before groupBy in processIssuesEnhancement is the right defensive fix; buildDrainTree removes the duplicated fold pattern; batchUpsertErrorPatterns hash-grouping correctly avoids the ON CONFLICT duplicate-row error.

@claude

This comment was marked as outdated.

Add DISTINCT ON to notifyErrorSubscriptions to prevent duplicate
notifications when multiple issues exist for the same error pattern.
Switch detectErrorSpikes from sequential forM_ to forConcurrently.
Remove redundant CustomJSON derivations on IssueType/RateChangeDirection.
@claude
Copy link

claude bot commented Feb 28, 2026

PR Review: Errors with threaded replies

Good overall structure. The fingerprinting algorithm is well-documented with doctests, the single-CTE baseline calculation is efficient, and the WrappedEnumSC JSON/DAE.CustomJSON consolidation is a clean DRY win. A few things worth addressing:


Potential bugs

Race condition in detectErrorSpikes (BackgroundJobs.hs)
The check-then-act for escalation is not atomic. Two concurrent ErrorSpikeDetection jobs for the same project can both see state != ESEscalating and create duplicate issues. Fix with a conditional DB update, proceeding only if rowsAffected > 0:

UPDATE apis.error_patterns SET state = 'escalating' WHERE id = ? AND state != 'escalating'

Permissions constructor order changed silently
Adding Ord derives ordering from constructor position (PView < PEdit < PAdmin). The string-based WrappedEnumSC serialisation is unaffected, but if any code compares permissions for authorisation via Ord, this is a silent semantic change. Worth a note.


Succinctness / package usage

Unbounded forConcurrently in detectErrorSpikes and notifyErrorSubscriptions
A project with hundreds of active errors will launch hundreds of concurrent alert-sending coroutines. Consider pooledForConcurrentlyN (from unliftio) or chunking to bound concurrency.

cleanPythonFunction opaque pointfree (Pkg/ErrorFingerprint.hs)
The flip + uncurry hides intent without saving lines. Prefer an explicit lambda: foldr ((old, new) -> T.replace old new) f [...].

Dead branch in extractGenericLineNumber
The second case branch (_ -> Nothing) is unreachable since T.splitOn always returns at least 1 element. Drop it.

occurrences_1m / occurrences_5m snake_case in a camelCase record
The ErrorPattern record uses snake_case for number-suffixed fields while all other fields are camelCase. Functionally correct (CamelToSnake is a no-op on already-snake fields) but inconsistent. A brief comment explaining why (CamelToSnake cannot produce _1m from camelCase) would clarify intent.


Minor

  • RequestDumps.ATError -- if the old type still exists alongside the new ErrorPatterns.ATError, worth removing to avoid confusion. Telemetry.hs was migrated but worth confirming nothing else imports the old definition.
  • batchUpsertErrorPatterns insertWith semantics -- HM.insertWith f new old keeps the new error and sums counts. Correct behaviour (dedup-by-hash, latest data wins), but a one-line comment would help.
  • PR description mentions separate 0034/0035 migrations that do not exist as separate files -- they are correctly consolidated into 0033. The description is just inaccurate.

Good changes worth calling out

  • Removing the _ <- ...; pass antipattern with void throughout
  • WrappedEnumSC now derives AE.ToJSON/AE.FromJSON, eliminating all the DAE.CustomJSON/DAE.ConstructorTagModifier/DAE.StripPrefix/DAE.CamelToSnake boilerplate on SeverityLevel, SpanStatus, SpanKind
  • buildDrainTree is a clean higher-order generalisation that reduces duplication across processBatch and fetchLogPatterns
  • {-# NOINLINE goStdlibPrefixes #-} + CAF is correct for the O(log n) Set.lookupLE prefix check

looksLikeJson used isInfixOf for "{" which matched replacement tokens
like {integer}, {YYYY-MM-DDThh:mm:ss.sTZD}, causing tokenizeJsonLike
to split them into separate tokens and break drain clustering.
@claude
Copy link

claude bot commented Feb 28, 2026

Code Review

Great feature addition — the error tracking state machine, log pattern detection, and notification threading all fit well into the existing architecture. Several cleanups here are excellent. Here are my observations:


Highlights

  • WrappedEnumSC unification across 7+ types (MonitorStatus, SeverityLevel, LogPatternState, NotificationChannel, QueryLibType, etc.) — replacing 2-line derive chains with a single derive is exactly the kind of GHC extension leverage we want.
  • void $ cleanup replacing _ <- ...; pass throughout is a welcome hygiene pass.
  • Drain.buildDrainTree abstraction eliminates the local processNewLog helper in two call sites cleanly.
  • pg-entity _select/_selectWhere in place of manual column lists in Anomalies.hs and Issues.hs is the right direction.
  • unnest batch upserts in batchUpsertErrorPatterns and upsertErrorPatternHourlyStats — single round-trip instead of N+1, good.
  • V.groupBy sort-before-group fix in processIssuesEnhancement (and same fix for errors in processOneMinuteErrors) was a real silent correctness bug. Good catch.
  • Removing QueryMonitorsTriggered and inlining monitor notification logic into evaluateQueryMonitor — simpler, removes dead code.

Issues / suggestions

1. Non-idiomatic snake_case field names in ErrorPattern

  , occurrences_1m :: Int -- snake_case fields match DB column names exactly
  , occurrences_5m :: Int
  , occurrences_1h :: Int
  , occurrences_24h :: Int

Haskell record fields should be camelCase. With FieldModifiers '[CamelToSnake], a field occurrences1m would map to column occurrences1m (no uppercase to convert), not occurrences_1m. For fields where the column name can't be derived by CamelToSnake, use pg-entity's field quasi-quoter with an explicit column name override rather than putting underscores in Haskell identifiers. Consider occurrencesLast1m / occurrencesLast5m etc.

2. Z-score recomputed in createErrorSpikeIssue

detectErrorSpikes already computes the z-score via detectSpikeOrDrop, but createErrorSpikeIssue recomputes it from scratch:

let zScore = if baselineStddev > 0 then (currentRate - baselineMean) / baselineStddev else 0

Pass the already-computed spike.zScore from the call site instead.

3. Redundant drainContent helper in processBatch

  tokenize = tok . drainContent
  drainContent = \case SeedPattern c -> c; NewEvent _ c -> c

Used exactly once — inline it: tokenize = tok . \case SeedPattern c -> c; NewEvent _ c -> c.

4. Destructive migration without data migration (0033_error.sql)

DROP TABLE IF EXISTS apis.errors CASCADE;

This drops apis.errors with no migration of existing rows to apis.error_patterns. The PR description doesn't mention this — please confirm apis.errors has no production data worth preserving.

5. bool 0 1 for Int conversion

addError acc e = HM.insertWith addCounts e.hash (1 :: Int, bool 0 1 (isJust e.userId)) acc

fromEnum (isJust e.userId) is more idiomatic for converting Bool to Int.

6. Lazy irrefutable pattern ~(subj, html) in sendAlertToChannels

~(subj, html) = case alertType of ...

Smart optimization to avoid computing email content when NEmail is not in the channel list. A short comment would help maintainability since irrefutable patterns are uncommon in application code.

7. selectIssueByHash may return the wrong issue in ErrorAssigned job

issueM <- Issues.selectIssueByHash pid err.hash

If multiple issues exist for the same hash (e.g., after a regression creates a second issue), this returns an arbitrary match. Ensure the underlying query orders by created_at DESC so the assignee email links to the most recent issue.

8. addCounts argument order

addCounts (c1, u1) (c2, u2) = (c1 + c2, u1 + u2)

With HM.insertWith f key newVal map the signature is f newValue oldValue. Since summation is symmetric this is harmless, but a comment would clarify intent.


Summary

The design is solid and the code is well-structured. Items worth addressing before merge:

  • [Confirm] Intent of dropping apis.errors without data migration
  • [Style] Snake_case field names in ErrorPattern record
  • [Correctness] selectIssueByHash ordering for the assignee email link
  • [Minor] Z-score duplication, drainContent inline, fromEnum over bool 0 1

The test was passing Nothing for service and omitting the actual
service name used during OTEL ingestion (test-service), causing
the computed fingerprint hash to not match the DB.
- createErrorSpikeIssue now receives the already-computed z-score from
  detectSpikeOrDrop instead of recomputing it from mean/stddev
- Inline single-use drainContent helper into tokenize composition
@tonyalaribe tonyalaribe merged commit 7ce9230 into master Feb 28, 2026
7 checks passed
@tonyalaribe tonyalaribe deleted the errors-chp branch February 28, 2026 20:13
@claude
Copy link

claude bot commented Feb 28, 2026

Code Review — Error Tracking + Log Pattern + Threaded Notifications

Overall this is a well-structured, well-tested PR. The fingerprinting, baseline/spike logic, and threading support are all coherent. Notes below focus on succinctness, correctness, and performance.


Bugs / Correctness

batchUpsertErrorPatterns — keeps first occurrence, not most-recent, per hash

HM.insertWith f new old calls f new old. The lambda (\(a, n) (_, k) -> (a, n + k)) keeps the first ATError seen per hash and discards later ones. This means recent_trace_id in the upsert carries the first trace ID, not the latest. Swap the lambda to keep the newcomer:

-- grouped in Models/Apis/ErrorPatterns.hs
HM.insertWith (\(newE, n) (_, k) -> (newE, n + k)) e.hash (e, 1) m

isGoStdlib — undocumented invariant on the Set

The lookupLE trick is correct today because every entry in goStdlibPrefixes ends with . or /. If someone adds a bare entry like "runtime" (without the dot), lookupLE "runtimex.foo" would return "runtime", and isPrefixOf would produce a false positive. Add a comment asserting the invariant:

-- INVARIANT: every entry must end with '.' or '/' so the lookupLE prefix
-- check in isGoStdlib cannot produce false positives against sibling packages.
goStdlibPrefixes :: S.Set Text

Succinctness / Style

cleanPythonFunctionflip (foldr (uncurry T.replace)) is opaque

The point-free form with flip foldr is harder to read than needed. A direct fold is shorter and clearer:

cleanPythonFunction f = foldr (\(a, b) -> T.replace a b) f
  [("<lambda>", "lambda"), ("<listcomp>", "listcomp"), ("<dictcomp>", "dictcomp")]

Unnecessary lazy pattern in notifyErrorSubscriptions

~(subj, html) = case alertType of ...

The ~ defers the pattern match but the RHS is a total case — no benefit. Plain let (subj, html) = ... avoids hiding an accidental thunk.


extractGenericLineNumber — unreachable branch

case parts of
  _ : rest -> listToMaybe $ mapMaybe readText rest
  _         -> Nothing   -- unreachable: T.splitOn always returns >= 1 element

T.splitOn on a non-empty string always returns a non-empty list, so the wildcard is dead. Remove it.


deduplicateByHash — document which occurrence is kept

HM.insert overwrites, so the last vector element per hash wins. The comment says "deduplication" but doesn't state this. Worth noting for callers who care about recency vs. first-seen.


Performance

getErrorPatternsWithCurrentRates — no pagination + unbounded concurrency

getErrorPatternsWithCurrentRates fetches all non-resolved, non-ignored errors for a project in one shot, then detectErrorSpikes fans them all out via forConcurrently. For a project with thousands of errors this results in an unbounded concurrent DB write fan-out. Consider capping concurrency (pooledForConcurrently or chunked batching), or pushing spike detection into a single SQL UPDATE so it's one round-trip.


bulkCalculateAndUpdateBaselines — positional parameters are hard to audit

The tuple (pid, now, now, pid, now, pid) has six positional bindings spread across two CTEs and the final UPDATE. A comment counting them, or restructuring with a values CTE (SELECT ? AS pid, ?::timestamptz AS cutoff) to bind them once, would reduce the risk of mis-ordering on future edits.


Minor Notes

  • ErrorPattern mixes occurrences_1m/occurrences_5m/occurrences_1h/occurrences_24h (snake_case) with all other camelCase fields. The inline comment explains it, but worth considering aliasing to camelCase + explicit column mapping long-term.

  • processNewError silently skips ESOngoing/ESEscalating states (alertType = Nothing). That's intentional per the state machine, but a Log.logDebug would help during incident triage.

  • The ErrorAssigned job constructor appears in the BgJobs ADT — confirm its handler is wired up in processBackgroundJob (not visible in the diff summary).

  • SQL in updateErrorPatternSubscription passes subscribed as both a value and a condition (CASE WHEN ? AND NOT subscribed THEN NULL ...). This relies on postgresql-simple binding Bool as boolean; worth a targeted integration test if not already covered.


Overall the test coverage in ErrorPatternsSpec.hs is thorough and the fingerprinting algorithm is well-designed. Main asks: fix the first-vs-last hash grouping in batchUpsertErrorPatterns, document the isGoStdlib invariant, and address the unbounded concurrency in spike detection.

@claude
Copy link

claude bot commented Feb 28, 2026

Code Review

This is a substantial, well-structured PR. The test coverage (20 cases in ErrorPatternsSpec.hs) is thorough, the SQL is efficient (CTE baselines, unnest batch upserts), and the state machine design is sound. A few things worth tightening:


1. {-# NOINLINE #-} on a top-level CAF is redundant

goStdlibPrefixes is a top-level constant — GHC already memoises it as a CAF without any pragma. The NOINLINE pragma is a no-op here (it matters on functions, not top-level values). Safe to remove.


2. cleanPythonFunction — opaque point-free

Current:

cleanPythonFunction = flip (foldr (uncurry T.replace)) ([("<lambda>", "lambda"), ("<listcomp>", "listcomp"), ("<dictcomp>", "dictcomp")] :: [(Text, Text)])

flip (foldr (uncurry T.replace)) is hard to parse at a glance. A direct fold is clearer and the same length:

cleanPythonFunction func = foldr (\(o, n) -> T.replace o n) func [("<lambda>", "lambda"), ("<listcomp>", "listcomp"), ("<dictcomp>", "dictcomp")]

3. truncateHour — exposes UTCTime internals unnecessarily

Utils.hs already imports utcTimeToPOSIXSeconds. The current decomposition reaches into DiffTime internals:

-- current
truncateHour (UTCTime day dt) = UTCTime day (secondsToDiffTime $ floor dt `div` 3600 * 3600)

Using the existing POSIX import (just add posixSecondsToUTCTime) keeps UTCTime opaque and avoids the pattern match:

truncateHour = posixSecondsToUTCTime . fromIntegral . (* 3600) . (`div` 3600) . (floor :: NominalDiffTime -> Int) . utcTimeToPOSIXSeconds

4. detectSpikeOrDropInt64 parameter is immediately fromIntegral-ed

detectSpikeOrDrop :: Double -> Double -> Double -> Double -> Int64 -> Maybe Issues.SpikeResult
-- body: rate = fromIntegral currentCount

Both call sites pass fromIntegral x where x :: Int. Changing the last parameter to Int removes the fromIntegral at every call site and makes the signature more natural.


5. forConcurrently in detectErrorSpikes — unbounded DB concurrency

void $ forConcurrently errorsWithRates \errRate -> do
  ErrorPatterns.updateErrorPatternState ...
  Issues.insertIssue ...

For a project with many error patterns this fires one DB-touching thread per pattern simultaneously. The logsPatternExtraction pipeline already uses Ki-based bounded concurrency. Consider a similar cap here to avoid connection pool exhaustion on busy projects.


6. notifyErrorSubscriptions embeds a complex SQL query in BackgroundJobs.hs

The DISTINCT ON join across error_patterns and issues belongs in Models.Apis.ErrorPatterns alongside the other query functions, not in the job runner. This would also make it testable in isolation.


Minor nits

  • Relude.when (V.length spansWithErrors > 0) should be unless (V.null spansWithErrors) — more idiomatic and signals intent directly
  • Relude.when is used qualified throughout BackgroundJobs.hs even though Relude is imported unqualified — bare when suffices unless there is an actual ambiguity worth documenting
  • parseJsFrame strips line twice (T.strip $ T.drop 3 $ T.strip line); the outer strip is only needed if "at" may be followed by multiple spaces — worth a comment if intentional

Positives

  • Batch SQL patterns (unnest arrays, single-CTE baseline calc) are the right approach throughout
  • isGoStdlib O(log n) prefix check via S.lookupLE is clever and correct
  • batchUpsertErrorPatterns grouping by hash before ON CONFLICT correctly avoids the "cannot affect row a second time" error
  • Test coverage is comprehensive: state machine cycle, idempotency, flat-baseline edge case (MAD floor), cross-project isolation, and thread-ID threading are all covered

@claude
Copy link

claude bot commented Feb 28, 2026

Code Review - PR 326: Errors with threaded replies

Big feature addition overall. Architecture is sound and the batch-upsert and threading designs are well thought out. Some targeted notes below.


Correctness / Potential Bugs

batchUpsertErrorPatterns - grouping semantics undocumented

insertWith f key new old means (a, n) is the new (incoming) value, (_, k) is the existing one in the lambda \(a, n) (_, k) -> (a, n + k). So this keeps the last-seen error data and sums counts. That is probably the right call (most-recent error wins for message/stack), but the intent is non-obvious from the lambda alone. A brief comment would prevent future surprises.

ErrorState missing Ord

The state machine progresses New -> Escalating -> Ongoing -> Resolved. Without a derived Ord, you cannot write guards like state >= ESEscalating and instead need exhaustive case matches scattered through BackgroundJobs. Since the declaration order already encodes the progression, adding deriving stock (… Ord, Enum, Bounded) costs nothing and makes conditions like state elem [ESNew, ESEscalating, ESOngoing, ESRegressed] replaceable with state < ESResolved.


Conciseness / GHC-extension opportunities

normalizeFunction - three-let chain vs. composition

Current (in normalizeStackTrace where clause):

normalizeFunction func =
  let noAddr     = unwords $ filter (not . ("0x" `T.isPrefixOf`)) $ words func
      noGenerics = T.takeWhile (/= '<') noAddr
      noParams   = T.takeWhile (/= '(') noGenerics
   in T.strip noParams

With point-free composition (no intermediate names needed):

normalizeFunction =
  T.strip . T.takeWhile (/= '(') . T.takeWhile (/= '<')
  . unwords . filter (not . T.isPrefixOf "0x") . words

cleanPythonFunction - flip (foldr ...) is opaque

-- current
cleanPythonFunction = flip (foldr (uncurry T.replace)) ([("<lambda>", "lambda"), ...] :: [(Text, Text)])
-- clearer
cleanPythonFunction f =
  foldl' (\t (k, v) -> T.replace k v t) f
    [("<lambda>", "lambda"), ("<listcomp>", "listcomp"), ("<dictcomp>", "dictcomp")]

extractJsModule - operator section replaces the lambda

-- current
asum $ map (\ext -> T.stripSuffix ext baseName) [".js", ".ts", ".mjs", ".cjs"]
-- simpler
asum $ map (`T.stripSuffix` baseName) [".js", ".ts", ".mjs", ".cjs"]

getErrorPatterns - duplicated ORDER BY / LIMIT / OFFSET suffix

The suffix " ORDER BY updated_at DESC LIMIT ? OFFSET ?" is copy-pasted in both branches of the case mstate of. Factor it into a where binding.


Test Coverage

Pkg.ErrorFingerprint - doctest examples are not wired up

The module has rich >>> examples covering Go stdlib detection, Python line-number exclusion, Java generics, etc., but there is no doctest suite in monoscope.cabal. Either add a doctest suite or migrate the examples to an ErrorFingerprintSpec. This is the most critical correctness surface in the PR and currently has zero automated test coverage.

Background jobs lack unit tests

processNewError, detectErrorSpikes, calculateErrorBaselines, and notifyErrorSubscriptions are new and complex but only reachable via the full integration flow. Targeted tests for the spike-detection logic (z-score threshold, BSLearning skip) and state-machine transitions would give much faster feedback on regressions.


Minor

  • updateErrorPatternThreadIds' (Bool flag): the two public wrappers are the right API; the private helper could accept Maybe UTCTime instead of Bool + UTCTime to avoid passing now twice, but this is a nit.
  • BackgroundJobs.hs is growing. No action needed now, but calculateErrorBaselines, detectErrorSpikes, and processNewError could eventually live in a dedicated Background.ErrorJobs module.

What is good

  • Batch unnest upsert in batchUpsertErrorPatterns / upsertErrorPatternHourlyStats correctly avoids N+1 and the ON CONFLICT DO UPDATE duplicate-row pitfall via the HashMap grouping step.
  • isGoStdlib via Set.lookupLE - O(log n) prefix matching instead of a linear scan; well commented.
  • WrappedEnumSC ToJSON/FromJSON consolidation in DeriveUtils eliminates a class of CustomJSON boilerplate across the codebase.
  • BaselineState centralised in DeriveUtils is the right call given it is now shared across LogPatterns and ErrorPatterns.
  • Threaded Slack/Discord design is clean: sendNotificationWithReply returns Maybe Text and thread IDs are stored separately, so retries do not duplicate threads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants