Skip to content

Log patterns: Log patterns related code#300

Merged
tonyalaribe merged 77 commits intomasterfrom
log-patterns-chp
Feb 26, 2026
Merged

Log patterns: Log patterns related code#300
tonyalaribe merged 77 commits intomasterfrom
log-patterns-chp

Conversation

@dawkaka
Copy link
Contributor

@dawkaka dawkaka commented Jan 22, 2026

Saving and presenting log patterns in ui and log patterns anomaly detection

  • Unusual increase in log patterns within a time window
  • New detected log patterns

Closes #

How to test

@claude

This comment was marked as outdated.

@claude

This comment was marked as outdated.



-- | Get a pattern by ID
getLogPatternById :: DB es => LogPatternId -> Eff es (Maybe LogPattern)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the _selectWhere pattern, instead of enumerating the fields one by one

Comment on lines +250 to 264
Issues.LogPattern ->
"Describe this log pattern issue and its implications.\n"
<> "Title: "
<> issue.title
<> "\n"
<> "Service: "
<> fromMaybe "unknown-service" issue.service
Issues.LogPatternRateChange ->
"Describe this log pattern rate change and its implications.\n"
<> "Title: "
<> issue.title
<> "\n"
<> "Service: "
<> fromMaybe "unknown-service" issue.service

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the [text|] or any other quasiquote. so its easier to visually see the pattern of this message without the haskell semigroup noise

@claude

This comment was marked as outdated.

@claude

This comment was marked as outdated.

@claude

This comment was marked as outdated.

@claude

This comment was marked as outdated.

@claude

This comment was marked as outdated.

Comment on lines +288 to +307
SELECT
lp.id,
lp.project_id,
lp.log_pattern,
lp.pattern_hash,
lp.baseline_state,
lp.baseline_volume_hourly_mean,
lp.baseline_volume_hourly_stddev,
COALESCE(counts.current_count, 0)::INT AS current_hour_count
FROM apis.log_patterns lp
LEFT JOIN (
SELECT log_pattern, COUNT(*) AS current_count
FROM otel_logs_and_spans
WHERE project_id = ?
AND timestamp >= date_trunc('hour', NOW())
AND log_pattern IS NOT NULL
GROUP BY log_pattern
) counts ON counts.log_pattern = lp.log_pattern
WHERE lp.project_id = ?
AND lp.state != 'ignored' AND lp.baseline_state = 'established'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is log_pattern supposed to join on otel_logs_and_spans when they're not in the same database?

Or is log_patterns supposed to be a timeseries table in timefusion as well? if thats the case then you cant make queries on timeseries tables that dont depend on timestamp range

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and you never join on two time series tables. some databases might support the join operation, but the performance is always horrible in that case

Comment on lines +573 to +584
[text|
SELECT lp.log_pattern, count(*) as p_count
FROM apis.log_patterns lp
INNER JOIN otel_logs_and_spans ols
ON lp.log_pattern = ols.log_pattern AND lp.project_id::text = ols.project_id
WHERE lp.project_id = ?
AND lp.state != 'ignored'
AND ${whereCondition}
GROUP BY lp.log_pattern
ORDER BY p_count DESC
OFFSET ? LIMIT 15
|]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as here. how is log_pattern joining on otel_logs_and_spans?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otel_logs_and_spans has a log_pattern column

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is log_pattern going to be in timefusion or postgres? if its in timefusion, you do joins. And you can't query it without time range being part of the query.


-- | Get pattern stats from otel_logs_and_spans
-- Returns median and MAD (Median Absolute Deviation) for robust baseline calculation
getPatternStats :: DB es => Projects.ProjectId -> Text -> Int -> Eff es (Maybe PatternStats)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you not using our widgets or atleast KQL for stats and numbers? isnt this for display?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for detecting spikes


-- | Calculate baselines for log patterns
-- Uses hourly counts from otel_logs_and_spans over the last 7 days
calculateLogPatternBaselines :: Projects.ProjectId -> ATBackgroundCtx ()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to support magic alerts. If we're detecting spikes, we should implement an alert system that our users can enable on any metric as well, so its the same code anad logic for all cases. Not magic logic we run in the background.

@claude

This comment was marked as outdated.

@claude

This comment was marked as outdated.

@claude

This comment was marked as outdated.

[text|
SELECT id, created_at, updated_at, project_id, issue_type::text, endpoint_hash, acknowledged_at, acknowledged_by, archived_at, title, service, critical,
CASE WHEN critical THEN 'critical' ELSE 'info' END, affected_requests, affected_clients, NULL::double precision,
CASE WHEN critical THEN 'critical' ELSE 'info' END, 0::int, 0::int, NULL::double precision,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you set these to 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are not part of the new issues table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we adding it to the query?

Comment on lines +149 to +162
Issues.LogPattern ->
"Generate a concise title for this log pattern issue.\n"
<> "Title: "
<> issue.title
<> "\n"
<> "Service: "
<> fromMaybe "unknown-service" issue.service
Issues.LogPatternRateChange ->
"Generate a concise title for this log pattern rate change.\n"
<> "Title: "
<> issue.title
<> "\n"
<> "Service: "
<> fromMaybe "unknown-service" issue.service
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the text quasiquotes here.

@@ -0,0 +1,68 @@
BEGIN;

CREATE TABLE IF NOT EXISTS apis.log_patterns (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will support patterns on different fields not just the default body/message field. This table doesn't seem aware of this expectation.

CREATE INDEX IF NOT EXISTS idx_log_patterns_last_seen ON apis.log_patterns(project_id, last_seen_at DESC);
CREATE INDEX IF NOT EXISTS idx_log_patterns_service ON apis.log_patterns(project_id, service_name);

CREATE OR REPLACE FUNCTION apis.new_log_pattern_proc() RETURNS trigger AS $$
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the question remains. Is this a timeseries table or regular table? If its a timeseries table (to be on timefusion), then it won't be in the same database as where the background jobs is, and hence would be unable to queue jobs from within the db.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a regular table.

@claude

This comment was marked as outdated.

@claude

This comment was marked as outdated.

@tonyalaribe
Copy link
Contributor

Rename patterns function from 15mins to 5mins.

@claude

This comment was marked as outdated.

@claude

This comment was marked as outdated.

@claude

This comment was marked as outdated.

@claude

This comment was marked as outdated.

@claude

This comment was marked as outdated.

@claude

This comment was marked as outdated.

…rom master

Restore the original 0030_replay_event_file_count and 0031_slack_bot_token
migrations from master and move log patterns changes into a new 0032 migration.
@claude
Copy link

claude bot commented Feb 25, 2026

PR Review: Log Patterns

Good overall design — detectSpikeOrDrop being pure and exported for unit testing is excellent, and the coverage across unit, integration, and bench specs is solid. A few concerns worth addressing:


Potential Bug: parseIssueType drops the "shape" alias

The old code had an explicit alias:

parseIssueType "shape" = Just APIChange -- Handle DB anomaly_type

The new implementation (rightToMaybe . parseUrlPiece) will return Nothing for "shape". If any code path (filters, URL params, existing rows) still passes "shape" here, it silently drops the issue type. Worth verifying no callers rely on this alias before merging.


Succinctness: unzip instead of two traversals

In extractFieldPatterns:

void $ LogPatterns.upsertLogPatternBatch (map fst allUps)
void $ LogPatterns.upsertHourlyStatBatch (map snd allUps)

allUps is traversed twice. unzip eliminates that:

let (ups, hss) = unzip allUps
void $ LogPatterns.upsertLogPatternBatch ups
void $ LogPatterns.upsertHourlyStatBatch hss

Same applies to persistSummaryPatterns where the two list comprehensions over prepared each re-traverse it — use unzip4 prepared instead.


Style: _ <- ...; pass vs void $ inconsistency

The two new scheduling loops use the old pattern while the third already uses void $. Use void $ consistently across all three.


Style: tryLog control flow is subtle

\(e :: SomeException) ->
  whenJust (fromException @SomeAsyncException e) throwIO
    >> Log.logAttention ...

throwIO aborts so >> is dead in the async branch, but the sequencing makes it look like both effects always run. case is more explicit:

\(e :: SomeException) -> case fromException @SomeAsyncException e of
  Just ae -> throwIO ae
  Nothing -> Log.logAttention ("LogPattern pipeline step failed: " <> label) (show e)

updateBaselineBatch: 6-tuple decomposition is hard to read

updateBaselineBatch :: ... -> V.Vector (Text, Text, BaselineState, Double, Double, Int) -> ...
srcFields = V.map (\(s, _, _, _, _, _) -> s) rows
hashes    = V.map (\(_, h, _, _, _, _) -> h) rows

A small named record would let you write V.map (.sourceField) rows etc., and makes the construction site in calculateLogPatternBaselines self-documenting.


Minor: Default LoggingDestination derivation is order-sensitive

deriving anyclass (Default) picks def as the first constructor. StdOut is first so behaviour matches the removed manual instance — but constructor reordering would silently break it. A short comment helps preserve the intent.


Breaking change in replaceAllFormats — stored pattern drift

"Oct 14, 2023" now produces "Oct {integer}, {integer}" instead of "{Mon DD, YYYY}". The test update documents it, but patterns already stored in the DB under the old format will diverge from newly-generated ones, causing duplicate clusters at the boundary. Worth calling out explicitly if there is no migration to re-normalize stored patterns.


Highlights

  • V.minIndexBy (comparing lastSeen) — excellent use of existing library function, cleanly replaces the 10-line fold.
  • bool Drain.generateDrainTokens Drain.generateSummaryDrainTokens isSummary — clean function selection.
  • WrappedEnumSC "" for IssueType — great use of the project utility; eliminates all the manual FromField/ToField/Display/FromHttpApiData boilerplate.
  • anyclass (Default, NFData) cleanup across Anomalies, Fields, Logging — good housekeeping.
  • detectSpikeOrDrop being pure and exported is the right call; unit tests cover boundary cases including the MAD=0 floor.
  • resolveFieldExpr whitelist for SQL injection protection in RequestDumps is the right approach.

@claude
Copy link

claude bot commented Feb 25, 2026

PR Review: Log Patterns Feature

Good overall structure. The Drain algorithm integration, hourly stats table, and MAD-based spike detection are well thought-out. Several concrete improvements below, prioritised by impact.


Bugs / Correctness

replaceAllFormats: replaceEmails is O(n^2) in @-count

On a non-matching @ the current code does replaceEmails (T.drop 1 after), which re-scans the prefix. For a string with k @ chars that don't form valid emails this is O(n*k). The fix is to advance from after directly:

-- current (buggy):
else before <> "@" <> replaceEmails (T.drop 1 after)

-- fix: don't re-scan `before`:
else T.reverse prefixRev <> T.singleton '@' <> replaceEmails (T.drop 1 after)

acknowledgeLogPatternsIN (SELECT unnest, unnest) is non-obvious

PostgreSQL does correlate both unnests pair-wise here, but the behaviour is surprising. Prefer an explicit lateral join for future readers:

FROM unnest(?::text[], ?::text[]) AS t(sf, ph)
WHERE source_field = t.sf AND pattern_hash = t.ph

Code Duplication (contradicts "less lines" priority)

upsertLogPattern / upsertHourlyStat duplicate their batch siblings' SQL entirely.

Both have identical ON CONFLICT clauses (~20 lines each). Delegate instead:

upsertLogPattern up = upsertLogPatternBatch [up]
upsertHourlyStat pid sf h t c = upsertHourlyStatBatch [(pid, sf, h, t, c)]

Hardcoded Enum Strings in SQL (type-safety regression)

pruneStalePatterns and autoAcknowledgeStaleNewPatterns compare against string literals:

[sql| WHERE state = 'acknowledged' ... |]   -- should be a typed parameter
[sql| WHERE state = 'new' ...          |]   -- should be a typed parameter

upsertLogPattern correctly uses the enum type in its ON CONFLICT clause. A renamed constructor silently breaks the prune queries. Fix:

pruneStalePatterns pid now staleDays =
  PG.execute [sql| DELETE FROM apis.log_patterns
                   WHERE project_id = ? AND last_seen_at < ?::timestamptz - INTERVAL '1 day' * ?
                   AND state = ? |]
             (pid, now, staleDays, LPSAcknowledged)

Same treatment for autoAcknowledgeStaleNewPatterns.


Opaque 6-tuple in updateBaselineBatch

updateBaselineBatch :: DB es => Projects.ProjectId
                   -> V.Vector (Text, Text, BaselineState, Double, Double, Int) -> Eff es Int64
srcFields = V.map (\(s, _, _, _, _, _) -> s) rows   -- 6 lines of this

A small record eliminates the positional destructuring and makes call sites self-documenting:

data BaselineUpdate = BaselineUpdate
  { buSourceField :: Text, buHash :: Text
  , buState :: BaselineState, buMean :: Double
  , buMad :: Double, buSamples :: Int }

Minor Issues

maybe 0 (\(Only n) -> n)maybe 0 fromOnly in fetchLogPatterns (already used correctly in getTotalEventCount).

replaceAllFormats: isOctet does a toString/readMaybe round-trip. A direct fold avoids the String allocation:

isOctet t = T.length t <= 3 && T.all isDigit t &&
  T.foldl' (\acc c -> acc * 10 + (fromEnum c - 48)) 0 t <= (255 :: Int)

fetchLogPatternsdrop 1 $ T.splitOn "." f when f = "attributes." produces [""], passing an empty path to the #>> operator. Add a not (null parts) guard.

processBatch nested lambda can use flip + LambdaCase:

V.foldl' (flip \case
  SeedPattern c  -> processNewLog isSummary "" Nothing c now
  NewEvent lid c -> processNewLog isSummary lid (Just c) c now)
  inTree batch

Positives

  • ExceptT / runExceptT in Enhancement.hs – clean removal of nested case ladders.
  • MkIssueOpts record replacing 11 positional args – major readability win.
  • WrappedEnumSC "" for IssueType – removes ~20 lines of hand-written ToField/FromField.
  • deriving anyclass (Default) on enum types – correct use of Generic machinery.
  • DrainInput sum type replacing Bool isSampleLog threading through 5 call layers.
  • tryLog async-exception re-throw – correct pattern (fromException @SomeAsyncException).
  • Spike detection unit tests covering edge cases including MAD=0 and min-delta guard.
  • Integration tests covering end-to-end extraction → baseline → spike → issue creation → acknowledgment.

@claude
Copy link

claude bot commented Feb 25, 2026

Code Review: PR #300 -- Log Patterns Pipeline

Good overall structure. Here are the key findings:


Bugs

1. ON CONFLICT deduplication collapses unrelated issues when target_hash is empty
src/Models/Apis/Issues.hs -- insertIssue

The new unique index has no guard for empty hashes. For RuntimeException issues where atError.hash = Nothing, targetHash = fromMaybe "" ... = "". Two unrelated runtime exceptions from different services both get target_hash = "" and the same issue_type, so ON CONFLICT deduplicates them into one issue. The old code had AND endpoint_hash != '' in the partial index -- this guard needs to be restored:

CREATE UNIQUE INDEX ... WHERE acknowledged_at IS NULL AND archived_at IS NULL AND target_hash != '';

2. "shape" alias in parseIssueType is silently broken
src/Models/Apis/Issues.hs

The old parseIssueType "shape" = Just APIChange was removed. The new implementation delegates to WrappedEnumSC via parseUrlPiece, which has no alias support. If any DB record or code path still produces "shape", it silently returns Nothing rather than Just ApiChange. Check whether Anomalies.hs or any SQL still references anomaly_type = 'shape'.


Correctness / Logic

3. Spike projection is 4x too aggressive for the first 15 minutes of every hour
src/BackgroundJobs.hs -- detectLogPatternSpikes

let minutesIntoHour = max 15 $ todMin (timeToTimeOfDay $ utctDayTime now)

todMin returns only the minute component (0-59), not total minutes since the hour start. At HH:00:30 this gives max 15 0 = 15 -> scaleFactor = 4.0, causing systematic 4x over-projection and false positives at the top of every hour. Use total elapsed seconds instead:

let totalSecs = round (realToFrac (utctDayTime now) :: Double) :: Int
    minutesIntoHour = max 15 $ (totalSecs `mod` 3600) `div` 60

Code Conciseness / Best Practices

4. tryLog should use catchSync from safe-exceptions
src/BackgroundJobs.hs

safe-exceptions is already in deps and provides exactly this pattern without the manual fromException @SomeAsyncException check:

import Control.Exception.Safe (catchSync)

tryLog :: Text -> ATBackgroundCtx () -> ATBackgroundCtx ()
tryLog label = (`catchSync` \(e :: SomeException) ->
  Log.logAttention ("LogPattern pipeline step failed: " <> label) (show e))

5. updateBaselineBatch does 6 separate passes over the same vector
src/Models/Apis/LogPatterns.hs

Six V.map (\(a,_,_,_,_,_) -> a) calls each traverse the full vector. Use a named record for the tuple with RecordWildCards, or restructure as a single traversal.

6. extractFieldPatterns: use unzip instead of two list traversals
src/BackgroundJobs.hs

-- Current:
let allUps = urlUpserts ++ excUpserts
void $ LogPatterns.upsertLogPatternBatch (map fst allUps)
void $ LogPatterns.upsertHourlyStatBatch (map snd allUps)

-- Better:
let (ups, hss) = unzip (urlUpserts ++ excUpserts)
void $ LogPatterns.upsertLogPatternBatch ups
void $ LogPatterns.upsertHourlyStatBatch hss

7. processNewLog uses bool with non-obvious argument order
src/BackgroundJobs.hs

bool f g b is if b then g else f -- false-case first, opposite to natural reading. Prefer explicit if:

let tokens = (if isSummary then Drain.generateSummaryDrainTokens else Drain.generateDrainTokens) content

8. Deriving Default via anyclass on enum types -- good cleanup
src/Models/Apis/Anomalies.hs, src/Models/Apis/Fields.hs

Good consolidation. Confirm the first constructor in each type is the intended def (ATUnknown, AAUnknown, FTUnknown, FCQueryParam) -- these match the removed instances, so this is correct.


Test Coverage Gaps

9. Spike detection test has non-deterministic branching
test/integration/Pages/LogPatternsSpec.hs -- test 4

The test branches based on whether an established pattern already exists in the DB, seeding one manually only if not found. This covers different code paths depending on DB state. Make seed setup unconditional for determinism.

10. hashes column update not verified in extraction test
test/integration/Pages/LogPatternsSpec.hs -- test 1

The test checks that log_pattern_hourly_stats rows exist but does not verify that otel_logs_and_spans.hashes was tagged with "pat:..." values -- the primary side-effect of persistSummaryPatterns.

11. Missing boundary test for pruneOldHourlyStats
test/integration/Pages/LogPatternsSpec.hs -- test 7

Only verifies a stat at -100h is pruned. No test that a stat at -(window - 1)h is NOT pruned; an off-by-one in the SQL interval expression would be silent.


Info

  • Schema: log_patterns.state and baseline_state are TEXT rather than Postgres enums. Not a bug, but invalid values can be stored without DB-level rejection.
  • Job retry double-counting: ON CONFLICT DO UPDATE SET event_count = event_count + EXCLUDED.event_count is documented as acceptable. The z-score threshold of 3.0 and MAD floor of 1.0 should absorb occasional retries.

@claude
Copy link

claude bot commented Feb 25, 2026

PR Review: Log Pattern Processing

Good overall architecture — the extraction→baseline→spike→new-pattern pipeline is clearly structured, and the refactoring of mkIssue, enhanceIssueWithLLM (via ExceptT), and Default derivations is clean. A few conciseness and correctness notes below.


Potential Bug: occurrence_count::INT cast can overflow

In fetchLogPatterns, the precomputed-path query casts occurrence_count::INT but the column is BIGINT. For high-volume patterns this silently truncates.

Fix: select the column without the cast and convert with fromIntegral when constructing PatternRow.count.


Conciseness: duplicate SQL strings

upsertLogPattern and upsertLogPatternBatch share identical SQL (same for upsertHourlyStat / upsertHourlyStatBatch). Extracting each as a top-level Query constant lets both functions share it:

upsertLogPatternSql :: Query
upsertLogPatternSql = [sql| INSERT INTO apis.log_patterns (...) ON CONFLICT ... DO UPDATE SET ... |]

upsertLogPattern up = PG.execute upsertLogPatternSql up
upsertLogPatternBatch [] = pure 0
upsertLogPatternBatch ups = PG.executeMany upsertLogPatternSql ups

Conciseness: redundant Data.List (lookup) import

Relude already re-exports lookup, so import Data.List (lookup) in LogPatterns.hs is unnecessary.


Conciseness: updateBaselineBatch – 6 separate V.map passes

Six positional mappers on a 6-tuple vector is fragile. Data.List.Extra (already in the project) exposes unzip6. Using V.toListunzip6V.fromList each, or a small local record, removes positional indexing.


Conciseness: map fst knownPatternFields repeated at call sites

Called at three sites (LogQueryBox, RequestDumps, LogPatterns). Exposing a knownSourceFields :: [Text] helper from LogPatterns avoids repeating the transformation:

knownSourceFields :: [Text]
knownSourceFields = map fst knownPatternFields

Conciseness: prepared 4-tuple in persistSummaryPatterns

(_, _, up, _) / (_, _, _, hs) positional access is fragile. The patternHash at position 2 is redundant — it already lives in UpsertPattern.hash. Dropping it yields a 3-tuple:

-- build as (filteredIds, up, hs)
forM_ prepared \(filteredIds, up, _) -> ...
void $ LogPatterns.upsertLogPatternBatch [up | (_, up, _) <- prepared]
void $ LogPatterns.upsertHourlyStatBatch [hs | (_, _, hs) <- prepared]

Conciseness: classifyHexContinuation uses case () of _ | ... guards

MultiWayIf (enabled in this project) is cleaner here:

-- current
case () of
  _ | len == 64 && atBoundary -> "{sha256}" ...
    | ...

-- cleaner
if | len == 64, atBoundary -> "{sha256}" ...
   | ...

(recognizeHex already uses case len of with guards which is fine.)


Performance: replaceTextDates does up to 12 traversals

The pre-pass traverses the string once per month name (up to 12 times). A single pass that scans for 3-letter word-boundary tokens and validates against a HashSet would be more efficient for this hot path.


Minor: LogPatternState missing Default instance

All other domain enums use deriving anyclass (Default). LogPatternState is the exception — LPSNew is the natural default.


Tests

Good end-to-end coverage in LogPatternsSpec.hs. The conditional branching in test 4 (spike detection) that forks based on whether an established pattern with positive MAD already exists could be simplified by always seeding a synthetic pattern at the start of the test. SpikeDetectionSpec is a good pure unit test addition.


Overall solid work. The SQL migration is careful (backfilling target_hash, placing ALTER TYPE outside the BEGIN/COMMIT block as required by PG), tryLog correctly re-throws async exceptions, and replacing regex-based replaceAllFormats with the single-pass scanner is a meaningful improvement.

tonyalaribe and others added 5 commits February 26, 2026 00:48
- Use elapsed seconds for spike projection instead of todMin indirection
- Remove redundant async exception re-throw in tryLog (UnliftIO.catch handles it)
- Use lens tuple accessors in updateBaselineBatch for readability
- Use unzip in extractFieldPatterns instead of map fst/snd
- Replace bool with explicit if in processNewLog for clarity
@claude
Copy link

claude bot commented Feb 26, 2026

Code Review

Good overall implementation of the log pattern pipeline. A few issues worth addressing, ordered by severity.


Bugs / Correctness

1. parseIssueType loses the "shape" alias (regression risk)

Before, there was an explicit alias parseIssueType "shape" = Just APIChange for legacy DB rows. The new one-liner parseIssueType = rightToMaybe . parseUrlPiece drops it. If any rows have issue_type = 'shape', they will silently fail to parse. Either keep the alias via a custom FromHttpApiData instance, or confirm no such rows exist and add a migration to rewrite them.


2. detectLogPatternSpikes ignores _scheduledTime

The handler receives LogPatternPeriodicProcessing _scheduledTime pid but calls now <- Time.currentTime (wall clock) for the minutesIntoHour calculation. Other jobs use the scheduled time for window calculations. If the queue is backed up, minutesIntoHour will reflect actual wall time rather than the intended window start. Either use _scheduledTime or document why wall clock is correct here.


3. LogPatternRateChange badge always shows "trending-up" icon

Issues.LogPatternRateChange -> ("bg-fillWarning-strong", "trending-up", "RATE CHANGE")

The issue can represent a spike or a drop, but the icon is hardcoded to "trending-up". The changeDirection field exists in issueData — either decode it at the badge site or use a neutral icon like "activity".


Code Conciseness / DRY

4. upsertLogPattern duplicates upsertLogPatternBatch SQL verbatim

The entire SQL body is copy-pasted between the two. Since executeMany works with a singleton list, this is sufficient:

upsertLogPattern :: DB es => UpsertPattern -> Eff es Int64
upsertLogPattern = upsertLogPatternBatch . pure

5. updateBaselineBatch uses view _1..view _6 on a 6-tuple

let srcFields = V.map (view _1) rows
    hashes    = V.map (view _2) rows
    ...
    samples   = V.map (view _6) rows

Pulling in Control.Lens just to project positional tuple fields is heavy. At minimum, use pattern-matching lambdas:

let srcFields = V.map (\(f,_,_,_,_,_) -> f) rows

Better still: define a BaselineUpdate record and project via field names. The same 6-tuple also makes calculateLogPatternBaselines fragile (V.filter (\(_, _, s, _, _, _) -> s == BSEstablished)). A named record cleans up both sites.


6. Three identical case AE.fromJSON (getAeson issue.issueData) blocks in anomalyDetailPage

Enhancement.hs already introduced withIssueData for this exact pattern. A rendering-context version collapses the three separate case/_ -> pass blocks:

withIssueDataH :: (AE.FromJSON a, Applicative m) => Issue -> (a -> m ()) -> m ()
withIssueDataH issue f = case AE.fromJSON (getAeson issue.issueData) of
  AE.Success d -> f d
  _            -> pure ()

Usage:

Issues.LogPattern           -> withIssueDataH @Issues.LogPatternData issue \d -> ...
Issues.LogPatternRateChange -> withIssueDataH @Issues.LogPatternRateChangeData issue \d -> ...

SQL Type Safety

7. Literal 'new'/'acknowledged' strings in prune functions

-- pruneStalePatterns
AND state = 'acknowledged'
-- autoAcknowledgeStaleNewPatterns
WHERE state = 'new'
SET state = 'acknowledged'

The rest of the module uses LPSAcknowledged/LPSNew via the WrappedEnumSC ToField instance. Typos in these literals won't be caught at compile time. Use parameterized values consistently:

AND state = ?   -- bind LPSAcknowledged / LPSNew

8. Parallel unnest in acknowledgeLogPatterns — prefer ROWS FROM

WHERE (source_field, pattern_hash) IN (SELECT unnest(?::text[]), unnest(?::text[]))

This works on Postgres >=9.4 but the safer canonical idiom is:

WHERE (source_field, pattern_hash) IN (
  SELECT * FROM unnest(?::text[], ?::text[]) AS t(source_field, pattern_hash)
)

Minor

  • UtilsBenchSpec uses wall-clock getCurrentTime diffs for timing assertions, which can flake on loaded CI. Consider tasty-bench or criterion if timing correctness matters long-term.
  • getLogPatternTexts has no LIMIT and logWarn fires after the full load. Consider adding LIMIT 6000 and warning before loading.

Overall the mkIssue smart constructor, runExceptT chains in Enhancement.hs, deriving anyclass Default, and the replaceAllFormats single-pass rewrite are all solid improvements. Main asks: fix the parseIssueType regression risk and replace the 6-tuple with a named record.

@claude
Copy link

claude bot commented Feb 26, 2026

Code Review: Log Patterns Pipeline

Great feature overall — the architecture is solid and the new MkIssueOpts/mkIssue builder is a nice improvement. Here are issues worth addressing, roughly prioritised.


Bugs / Correctness

1. occurrence_count truncation in fetchLogPatterns

occurrence_count is BIGINT in the schema; occurrence_count::INT in the SQL silently truncates values > 2^31. The Haskell tuple type Int should be Int64, and the SQL cast should be ::BIGINT.

2. scaleFactor precision in detectLogPatternSpikes

utctDayTime :: DiffTime routed through Double loses sub-second precision unnecessarily. Prefer truncate (utctDayTime now) :: Int.

3. paginateTree missing type signature

The paginateTree where-binding inside logsPatternExtraction has no type annotation. Add one.


Code Duplication

4. Copy-pasted SQL in upsert helpers

upsertLogPattern/upsertLogPatternBatch and upsertHourlyStat/upsertHourlyStatBatch contain identical SQL strings. Either extract as a named Query constant or implement the single-item variant via the batch one (upsertLogPattern up = upsertLogPatternBatch [up]).


Packages / Extensions Already Available

5. Lens-based 6-tuple access in updateBaselineBatch

Using view _1 ... view _6 makes 6 passes over the vector and pulls in Control.Lens just for this one function. Plain pattern matching in a single V.map removes the dependency entirely. Better still: introduce a small record type for the batch row to avoid positional 6-tuples altogether.

6. Redundant Data.List (lookup) import in LogPatterns.hs

Relude re-exports Prelude.lookup; the explicit import is a no-op.


Design / Clarity

7. Deprecated endpointHash on Issue with no tracking ticket

mkIssue always sets endpointHash = opts.targetHash. Every insert and select carries a redundant column. The inline comment notes it should be dropped once queries are migrated — please open a follow-up issue.

8. knownPatternFields includes "body" but nothing writes to it

extractSummaryPatterns uses sourceField = "summary" and extractFieldPatterns produces "url_path" / "exception". The "body" entry will appear in the UI dropdown but always return empty results.

9. Retry-induced double-counting in hourly stats

The upsertHourlyStat comment acknowledges: "SUM semantics mean job retries double-count." A crash-and-retry within the same 5-minute window doubles the bucket count, which can trigger false-positive spike alerts. If the extraction run re-aggregates from source data, GREATEST(existing, excluded) is idempotent and avoids this. Worth a deliberate decision rather than an accepted limitation.

10. ON CONFLICT DO NOTHING without a conflict target in insertTestLog

Omitting the conflict target silently suppresses any constraint violation during tests. Use ON CONFLICT (id) DO NOTHING.


Minor

11. Operator precedence in shouldSkipLoad

&& binding tighter than || makes the expression correct but easy to misread. Wrap the && group in explicit parentheses.

12. UtilsBenchSpec in unit-tests stanza

If this runs criterion/tasty-bench it will inflate unit test runtimes. Benchmarks should be a separate cabal stanza.

tonyalaribe and others added 3 commits February 26, 2026 01:08
…ate suite

- upsertLogPattern now delegates to upsertLogPatternBatch via pure
- Replace repeated case AE.fromJSON/getAeson/pass blocks with withIssueDataH
- Move UtilsBenchSpec out of unit tests into dedicated bench suite
- Pass scheduledTime to detectLogPatternSpikes for accurate partial-hour projection
@tonyalaribe tonyalaribe merged commit 9e1f936 into master Feb 26, 2026
7 checks passed
@tonyalaribe tonyalaribe deleted the log-patterns-chp branch February 26, 2026 00:11
@github-actions github-actions bot restored the log-patterns-chp branch February 26, 2026 00:11
@tonyalaribe tonyalaribe deleted the log-patterns-chp branch February 26, 2026 00:12
@claude
Copy link

claude bot commented Feb 26, 2026

PR Review: Log Patterns Feature

Overall a solid, well-structured PR with good use of GHC extensions and batch DB operations. A few suggestions for improvement.


Positives

  • mkIssue + MkIssueOpts builder — eliminates ~90 lines of repeated Issue construction across all issue creation functions. Excellent refactor.
  • runExceptT/hoistEither in Enhancement.hs — the nested case chains were rightfully replaced; much cleaner.
  • deriving anyclass (Default) instead of manual instances throughout — good cleanup of boilerplate.
  • withIssueDataH helper — avoids the repeated AE.fromJSON (getAeson ...) >>= case pattern across the UI layer.
  • Batch DB operations everywhere (upsert, baseline update, stats) — avoids N+1.

Suggestions

1. updateBaselineBatch — 6-tuple indexed by lens is opaque

let srcFields = V.map (view _1) rows
    hashes    = V.map (view _2) rows
    states    = V.map (view _3) rows
    means     = V.map (view _4) rows
    mads      = V.map (view _5) rows
    samples   = V.map (view _6) rows

This 6-tuple is hard to read and maintain. A small named record:

data BaselineUpdate = BaselineUpdate
  { sourceField :: Text, patternHash :: Text, state :: BaselineState
  , mean :: Double, mad :: Double, samples :: Int }

…would let you use V.map (.sourceField) rows etc., removing the lens dependency from this function and making future changes obvious.

2. paginateTree event-meta 3-tuple should be a record

HM.HashMap Text (Maybe Text, Maybe Text, Maybe Text)
-- traceId?  serviceName? level?

Anonymous 3-tuple of Maybe Text is indistinguishable at a glance. A tiny record (or even a type alias with a comment) would make the destructuring in persistSummaryPatterns self-documenting.

3. fetchLogPatterns on-the-fly branch is too large

The function is ~100 lines mixing: SQL building, Drain tree construction, aggregation, and sparkline computation. The on-the-fly branch should be extracted as fetchLogPatternsOnTheFly (or similar) so the top-level function remains a readable dispatcher between precomputed and live paths.

4. Hourly stats are not idempotent on job retry

-- Note: SUM semantics mean job retries double-count.

This is acknowledged but worth addressing. A double-counted hourly bucket inflates the MAD baseline, making the MAD effectively 0 for that hour, which lowers the noise floor and could either suppress or trigger false spike alerts. Since MAD is robust to occasional outliers, this is probably fine for now — but it's worth a follow-up issue, especially once this runs at scale.

5. endpointHash deprecation needs a tracking issue

, endpointHash :: Text -- Deprecated: always == targetHash (set by mkIssue). Drop column once ...

No follow-up issue is referenced. This will persist indefinitely without one. A -- TODO(#NNN) link would help.

6. replaceAllFormatsfromEnum c - 48 in isOctet

T.foldl' (\acc c -> acc * 10 + (fromEnum c - 48)) 0 t <= (255 :: Int)

This works but ord c - ord '0' (or just digitToInt c from Data.Char) is more readable. The old regex-based implementation was replaced for good reason (performance), but this helper can be a bit clearer.

7. Benchmark suite uses hspec wall-clock for timing assertions

elapsed `shouldSatisfy` (< 1.0) -- 2k rows under 1 second

This is brittle under CI load variance. Fine as a smoke test but consider tasty-bench or criterion if you want meaningful regression tracking. Alternatively, make the threshold generous enough that it only fails on catastrophic slowdowns.

8. selectIssues raw query still uses hardcoded column positions

The selectIssues query and IssueL's FromRow are tied by positional binding. With target_hash added and service changed to Maybe Text, this coupling is a latent bug source. Consider migrating to postgresql-entity's _select like findOpenIssueForEndpoint was (nicely done there), or at least document the column order explicitly near the IssueL definition.


Minor

  • In processNewLogPatterns, the acknowledgement of ALL new patterns happens before checking volume threshold, which means low-volume projects will silently drop new-pattern alerts. This is documented in the comment ("even when volume too low") but could surprise operators — a Log.logInfo confirming how many were acknowledged-without-issue would be useful observability.
  • bool BSLearning BSEstablished est is idiomatic but some readers may not immediately recognise Data.Bool.bool; a brief comment or if est then BSEstablished else BSLearning is equally short and universally clear.

@claude
Copy link

claude bot commented Feb 26, 2026

Code Review — PR #300: Log Patterns

Solid feature work overall — the pipeline design (extract → baseline → spike/drop → new pattern) is well-structured, the SQL is correct, and the doctest coverage on replaceAllFormats is excellent. The runExceptT/hoistEither refactoring in Enhancement.hs is a clear improvement. A few issues worth addressing before merging.


Bug / Security

getLogPatternById / getLogPatternsByIds missing project-level filter

-- LogPatterns.hs
getLogPatternById :: DB es => LogPatternId -> Eff es (Maybe LogPattern)
getLogPatternById lpid = listToMaybe <$> PG.query (_selectWhere @LogPattern [[DAT.field| id |]]) (Only lpid)

getLogPatternsByIds :: DB es => V.Vector LogPatternId -> Eff es (V.Vector LogPattern)
getLogPatternsByIds ids
  | V.null ids = pure V.empty
  | otherwise = V.fromList <$> PG.query (_select @LogPattern <> " WHERE id = ANY(?)") (Only ids)

Every other function in this module takes Projects.ProjectId and filters on it. These two don't. If a caller ever derives IDs from user input or another project's context, they can leak patterns across project boundaries. Add projectId to both signatures and filter on it (consistent with every other function in this module).


Correctness / Subtle Bug

persistSummaryPatterns runs N individual UPDATEs against otel_logs_and_spans

forM_ prepared \(filteredIds, patternHash, _, _) -> do
  let hashTag = "pat:" <> patternHash
  void $ withTimefusion tfEnabled $ PG.execute [sql|UPDATE otel_logs_and_spans SET hashes = array_append(hashes, ?) WHERE project_id = ? AND timestamp >= ? AND id = ANY(?::uuid[]) AND NOT (hashes @> ARRAY[?])|] (hashTag, pid, since, filteredIds, hashTag)

For a project with many patterns this is O(patterns) queries against otel_logs_and_spans, which can be large. The existing pattern in this codebase for bulk updates is unnest() arrays — the same approach used in updateBaselineBatch. Consider a single query: UPDATE … SET hashes = array_append(hashes, tag_for_id) FROM (SELECT unnest(?::uuid[]) AS id, unnest(?::text[]) AS tag) v WHERE …. At minimum, a log warning when length prepared > N would catch unexpectedly large batches.


Code Conciseness

6-tuple V.Vector with view _1view _6 — use a record or V.unzip

-- updateBaselineBatch — requires Control.Lens import just for this
let srcFields = V.map (view _1) rows
    hashes    = V.map (view _2) rows
    states    = V.map (view _3) rows
    means     = V.map (view _4) rows
    mads      = V.map (view _5) rows
    samples   = V.map (view _6) rows

With RecordWildCards available, a small named record (or even just pattern-matching with a helper lambda) avoids the Lens import and the indexed-field mystery. Alternatively, accept V.Vector PatternBaselineUpdate where data PatternBaselineUpdate = PatternBaselineUpdate { sourceField, patternHash, state, mean, mad, samples } and unzip inside. The 6-tuple signature leaks into every caller.

monthSet defined inside replaceAllFormats

replaceAllFormats !input =  go Nothing (replacePrePass input)
  where
    monthSet :: HS.HashSet Text
    monthSet = HS.fromList ["Jan","Feb",]

This is a constant with no free variables. GHC may float it as a CAF, but it's not guaranteed — move it to a top-level binding to make the intent explicit and avoid any risk of repeated allocation on every call to replaceAllFormats.

sourceFieldLabel can use lookup from Prelude more concisely

sourceFieldLabel f = fromMaybe f $ lookup f knownPatternFields
-- Already concise. But knownPatternFields is a list — for 4 entries a Map or plain list is fine; 
-- just making sure it's not accidentally O(n) in a hot path (it isn't, 4 entries, no concern).

OK as-is.

processNewLogPatterns: unnecessary V.fromList $ filter then V.forM

let issueWorthy = V.fromList $ filter (\lp -> lp.sourceField /= "url_path") newPatterns
issueIds <- V.forM issueWorthy \lp -> 

newPatterns is a [LogPattern]. Filter, then forM (list), then collect — no need for the intermediate Vector:

let issueWorthy = filter (\lp -> lp.sourceField /= "url_path") newPatterns
issueIds <- forM issueWorthy \lp -> 
unless (null issueIds) $ 

Design / Maintainability

knownPatternFields lists "body" but the extraction pipeline never processes it

logsPatternExtraction only handles summary, url_path, and exception. The "body" -> "Log body" entry in knownPatternFields (and the corresponding selector in LogQueryBox) suggests an unimplemented path. Either implement it or remove the entry to avoid user confusion (the UI will offer it but it will always fall back to the on-the-fly query in fetchLogPatterns).

LPSIgnored is a dead state

data LogPatternState = LPSNew | LPSAcknowledged | LPSIgnored

Nothing in this PR transitions a pattern to LPSIgnored, and pruneStalePatterns only prunes LPSAcknowledged. If this is future work, a TODO comment would help; if not planned, remove it to keep the state machine honest.

BatchPatternStats.totalEvents is computed but unused

calculateLogPatternBaselines uses hourlyMedian, hourlyMADScaled, and totalHours from BatchPatternStats, but not totalEvents. The CTE and COALESCE(SUM(event_count),0) column in getBatchPatternStats are dead weight. Remove the field (or use it — it would let you warn on patterns with very few total events, which might produce unreliable baselines).

endpointHash as deprecated alias

, endpointHash :: Text -- Deprecated: always == targetHash. Drop column once selectIssues/selectIssueByHash are migrated.

The migration backfills and the comment acknowledges the debt. It's fine for now, but worth tracking as a follow-up since it keeps two fields in sync by convention rather than by code.


Minor

  • upsertHourlyStat delegates to upsertHourlyStatBatch . pure — clean and concise, no issues.
  • detectSpikeOrDrop exported from BackgroundJobs — pure function, easy to test. The unit test spec looks solid. One edge case not tested: mad = 0, mean = 0, currentCount = 0 (floor at 1.0 makes z = 0, so no alert — correct).
  • replaceAllFormats rewrite — the single-pass scanner is a meaningful improvement over the multi-regex approach. The doctest suite is thorough. The pre-pass for emails/JWTs/text dates before the main go scanner is a reasonable decomposition.
  • bool BSLearning BSEstablished est — nice use of Data.Bool.bool.
  • tryLog wrapper — good defensive pattern for the pipeline steps.

Overall the logic is sound. The three most important things to fix before merging: the missing projectId filter on getLogPatternById/getLogPatternsByIds, the dead LPSIgnored state (or document it), and the "body" field in knownPatternFields that has no backing extraction.

@claude
Copy link

claude bot commented Feb 26, 2026

Code Review — Log Patterns (#300)

Overall this is a well-structured addition. The baseline/MAD spike detection, the Drain integration, and the migration are solid. A few issues worth addressing before merge:


1. Replace the 6-tuple in updateBaselineBatch with a named record

-- Current: fragile — mean and mad are both Double, easy to swap silently
updateBaselineBatch :: DB es => Projects.ProjectId -> V.Vector (Text, Text, BaselineState, Double, Double, Int) -> Eff es Int64

let srcFields = V.map (view _1) rows
    hashes    = V.map (view _2) rows
    states    = V.map (view _3) rows
    means     = V.map (view _4) rows
    mads      = V.map (view _5) rows
    samples   = V.map (view _6) rows

You already have the UpsertPattern and BatchPatternStats pattern — define a BaselineUpdate record (or reuse an inline record with {sourceField, patternHash, baselineState, mean, mad, samples}) and destructure it in the SQL unnest block. Eliminates the Control.Lens._1/_6 import and makes the type self-documenting.


2. N+1 SQL round-trips in persistSummaryPatterns

forM_ prepared \(filteredIds, patternHash, _, _) -> do
  let hashTag = "pat:" <> patternHash
  void $ withTimefusion tfEnabled $ PG.execute [sql|UPDATE otel_logs_and_spans SET hashes = array_append(hashes, ?) ...|] (hashTag, pid, since, filteredIds, hashTag)

This fires one round-trip per Drain pattern group (typically 20–100 per 5-minute job). You're already using unnest everywhere else — a single batched update via unnest(?::text[], ?::uuid[]) or a subquery join would replace the loop entirely:

UPDATE otel_logs_and_spans SET hashes = array_append(hashes, v.tag)
FROM (SELECT unnest(?::uuid[]) AS id, unnest(?::text[]) AS tag) v
WHERE otel_logs_and_spans.id = v.id
  AND project_id = ? AND timestamp >= ?
  AND NOT (hashes @> ARRAY[v.tag])

Where you pass the flattened (allIds, corresponding tags) from the prepared list.


3. knownPatternFields — export knownSourceFields to avoid map fst at call sites

-- Log.hs
if target `elem` map fst LogPatterns.knownPatternFields

-- Settings.hs
let isCustom = maybe False (\s -> s `notElem` map fst knownPatternFields) config.patternSelected

Add knownSourceFields :: [Text]\nknownSourceFields = map fst knownPatternFields to LogPatterns exports. Two call sites become elem LogPatterns.knownSourceFields with no extra allocation.


4. detectSpikeOrDrop — 4 consecutive Double parameters

detectSpikeOrDrop :: Double -> Double -> Double -> Double -> Int64 -> Maybe Issues.SpikeResult
detectSpikeOrDrop zThreshold minDelta mean mad currentCount

mean and mad can be silently transposed at the call site. The tests do catch this for the fixed-threshold calls, but callers passing dynamic values (e.g. from BatchPatternStats.hourlyMedian/hourlyMADScaled) have no protection. A simple data BaselineStats = BaselineStats { mean :: Double, mad :: Double } record (or even collapsing the first two params into a SpikeConfig) would eliminate the risk.


5. Minor / style

  • V.fromList $ map (\lp -> (lp.sourceField, lp.patternHash)) newPatterns — with Control.Arrow.(&&&) (already a transitive dep): V.fromList $ map ((.sourceField) &&& (.patternHash)) newPatterns.

  • import Data.List (lookup) in LogPatterns.hs — Relude already re-exports lookup; this import looks redundant and can likely be dropped.

  • endpointHash = opts.targetHash in mkIssue** — the deprecation comment is clear; a linked issue/TODO would make the migration intent easier to track.

  • Benchmark target in monoscope.cabal duplicates most of the extension/flag list from the unit-tests stanza verbatim — consider factoring these into a common stanza if cabal version allows.


Strengths worth noting

  • The replaceAllFormats single-pass Data.Text.Lazy.Builder rewrite is a clear improvement over the multi-pass regex approach — fewer allocations, more patterns handled, better doctest coverage.
  • MAD-based baseline with a floor of 1.0 is a robust choice. The getBatchPatternStats single-query PERCENTILE_CONT approach is efficient.
  • The mkIssue factory eliminating repetition across all issue creation functions is a good refactor.
  • Good test coverage: unit tests for detectSpikeOrDrop, integration pipeline tests, and the new benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants