Skip to content

[Correctness] ServyFailureEmail.ps1 — Timestamp only persisted on email success causes event storm after SMTP outage #760

@Christophe-Rogiers

Description

@Christophe-Rogiers

Severity: Warning
File: setup/taskschd/ServyFailureEmail.ps1 lines ~260-303

Description:

The per-event loop only updates $lastSuccessfulTimestamp and writes the timestamp file inside the email-success branch:

if (Send-NotificationEmail -Subject $subject -Body $htmlBody -scriptDir $scriptDir) {
    Write-Host "Email Notification sent for '$serviceName'."
    $lastSuccessfulTimestamp = $evt.TimeCreated

    # Update timestamp immediately for this specific event
    if ($null -ne $lastSuccessfulTimestamp) {
        ...
        [System.IO.File]::WriteAllText($timestampFile, $timestampString, [System.Text.Encoding]::UTF8)
        ...
    }
} else {
    Write-Host "Aborting further processing due to email failure." -ForegroundColor Yellow
    break
}

Failure mode:

If the SMTP server is unreachable for some window (DNS glitch, network partition, auth failure, relay unavailable), the else branch fires break and exits the foreach without updating the timestamp file. The next scheduled-task tick re-queries Event Log from the old $lastSuccessfulTimestamp and retries every event that happened in the meantime. When SMTP eventually recovers, the script sends one email per failure event accumulated across the entire outage window — an email storm exactly when the operator is already dealing with an incident.

Worse, because of the break, only the first retried event even reaches Send-NotificationEmail in the recovery tick; if that succeeds, the loop continues from event #2 and sends the whole backlog in one run. If it fails again, the queue keeps growing.

Suggested fix:

Separate "event was observed" from "email was delivered". Options:

  1. Update the timestamp to $evt.TimeCreated regardless of send result, and queue failed notifications to a local retry file (next run drains the retry file before querying Event Log).
  2. Update timestamp on failure too, and accept "one missed email per outage" — safer than storm.
  3. Keep current behaviour but cap the lookback window (e.g. max($lastSuccessfulTimestamp, now - 1h)) so long outages self-limit.

Option 1 is the most correct; option 3 is the smallest diff that kills the storm. Whichever is chosen, the 'break' on send failure should be replaced with 'continue' so a single bad event doesn't block the rest of the batch indefinitely.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions