[Correctness] ServyFailureEmail.ps1 — Timestamp only persisted on email success causes event storm after SMTP outage

**Severity:** Warning
**File:** `setup/taskschd/ServyFailureEmail.ps1` lines ~260-303

**Description:**

The per-event loop only updates `$lastSuccessfulTimestamp` and writes the timestamp file inside the email-success branch:

```powershell
if (Send-NotificationEmail -Subject $subject -Body $htmlBody -scriptDir $scriptDir) {
    Write-Host "Email Notification sent for '$serviceName'."
    $lastSuccessfulTimestamp = $evt.TimeCreated

    # Update timestamp immediately for this specific event
    if ($null -ne $lastSuccessfulTimestamp) {
        ...
        [System.IO.File]::WriteAllText($timestampFile, $timestampString, [System.Text.Encoding]::UTF8)
        ...
    }
} else {
    Write-Host "Aborting further processing due to email failure." -ForegroundColor Yellow
    break
}
```

**Failure mode:**

If the SMTP server is unreachable for some window (DNS glitch, network partition, auth failure, relay unavailable), the `else` branch fires `break` and exits the foreach without updating the timestamp file. The next scheduled-task tick re-queries Event Log from the old `$lastSuccessfulTimestamp` and retries every event that happened in the meantime. When SMTP eventually recovers, the script sends one email per failure event accumulated across the entire outage window — an email storm exactly when the operator is already dealing with an incident.

Worse, because of the `break`, only the _first_ retried event even reaches Send-NotificationEmail in the recovery tick; if that succeeds, the loop continues from event #2 and sends the whole backlog in one run. If it fails again, the queue keeps growing.

**Suggested fix:**

Separate "event was observed" from "email was delivered". Options:

1. Update the timestamp to `$evt.TimeCreated` regardless of send result, and queue failed notifications to a local retry file (next run drains the retry file before querying Event Log).
2. Update timestamp on failure too, and accept "one missed email per outage" — safer than storm.
3. Keep current behaviour but cap the lookback window (e.g. `max($lastSuccessfulTimestamp, now - 1h)`) so long outages self-limit.

Option 1 is the most correct; option 3 is the smallest diff that kills the storm. Whichever is chosen, the 'break' on send failure should be replaced with 'continue' so a single bad event doesn't block the rest of the batch indefinitely.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Correctness] ServyFailureEmail.ps1 — Timestamp only persisted on email success causes event storm after SMTP outage #760

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

[Correctness] ServyFailureEmail.ps1 — Timestamp only persisted on email success causes event storm after SMTP outage #760

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions