Severity: Warning
File: setup/taskschd/ServyFailureEmail.ps1 lines ~260-303
Description:
The per-event loop only updates $lastSuccessfulTimestamp and writes the timestamp file inside the email-success branch:
if (Send-NotificationEmail -Subject $subject -Body $htmlBody -scriptDir $scriptDir) {
Write-Host "Email Notification sent for '$serviceName'."
$lastSuccessfulTimestamp = $evt.TimeCreated
# Update timestamp immediately for this specific event
if ($null -ne $lastSuccessfulTimestamp) {
...
[System.IO.File]::WriteAllText($timestampFile, $timestampString, [System.Text.Encoding]::UTF8)
...
}
} else {
Write-Host "Aborting further processing due to email failure." -ForegroundColor Yellow
break
}
Failure mode:
If the SMTP server is unreachable for some window (DNS glitch, network partition, auth failure, relay unavailable), the else branch fires break and exits the foreach without updating the timestamp file. The next scheduled-task tick re-queries Event Log from the old $lastSuccessfulTimestamp and retries every event that happened in the meantime. When SMTP eventually recovers, the script sends one email per failure event accumulated across the entire outage window — an email storm exactly when the operator is already dealing with an incident.
Worse, because of the break, only the first retried event even reaches Send-NotificationEmail in the recovery tick; if that succeeds, the loop continues from event #2 and sends the whole backlog in one run. If it fails again, the queue keeps growing.
Suggested fix:
Separate "event was observed" from "email was delivered". Options:
- Update the timestamp to
$evt.TimeCreated regardless of send result, and queue failed notifications to a local retry file (next run drains the retry file before querying Event Log).
- Update timestamp on failure too, and accept "one missed email per outage" — safer than storm.
- Keep current behaviour but cap the lookback window (e.g.
max($lastSuccessfulTimestamp, now - 1h)) so long outages self-limit.
Option 1 is the most correct; option 3 is the smallest diff that kills the storm. Whichever is chosen, the 'break' on send failure should be replaced with 'continue' so a single bad event doesn't block the rest of the batch indefinitely.
Severity: Warning
File:
setup/taskschd/ServyFailureEmail.ps1lines ~260-303Description:
The per-event loop only updates
$lastSuccessfulTimestampand writes the timestamp file inside the email-success branch:Failure mode:
If the SMTP server is unreachable for some window (DNS glitch, network partition, auth failure, relay unavailable), the
elsebranch firesbreakand exits the foreach without updating the timestamp file. The next scheduled-task tick re-queries Event Log from the old$lastSuccessfulTimestampand retries every event that happened in the meantime. When SMTP eventually recovers, the script sends one email per failure event accumulated across the entire outage window — an email storm exactly when the operator is already dealing with an incident.Worse, because of the
break, only the first retried event even reaches Send-NotificationEmail in the recovery tick; if that succeeds, the loop continues from event #2 and sends the whole backlog in one run. If it fails again, the queue keeps growing.Suggested fix:
Separate "event was observed" from "email was delivered". Options:
$evt.TimeCreatedregardless of send result, and queue failed notifications to a local retry file (next run drains the retry file before querying Event Log).max($lastSuccessfulTimestamp, now - 1h)) so long outages self-limit.Option 1 is the most correct; option 3 is the smallest diff that kills the storm. Whichever is chosen, the 'break' on send failure should be replaced with 'continue' so a single bad event doesn't block the rest of the batch indefinitely.