Severity: Info
File: src/Servy.Core/Services/ServiceManager.cs
Lines: 830–885 (GetAllServices, Parallel.ForEach body)
Description:
GetAllServices enumerates the SCM list and calls PopulateNativeDetails for each service in parallel:
Parallel.ForEach(services, new ParallelOptions
{
CancellationToken = cancellationToken,
MaxDegreeOfParallelism = Math.Min(Environment.ProcessorCount, MaxParallelScmQueries),
},
service =>
{
try
{
if (cancellationToken.IsCancellationRequested) return;
ServiceInfo info = new ServiceInfo { ... };
// Fetch deep details natively
PopulateNativeDetails(scmHandle, info);
results.Add(info);
}
finally
{
service.Dispose();
}
});
The CancellationToken only blocks new iterations — it cannot interrupt an in-flight native SCM call. PopulateNativeDetails issues QueryServiceConfig / QueryServiceConfig2W against the SCM, and these calls have been observed to hang on:
- protected services where the calling token lacks the right access mask
- driver services in transitional states
- corrupted service registry entries
- machines where a kernel filter driver intercepts SCM calls
When that happens, one of the (typically 4–8) parallel workers stays blocked until the native call eventually returns. With MaxDegreeOfParallelism = min(ProcessorCount, MaxParallelScmQueries), several concurrent hangs can drain the entire pool, and the user-visible Manager UI stalls indefinitely (the cancellation request is honoured for the queue, but in-flight RPCs block the workers, so cancellation never completes).
Reproduction (general shape):
- Have a service with an unusual access ACL or a driver service in
START_PENDING for an extended period.
- Open the Manager UI on that machine.
- Click cancel — observe the UI does not actually unblock until each in-flight native call returns on its own.
Suggested fix:
Wrap PopulateNativeDetails in a Task.Run(...).Wait(timeoutMs, cancellationToken) so a stuck call cannot keep a worker indefinitely:
service =>
{
try
{
if (cancellationToken.IsCancellationRequested) return;
ServiceInfo info = new ServiceInfo { ... };
bool populated = Task.Run(() => PopulateNativeDetails(scmHandle, info), cancellationToken)
.Wait(AppConfig.PopulateNativeDetailsTimeoutMs, cancellationToken);
if (!populated)
{
// Emit the basic info we already have rather than dropping the service entirely.
info.Description = "(details unavailable: native query timed out)";
}
results.Add(info);
}
catch (OperationCanceledException) { /* token cancelled */ }
finally
{
service.Dispose();
}
}
Pick a sensible default in AppConfig (e.g. 1000–2000 ms per service). The leaked native call will still complete on its own, but it no longer holds up the parallel pool or the cancellation path.
Severity rationale:
Marked Info rather than Warning because the failure mode requires a specifically misbehaving service in the SCM list — most production environments will never trip it. Where it does trip, however, the symptom (Manager UI permanently unresponsive, cancel button "doing nothing") is severe and hard to diagnose without the source context above.
Severity: Info
File:
src/Servy.Core/Services/ServiceManager.csLines: 830–885 (
GetAllServices,Parallel.ForEachbody)Description:
GetAllServicesenumerates the SCM list and callsPopulateNativeDetailsfor each service in parallel:The
CancellationTokenonly blocks new iterations — it cannot interrupt an in-flight native SCM call.PopulateNativeDetailsissuesQueryServiceConfig/QueryServiceConfig2Wagainst the SCM, and these calls have been observed to hang on:When that happens, one of the (typically 4–8) parallel workers stays blocked until the native call eventually returns. With
MaxDegreeOfParallelism = min(ProcessorCount, MaxParallelScmQueries), several concurrent hangs can drain the entire pool, and the user-visible Manager UI stalls indefinitely (the cancellation request is honoured for the queue, but in-flight RPCs block the workers, so cancellation never completes).Reproduction (general shape):
START_PENDINGfor an extended period.Suggested fix:
Wrap
PopulateNativeDetailsin aTask.Run(...).Wait(timeoutMs, cancellationToken)so a stuck call cannot keep a worker indefinitely:Pick a sensible default in
AppConfig(e.g. 1000–2000 ms per service). The leaked native call will still complete on its own, but it no longer holds up the parallel pool or the cancellation path.Severity rationale:
Marked
Inforather thanWarningbecause the failure mode requires a specifically misbehaving service in the SCM list — most production environments will never trip it. Where it does trip, however, the symptom (Manager UI permanently unresponsive, cancel button "doing nothing") is severe and hard to diagnose without the source context above.