Skip to content

[Robustness] ServiceManager.GetAllServices — Parallel.ForEach has no per-service timeout, single hung SCM RPC blocks a worker #819

@Christophe-Rogiers

Description

@Christophe-Rogiers

Severity: Info

File: src/Servy.Core/Services/ServiceManager.cs
Lines: 830–885 (GetAllServices, Parallel.ForEach body)

Description:

GetAllServices enumerates the SCM list and calls PopulateNativeDetails for each service in parallel:

Parallel.ForEach(services, new ParallelOptions
{
    CancellationToken = cancellationToken,
    MaxDegreeOfParallelism = Math.Min(Environment.ProcessorCount, MaxParallelScmQueries),
},
service =>
{
    try
    {
        if (cancellationToken.IsCancellationRequested) return;

        ServiceInfo info = new ServiceInfo { ... };

        // Fetch deep details natively
        PopulateNativeDetails(scmHandle, info);

        results.Add(info);
    }
    finally
    {
        service.Dispose();
    }
});

The CancellationToken only blocks new iterations — it cannot interrupt an in-flight native SCM call. PopulateNativeDetails issues QueryServiceConfig / QueryServiceConfig2W against the SCM, and these calls have been observed to hang on:

  • protected services where the calling token lacks the right access mask
  • driver services in transitional states
  • corrupted service registry entries
  • machines where a kernel filter driver intercepts SCM calls

When that happens, one of the (typically 4–8) parallel workers stays blocked until the native call eventually returns. With MaxDegreeOfParallelism = min(ProcessorCount, MaxParallelScmQueries), several concurrent hangs can drain the entire pool, and the user-visible Manager UI stalls indefinitely (the cancellation request is honoured for the queue, but in-flight RPCs block the workers, so cancellation never completes).

Reproduction (general shape):

  1. Have a service with an unusual access ACL or a driver service in START_PENDING for an extended period.
  2. Open the Manager UI on that machine.
  3. Click cancel — observe the UI does not actually unblock until each in-flight native call returns on its own.

Suggested fix:

Wrap PopulateNativeDetails in a Task.Run(...).Wait(timeoutMs, cancellationToken) so a stuck call cannot keep a worker indefinitely:

service =>
{
    try
    {
        if (cancellationToken.IsCancellationRequested) return;

        ServiceInfo info = new ServiceInfo { ... };

        bool populated = Task.Run(() => PopulateNativeDetails(scmHandle, info), cancellationToken)
                             .Wait(AppConfig.PopulateNativeDetailsTimeoutMs, cancellationToken);

        if (!populated)
        {
            // Emit the basic info we already have rather than dropping the service entirely.
            info.Description = "(details unavailable: native query timed out)";
        }

        results.Add(info);
    }
    catch (OperationCanceledException) { /* token cancelled */ }
    finally
    {
        service.Dispose();
    }
}

Pick a sensible default in AppConfig (e.g. 1000–2000 ms per service). The leaked native call will still complete on its own, but it no longer holds up the parallel pool or the cancellation path.

Severity rationale:
Marked Info rather than Warning because the failure mode requires a specifically misbehaving service in the SCM list — most production environments will never trip it. Where it does trip, however, the symptom (Manager UI permanently unresponsive, cancel button "doing nothing") is severe and hard to diagnose without the source context above.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions