Skip to content

Send k8s pod events in TaskExecutionEvent updates #3825

@andrewwdye

Description

@andrewwdye

Motivation: Why do you think this is important?

It can be difficult to understand delays in task start up. We recently added runtime metrics to the timeline view to better surface where time is spent, and PodCondition reasons are included in the task state tooltip to explain state transitions.

Screenshot 2023-07-03 at 11 45 15 AM Screenshot 2023-07-03 at 11 41 48 AM

The full text of this tooltip is of the form

7/3/2023 6:41:18 PM UTC task submitted to K8s

7/3/2023 6:41:18 PM UTC Unschedulable:0/5 nodes are available: 1 node(s) were unschedulable, 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/5 nodes are available: 5 Preemption is not helpful for scheduling.

7/3/2023 6:42:20 PM UTC [ContainersNotReady|ContainerCreating]: containers with unready status: [aqrp2plhk79cj5dwzg5z-n0-0]|

However this doesn't indicate ongoing node allocation or image pull, two of the most common delays in "happy path" task start up. By comparison kubectl get events has much richer information.

❯ kubectl get events -n flytesnacks-development --sort-by='{.metadata.creationTimestamp}' --field-selector involvedObject.name=aqrp2plhk79cj5dwzg5z-n0-0
LAST SEEN   TYPE      REASON                 OBJECT                          MESSAGE
6m57s       Warning   FailedScheduling       pod/aqrp2plhk79cj5dwzg5z-n0-0   0/5 nodes are available: 1 node(s) were unschedulable, 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/5 nodes are available: 5 Preemption is not helpful for scheduling.
6m50s       Normal    TriggeredScaleUp       pod/aqrp2plhk79cj5dwzg5z-n0-0   pod triggered scale-up: [{eks-opta-oc-production-nodegroup1-d7fdbb758a882b40-dec46029-5e1e-5bf0-4999-238661b4dc51 0->1 (max: 5)}]
5m55s       Normal    Scheduled              pod/aqrp2plhk79cj5dwzg5z-n0-0   Successfully assigned flytesnacks-development/aqrp2plhk79cj5dwzg5z-n0-0 to ip-10-0-148-239.us-east-2.compute.internal
5m53s       Normal    Pulling                pod/aqrp2plhk79cj5dwzg5z-n0-0   Pulling image "cr.flyte.org/flyteorg/flytekit:py3.9-latest"
5m52s       Normal    TaintManagerEviction   pod/aqrp2plhk79cj5dwzg5z-n0-0   Cancelling deletion of Pod flytesnacks-development/aqrp2plhk79cj5dwzg5z-n0-0
5m23s       Normal    Pulled                 pod/aqrp2plhk79cj5dwzg5z-n0-0   Successfully pulled image "cr.flyte.org/flyteorg/flytekit:py3.9-latest" in 30.158785098s
5m23s       Normal    Created                pod/aqrp2plhk79cj5dwzg5z-n0-0   Created container aqrp2plhk79cj5dwzg5z-n0-0
5m23s       Normal    Started                pod/aqrp2plhk79cj5dwzg5z-n0-0   Started container aqrp2plhk79cj5dwzg5z-n0-0

Goal: What should the final outcome look like, ideally?

The execution closer should include task-specific event details, including scheduling attempts, node allocations, and image pulls.

Describe alternatives you've considered

A more complete solution may overhaul event information in the execution closure so that reasons are not coupled to Flyte state transitions and could instead surface a sink of structured or unstructured event information. This is beyond the scope of this particular issue, but the proposal below does not preclude such an investment in the future.

Propose: Link/Inline OR Additional context

As a potential solution, update DemystifyPending to interleave k8s pod events alongside existing PodCondition reasons.

Note the reporting interface assumes a single-event per state; however, a recent change made it possible to report multiple events using a phase version.

A relatively naive solutions proposed by @hamersaw might be:

  • Have propeller keep a watch on k8s events. I assume the kube-client has this functionality. Store these in a local cache (with configurable size) and keyed on the object or Flyte task they are associated with.
  • When sending a TaskExecutionEvent we could lookup the k8s events and instead of a singular reason, return a list of reasons (probably update the name) containing all unreported k8s events (use some kind of lastSeen indicator - timestamp, revisionVersion, hash of message, etc).
  • Merge the k8s events into the ExecutionClosure.

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestuntriagedThis issues has not yet been looked at by the Maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions