Motivation: Why do you think this is important?
It can be difficult to understand delays in task start up. We recently added runtime metrics to the timeline view to better surface where time is spent, and PodCondition reasons are included in the task state tooltip to explain state transitions.
The full text of this tooltip is of the form
7/3/2023 6:41:18 PM UTC task submitted to K8s
7/3/2023 6:41:18 PM UTC Unschedulable:0/5 nodes are available: 1 node(s) were unschedulable, 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/5 nodes are available: 5 Preemption is not helpful for scheduling.
7/3/2023 6:42:20 PM UTC [ContainersNotReady|ContainerCreating]: containers with unready status: [aqrp2plhk79cj5dwzg5z-n0-0]|
However this doesn't indicate ongoing node allocation or image pull, two of the most common delays in "happy path" task start up. By comparison kubectl get events has much richer information.
❯ kubectl get events -n flytesnacks-development --sort-by='{.metadata.creationTimestamp}' --field-selector involvedObject.name=aqrp2plhk79cj5dwzg5z-n0-0
LAST SEEN TYPE REASON OBJECT MESSAGE
6m57s Warning FailedScheduling pod/aqrp2plhk79cj5dwzg5z-n0-0 0/5 nodes are available: 1 node(s) were unschedulable, 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/5 nodes are available: 5 Preemption is not helpful for scheduling.
6m50s Normal TriggeredScaleUp pod/aqrp2plhk79cj5dwzg5z-n0-0 pod triggered scale-up: [{eks-opta-oc-production-nodegroup1-d7fdbb758a882b40-dec46029-5e1e-5bf0-4999-238661b4dc51 0->1 (max: 5)}]
5m55s Normal Scheduled pod/aqrp2plhk79cj5dwzg5z-n0-0 Successfully assigned flytesnacks-development/aqrp2plhk79cj5dwzg5z-n0-0 to ip-10-0-148-239.us-east-2.compute.internal
5m53s Normal Pulling pod/aqrp2plhk79cj5dwzg5z-n0-0 Pulling image "cr.flyte.org/flyteorg/flytekit:py3.9-latest"
5m52s Normal TaintManagerEviction pod/aqrp2plhk79cj5dwzg5z-n0-0 Cancelling deletion of Pod flytesnacks-development/aqrp2plhk79cj5dwzg5z-n0-0
5m23s Normal Pulled pod/aqrp2plhk79cj5dwzg5z-n0-0 Successfully pulled image "cr.flyte.org/flyteorg/flytekit:py3.9-latest" in 30.158785098s
5m23s Normal Created pod/aqrp2plhk79cj5dwzg5z-n0-0 Created container aqrp2plhk79cj5dwzg5z-n0-0
5m23s Normal Started pod/aqrp2plhk79cj5dwzg5z-n0-0 Started container aqrp2plhk79cj5dwzg5z-n0-0
Goal: What should the final outcome look like, ideally?
The execution closer should include task-specific event details, including scheduling attempts, node allocations, and image pulls.
Describe alternatives you've considered
A more complete solution may overhaul event information in the execution closure so that reasons are not coupled to Flyte state transitions and could instead surface a sink of structured or unstructured event information. This is beyond the scope of this particular issue, but the proposal below does not preclude such an investment in the future.
Propose: Link/Inline OR Additional context
As a potential solution, update DemystifyPending to interleave k8s pod events alongside existing PodCondition reasons.
Note the reporting interface assumes a single-event per state; however, a recent change made it possible to report multiple events using a phase version.
A relatively naive solutions proposed by @hamersaw might be:
- Have propeller keep a watch on k8s events. I assume the kube-client has this functionality. Store these in a local cache (with configurable size) and keyed on the object or Flyte task they are associated with.
- When sending a TaskExecutionEvent we could lookup the k8s events and instead of a singular reason, return a list of reasons (probably update the name) containing all unreported k8s events (use some kind of lastSeen indicator - timestamp, revisionVersion, hash of message, etc).
- Merge the k8s events into the ExecutionClosure.
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?
Motivation: Why do you think this is important?
It can be difficult to understand delays in task start up. We recently added runtime metrics to the timeline view to better surface where time is spent, and PodCondition reasons are included in the task state tooltip to explain state transitions.
The full text of this tooltip is of the form
However this doesn't indicate ongoing node allocation or image pull, two of the most common delays in "happy path" task start up. By comparison
kubectl get eventshas much richer information.Goal: What should the final outcome look like, ideally?
The execution closer should include task-specific event details, including scheduling attempts, node allocations, and image pulls.
Describe alternatives you've considered
A more complete solution may overhaul event information in the execution closure so that reasons are not coupled to Flyte state transitions and could instead surface a sink of structured or unstructured event information. This is beyond the scope of this particular issue, but the proposal below does not preclude such an investment in the future.
Propose: Link/Inline OR Additional context
As a potential solution, update DemystifyPending to interleave k8s pod events alongside existing PodCondition reasons.
Note the reporting interface assumes a single-event per state; however, a recent change made it possible to report multiple events using a phase version.
A relatively naive solutions proposed by @hamersaw might be:
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?