Send k8s pod events in TaskExecutionEvent updates

### Motivation: Why do you think this is important?

It can be difficult to understand delays in task start up. We recently added runtime metrics to the timeline view to better surface where time is spent, and PodCondition reasons are included in the task state tooltip to explain state transitions.

<img width="410" alt="Screenshot 2023-07-03 at 11 45 15 AM" src="https://github.com/flyteorg/flyte/assets/5725707/f8af552c-b4f2-4a5b-baa3-58daf7880c2c">

<img width="431" alt="Screenshot 2023-07-03 at 11 41 48 AM" src="https://github.com/flyteorg/flyte/assets/5725707/35a17c7f-e4ee-49ca-8fe1-93202cc8306a">

The full text of this tooltip is of the form
```
7/3/2023 6:41:18 PM UTC task submitted to K8s

7/3/2023 6:41:18 PM UTC Unschedulable:0/5 nodes are available: 1 node(s) were unschedulable, 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/5 nodes are available: 5 Preemption is not helpful for scheduling.

7/3/2023 6:42:20 PM UTC [ContainersNotReady|ContainerCreating]: containers with unready status: [aqrp2plhk79cj5dwzg5z-n0-0]|
```

However this doesn't indicate ongoing node allocation or image pull, two of the most common delays in "happy path" task start up. By comparison `kubectl get events` has much richer information.
```
❯ kubectl get events -n flytesnacks-development --sort-by='{.metadata.creationTimestamp}' --field-selector involvedObject.name=aqrp2plhk79cj5dwzg5z-n0-0
LAST SEEN   TYPE      REASON                 OBJECT                          MESSAGE
6m57s       Warning   FailedScheduling       pod/aqrp2plhk79cj5dwzg5z-n0-0   0/5 nodes are available: 1 node(s) were unschedulable, 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/5 nodes are available: 5 Preemption is not helpful for scheduling.
6m50s       Normal    TriggeredScaleUp       pod/aqrp2plhk79cj5dwzg5z-n0-0   pod triggered scale-up: [{eks-opta-oc-production-nodegroup1-d7fdbb758a882b40-dec46029-5e1e-5bf0-4999-238661b4dc51 0->1 (max: 5)}]
5m55s       Normal    Scheduled              pod/aqrp2plhk79cj5dwzg5z-n0-0   Successfully assigned flytesnacks-development/aqrp2plhk79cj5dwzg5z-n0-0 to ip-10-0-148-239.us-east-2.compute.internal
5m53s       Normal    Pulling                pod/aqrp2plhk79cj5dwzg5z-n0-0   Pulling image "cr.flyte.org/flyteorg/flytekit:py3.9-latest"
5m52s       Normal    TaintManagerEviction   pod/aqrp2plhk79cj5dwzg5z-n0-0   Cancelling deletion of Pod flytesnacks-development/aqrp2plhk79cj5dwzg5z-n0-0
5m23s       Normal    Pulled                 pod/aqrp2plhk79cj5dwzg5z-n0-0   Successfully pulled image "cr.flyte.org/flyteorg/flytekit:py3.9-latest" in 30.158785098s
5m23s       Normal    Created                pod/aqrp2plhk79cj5dwzg5z-n0-0   Created container aqrp2plhk79cj5dwzg5z-n0-0
5m23s       Normal    Started                pod/aqrp2plhk79cj5dwzg5z-n0-0   Started container aqrp2plhk79cj5dwzg5z-n0-0
```

### Goal: What should the final outcome look like, ideally?

The execution closer should include task-specific event details, including scheduling attempts, node allocations, and image pulls.

### Describe alternatives you've considered 

A more complete solution may overhaul event information in the execution closure so that reasons are not coupled to Flyte state transitions and could instead surface a sink of structured or unstructured event information. This is beyond the scope of this particular issue, but the proposal below does not preclude such an investment in the future.

###  Propose: Link/Inline OR Additional context 

As a potential solution, update [DemystifyPending](https://github.com/flyteorg/flyteplugins/blob/f7b37aa54ccc47d5e895fbedec313650c04f4481/go/tasks/pluginmachinery/flytek8s/pod_helper.go#L415) to interleave k8s pod events alongside existing PodCondition reasons.

Note the [reporting interface](https://github.com/flyteorg/flyteidl/blob/b219c2ab37886801039fda67d913760ac6fc4c8b/protos/flyteidl/event/event.proto#L214) assumes a single-event per state; however, a [recent change](https://github.com/flyteorg/flyteplugins/pull/331) made it possible to report multiple events using a phase version. 

A relatively naive solutions proposed by @hamersaw might be:
* Have propeller keep a watch on k8s events. I assume the kube-client has this functionality. Store these in a local cache (with configurable size) and keyed on the object or Flyte task they are associated with.
* When sending a TaskExecutionEvent we could lookup the k8s events and instead of a singular reason, return a list of reasons (probably update the name) containing all unreported k8s events (use some kind of lastSeen indicator - timestamp, revisionVersion, hash of message, etc).
* Merge the k8s events into the ExecutionClosure.

### Are you sure this issue hasn't been raised already?

- [X] Yes

### Have you read the Code of Conduct?

- [X] Yes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Send k8s pod events in TaskExecutionEvent updates #3825

Motivation: Why do you think this is important?

Goal: What should the final outcome look like, ideally?

Describe alternatives you've considered

Propose: Link/Inline OR Additional context

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Send k8s pod events in TaskExecutionEvent updates #3825

Description

Motivation: Why do you think this is important?

Goal: What should the final outcome look like, ideally?

Describe alternatives you've considered

Propose: Link/Inline OR Additional context

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions