add result field and propagate and store errors in tasks by nonsense · Pull Request #1154 · testground/testground

nonsense · 2020-09-24T16:26:15Z

This PR is:

Adding Result field in run outputs and starting to use it as part of Task - Result is an interface and depends on the runner that produces it, it should contain all information that is relevant to a given testplan run executed on a given runner (i.e. not just status, and exit codes for test instances, but also enough debug information for developers to understand what happened in their run).
Adding Error field as part of Task - this is an error returned by Testground. You can think of it as a higher level than Result.. could be "global timeout", "not enough resources on the cluster", "build failed", etc.
cluster:k8s - Subscribing to all WARN events and pod statuses relevant to a given run, while executing it. Also collecting these events and storing them inside Result for reference for developers.
Fixing bugs with respect to error propagation.
Fixing double-JSON marshalling for logs.
Fixing long line log parsing due to bufio.ReadBytes.
Functionality to post updates to Slack and Github
Slight refactor on the Task model and separation of concerns (DB operations, vs business logic, etc.)
Adding StateCanceled to task states
Extracting timeouts - currently we have run and task timeouts on cluster:k8s, with the idea to unify them into one timeout in the future.
...

TODO:

solve logs are not always flushed to output file #1153
make sure outcome:ok is taken into account when deciding if status is ok/fail (i.e. support strict case fully for now) ; leniency mode should be added as parameter to manifest.toml
make global timeout configurable
introduce some schema for run journal (events from k8s + pod statuses)
dedup journal entries
fix tests
godoc
refactor post to github

FOLLOW UP:

consider removing run_timeout in cluster:k8s runner, and relying only on task_timeout
gracefully handle testground daemon shutdown without leaving tasks in store (we wouldn't need delete task func)
fix journaled Kubernetes statuses in Result
support multiple github repos for status check updates
some kind of ACL maybe - userA should be able to kill userB's task?
admin control panel (for tasks such as delete)
fix took on dashboard at https://ci.testground.ipfs.team/ to not include scheduled time, but only processing time.
remove plan, case, since we already have a Composition in Task

Fixes: #1147
Fixes: #1153
Fixes: #1163

pkg/client/client.go

pkg/daemon/tasks.go

nonsense · 2020-09-29T17:23:58Z

@hacdias @raulk opening this for review, even though there are still a few TODOs, as it is getting quite large, and already addressing a few bugs that it'd be nice to merge to master.

pkg/engine/engine.go

nonsense · 2020-09-29T17:28:08Z

pkg/engine/engine.go


-	file, err := os.Open(path)
+	stop := make(chan struct{})
+	file, err := newTailReader(path, stop)


@hacdias @raulk if you have better ideas on how to mimic tail -f behaviour, I'd be happy to refactor this.

The idea here is that we continuously read the file (which includes logs for a given task), and we sleep a bit if we encounter io.EOF.

engine.Logs is blocking, and e.signals[id] governs when we return from this function - this is used for the --wait flag, as well for the testground logs --follow command.

bufio.ReadBytes was not a good impl. as it doesn't handle long lines, therefore I am switching for json.NewDecoder.

I think we should get rid of the tailing entirely. The way I see it:

we have run events in the sync service to update the user view.

we should consume those run events live, and update the UI accordingly. IMO that's what the new "tail" should be.

results aren't really actionable/useful until the run has finished, so those should be downloaded as outputs.

pkg/runner/cluster_k8s.go

nonsense · 2020-09-29T17:33:55Z

pkg/runner/cluster_k8s.go


-					logs, err := c.getPodLogs(ow, podName)
-					if err != nil {
+						_, err = ow.WriteProgress([]byte(logs))


For now we forward stdout logs from test plan instances to the logs output for given task. If test plans are well-implemented, in the future we might not need to do that.

With proper run events in the sync service now, we should stop doing this. The fact that the runner behaves different with different pod counts is confusing. We should also get rid of the tailing.

I both agree and disagree with you:
a) agree because this is really hacky and confusing and not consistent across runners.
b) disagree about removing it now, because we will lose stdout and it won't be available to users, and this will make it much harder to debug test plans, that are incorrectly using stdout for various debug messages and errors.

I'd rather have a better UX at the moment, and slightly harder and complicated codebase, than well architected runner and bad UX.

pkg/daemon/tasks.go

hacdias

After going over the 21 changed files, it looks good. Didn't spot any glaring flaw! I didn't try running the code though.

nonsense · 2020-10-14T17:39:03Z

@raulk @hacdias this is ready for review again (having merged another set of changes from #1156 in order not to have 2 huge PRs).

@hacdias you have reviewed most of this already, the changes until your previous review are:

git diff dfc34b642afa232a4acbbde7c02df7eb7b18d4a3..HEAD

hacdias · 2020-10-15T06:32:20Z

Quickly went through it and looks good! Just to confirm, what's the difference between run_timeout_limit and task_timeout_limit? Is the first only for k8s runs, while the second works for everything? What do we need a specific one for k8s?

nonsense · 2020-10-15T12:02:50Z

@hacdias the timeout on the cluster:k8s side has been there since the initial implementation, and is measured since the test plan run is processing. It is present only on the cluster:k8s... I think we don't have such a timeout on the other runners at all. The task timeout includes both run AND build processes.

Ultimately we can have only one timeout that unifies build and run processes (i.e. the task timeout) and not have two timeouts, but I didn't want to introduce even more changes in this PR. This has been added as a follow up.

raulk

This are looking really good! 😍 Honestly, most of my comments are just nits. I plan to review the whole thing more holistically once it's on master, before it's released.

The two sticking points for me are:

very direct dependency on k8s runner output/result. This might be fine, if we consider that TaaS will only ever run with a k8s runner, but I don't think that should be the case. I would like to be able to set up Testground quickly on a remote machine with Docker, and use it to schedule jobs with local:docker, and have the dashboard work with that setup. We don't need to resolve this right here, right now, but we should before we release TaaS as beta/GA.
the log tailing. Now that we've landed run events via the sync service, I think we should not be tailing container/pod outputs at all. (1) Live updates will take the form of push run events, and (2) outputs should be downloaded from the runner.

raulk · 2020-10-16T12:19:38Z

pkg/api/engine.go

 	Tasks(filters TasksFilters) ([]task.Task, error)
+	GetTask(id string) (*task.Task, error)
+	Logs(ctx context.Context, taskId string, follow bool, cancel bool, w io.Writer) (*task.Task, error)
+	Kill(taskId string) error
+	DeleteTask(taskId string) error


nit: maybe extract these tasks-related methods to a TaskManager interface, and embed that here?

Adding the interface now, but will do a refactor of the Engine later - it is not straight-forward to just remove the methods, the signals needs to be abstracted, and methods like worker are using them, so the interface might grow.

raulk · 2020-10-16T12:21:16Z

pkg/api/runner.go

+	// Result of the run
+	// Depending on runner, might include:
+	// - Status of run (green, red, yellow :: success, fail, partial success)
+	// - Event log containing various information related to the run, for example:
+	// -- Kubernetes events
+	// -- Kubernetes pod Status
+	// -- etc.
+	Result interface{}


nit: may want to normalise the return value into an interface going forward. I think it's worth marking this method as // EXPERIMENTAL so that users expect changes.

pkg/build/selector_test.go

pkg/config/env.go

raulk · 2020-10-16T12:28:39Z

pkg/daemon/journal.go

+			return
+		}
+
+		result := decodeResultK8s(tsk.Result)


Alright, this is one of those bits you suspected to be overly specific. I agree, this is too runner-specific. We need to figure out the right abstraction here.

Yes, but it will be difficult to get the right abstraction now, without having used a few runners for a while with all the necessary outputs stored in result. So I'd rather we make it specific, until we nail the correct abstraction, rather than work with a bad abstraction.

I guess once we introduce more runners in TaaS, or once we want to use these dashboards locally, we will have to solve that problem.

plans/example/go.mod

plans/network/pingpong.go

plans/splitbrain/go.mod

pkg/task/task.go

pkg/task/storage.go

Co-authored-by: Raúl Kripalani <raul@protocol.ai>

nonsense · 2020-10-16T14:10:08Z

I think we should get rid of the tailing entirely. The way I see it:

we have run events in the sync service to update the user view.
we should consume those run events live, and update the UI accordingly. IMO that's what the new "tail" should be.
results aren't really actionable/useful until the run has finished, so those should be downloaded as outputs.

This is used when a user schedules a task, and then wants to follow its progress:
a) from the CLI
b) from the web interface (via the /logs handler)

Run events (from testplan instances) is not the only relevant information at that point, as a user you care about the logs from the actual runner, and these are stored in a log file, so we seem to need to do the tailing of the log file, unless we store these aggregate runner + testplan instances outputs in another store, and not a file.

nonsense · 2020-10-16T18:32:26Z

@raulk I've addressed some of your comments, but not all. While I agree with pretty much all of them, they require larger refactors, that I'd rather not do in an already big PR. Will go over the comments and the TODOs in the PR, and create issues for each of them, so that we address in easier-to-digest PRs in the coming week.

Thanks for the multiple reviews @hacdias @raulk

Co-authored-by: Henrique Dias <hacdias@gmail.com> Co-authored-by: Raúl Kripalani <raul@protocol.ai>

nonsense added 3 commits September 24, 2020 18:19

propagate k8s events and statuses

4ef1bd9

introduce dummy journal

533d312

add redirect

4c77bb1

nonsense force-pushed the tasks-status branch 2 times, most recently from 5a2eea9 to 6d56ed7 Compare September 24, 2020 17:25