add result field and propagate and store errors in tasks#1154
add result field and propagate and store errors in tasks#1154nonsense merged 23 commits intoadd-stdout-to-logsfrom
Conversation
5a2eea9 to
6d56ed7
Compare
4be6ca1 to
9ec17a8
Compare
|
|
||
| file, err := os.Open(path) | ||
| stop := make(chan struct{}) | ||
| file, err := newTailReader(path, stop) |
There was a problem hiding this comment.
@hacdias @raulk if you have better ideas on how to mimic tail -f behaviour, I'd be happy to refactor this.
The idea here is that we continuously read the file (which includes logs for a given task), and we sleep a bit if we encounter io.EOF.
engine.Logs is blocking, and e.signals[id] governs when we return from this function - this is used for the --wait flag, as well for the testground logs --follow command.
bufio.ReadBytes was not a good impl. as it doesn't handle long lines, therefore I am switching for json.NewDecoder.
There was a problem hiding this comment.
I think we should get rid of the tailing entirely. The way I see it:
- we have run events in the sync service to update the user view.
- we should consume those run events live, and update the UI accordingly. IMO that's what the new "tail" should be.
- results aren't really actionable/useful until the run has finished, so those should be downloaded as outputs.
|
|
||
| logs, err := c.getPodLogs(ow, podName) | ||
| if err != nil { | ||
| _, err = ow.WriteProgress([]byte(logs)) |
There was a problem hiding this comment.
For now we forward stdout logs from test plan instances to the logs output for given task. If test plans are well-implemented, in the future we might not need to do that.
There was a problem hiding this comment.
With proper run events in the sync service now, we should stop doing this. The fact that the runner behaves different with different pod counts is confusing. We should also get rid of the tailing.
There was a problem hiding this comment.
I both agree and disagree with you:
a) agree because this is really hacky and confusing and not consistent across runners.
b) disagree about removing it now, because we will lose stdout and it won't be available to users, and this will make it much harder to debug test plans, that are incorrectly using stdout for various debug messages and errors.
I'd rather have a better UX at the moment, and slightly harder and complicated codebase, than well architected runner and bad UX.
7b1aa83 to
dc809fe
Compare
hacdias
left a comment
There was a problem hiding this comment.
After going over the 21 changed files, it looks good. Didn't spot any glaring flaw! I didn't try running the code though.
11093b4 to
a270c3e
Compare
4f4e8b1 to
6996df4
Compare
d98e502 to
9fb2cc7
Compare
|
Quickly went through it and looks good! Just to confirm, what's the difference between |
|
@hacdias the timeout on the Ultimately we can have only one timeout that unifies build and run processes (i.e. the task timeout) and not have two timeouts, but I didn't want to introduce even more changes in this PR. This has been added as a follow up. |
There was a problem hiding this comment.
This are looking really good! 😍 Honestly, most of my comments are just nits. I plan to review the whole thing more holistically once it's on master, before it's released.
The two sticking points for me are:
- very direct dependency on k8s runner output/result. This might be fine, if we consider that TaaS will only ever run with a k8s runner, but I don't think that should be the case. I would like to be able to set up Testground quickly on a remote machine with Docker, and use it to schedule jobs with
local:docker, and have the dashboard work with that setup. We don't need to resolve this right here, right now, but we should before we release TaaS as beta/GA. - the log tailing. Now that we've landed run events via the sync service, I think we should not be tailing container/pod outputs at all. (1) Live updates will take the form of push run events, and (2) outputs should be downloaded from the runner.
pkg/api/engine.go
Outdated
| Tasks(filters TasksFilters) ([]task.Task, error) | ||
| GetTask(id string) (*task.Task, error) | ||
| Logs(ctx context.Context, taskId string, follow bool, cancel bool, w io.Writer) (*task.Task, error) | ||
| Kill(taskId string) error | ||
| DeleteTask(taskId string) error |
There was a problem hiding this comment.
nit: maybe extract these tasks-related methods to a TaskManager interface, and embed that here?
There was a problem hiding this comment.
Adding the interface now, but will do a refactor of the Engine later - it is not straight-forward to just remove the methods, the signals needs to be abstracted, and methods like worker are using them, so the interface might grow.
| // Result of the run | ||
| // Depending on runner, might include: | ||
| // - Status of run (green, red, yellow :: success, fail, partial success) | ||
| // - Event log containing various information related to the run, for example: | ||
| // -- Kubernetes events | ||
| // -- Kubernetes pod Status | ||
| // -- etc. | ||
| Result interface{} |
There was a problem hiding this comment.
nit: may want to normalise the return value into an interface going forward. I think it's worth marking this method as // EXPERIMENTAL so that users expect changes.
| return | ||
| } | ||
|
|
||
| result := decodeResultK8s(tsk.Result) |
There was a problem hiding this comment.
Alright, this is one of those bits you suspected to be overly specific. I agree, this is too runner-specific. We need to figure out the right abstraction here.
There was a problem hiding this comment.
Yes, but it will be difficult to get the right abstraction now, without having used a few runners for a while with all the necessary outputs stored in result. So I'd rather we make it specific, until we nail the correct abstraction, rather than work with a bad abstraction.
I guess once we introduce more runners in TaaS, or once we want to use these dashboards locally, we will have to solve that problem.
Co-authored-by: Raúl Kripalani <raul@protocol.ai>
This is used when a user schedules a task, and then wants to follow its progress: Run events (from testplan instances) is not the only relevant information at that point, as a user you care about the logs from the actual runner, and these are stored in a log file, so we seem to need to do the tailing of the log file, unless we store these aggregate runner + testplan instances outputs in another store, and not a file. |
579119a to
9e0a5fe
Compare
6d39d31 to
d3e9603
Compare
d3e9603 to
e10e830
Compare
|
@raulk I've addressed some of your comments, but not all. While I agree with pretty much all of them, they require larger refactors, that I'd rather not do in an already big PR. Will go over the comments and the TODOs in the PR, and create issues for each of them, so that we address in easier-to-digest PRs in the coming week. |
Co-authored-by: Henrique Dias <hacdias@gmail.com> Co-authored-by: Raúl Kripalani <raul@protocol.ai>
This PR is:
Resultfield in run outputs and starting to use it as part ofTask- Result is an interface and depends on the runner that produces it, it should contain all information that is relevant to a given testplan run executed on a given runner (i.e. not just status, and exit codes for test instances, but also enough debug information for developers to understand what happened in their run).Errorfield as part ofTask- this is an error returned by Testground. You can think of it as a higher level thanResult.. could be "global timeout", "not enough resources on the cluster", "build failed", etc.cluster:k8s- Subscribing to all WARN events and pod statuses relevant to a given run, while executing it. Also collecting these events and storing them insideResultfor reference for developers.bufio.ReadBytes.Taskmodel and separation of concerns (DB operations, vs business logic, etc.)StateCanceledto task statesrunandtasktimeouts oncluster:k8s, with the idea to unify them into one timeout in the future....
TODO:
outcome:okis taken into account when deciding if status is ok/fail (i.e. support strict case fully for now) ; leniency mode should be added as parameter to manifest.tomlFOLLOW UP:
run_timeoutincluster:k8srunner, and relying only ontask_timeoutdeletetask func)delete)tookon dashboard at https://ci.testground.ipfs.team/ to not includescheduledtime, but onlyprocessingtime.plan,case, since we already have aCompositioninTaskFixes: #1147
Fixes: #1153
Fixes: #1163