Skip to content

add result field and propagate and store errors in tasks#1154

Merged
nonsense merged 23 commits intoadd-stdout-to-logsfrom
tasks-status
Oct 16, 2020
Merged

add result field and propagate and store errors in tasks#1154
nonsense merged 23 commits intoadd-stdout-to-logsfrom
tasks-status

Conversation

@nonsense
Copy link
Copy Markdown
Member

@nonsense nonsense commented Sep 24, 2020

This PR is:

  1. Adding Result field in run outputs and starting to use it as part of Task - Result is an interface and depends on the runner that produces it, it should contain all information that is relevant to a given testplan run executed on a given runner (i.e. not just status, and exit codes for test instances, but also enough debug information for developers to understand what happened in their run).
  2. Adding Error field as part of Task - this is an error returned by Testground. You can think of it as a higher level than Result.. could be "global timeout", "not enough resources on the cluster", "build failed", etc.
  3. cluster:k8s - Subscribing to all WARN events and pod statuses relevant to a given run, while executing it. Also collecting these events and storing them inside Result for reference for developers.
  4. Fixing bugs with respect to error propagation.
  5. Fixing double-JSON marshalling for logs.
  6. Fixing long line log parsing due to bufio.ReadBytes.
  7. Functionality to post updates to Slack and Github
  8. Slight refactor on the Task model and separation of concerns (DB operations, vs business logic, etc.)
  9. Adding StateCanceled to task states
  10. Extracting timeouts - currently we have run and task timeouts on cluster:k8s, with the idea to unify them into one timeout in the future.
    ...

TODO:

  • solve logs are not always flushed to output file #1153
  • make sure outcome:ok is taken into account when deciding if status is ok/fail (i.e. support strict case fully for now) ; leniency mode should be added as parameter to manifest.toml
  • make global timeout configurable
  • introduce some schema for run journal (events from k8s + pod statuses)
  • dedup journal entries
  • fix tests
  • godoc
  • refactor post to github

FOLLOW UP:

  • consider removing run_timeout in cluster:k8s runner, and relying only on task_timeout
  • gracefully handle testground daemon shutdown without leaving tasks in store (we wouldn't need delete task func)
  • fix journaled Kubernetes statuses in Result
  • support multiple github repos for status check updates
  • some kind of ACL maybe - userA should be able to kill userB's task?
  • admin control panel (for tasks such as delete)
  • fix took on dashboard at https://ci.testground.ipfs.team/ to not include scheduled time, but only processing time.
  • remove plan, case, since we already have a Composition in Task

Fixes: #1147
Fixes: #1153
Fixes: #1163

@nonsense nonsense force-pushed the tasks-status branch 2 times, most recently from 5a2eea9 to 6d56ed7 Compare September 24, 2020 17:25
@nonsense nonsense requested review from hacdias and raulk September 29, 2020 17:22
@nonsense nonsense marked this pull request as ready for review September 29, 2020 17:23
@nonsense
Copy link
Copy Markdown
Member Author

@hacdias @raulk opening this for review, even though there are still a few TODOs, as it is getting quite large, and already addressing a few bugs that it'd be nice to merge to master.


file, err := os.Open(path)
stop := make(chan struct{})
file, err := newTailReader(path, stop)
Copy link
Copy Markdown
Member Author

@nonsense nonsense Sep 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hacdias @raulk if you have better ideas on how to mimic tail -f behaviour, I'd be happy to refactor this.

The idea here is that we continuously read the file (which includes logs for a given task), and we sleep a bit if we encounter io.EOF.

engine.Logs is blocking, and e.signals[id] governs when we return from this function - this is used for the --wait flag, as well for the testground logs --follow command.


bufio.ReadBytes was not a good impl. as it doesn't handle long lines, therefore I am switching for json.NewDecoder.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should get rid of the tailing entirely. The way I see it:

  1. we have run events in the sync service to update the user view.
  2. we should consume those run events live, and update the UI accordingly. IMO that's what the new "tail" should be.
  3. results aren't really actionable/useful until the run has finished, so those should be downloaded as outputs.


logs, err := c.getPodLogs(ow, podName)
if err != nil {
_, err = ow.WriteProgress([]byte(logs))
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now we forward stdout logs from test plan instances to the logs output for given task. If test plans are well-implemented, in the future we might not need to do that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With proper run events in the sync service now, we should stop doing this. The fact that the runner behaves different with different pod counts is confusing. We should also get rid of the tailing.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I both agree and disagree with you:
a) agree because this is really hacky and confusing and not consistent across runners.
b) disagree about removing it now, because we will lose stdout and it won't be available to users, and this will make it much harder to debug test plans, that are incorrectly using stdout for various debug messages and errors.

I'd rather have a better UX at the moment, and slightly harder and complicated codebase, than well architected runner and bad UX.

@nonsense nonsense force-pushed the tasks-status branch 5 times, most recently from 7b1aa83 to dc809fe Compare September 30, 2020 15:33
Copy link
Copy Markdown
Member

@hacdias hacdias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After going over the 21 changed files, it looks good. Didn't spot any glaring flaw! I didn't try running the code though.

@nonsense nonsense marked this pull request as ready for review October 14, 2020 17:37
@nonsense
Copy link
Copy Markdown
Member Author

nonsense commented Oct 14, 2020

@raulk @hacdias this is ready for review again (having merged another set of changes from #1156 in order not to have 2 huge PRs).

@hacdias you have reviewed most of this already, the changes until your previous review are:

git diff dfc34b642afa232a4acbbde7c02df7eb7b18d4a3..HEAD

@hacdias
Copy link
Copy Markdown
Member

hacdias commented Oct 15, 2020

Quickly went through it and looks good! Just to confirm, what's the difference between run_timeout_limit and task_timeout_limit? Is the first only for k8s runs, while the second works for everything? What do we need a specific one for k8s?

@nonsense
Copy link
Copy Markdown
Member Author

nonsense commented Oct 15, 2020

@hacdias the timeout on the cluster:k8s side has been there since the initial implementation, and is measured since the test plan run is processing. It is present only on the cluster:k8s... I think we don't have such a timeout on the other runners at all. The task timeout includes both run AND build processes.

Ultimately we can have only one timeout that unifies build and run processes (i.e. the task timeout) and not have two timeouts, but I didn't want to introduce even more changes in this PR. This has been added as a follow up.

Copy link
Copy Markdown
Contributor

@raulk raulk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This are looking really good! 😍 Honestly, most of my comments are just nits. I plan to review the whole thing more holistically once it's on master, before it's released.

The two sticking points for me are:

  • very direct dependency on k8s runner output/result. This might be fine, if we consider that TaaS will only ever run with a k8s runner, but I don't think that should be the case. I would like to be able to set up Testground quickly on a remote machine with Docker, and use it to schedule jobs with local:docker, and have the dashboard work with that setup. We don't need to resolve this right here, right now, but we should before we release TaaS as beta/GA.
  • the log tailing. Now that we've landed run events via the sync service, I think we should not be tailing container/pod outputs at all. (1) Live updates will take the form of push run events, and (2) outputs should be downloaded from the runner.

Comment on lines +59 to +63
Tasks(filters TasksFilters) ([]task.Task, error)
GetTask(id string) (*task.Task, error)
Logs(ctx context.Context, taskId string, follow bool, cancel bool, w io.Writer) (*task.Task, error)
Kill(taskId string) error
DeleteTask(taskId string) error
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe extract these tasks-related methods to a TaskManager interface, and embed that here?

Copy link
Copy Markdown
Member Author

@nonsense nonsense Oct 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the interface now, but will do a refactor of the Engine later - it is not straight-forward to just remove the methods, the signals needs to be abstracted, and methods like worker are using them, so the interface might grow.

Comment on lines +87 to +94
// Result of the run
// Depending on runner, might include:
// - Status of run (green, red, yellow :: success, fail, partial success)
// - Event log containing various information related to the run, for example:
// -- Kubernetes events
// -- Kubernetes pod Status
// -- etc.
Result interface{}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: may want to normalise the return value into an interface going forward. I think it's worth marking this method as // EXPERIMENTAL so that users expect changes.

return
}

result := decodeResultK8s(tsk.Result)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, this is one of those bits you suspected to be overly specific. I agree, this is too runner-specific. We need to figure out the right abstraction here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it will be difficult to get the right abstraction now, without having used a few runners for a while with all the necessary outputs stored in result. So I'd rather we make it specific, until we nail the correct abstraction, rather than work with a bad abstraction.


I guess once we introduce more runners in TaaS, or once we want to use these dashboards locally, we will have to solve that problem.

Co-authored-by: Raúl Kripalani <raul@protocol.ai>
@nonsense
Copy link
Copy Markdown
Member Author

I think we should get rid of the tailing entirely. The way I see it:

we have run events in the sync service to update the user view.
we should consume those run events live, and update the UI accordingly. IMO that's what the new "tail" should be.
results aren't really actionable/useful until the run has finished, so those should be downloaded as outputs.

This is used when a user schedules a task, and then wants to follow its progress:
a) from the CLI
b) from the web interface (via the /logs handler)


Run events (from testplan instances) is not the only relevant information at that point, as a user you care about the logs from the actual runner, and these are stored in a log file, so we seem to need to do the tailing of the log file, unless we store these aggregate runner + testplan instances outputs in another store, and not a file.

@nonsense nonsense force-pushed the tasks-status branch 2 times, most recently from 6d39d31 to d3e9603 Compare October 16, 2020 16:18
@nonsense nonsense merged commit 289c245 into add-stdout-to-logs Oct 16, 2020
@nonsense nonsense deleted the tasks-status branch October 16, 2020 18:31
@nonsense
Copy link
Copy Markdown
Member Author

@raulk I've addressed some of your comments, but not all. While I agree with pretty much all of them, they require larger refactors, that I'd rather not do in an already big PR. Will go over the comments and the TODOs in the PR, and create issues for each of them, so that we address in easier-to-digest PRs in the coming week.


Thanks for the multiple reviews @hacdias @raulk

nonsense added a commit that referenced this pull request Oct 16, 2020
Co-authored-by: Henrique Dias <hacdias@gmail.com>
Co-authored-by: Raúl Kripalani <raul@protocol.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants