Skip to content

Add ON-CALL playbook#3466

Merged
istio-merge-robot merged 7 commits intomasterfrom
oncall
Feb 15, 2018
Merged

Add ON-CALL playbook#3466
istio-merge-robot merged 7 commits intomasterfrom
oncall

Conversation

@andraxylia
Copy link
Copy Markdown
Contributor

@andraxylia andraxylia commented Feb 14, 2018

@ldemailly
Copy link
Copy Markdown
Member

try making branches only after pulling
the circle lint failure was because your branch was out of date
#3471

@ldemailly
Copy link
Copy Markdown
Member

/lgtm

ON-CALL.md Outdated
Every week, one Istio developer is on-call and is responsible for maintaining Istio build and help out users and other developers. The on-call person should prioritize on-call duties on top of their daily work.

## Schedule
The on-call schedule can be found [here][https://docs.google.com/spreadsheets/d/1FaHwPpad3F3hva2suJweNeTnocjKtLnbgLkyMRPzgUY/edit#gid=1475801904].
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect md syntax.
Use () after [] for hyperlinks.

You can preview the text using the "view" button above too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The [] works for me in my golang IDE viewer, but I will change it.

ON-CALL.md Outdated
* Help with creating the release when needed.
* Check the schedule sheet to make sure the next on call is defined.
* On Friday, notify the next on-call.
* On Tuesday morning, handoff to next oncall and send a summary to istio-dev containing these stats, that can be obtained by querying [github with dates ranges][https://help.github.com/articles/searching-issues-and-pull-requests/]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please send hand off summary to istio-oncall@googlegroups.com

* Join google groups [istio-oncall][https://groups.google.com/forum/#!forum/istio-oncall]
* Join the `oncall` Slack channel

## Responsibilities
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add post submit monitoring and fix

@ldemailly
Copy link
Copy Markdown
Member

ldemailly commented Feb 14, 2018

I would add in the summary that it replaces the other doc and in the other doc a pointer to this
EDIT: I did that

also make sure nothing is lost

@ldemailly
Copy link
Copy Markdown
Member

/hold

@istio-testing istio-testing added the do-not-merge/hold Block automatic merging of a PR. label Feb 14, 2018
@ldemailly
Copy link
Copy Markdown
Member

(my lgtm was mostly to re-kick the build btw - also I (wrongly) assumed the preview had been looked and we'd iterate)

@istio-merge-robot
Copy link
Copy Markdown

/lgtm cancel //PR changed after LGTM, removing LGTM. @andraxylia @ldemailly


## Responsibilities
* Build cop: monitor the builds, the presubmit automated tests, the postsubmit automated tests:
* Familiarize yourself with the current [open issues affecting automated tests](https://github.com/istio/istio/issues?q=is%3Aopen+is%3Aissue+label%3Akind%2Ftest-failure).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems a too long list to be familiar with. Is there a shortened list just for build/critical test break issue?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the list unfortunately, 20 open issues. If it is too long, people should prioritize fixing those bugs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the list for critical issues, not all the issues.

#PRs with review approved / in flight:
53 baseline / 22 current
```

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to have queries for these for reuse purpose.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They have to be adapted anyway with the new dates, so no point in adding them here. Please feel free to add them later if you find some that can be re-used.

ON-CALL.md Outdated
* If there are new failures, open issues and label them with kind/test-failure, with the appropriate area label, with "prow" or "circleci" label,
and assign them either directly to an engineer or to the area lead.
The issue must contain a link to the failed test log.
Add a comment in github cc-ing the assignees and explaining this must be fixed or reverted with highest priority. Nag people when needed.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the process of disable the required test when the test is determined to be not stable?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have instructions to disable prow in an email, I can add them, but this can only be done by an administrator.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added instructions, PTAL.


#Issues total
488 baseline / 526 current

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do I interpret baseline? the # of issues at the beginning of the on call?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonder if we should put the data in an online spreadsheet so that we can easily see trends among multiple weeks. that'd be more interesting data.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am keeping a log in the oncall folder, maybe we can turn it into a sheet with weekly data

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that can be done later. K8s does fancy graphs with the data, this is more for information purposes. It is not the on call duty to make judgments calls related to the number of open issues.

@ldemailly ldemailly removed the do-not-merge/hold Block automatic merging of a PR. label Feb 14, 2018
@ldemailly
Copy link
Copy Markdown
Member

I find keeping an eye on https://github.com/orgs/istio/dashboard useful
and during handoff look at https://github.com/istio/istio/pulse which conveniently defaults to looking at 1 week in the past

@andraxylia
Copy link
Copy Markdown
Contributor Author

Only a handful of people have github permissions to disable tests. Instructions are here for reference, but it will not be actually the on-call doing it. There is no need to add the note about the slippery slope, I hope those who have been given github admin permissions understand this.

@andraxylia
Copy link
Copy Markdown
Contributor Author

Can I get an approval to merge this? People can iterate subsequently.

ON-CALL.md Outdated
# Istio On-call Playbook

## Who
Every week, one Istio developer is on-call and is responsible for maintaining Istio build and for helping out users and other developers. The on-call person should prioritize on-call duties on top of the daily work.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maintaining THE Istio build PROCESS
on top of THEIR daily work

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And should we mention it's not 24*7 on-call?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix the grammar mistakes. I already mentioned this is during business hours for the time zone.

ON-CALL.md Outdated
* If there are new failures, open issues and label them with kind/test-failure, with the appropriate area label, with "prow" or "circleci" label,
and assign them either directly to an engineer or to the area lead.
The issue must contain a link to the failed test log.
Add a comment in github cc-ing the assignees and explaining this must be fixed or reverted with highest priority. Nag people when needed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

github -> GitHub

here and elsewhere in the doc.

ON-CALL.md Outdated
* Help with creating the release when needed.
* Check the schedule sheet to make sure the next on call is defined.
* On Friday, notify the next on-call.
* On Tuesday morning, handoff to next oncall and send a summary to istio-oncall and istio-dev. Include the stats below, that can be obtained by querying [github with dates ranges:](https://help.github.com/articles/searching-issues-and-pull-requests/)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oncall -> on-call

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Copy Markdown
Contributor

@douglas-reid douglas-reid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few comments.

ON-CALL.md Outdated
## Schedule
The on-call schedule can be found [here](https://docs.google.com/spreadsheets/d/1FaHwPpad3F3hva2suJweNeTnocjKtLnbgLkyMRPzgUY/edit#gid=1475801904).

On-call duty starts on Tuesday at noon PST, ends the following week on Tuesday at noon PST, and is performed during regular working hours for your time zone.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, for non-PST timezoned folks, this implies that their oncall cycle might start at the beginning of their regular working hours some time thereafter, correct? and their cycles would end before the handoff?

For the oncall members in Israel, noon PST is like 10PM and 10AM is midnight PST. I'm not sure how much this matters, but is worth mentioning, especially as it regards oncall expectations.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it is kind of implied by the business hours I mention previously.

ON-CALL.md Outdated
* Help with creating the release when needed.
* Check the schedule sheet to make sure the next on call is defined.
* On Friday, notify the next on-call.
* On Tuesday morning, handoff to next oncall and send a summary to istio-oncall and istio-dev. Include the stats below, that can be obtained by querying [github with dates ranges:](https://help.github.com/articles/searching-issues-and-pull-requests/)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd be careful with "morning" given timezone differences. I think I'd phrase this as "at the end of your oncall shift".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

* Join the `oncall` Slack channel

## Responsibilities
* Build cop: monitor the builds, the presubmit automated tests, the postsubmit automated tests:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a link to the test grid (that shows post-submit test status across PRs, etc.) somewhere here? http://k8s-testgrid.appspot.com/istio#Summary

ON-CALL.md Outdated
* Uncheck the affected tests.



Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove empty lines.

ON-CALL.md Outdated
# Istio On-call Playbook

## Who
Every week, one Istio developer is on-call and is responsible for maintaining Istio build and for helping out users and other developers. The on-call person should prioritize on-call duties on top of the daily work.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And should we mention it's not 24*7 on-call?

## Who
Every week, one Istio developer is on-call and is responsible for maintaining Istio build and for helping out users and other developers. The on-call person should prioritize on-call duties on top of the daily work.

## Schedule
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And how do we do this oncall rotation? Based on volunteer or we have some rotation mechanism like round robin

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know this.

@andraxylia
Copy link
Copy Markdown
Contributor Author

Updated the doc, PTAL. Remember this is not a legally binding document, so we can iterate if we find it does not work.

@ldemailly
Copy link
Copy Markdown
Member

Thanks Andra ! I'll add more too after my shift
/lgtm

@istio-merge-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ldemailly

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@istio-merge-robot
Copy link
Copy Markdown

/test all [submit-queue is verifying that this PR is safe to merge]

@istio-merge-robot
Copy link
Copy Markdown

Automatic merge from submit-queue.

@istio-merge-robot istio-merge-robot merged commit ed69c3a into master Feb 15, 2018
@ldemailly ldemailly deleted the oncall branch February 15, 2018 05:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants