Skip to content

WIP: Modify kube-push for GCE to bring down the existing master VM and completely replace it with a new one#3174

Closed
a-robinson wants to merge 1 commit intokubernetes:masterfrom
a-robinson:gce-push
Closed

WIP: Modify kube-push for GCE to bring down the existing master VM and completely replace it with a new one#3174
a-robinson wants to merge 1 commit intokubernetes:masterfrom
a-robinson:gce-push

Conversation

@a-robinson
Copy link
Copy Markdown
Contributor

This makes upgrades less likely to break in weird ways and adds support for easily upgrading underlying components on the master like the guest OS or etcd.

To do this, we reserve the IP address of the master after it's created and store all dynamically created files on a persistent disk (PD). Then, kube-push consists of swapping the PD and reserved IP address over to a new VM with the desired components on it.

This has been tested to pass the /validate endpoint after upgrading between a few different recent commit versions.

@a-robinson
Copy link
Copy Markdown
Contributor Author

Related to #2524

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it tested that everything important on the master is stored in a dir mounted from this pd?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not, and frankly I don't have any good plan for getting that assumption under test, given that usage of the filesystem is spread all over the place in tons of different shell scripts and salt configs :/

Have any ideas?

@lavalamp
Copy link
Copy Markdown
Contributor

Can you run the following sequence of commands and report back?

hack/build-go.sh
go run hack/e2e.go -v -build -up -test
go run hack/e2e.go -v -build -push -test

I think the monitoring.sh test is broken at the moment but the other tests should all pass.

@lavalamp lavalamp self-assigned this Dec 30, 2014
@davidopp davidopp mentioned this pull request Dec 30, 2014
@a-robinson
Copy link
Copy Markdown
Contributor Author

Hm, sorry for the delay, but I've been having a lot of trouble with the e2e tests. After finally getting a cluster up, it only passed 7/10 tests before the push, and only 2/10 after. I'm running them again and looking into why they failed, but do you know if there've been any issues with the e2e tests today?

completely replace it with a new one. This makes upgrades less likely
to break in weird ways and adds support for easily upgrading underlying
components on the master like the guest OS or etcd.

To do this, I reserve the IP address of the master after it's created
and store all dynamically created files on a persistent disk (PD). Then,
kube-push consists of swapping the PD and reserved IP address over to a
new VM with the desired components on it.

This has been tested to pass the /validate endpoint after upgrading
between a few different recent commit versions.
@a-robinson
Copy link
Copy Markdown
Contributor Author

Thanks for reminding me to run the e2e tests -- it turns out the /validate endpoint is woefully insufficient for validating the cluster, and that the cluster doesn't work properly (it can't even schedule pods).

The reason is that the kubelet on the minions isn't able to connect to etcd or the apiserver is because they currently talk to them over internal IPs, and GCE doesn't seem to offer a way to transfer internal IP addresses between VMs. This could be fixed by using a route for all minion-to-master traffic, or we could wait until the kubelet's direct dependency on etcd goes away (PR #846 / Issue #2483) and then have the kubelet speak to the apiserver using its external IP instead. CJ and I chatted and would lean toward the latter to avoid adding more network cruft, so this PR may have to wait a bit unless you feel differently.

@a-robinson
Copy link
Copy Markdown
Contributor Author

Also closely related to #3168

@lavalamp
Copy link
Copy Markdown
Contributor

lavalamp commented Jan 2, 2015

The reason is that the kubelet on the minions isn't able to connect to etcd or the apiserver is because they currently talk to them over internal IPs, and GCE doesn't seem to offer a way to transfer internal IP addresses between VMs.

That's unfortunate. Does the internal IP change if you reboot the master? (Is it possible to reboot & supply a new disk at the same time?)

@a-robinson
Copy link
Copy Markdown
Contributor Author

The IP stays the same when you reboot, but boot disks can't be detached from their instance, so there's no way to swap in a fresh one.

@lavalamp
Copy link
Copy Markdown
Contributor

lavalamp commented Jan 6, 2015

The status of this is that it's blocked on either making a special route or using master's public IP address, correct?

@a-robinson
Copy link
Copy Markdown
Contributor Author

Yup. I'll self-assign this until it's unblocked and ready again.

@a-robinson a-robinson assigned a-robinson and unassigned lavalamp Jan 6, 2015
@bgrant0607
Copy link
Copy Markdown
Member

@a-robinson Are you still working on this?

@a-robinson
Copy link
Copy Markdown
Contributor Author

I haven't touched it since my last comment. I should probably check out how it works now that the minion's dependency on etcd has been removed on GCP, but I expect that the change of internal IP will still break salt, at the very least.

I'll strip out the kube-push change from the PD mounting improvements and get those checked in.

@a-robinson
Copy link
Copy Markdown
Contributor Author

I tried this out again after rebasing to head. It still doesn't work, with the current cause being the use of the master's internal IP address by minions rather than its hostname or external IP. I'll take a look into whether changing our salt configs would break anything. In the meantime, I've split out the directory and static IP changes from this into #4715.

@a-robinson a-robinson changed the title Modify kube-push for GCE to bring down the existing master VM and completely replace it with a new one WIP: Modify kube-push for GCE to bring down the existing master VM and completely replace it with a new one Feb 23, 2015
a-robinson added a commit to a-robinson/kubernetes that referenced this pull request Feb 23, 2015
…nd reserve

the master's IP upon creation to make it easier to replace the master later.

This pulls out the parts of PR kubernetes#3174 that don't break anything and will
make upgrading existing clusters in the future less painful.

Add /etc/salt to the master-pd
@a-robinson
Copy link
Copy Markdown
Contributor Author

Closing, will open a new PR once I've played with the salt configs to try not explicitly using internal IP.

@a-robinson a-robinson closed this Mar 9, 2015
@mbforbes mbforbes mentioned this pull request Mar 27, 2015
@mbforbes mbforbes mentioned this pull request Mar 27, 2015
10 tasks
@a-robinson a-robinson deleted the gce-push branch June 5, 2015 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants