[4.0] memcached: Make config non-HA-aware (bsc#1038223) by cmurphy · Pull Request #1340 · crowbar/crowbar-openstack

cmurphy · 2017-10-02T12:26:05Z

Without this patch, the keystone and nova barclamps set their cache
servers to all of the memcached servers in the cluster in
lexicographical order. This is not actually an optimal way to configure
memcached servers since if part of the cluster is down, the memcached
servers living on it will be inaccessible. The python-memcached backend
is not tied to pacemaker and has no way of knowing that the server is
down, so it attempts to connect to each server serially, not attempting
the next one until the first times out. The effect is that any query to
the OpenStack service will take a very long time. This patch fixes the
issue by only using the local memcached server for keystonemiddleware
instead of using all in the cluster. This means every controller in the
cluster will use only its own memcached server, similar to how it would
work if it was using an in-process cache.

cmurphy · 2017-10-02T12:52:07Z

Cloud8 version is here: #1341 (not cherry-picked)

nicolasbock

If I understood this correctly then we won't use several memcached instances within HA but only the on-node instance. That is giving up on quite some potential performance, isn't it? This sounds like a pretty fundamental limitation of the oslo cache code.

cmurphy · 2017-10-02T13:47:31Z

@nicolasbock what performance gains did we get from using multiple cache instances? I can't find information on how configuring memcached in a cluster improves performance, and in fact from https://github.com/memcached/memcached/wiki/Performance#maximum-number-of-nodes-in-a-cluster there is the potential for it to impede performance.

The issue is actually in python-memcached, not oslo.cache: https://github.com/linsomniac/python-memcached/blob/1.58/memcache.py#L444-L448
python-memcached is common to both the dogpile.cache.memcached and oslo_cache.memcache_pool backends of oslo.cache, so I don't think there's any way to tune oslo.cache for this. I guess it's worth noting that I reproduced this with cache backend oslo_cache.memcache_pool but Dirk reported the bug when we were still using dogpile.cache.memcache as the cache backend.

nicolasbock · 2017-10-02T15:23:45Z

what performance gains did we get from using multiple cache instances?

In a multiprocess keystone setup in which all keystone instances access the same pool of memcached servers I would expect a potential increase in keystone performance because of the shared cache. While you are correct to point out that a large number of servers can slow down a client we don't know at what number this happens. And of course it would help if we had some actual benchmarks to show that we do end up with a performance gain. At this point I am just speculating. 😄

It sounds like the memcached design does not consider failed nodes though (I couldn't find anything in their documentation) and waiting for a connection to time out definitely decreases performance. Since you found that we can't tune anything to change that behavior I agree that it's better to simply use the local memcached only.

Thanks for the additional details!

stefannica · 2017-10-02T15:31:30Z

@cmurphy there were two memcached_servers.join in the nova config file

cmurphy · 2017-10-04T07:07:57Z

@stefannica thanks, fixed

Without this patch, the keystone and nova barclamps set their cache servers to all of the memcached servers in the cluster in lexicographical order. This is not actually an optimal way to configure memcached servers since if part of the cluster is down, the memcached servers living on it will be inaccessible. The python-memcached backend is not tied to pacemaker and has no way of knowing that the server is down, so it attempts to connect to each server serially, not attempting the next one until the first times out. The effect is that any query to the OpenStack service will take a very long time. This patch fixes the issue by only using the local memcached server for keystonemiddleware instead of using all in the cluster. This means every controller in the cluster will use only its own memcached server, similar to how it would work if it was using an in-process cache.

cmurphy · 2017-10-06T12:40:55Z

I'm pretty sure the failure here must be caused by this change

cmurphy · 2017-10-12T15:09:59Z

@nicolasbock after doing a lot of reading I understand better what you were saying about performance gains - if we have two separate caches then each controller has to cache everything itself, doubling or tripling the number of writes we have to do. Not ideal.

The HA job is failing here because when the ceph cookbook tries to make a role assignment, it makes two requests, one to GET the role assignments for the ceph user and one to PUT the new role assignment. It was lucky that it was fairly consistent about which controller ended up receiving the requests. One GET for role assignments would go to one controller and produce a cache hit containing just the role member, which comes from the ceph user having a default tenant set, and which is not the intended assignment of 'admin'. It would then issue a PUT to try to correct the role assignment and fail with an HTTP 409 because it had already created this role assignment in an earlier chef run, just the request had gone to a different controller and therefore was only cached on that controller.

I think this particular issue could be corrected by using the keystone v3 API for role assignments (which we already do in master) which wouldn't consider a default project to be a role assignment and would therefore have a cache miss and seek the role assignments from the database. But this illustrates the potential for a sort of split-brain problem that is not really acceptable, in addition to the performance hit.

I commented on the bug that I think the problem that prompted this is not really a problem any more since we switched to the memcache_pool backend, so closing this.

nicolasbock · 2017-10-16T12:00:22Z

That's interesting @cmurphy. Nice analysis!

cmurphy added needs backport to SOC6 needs forward-port to SOC8 labels Oct 2, 2017

cmurphy mentioned this pull request Oct 2, 2017

memcached: Make config non-HA-aware (bsc#1038223) #1341

Closed

cmurphy requested review from dirkmueller and stefannica October 2, 2017 12:52

nicolasbock reviewed Oct 2, 2017

View reviewed changes

nicolasbock previously approved these changes Oct 2, 2017

View reviewed changes

cmurphy dismissed nicolasbock’s stale review via 5562d2f October 4, 2017 07:07

cmurphy force-pushed the fix-memcached-4.0 branch from 159d168 to 5562d2f Compare October 4, 2017 07:07

nicolasbock previously approved these changes Oct 4, 2017

View reviewed changes

cmurphy dismissed nicolasbock’s stale review via 764c1b9 October 4, 2017 14:41

cmurphy force-pushed the fix-memcached-4.0 branch from 5562d2f to 764c1b9 Compare October 4, 2017 14:41

cmurphy removed the needs backport to SOC6 label Oct 5, 2017

dirkmueller added the backport from SOC8 label Oct 5, 2017

cmurphy changed the title ~~memcached: Make config non-HA-aware (bsc#1038223)~~ [4.0] memcached: Make config non-HA-aware (bsc#1038223) Oct 6, 2017

cmurphy added the wip label Oct 6, 2017

cmurphy closed this Oct 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[4.0] memcached: Make config non-HA-aware (bsc#1038223)#1340

[4.0] memcached: Make config non-HA-aware (bsc#1038223)#1340
cmurphy wants to merge 1 commit intocrowbar:stable/4.0from
cmurphy:fix-memcached-4.0

cmurphy commented Oct 2, 2017

Uh oh!

cmurphy commented Oct 2, 2017

Uh oh!

nicolasbock left a comment

Uh oh!

cmurphy commented Oct 2, 2017

Uh oh!

nicolasbock commented Oct 2, 2017

Uh oh!

stefannica commented Oct 2, 2017

Uh oh!

cmurphy commented Oct 4, 2017

Uh oh!

cmurphy commented Oct 6, 2017

Uh oh!

cmurphy commented Oct 12, 2017

Uh oh!

nicolasbock commented Oct 16, 2017

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

Conversation

cmurphy commented Oct 2, 2017

Uh oh!

cmurphy commented Oct 2, 2017

Uh oh!

nicolasbock left a comment

Choose a reason for hiding this comment

Uh oh!

cmurphy commented Oct 2, 2017

Uh oh!

nicolasbock commented Oct 2, 2017

Uh oh!

stefannica commented Oct 2, 2017

Uh oh!

cmurphy commented Oct 4, 2017

Uh oh!

cmurphy commented Oct 6, 2017

Uh oh!

cmurphy commented Oct 12, 2017

Uh oh!

nicolasbock commented Oct 16, 2017

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants