-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Monotonic clock and updates to AE #7644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Rebase to latest OSS
|
looks good indeed. TBH we we really need to reduce calls to |
|
@guybe7 You are correct. The monotonic clock should never be used in an attempt to "deduce the wall-clock time". It is only useful for determining elapsed time (aka deltas) - which is a common need. Re: reduction of calls to |
|
What a great idea Jim! Disclosure, this is an AWS contribution, so we run a similar variant to this internally already and have for awhile. Although I think this is specifically beneficial here, I also agree that is probably useful to replace some of the other usages of mstime/gettimeofday within redis when precise time isn't necessary and we can get a little performance boost. I am good with the current implementation as is, but this is a change worthy of consensus. The specific concerns I had that would be great to validate by others:
|
oranagra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JimB123 thanks a lot for this contribution.
I agree with everyone that it's an important step forward, and we can merge it as is and later worry about using it in more places.
I also agree with Guy and others that in some (many) places we should just cache the mstime/ustime of the current command rather than sample it repeatedly, but that's certainly another topic.
Regarding styling:
I personally think ifdefs should not induce code indentation. Instead i like the #else and #endif to have a comment indicating to which #if they belong.
I usually prefer statements that change execution flow (goto, return, continue, break) to be on a line of their own (not with the if), except in a repeated error handling code which would add too many lines and hide the actual logic code.
But these are not a "one size fits all" (specifically since this is a new file), it's up to you to decide if you wanna change it.
I saw the PR description says the default is to use clock_gettime, but correct me if i'm wrong, you mean that it's the "fallback", right? i.e. by default it will use the faster HW specific methods and only fallback to POSIX if they're missing at compile time.
@oranagra Originally, I coded this to use |
I fail to see what are the risks of enabling it by default (@madolson can you elaborate). many will not be aware of this flag and will just remain with less efficient redis.
@yossigo what do you think? The second bullet brings me back to the assertions. again assert is more suitable to check a condition that should logically never happen. in this case it checks external conditions (parsing a file), on failure it should print a message to stderr and call |
|
@JimB123 This is a great contribution, thanks! |
|
@yossigo so just to be sure, you support making it enabled in a default build, and have it simply print an error and exit if if there is a problem (i.e. no fallback at runtime is needed). |
|
@oranagra no. I don't have a strong opinion about being built by default, I know it can have undesired side effects in some cases (i.e. depending on CPU / firmware / hypervisor in use, in case a CPU is hotswapped, etc.). But if we do compile support for it, I don't see a reason to exit if we fail to initialize. We can just produce a warning and fall back to |
|
ohh, missed one word in your previous post. Personally i don't really like an important optimization to be opt-in. i think there's a high chance that many will not enjoy it.
|
@oranagra I submit that this is acceptable use of assert. These conditions should never happen unless there is a fundamental error in the code or compilation options. I agree that this is an "external condition" in that it's checking the expected functionality of the operating system (at startup). But, this is not "parsing a file" as there is no "file". Linux "files" under
I don't see much difference between While I think use of |
Note that supporting run-time fallback would probably offset the value of providing |
|
@oranagra My main concern was to make sure Redis kept building out of the box or weird clock behavior. There seems to be a lot of weird compatibility issues around what timers were supported on different OS's. It looks like it's gotten much more consistent in the last couple of years. If someone is running an old enough x86 processor, it might not actually work correctly as the TSC between cores aren't synced. So I thought letting people slowly testing it out with an opt in flag for awhile might work and then at some point we build it by default. Looking back to the internal review, I was also concerned it might not initialize correctly. So the alternative approach is to make sure at startup everything works as expected otherwise fallback to clock_gettime(). I was also okay with that approach, but it's slightly more complex. |
|
Thank you all for the positive reception of this feature. It's nice to see that there are no issues/concerns with the functional interface or the changes to @madolson, as primary reviewer for this, please reach concensus (or make an executive decision) regarding the following open questions:
|
|
Hello, I guess I'll post the update:
|
|
we also need to document it somewhere (maybe in the README). |
The code used to decide on the next time to wake on a timer with microsecond accuracy, but when deciding to go to sleep it used milliseconds accuracy (with truncation), this means that it would wake up too early, see that there's no timer to process, and go to sleep again for 0ms again and again until the right microsecond arrived. i.e. a timer for 100ms, would sleep for 99ms, but then do a busy loop through the kernel in the last millisecond, triggering many calls to beforeSleep. The fix is to change all the logic in ae.c to work with microseconds, which is good since most of the ae backends support micro (or even nano) seconds. however the epoll backend, doesn't support micro, so to avoid this problem it needs to round upwards, rather than truncate. Issue created by the monotonic timer PR #7644 (redis 6.2) Before that, all the timers in ae.c were in milliseconds (using mstime), so when it requested the backend to sleep till the next timer event, it would have worked ok.
The code used to decide on the next time to wake on a timer with microsecond accuracy, but when deciding to go to sleep it used milliseconds accuracy (with truncation), this means that it would wake up too early, see that there's no timer to process, and go to sleep again for 0ms again and again until the right microsecond arrived. i.e. a timer for 100ms, would sleep for 99ms, but then do a busy loop through the kernel in the last millisecond, triggering many calls to beforeSleep. The fix is to change all the logic in ae.c to work with microseconds, which is good since most of the ae backends support micro (or even nano) seconds. however the epoll backend, doesn't support micro, so to avoid this problem it needs to round upwards, rather than truncate. Issue created by the monotonic timer PR redis#7644 (redis 6.2) Before that, all the timers in ae.c were in milliseconds (using mstime), so when it requested the backend to sleep till the next timer event, it would have worked ok.
|
Sorry for asking a maybe stupid question... Does it mean that as of Redis 6.2 Redis is using monotonic clock? Or is it an config option... Depends on the platform? Thanks in advance |
|
@jhmilan there is an internal API which provides a monotonic clock for some (not all) Redis functions. Eventually most of Redis should use that API. Currently most of Redis uses Behind the API is a clock implementation which is determined at compile time. There are options to directly use the X86 TSC, the ARM equivalent, or a POSIX implementation. Note that, other than slight performance variations across platforms, this has no impact on the Redis client experience. |
Thanks for your quick reply @JimB123 |
|
@jhmilan the expiration mechanism always uses wall-clock (not monotonic), and i'm not sure it should change. |
|
@oranagra sure. This a very interesting read where Martin Kleppman talks about this topic: https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html
From this I understand that it is possible that a given key expires before its expected TTL. If that is the case, then, as explained in the article, Redlock could violate the safety property for the lock adquisition |
|
that's all true (except for the clock jumping backwards, since a proper configuration should use slewing), but the question is if / how this breaks redlock (which i don't know much about). |
|
Say you have 5 Redis nodes for your Redlock implementation. Then 2 of them are down. You adquire a lock for the resource "foo" for 30 seconds (it is basically setting the key "foo" in a majority of nodes for 30 secs TTL). Now imagine that the 2 nodes down come back to life again and suppose there is a clock shift in node B which makes the TTL to expire. Basically it would mean that 2/5 nodes (a minority) would hold the lock but someone trying to acquire "foo" would success since 3/5 nodes would set that key. In this situation 2 clients could have acquired the same lock simultaneously. The chances for this to happen are quite low I think (especially for short lived ttls) but still, I becomes Redlock unsafe and not eligible for some scenarios were consistency is crucial. By using a monotonic clock for the expiration mechanism that issue would be gone. I guess that this is not important when using Redis as a cache layer, there you would be just missing some key that you can easily refresh. |
|
Small thought here... Regardless Redlock use case, in my opinion, TTL mechanism should be reliable and the fact it uses the wall-clock makes it weak. I never thought about it very thorough, but it scopes Redis to just be a simpler cache layer and no more than this. It is a bit disappointing |
Wall-clocks have been getting more reliable and consistent. Within AWS, you can configure NTP to use AWS time sync which provides a very accurate global clock which correctly slews with changes. I don't think the answer to more precise time measurement is to necessarily use the hardware clocks/monotonic APIs, especially for distributed systems. |
|
@madolson "more reliable" is not enough in my opinion. Your argument is that "you can set up the system in a way that makes this problem less likely" and, ok, but this is not sufficient I would say. It is not about being against NTP and wall clocks in general, it is just about this specific use case in Redis TTL. I think it should be (100%) reliable. A TTL should never ever expire before its expiration time. It never ever should be affected by something external. |
|
@jhmilan This is not as simple as it looks, you are looking at it from the perspective of Redlock, but redis has other concerns as well. |
Thanks for your response. And please, don't take me wrong.
My only suggestion here is adding some disclaimer in the documentation clearly stating the limitation and risks around TTLs, clocks, etc (or maybe they are there and I miss them). As a user of Redis I assumed that TTLs are consistent. It is not something that you think of upfront. So discovering that wall-clock issues can affect Redis TTLs scares you a bit as a user. It would be nice if you know exactly what are the potential problems, how you can prevent them, what are the implications, what you should/should not do, etc... And particularly, Redlock docs must warn about it very clearly. I love Redis and I think it is a great solution. My problem has been that I have implemented something that Redis has brought up publicly (Redlock) and after implementing it, I discovered that something in the guts of Redis makes this (official) solution weak.
Any light you can bring in that sense would be very welcome. Thank you very much in advance |
|
Hi, Regarding Redlock, it could be that something has changed since it was designed. well, many things have changed, but i can't think of something that broke it. Regarding other documentation, about TTL, i'm not certain what's already documented and where it could be added. |
I will, for sure. Thanks for clarifying this to me. |
Update adds a general source for retrieving a monotonic time. In addition, AE has been updated to utilize the new monotonic clock for timer processing. This performance improvement is **not** enabled in a default build due to various H/W compatibility concerns, see README.md for details. It does however change the default use of gettimeofday with clock_gettime and somewhat improves performance. This update provides the following 1. An interface for retrieving a monotonic clock. getMonotonicUs returns a uint64_t (aka monotime) with the number of micro-seconds from an arbitrary point. No more messing with tv_sec/tv_usec. Simple routines are provided for measuring elapsed milli-seconds or elapsed micro-seconds (the most common use case for a monotonic timer). No worries about time moving backwards. 2. High-speed assembler implementation for x86 and ARM. The standard method for retrieving the monotonic clock is POSIX.1b (1993): clock_gettime(CLOCK_MONOTONIC, timespec*). However, most modern processors provide a constant speed instruction clock which can be retrieved in a fraction of the time that it takes to call clock_gettime. For x86, this is provided by the RDTSC instruction. For ARM, this is provided by the CNTVCT_EL0 instruction. As a compile-time option, these high-speed timers can be chosen. (Default is POSIX clock_gettime.) 3. Refactor of event loop timers. The timer processing in ae.c has been refactored to use the new monotonic clock interface. This results in simpler/cleaner logic and improved performance.
The code used to decide on the next time to wake on a timer with microsecond accuracy, but when deciding to go to sleep it used milliseconds accuracy (with truncation), this means that it would wake up too early, see that there's no timer to process, and go to sleep again for 0ms again and again until the right microsecond arrived. i.e. a timer for 100ms, would sleep for 99ms, but then do a busy loop through the kernel in the last millisecond, triggering many calls to beforeSleep. The fix is to change all the logic in ae.c to work with microseconds, which is good since most of the ae backends support micro (or even nano) seconds. however the epoll backend, doesn't support micro, so to avoid this problem it needs to round upwards, rather than truncate. Issue created by the monotonic timer PR redis#7644 (redis 6.2) Before that, all the timers in ae.c were in milliseconds (using mstime), so when it requested the backend to sleep till the next timer event, it would have worked ok.
Background
Redis needs an internal API for using a monotonic clock. Currently, most of Redis uses
gettimeofday. This is a problem because time of day can be altered due to system time changes. The system time can be altered manually or by NTP. There is actually code, included inae.c, which checks to see if the clock has moved backwards. However, most of Redis usesgettimeofdayfor timing, etc. - simply assuming that it is a monotonic clock.This update provides the following
getMonotonicUsreturns auint64_t(akamonotime) with the number of micro-seconds from an arbitrary point. No more messing withtv_sec/tv_usec. Simple routines are provided for measuring elapsed milli-seconds or elapsed micro-seconds (the most common use case for a monotonic timer). No worries about time moving backwards.clock_gettime(CLOCK_MONOTONIC, timespec*). However, most modern processors provide a constant speed instruction clock which can be retrieved in a fraction of the time that it takes to callclock_gettime. For x86, this is provided by theRDTSCinstruction. For ARM, this is provided by theCNTVCT_EL0instruction. As a compile-time option, these high-speed timers can be chosen. (Default is POSIXclock_gettime.)ae.chas been refactored to use the new monotonic clock interface. This results in simpler/cleaner logic and improved performance.Performance implications
clock_gettime, timer processing is improved about 18% due to streamlined logic inae.c.Test details
Simple test program creating an event loop and 3 timers - each firing with 0ms delay. Linux host, with processor: Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz. The events each incremented a counter. A
beforeSleepproc checked elapsed time (using the monotonic clock) until 5 seconds had passed. Compared average count of timers fired.Note: on my test machine,
clock_gettimeis measured to run about 5% slower thangettimeofday. However it is expected that this varies on system-to-system and that they are nominally similar in terms of performance.Future updates
Introduction of this API provides the opportunity to begin replacing
ustimeandmstimecalls throughout the code. With the huge performance difference provided by using the processor clock, this will provide benefits to any feature which relies on time-based code.For those unfamiliar with the x86 TSC
The x86 maintains a counter which is based on a constant clock rate at the nominal clocking of the processor. Even though individual cores are running at various and changing clock rates, they all share the same constant-rate clock. On multi-CPU systems, all CPUs are sync'd to a common clock source.
Though you can still find articles about older CPUs that did not support a constant-rate TSC, all of the reasonably modern x86 and AMD processors support constant-rate TSC. On Linux, you can check for the
constant_tscflag in/proc/cpuinfo.Note also, though the TSC runs at a constant rate at the nominal speed of the processor, the processor itself doesn't actually know it's speed. The OS will report the speed in
/proc/cpuinfo(Linux). On Windows, theQueryPerformanceFrequencyfunction may be used. The ticks-per-second needs to be used to convert the TSC to common units of time (milli-seconds/micro-seconds).The TSC is not guaranteed to be wall-clock precise - it may suffer from clock-drift and is not corrected by NTP. It's fine for timing short intervals, but (for example) system uptime should be measured with a different mechanism (like
gettimeofday).