-
Notifications
You must be signed in to change notification settings - Fork 1
Description
In previous comments, I made the statement:
I would treat making a single system call per ktcp cycle as very fast.
The system call overhead is almost identical to the hw interrupt processing overhead ... and of course the 100 timer calls/second are thought as minuscule overhead.
After looking at the code generated for the kernel sys_gettimeofday routine, I would say "Oops - that's only part of the story, its actually not fast". In order to prepare the returned system time into a struct timeval, a 32-bit divide (by calling __udivsi3) plus an additional 32-bit modulo (by calling __umodsi3) are made, along with a couple other 32-bit adds. This is all slow as heck. (Unbundled sources for each are here and here and they loop).
Then, after that, ktcp's time wrapper function timer_get_time, which returns a 32-bit time in 1/16 second resolution, goes through yet another slow call to __udivsi3 to convert microseconds into 1/16 seconds by dividing by 16:
timeq_t timer_get_time(void)
{
struct timezone tz;
struct timeval tv;
gettimeofday(&tv, &tz);
/* return 1/16 second ticks, 1,000,000/16 = 62500*/
return (tv.tv_sec << 4) | ((unsigned long)tv.tv_usec / 62500U);
}
Besides being quite slow, 1/16 second = 62.5 msecs, 6 times longer than the 10ms clock resolution and the 5-10 msecs for an ACK response, discussed here.
(I also notice in the above code as I'm writing this that the second ("&tz") param to gettimeofday is not used and wastes more time returning it - should be NULL).
In another post I stated:
Whether you choose 1, 10 or 100ms isn't probably a big deal if we stick with the jiffies-related (100ms) hw timer as base real time, but the lower the better
After thinking more on this and looking at the timer_get_time routine - that answer also gives only half the story. The problem is that for efficiency reasons, any "ktcp" time needs to fit in 32 bits, not a timeval struct, in order to do a quick subtraction and not end up calling gcc long div/mod routines each time its used.
Thus the tradeoff: the more resolution given to micro or milliseconds, the less bits given to the "seconds" portion of a timeq_t.
In the current timeq_t, 28 bits are seconds and 4 bits are 1/16 seconds. This allows for 2^28 = 256M/86,400 = ~3106 days of timer. (A timeq_t may have to be in effect for the total duration of a TCP connection, I believe). To move to microsecond resolution would require 20 bits (1M) for microseconds, leaving only 12 bits (=4096) for seconds, a little over an hour. So that won't work, and the kernel hw timer has max 10ms (=HZ = 1/100) resolution anyways.
Considering that we stay with an approx ~10ms timer, which would at least measure ACK responses favorably on a 286/12.5Mhz or V20/8Mhz, how would the math work, given that the gettimeofday/timer_get_time is already quite slow?
If we considered 1/256 second resolution (=~3.9ms) or 1/128 resolution (=~7.8ms), this would help a lot and the timer_get_time routine could be made faster (see below). If 1/256sec is used, that leaves 24 bits for seconds (=16M/86400 = ~194 days). 1/128 would be 25 bits with 388 days. But we should be time sampling with less than half the timer resolution, so I would say 1/256 second resolution would be best, and TLVC/ELKS is likely just fine with an open TCP connection for 194 days before worrying about RTT miscalculations.
One might ask, why not just use 1/100 second (=10ms) resolution directly? Well, the original timer_get_time routine is somewhat magic in that it uses the fact that 1M/16 = 62,500, so a microsecond divide can exactly fit into 4 bits (=1/16 sec). Dividing by 100 will still be quite slow but won't fit neatly into 7 bits. And if we're going to use 10ms resolution, why not make a special new system call and just return jiffies directly, which would be quite a bit faster, and only use a compare and subtract or two to compute elapsed time.
Another thought would be to just use a right-shift-12 (=1M/256 = 2^20/2^8) for roughly converting struct timeval microseconds to 1/256 seconds, this would leave the last line of the routine as:
return (tv.tv_sec << 8) | (((unsigned long)tv.tv_usec >> 12) & 0xFF);
I haven't yet figured out how to measure how long the __udivsi3 or __umodsi3 routines take, so I'm not fully sure just how important all this is, given the kernel's 10ms max resolution.
What do you think?