Skip to content

Thoughts on changing ktcp timer resolution #71

@ghaerr

Description

@ghaerr

In previous comments, I made the statement:

I would treat making a single system call per ktcp cycle as very fast.
The system call overhead is almost identical to the hw interrupt processing overhead ... and of course the 100 timer calls/second are thought as minuscule overhead.

After looking at the code generated for the kernel sys_gettimeofday routine, I would say "Oops - that's only part of the story, its actually not fast". In order to prepare the returned system time into a struct timeval, a 32-bit divide (by calling __udivsi3) plus an additional 32-bit modulo (by calling __umodsi3) are made, along with a couple other 32-bit adds. This is all slow as heck. (Unbundled sources for each are here and here and they loop).

Then, after that, ktcp's time wrapper function timer_get_time, which returns a 32-bit time in 1/16 second resolution, goes through yet another slow call to __udivsi3 to convert microseconds into 1/16 seconds by dividing by 16:

timeq_t timer_get_time(void)
{
    struct timezone tz;
    struct timeval  tv;

    gettimeofday(&tv, &tz);

    /* return 1/16 second ticks, 1,000,000/16 = 62500*/
    return (tv.tv_sec << 4) | ((unsigned long)tv.tv_usec / 62500U);
}

Besides being quite slow, 1/16 second = 62.5 msecs, 6 times longer than the 10ms clock resolution and the 5-10 msecs for an ACK response, discussed here.

(I also notice in the above code as I'm writing this that the second ("&tz") param to gettimeofday is not used and wastes more time returning it - should be NULL).

In another post I stated:

Whether you choose 1, 10 or 100ms isn't probably a big deal if we stick with the jiffies-related (100ms) hw timer as base real time, but the lower the better

After thinking more on this and looking at the timer_get_time routine - that answer also gives only half the story. The problem is that for efficiency reasons, any "ktcp" time needs to fit in 32 bits, not a timeval struct, in order to do a quick subtraction and not end up calling gcc long div/mod routines each time its used.

Thus the tradeoff: the more resolution given to micro or milliseconds, the less bits given to the "seconds" portion of a timeq_t.

In the current timeq_t, 28 bits are seconds and 4 bits are 1/16 seconds. This allows for 2^28 = 256M/86,400 = ~3106 days of timer. (A timeq_t may have to be in effect for the total duration of a TCP connection, I believe). To move to microsecond resolution would require 20 bits (1M) for microseconds, leaving only 12 bits (=4096) for seconds, a little over an hour. So that won't work, and the kernel hw timer has max 10ms (=HZ = 1/100) resolution anyways.

Considering that we stay with an approx ~10ms timer, which would at least measure ACK responses favorably on a 286/12.5Mhz or V20/8Mhz, how would the math work, given that the gettimeofday/timer_get_time is already quite slow?

If we considered 1/256 second resolution (=~3.9ms) or 1/128 resolution (=~7.8ms), this would help a lot and the timer_get_time routine could be made faster (see below). If 1/256sec is used, that leaves 24 bits for seconds (=16M/86400 = ~194 days). 1/128 would be 25 bits with 388 days. But we should be time sampling with less than half the timer resolution, so I would say 1/256 second resolution would be best, and TLVC/ELKS is likely just fine with an open TCP connection for 194 days before worrying about RTT miscalculations.

One might ask, why not just use 1/100 second (=10ms) resolution directly? Well, the original timer_get_time routine is somewhat magic in that it uses the fact that 1M/16 = 62,500, so a microsecond divide can exactly fit into 4 bits (=1/16 sec). Dividing by 100 will still be quite slow but won't fit neatly into 7 bits. And if we're going to use 10ms resolution, why not make a special new system call and just return jiffies directly, which would be quite a bit faster, and only use a compare and subtract or two to compute elapsed time.

Another thought would be to just use a right-shift-12 (=1M/256 = 2^20/2^8) for roughly converting struct timeval microseconds to 1/256 seconds, this would leave the last line of the routine as:

    return (tv.tv_sec << 8) | (((unsigned long)tv.tv_usec >> 12) & 0xFF);

I haven't yet figured out how to measure how long the __udivsi3 or __umodsi3 routines take, so I'm not fully sure just how important all this is, given the kernel's 10ms max resolution.

What do you think?

Metadata

Metadata

Assignees

No one assigned

    Labels

    DiscussionTechnical discussion related to other issues

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions