-
Notifications
You must be signed in to change notification settings - Fork 1
Description
A rather interesting issue related to the discussion in #71, the implementation of a precision timer and a fix for divide-by-zero.
The problem occurs only in QEMU, which means that QEMU itself may be the culprit. That said, it takes either a very fast or very recent version (maybe both) of QEMU for the problem to occur. Which makes the analysis even harder. The symptoms simply don't make sense. A classic start of a real rabbit chase.
The problem: A special sequence of actions cause printk to either cause the current function to exit or the system to hang, a classic stack pollution case.
The situation leading to the problem:
- using the value returned from
get_ptime - a long divide involving that returned value
- a call to
printkusing the result of that division - and encapsulating it all, a very fast QEMU - or possibly just a very fast processor, which may be a clue to where the rabbit hides, a race condition??
@ghaerr, it would be interesting if you run this code snippet on ELKS on your QEMU which is likely quite a bit faster than my Mac Mini, which does not show the problem.
Anyway, this code snippet (not doing anything other than demonstrating the problem) demonstrates the problem: Either the statements from the second printk to the end of the function are skipped (like an immediate return) - or we have a hang (the experimental code resides in init/main.c):
{ unsigned long temp;
int dd;
unsigned d5, d6;
d5 = 50;
get_ptime();
temp = get_ptime();
printk("Direct %lu $$", temp); <--- this is printed, value between 3 and 6 (11 on the MacMini)
//temp = 3UL;
dd = temp*838UL/1000UL;
printk("XXXX %lu", temp); <---- this prints too, showing that the div is OK and temp us unaffected
printk("UL dd=%d $$ ", dd); <--- this and any printk below in the same function never show, the function just exits - or the system hangs right here
dd = ((long)temp * 100L)/(long)d5;
printk("Signed dd=%d $$ ", dd);
...
}
Some observations from a seemingly infinite number of testvariants:
- Activating the
//temp = 3UL;line eliminates the problem - which is very interesting because it means that...
int d5;
get_ptime();
d5 = get_ptime();
... is not the same as
d5 = 4351;
... even if that number is exactly what was returned from get_ptime();. So it's not the number returned but the call to get_ptime() that seems to be important. That said,
get_ptime();
get_ptime();
d5 = 4351;
... doesn't cut it either. The error cannot be provoked unless the actual return value from get_ptime() is used. Very strange indeed.
- Still, the variable itself is unimportant - as is its size. Setting
d6 = temp;whered6is anunsigned intand subsequently used6instead oftempin the division still triggers the problem. - Adding delays between the two
get_ptime()calls to increase the # of pticks to > 64k makes no difference, IOW whether get_ptime returns a long or a short doesn't make any difference. - This observation is supported by changing
temptounsigned intand casting the 2nd call toget_ptime(), still no difference. - Changing the
dd(int) in the first divide statement tod6(unsigned) changes behaviour slightly, from 'exit function and continue' to 'hang right there' - which means nothing other than that a 'stray' return happens to 'land' on a slightly different address. - The only way to make the routine 'work' with the return value from
get_ptime()is to AVOID long divide.
d5 = d6*838/1000; /* THIS ONE WORKS */
printk("2: Direct %x $$", d5);
... which obviously is no good if d6 > 78 - but we're closer to a conclusion: Somehow the interaction between get_ptime() and long div (signed and unsigned) cause a stack pollution which in turn cause printk to malfunction on very fast machines - or there is a bug in QEMU 7.2.0.
And the hunt continues.