Skip to content

Subtle combination causing printk to go astray #76

@Mellvik

Description

@Mellvik

A rather interesting issue related to the discussion in #71, the implementation of a precision timer and a fix for divide-by-zero.

The problem occurs only in QEMU, which means that QEMU itself may be the culprit. That said, it takes either a very fast or very recent version (maybe both) of QEMU for the problem to occur. Which makes the analysis even harder. The symptoms simply don't make sense. A classic start of a real rabbit chase.

The problem: A special sequence of actions cause printk to either cause the current function to exit or the system to hang, a classic stack pollution case.

The situation leading to the problem:

  • using the value returned from get_ptime
  • a long divide involving that returned value
  • a call to printk using the result of that division
  • and encapsulating it all, a very fast QEMU - or possibly just a very fast processor, which may be a clue to where the rabbit hides, a race condition??

@ghaerr, it would be interesting if you run this code snippet on ELKS on your QEMU which is likely quite a bit faster than my Mac Mini, which does not show the problem.

Anyway, this code snippet (not doing anything other than demonstrating the problem) demonstrates the problem: Either the statements from the second printk to the end of the function are skipped (like an immediate return) - or we have a hang (the experimental code resides in init/main.c):

{               unsigned long temp;
                int dd;
                unsigned d5, d6;

                d5 = 50;
                get_ptime();
                temp = get_ptime();
                printk("Direct %lu $$", temp);       <--- this is printed, value between 3 and 6 (11 on the MacMini)
                //temp = 3UL;
                dd = temp*838UL/1000UL;
                printk("XXXX %lu", temp);                 <---- this prints too, showing that the div is OK and temp us unaffected
                printk("UL dd=%d $$ ", dd);    <--- this and any printk below in the same function never show, the function just exits - or the system hangs right here
                dd = ((long)temp * 100L)/(long)d5;
                printk("Signed dd=%d $$ ", dd);
                ...
}

Some observations from a seemingly infinite number of testvariants:

  1. Activating the //temp = 3UL; line eliminates the problem - which is very interesting because it means that...
   int d5;
   get_ptime();
   d5 = get_ptime();

... is not the same as

   d5 = 4351;

... even if that number is exactly what was returned from get_ptime();. So it's not the number returned but the call to get_ptime() that seems to be important. That said,

   get_ptime();
   get_ptime();
   d5 = 4351;

... doesn't cut it either. The error cannot be provoked unless the actual return value from get_ptime() is used. Very strange indeed.

  1. Still, the variable itself is unimportant - as is its size. Setting d6 = temp; where d6 is an unsigned int and subsequently use d6 instead of temp in the division still triggers the problem.
  2. Adding delays between the two get_ptime() calls to increase the # of pticks to > 64k makes no difference, IOW whether get_ptime returns a long or a short doesn't make any difference.
  3. This observation is supported by changing temp to unsigned int and casting the 2nd call to get_ptime(), still no difference.
  4. Changing the dd (int) in the first divide statement to d6 (unsigned) changes behaviour slightly, from 'exit function and continue' to 'hang right there' - which means nothing other than that a 'stray' return happens to 'land' on a slightly different address.
  5. The only way to make the routine 'work' with the return value from get_ptime() is to AVOID long divide.
                d5 = d6*838/1000;       /* THIS ONE WORKS */
                printk("2: Direct %x $$", d5);

... which obviously is no good if d6 > 78 - but we're closer to a conclusion: Somehow the interaction between get_ptime() and long div (signed and unsigned) cause a stack pollution which in turn cause printk to malfunction on very fast machines - or there is a bug in QEMU 7.2.0.

And the hunt continues.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions