translatedcode

Recently I’ve been working on adding “fused multiply-accumulate” support to QEMU (‘FMAC’ for short); the patches were accepted upstream earlier this week and will be in upstream QEMU 1.0, and have already appeared in qemu-linaro 2011.10.

FMAC is easy enough to describe: it’s just a floating point operation that computes (a * b) + c without doing the intermediate rounding that would happen if you used separate multiplication and addition instructions. It was added to the IEEE floating point arithmetic standard in IEEE 754-2008, and (as Wikipedia shows) has been implemented in various CPU architectures. It appears in the ARM architecture starting with VFPv4, which is implemented in the Cortex-A5 and Cortex-A15 cores, as the instructions VFMA, VFMS, VFNMA, and VFNMS. (Don’t confuse these with the older VMLA, VNMLA and friends, which do similar operations but do perform the rounding between the multiply and the addition.) Implementation is slightly more complicated than result = (a * b) + c, however, and it takes over 250 lines of code just to implement this for single precision…

When QEMU generates code for target floating point operations, it doesn’t turn them into floating point instructions for the host CPU; instead they are emulated using integer operations only. This surprises some people, since almost all CPUs support IEEE754 floating point, which specifies bit-exact results. Unfortunately, there are a swathe of special cases where IEEE permits implementation-defined results, or where CPUs have modes which deviate from IEEE for performance reasons; in order to get these right QEMU is forced to do floating point “the hard way”.

FMAC itself provides some good examples of these special cases:

NaN propagation
IEEE says that if you have an operation which has several inputs which are NaN (“Not a Number”, a special ‘error’ representation generated for things like square roots of negative numbers) then the result will be one of the input NaNs. However it’s up to the implementation which one it picks, and different CPU architectures make different choices. FMAC is interesting here because it is the only three-input operation in IEEE, and so the “pick a NaN” code for it is completely FMAC specific.
Denormal handling
IEEE defines arithmetic on “denormal” numbers (which are so close to zero that they can’t be represented except with reduced precision). Handling these is slower than dealing with normal numbers, so some CPU architectures allow the user to select a “fast” mode where denormals are “flushed” to zero, either on input or on output or both. Implementations vary a lot on how this is controlled, which kind of flushing is done and whether status flags are raised when flushing occurs. On ARM, flushing of denormals can be enabled via an FPSCR bit. Neon instructions also always work in “flush denormals” mode; this means that C compilers typically won’t use the Neon variants for floating point arithmetic unless the user enables a “fast math” mode. FMAC is no exception here — there are VFP instructions which are (by default) fully IEEE compliant and Neon versions which flush.
Architecture-specific choices of negation
IEEE specifies only the basic (a * b) + c operation; most CPU architectures have extended this to provide some negated variants, but not always in the same way. So for instance x86 provides -(a * b) + c, but PPC has -((a * b) + c), and ARM has (-a * b) + c. It might look at first as if you can implement some of these by providing a small set of core functions and having the architecture-specific code negate inputs and outputs. This unfortunately doesn’t always work for the special case where one of the inputs is a NaN. In some cases (like ARM) the instruction is defined to be implemented as a simple negation followed by a multiply-add — in this case if the negated input was a NaN it will emerge from the other side with its sign bit flipped. But some CPU architectures, like PPC, specify that the whole instruction including negation is a single operation for NaN handling purposes: if the negated input is a NaN it must not have its sign bit flipped. This means the negation needs to be handled in the “core” floating point emulation so that it can be done after the special-case processing of NaN inputs.
(0 * Inf) + QNaN
IEEE allows an implementation choice in how the special case of (0 * Inf) + QNaN is handled — it will always result in a QNaN, but whether the InvalidOp exception flag should be raised is implementation defined. (If your conceptual model of this operation is that you first do the multiply before you even look at the addend, then you’ll raise InvalidOp because multiplying zero by infinity isn’t a valid operation. If your model is that handling of NaN inputs is the first step before you start computing anything, then you won’t raise InvalidOp because you won’t get to the point of trying to multiply anything.)

Handling of special cases takes up about half the FMAC routine, but even when we’re down to the common case it’s still about a hundred lines of code, because we have to do a multiply, and then either an addition or a subtraction, all with only integer operations. To understand how this works you have to know that a floating point number is represented as (s * -1) * 2^e * m, where s, e and m are the sign bit, exponent and mantissa (fractional part). If you break out the separate fields for the input operands then you can calculate the required result sign, exponent and mantissa, and then pack them back into the floating point binary format. I’m just going to sketch the general idea here (and in particular I have omitted some details such as calculation of the result’s sign bit).

Multiplication:
(2^a * b) * (2^c * d) == 2^(a+c) * b * d so all we need to do is add the exponents and multiply the mantissae. Well, nearly all: then we need to fix up the result if the most significant bit of the mantissa isn’t in the right place, by shifting the mantissa and decrementing the exponent. This doesn’t affect the answer because it’s doing both a multiplication by two (the shift) and a division by two (the exponent adjustment), and they cancel out. Of course multiplying the two 23-bit mantissae gives 56 bits of product, so the following addition or subtraction has to be done at the widened precision, to satisfy the “no intermediate rounding” requirement.
Addition:
We can adjust one of the inputs (by the same trick of shifting the mantissa and adjusting the exponent) so that both inputs have the same exponent. Then we can calculate the result mantissa by adding the two input mantissae, because (2^x * y) + (2^x * z) == (2^x * (y + z)). As with multiplication, we may then need to fix up the result to put the most significant bit of the mantissa in the right place.
Subtraction:
This is like addition, although the details are sufficiently different that it has to be done as a separate code path.

Working through this really solidified my understanding of what QEMU’s floating point emulation code was doing, and if you’re curious about what’s actually lurking behind the ‘single’ and ‘double’ data types I’d encourage you to work through the function with a copy of the IEEE specification and try to figure out why it gives you the right answers…

Written by pm215

November 6, 2011 at 2:42 am

Posted in linaro, qemu

Some QEMU patch statistics

leave a comment »

I’m Peter Maydell, and I work in the Linaro Toolchain Working Group on QEMU, the open source emulator, improving its support for ARM processors. This blog is a place for me to talk about what Linaro is doing with QEMU, interesting corners of the ARM architecture from a CPU modelling perspective, neat QEMU tricks, and generally anything in the intersection of “ARM” and “QEMU”. To kick things off I thought I’d start with a kind of high-level summary of what we’ve been doing so far…

At this year’s KVM Forum Anthony Liguori presented some statistics about contributions to QEMU over the last year. Linaro came in third in this league table of ‘corporate’ contributors, which I think is pretty impressive given we only really started seriously upstreaming patches about halfway through the year:

Company                   | Commits | Percent
--------------------------+---------+---------
Red Hat                   |     962 |     31%
IBM                       |     343 |     11%
Linaro                    |     222 |      7%
Siemens                   |     159 |      5%
SuSE                      |     124 |      4%

So what exactly were we doing with those 200+ patches? I did a little analysis of our committed patches, dividing them into four broad categories:

arm: patches to the core ARM CPU emulation code
board: patches to ARM boards and devices; this includes fixing bugs in existing board models like versatilepb and also adding new boards (vexpress-a9)
linux-user: patches to the code that implements QEMU’s “linux user” mode; mostly this is adding support for passing through new system calls
other : everything else

and generated this graph of committed patches every month by category:

Graph of patches submitted by date

You can see that initially we were doing a lot of ARM emulation code fixes, correcting a lot of bugs in the Neon and VFP emulation code. That activity is now complete, and for the last few months patches in this category have been typically code cleanup or adding new instructions rather than fixing bugs in the existing code. In its place, we’ve been doing more patches to ARM boards and devices. As well as getting an A9 Versatile Express board model upstream in time for QEMU 0.15, we’ve been working recently on getting the OMAP3 support patches from the Meego tree upstream, which you can see pretty clearly in the sudden increase in board patches in August.

And future patches? Well, there’s work to do to support the Cortex-A15, starting with some new user-visible instructions for integer divide and floating-point fused multiply-accumulate. We also need an A15 Versatile Express board model, which will tie in with the work being done for KVM virtualization on A15. And we’re planning to push on with the OMAP3 upstreaming as a “background” task. So I reckon Linaro will still be a strong contender in next year’s “patches league table”…

Written by pm215

October 14, 2011 at 11:57 am

Posted in linaro, qemu

translatedcode

ARM ARM update

Computing a*b+c : how hard can it be?

Some QEMU patch statistics

Pages

Recent Posts

Archives

Categories

Meta