ARM ARM update
The latest revision of the ARM ARM (or to give it its full title, the ARMv7-AR Architecture Reference Manual) was released this week. (It’s available from the ARM Infocenter website; you need to register as a user on the website to be able to download it, but this is a quick and painless process.) If you’ve got a copy of revision B you should grab rev C now. It folds the previously separate documentation of the virtualization and LPAE extensions in to the main architecture specification, and sweeps up a few loose ends like documentation of fused multiply-accumulate.
Working on CPU and device models means spending quite a lot of time looking at hardware reference manuals; you quickly develop an appreciation for the good ones.
Writing a model is creating a from-scratch reimplementation of the hardware. Unfortunately hardware documentation is often written for the device driver writer, not the implementor. You can see this difference of focus most clearly in documents that use phrasing like “you must do X” but which don’t say what happens when you do something else. That’s fine for a device driver writer, who can just stay safely in the area the documentation describes, but to write a good model you also need to know how to behave when the guest OS does do something non-standard. The ARM ARM scores well here, describing both sides of the hardware/software contract rather than merely making rules for software; it also carefully marks out the areas which are implementation defined or unpredictable.
I also like documentation that doesn’t skimp on the details. If I’m halfway through writing some CPU emulation code and I reach a corner case, I want to be able to grab the manual and look up exactly how that corner case needs to be handled. The ARM ARM’s extensive use of pseudocode is a fantastic help here — it acts as a guide for the authors to ensure they really did write down all the corner case behaviours, and it’s a concise and unambiguous way to communicate them. (There’s a price, of course — the rev C is over 2600 pages — but I’ll willingly pay that.)
So it’s cool to see a new revision of an old friend; I wish everybody else’s docs were this good!
Computing a*b+c : how hard can it be?
Recently I’ve been working on adding “fused multiply-accumulate” support to QEMU (‘FMAC’ for short); the patches were accepted upstream earlier this week and will be in upstream QEMU 1.0, and have already appeared in qemu-linaro 2011.10.
FMAC is easy enough to describe: it’s just a floating point operation that computes (a * b) + c without doing the intermediate rounding that would happen if you used separate multiplication and addition instructions. It was added to the IEEE floating point arithmetic standard in IEEE 754-2008, and (as Wikipedia shows) has been implemented in various CPU architectures. It appears in the ARM architecture starting with VFPv4, which is implemented in the Cortex-A5 and Cortex-A15 cores, as the instructions VFMA, VFMS, VFNMA, and VFNMS. (Don’t confuse these with the older VMLA, VNMLA and friends, which do similar operations but do perform the rounding between the multiply and the addition.) Implementation is slightly more complicated than result = (a * b) + c, however, and it takes over 250 lines of code just to implement this for single precision…
When QEMU generates code for target floating point operations, it doesn’t turn them into floating point instructions for the host CPU; instead they are emulated using integer operations only. This surprises some people, since almost all CPUs support IEEE754 floating point, which specifies bit-exact results. Unfortunately, there are a swathe of special cases where IEEE permits implementation-defined results, or where CPUs have modes which deviate from IEEE for performance reasons; in order to get these right QEMU is forced to do floating point “the hard way”.
FMAC itself provides some good examples of these special cases:
- NaN propagation
IEEE says that if you have an operation which has several inputs which are NaN (“Not a Number”, a special ‘error’ representation generated for things like square roots of negative numbers) then the result will be one of the input NaNs. However it’s up to the implementation which one it picks, and different CPU architectures make different choices. FMAC is interesting here because it is the only three-input operation in IEEE, and so the “pick a NaN” code for it is completely FMAC specific.
- Denormal handling
IEEE defines arithmetic on “denormal” numbers (which are so close to zero that they can’t be represented except with reduced precision). Handling these is slower than dealing with normal numbers, so some CPU architectures allow the user to select a “fast” mode where denormals are “flushed” to zero, either on input or on output or both. Implementations vary a lot on how this is controlled, which kind of flushing is done and whether status flags are raised when flushing occurs. On ARM, flushing of denormals can be enabled via an FPSCR bit. Neon instructions also always work in “flush denormals” mode; this means that C compilers typically won’t use the Neon variants for floating point arithmetic unless the user enables a “fast math” mode. FMAC is no exception here — there are VFP instructions which are (by default) fully IEEE compliant and Neon versions which flush.
- Architecture-specific choices of negation
IEEE specifies only the basic
(a * b) + coperation; most CPU architectures have extended this to provide some negated variants, but not always in the same way. So for instance x86 provides-(a * b) + c, but PPC has-((a * b) + c), and ARM has(-a * b) + c. It might look at first as if you can implement some of these by providing a small set of core functions and having the architecture-specific code negate inputs and outputs. This unfortunately doesn’t always work for the special case where one of the inputs is a NaN. In some cases (like ARM) the instruction is defined to be implemented as a simple negation followed by a multiply-add — in this case if the negated input was a NaN it will emerge from the other side with its sign bit flipped. But some CPU architectures, like PPC, specify that the whole instruction including negation is a single operation for NaN handling purposes: if the negated input is a NaN it must not have its sign bit flipped. This means the negation needs to be handled in the “core” floating point emulation so that it can be done after the special-case processing of NaN inputs. (0 * Inf) + QNaNIEEE allows an implementation choice in how the special case of
(0 * Inf) + QNaNis handled — it will always result in a QNaN, but whether the InvalidOp exception flag should be raised is implementation defined. (If your conceptual model of this operation is that you first do the multiply before you even look at the addend, then you’ll raise InvalidOp because multiplying zero by infinity isn’t a valid operation. If your model is that handling of NaN inputs is the first step before you start computing anything, then you won’t raise InvalidOp because you won’t get to the point of trying to multiply anything.)
Handling of special cases takes up about half the FMAC routine, but even when we’re down to the common case it’s still about a hundred lines of code, because we have to do a multiply, and then either an addition or a subtraction, all with only integer operations. To understand how this works you have to know that a floating point number is represented as (s * -1) * 2^e * m, where s, e and m are the sign bit, exponent and mantissa (fractional part). If you break out the separate fields for the input operands then you can calculate the required result sign, exponent and mantissa, and then pack them back into the floating point binary format. I’m just going to sketch the general idea here (and in particular I have omitted some details such as calculation of the result’s sign bit).
- Multiplication:
(2^a * b) * (2^c * d) == 2^(a+c) * b * dso all we need to do is add the exponents and multiply the mantissae. Well, nearly all: then we need to fix up the result if the most significant bit of the mantissa isn’t in the right place, by shifting the mantissa and decrementing the exponent. This doesn’t affect the answer because it’s doing both a multiplication by two (the shift) and a division by two (the exponent adjustment), and they cancel out. Of course multiplying the two 23-bit mantissae gives 56 bits of product, so the following addition or subtraction has to be done at the widened precision, to satisfy the “no intermediate rounding” requirement. - Addition:
We can adjust one of the inputs (by the same trick of shifting the mantissa and adjusting the exponent) so that both inputs have the same exponent. Then we can calculate the result mantissa by adding the two input mantissae, because
(2^x * y) + (2^x * z) == (2^x * (y + z)). As with multiplication, we may then need to fix up the result to put the most significant bit of the mantissa in the right place. - Subtraction:
This is like addition, although the details are sufficiently different that it has to be done as a separate code path.
Working through this really solidified my understanding of what QEMU’s floating point emulation code was doing, and if you’re curious about what’s actually lurking behind the ‘single’ and ‘double’ data types I’d encourage you to work through the function with a copy of the IEEE specification and try to figure out why it gives you the right answers…
Some QEMU patch statistics
I’m Peter Maydell, and I work in the Linaro Toolchain Working Group on QEMU, the open source emulator, improving its support for ARM processors. This blog is a place for me to talk about what Linaro is doing with QEMU, interesting corners of the ARM architecture from a CPU modelling perspective, neat QEMU tricks, and generally anything in the intersection of “ARM” and “QEMU”. To kick things off I thought I’d start with a kind of high-level summary of what we’ve been doing so far…
At this year’s KVM Forum Anthony Liguori presented some statistics about contributions to QEMU over the last year. Linaro came in third in this league table of ‘corporate’ contributors, which I think is pretty impressive given we only really started seriously upstreaming patches about halfway through the year:
Company | Commits | Percent --------------------------+---------+--------- Red Hat | 962 | 31% IBM | 343 | 11% Linaro | 222 | 7% Siemens | 159 | 5% SuSE | 124 | 4%
So what exactly were we doing with those 200+ patches? I did a little analysis of our committed patches, dividing them into four broad categories:
arm: patches to the core ARM CPU emulation codeboard: patches to ARM boards and devices; this includes fixing bugs in existing board models like versatilepb and also adding new boards (vexpress-a9)linux-user: patches to the code that implements QEMU’s “linux user” mode; mostly this is adding support for passing through new system callsother: everything else
and generated this graph of committed patches every month by category:
You can see that initially we were doing a lot of ARM emulation code fixes, correcting a lot of bugs in the Neon and VFP emulation code. That activity is now complete, and for the last few months patches in this category have been typically code cleanup or adding new instructions rather than fixing bugs in the existing code. In its place, we’ve been doing more patches to ARM boards and devices. As well as getting an A9 Versatile Express board model upstream in time for QEMU 0.15, we’ve been working recently on getting the OMAP3 support patches from the Meego tree upstream, which you can see pretty clearly in the sudden increase in board patches in August.
And future patches? Well, there’s work to do to support the Cortex-A15, starting with some new user-visible instructions for integer divide and floating-point fused multiply-accumulate. We also need an A15 Versatile Express board model, which will tie in with the work being done for KVM virtualization on A15. And we’re planning to push on with the OMAP3 upstreaming as a “background” task. So I reckon Linaro will still be a strong contender in next year’s “patches league table”…
