Article

# Verifying Mathematical Calculations

Ensure the accuracy of your math operations in 64-bit architecture.

## Overview

Math operations are affected by numbers in the 64-bit runtime. Review the accuracy of the results of any calculations your app performs. Check signed value results to ensure that they're correct for their operation and operands. Verify that your bit mask code doesn't make assumptions about type size.

### Verify the Accuracy of Signed Math Operations

C and similar languages use a set of sign extension rules to determine whether to treat the top bit in an integer as a sign bit when the value is assigned to a variable of larger width. The sign extension rules are as follows:

1. Unsigned values are zero extended (not sign extended) when promoted to a larger type.

2. Signed values are always sign extended when promoted to a larger type, even if the resulting type is unsigned.

3. Constants (unless modified by a suffix, such as `0x8L`) are treated as the smallest size that will hold the value. Hexadecimal numbers may be treated by the compiler as `int`, `long`, and `long long` types and may be either `signed` or `unsigned` types. Decimal numbers are always treated as `signed` types.

4. The sum of a signed value and an unsigned value of the same size is an unsigned value.

When this code is executed in the 32-bit runtime, the result is -1 (`0xffffffff`). When the same code is run in the 64-bit runtime, the result is 4294967295 (`0x00000000ffffffff`), which is incorrect.

To understand why this happens, consider that when these numbers are added, the signed value plus the unsigned value results in an unsigned value (rule 4). That result is promoted to a larger type, but this promotion doesn't cause sign extension.

To fix this problem, cast `b` to a `long` integer. This cast forces the non-sign-extended promotion of `b` to a 64-bit type before the addition operation, thus forcing the signed integer to be promoted (in a signed fashion) to match. With that change, the result is the expected -1.

In the above code, bit shifting is used to move the value of `a `into a different position in `b`, which is copied into `c`. The expected result from the `printf` (and the result from a 32-bit executable) is `0x80000000`. The result generated by a 64-bit executable, however, is `0xffffffff80000000`.

There are two reasons for this result. First, when the left-shift operator `<<` is invoked, the variable `a` is promoted to a variable of type `int`. Because all values of a `short` integer can fit into a signed `int` type, the result of this promotion is signed. Second, when the left-shift completes, the result is stored in a `long` integer. Thus, the 32-bit signed value represented by `(a << 31)` is sign extended (rule 2) when it's promoted to a 64-bit value (even though the resulting type is unsigned).

The solution is to cast the initial value to a `long` integer before the shift. The `short` integer is promoted only once—this time, to a 64-bit type (when compiled as a 64-bit executable).

When working with bit masks with 64-bit values, follow these tips to avoid inadvertently getting 32-bit values.

Don’t assume that a data type has a particular length. If you're shifting through the bits stored in a variable of type `long` integer, use the `LONG_BIT` value to determine the number of bits. The result of a shift that exceeds the length of a variable is architecture dependent.

Use inverted bit masks, if needed. Be careful when using bit masks with long integers, because the width differs between 32-bit and 64-bit runtimes. There are two ways to create a bit mask, depending on whether you want it to be zero extended or one extended:

• If you want the bit mask value to contain zeros in the upper 32 bits in the 64-bit runtime, the usual fixed-width bit mask works as expected, because it's extended in an unsigned fashion to a 64-bit quantity.

• If you want the bit mask value to contain ones in the upper bits, write the bit mask as the bitwise inverse of its inverse.

Note that in the 64-bit runtime, the upper bits in the bit mask are all ones.