Verifying Mathematical Calculations

Ensure the accuracy of your math operations in 64-bit architecture.


Math operations are affected by numbers in the 64-bit runtime. Review the accuracy of the results of any calculations your app performs. Check signed value results to ensure that they're correct for their operation and operands. Verify that your bit mask code doesn't make assumptions about type size.

Verify the Accuracy of Signed Math Operations

C and similar languages use a set of sign extension rules to determine whether to treat the top bit in an integer as a sign bit when the value is assigned to a variable of larger width. The sign extension rules are as follows:

  1. Unsigned values are zero extended (not sign extended) when promoted to a larger type.

  2. Signed values are always sign extended when promoted to a larger type, even if the resulting type is unsigned.

  3. Constants (unless modified by a suffix, such as 0x8L) are treated as the smallest size that will hold the value. Hexadecimal numbers may be treated by the compiler as int, long, and long long types and may be either signed or unsigned types. Decimal numbers are always treated as signed types.

  4. The sum of a signed value and an unsigned value of the same size is an unsigned value.

int a = -2;
unsigned int b = 1;
long c = a + b;
long long d = c; // To get a consistent size for printing.

printf("%lld\n", d);

When this code is executed in the 32-bit runtime, the result is -1 (0xffffffff). When the same code is run in the 64-bit runtime, the result is 4294967295 (0x00000000ffffffff), which is incorrect.

To understand why this happens, consider that when these numbers are added, the signed value plus the unsigned value results in an unsigned value (rule 4). That result is promoted to a larger type, but this promotion doesn't cause sign extension.

To fix this problem, cast b to a long integer. This cast forces the non-sign-extended promotion of b to a 64-bit type before the addition operation, thus forcing the signed integer to be promoted (in a signed fashion) to match. With that change, the result is the expected -1.

unsigned short a = 1;
unsigned long b = (a << 31);
unsigned long long c = b;

printf("%llx\n", c);

In the above code, bit shifting is used to move the value of a into a different position in b, which is copied into c. The expected result from the printf (and the result from a 32-bit executable) is 0x80000000. The result generated by a 64-bit executable, however, is 0xffffffff80000000.

There are two reasons for this result. First, when the left-shift operator << is invoked, the variable a is promoted to a variable of type int. Because all values of a short integer can fit into a signed int type, the result of this promotion is signed. Second, when the left-shift completes, the result is stored in a long integer. Thus, the 32-bit signed value represented by (a << 31) is sign extended (rule 2) when it's promoted to a 64-bit value (even though the resulting type is unsigned).

The solution is to cast the initial value to a long integer before the shift. The short integer is promoted only once—this time, to a 64-bit type (when compiled as a 64-bit executable).

Check Your Code for Assumptions About Type Size

When working with bit masks with 64-bit values, follow these tips to avoid inadvertently getting 32-bit values.

Don’t assume that a data type has a particular length. If you're shifting through the bits stored in a variable of type long integer, use the LONG_BIT value to determine the number of bits. The result of a shift that exceeds the length of a variable is architecture dependent.

Use inverted bit masks, if needed. Be careful when using bit masks with long integers, because the width differs between 32-bit and 64-bit runtimes. There are two ways to create a bit mask, depending on whether you want it to be zero extended or one extended:

  • If you want the bit mask value to contain zeros in the upper 32 bits in the 64-bit runtime, the usual fixed-width bit mask works as expected, because it's extended in an unsigned fashion to a 64-bit quantity.

  • If you want the bit mask value to contain ones in the upper bits, write the bit mask as the bitwise inverse of its inverse.

function_name(long value)
    // Use the complement (~) operator to get ones instead of zeros.
    // Mask will be 0xfffffffc in the 32-bit runtime,
    //   or 0xfffffffffffffffc in the 64-bit runtime.
    long mask = ~0x3;
    return (value & mask);

Note that in the 64-bit runtime, the upper bits in the bit mask are all ones.

See Also

Performance and Accuracy

Optimizing Memory Performance

Measure the impact of the 64-bit runtime on your app's memory usage.