ADC Home > Reference Library > Reference > Mac OS X > Mac OS X Man Pages

This document is a Mac OS X manual page. Manual pages are a commandline technology for providing documentation. You can view these manual pages locally using the man(1) command. These manual pages come from many different sources, and thus, have a variety of writing styles. For more information about the manual page format, see the manual page for manpages(5). 
FLOAT(3) BSD Library Functions Manual FLOAT(3) NAME float  description of floatingpoint types available on OS X DESCRIPTION This page describes the available C floatingpoint types. For a list of math library functions that operate on these types, see the page on the math library, "man math". TERMINOLOGY Floating point numbers are represented in three parts: a sign, a mantissa (or significand), and an exponent. Given such a representation with sign s, mantissa m, and exponent e, the corresponding numerical value is s*m*2**e. Floatingpoint types differ in the number of bits of accuracy in the mantissa mantissa tissa (called the precision), and set of available exponents (the exponent exponent nent range). Floatingpoint numbers with the maximum available exponent are reserved operands, denoting an infinity if the significand is precisely zero, and a NotaNumber, or NaN, otherwise. Floatingpoint numbers with the minimum available exponent are either zero if the significand is precisely zero, and denormal otherwise. Note that zero is signed: +0 and 0 are distinct floating point numbers. Floatingpoint numbers with exponents other than the maximum and minimum available are called normal numbers. PROPERTIES OF IEEE754 FLOATINGPOINT Basic arithmetic operations in IEEE754 floatingpoint are correctly rounded: this means that the result delivered is the same as the result that would be achieved by computing the exact realnumber operation on the operands, then rounding the realnumber result to a floatingpoint value. Overflow occurs when the value of the exact result is too large in magnitude magnitude tude to be represented in the floatingpoint type in which the computation computation tion is being performed; doing so would require an exponent outside of the exponent range of the type. By default, computations that result in overflow return a signed infinity. Underflow occurs when the value of the exact result is too small in magnitude magnitude nitude to be represented as a normal number in the floatingpoint type in which the computation is being performed. By default, underflow is gradual, gradual, ual, and produces a denormal number or a zero. All floatingpoints number of a given type are integer multiples of the smallest nonzero floatingpoint number of that type; however, the converse converse verse is not true. This means that, in the default mode, (xy) = 0 only if x = y. The sign of zero transforms correctly through multiplication and division, division, sion, and is preserved by addition of zeros with like signs, but x  x yields +0 for every finite floatingpoint number x. The only operations that reveal the sign of a zero are x/(+0) and copysign(x,+0). In particular, particular, ticular, comparisons (x > y, x != y, etc) are not affected by the sign of zero. The sign of infinity transforms correctly through multiplication and division, and infinities are unaffected by addition or subtraction of any finite floatingpoint number. But InfInf, Inf*0, and Inf/Inf are, like 0/0 or sqrt(3), invalid operations that produce NaN. NaNs are the default results of invalid operations, and they propagate through subsequent arithmetic operations. If x is a NaN, then x != x is TRUE, and every other comparison predicate (x > y, x = y, x <= y, etc) evaluates to FALSE, regardless of the value of y. Additionally, predi cates that entail an ordered comparison (rather than mere equality or inequality) signal Invalid Operation when one of the arguments is NaN. IEEE754 provides five kinds of floatingpoint exceptions, listed below: Exception Default Result __________________________________________ Invalid Operation NaN or FALSE Overflow +Infinity Divide by Zero +Infinity Underflow Gradual Underflow Inexact Rounded Value NOTE: An exception is not an error unless it is handled incorrectly. What makes a class of exceptions exceptional is that no single default response can be satisfactory in every instance. On the other hand, because a default response will serve most instances of the exception satisfactorily, simply aborting the computation cannot be justified. For each kind of floatingpoint exception, IEEE754 provides a flag that is raised each time its exception is signaled, and remains raised until the program resets it. Programs may test, save, and restore the flags, or a subset thereof. PRECISION AND EXPONENT RANGE OF SPECIFIC FLOATINGPOINT TYPES On both Intel and PPC macs, the type float corresponds to IEEE754 single precision. A singleprecision number is represented in 32 bits, and has a precision of 24 significant bits, roughly like 7 significant decimal digits. 8 bits are used to encode the exponent, which gives an exponent range from 126 to 127, inclusive. The header <float.h> defines several useful constants for the float type: FLT_MANT_DIG  The number of binary digits in the significand of a float. FLT_MIN_EXP  One more than the smallest exponent available in the float type. FLT_MAX_EXP  One more than the largest exponent available in the float type. FLT_DIG  the precision in decimal digits of a float. A decimal value with this many digits, stored as a float, always yields the same value up to this many digits when converted back to decimal notation. FLT_MIN_10_EXP  the smallest n such that 10**n is a nonzero normal number number ber as a float. FLT_MAX_10_EXP  the largest n such that 10**n is finite as a float. FLT_MIN  the smallest positive normal float. FLT_MAX  the largest finite float. FLT_EPSILON  the difference between 1.0 and the smallest float bigger than 1.0. On both Intel and PPC macs, the type double corresponds to IEEE754 double double ble precision. A doubleprecision number is represented in 64 bits, and has a precision of 53 significant bits, roughly like 16 significant decimal decimal mal digits. 11 bits are used to encode the exponent, which gives an exponent range from 1022 to 1023, inclusive. The header <float.h> defines several useful constants for the double type: DBL_MANT_DIG  The number of binary digits in the significand of a double. double. ble. DBL_MIN_EXP  One more than the smallest exponent available in the double type. DBL_MAX_EXP  One more than the exponent available in the double type. DBL_DIG  the precision in decimal digits of a double. A decimal value with this many digits, stored as a double, always yields the same value up to this many digits when converted back to decimal notation. DBL_MIN_10_EXP  the smallest n such that 10**n is a nonzero normal number number ber as a double. DBL_MAX_10_EXP  the largest n such that 10**n is finite as a double. DBL_MIN  the smallest positive normal double. DBL_MAX  the largest finite double. DBL_EPSILON  the difference between 1.0 and the smallest double bigger than 1.0. On Intel macs, the type long double corresponds to IEEE754 double extended precision. A double extended number is represented in 80 bits, and has a precision of 64 significant bits, roughly like 19 significant decimal digits. 15 bits are used to encode the exponent, which gives an exponent range from 16383 to 16384, inclusive. The header <float.h> defines several useful constants for the long double type: LDBL_MANT_DIG  The number of binary digits in the significand of a long double. LDBL_MIN_EXP  One more than the smallest exponent available in the long double type. LDBL_MAX_EXP  One more than the exponent available in the long double type. LDBL_DIG  the precision in decimal digits of a long double. A decimal value with this many digits, stored as a long double, always yields the same value up to this many digits when converted back to decimal notation. notation. tion. LDBL_MIN_10_EXP  the smallest n such that 10**n is a nonzero normal number as a long double. LDBL_MAX_10_EXP  the largest n such that 10**n is finite as a long double. double. ble. LDBL_MIN  the smallest positive normal long double. LDBL_MAX  the largest finite long double. LDBL_EPSILON  the difference between 1.0 and the smallest long double bigger than 1.0. LONG DOUBLE ON POWERPC MACS On PowerPC macs, by default the type long double is mapped to IEEE754 double precision, described above. If additional precision is required, a nonIEEE754 128 bit long double format is also available. To use this format, compile with the mlongdouble128 flag. If you wish to use the <math.h> functions, you must include the linker flag lmx as well as the usual lm. The mlongdouble128 flag is only valid when the target architecture is ppc or ppc64. Each 128bit long double is made up of two IEEE doubles (head and tail). The value of the long double is the sum of the values of the two parts (unless the head double has value 0.0, in which case the value of the long double is 0.0). The value of the head is required to be the value of the long double rounded to the nearest double. If the head is an infinity, the value of the long double is the value of the head, and the tail must be +0.0. The tail of a NaN can be any double value. There are many 128bit bit patterns that are not valid as long doubles. These do not represet any value. The 128bit long double format provides 106 significant bits, which is roughly like 31 significant decimal digits. It has the same exponent range as double, from 1022 to 1023, inclusive. The usual constants are provided from <float.h>, as described above. In the 128bit long double format, addition and subtraction have a relative relative tive error bound of one ulp, or 2**106. Multiplication has a relative error within 2 ulps, and division a relative error within 3 ulps. SEE ALSO math(3), complex(3) STANDARDS Floatingpoint arithmetic conforms to the ISO/IEC 9899:1999(E) standard. BSD March 20, 2007 BSD 