Floating Point Numbers
Floating Point Numbers
jar:file:///C:/Program%20Files/MATLAB/R2011a/help/toolbox/fixpoint/...
Floating-Point Numbers
On this page
About Floating-Point Numbers
Scientific Notation
The IEEE Format
Range and Precision
Exceptional Arithmetic
You can represent any binary floating-point number in scientific notation form as f 2 , where f is the fraction (or
mantissa), 2 is the radix or base (binary in this case), and e is the exponent of the radix. The radix is always a positive
number, while f and e can be positive or negative.
When performing arithmetic operations, floating-point hardware must take into account that the sign, exponent, and
fraction are all encoded within the same binary word. This results in complex logic circuits when compared with the circuits
for binary fixed-point operations.
The Simulink Fixed Point software supports single-precision and double-precision floating-point numbers as defined by the
IEEE Standard 754. Additionally, a nonstandard IEEE-style number is supported.
Back to Top
Scientific Notation
A direct analogy exists between scientific notation and radix point notation. For example, scientific notation using five
decimal digits for the fraction would take the form
If the exponent were greater than 0 or less than -3, then the representation would involve lots of zeros.
1 of 5
11/26/2014 11:52 AM
jar:file:///C:/Program%20Files/MATLAB/R2011a/help/toolbox/fixpoint/...
These extra zeros never change to ones, however, so they don't show up in the hardware. Furthermore, unlike
floating-point exponents, a fixed-point exponent never shows up in the hardware, so fixed-point exponents are not limited
by a finite number of bits.
Note Restricting the binary point to being contiguous with the fraction is unnecessary; the Simulink Fixed Point
software allows you to extend the binary point to any arbitrary location.
Back to Top
Single-Precision Format
The IEEE single-precision floating-point format is a 32-bit word divided into a 1-bit sign indicator s, an 8-bit biased
exponent e, and a 23-bit fraction f. For more information, see The Sign Bit, The Exponent Field, and The Fraction Field. A
representation of this format is given below.
The relationship between this format and the representation of real numbers is given by
2 of 5
11/26/2014 11:52 AM
jar:file:///C:/Program%20Files/MATLAB/R2011a/help/toolbox/fixpoint/...
Double-Precision Format
The IEEE double-precision floating-point format is a 64-bit word divided into a 1-bit sign indicator s, an 11-bit biased
exponent e, and a 52-bit fraction f.For more information, see The Sign Bit, The Exponent Field, and The Fraction Field. A
representation of this format is shown in the following figure.
The relationship between this format and the representation of real numbers is given by
Range
The range of representable numbers for an IEEE floating-point number with f bits allocated for the fraction, e bits allocated
(e-1)
for the exponent, and the bias of e given by bias = 2
1 is given below.
where
Normalized positive numbers are defined within the range 2
(1bias)
bias
-f
to (22 )2
(1bias)
bias
-f
to (22 )2
bias
.
-f
(1bias)
bias
are overflows.
Overflows and underflows result from exceptional arithmetic conditions. Floating-point numbers outside the defined range
are always mapped to Inf.
Note You can use the MATLAB commands realmin and realmax to determine the dynamic range of
3 of 5
11/26/2014 11:52 AM
jar:file:///C:/Program%20Files/MATLAB/R2011a/help/toolbox/fixpoint/...
Precision
Because of a finite word size, a floating-point number is only an approximation of the "true" value. Therefore, it is important
to have an understanding of the precision (or accuracy) of a floating-point result. In general, a value v with an accuracy q
is specified by vq. For IEEE floating-point numbers,
s
v = (1) (2
ebias
)(1.f)
and
f
q = 2 2
ebias
Thus, the precision is associated with the number of bits in the fraction field.
Note In the MATLAB software, floating-point relative accuracy is given by the command eps, which returns the
distance from 1.0 to the next larger floating-point number. For a computer that supports the IEEE Standard 754,
-52
-16
eps = 2 or 2.22045 10 .
Data Type
Low Limit
Single
Double
Nonstandard
-126
10
-1022
High Limit
-38
2 10
(1 - bias)
2
-308
128
3 10
1024
38
2 10
-f
(2 - 2 ) 2
308
bias
Exponent Bias
Precision
127
1023
(e - 1)
-1
-23
-52
10
10
-7
-16
-f
Because of the sign/magnitude representation of floating-point numbers, there are two representations of zero, one
positive and one negative. For both representations e = 0 and f.0 = 0.0.
Back to Top
Exceptional Arithmetic
In addition to specifying a floating-point format, the IEEE Standard 754 specifies practices and procedures so that
predictable results are produced independently of the hardware platform. Specifically, denormalized numbers, Inf, and
NaN are defined to deal with exceptional arithmetic (underflow and overflow).
If an underflow or overflow is handled as Inf or NaN, then significant processor overhead is required to deal with this
exception. Although the IEEE Standard 754 specifies practices and procedures to deal with exceptional arithmetic
conditions in a consistent manner, microprocessor manufacturers might handle these conditions in ways that depart from
the standard. Some of the alternative approaches, such as saturation and wrapping, are discussed in Arithmetic
Operations.
Denormalized Numbers
Denormalized numbers are used to handle cases of exponent underflow. When the exponent of the result is too small
(i.e., a negative exponent with too large a magnitude), the result is denormalized by right-shifting the fraction and leaving
the exponent at its minimum value. The use of denormalized numbers is also referred to as gradual underflow. Without
denormalized numbers, the gap between the smallest representable nonzero number and zero is much wider than the gap
between the smallest representable nonzero number and the next larger number. Gradual underflow fills that gap and
4 of 5
11/26/2014 11:52 AM
jar:file:///C:/Program%20Files/MATLAB/R2011a/help/toolbox/fixpoint/...
reduces the impact of exponent underflow to a level comparable with roundoff among the normalized numbers. Thus,
denormalized numbers provide extended range for small numbers at the expense of precision.
Inf
Arithmetic involving Inf (infinity) is treated as the limiting case of real arithmetic, with infinite values defined as those
outside the range of representable numbers, or (representable numbers) < . With the exception of the special
cases discussed below (NaN), any arithmetic operation involving Inf yields Inf. Inf is represented by the largest biased
exponent allowed by the format and a fraction of zero.
NaN
A NaN (not-a-number) is a symbolic entity encoded in floating-point format. There are two types of NaN: signaling and
quiet. A signaling NaN signals an invalid operation exception. A quiet NaN propagates through almost every arithmetic
operation without signaling an exception. The following operations result in a NaN: , +, 0, 0/0, and /.
Both types of NaN are represented by the largest biased exponent allowed by the format and a fraction that is nonzero.
The bit pattern for a quiet NaN is given by 0.f where the most significant number in f must be a one, while the bit pattern for
a signaling NaN is given by 0.f where the most significant number in f must be zero and at least one of the remaining
numbers must be nonzero.
Back to Top
Was this topic helpful?
Fixed-Point Numbers
Yes
No
Arithmetic Operations
5 of 5
11/26/2014 11:52 AM