0% found this document useful (0 votes)
54 views

Floating Point Arithmetic: Computer Architecture and Assembly Language Dr. Aiman El-Maleh

simple data types 2 in c

Uploaded by

A7a Wtf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Floating Point Arithmetic: Computer Architecture and Assembly Language Dr. Aiman El-Maleh

simple data types 2 in c

Uploaded by

A7a Wtf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Floating Point Arithmetic

ICS 233
Computer Architecture and Assembly Language
Dr. Aiman El-Maleh
College of Computer Sciences and Engineering
King Fahd University of Petroleum and Minerals
[Adapted from slides of Dr. M. Mudawar, ICS 233, KFUPM]
Outline

❖ Floating-Point Numbers

❖ IEEE 754 Floating-Point Standard

Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 2


The World is Not Just Integers
❖ Programming languages support numbers with fraction
 Called floating-point numbers
 Examples:
3.14159265… (π)
2.71828… (e)
0.000000001 or 1.0 × 10–9 (seconds in a nanosecond)
86,400,000,000,000 or 8.64 × 1013 (nanoseconds in a day)
last number is a large integer that cannot fit in a 32-bit integer

❖ We use a scientific notation to represent


 Very small numbers (e.g. 1.0 × 10–9)
 Very large numbers (e.g. 8.64 × 1013)
 Scientific notation: ± d . f1f2f3f4 … × 10 ± e1e2e3
Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 3
Floating-Point Numbers
❖ Examples of floating-point numbers in base 10 …
 5.341×103 , 0.05341×105 , –2.013×10–1 , –201.3×10–3
decimal point
❖ Examples of floating-point numbers in base 2 …
 1.00101×223 , 0.0100101×225 , –1.101101×2–3 , –1101.101×2–6
binary point
 Exponents are kept in decimal for clarity
 The binary number (1101.101)2 = 23+22+20+2–1+2–3 = 13.625
❖ Floating-point numbers should be normalized
 Exactly one non-zero digit should appear before the point
▪ In a decimal number, this digit can be from 1 to 9
▪ In a binary number, this digit should be 1
 Normalized FP Numbers: 5.341×103 and –1.101101×2–3
 NOT Normalized: 0.05341×105 and –1101.101×2–6
Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 4
Floating-Point Representation
❖ A floating-point number is represented by the triple
 S is the Sign bit (0 is positive and 1 is negative)
▪ Representation is called sign and magnitude
 E is the Exponent field (signed)
▪ Very large numbers have large positive exponents
▪ Very small close-to-zero numbers have negative exponents
▪ More bits in exponent field increases range of values
 F is the Fraction field (fraction after binary point)
▪ More bits in fraction field improves the precision of FP numbers

S Exponent Fraction

Value of a floating-point number = (-1)S × val(F) × 2val(E)


Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 5
Real Numbers

FP Overflow & Underflow


• Fixed-sized representation leads to limitations

Large positive exponent.


Unlike integer arithmetic, overflow →
imprecise result (), not inaccurate result

Round Round
to - Zero to +

Negative Expressible Negative Positive Expressible Positive


overflow negative values underflow underflow positive values overflow

Large negative exponent


Round to zero

Cox
6
Alan L. Cox alc@rice.edu
Next . . .

❖ Floating-Point Numbers

❖ IEEE 754 Floating-Point Standard

Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 7


IEEE 754 Floating-Point Standard
❖ Single Precision Floating Point Numbers (32 bits)
 1-bit sign + 8-bit exponent + 23-bit fraction

❖ Double Precision Floating Point Numbers (64 bits)


 1-bit sign + 11-bit exponent + 52-bit fraction

S Exponent8 Fraction23

S Exponent11 Fraction52
(continued)

Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 8


Normalized Floating Point Numbers
❖ For a normalized floating point number (S, E, F)
S E F = f 1 f2 f3 f4 …

❖ Significand is equal to (1.F)2 = (1.f1f2f3f4…)2


 IEEE 754 assumes hidden 1. (not stored) for normalized numbers
 Significand is 1 bit longer than fraction
❖ Value of a Normalized Floating Point Number is
(–1)S × (1.F)2 × 2val(E)
(–1)S × (1.f1f2f3f4 …)2 × 2val(E)
(–1)S × (1 + f1×2-1 + f2×2-2 + f3×2-3 + f4×2-4 …)2 × 2val(E)

(–1)S is 1 when S is 0 (positive), and –1 when S is 1 (negative)


Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 9
Biased Exponent Representation
❖ How to represent a signed exponent? Choices are …
 Sign + magnitude representation for the exponent
 Two’s complement representation
 Biased representation
❖ IEEE 754 uses biased representation for the exponent
 Value of exponent = val(E) = E – Bias (Bias is a constant)
❖ Recall that exponent field is 8 bits for single precision
 E can be in the range 0 to 255
 E = 0 and E = 255 are reserved for special use (discussed later)
 E = 1 to 254 are used for normalized floating point numbers
 Bias = 127 (half of 254), val(E) = E – 127
 val(E=1) = –126, val(E=127) = 0, val(E=254) = 127
Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 10
Biased Exponent – Cont’d
❖ For double precision, exponent field is 11 bits
 E can be in the range 0 to 2047
 E = 0 and E = 2047 are reserved for special use
 E = 1 to 2046 are used for normalized floating point numbers
 Bias = 1023 (half of 2046), val(E) = E – 1023
 val(E=1) = –1022, val(E=1023) = 0, val(E=2046) = 1023
❖ Value of a Normalized Floating Point Number is

(–1)S × (1.F)2 × 2E – Bias


(–1)S × (1.f1f2f3f4 …)2 × 2E – Bias
(–1)S × (1 + f1×2-1 + f2×2-2 + f3×2-3 + f4×2-4 …)2 × 2E – Bias

Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 11


Examples of Single Precision Float
❖ What is the decimal value of this Single Precision float?
10111110001000000000000000000000

❖ Solution:
 Sign = 1 is negative
 Exponent = (01111100)2 = 124, E – bias = 124 – 127 = –3
 Significand = (1.0100 … 0)2 = 1 + 2-2 = 1.25 (1. is implicit)
 Value in decimal = –1.25 × 2–3 = –0.15625
❖ What is the decimal value of?
01000001001001100000000000000000

❖ Solution: implicit
 Value in decimal = +(1.01001100 … 0)2 × 2130–127 =
(1.01001100 … 0)2 × 23 = (1010.01100 … 0)2 = 10.375
Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 12
Examples of Double Precision Float
❖ What is the decimal value of this Double Precision float ?
01000000010100101010000000000000
00000000000000000000000000000000

❖ Solution:
 Value of exponent = (10000000101)2 – Bias = 1029 – 1023 = 6
 Value of double float = (1.00101010 … 0)2 × 26 (1. is implicit) =
(1001010.10 … 0)2 = 74.5
❖ What is the decimal value of ?
10111111100010000000000000000000
00000000000000000000000000000000

❖ Do it yourself! (answer should be –1.5 × 2–7 = –0.01171875)


Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 13
Converting FP Decimal to Binary
❖ Convert –0.8125 to binary in single and double precision
❖ Solution:
 Fraction bits can be obtained using multiplication by 2
▪ 0.8125 × 2 = 1.625
▪ 0.625 × 2 = 1.25
0.8125 = (0.1101)2 = ½ + ¼ + 1/16 = 13/16
▪ 0.25 × 2 = 0.5
▪ 0.5 × 2 = 1.0
▪ Stop when fractional part is 0
 Fraction = (0.1101)2 = (1.101)2 × 2 –1 (Normalized)
 Exponent = –1 + Bias = 126 (single precision) and 1022 (double)
Single
10111111010100000000000000000000
Precision
10111111111010100000000000000000 Double
Precision
00000000000000000000000000000000
Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 14
Basic Technique

• Represent the decimal in the form +/- 1.xxxb x 2y


• And “fill in the fields”
– Remember biased exponent and implicit “1.” mantissa!
• Examples:
– 0.0: 0 00000000 00000000000000000000000
– 1.0 (1.0 x 2^0): 0 01111111 00000000000000000000000
– 0.5 (0.1 binary = 1.0 x 2^-1): 0 01111110 00000000000000000000000
– 0.75 (0.11 binary = 1.1 x 2^-1): 0 01111110 10000000000000000000000
– 3.0 (11 binary = 1.1*2^1): 0 10000000 10000000000000000000000
– -0.375 (-0.011 binary = -1.1*2^-2): 1 01111101 10000000000000000000000
– 1 10000011 01000000000000000000000 = - 1.01 * 2^4 = -20.0

Lec 14 Systems Architecture 15


http://www.math-cs.gordon.edu/courses/cs311/lectures-2003/binary.html
Copyright ©2003 - Russell C. Bjork
Floating-Point Example
• Represent –0.75
– –0.75 = (–1)1 × 1.12 × 2–1
– S=1
– Fraction = 1000…002
– Exponent = –1 + Bias
• Single: –1 + 127 = 126 = 011111102
• Double: –1 + 1023 = 1022 = 011111111102

• Single: 1011111101000…00
• Double: 1011111111101000…00

Lec 14 Systems Architecture 16


Jeremy R. Johnson, Anatole D. Ruslanov, William M. Mongan
Floating-Point Example
• What number is represented by the single-precision float
11000000101000…00
– S=1
– Fraction = 01000…002
– Fxponent = 100000012 = 129
• x = (–1)1 × (1 + 012) × 2(129 – 127)
= (–1) × 1.25 × 22
= –5.0

Lec 14 Systems Architecture 17


Jeremy R. Johnson, Anatole D. Ruslanov, William M. Mongan
Largest Normalized Float
❖ What is the Largest normalized float?
❖ Solution for Single Precision:
01111111011111111111111111111111

 Exponent – bias = 254 – 127 = 127 (largest exponent for SP)


 Significand = (1.111 … 1)2 = almost 2
 Value in decimal ≈ 2 × 2127 ≈ 2128 ≈ 3.4028 … × 1038
❖ Solution for Double Precision:
01111111111011111111111111111111
11111111111111111111111111111111

 Value in decimal ≈ 2 × 21023 ≈ 21024 ≈ 1.79769 … × 10308


❖ Overflow: exponent is too large to fit in the exponent field
Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 18
Smallest Normalized Float
❖ What is the smallest (in absolute value) normalized float?
❖ Solution for Single Precision:
00000000100000000000000000000000
 Exponent – bias = 1 – 127 = –126 (smallest exponent for SP)
 Significand = (1.000 … 0)2 = 1
 Value in decimal = 1 × 2–126 = 1.17549 … × 10–38
❖ Solution for Double Precision:
00000000000100000000000000000000
00000000000000000000000000000000

 Value in decimal = 1 × 2–1022 = 2.22507 … × 10–308


❖ Underflow: exponent is too small to fit in exponent field
Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 19
Zero, Infinity, and NaN
❖ Zero
 Exponent field E = 0 and fraction F = 0
 +0 and –0 are possible according to sign bit S
❖ Infinity
 Infinity is a special value represented with maximum E and F = 0
▪ For single precision with 8-bit exponent: maximum E = 255
▪ For double precision with 11-bit exponent: maximum E = 2047
 Infinity can result from overflow or division by zero
 +∞ and –∞ are possible according to sign bit S
❖ NaN (Not a Number)
 NaN is a special value represented with maximum E and F ≠ 0
 Result from exceptional situations, such as 0/0 or sqrt(negative)
 Operation on a NaN results is NaN: Op(X, NaN) = NaN
Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 20
Denormalized Numbers
❖ IEEE standard uses denormalized numbers to …
 Fill the gap between 0 and the smallest normalized float
 Provide gradual underflow to zero
❖ Denormalized: exponent field E is 0 and fraction F ≠ 0
 Implicit 1. before the fraction now becomes 0. (not normalized)
❖ Value of denormalized number ( S, 0, F )
Single precision: (–1) S × (0.F)2 × 2–126
Double precision: (–1) S × (0.F)2 × 2–1022
Negative Negative Positive Positive
Overflow Underflow Underflow Overflow

-∞ Normalized (–ve) Denorm Denorm Normalized (+ve) +∞


-2128 -2–126 0 2–126 2128
Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 21
Special Value Rules

Operation Result
n /  0
 x  
nonzero / 0 
+  (similar for -)
0 / 0 NaN
- NaN (similar for -)
 /  NaN
 x 0 NaN
NaN op anything NaN

Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 22


Summary of IEEE 754 Encoding
Single-Precision Exponent = 8 Fraction = 23 Value
Normalized Number 1 to 254 Anything ± (1.F)2 × 2E – 127
Denormalized Number 0 nonzero ± (0.F)2 × 2–126
Zero 0 0 ±0
Infinity 255 0 ±∞
NaN 255 nonzero NaN

Double-Precision Exponent = 11 Fraction = 52 Value


Normalized Number 1 to 2046 Anything ± (1.F)2 × 2E – 1023
Denormalized Number 0 nonzero ± (0.F)2 × 2–1022
Zero 0 0 ±0
Infinity 2047 0 ±∞
NaN 2047 nonzero NaN

Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 23


Simple 6-bit Floating Point Example
❖ 6-bit floating point representation
S Exponent3 Fraction2
 Sign bit is the most significant bit
 Next 3 bits are the exponent with a bias of 3
 Last 2 bits are the fraction
❖ Same general form as IEEE
 Normalized, denormalized
 Representation of 0, infinity and NaN
❖ Value of normalized numbers (–1)S × (1.F)2 × 2E – 3
❖ Value of denormalized numbers (–1)S × (0.F)2 × 2– 2

Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 24


Values Related to Exponent

Exp. exp E 2E
0 000 -2 ¼ Denormalized

1 001 -2 ¼
2 010 -1 ½
3 011 0 1
Normalized
4 100 1 2
5 101 2 4
6 110 3 8
7 111 n/a Inf or NaN

Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 25


Dynamic Range of Values
s exp frac E value
0 000 00 -2 0
0 000 01 -2 1/4*1/4=1/16 smallest denormalized
0 000 10 -2 2/4*1/4=2/16
0 000 11 -2 3/4*1/4=3/16 largest denormalized
0 001 00 -2 4/4*1/4=4/16=1/4=0.25 smallest normalized
0 001 01 -2 5/4*1/4=5/16
0 001 10 -2 6/4*1/4=6/16
0 001 11 -2 7/4*1/4=7/16
0 010 00 -1 4/4*2/4=8/16=1/2=0.5
0 010 01 -1 5/4*2/4=10/16
0 010 10 -1 6/4*2/4=12/16=0.75
0 010 11 -1 7/4*2/4=14/16
Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 26
Dynamic Range of Values
s exp frac E value
0 011 00 0 4/4*4/4=16/16=1
0 011 01 0 5/4*4/4=20/16=1.25
0 011 10 0 6/4*4/4=24/16=1.5
0 011 11 0 7/4*4/4=28/16=1.75
0 100 00 1 4/4*8/4=32/16=2
0 100 01 1 5/4*8/4=40/16=2.5
0 100 10 1 6/4*8/4=48/16=3
0 100 11 1 7/4*8/4=56/16=3.5
0 101 00 2 4/4*16/4=64/16=4
0 101 01 2 5/4*16/4=80/16=5
0 101 10 2 6/4*16/4=96/16=6
0 101 11 2 7/4*16/4=112/16=7
Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 27
Dynamic Range of Values
s exp frac E value
0 110 00 3 4/4*32/4=128/16=8
0 110 01 3 5/4*32/4=160/16=10
0 110 10 3 6/4*32/4=192/16=12
0 110 11 3 7/4*32/4=224/16=14 largest normalized
0 111 00 
0 111 01 NaN
0 111 10 NaN
0 111 11 NaN

Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 28


FP Behavior
Programmer must be aware of accuracy limitations!

(1.0 + 6.0) ÷ 640.0 =? (1.0 ÷ 640.0) + (6.0 ÷


640.0)
7.0 ÷ 640.0 =? .001563 + .009375
.010937  .010938

×,÷ not distributive across +,-

(1010 + 1030) + –1030 =? 1010 + (1030 + –1030)


1030 – 1030 =? 1010 + 0
0  1010

Operations not associative!

Cox 29
Simple Data Types
Alan L. Cox alc@rice.edu
• Associativity law for addition: a + (b + c) = (a + b) + c

• Let a = – 2.7 x 1023, b = 2.7 x 1023, and c = 1.0

• a + (b + c) = – 2.7 x 1023 + ( 2.7 x 1023 + 1.0 ) = – 2.7 x 1023 +


2.7 x 1023 = 0.0

• (a + b) + c = ( – 2.7 x 1023 + 2.7 x 1023 ) + 1.0 = 0.0 + 1.0 = 1.0

• Beware – Floating Point addition not associative!

• The result is approximate…

• Why the smaller number disappeared?


Lec 14 Systems Architecture 30
Jeremy R. Johnson, Anatole D. Ruslanov, William M. Mongan
FP vs. Integer Results

int i = 1000 / 6;
float f = 1000.0 / 6.0;

True mathematical answer: 1000  6 = 166 2/3

i= ? 166 Integer division ignores remainder

f= ? 166.666672 FP arithmetic rounds result

Surprise!
Arithmetic in binary, printing in decimal –
doesn’t always give expected result

Cox Simple Data Types 31


Alan L. Cox alc@rice.edu
FP  Integer Conversions in C

#include <limits.h>
#include <stdio.h>

void main()
{
unsigned int ui = UINT_MAX;
float f = ui;
printf(“ui: %u\nf: %f\n”, ui, f);
}

Surprisingly, this program print the following. Why?

ui: 4294967295
f: 4294967296.000000
Cox 32
Simple Data Types
Alan L. Cox alc@rice.edu

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy