0% found this document useful (0 votes)

279 views24 pages

Lecture 6. Fixed and Floating Point Numbers: Prof. Taeweon Suh Computer Science Education Korea University

This document provides a summary of a lecture on fixed and floating point numbers. It discusses: - Fixed point numbers represent rational numbers using a binary point between integer and fraction bits. Addition works by aligning the binary points. - Floating point numbers represent numbers in scientific notation with three fields - sign, exponent, and fraction. The IEEE 754 standard uses biases in the exponent. - IEEE 754 defines single and double precision formats. Single uses 32 bits with a sign, 8 exponent bits, and 23 fraction bits. Double uses 64 bits with 11 exponent and 52 fraction bits for greater precision.

Uploaded by

Panku Rangaree

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

279 views24 pages

Lecture 6. Fixed and Floating Point Numbers: Prof. Taeweon Suh Computer Science Education Korea University

Uploaded by

Panku Rangaree

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 24

COMP211 Computer Logic Design

Lecture 6. Fixed and Floating Point

Numbers

Prof. Taeweon Suh

Computer Science Education
Korea University
Number Systems

• So far we have studied the following integer number systems in

computer
 Unsigned numbers
 Sign/magnitude numbers
 Two’s complement numbers

• What about rational numbers?

 For example, 2.5, -10.04, 0.75 etc

• Two common notations to represent rational numbers in

computer
 Fixed-point numbers
 Floating-point numbers

Korea Univ
Fixed-Point Numbers
• Fixed point notation has an implied binary point between the integer
and fraction bits
 The binary point is not a part of the representation but is implied
 Example:
• Fixed-point representation of 6.75 using 4 integer bits and 4 fraction bits:

01101100
0110.1100
22 + 21 + 2-1 + 2-2 = 6.75

• The number of integer and fraction bits must be agreed upon by

those generating and those reading the number
 There is no way of knowing the existence of the binary point except through
agreement of those people interpreting the number

Korea Univ
Signed Fixed-Point Numbers

• As with whole numbers, negative fractional numbers can be represented in

two ways
 Sign/magnitude notation
 Two’s complement notation

• Example:
 -2.375 using 8 bits (4 bits each to represent integer and fractional parts)
• 2.375 = 0010 . 0110
• Sign/magnitude notation: 1010 0110
• Two’s complement notation:
1. flip all the bits: 1101 1001
2. add 1: + 1
1101 1010

• Addition and subtraction works easily in computer with 2’s complement

notation like integer addition and subtraction

Korea Univ
Example

• Suppose that we have 8 bits to represent a number

 4 bits for integer and 4 bits for fraction

• Compute 0.75 + (-0.625)

 0.75 = 0000 1100
 0.625 = 0000 1010
 -0.625 in 2’s complement form: 1111 0110

0.75 0000 1100

+ - 0.625 1111 0110
0.125
0000 0010

Korea Univ
Fixed-Point Number Systems

• Fixed-point number systems have a limitation of having a constant

number of integer and fractional bits
 What are the largest and the smallest rational numbers you can represent with
32 bits, assuming 16 bits each for integer and fractional parts?

• Some low-end digital signal processors support fixed-point numbers

 Example: TMS320C550x TI (Texas Instruments) DSPs: www.ti.com

Korea Univ
Floating-Point Numbers
• Floating-point number systems circumvent the limitation of having a constant
number of integer and fractional bits
 They allow the representation of very large and very small numbers

• The binary point floats to the right of the most significant 1

 Similar to decimal scientific notation
 For example, write 27310 in scientific notation:
• Move the decimal point to the right of the most significant digit and increase the exponent:

273 = 2.73 × 102

• In general, a number is written in scientific notation as:

± M × BE
Where,
 M = mantissa
 B = base
 E = exponent
 In the example, M = 2.73, B = 10, and E = 2 (that is, +2.73 × 102)

Korea Univ
Floating-Point Numbers

• Floating-point number representation using 32 bits

 1 sign bit
 8 exponent bits
 23 bits for the mantissa.

1 bit 8 bits 23 bits

Sign Exponent Mantissa

• The following slides show three versions of floating-point

representation with 22810 using a 32-bit
 The final version is called the IEEE 754 floating-point standard

Korea Univ
Floating-Point Representation #1

• First, convert the decimal number to binary

 22810 = 111001002 = 1.11001 × 27

• Next, fill in each field in the 32-bit:

 The sign bit (1 bit) is positive, so 0
 The exponent (8 bits) is 7 (111)
 The mantissa (23 bits) is 1.11001

1 bit 8 bits 23 bits

0 00000111 11 1001 0000 0000 0000 0000
Sign Exponent Mantissa

Korea Univ
Floating-Point Representation #2

• You may have noticed that the first bit of the mantissa is always 1, since the
binary point floats to the right of the most significant 1
 Example: 22810 = 111001002 = 1.11001 × 27

• Thus, storing the most significant 1 (also called the implicit leading 1) is
redundant information

• We can store just the fraction parts in the 23-bit field

 Now, the leading 1 is implied

1 bit 8 bits 23 bits

0 0 00000011
0000111 110 0100 0000 0000 0000 0000
Sign Exponent Fraction

Korea Univ
Floating-Point Representation #3

The exponent needs to represent both positive and negative

• The final change is to use a biased exponent
 The IEEE 754 standard uses a bias of 127
 Biased exponent = bias + exponent
• For example, an exponent of 7 is stored as 127 + 7 = 134 = 100001102

• Thus , 22810 using the IEEE 754 32-bit floating-point standard is

1 bit 8 bits 23 bits

0 10000110 110 0100 0000 0000 0000 0000
Sign Biased Fraction
Exponent

Korea Univ
Example

• Represent -5810 using the IEEE 754 floating-point standard

 First, convert the decimal number to binary
• 5810 = 1110102 = 1.1101 × 25

 Next, fill in each field in the 32-bit number

• The sign bit is negative (1)
• The 8 exponent bits are (127 + 5) = 132 = 10000100(2)

• The remaining 23 bits are the fraction bits: 11010000..000(2)

1 bit 8 bits 23 bits

1 10000100 110 1000 0000 0000 0000 0000
Sign Exponent Fraction

 It is 0xC2680000 in the hexadecimal form

• Check this out with the result of the sample program in the slide# 3

Korea Univ
Floating-Point Numbers: Special Cases

• The IEEE 754 standard includes special cases for numbers that are
difficult to represent, such as 0 because it lacks an implicit leading 1

Number Sign Exponent Fraction

0 X 00000000 00000000000000000000000
∞ 0 11111111 00000000000000000000000
-∞ 1 11111111 00000000000000000000000
NaN X 11111111 non-zero

NaN is used for numbers that don’t exist, such as √-1 or log(-5)

Korea Univ
Floating-Point Number Precision
• The IEEE 754 standard also defines 64-bit double-precision that provides
greater precision and greater range
 Single-Precision (use the float declaration in C language)
• 32-bit notation
• 1 sign bit, 8 exponent bits, 23 fraction bits
• bias = 127
• It spans a range from ±1.175494 X 10-38 to ±3.402824 X 1038

 Double-Precision (use the double declaration in C language)

• 64-bit notation
• 1 sign bit, 11 exponent bits, 52 fraction bits
• bias = 1023
• It spans a range from ±2.22507385850720 X 10-308 to ±1.79769313486232 X 10308

• Most general purpose processors (including Intel and AMD processors) provide
hardware support for double-precision floating-point numbers and operations

Korea Univ
Double Precision Example
• Represent -5810 using the IEEE 754 double precision
 First, convert the decimal number to binary
• 5810 = 1110102 = 1.1101 × 25

 Next, fill in each field in the 64-bit number

• The sign bit is negative (1)
• The 11 exponent bits are (1023 + 5) = 1028 = 10000000100(2)
• The remaining 52 bits are the fraction bits: 11010000..000 (2)

 It is 0xC04D0000_00000000 in the hexadecimal form

• Check this out with the result of the sample program in the slide# 4

Korea Univ
Represent 0.7

• Represent 0.7 in IEEE 754 single precision form

 ½ = 0.5 = 0.1(2) // 0.7-0.5 = 0.2

 1/8 = 0.125 = 0.001(2) // 0.2-0.125=0.075

 1/16 = 0.0625 = 0.0001(2) // 0.075-0.0625=0.0125

 1/128 = 0.0078125 = 0.0000001(2) // 0.0125-0.0078125 =0.0046875

 1/256 = 0.00390625 = 0.00000001(2) // 0.0046875-0.00390625=0.00078125

 ……

 Thus, 0.7 = 0.10110011…(2) = 1.0110011…(2) X 2-1

• In IEEE754 single precision, 0.7 = 0x3F333333
• Check it out with the slide#6

• IEEE754 floating-point standard can’t represent some numbers exactly

Korea Univ
Binary Coded Decimal (BCD)
• Since floating-point number systems can’t represent some numbers exactly such
as 0.1 and 0.7, some application (calculators) use BCD (Binary coded decimal)
 BCD numbers encode each decimal digit using 4 bits with a range of 0 to 9

Decimal BCD Digit

0 0000
1 0001
2 0010
3 0011 BCD fixed-point notation examples
4 0100
1.7 = 0001 . 0111
5 0101
4.9 = 0100 . 1001
6 0110
7 0111
8 1000
9 1001

• BCD is very common in electronic systems where a numeric value is to be

displayed, especially, in systems consisting solely of digital logic (not containing a
microprocessor) - Wiki
Korea Univ
Backup Slides

Korea Univ
Floating-Point Numbers: Rounding
https://www.youtube.com/watch?
v=Iw77CYUT74c
• Arithmetic results that fall outside of the available precision
must round to a neighboring number
• Rounding modes
 Round down
 Round up
 Round toward zero
 Round to nearest

• Example
 Round 1.100101 (1.578125) so that it uses only 3 fraction bits
• Round down: 1.100
• Round up: 1.101
• Round toward zero: 1.100
• Round to nearest: 1.101
 1.625 is closer to 1.578125 than 1.5 is

Korea Univ
Floating-Point Addition with the Same Sign

• Addition with floating-point numbers is not as simple as

addition with 2’s complement numbers

• The steps for adding floating-point numbers with the same

sign are as follows
1. Extract exponent and fraction bits
2. Prepend leading 1 to form mantissa
3. Compare exponents
4. Shift smaller mantissa if necessary
5. Add mantissas
6. Normalize mantissa and adjust exponent if necessary
7. Round result
8. Assemble exponent and fraction back into floating-point format

Korea Univ
Floating-Point Addition Example

Add the following floating-point numbers:

1.5 + 3.25

1.5(10) = 1.1(2) x 20
3.25(10) = 11.01(2) = 1.101(2) x 21

1.1(10) = 0x3FC00000 in IEEE 754 single precision

3.25(10) = 0x40500000 in IEEE 754 single precision

Korea Univ
Floating-Point Addition Example

1. Extract exponent and fraction bits

1 bit 8 bits 23 bits
0 01111111 100 0000 0000 0000 0000 0000
Sign Exponent Fraction
1 bit 8 bits 23 bits
0 10000000 101 0000 0000 0000 0000 0000
Sign Exponent Fraction

For first number (N1): S = 0, E = 127, F = .1

For second number (N2): S = 0, E = 128, F = .101

2. Prepend leading 1 to form mantissa

N1: 1.1
N2: 1.101

Korea Univ
Floating-Point Addition Example

3. Compare exponents
127 – 128 = -1, so shift N1 right by 1 bit

4. Shift smaller mantissa if necessary

shift N1’s mantissa: 1.1 >> 1 = 0.11 (× 21)

5. Add mantissas
0.11 × 21
+ 1.101 × 21
10.011 × 21

Korea Univ
Floating-Point Addition Example

6. Normalize mantissa and adjust exponent if necessary

10.011 × 21 = 1.0011 × 22

7. Round result
No need (fits in 23 bits)

8. Assemble exponent and fraction back into floating-point format

S = 0, E = 2 + 127 = 129 = 10000001 2, F = 001100..

1 bit 8 bits 23 bits

0 10000001 001 1000 0000 0000 0000 0000
Sign Exponent Fraction
4.75(10) = 0x40980000 in the hexadecimal form

Korea Univ

SOP To POS Vice Versa
No ratings yet
SOP To POS Vice Versa
3 pages
Architetture Dei Calcolatori 2425 079 092
No ratings yet
Architetture Dei Calcolatori 2425 079 092
14 pages
Calculations in Chemistry
100% (5)
Calculations in Chemistry
1,387 pages
Practice Problems On Multiplexer and Decoder
No ratings yet
Practice Problems On Multiplexer and Decoder
2 pages
UNIT III - PIC Architecture, Data Serialization, RAM ROM Allocation
100% (1)
UNIT III - PIC Architecture, Data Serialization, RAM ROM Allocation
44 pages
Floating Point Representation: Reading: B&O 2.4
No ratings yet
Floating Point Representation: Reading: B&O 2.4
44 pages
COA
No ratings yet
COA
14 pages
Floating Point Numbers 237045407 237045407
No ratings yet
Floating Point Numbers 237045407 237045407
20 pages
4.4 - 1 New Floating Point
No ratings yet
4.4 - 1 New Floating Point
22 pages
Lecture 14 - Arithmetic Subsystems - Numbering Systems and Floating Point Unit (FPU)
No ratings yet
Lecture 14 - Arithmetic Subsystems - Numbering Systems and Floating Point Unit (FPU)
32 pages
Machine Level Representation of Data Part 3
100% (1)
Machine Level Representation of Data Part 3
32 pages
Arithmetic & Logic Instructions and Programs: The 8051 Microcontroller and Embedded Systems: Using Assembly and C
No ratings yet
Arithmetic & Logic Instructions and Programs: The 8051 Microcontroller and Embedded Systems: Using Assembly and C
50 pages
CA Important Questions With Solution
No ratings yet
CA Important Questions With Solution
37 pages
Unit Iv Pic Microcontroller
100% (1)
Unit Iv Pic Microcontroller
68 pages
8051 Program
100% (1)
8051 Program
7 pages
Unit-6: PIC 18 Microcontroller Programming in C
100% (1)
Unit-6: PIC 18 Microcontroller Programming in C
17 pages
Booths Algorithm
100% (1)
Booths Algorithm
24 pages
Unit 2 Architecture of 8051 Microcontroller
No ratings yet
Unit 2 Architecture of 8051 Microcontroller
25 pages
Unit 1
100% (1)
Unit 1
44 pages
Fixed & Floating Point
No ratings yet
Fixed & Floating Point
31 pages
VHDL Nptel PDF
No ratings yet
VHDL Nptel PDF
94 pages
PC Intro
No ratings yet
PC Intro
396 pages
Decade Counter
100% (1)
Decade Counter
3 pages
Register Organization of 8086
No ratings yet
Register Organization of 8086
4 pages
Digital Signal Processing Unit V: DSP Processor
No ratings yet
Digital Signal Processing Unit V: DSP Processor
20 pages
ARM Organization and Implementation: Aleksandar Milenkovic
100% (3)
ARM Organization and Implementation: Aleksandar Milenkovic
37 pages
Addressing Modes in 8085 - ComputerSC PDF
No ratings yet
Addressing Modes in 8085 - ComputerSC PDF
6 pages
VHDL Code For Full Adder
100% (1)
VHDL Code For Full Adder
5 pages
MTech VLSI Design 1st Sem Syllabus
100% (1)
MTech VLSI Design 1st Sem Syllabus
5 pages
#3 - Floating Point
No ratings yet
#3 - Floating Point
38 pages
Combinational & Sequential Logics
No ratings yet
Combinational & Sequential Logics
32 pages
Basic Embedded C Programs Lab Manual
No ratings yet
Basic Embedded C Programs Lab Manual
16 pages
Assignment 8 - 2023 - Gate
No ratings yet
Assignment 8 - 2023 - Gate
10 pages
Unit - 3 of Computer Architecture
No ratings yet
Unit - 3 of Computer Architecture
59 pages
STLD Question Bank
0% (1)
STLD Question Bank
6 pages
Full Adder VHDL
No ratings yet
Full Adder VHDL
52 pages
8255 Interfacing Example
No ratings yet
8255 Interfacing Example
9 pages
Unit 2 QB With Answers
No ratings yet
Unit 2 QB With Answers
13 pages
Rajagiri School of Engineering and Technology
No ratings yet
Rajagiri School of Engineering and Technology
44 pages
Multiplier in Vlsi PDF
100% (1)
Multiplier in Vlsi PDF
23 pages
EC 6302 2-Marks and 16 Marks Questions
No ratings yet
EC 6302 2-Marks and 16 Marks Questions
12 pages
Bus Arbiter Using FSM and ASM Approach
No ratings yet
Bus Arbiter Using FSM and ASM Approach
10 pages
Digital Electronics Questions
No ratings yet
Digital Electronics Questions
8 pages
Booth Algorithm
No ratings yet
Booth Algorithm
25 pages
Microprocessors and Microcontrollers Answer Key
No ratings yet
Microprocessors and Microcontrollers Answer Key
14 pages
Assignment 7 System Design Through VERILOG - Unit 8 - Week 7 - Test Benches
No ratings yet
Assignment 7 System Design Through VERILOG - Unit 8 - Week 7 - Test Benches
4 pages
Labview RM
No ratings yet
Labview RM
92 pages
Programming of 8085 PPT 1
No ratings yet
Programming of 8085 PPT 1
17 pages
Radix-4 Modified Booth's Multiplier Using Verilog RTL
No ratings yet
Radix-4 Modified Booth's Multiplier Using Verilog RTL
10 pages
DLD GTU Question Bank: Chapter-1 Binary System
No ratings yet
DLD GTU Question Bank: Chapter-1 Binary System
129 pages
COA Lab Exam Question Paper
No ratings yet
COA Lab Exam Question Paper
3 pages
8086 Interview Questions:: 8086 Microprocessor
No ratings yet
8086 Interview Questions:: 8086 Microprocessor
20 pages
Lec10 Register Transfer and Microoperations
No ratings yet
Lec10 Register Transfer and Microoperations
22 pages
VLSI FDP Brochure - Phase3
No ratings yet
VLSI FDP Brochure - Phase3
2 pages
Timer Counter in ARM7 (LPC2148) : Aarav Soni
No ratings yet
Timer Counter in ARM7 (LPC2148) : Aarav Soni
26 pages
Lab2 Verilog
No ratings yet
Lab2 Verilog
5 pages
BCD 2 Binary
No ratings yet
BCD 2 Binary
25 pages
Lecture 3: Logic Systems, Data Types, and Operators For Modeling in Verilog HDL
No ratings yet
Lecture 3: Logic Systems, Data Types, and Operators For Modeling in Verilog HDL
24 pages
ARM7 Based LPC2148 Microcontroller
No ratings yet
ARM7 Based LPC2148 Microcontroller
4 pages
BCD Adder
50% (2)
BCD Adder
23 pages
System Verilog Imp
No ratings yet
System Verilog Imp
59 pages
1st Lecture - Number - System, IEEE754
No ratings yet
1st Lecture - Number - System, IEEE754
51 pages
Example: Parity Checker: More Moore/Mealy Machines
No ratings yet
Example: Parity Checker: More Moore/Mealy Machines
3 pages
CA - Unit 2 - Important Question & Ans.
No ratings yet
CA - Unit 2 - Important Question & Ans.
6 pages
Question Bank Unitwise
No ratings yet
Question Bank Unitwise
6 pages
M.Tech-VLSISD - R18 - Syllabus
No ratings yet
M.Tech-VLSISD - R18 - Syllabus
57 pages
Clapswitchpbl 2
No ratings yet
Clapswitchpbl 2
22 pages
FPGA Presentation
No ratings yet
FPGA Presentation
57 pages
DigitalLogic ComputerOrganization L13 Arithmetic Handout
No ratings yet
DigitalLogic ComputerOrganization L13 Arithmetic Handout
37 pages
Viva Questions For MP Lab
No ratings yet
Viva Questions For MP Lab
9 pages
IJRTI2209058 Sharvani
No ratings yet
IJRTI2209058 Sharvani
7 pages
Section 4 Roundoff and Truncation Error
No ratings yet
Section 4 Roundoff and Truncation Error
41 pages
IEEE Floating Point
No ratings yet
IEEE Floating Point
20 pages
Unit 2 (Data Representation and Basic Computer Arithmetic) & 3 Basic Computer Organization and Des
No ratings yet
Unit 2 (Data Representation and Basic Computer Arithmetic) & 3 Basic Computer Organization and Des
7 pages
CMOS MSD Question Bank
No ratings yet
CMOS MSD Question Bank
1 page
9 Computer Architecture and Organization
No ratings yet
9 Computer Architecture and Organization
52 pages
Unit 1
No ratings yet
Unit 1
40 pages
DSP MCQ Paper
No ratings yet
DSP MCQ Paper
4 pages
1
No ratings yet
1
12 pages
Draft: What Every Programmer Should Know About Floating-Point Arithmetic
No ratings yet
Draft: What Every Programmer Should Know About Floating-Point Arithmetic
15 pages
1.eastron SDM530-Modbus Smart Meter Modbus Protocol Implementation V1.1
No ratings yet
1.eastron SDM530-Modbus Smart Meter Modbus Protocol Implementation V1.1
20 pages
8087 Coprocessor
100% (1)
8087 Coprocessor
28 pages
The Mathematics of Computing: Don Johnson
No ratings yet
The Mathematics of Computing: Don Johnson
4 pages
R18 B.Tech ECE
No ratings yet
R18 B.Tech ECE
153 pages
Vector Processor
No ratings yet
Vector Processor
7 pages
CH4 Updated
No ratings yet
CH4 Updated
56 pages
Sample Doc Mini Project
No ratings yet
Sample Doc Mini Project
61 pages
CS276A Text Retrieval and Mining
No ratings yet
CS276A Text Retrieval and Mining
48 pages
(Autonomous) : Department of Electronics & Communication Engineering
No ratings yet
(Autonomous) : Department of Electronics & Communication Engineering
7 pages
3.1 Binary Addition: Chapter Three
No ratings yet
3.1 Binary Addition: Chapter Three
28 pages
CSC 111 Lecture 4
No ratings yet
CSC 111 Lecture 4
61 pages
Chapter 3 Scientific Measurement
No ratings yet
Chapter 3 Scientific Measurement
7 pages
Madan Digital Notes
No ratings yet
Madan Digital Notes
107 pages
Unit-1 2
No ratings yet
Unit-1 2
51 pages
Chap 2
No ratings yet
Chap 2
52 pages
B.Tech III-I TT
No ratings yet
B.Tech III-I TT
4 pages
Iot, Industrial Iot, Industry 4.0
No ratings yet
Iot, Industrial Iot, Industry 4.0
1 page
M Tech Publications
No ratings yet
M Tech Publications
5 pages
Vaagdevi College of Engineering: Autonomous B.Tech. Electronics & Communication Engineering Course Structure
No ratings yet
Vaagdevi College of Engineering: Autonomous B.Tech. Electronics & Communication Engineering Course Structure
5 pages
Abstract
No ratings yet
Abstract
22 pages
Digital Logic Design Assignment 01 Converting Binary Floating Point Number To Decimal
No ratings yet
Digital Logic Design Assignment 01 Converting Binary Floating Point Number To Decimal
8 pages
Dspa Question Bank
No ratings yet
Dspa Question Bank
2 pages
Computer Organisation and Architecture
No ratings yet
Computer Organisation and Architecture
17 pages
Error Experimental
No ratings yet
Error Experimental
14 pages
9608 s19 QP 31
No ratings yet
9608 s19 QP 31
16 pages
Floating Point Example
No ratings yet
Floating Point Example
2 pages
Quiz Csi
No ratings yet
Quiz Csi
5 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture 6. Fixed and Floating Point Numbers: Prof. Taeweon Suh Computer Science Education Korea University

Uploaded by

Lecture 6. Fixed and Floating Point Numbers: Prof. Taeweon Suh Computer Science Education Korea University

Uploaded by

COMP211 Computer Logic Design

Lecture 6. Fixed and Floating Point

Prof. Taeweon Suh

• So far we have studied the following integer number systems in

• What about rational numbers?

• Two common notations to represent rational numbers in

• The number of integer and fraction bits must be agreed upon by

• As with whole numbers, negative fractional numbers can be represented in

• Addition and subtraction works easily in computer with 2’s complement

• Suppose that we have 8 bits to represent a number

• Compute 0.75 + (-0.625)

0.75 0000 1100

• Fixed-point number systems have a limitation of having a constant

• Some low-end digital signal processors support fixed-point numbers

• The binary point floats to the right of the most significant 1

273 = 2.73 × 102

• In general, a number is written in scientific notation as:

• Floating-point number representation using 32 bits

1 bit 8 bits 23 bits

Sign Exponent Mantissa

• The following slides show three versions of floating-point

• First, convert the decimal number to binary

• Next, fill in each field in the 32-bit:

1 bit 8 bits 23 bits

• We can store just the fraction parts in the 23-bit field

1 bit 8 bits 23 bits

The exponent needs to represent both positive and negative

• Thus , 22810 using the IEEE 754 32-bit floating-point standard is

1 bit 8 bits 23 bits

• Represent -5810 using the IEEE 754 floating-point standard

 Next, fill in each field in the 32-bit number

• The remaining 23 bits are the fraction bits: 11010000..000(2)

1 bit 8 bits 23 bits

 It is 0xC2680000 in the hexadecimal form

Number Sign Exponent Fraction

 Double-Precision (use the double declaration in C language)

 Next, fill in each field in the 64-bit number

 It is 0xC04D0000_00000000 in the hexadecimal form

• Represent 0.7 in IEEE 754 single precision form

 1/8 = 0.125 = 0.001(2) // 0.2-0.125=0.075

 1/16 = 0.0625 = 0.0001(2) // 0.075-0.0625=0.0125

 1/128 = 0.0078125 = 0.0000001(2) // 0.0125-0.0078125 =0.0046875

 1/256 = 0.00390625 = 0.00000001(2) // 0.0046875-0.00390625=0.00078125

 Thus, 0.7 = 0.10110011…(2) = 1.0110011…(2) X 2-1

• IEEE754 floating-point standard can’t represent some numbers exactly

Decimal BCD Digit

• BCD is very common in electronic systems where a numeric value is to be

• Addition with floating-point numbers is not as simple as

• The steps for adding floating-point numbers with the same

Add the following floating-point numbers:

1.1(10) = 0x3FC00000 in IEEE 754 single precision

1. Extract exponent and fraction bits

For first number (N1): S = 0, E = 127, F = .1

2. Prepend leading 1 to form mantissa

4. Shift smaller mantissa if necessary

6. Normalize mantissa and adjust exponent if necessary

8. Assemble exponent and fraction back into floating-point format

1 bit 8 bits 23 bits

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.