0% found this document useful (0 votes)
71 views21 pages

External Sorting: Comp 521 - Files and Databases Fall 2010 1

The document discusses external sorting techniques for large datasets that exceed available memory. It describes the basic 2-way merge sort algorithm and how it can be generalized to utilize more than 2 buffers. The number of sorting passes depends on the number of buffers used. Replacement selection sort and optimizations like double buffering and block I/O are also covered. The document notes that parallel external sorting across many nodes is now commonly used to sort very large datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views21 pages

External Sorting: Comp 521 - Files and Databases Fall 2010 1

The document discusses external sorting techniques for large datasets that exceed available memory. It describes the basic 2-way merge sort algorithm and how it can be generalized to utilize more than 2 buffers. The number of sorting passes depends on the number of buffers used. Replacement selection sort and optimizations like double buffering and block I/O are also covered. The document notes that parallel external sorting across many nodes is now commonly used to sort very large datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

External Sorting

Chapter 13

Comp 521 – Files and Databases Fall 2010 1


Why Sort?
  Aclassic problem in computer science!
  Advantages of requesting data in sorted order
  gathers duplicates
  allows for efficient searches
  Sorting is first step in bulk loading B+ tree index.
  Sort-merge join algorithm involves sorting.
  Problem: sort 20Gb of data with 1Gb of RAM.
  why not let the OS handle it with virtual memory?

Comp 521 – Files and Databases Fall 2010 2


2-Way Sort: Requires 3 Buffers
  Pass 1: Read a page, sort it, write it.
  only one buffer page is used
  Pass 2, 3, …, N etc.:
  Read two pages, merge them, and write merged page
  Requires three buffer pages.

INPUT 1

OUTPUT
INPUT 2

Main memory buffers Disk


Disk

Comp 521 – Files and Databases Fall 2010 3


Two-Way External Merge Sort
  Each pass we read + write 3,4 6,2 9,4 8,7 5,6 3,1 2 Input file
PASS 0
each page in file. 3,4 2,6 4,9 7,8 5,6 1,3 2 1-page runs
PASS 1
  N pages in the file => the 2,3 4,7 1,3
2-page runs
number of passes 4,6 8,9 5,6 2
PASS 2
2,3
  So toal cost is: 4,4 1,2 4-page runs
6,7 3,5
8,9 6
PASS 3

  Idea: Divide and conquer: 1,2


2,3
sort pages and merge 3,4 8-page runs
4,5
6,6
7,8
Comp 521 – Files and Databases Fall 2010 9 4
General External Merge Sort
  More than 3 buffer pages. How can we utilize them?
  To sort a file with N pages using B buffer pages:
  Pass 0: use B buffer pages. Produce sorted runs of B
pages each.
  Pass 2, …, etc.: merge B-1 runs.
INPUT 1

INPUT 2
... ... OUTPUT
...
INPUT B-1
Disk Disk
B Main memory buffers
Comp 521 – Files and Databases Fall 2010 5
General External Merge Sort
  More than 3 buffer pages. How can we utilize them?
  Key Insight #1: We can merge more than 2 input
buffers at a time… affects fanout  base of log!
  Key Insight #2: The output buffer is generated
incrementally, so only one buffer page is needed for
any size of run!
  To sort a file with N pages using B buffer pages:
  Pass 0: use B buffer pages. Produce sorted runs of B
pages each.
  Pass 2, …, etc.: merge B-1 runs, leaving one page for
output.
Comp 521 – Files and Databases Fall 2010 6
Cost of External Merge Sort
  Number of passes:
  Cost = 2N * (# of passes)
  E.g., with 5 buffer pages, to sort 108 page file:
  Pass 0: = 22 sorted runs of 5 pages each
(last run is only 3 pages)
  Pass 1: = 6 sorted runs of 20 pages each
(last run is only 8 pages)
  Pass 2: ⎡6 / 4 ⎤= 2 sorted runs, 80 pages and 28 pages
  Pass 3: Sorted file of 108 pages


Comp 521 – Files and Databases Fall 2010 7
Number of Passes of External Sort

Comp 521 – Files and Databases Fall 2010 8


Internal Sort Algorithm
  Quicksort is a fast way to sort in memory.
  Very fast on average
  Alternative is “replacement sort”
  Top: Read in B blocks
  Fill: Find the smallest record greater than the
largest value to output buffer
•  add it to the end of the output buffer
•  fill moved record’s slot with next value from the input
buffer, if empty refill input buffer
  else end run
  goto Fill

Comp 521 – Files and Databases Fall 2010 9


More on Replacement Sort
  Fact:average length of a run is 2B
  The “snowplow” analogy
  Imagine a snowplow moving
around a circular track on
which snow falls at a steady rate.
  At any instant, there is a
certain amount of snow S
on the track. Some falling
snow comes in front of the B
plow, some behind.
  During the next revolution of the plow, all of this is
removed, plus 1/2 of what falls during that revolution.
  Thus, the plow removes 2S amount of snow.

Comp 521 – Files and Databases Fall 2010 10


More on Replacement Sort
  Fact: average length of a run in heapsort is 2B
  The “snowplow” analogy
  Worst-Case:
  What is min length of a run?
  How does this arise?
  Best-Case:
  What is max length of a run? B
  How does this arise?
  Quicksort is faster, but ...

Comp 521 – Files and Databases Fall 2010 11


I/O for External Merge Sort
  … longer runs imply fewer passes!
  Actually, do I/O a page at a time
  In fact, read a block of pages sequentially!
  Suggests we should make each buffer (input/
output) be a block of pages.
  But this will reduce fan-out during merge passes!
  In practice, most files still sorted in 2-3 passes.

Comp 521 – Files and Databases Fall 2010 12


Number of Passes of Optimized Sort

  Block size = 32, initial pass produces runs of size 2B.


Comp 521 – Files and Databases Fall 2010 13
Double Buffering
  To reduce wait time for I/O request to
complete, can prefetch into a ”shadow block”.
  Potentially, more passes; in practice, most
files still sorted in 2-3 passes.
INPUT 1

INPUT 1'

INPUT 2
OUTPUT
INPUT 2'
OUTPUT'

b
block size
Disk INPUT k
Disk
INPUT k'

B main memory buffers, k-way merge


Comp 521 – Files and Databases Fall 2010 14
Sorting Records!
  Sorting has become a blood sport!
  Parallel external sorting is the name of the game ...
  2005 IBM Almaden
  Sort 1Tbyte of 100 byte records
  Typical DBMS: > 5 days
  World record: 17 min, 37 seconds
•  RS/6000 SP with 488 nodes
•  Each node: 4, 332MHz 604 processors,
1.5GB of RAM, and a 9GB SCSI disk
  New benchmarks proposed:
  Minute Sort: How many can you sort in 1 minute?
  Dollar Sort: How many can you sort for $1.00?
Comp 521 – Files and Databases Fall 2010 15
Using B+ Trees for Sorting

  Scenario: Table to be sorted has B+ tree index on


sorting column(s).
  Idea: Can retrieve records in order by traversing
leaf pages.
  Is this a good idea?
  Cases to consider:
  B+ tree is clustered Good idea!
  B+ tree is not clustered Could be a very bad idea!

Comp 521 – Files and Databases Fall 2010 16


Clustered B+ Tree Used for Sorting
  Cost: root to the left-most Index
leaf, then retrieve all leaf (Directs search)
pages (Alternative 1)
  If Alternative 2 is used? Data Entries
Additional cost of ("Sequence set")
retrieving data records:
each page fetched
just once.
  Fill factor of < 100% Data Records
introduces a small overhead extra pages fetched
  Always better than external sorting!
Comp 521 – Files and Databases Fall 2010 17
Unclustered B+ Tree Used for Sorting
  Alternative (2) for data entries; each data
entry contains rid of a data record. In general,
one I/O per data record!

Index
(Directs search)

Data Entries
("Sequence set")

Data Records
Comp 521 – Files and Databases Fall 2010 18
External Sorting vs. Unclustered Index

  p: # of records per page


  B=1,000 and block size=32 for sorting
  p=100 is the more realistic value.
Comp 521 – Files and Databases Fall 2010 19
Summary
  External sorting is important; DBMS may dedicate
part of buffer pool just for sorting!
  External merge sort minimizes disk I/O cost:
  Pass 0: Produces sorted runs of size B (# buffer pages).
Later passes: merge runs.
  # of runs merged at a time depends on B, and block size.
  Larger block size means less I/O cost per page.
  Larger block size means smaller # runs merged.
  In practice, # of runs rarely more than 2 or 3.

Comp 521 – Files and Databases Fall 2010 20


Summary, cont.
  Choice of internal sort algorithm may matter:
  Quicksort: Quick!
  Replacement sort: slower (2x), but with longer
runs
  The best sorts are wildly fast:
  Despite 40+ years of research, we’re still
improving!
  Clustered B+ tree is good for sorting;
unclustered tree is usually very bad.

Comp 521 – Files and Databases Fall 2010 21

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy