Skip to content

Commit 3ba11d3

Browse files
committed
Teach CLUSTER to use seqscan-and-sort when it's faster than indexscan.
... or at least, when the planner's cost estimates say it will be faster. Leonardo Francalanci, reviewed by Itagaki Takahiro and Tom Lane
1 parent 694c56a commit 3ba11d3

File tree

14 files changed

+713
-138
lines changed

14 files changed

+713
-138
lines changed

doc/src/sgml/ref/cluster.sgml

Lines changed: 30 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -128,18 +128,33 @@ CLUSTER [VERBOSE]
128128
</para>
129129

130130
<para>
131-
During the cluster operation, a temporary copy of the table is created
132-
that contains the table data in the index order. Temporary copies of
133-
each index on the table are created as well. Therefore, you need free
134-
space on disk at least equal to the sum of the table size and the index
135-
sizes.
131+
<command>CLUSTER</> can re-sort the table using either an indexscan
132+
on the specified index, or (if the index is a b-tree) a sequential
133+
scan followed by sorting. It will attempt to choose the method that
134+
will be faster, based on planner cost parameters and available statistical
135+
information.
136136
</para>
137137

138138
<para>
139-
Because <command>CLUSTER</command> remembers the clustering information,
140-
one can cluster the tables one wants clustered manually the first time, and
141-
setup a timed event similar to <command>VACUUM</command> so that the tables
142-
are periodically reclustered.
139+
When an indexscan is used, a temporary copy of the table is created that
140+
contains the table data in the index order. Temporary copies of each
141+
index on the table are created as well. Therefore, you need free space on
142+
disk at least equal to the sum of the table size and the index sizes.
143+
</para>
144+
145+
<para>
146+
When a sequential scan and sort is used, a temporary sort file is
147+
also created, so that the peak temporary space requirement is as much
148+
as double the table size, plus the index sizes. This method is often
149+
faster than the indexscan method, but if the disk space requirement is
150+
intolerable, you can disable this choice by temporarily setting <xref
151+
linkend="guc-enable-sort"> to <literal>off</>.
152+
</para>
153+
154+
<para>
155+
It is advisable to set <xref linkend="guc-maintenance-work-mem"> to
156+
a reasonably large value (but not more than the amount of RAM you can
157+
dedicate to the <command>CLUSTER</> operation) before clustering.
143158
</para>
144159

145160
<para>
@@ -150,35 +165,13 @@ CLUSTER [VERBOSE]
150165
</para>
151166

152167
<para>
153-
There is another way to cluster data. The
154-
<command>CLUSTER</command> command reorders the original table by
155-
scanning it using the index you specify. This can be slow
156-
on large tables because the rows are fetched from the table
157-
in index order, and if the table is disordered, the
158-
entries are on random pages, so there is one disk page
159-
retrieved for every row moved. (<productname>PostgreSQL</productname> has
160-
a cache, but the majority of a big table will not fit in the cache.)
161-
The other way to cluster a table is to use:
162-
163-
<programlisting>
164-
CREATE TABLE <replaceable class="parameter">newtable</replaceable> AS
165-
SELECT * FROM <replaceable class="parameter">table</replaceable> ORDER BY <replaceable class="parameter">columnlist</replaceable>;
166-
</programlisting>
167-
168-
which uses the <productname>PostgreSQL</productname> sorting code
169-
to produce the desired order;
170-
this is usually much faster than an index scan for disordered data.
171-
Then you drop the old table, use
172-
<command>ALTER TABLE ... RENAME</command>
173-
to rename <replaceable class="parameter">newtable</replaceable> to the
174-
old name, and recreate the table's indexes.
175-
The big disadvantage of this approach is that it does not preserve
176-
OIDs, constraints, foreign key relationships, granted privileges, and
177-
other ancillary properties of the table &mdash; all such items must be
178-
manually recreated. Another disadvantage is that this way requires a sort
179-
temporary file about the same size as the table itself, so peak disk usage
180-
is about three times the table size instead of twice the table size.
168+
Because <command>CLUSTER</command> remembers which indexes are clustered,
169+
one can cluster the tables one wants clustered manually the first time,
170+
then set up a periodic maintenance script that executes
171+
<command>CLUSTER</> without any parameters, so that the desired tables
172+
are periodically reclustered.
181173
</para>
174+
182175
</refsect1>
183176

184177
<refsect1>

src/backend/commands/cluster.c

Lines changed: 121 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@
3636
#include "commands/trigger.h"
3737
#include "commands/vacuum.h"
3838
#include "miscadmin.h"
39+
#include "optimizer/planner.h"
3940
#include "storage/bufmgr.h"
4041
#include "storage/procarray.h"
4142
#include "storage/smgr.h"
@@ -49,6 +50,7 @@
4950
#include "utils/snapmgr.h"
5051
#include "utils/syscache.h"
5152
#include "utils/tqual.h"
53+
#include "utils/tuplesort.h"
5254

5355

5456
/*
@@ -69,7 +71,10 @@ static void copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex,
6971
int freeze_min_age, int freeze_table_age,
7072
bool *pSwapToastByContent, TransactionId *pFreezeXid);
7173
static List *get_tables_to_cluster(MemoryContext cluster_context);
72-
74+
static void reform_and_rewrite_tuple(HeapTuple tuple,
75+
TupleDesc oldTupDesc, TupleDesc newTupDesc,
76+
Datum *values, bool *isnull,
77+
bool newRelHasOids, RewriteState rwstate);
7378

7479

7580
/*---------------------------------------------------------------------------
@@ -759,6 +764,8 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex,
759764
TransactionId OldestXmin;
760765
TransactionId FreezeXid;
761766
RewriteState rwstate;
767+
bool use_sort;
768+
Tuplesortstate *tuplesort;
762769

763770
/*
764771
* Open the relations we need.
@@ -845,12 +852,30 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex,
845852
rwstate = begin_heap_rewrite(NewHeap, OldestXmin, FreezeXid, use_wal);
846853

847854
/*
848-
* Scan through the OldHeap, either in OldIndex order or sequentially, and
849-
* copy each tuple into the NewHeap. To ensure we see recently-dead
850-
* tuples that still need to be copied, we scan with SnapshotAny and use
855+
* Decide whether to use an indexscan or seqscan-and-optional-sort to
856+
* scan the OldHeap. We know how to use a sort to duplicate the ordering
857+
* of a btree index, and will use seqscan-and-sort for that case if the
858+
* planner tells us it's cheaper. Otherwise, always indexscan if an
859+
* index is provided, else plain seqscan.
860+
*/
861+
if (OldIndex != NULL && OldIndex->rd_rel->relam == BTREE_AM_OID)
862+
use_sort = plan_cluster_use_sort(OIDOldHeap, OIDOldIndex);
863+
else
864+
use_sort = false;
865+
866+
/* Set up sorting if wanted */
867+
if (use_sort)
868+
tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
869+
maintenance_work_mem, false);
870+
else
871+
tuplesort = NULL;
872+
873+
/*
874+
* Prepare to scan the OldHeap. To ensure we see recently-dead tuples
875+
* that still need to be copied, we scan with SnapshotAny and use
851876
* HeapTupleSatisfiesVacuum for the visibility test.
852877
*/
853-
if (OldIndex != NULL)
878+
if (OldIndex != NULL && !use_sort)
854879
{
855880
heapScan = NULL;
856881
indexScan = index_beginscan(OldHeap, OldIndex,
@@ -862,17 +887,21 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex,
862887
indexScan = NULL;
863888
}
864889

890+
/*
891+
* Scan through the OldHeap, either in OldIndex order or sequentially;
892+
* copy each tuple into the NewHeap, or transiently to the tuplesort
893+
* module. Note that we don't bother sorting dead tuples (they won't
894+
* get to the new table anyway).
895+
*/
865896
for (;;)
866897
{
867898
HeapTuple tuple;
868-
HeapTuple copiedTuple;
869899
Buffer buf;
870900
bool isdead;
871-
int i;
872901

873902
CHECK_FOR_INTERRUPTS();
874903

875-
if (OldIndex != NULL)
904+
if (indexScan != NULL)
876905
{
877906
tuple = index_getnext(indexScan, ForwardScanDirection);
878907
if (tuple == NULL)
@@ -951,45 +980,50 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex,
951980
continue;
952981
}
953982

954-
/*
955-
* We cannot simply copy the tuple as-is, for several reasons:
956-
*
957-
* 1. We'd like to squeeze out the values of any dropped columns, both
958-
* to save space and to ensure we have no corner-case failures. (It's
959-
* possible for example that the new table hasn't got a TOAST table
960-
* and so is unable to store any large values of dropped cols.)
961-
*
962-
* 2. The tuple might not even be legal for the new table; this is
963-
* currently only known to happen as an after-effect of ALTER TABLE
964-
* SET WITHOUT OIDS.
965-
*
966-
* So, we must reconstruct the tuple from component Datums.
967-
*/
968-
heap_deform_tuple(tuple, oldTupDesc, values, isnull);
983+
if (tuplesort != NULL)
984+
tuplesort_putheaptuple(tuplesort, tuple);
985+
else
986+
reform_and_rewrite_tuple(tuple,
987+
oldTupDesc, newTupDesc,
988+
values, isnull,
989+
NewHeap->rd_rel->relhasoids, rwstate);
990+
}
969991

970-
/* Be sure to null out any dropped columns */
971-
for (i = 0; i < natts; i++)
992+
if (indexScan != NULL)
993+
index_endscan(indexScan);
994+
if (heapScan != NULL)
995+
heap_endscan(heapScan);
996+
997+
/*
998+
* In scan-and-sort mode, complete the sort, then read out all live
999+
* tuples from the tuplestore and write them to the new relation.
1000+
*/
1001+
if (tuplesort != NULL)
1002+
{
1003+
tuplesort_performsort(tuplesort);
1004+
1005+
for (;;)
9721006
{
973-
if (newTupDesc->attrs[i]->attisdropped)
974-
isnull[i] = true;
975-
}
1007+
HeapTuple tuple;
1008+
bool shouldfree;
9761009

977-
copiedTuple = heap_form_tuple(newTupDesc, values, isnull);
1010+
CHECK_FOR_INTERRUPTS();
9781011

979-
/* Preserve OID, if any */
980-
if (NewHeap->rd_rel->relhasoids)
981-
HeapTupleSetOid(copiedTuple, HeapTupleGetOid(tuple));
1012+
tuple = tuplesort_getheaptuple(tuplesort, true, &shouldfree);
1013+
if (tuple == NULL)
1014+
break;
9821015

983-
/* The heap rewrite module does the rest */
984-
rewrite_heap_tuple(rwstate, tuple, copiedTuple);
1016+
reform_and_rewrite_tuple(tuple,
1017+
oldTupDesc, newTupDesc,
1018+
values, isnull,
1019+
NewHeap->rd_rel->relhasoids, rwstate);
9851020

986-
heap_freetuple(copiedTuple);
987-
}
1021+
if (shouldfree)
1022+
heap_freetuple(tuple);
1023+
}
9881024

989-
if (OldIndex != NULL)
990-
index_endscan(indexScan);
991-
else
992-
heap_endscan(heapScan);
1025+
tuplesort_end(tuplesort);
1026+
}
9931027

9941028
/* Write out any remaining tuples, and fsync if needed */
9951029
end_heap_rewrite(rwstate);
@@ -1488,3 +1522,50 @@ get_tables_to_cluster(MemoryContext cluster_context)
14881522

14891523
return rvs;
14901524
}
1525+
1526+
1527+
/*
1528+
* Reconstruct and rewrite the given tuple
1529+
*
1530+
* We cannot simply copy the tuple as-is, for several reasons:
1531+
*
1532+
* 1. We'd like to squeeze out the values of any dropped columns, both
1533+
* to save space and to ensure we have no corner-case failures. (It's
1534+
* possible for example that the new table hasn't got a TOAST table
1535+
* and so is unable to store any large values of dropped cols.)
1536+
*
1537+
* 2. The tuple might not even be legal for the new table; this is
1538+
* currently only known to happen as an after-effect of ALTER TABLE
1539+
* SET WITHOUT OIDS.
1540+
*
1541+
* So, we must reconstruct the tuple from component Datums.
1542+
*/
1543+
static void
1544+
reform_and_rewrite_tuple(HeapTuple tuple,
1545+
TupleDesc oldTupDesc, TupleDesc newTupDesc,
1546+
Datum *values, bool *isnull,
1547+
bool newRelHasOids, RewriteState rwstate)
1548+
{
1549+
HeapTuple copiedTuple;
1550+
int i;
1551+
1552+
heap_deform_tuple(tuple, oldTupDesc, values, isnull);
1553+
1554+
/* Be sure to null out any dropped columns */
1555+
for (i = 0; i < newTupDesc->natts; i++)
1556+
{
1557+
if (newTupDesc->attrs[i]->attisdropped)
1558+
isnull[i] = true;
1559+
}
1560+
1561+
copiedTuple = heap_form_tuple(newTupDesc, values, isnull);
1562+
1563+
/* Preserve OID, if any */
1564+
if (newRelHasOids)
1565+
HeapTupleSetOid(copiedTuple, HeapTupleGetOid(tuple));
1566+
1567+
/* The heap rewrite module does the rest */
1568+
rewrite_heap_tuple(rwstate, tuple, copiedTuple);
1569+
1570+
heap_freetuple(copiedTuple);
1571+
}

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy