Skip to content

Commit ab4bb5c

Browse files
committed
Fix multiple bugs in index page locking during hot-standby WAL replay.
In ordinary operation, VACUUM must be careful to take a cleanup lock on each leaf page of a btree index; this ensures that no indexscans could still be "in flight" to heap tuples due to be deleted. (Because of possible index-tuple motion due to concurrent page splits, it's not enough to lock only the pages we're deleting index tuples from.) In Hot Standby, the WAL replay process must likewise lock every leaf page. There were several bugs in the code for that: * The replay scan might come across unused, all-zero pages in the index. While btree_xlog_vacuum itself did the right thing (ie, nothing) with such pages, xlogutils.c supposed that such pages must be corrupt and would throw an error. This accounts for various reports of replication failures with "PANIC: WAL contains references to invalid pages". To fix, add a ReadBufferMode value that instructs XLogReadBufferExtended not to complain when we're doing this. * btree_xlog_vacuum performed the extra locking if standbyState == STANDBY_SNAPSHOT_READY, but that's not the correct test: we won't open up for hot standby queries until the database has reached consistency, and we don't want to do the extra locking till then either, for fear of reading corrupted pages (which bufmgr.c would complain about). Fix by exporting a new function from xlog.c that will report whether we're actually in hot standby replay mode. * To ensure full coverage of the index in the replay scan, btvacuumscan would emit a dummy WAL record for the last page of the index, if no vacuuming work had been done on that page. However, if the last page of the index is all-zero, that would result in corruption of said page, since the functions called on it weren't prepared to handle that case. There's no need to lock any such pages, so change the logic to target the last normal leaf page instead. The first two of these bugs were diagnosed by Andres Freund, the other one by me. Fixes based on ideas from Heikki Linnakangas and myself. This has been wrong since Hot Standby was introduced, so back-patch to 9.0.
1 parent fc27b40 commit ab4bb5c

File tree

7 files changed

+99
-43
lines changed

7 files changed

+99
-43
lines changed

src/backend/access/nbtree/nbtree.c

Lines changed: 43 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -59,8 +59,8 @@ typedef struct
5959
IndexBulkDeleteCallback callback;
6060
void *callback_state;
6161
BTCycleId cycleid;
62-
BlockNumber lastBlockVacuumed; /* last blkno reached by Vacuum scan */
63-
BlockNumber lastUsedPage; /* blkno of last non-recyclable page */
62+
BlockNumber lastBlockVacuumed; /* highest blkno actually vacuumed */
63+
BlockNumber lastBlockLocked; /* highest blkno we've cleanup-locked */
6464
BlockNumber totFreePages; /* true total # of free pages */
6565
MemoryContext pagedelcontext;
6666
} BTVacState;
@@ -665,7 +665,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
665665
vstate.callback_state = callback_state;
666666
vstate.cycleid = cycleid;
667667
vstate.lastBlockVacuumed = BTREE_METAPAGE; /* Initialise at first block */
668-
vstate.lastUsedPage = BTREE_METAPAGE;
668+
vstate.lastBlockLocked = BTREE_METAPAGE;
669669
vstate.totFreePages = 0;
670670

671671
/* Create a temporary memory context to run _bt_pagedel in */
@@ -721,27 +721,30 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
721721
}
722722

723723
/*
724-
* InHotStandby we need to scan right up to the end of the index for
725-
* correct locking, so we may need to write a WAL record for the final
726-
* block in the index if it was not vacuumed. It's possible that VACUUMing
727-
* has actually removed zeroed pages at the end of the index so we need to
728-
* take care to issue the record for last actual block and not for the
729-
* last block that was scanned. Ignore empty indexes.
724+
* If the WAL is replayed in hot standby, the replay process needs to get
725+
* cleanup locks on all index leaf pages, just as we've been doing here.
726+
* However, we won't issue any WAL records about pages that have no items
727+
* to be deleted. For pages between pages we've vacuumed, the replay code
728+
* will take locks under the direction of the lastBlockVacuumed fields in
729+
* the XLOG_BTREE_VACUUM WAL records. To cover pages after the last one
730+
* we vacuum, we need to issue a dummy XLOG_BTREE_VACUUM WAL record
731+
* against the last leaf page in the index, if that one wasn't vacuumed.
730732
*/
731733
if (XLogStandbyInfoActive() &&
732-
num_pages > 1 && vstate.lastBlockVacuumed < (num_pages - 1))
734+
vstate.lastBlockVacuumed < vstate.lastBlockLocked)
733735
{
734736
Buffer buf;
735737

736738
/*
737-
* We can't use _bt_getbuf() here because it always applies
738-
* _bt_checkpage(), which will barf on an all-zero page. We want to
739-
* recycle all-zero pages, not fail. Also, we want to use a
740-
* nondefault buffer access strategy.
739+
* The page should be valid, but we can't use _bt_getbuf() because we
740+
* want to use a nondefault buffer access strategy. Since we aren't
741+
* going to delete any items, getting cleanup lock again is probably
742+
* overkill, but for consistency do that anyway.
741743
*/
742-
buf = ReadBufferExtended(rel, MAIN_FORKNUM, num_pages - 1, RBM_NORMAL,
743-
info->strategy);
744+
buf = ReadBufferExtended(rel, MAIN_FORKNUM, vstate.lastBlockLocked,
745+
RBM_NORMAL, info->strategy);
744746
LockBufferForCleanup(buf);
747+
_bt_checkpage(rel, buf);
745748
_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
746749
_bt_relbuf(rel, buf);
747750
}
@@ -816,10 +819,6 @@ btvacuumpage(BTVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
816819
}
817820
}
818821

819-
/* If the page is in use, update lastUsedPage */
820-
if (!_bt_page_recyclable(page) && vstate->lastUsedPage < blkno)
821-
vstate->lastUsedPage = blkno;
822-
823822
/* Page is valid, see what to do with it */
824823
if (_bt_page_recyclable(page))
825824
{
@@ -855,6 +854,13 @@ btvacuumpage(BTVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
855854
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
856855
LockBufferForCleanup(buf);
857856

857+
/*
858+
* Remember highest leaf page number we've taken cleanup lock on; see
859+
* notes in btvacuumscan
860+
*/
861+
if (blkno > vstate->lastBlockLocked)
862+
vstate->lastBlockLocked = blkno;
863+
858864
/*
859865
* Check whether we need to recurse back to earlier pages. What we
860866
* are concerned about is a page split that happened since we started
@@ -921,18 +927,26 @@ btvacuumpage(BTVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
921927
*/
922928
if (ndeletable > 0)
923929
{
924-
BlockNumber lastBlockVacuumed = BufferGetBlockNumber(buf);
925-
926-
_bt_delitems_vacuum(rel, buf, deletable, ndeletable, vstate->lastBlockVacuumed);
930+
/*
931+
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes an
932+
* instruction to the replay code to get cleanup lock on all pages
933+
* between the previous lastBlockVacuumed and this page. This
934+
* ensures that WAL replay locks all leaf pages at some point.
935+
*
936+
* Since we can visit leaf pages out-of-order when recursing,
937+
* replay might end up locking such pages an extra time, but it
938+
* doesn't seem worth the amount of bookkeeping it'd take to avoid
939+
* that.
940+
*/
941+
_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
942+
vstate->lastBlockVacuumed);
927943

928944
/*
929-
* Keep track of the block number of the lastBlockVacuumed, so we
930-
* can scan those blocks as well during WAL replay. This then
931-
* provides concurrency protection and allows btrees to be used
932-
* while in recovery.
945+
* Remember highest leaf page number we've issued a
946+
* XLOG_BTREE_VACUUM WAL record for.
933947
*/
934-
if (lastBlockVacuumed > vstate->lastBlockVacuumed)
935-
vstate->lastBlockVacuumed = lastBlockVacuumed;
948+
if (blkno > vstate->lastBlockVacuumed)
949+
vstate->lastBlockVacuumed = blkno;
936950

937951
stats->tuples_removed += ndeletable;
938952
/* must recompute maxoff */

src/backend/access/nbtree/nbtxlog.c

Lines changed: 30 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -489,28 +489,47 @@ btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record)
489489
BTPageOpaque opaque;
490490

491491
/*
492-
* If queries might be active then we need to ensure every block is
492+
* If queries might be active then we need to ensure every leaf page is
493493
* unpinned between the lastBlockVacuumed and the current block, if there
494-
* are any. This ensures that every block in the index is touched during
495-
* VACUUM as required to ensure scans work correctly.
494+
* are any. This prevents replay of the VACUUM from reaching the stage of
495+
* removing heap tuples while there could still be indexscans "in flight"
496+
* to those particular tuples (see nbtree/README).
497+
*
498+
* It might be worth checking if there are actually any backends running;
499+
* if not, we could just skip this.
500+
*
501+
* Since VACUUM can visit leaf pages out-of-order, it might issue records
502+
* with lastBlockVacuumed >= block; that's not an error, it just means
503+
* nothing to do now.
504+
*
505+
* Note: since we touch all pages in the range, we will lock non-leaf
506+
* pages, and also any empty (all-zero) pages that may be in the index. It
507+
* doesn't seem worth the complexity to avoid that. But it's important
508+
* that HotStandbyActiveInReplay() will not return true if the database
509+
* isn't yet consistent; so we need not fear reading still-corrupt blocks
510+
* here during crash recovery.
496511
*/
497-
if (standbyState == STANDBY_SNAPSHOT_READY &&
498-
(xlrec->lastBlockVacuumed + 1) != xlrec->block)
512+
if (HotStandbyActiveInReplay())
499513
{
500-
BlockNumber blkno = xlrec->lastBlockVacuumed + 1;
514+
BlockNumber blkno;
501515

502-
for (; blkno < xlrec->block; blkno++)
516+
for (blkno = xlrec->lastBlockVacuumed + 1; blkno < xlrec->block; blkno++)
503517
{
504518
/*
519+
* We use RBM_NORMAL_NO_LOG mode because it's not an error
520+
* condition to see all-zero pages. The original btvacuumpage
521+
* scan would have skipped over all-zero pages, noting them in FSM
522+
* but not bothering to initialize them just yet; so we mustn't
523+
* throw an error here. (We could skip acquiring the cleanup lock
524+
* if PageIsNew, but it's probably not worth the cycles to test.)
525+
*
505526
* XXX we don't actually need to read the block, we just need to
506527
* confirm it is unpinned. If we had a special call into the
507528
* buffer manager we could optimise this so that if the block is
508529
* not in shared_buffers we confirm it as unpinned.
509-
*
510-
* Another simple optimization would be to check if there's any
511-
* backends running; if not, we could just skip this.
512530
*/
513-
buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, blkno, RBM_NORMAL);
531+
buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, blkno,
532+
RBM_NORMAL_NO_LOG);
514533
if (BufferIsValid(buffer))
515534
{
516535
LockBufferForCleanup(buffer);

src/backend/access/transam/xlog.c

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7120,7 +7120,8 @@ RecoveryInProgress(void)
71207120
* true. Postmaster knows this by way of signal, not via shared memory.
71217121
*
71227122
* Unlike testing standbyState, this works in any process that's connected to
7123-
* shared memory.
7123+
* shared memory. (And note that standbyState alone doesn't tell the truth
7124+
* anyway.)
71247125
*/
71257126
bool
71267127
HotStandbyActive(void)
@@ -7146,6 +7147,16 @@ HotStandbyActive(void)
71467147
}
71477148
}
71487149

7150+
/*
7151+
* Like HotStandbyActive(), but to be used only in WAL replay code,
7152+
* where we don't need to ask any other process what the state is.
7153+
*/
7154+
bool
7155+
HotStandbyActiveInReplay(void)
7156+
{
7157+
return LocalHotStandbyActive;
7158+
}
7159+
71497160
/*
71507161
* Is this process allowed to insert new WAL records?
71517162
*

src/backend/access/transam/xlogutils.c

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -264,6 +264,10 @@ XLogReadBuffer(RelFileNode rnode, BlockNumber blkno, bool init)
264264
*
265265
* In RBM_ZERO and RBM_ZERO_ON_ERROR modes, if the page doesn't exist, the
266266
* relation is extended with all-zeroes pages up to the given block number.
267+
*
268+
* In RBM_NORMAL_NO_LOG mode, we return InvalidBuffer if the page doesn't
269+
* exist, and we don't check for all-zeroes. Thus, no log entry is made
270+
* to imply that the page should be dropped or truncated later.
267271
*/
268272
Buffer
269273
XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
@@ -304,6 +308,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
304308
log_invalid_page(rnode, forknum, blkno, false);
305309
return InvalidBuffer;
306310
}
311+
if (mode == RBM_NORMAL_NO_LOG)
312+
return InvalidBuffer;
307313
/* OK to extend the file */
308314
/* we do this in recovery only - no rel-extension lock needed */
309315
Assert(InRecovery);

src/backend/storage/buffer/bufmgr.c

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -201,7 +201,8 @@ ReadBuffer(Relation reln, BlockNumber blockNum)
201201
* Assume when this function is called, that reln has been opened already.
202202
*
203203
* In RBM_NORMAL mode, the page is read from disk, and the page header is
204-
* validated. An error is thrown if the page header is not valid.
204+
* validated. An error is thrown if the page header is not valid. (But
205+
* note that an all-zero page is considered "valid"; see PageIsVerified().)
205206
*
206207
* RBM_ZERO_ON_ERROR is like the normal mode, but if the page header is not
207208
* valid, the page is zeroed instead of throwing an error. This is intended
@@ -215,6 +216,8 @@ ReadBuffer(Relation reln, BlockNumber blockNum)
215216
* current physical EOF; that is likely to cause problems in md.c when
216217
* the page is modified and written out. P_NEW is OK, though.
217218
*
219+
* RBM_NORMAL_NO_LOG mode is treated the same as RBM_NORMAL here.
220+
*
218221
* If strategy is not NULL, a nondefault buffer access strategy is used.
219222
* See buffer/README for details.
220223
*/

src/include/access/xlog.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -295,6 +295,7 @@ extern void issue_xlog_fsync(int fd, uint32 log, uint32 seg);
295295

296296
extern bool RecoveryInProgress(void);
297297
extern bool HotStandbyActive(void);
298+
extern bool HotStandbyActiveInReplay(void);
298299
extern bool XLogInsertAllowed(void);
299300
extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
300301
extern XLogRecPtr GetXLogReplayRecPtr(void);

src/include/storage/bufmgr.h

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,9 @@ typedef enum
3838
RBM_NORMAL, /* Normal read */
3939
RBM_ZERO, /* Don't read from disk, caller will
4040
* initialize */
41-
RBM_ZERO_ON_ERROR /* Read, but return an all-zeros page on error */
41+
RBM_ZERO_ON_ERROR, /* Read, but return an all-zeros page on error */
42+
RBM_NORMAL_NO_LOG /* Don't log page as invalid during WAL
43+
* replay; otherwise same as RBM_NORMAL */
4244
} ReadBufferMode;
4345

4446
/* in globals.c ... this duplicates miscadmin.h */

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy