Skip to content

Commit d2086b0

Browse files
committed
Reduce path length for locking leaf B-tree pages during insertion
In our B-tree implementation appropriate leaf page for new tuple insertion is acquired using _bt_search() function. This function always returns leaf page locked in shared mode. In order to obtain exclusive lock, caller have to relock the page. This commit makes _bt_search() function lock leaf page immediately in exclusive mode when needed. That removes unnecessary relock and, in turn reduces lock contention for B-tree leaf pages. Our experiments on multi-core systems showed acceleration up to 4.5 times in corner case. Discussion: https://postgr.es/m/CAPpHfduAMDFMNYTCN7VMBsFg_hsf0GqiqXnt%2BbSeaJworwFoig%40mail.gmail.com Author: Alexander Korotkov Reviewed-by: Yoshikazu Imai, Simon Riggs, Peter Geoghegan
1 parent 8a9b72c commit d2086b0

File tree

2 files changed

+41
-22
lines changed

2 files changed

+41
-22
lines changed

src/backend/access/nbtree/nbtinsert.c

Lines changed: 4 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -216,23 +216,12 @@ _bt_doinsert(Relation rel, IndexTuple itup,
216216

217217
if (!fastpath)
218218
{
219-
/* find the first page containing this key */
220-
stack = _bt_search(rel, indnkeyatts, itup_scankey, false, &buf, BT_WRITE,
221-
NULL);
222-
223-
/* trade in our read lock for a write lock */
224-
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
225-
LockBuffer(buf, BT_WRITE);
226-
227219
/*
228-
* If the page was split between the time that we surrendered our read
229-
* lock and acquired our write lock, then this page may no longer be
230-
* the right place for the key we want to insert. In this case, we
231-
* need to move right in the tree. See Lehman and Yao for an
232-
* excruciatingly precise description.
220+
* Find the first page containing this key. Buffer returned by
221+
* _bt_search() is locked in exclusive mode.
233222
*/
234-
buf = _bt_moveright(rel, buf, indnkeyatts, itup_scankey, false,
235-
true, stack, BT_WRITE, NULL);
223+
stack = _bt_search(rel, indnkeyatts, itup_scankey, false, &buf, BT_WRITE,
224+
NULL);
236225
}
237226

238227
/*

src/backend/access/nbtree/nbtsearch.c

Lines changed: 37 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -87,17 +87,18 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
8787
* place during the descent through the tree. This is not needed when
8888
* positioning for an insert or delete, so NULL is used for those cases.
8989
*
90-
* NOTE that the returned buffer is read-locked regardless of the access
91-
* parameter. However, access = BT_WRITE will allow an empty root page
92-
* to be created and returned. When access = BT_READ, an empty index
93-
* will result in *bufP being set to InvalidBuffer. Also, in BT_WRITE mode,
94-
* any incomplete splits encountered during the search will be finished.
90+
* The returned buffer is locked according to access parameter. Additionally,
91+
* access = BT_WRITE will allow an empty root page to be created and returned.
92+
* When access = BT_READ, an empty index will result in *bufP being set to
93+
* InvalidBuffer. Also, in BT_WRITE mode, any incomplete splits encountered
94+
* during the search will be finished.
9595
*/
9696
BTStack
9797
_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
9898
Buffer *bufP, int access, Snapshot snapshot)
9999
{
100100
BTStack stack_in = NULL;
101+
int page_access = BT_READ;
101102

102103
/* Get the root page to start with */
103104
*bufP = _bt_getroot(rel, access);
@@ -132,7 +133,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
132133
*/
133134
*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
134135
(access == BT_WRITE), stack_in,
135-
BT_READ, snapshot);
136+
page_access, snapshot);
136137

137138
/* if this is a leaf page, we're done */
138139
page = BufferGetPage(*bufP);
@@ -166,13 +167,42 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
166167
new_stack->bts_btentry = blkno;
167168
new_stack->bts_parent = stack_in;
168169

170+
/*
171+
* Page level 1 is lowest non-leaf page level prior to leaves. So,
172+
* if we're on the level 1 and asked to lock leaf page in write mode,
173+
* then lock next page in write mode, because it must be a leaf.
174+
*/
175+
if (opaque->btpo.level == 1 && access == BT_WRITE)
176+
page_access = BT_WRITE;
177+
169178
/* drop the read lock on the parent page, acquire one on the child */
170-
*bufP = _bt_relandgetbuf(rel, *bufP, blkno, BT_READ);
179+
*bufP = _bt_relandgetbuf(rel, *bufP, blkno, page_access);
171180

172181
/* okay, all set to move down a level */
173182
stack_in = new_stack;
174183
}
175184

185+
/*
186+
* If we're asked to lock leaf in write mode, but didn't manage to, then
187+
* relock. That may happend when root page appears to be leaf.
188+
*/
189+
if (access == BT_WRITE && page_access == BT_READ)
190+
{
191+
/* trade in our read lock for a write lock */
192+
LockBuffer(*bufP, BUFFER_LOCK_UNLOCK);
193+
LockBuffer(*bufP, BT_WRITE);
194+
195+
/*
196+
* If the page was split between the time that we surrendered our read
197+
* lock and acquired our write lock, then this page may no longer be
198+
* the right place for the key we want to insert. In this case, we
199+
* need to move right in the tree. See Lehman and Yao for an
200+
* excruciatingly precise description.
201+
*/
202+
*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
203+
true, stack_in, BT_WRITE, snapshot);
204+
}
205+
176206
return stack_in;
177207
}
178208

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy