gh-134004: Dbm vacuuming #134028

Andrea-Oliveri · 2025-05-15T05:39:36Z

gh-134004:

Added a .reorganize() method to dbm.sqlite and dbm.dumb which allows to retrieve unused space from database files. Name was chosen for consistency with dbm.gnu.reorganize(). dbm.ndbm` does not support any such optimization.
dbm.dumb.reorganize() is a new custom implementation rewriting the database file in-place and removing the gaps left by deletions, then truncating the excess.
Added 3 tests to ensure .reorganize() runs as expected on all compatible dbm submodules.
Updated the documentation of dbm module to include contributed methods as well as to highlight some of the limitations of the current implementations, such as lack of concurrency compatibility (for dbm.dumb) and lack of automatic reorganizing
Added .reorganize() method to shelve to expose dbm's submodules methods
Added documentation for shelve.reorganize()

Note: Unfortunately I did not feel comfortable changing the C code at this time, so I chose not to add an empty .reorganize() method to dbm.ndbm. This resulted in:

Inconsistencies between the methods supported by dbm submodules.
The added tests need to verify, using hasattr(), whether the dbm objects have the .reorganize() method and succeed early if that is not the case.
The documentation clearly explains the situation described here.

bedevere-app · 2025-05-15T05:39:41Z

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

python-cla-bot · 2025-05-15T05:39:41Z

All commit authors signed the Contributor License Agreement.

…ce to module links not found

…m.dumb for consistency with dbm.gnu. Also updated documentations and tests to reflect the change

…eorganize()

Lib/dbm/dumb.py

serhiy-storchaka

Very good code.

However, there are few issues.

dbm.dumb does not support concurrent writing, but it supports limited concurrent access with a single writer and multiple readers. Readers may not see new data, and can read deleted data (because they cache the index), but otherwise they do not conflict with writer. But when writer calls reorganize(), it breaks readers. So using this feature can cause new bugs.

I am not sure that we can to do something with this.
Other issue is that reorganize() is inherently not fail-safe. If something happens during the reorganization process, you may lose data.

Do gdbm_reorganize() and Sqlite VACUUM have the same issues or they provide some protection?

Doc/library/dbm.rst

Lib/dbm/dumb.py

Andrea-Oliveri · 2025-05-17T08:39:38Z

Good morning @serhiy-storchaka,
thank you very much for the review and the nice compliment 😃 .

Easy changes

I have pushed changes to address the documentation and dbm.dumb comment remarks.

Organize not being fail-safe

On the topic of reorganize() inherently being not fail-safe, I agree with you. It is my understanding that dbm.dumb does very little to be robust in general. Most notably, it relies on its __del__ method to write back, which may not always be called in case of trouble. This is why I chose this less-resource-intensive implementation over a more robust one. If we want to go for more robustness, any of these two approaches should work, although none have a 100% robustness guarantee:

After every block is moved in the binary file, rewrite the index
- Advantage: keep disk space usage down
- Disadvantage: many small writes can occur.
- Disadvantage: If crash / KeyboardInterrupt happens between moment block is moved and index is rewritten, database is corrupt
Reorganize binary file in a temporary file and only once completed rename it to overwrite old version and rewrite index
- Advantage: minimizes window of time during which db would be corrupt in case of crash
- Disadvantage: uses more disk space.
- Disadvantage: If crash / KeyboardInterrupt happens between when we replace old binary file with new one and moment when we replace old index file with new one, database is corrupt

Let me know how you think we should proceed. SQLITE and GDBM are designed to be much more robust and use the second technique to ensure failed reorganizations don't corrupt the database. Crucially, however, they handle a single binary file with both index and data, hence either everything is renamed atomically or nothing is.

Concurrency

My apologies, but I would disagree dbm.dumb supports any concurrency at all... I totally understand what you mean but for me if the readers can access deleted keys and if they aren't notified when a value is updated for an existing key the behaviour during concurrency is undefined, not limitedly supported.

Examples:

I want to update a key with a new value. In this circumstance, __setitem__ will first check if the blocks used by the previous value are more or equal than those required by the new value. If so, it makes the change in-place. Let's say the old value occupied twice as much blocks as the new value. __setitem__ will write the new value in-place, but keep the second-half of the blocks untouched. When a reader with the old index will read, he will read as many bytes as required for the old value. So he will read the new value, but also some gibblerish left out by __setitem__. This would crash modules like shelve that expect the value to be valid pickles.
In a similar fashion if I am updating a key with a new value larger than the old one, readers will keep reading the old value.
We already touched upon the deletion.

Of course this problem is well-known to any developer who looked at the source code for dbm and dbm.dumb. However, I found the documentation was a bit lackluster on these problems, so I have also updated it to warn users of these limitations (see the warning under dbm.dumb.open).

One thing that was already well explained in the documentation, however:

The shelve module does not support concurrent read/write access to shelved objects. (Multiple simultaneous read accesses are safe.) When a program has a shelf open for writing, no other program should have it open for reading or writing.

Please let me know how you would like to proceed on this issue as well.

In the meantime, have a great day! 😃

Lib/test/test_dbm.py

…module

Lib/test/test_dbm.py

serhiy-storchaka

Thank you for your response. What is left:

Add .. versionadded:: next for all new methods.
Add What's New entries.

serhiy-storchaka

LGTM. 👍

Thank you for your contribution, @Andrea-Oliveri. Please ping me next week if I don't merge this PR before then.

Andrea-Oliveri · 2025-05-28T18:20:19Z

Thank you @serhiy-storchaka 😄 . Just FYI, since I had merge conflicts due to the whatsnew I have merged with main between you marking as approved and me reading the message.

erlend-aasland · 2025-06-02T09:33:32Z

This is great; thanks for the proposal and implementation, @Andrea-Oliveri; thanks for managing this, @serhiy-storchaka.

Andrea-Oliveri added 5 commits May 14, 2025 17:42

Added tests for vacuuming functionality of dbm

7505713

Added vacuuming logic to dbm.sqlite

1147774

Added vacuuming logic to dbm.dumb

109a378

Updated documentation of dbm

cdacb53

Adapted vacuum tests to allow for submodules missing method

02a7b8a

Andrea-Oliveri requested review from berkerpeksag, erlend-aasland, corona10 and serhiy-storchaka as code owners May 15, 2025 05:39

bedevere-app bot added the awaiting review label May 15, 2025

Pushing news and acks entries

dcb43a2

Andrea-Oliveri changed the title ~~gh134004: Dbm vacuuming~~ gh-134004: Dbm vacuuming May 15, 2025

bedevere-app bot mentioned this pull request May 15, 2025

DBM Module Vacuuming #134004

Closed

Andrea-Oliveri added 7 commits May 15, 2025 07:52

Changed News entry to avoid failure during Doc testing due to referen…

89fb2db

…ce to module links not found

Changed method names from .vacuum to .reorganize in dbm.sqlite and db…

476dc55

…m.dumb for consistency with dbm.gnu. Also updated documentations and tests to reflect the change

Added .reorganize() method in shelve to expose dbm submodule's own .r…

19c0c8d

…eorganize()

Added documentation for shelve.reorganize

88b4014

Fixed link in doc

5c1d45f

Updated news

992e7aa

PR review: removed unnecessary .keys()

b96480b

Yang-Wei-Ting reviewed May 16, 2025

View reviewed changes

Lib/dbm/dumb.py Outdated Show resolved Hide resolved

Andrea-Oliveri requested a review from Yang-Wei-Ting May 16, 2025 16:25

serhiy-storchaka reviewed May 16, 2025

View reviewed changes

Doc/library/dbm.rst Outdated Show resolved Hide resolved

Lib/dbm/dumb.py Outdated Show resolved Hide resolved

Andrea-Oliveri added 3 commits May 17, 2025 09:52

Updated documentation to correct notes indentation

4c23b64

Left previously removed comment as requested in PR

8a80977

Modified documentation of dbm.dumb warning to align with shelve warning

166a553

serhiy-storchaka reviewed May 28, 2025

View reviewed changes

Lib/test/test_dbm.py Outdated Show resolved Hide resolved

Skipping test instead of succeeding if method not implemented for sub…

6f34de5

…module

serhiy-storchaka reviewed May 28, 2025

View reviewed changes

Lib/test/test_dbm.py Outdated Show resolved Hide resolved

Converted redundant f-string to regular string

059ad82

serhiy-storchaka reviewed May 28, 2025

View reviewed changes

Andrea-Oliveri added 2 commits May 28, 2025 19:33

Added versionadded to method documentations

3e7049f

Added whatsnew entries

2f5af38

serhiy-storchaka approved these changes May 28, 2025

View reviewed changes

bedevere-app bot added awaiting merge and removed awaiting review labels May 28, 2025

Merged changes from branch main

e2370ac

serhiy-storchaka merged commit f806463 into python:main Jun 1, 2025
39 checks passed

bedevere-app bot removed the awaiting merge label Jun 1, 2025

github-project-automation bot moved this to Done in lavitaconnect@MOSTAFAAMMER Jun 1, 2025

serhiy-storchaka linked an issue Jun 1, 2025 that may be closed by this pull request

DBM Module Vacuuming #134004

Closed

Andrea-Oliveri deleted the dbm-vacuum-gh134004 branch June 6, 2025 06:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-134004: Dbm vacuuming #134028

gh-134004: Dbm vacuuming #134028

Uh oh!

Andrea-Oliveri commented May 15, 2025 •

edited

Loading

Uh oh!

bedevere-app bot commented May 15, 2025

Uh oh!

python-cla-bot bot commented May 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

serhiy-storchaka left a comment

Uh oh!

Uh oh!

Uh oh!

Andrea-Oliveri commented May 17, 2025

Uh oh!

Uh oh!

Uh oh!

serhiy-storchaka left a comment

Uh oh!

serhiy-storchaka left a comment •

edited

Loading

Uh oh!

Andrea-Oliveri commented May 28, 2025

Uh oh!

Uh oh!

erlend-aasland commented Jun 2, 2025

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Uh oh!

gh-134004: Dbm vacuuming #134028

gh-134004: Dbm vacuuming #134028

Uh oh!

Conversation

Andrea-Oliveri commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

gh-134004:

Uh oh!

bedevere-app bot commented May 15, 2025

Uh oh!

python-cla-bot bot commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Andrea-Oliveri commented May 17, 2025

Easy changes

Organize not being fail-safe

Concurrency

Uh oh!

Uh oh!

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Andrea-Oliveri commented May 28, 2025

Uh oh!

Uh oh!

erlend-aasland commented Jun 2, 2025

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Andrea-Oliveri commented May 15, 2025 •

edited

Loading

python-cla-bot bot commented May 15, 2025 •

edited

Loading

serhiy-storchaka left a comment •

edited

Loading