Skip to content

gh-134004: Dbm vacuuming #134028

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Jun 1, 2025
Merged

Conversation

Andrea-Oliveri
Copy link
Contributor

@Andrea-Oliveri Andrea-Oliveri commented May 15, 2025

gh-134004:

  1. Added a .reorganize() method to dbm.sqlite and dbm.dumb which allows to retrieve unused space from database files. Name was chosen for consistency with dbm.gnu.reorganize(). dbm.ndbm` does not support any such optimization.
  2. dbm.dumb.reorganize() is a new custom implementation rewriting the database file in-place and removing the gaps left by deletions, then truncating the excess.
  3. Added 3 tests to ensure .reorganize() runs as expected on all compatible dbm submodules.
  4. Updated the documentation of dbm module to include contributed methods as well as to highlight some of the limitations of the current implementations, such as lack of concurrency compatibility (for dbm.dumb) and lack of automatic reorganizing
  5. Added .reorganize() method to shelve to expose dbm's submodules methods
  6. Added documentation for shelve.reorganize()

Note: Unfortunately I did not feel comfortable changing the C code at this time, so I chose not to add an empty .reorganize() method to dbm.ndbm. This resulted in:

  1. Inconsistencies between the methods supported by dbm submodules.
  2. The added tests need to verify, using hasattr(), whether the dbm objects have the .reorganize() method and succeed early if that is not the case.
  3. The documentation clearly explains the situation described here.

@bedevere-app
Copy link

bedevere-app bot commented May 15, 2025

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@python-cla-bot
Copy link

python-cla-bot bot commented May 15, 2025

All commit authors signed the Contributor License Agreement.

CLA signed

@Andrea-Oliveri Andrea-Oliveri changed the title gh134004: Dbm vacuuming gh-134004: Dbm vacuuming May 15, 2025
@bedevere-app bedevere-app bot mentioned this pull request May 15, 2025
Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good code.

However, there are few issues.

  • dbm.dumb does not support concurrent writing, but it supports limited concurrent access with a single writer and multiple readers. Readers may not see new data, and can read deleted data (because they cache the index), but otherwise they do not conflict with writer. But when writer calls reorganize(), it breaks readers. So using this feature can cause new bugs.

    I am not sure that we can to do something with this.

  • Other issue is that reorganize() is inherently not fail-safe. If something happens during the reorganization process, you may lose data.

Do gdbm_reorganize() and Sqlite VACUUM have the same issues or they provide some protection?

@Andrea-Oliveri
Copy link
Contributor Author

Good morning @serhiy-storchaka,
thank you very much for the review and the nice compliment 😃 .

Easy changes

I have pushed changes to address the documentation and dbm.dumb comment remarks.

Organize not being fail-safe

On the topic of reorganize() inherently being not fail-safe, I agree with you. It is my understanding that dbm.dumb does very little to be robust in general. Most notably, it relies on its __del__ method to write back, which may not always be called in case of trouble. This is why I chose this less-resource-intensive implementation over a more robust one. If we want to go for more robustness, any of these two approaches should work, although none have a 100% robustness guarantee:

  • After every block is moved in the binary file, rewrite the index
    • Advantage: keep disk space usage down
    • Disadvantage: many small writes can occur.
    • Disadvantage: If crash / KeyboardInterrupt happens between moment block is moved and index is rewritten, database is corrupt
  • Reorganize binary file in a temporary file and only once completed rename it to overwrite old version and rewrite index
    • Advantage: minimizes window of time during which db would be corrupt in case of crash
    • Disadvantage: uses more disk space.
    • Disadvantage: If crash / KeyboardInterrupt happens between when we replace old binary file with new one and moment when we replace old index file with new one, database is corrupt

Let me know how you think we should proceed. SQLITE and GDBM are designed to be much more robust and use the second technique to ensure failed reorganizations don't corrupt the database. Crucially, however, they handle a single binary file with both index and data, hence either everything is renamed atomically or nothing is.

Concurrency

My apologies, but I would disagree dbm.dumb supports any concurrency at all... I totally understand what you mean but for me if the readers can access deleted keys and if they aren't notified when a value is updated for an existing key the behaviour during concurrency is undefined, not limitedly supported.

Examples:

  • I want to update a key with a new value. In this circumstance, __setitem__ will first check if the blocks used by the previous value are more or equal than those required by the new value. If so, it makes the change in-place. Let's say the old value occupied twice as much blocks as the new value. __setitem__ will write the new value in-place, but keep the second-half of the blocks untouched. When a reader with the old index will read, he will read as many bytes as required for the old value. So he will read the new value, but also some gibblerish left out by __setitem__. This would crash modules like shelve that expect the value to be valid pickles.
  • In a similar fashion if I am updating a key with a new value larger than the old one, readers will keep reading the old value.
  • We already touched upon the deletion.

Of course this problem is well-known to any developer who looked at the source code for dbm and dbm.dumb. However, I found the documentation was a bit lackluster on these problems, so I have also updated it to warn users of these limitations (see the warning under dbm.dumb.open).

One thing that was already well explained in the documentation, however:

The shelve module does not support concurrent read/write access to shelved objects. (Multiple simultaneous read accesses are safe.) When a program has a shelf open for writing, no other program should have it open for reading or writing.

Please let me know how you would like to proceed on this issue as well.

In the meantime, have a great day! 😃

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your response. What is left:

  • Add .. versionadded:: next for all new methods.
  • Add What's New entries.

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. 👍

Thank you for your contribution, @Andrea-Oliveri. Please ping me next week if I don't merge this PR before then.

@Andrea-Oliveri
Copy link
Contributor Author

Thank you @serhiy-storchaka 😄 . Just FYI, since I had merge conflicts due to the whatsnew I have merged with main between you marking as approved and me reading the message.

@serhiy-storchaka serhiy-storchaka merged commit f806463 into python:main Jun 1, 2025
39 checks passed
@serhiy-storchaka serhiy-storchaka linked an issue Jun 1, 2025 that may be closed by this pull request
@erlend-aasland
Copy link
Contributor

This is great; thanks for the proposal and implementation, @Andrea-Oliveri; thanks for managing this, @serhiy-storchaka.

@Andrea-Oliveri Andrea-Oliveri deleted the dbm-vacuum-gh134004 branch June 6, 2025 06:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DBM Module Vacuuming
4 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy