Skip to content

gh-51067: Add remove() and repack() to ZipFile #134627

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 50 commits into
base: main
Choose a base branch
from

Conversation

danny0838
Copy link

@danny0838 danny0838 commented May 24, 2025

This is a revised version of PR #103033, implementing two new methods in zipfile.ZipFile: remove() and repack(), as suggested in this comment.

Features

ZipFile.remove(zinfo_or_arcname)

  • Removes a file entry (by providing a str path or ZipInfo) from the central directory.
  • If there are multiple file entries with the same path, only one is removed when a str path is provided.
  • Returns the removed ZipInfo instance.
  • Supported in modes: 'a', 'w', 'x'.

ZipFile.repack(removed=None)

  • Physically removes stale local file entry data that is no longer referenced by the central directory.
  • Shrinks the archive file size.
  • If removed is passed (as a sequence of removed ZipInfos), only their corresponding local file entry data are removed.
  • Only supported in mode 'a'.

Rationales

Heuristics Used in repack()

Since repack() does not immediately clean up removed entries at the time a remove() is called, the header information of removed file entries may be missing, and thus it can be technically difficult to determine whether certain stale bytes are really previously removed files and safe to remove.

While local file entries begin with the magic signature PK\x03\x04, this alone is not a reliable indicator. For instance, a self-extracting ZIP file may contain executable code before the actual archive, which could coincidentally include such a signature, especially if it embeds ZIP-based content.

To safely reclaim space, repack() assumes that in a normal ZIP file, local file entries are stored consecutively:

  • File entries must not overlap.
    • If any entry’s data overlaps with the next, a BadZipFile error is raised and no changes are made.
  • There should be no extra bytes between entries (or between the last entry and the central directory):
    1. Data before the first referenced entry is removed only when it appears to be a sequence of consecutive entries with no extra following bytes; extra preceeding bytes are preserved.
    2. Data between referenced entries is removed only when it appears to be a sequence of consecutive entries with no extra preceding bytes; extra following bytes are preserved.

Check the doc in the source code of _ZipRepacker.repack() (which is internally called by ZipFile.repack()) for more details.

Supported Modes

There has been opinions that a repacking should support mode 'w' and 'x' (e. g. #51067 (comment)).

This is NOT introduced since such modes do not truncate the file at the end of writing, and won't really shrink the file size after a removal has been made. Although we do can change the behavior for the existing API, some further care has to be made because mode 'w' and 'x' may be used on an unseekable file and will be broken by such change. OTOH, mode 'a' is not expected to work with an unseekable file since an initial seek is made immediately when it is opened.



📚 Documentation preview 📚: https://cpython-previews--134627.org.readthedocs.build/

@bedevere-app
Copy link

bedevere-app bot commented May 24, 2025

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

sharktide

This comment was marked as off-topic.

@danny0838 danny0838 requested a review from sharktide May 24, 2025 17:29
Copy link
Contributor

@sharktide sharktide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably would be better to raise an attributeError instead of a valueError here since you are trying to access an attribute a closed zipfile doesn’t have

@danny0838
Copy link
Author

It probably would be better to raise an attributeError instead of a valueError here since you are trying to access an attribute a closed zipfile doesn’t have

This behavior simply resembles open() and write(), which raises a ValueError in various cases. Furthermore there has been a change from raising RuntimeError since Python 3.6:

Changed in version 3.6: Calling open() on a closed ZipFile will raise a ValueError. Previously, a RuntimeError was raised.

Changed in version 3.6: Calling write() on a ZipFile created with mode 'r' or a closed ZipFile will raise a ValueError. Previously, a RuntimeError was raised.

@danny0838 danny0838 requested a review from sharktide May 24, 2025 17:58
@danny0838
Copy link
Author

danny0838 commented May 24, 2025

Nicely inform @ubershmekel, @barneygale, @merwok, and @wimglenn about this PR. This should be more desirable and flexible than the previous PR, although cares must be taken as there might be a potential risk on the algorithm about reclaiming spaces.

The previous PR is kept open in case some folks are interested in it. Will close when either one is accepted.

danny0838 added 6 commits May 25, 2025 16:18
- Separate individual validation tests.
- Check underlying repacker not called in validation.
- Use `unlink` to prevent FileNotFoundError.
- Fix mode 'x' test.
- Set `_writing` to prevent `open('w').write()` during repacking.
- Move the protection logic to `ZipFile.repack()`.
@danny0838
Copy link
Author

danny0838 commented May 31, 2025

Is there still any problem about this PR?

Here are something that may need further consideration/discussion:

1. Should strict_descriptor option for repack() default to True or False?

Summary of trade-offs

Option Pros ✅ Cons ❌
strict_descriptor=True - Correctly strips any entry with an unsigned data descriptor
- Better strict to ZIP spec
- ~150× slower in worst cases
- Might open a hole for DoS (if attacker crafts offensive entries)
- Slightly higher false-positive risk on random bytes
strict_descriptor=False - Much faster
- Safer against DoS scenarios
- Lower false-positive risk
- Cannot strip unsigned descriptors
- Less strict to ZIP spec (but doesn't violate it)

Background

When a local file entry has the flag bit indicating usage of data descriptor:

  1. This method first attempts to scan for a signed data descriptor.
  2. If no valid one is found:
    1. For supported compression methods (ZIP_DEFLATED, ZIP_BZIP2, or ZIP_ZSTANDARD), it decompresses the data to find its end offset.
    2. Otherwise it performs a byte-by-byte scan for an unsigned data descriptor.

This option only affects case 2.2, which is used when neither signed descriptor nor decompression-based validation is applicable.

Performance comparison

Based on the benchmark (see tests in test_zipfile64.py):

  • 8 GiB ZIP_STORED file with signed data descriptor: ~56.7s
  • 400 MiB ZIP_STORED file with unsigned data descriptor: ~270s

The latter is over 150× slower due to the byte-by-byte scanning for a valid data descriptor.

This may also raise a security concern since strict_descriptor=False may open a path for a DoS Attack (if an attacker crafts a ZIP file with offensive entries).

False-positive risk

It's not possible to guarantee the "real" file size of a local entry with a data descriptor without the information from the central directory.

If a local file entry spans 100 MiB, it's theoretically possible that multiple byte ranges (e.g., the first 20 MiB, 30 MiB, etc.) could each appear as valid data + data descriptor segments with differing CRCs and compressed sizes. (Currently, the algorithm validates only the compressed size. Checking for CRCs could reduce false positives but would significantly deteriate performance.) The byte-by-byte validation can increase the risk of a false positive compared to the signature or decompression based validation, which only checks for certain points and has more prerequisites.

A false positive should be unlikely to happen in practice. If it were to happen, a stale local file entry is stripped incompletely (e.g. a 30MiB entry be treated as 20MiB, leaving 10MiB random bytes over) and cause following entries not stripped (since the algorithm requires consecutive entries). However, the ZIP file remains uncorrupted.

Spec compliance

According to the ZIP file format specification: Applications SHOULD write signed descriptors and SHOULD support both forms when reading.

Unsigned descriptors are thus considered legacy, but it is unclear whether they are still used widely.

strict_descriptor=True adheres less strictly to the spec, but does not violate it — because stripping is neither reading nor writing, and a suboptimal stripping does not corrupt the ZIP archive.

2. Should we also implement copy() for ZipFile?

Currently, copying an entry within a ZIP file is cumbersome due to the lack of support for simultaneous reading and writing. The implementer must either:

  • Read the entire entry and write afterwards (which is memory-intensive and inefficient for large files), or
  • Use a temporary file for buffered copying.

Both approaches are more complex and less performant, due to the need to decompress and recompress data.

If would be much more performant and friendly by implementing a copy() method, using the similar internal buffered copying technique that _ZipRepacker has used.

Additionally, this also opens the door to support an efficient move() operation, composed of copy(), remove(), and optionally repack().

And an additional question: whether this should be included in the current PR, or proposed separately as a follow-up?

The current draft for copy() support can be found here.

@gpshead
Copy link
Member

gpshead commented May 31, 2025

A higher level question: Why would we want to maintain advanced features that are not used by most people within the standard library at all? This code existing in the stdlib creates a maintenance burden and is the slowest possible way to get features and their resulting trail of bugfixes to people who might want them. All features have an ongoing cost.

Is there a real need for these zipfile features to not simply be advanced ones available in a PyPI package? @jaraco FYI

I appreciate the enthusiasm for implementing something interesting. I'm not sure we actually want to maintain that within the CPython project though. Are there compelling use cases for these features to be part of Python rather than external?

(edit: posted this on the Issue, leaving here for posterity)

@danny0838
Copy link
Author

danny0838 commented May 31, 2025

@gpshead Don't you think your question should be raised in the issue thread rather than here? 🫠

@gpshead
Copy link
Member

gpshead commented May 31, 2025

good point, moved, thanks! (where to have what discussions is an area where I consider github's UX to be... not great)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy