-
-
Notifications
You must be signed in to change notification settings - Fork 32.1k
gh-51067: Add remove()
and repack()
to ZipFile
#134627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably would be better to raise an attributeError instead of a valueError here since you are trying to access an attribute a closed zipfile doesn’t have
This behavior simply resembles
|
Nicely inform @ubershmekel, @barneygale, @merwok, and @wimglenn about this PR. This should be more desirable and flexible than the previous PR, although cares must be taken as there might be a potential risk on the algorithm about reclaiming spaces. The previous PR is kept open in case some folks are interested in it. Will close when either one is accepted. |
- Separate individual validation tests. - Check underlying repacker not called in validation. - Use `unlink` to prevent FileNotFoundError. - Fix mode 'x' test.
- Set `_writing` to prevent `open('w').write()` during repacking. - Move the protection logic to `ZipFile.repack()`.
- According to ZIP spec, both uncompressed and compressed size should be 0xffffffff when zip64 is used.
- The previous implementation might cause [archive decryption header] and/or [archive extra data record] preceeding [central directory] be stripped.
Is there still any problem about this PR? Here are something that may need further consideration/discussion: 1. Should
|
Option | Pros ✅ | Cons ❌ |
---|---|---|
strict_descriptor=True |
- Correctly strips any entry with an unsigned data descriptor - Better strict to ZIP spec |
- ~150× slower in worst cases - Might open a hole for DoS (if attacker crafts offensive entries) - Slightly higher false-positive risk on random bytes |
strict_descriptor=False |
- Much faster - Safer against DoS scenarios - Lower false-positive risk |
- Cannot strip unsigned descriptors - Less strict to ZIP spec (but doesn't violate it) |
Background
When a local file entry has the flag bit indicating usage of data descriptor:
- This method first attempts to scan for a signed data descriptor.
- If no valid one is found:
- For supported compression methods (
ZIP_DEFLATED
,ZIP_BZIP2
, orZIP_ZSTANDARD
), it decompresses the data to find its end offset. - Otherwise it performs a byte-by-byte scan for an unsigned data descriptor.
- For supported compression methods (
This option only affects case 2.2, which is used when neither signed descriptor nor decompression-based validation is applicable.
Performance comparison
Based on the benchmark (see tests in test_zipfile64.py
):
- 8 GiB
ZIP_STORED
file with signed data descriptor: ~56.7s - 400 MiB
ZIP_STORED
file with unsigned data descriptor: ~270s
The latter is over 150× slower due to the byte-by-byte scanning for a valid data descriptor.
This may also raise a security concern since strict_descriptor=False
may open a path for a DoS Attack (if an attacker crafts a ZIP file with offensive entries).
False-positive risk
It's not possible to guarantee the "real" file size of a local entry with a data descriptor without the information from the central directory.
If a local file entry spans 100 MiB, it's theoretically possible that multiple byte ranges (e.g., the first 20 MiB, 30 MiB, etc.) could each appear as valid data + data descriptor segments with differing CRCs and compressed sizes. (Currently, the algorithm validates only the compressed size. Checking for CRCs could reduce false positives but would significantly deteriate performance.) The byte-by-byte validation can increase the risk of a false positive compared to the signature or decompression based validation, which only checks for certain points and has more prerequisites.
A false positive should be unlikely to happen in practice. If it were to happen, a stale local file entry is stripped incompletely (e.g. a 30MiB entry be treated as 20MiB, leaving 10MiB random bytes over) and cause following entries not stripped (since the algorithm requires consecutive entries). However, the ZIP file remains uncorrupted.
Spec compliance
According to the ZIP file format specification: Applications SHOULD write signed descriptors and SHOULD support both forms when reading.
Unsigned descriptors are thus considered legacy, but it is unclear whether they are still used widely.
strict_descriptor=True
adheres less strictly to the spec, but does not violate it — because stripping is neither reading nor writing, and a suboptimal stripping does not corrupt the ZIP archive.
2. Should we also implement copy()
for ZipFile
?
Currently, copying an entry within a ZIP file is cumbersome due to the lack of support for simultaneous reading and writing. The implementer must either:
- Read the entire entry and write afterwards (which is memory-intensive and inefficient for large files), or
- Use a temporary file for buffered copying.
Both approaches are more complex and less performant, due to the need to decompress and recompress data.
If would be much more performant and friendly by implementing a copy()
method, using the similar internal buffered copying technique that _ZipRepacker
has used.
Additionally, this also opens the door to support an efficient move()
operation, composed of copy()
, remove()
, and optionally repack()
.
And an additional question: whether this should be included in the current PR, or proposed separately as a follow-up?
The current draft for copy()
support can be found here.
A higher level question: Why would we want to maintain advanced features that are not used by most people within the standard library at all? This code existing in the stdlib creates a maintenance burden and is the slowest possible way to get features and their resulting trail of bugfixes to people who might want them. All features have an ongoing cost. Is there a real need for these zipfile features to not simply be advanced ones available in a PyPI package? @jaraco FYI I appreciate the enthusiasm for implementing something interesting. I'm not sure we actually want to maintain that within the CPython project though. Are there compelling use cases for these features to be part of Python rather than external? (edit: posted this on the Issue, leaving here for posterity) |
@gpshead Don't you think your question should be raised in the issue thread rather than here? 🫠 |
good point, moved, thanks! (where to have what discussions is an area where I consider github's UX to be... not great) |
This is a revised version of PR #103033, implementing two new methods in
zipfile.ZipFile
:remove()
andrepack()
, as suggested in this comment.Features
ZipFile.remove(zinfo_or_arcname)
str
path orZipInfo
) from the central directory.str
path is provided.ZipInfo
instance.'a'
,'w'
,'x'
.ZipFile.repack(removed=None)
removed
is passed (as a sequence of removedZipInfo
s), only their corresponding local file entry data are removed.'a'
.Rationales
Heuristics Used in
repack()
Since
repack()
does not immediately clean up removed entries at the time aremove()
is called, the header information of removed file entries may be missing, and thus it can be technically difficult to determine whether certain stale bytes are really previously removed files and safe to remove.While local file entries begin with the magic signature
PK\x03\x04
, this alone is not a reliable indicator. For instance, a self-extracting ZIP file may contain executable code before the actual archive, which could coincidentally include such a signature, especially if it embeds ZIP-based content.To safely reclaim space,
repack()
assumes that in a normal ZIP file, local file entries are stored consecutively:BadZipFile
error is raised and no changes are made.Check the doc in the source code of
_ZipRepacker.repack()
(which is internally called byZipFile.repack()
) for more details.Supported Modes
There has been opinions that a repacking should support mode
'w'
and'x'
(e. g. #51067 (comment)).This is NOT introduced since such modes do not truncate the file at the end of writing, and won't really shrink the file size after a removal has been made. Although we do can change the behavior for the existing API, some further care has to be made because mode
'w'
and'x'
may be used on an unseekable file and will be broken by such change. OTOH, mode'a'
is not expected to work with an unseekable file since an initial seek is made immediately when it is opened.📚 Documentation preview 📚: https://cpython-previews--134627.org.readthedocs.build/