bpo-37538: Zipfile refactor #14957

danifus · 2019-07-26T09:26:36Z

This pull request aims to make ZipFile easier to subclass and extend.

The general goals of the refactor were to:

Add hooks for extending the way zipfile works to enable a subclass to
add AES encryption without having to duplicate most of the zipfile
module. This included adding hooks to:
- Read and write new "extra" data records in the central file directory
  and local header records.
- Provide a mechanism to substitute ZipInfo, ZipExtFile and ZipWriteFile
  classes used in a subclass of ZipFile to ease use of subclassed ZipInfo,
  ZipExtFile or ZipWriteFile. This avoids having to rewrite large parts of
  the zipfile module if we only want to change the behaviour of a small
  part of one of those classes.
Contain all code that reads the header, contents and tail of a file in the
archive to within ZipExtFile. Previously reading the header and some other
things were done in the ZipFile class before handing the rest of the
processing to ZipExtFile.
Contain all code that writes the header, contents and tail of a file in the
archive to within ZipWriteFile. Previously reading the header and some other
things were done in the ZipFile class before handing the rest of the
processing to ZipExtFile.
Move generation of local file header and central directory record content to
the ZipInfo class to be alongside the data that it is packing.
Add comments to provide context from the zip spec. Replace explicit numbers
to variables with explanatory names or adding comments.

This patch contains some bug fixes (seeking an encrypted file) that were made
possible by the refactor.

Happy for suggestions to improve this patch.

https://bugs.python.org/issue37538

Replace masking with integers directly with the new global variables.

Easier than writing out `flags | mask` each time.

** This commit changes the __init__ signature of ZipExtFile ** - ZipExtFile is now exclusively responsible for the following segments: [local file header] [encryption header] [file data] [data descriptor] - It is responsible for initialising any decryptors too.

** This undoes the previous __init__ method change a few commits ago **

The code to select compressors and decompressors has been moved to subclasses to allow subclasses to extend this process. Also adds a method around _check_compression in ZipFile for a similar purpose.

This allows these classes which are used inside ZipFile to be overridden in ZipFile subclasses without having to duplicate and alter any method which contains references to them.

** This changes the default content of the `extra` field in the local header to be empty ** Previously, if a file was opened via a ZipInfo instance that had data in the `extra` field, we may have erroneously left the previous values there while appending any new or modified values after the existing content. This behaviour differs to that of writing the central header `extra` field where we check that a zip64 entry is not already present and remove it if it is present (via `_strip_extra`). All other extra fields are copied across in this instance (which may not be correct either).

** Changes the behaviour of zip64 extra data handling as it now works when a diskno field is present where there is only 1 or 2 other fields present **

- We now move an index over the extra fields rather than rewriting each time an extra block was read. - Methods that handle the extra data now just take the length and payload bytes.

- This creates a hook for subclasses to add addtional integrity checks after the file has been read.

This makes all writing of files (directories are handled differently) contained within this class. The local file header often gets rewritten when closing the file item to fix up compressed size and someother things. One of the tests needed a slight adjustment so `StoredTestsWithSourceFile` would pass when testing broken files. This doesn't change the behaviour of writing files. `StoredTestsWithSourceFile.test_writing_errors()` would fail as OSError wasn't being raised in the `_ZipWriteFile.close()` (in addition to where `stop == count` would indicate OSError should have been raised).

Still not as fast as the module level decrypt approach prior to fixing the seeking bug. From some basic profiling, if we use a coroutine to encapsulate `decrypt()`, we can get speeds slightly faster than the original approach. It is a question of if we want that additional complexity.

To enable subclasses of the classes defined in the zipfile module to alter the contents of the written zipfile, the methods responsible for encoding the local file header, central directory and end of file records have been refactored into the following pattern: - A method collects the parameters to be encoded, a method encodes those parameters to a struct and a method that ties those two methods together. The `get_*_params()` methods can be overridden to alter the params to be written and implement new features defined in the zip spec. The separate methods for encoding the structs (`_encode_*()`) also act as a sanity check that all the required parameters have been supplied and no unknown parameters are present.

A previous change in the zipfile refactor changeset defaulted the extra data to be encoded in the local file header to be empty bytes. This was because different content may appear in the local file extra data compared to the central directory extra data (different zip64 fields for instance). If opening a file from a ZipInfo instance, the extra data is initialised with data read from the central directory. On reflection, the zip64 difference is the only difference between the two encodings I know of and we can account for that by stripping and rewritting the zip64 content. Prior to this changeset the zip64 section was not stripped in the local file header which may have led to multiple zip64 sections appearing in files written after being opened with a ZipInfo instance which had zip64 data in its extra data.

The signature of `open()` remains unchanged but _open_to_write() and _open_to_read() can take kwargs now. This will enable subclasses to be able to pass additional arguments to `open()`, to pass through to `_open_to_write()` and `_open_to_read()` without having to duplicate the contents of `open()`.

While we still raise an error if a password is supplied when trying to write, this will help people subclass ZipFile and add encryption functionality.

Small unification of how compress_size is counted when compression method is ZIP_STORED.

This clean up fixes some short-comings identified when implementing the AES code used to show the utility of this refactor.

danifus · 2019-07-27T03:23:34Z

Here is a gist which shows the refactor in action. It implements winzip's AES encryption and decryption based on these changes.

https://gist.github.com/danifus/73d258df243bbb386c1dd64c0888cddf

If this pull request is too large, I'm happy to split it up with some guidance as to what would be acceptable.

Also, the CI tests is having an issue with 'env changed' but I'm unable to reproduce it locally. Any tips for chasing those type of errors down?

Thanks, Dan

gnprice · 2019-07-30T09:38:14Z

Hi @danifus, and thanks! This sounds like a useful direction.

You're right that this is a large diff. In general, that makes it challenging for someone to review it. I'd recommend you try to split it into a sequence of smaller changes, each of which on its own is a good change and can be understood by the reader. That can make for much less total work for someone to understand and review, even when the sequence adds up to the exact same total change in the end.

As a logistical matter, I'd suggest you then send the first change in the sequence as its own PR; once that's merged, send the next one. You might keep posting the whole sequence as a branch, and linking to it for context, but that'd let a reviewer focus the detailed code-reading on just the one change at a time. (Someone else might have another suggestion, though.)

For the substance of where to split it:

Add hooks for extending the way zipfile works to enable a subclass to
add AES encryption without having to duplicate most of the zipfile
module. This included adding hooks to [...]

This sounds like a good separate change. It might come after some other changes (maybe all of them?), if you need some of your other refactors before this change makes sense.

This patch contains some bug fixes (seeking an encrypted file) that were made
possible by the refactor.

This sounds great, and I'd definitely recommend making it its own separate change -- that will help in crisply explaining what behavior you're changing. I'd make it the first PR in the sequence if you can (without making it a big one). If not, then I'd make it the second or third, and start with whichever refactors you need specifically for this.

# Zip Appnote: 4.4.4 general purpose bit flag: (2 bytes)
_MASK_ENCRYPTED = 1 << 0
_MASK_COMPRESS_OPTION_1 = 1 << 1
# ...

This looks like a very nice improvement (vs. if self.flag_bits & 0x08: etc.) -- and one that will be easy for someone to review, if you can separate out a PR that's just this change. (I.e. adding these constants and their comments, and replacing the 0x08 and so on with the named constants.) If it weren't for the bugfix, this is probably the one I'd start with.

There are probably more splits it'd be helpful to make, which you can probably think of as you look back through your code with these in mind.

You also don't have to split everything into polished self-contained changes at once -- it's enough to identify one change you can send as a first PR in the series. That will help people get started reviewing, and then you can split more changes out as you go.

danifus · 2019-08-03T02:01:02Z

Thanks for the advice! I'll start splitting this patch into bite size pieces.

danifus · 2025-07-29T06:26:00Z

Closing this big PR in favour of implementing something similar but in smaller PRs

danifus added 27 commits July 26, 2019 19:02

Add descriptive global variables for general purpose bit flags

a0db1c9

Replace masking with integers directly with the new global variables.

Add global variable for zip64 extra data header id

6710baf

Add flag properties to ZipInfo

3777389

Easier than writing out `flags | mask` each time.

Fix bug when seeking on encrypted zip files

ca41137

Refactor _ZipDecrypter with a BaseZipDecrypter class

00c87ee

** This undoes the previous __init__ method change a few commits ago **

Move compressor and decompressor selection code into classes

b8364a6

The code to select compressors and decompressors has been moved to subclasses to allow subclasses to extend this process. Also adds a method around _check_compression in ZipFile for a similar purpose.

Add zipinfo_cls, zipextfile_cls and zipwritefile_cls to ZipFile

6b256c0

This allows these classes which are used inside ZipFile to be overridden in ZipFile subclasses without having to duplicate and alter any method which contains references to them.

Fix typo datadescripter -> datadescriptor

af8864b

Add dosdate and dostime properties to ZipInfo

42c4be6

Move encoding datadescriptor to ZipInfo

801d966

Move central directory encoding to ZipInfo

7d28d8f

Move struct packing of central directory record to a ZipInfo method

c784d7f

Refactor _decodeExtra to allow subclasses to support new extra fields

f84e481

** Changes the behaviour of zip64 extra data handling as it now works when a diskno field is present where there is only 1 or 2 other fields present **

Change the way zipfile _decodeExtra loops through the extra bytes

1a07518

- We now move an index over the extra fields rather than rewriting each time an extra block was read. - Methods that handle the extra data now just take the length and payload bytes.

Decouple updating and checking crc when reading a zipfile

6de1a9a

- This creates a hook for subclasses to add addtional integrity checks after the file has been read.

Move writing zipfile local header to _ZipWriteFile

6b90dfd

Add some comments to zipfile's LZMACompressor

bfa8a7e

Add comments to ZipFile._write_end_record describing structs

a211abe

Change ZipFile._open_to_write() to accept pwd argument.

5a88b2d

While we still raise an error if a password is supplied when trying to write, this will help people subclass ZipFile and add encryption functionality.

ZipFile remove special case path for ZIP_STORED

fa374ee

Small unification of how compress_size is counted when compression method is ZIP_STORED.

the-knights-who-say-ni added the CLA signed label Jul 26, 2019

bedevere-bot added the awaiting review label Jul 26, 2019

📜🤖 Added by blurb_it.

5bb4c17

bpo-37538: Small clean up of zipfile refactor

366f79f

This clean up fixes some short-comings identified when implementing the AES code used to show the utility of this refactor.

csabella requested a review from serhiy-storchaka February 6, 2020 02:32

danifus mannequin mentioned this pull request Apr 10, 2022

Refactor zipfile to ease subclassing and enhancement #81719

Open

ezio-melotti removed the CLA signed label Jul 13, 2022

adiroiban mentioned this pull request Jul 18, 2025

Make it easier to extend zipfile code #136741

Open

danifus closed this Jul 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

bpo-37538: Zipfile refactor #14957

bpo-37538: Zipfile refactor #14957

Uh oh!

danifus commented Jul 26, 2019 •

edited by bedevere-bot

Loading

Uh oh!

danifus commented Jul 27, 2019

Uh oh!

gnprice commented Jul 30, 2019

Uh oh!

danifus commented Aug 3, 2019

Uh oh!

danifus commented Jul 29, 2025

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Uh oh!

bpo-37538: Zipfile refactor #14957

bpo-37538: Zipfile refactor #14957

Uh oh!

Conversation

danifus commented Jul 26, 2019 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danifus commented Jul 27, 2019

Uh oh!

gnprice commented Jul 30, 2019

Uh oh!

danifus commented Aug 3, 2019

Uh oh!

danifus commented Jul 29, 2025

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

danifus commented Jul 26, 2019 •

edited by bedevere-bot

Loading