Skip to content

ENH: Enable custom compression levels in np.savez_compressed #29294

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

a-sajjad72
Copy link

@a-sajjad72 a-sajjad72 commented Jun 30, 2025

This pull-request adds flexible compression support to np.savez_compressed while preserving backward compatibility for existing code.

  1. Implementation

    • Introduce a single zipfile_kwargs parameter that is forwarded verbatim to zipfile.ZipFile.
      • Anything supported by the std-lib (compression, compresslevel, etc.) can now be passed through.
      • Defaults to compression=ZIP_DEFLATED when savez_compressed is used and the caller does not specify one.
    • Human-readable aliases ("stored", "deflated", "bzip2", "lzma") are mapped case-insensitively to the appropriate zipfile constants.
    • Added validation for compresslevel ranges per algorithm and for invalid types.
    • The old reserved names compression= / compression_opts= are removed, so existing code that stores an array called "compression" will continue to work.
  2. Tests

    • Re-worked TestSavezCompressed suite:
      • Uses zipfile_kwargs everywhere.
      • Covers basic use, algorithm/level permutations, error paths, mixed dtypes, file-handle / pathlib targets, and large-array “slow” tests.
      • Adds new edge-case checks for bad compression / compresslevel types and out-of-range levels.
    • All tests pass locally on CPython 3.11+ (NumPy’s current support floor).

API changes

np.savez_compressed(
    file, *arrays,
    allow_pickle=True,
    zipfile_kwargs=None,   # NEW – dict forwarded to zipfile.ZipFile
    **named_arrays,
)
  • If zipfile_kwargs is None, it falls back to {"compression": ZIP_DEFLATED}.
  • Example:
np.savez_compressed(
    "data.npz",
    a=my_array,
    zipfile_kwargs={"compression": "lzma", "compresslevel": 9},
)

Backward compatibility

  • No previously valid call is broken:
    • Calls without zipfile_kwargs continue to work unchanged.
    • Array names such as "compression" or "compression_opts" are no longer shadowed by function parameters.
  • Internal default remains ZIP_DEFLATED level 6 (as in the original implementation).

Checklist

  • Core implementation in _npyio_impl.py & zipfile_factory.
  • Comprehensive test coverage updated in TestSavezCompressed.
  • Full test suite passes (pytest -q) on Linux/macOS/Windows, CPython 3.11+.
  • Documentation & docstrings updated.

@ngoldbaum
Copy link
Member

I shared this feedback over Slack but realized it makes more sense here.

Not sure why there’s code in this PR to support old python versions like python 3.3, we support Python 3.11 or newer.

It’s not clear to me why this feature is needed. NumPy is relatively conservative about adding new API surface, why add this?

You might want to ping the mailing list to get people to take a look. One of the contribution guidelines is to ping the mailing list before opening a PR adding a new feature.

@seberg
Copy link
Member

seberg commented Jul 11, 2025

So one reason we were a bit hesitant in the past, is that it'll break an array named compression=... in currently working code.

My gut feelig might be that a zipfile_kwargs=dict may make sense (weirder name, unless we add new API, and we would be clear that we forward it directly. That would still mean we promize to keep using ZipFile, but I suspect that is OK).
(Yes, that is a bit awkward API wise and we don't have such API elsewhere in NumPy, so not sure.)

@a-sajjad72
Copy link
Author

Thank you for looking into this and sharing your feedback.

Not sure why there’s code in this PR to support Python 3.3, we support Python 3.11 or newer.

You’re right. That Python version check was part of my initial implementation. I’ve since realized NumPy no longer supports versions earlier than Python 3.11, so I’ve already removed those blocks from the current PR version.

It’s not clear to me why this feature is needed.

The idea behind adding compression and compression_opts parameters to np.savez/np.savez_compressed is to give users more control when exporting large arrays, especially those exceeding 1 GiB. This helps make storage more efficient and provides flexibility beyond just the default ZIP_DEFLATED method.

NumPy is relatively conservative about adding new API surface, why add this?

I fully understand that. My intention wasn’t to expand the public API surface unnecessarily but rather, this is a small, backward-compatible enhancement of the existing np.savez_compressed function by exposing its already underlying zipfile compression settings as user-facing options.
If it would help clarify things, I can write to the mailing list and ask if there’s interest in officially supporting this enhancement.

…n options

- Refactored `savez_compressed` to accept a `zipfile_kwargs` dictionary for compression settings, replacing the previous `compression` and `compression_opts` parameters.
- Updated related tests to utilize the new `zipfile_kwargs` structure for specifying compression methods and levels.
- Improved validation for compression levels and methods within the new structure.
@a-sajjad72
Copy link
Author

So one reason we were a bit hesitant in the past, is that it'll break an array named compression=... in currently working code.

My gut feelig might be that a zipfile_kwargs=dict may make sense (weirder name, unless we add new API, and we would be clear that we forward it directly. That would still mean we promize to keep using ZipFile, but I suspect that is OK). (Yes, that is a bit awkward API wise and we don't have such API elsewhere in NumPy, so not sure.)

@seberg Thanks for pointing that out!
I hadn’t considered that compression= can already be an array name.
Now I had switched to a single zipfile_kwargs dict that we pass straight to ZipFile; that should avoid the conflict while still exposing the new functionality. I had pushed the changes have a review on it.

- Improved default behavior for compression settings, ensuring DEFLATED is used when compressing unless specified otherwise.
- Added support for translating string-based compression methods to their corresponding integer constants.
- Enhanced validation for `compresslevel`, ensuring it is an integer or None, and updated error messages for clarity.
- Refactored internal logic to streamline compression method handling and validation.
- Simplified exception handling for unavailable compression methods in tests.
- Removed legacy tests for Python versions <3.3, as NumPy now targets >=3.11.
- Added new tests for invalid compression types and levels, ensuring robust validation.
- Introduced case-insensitive handling for compression aliases in tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Awaiting a code review
Development

Successfully merging this pull request may close these issues.

3 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy