Skip to content

ENH: Add a public API and versioned schema for generating and loading hashable buffers #29229

@ngoldbaum

Description

@ngoldbaum

Proposed new feature or change:

c.f. #29226

The standard-library hashing functions use the buffer protocol to expose the bytes in the ndarray buffer to the hash functions in hashlib.

This is suboptimal for dtypes that hold references (StringDType, see #29226 for how this goes wrong), but is a more general issue. Using hashlib like this breaks in several ways:

  • Where two values that compare equal can have different byte representations and thus different hashes:
>>> hashlib.sha256(np.array(+0.0)).hexdigest()
'af5570f5a1810b7af78caf4bc70a660f0df51e42baf91d4de5b2328de0e83dfc'
>>> hashlib.sha256(np.array(-0.0)).hexdigest()
'e6ad6c9a3a3b7658c35bacf6553fcb8ffe34387534a648fe18f875b8f7a86ddb'
>>> np.array(-0.0) == np.array(0.0)
np.True_
  • Where two arrays have identical byte buffers:
>>> hashlib.sha256(np.array(['ab', 'cd'])).hexdigest()
'877c5b2fcb8523ae0edcc5b3207902bedc0ed296273c646291d34e6a63d8313e'
>>> hashlib.sha256(np.array(['abcd'])).hexdigest()
'877c5b2fcb8523ae0edcc5b3207902bedc0ed296273c646291d34e6a63d8313e'
>>> byte_data = b'\x01\x02\x03\x04\x05\x06\x07\x08'
>>> hashlib.sha256(np.frombuffer(byte_data, dtype=np.uint8)).hexdigest()
'66840dda154e8a113c31dd0ad32f7f3a366a80e8136979d8f5a101d3d29d6f72'
>>> hashlib.sha256(np.frombuffer(byte_data, dtype=np.int64)).hexdigest()
'66840dda154e8a113c31dd0ad32f7f3a366a80e8136979d8f5a101d3d29d6f72'

Proposal

IMO we should add a new array hashing API to NumPy. The hash should be based on the bytes of all the values in the array as well as the shape of the array and DType metadata.

We can also compute hashes for all user DTypes that don't include references, and add an API user DTypes that do include references (and StringDType) can use to produce a hash from array items.

To avoid adding a new ndarray member function, the Python API could be something like:

>>> np.hash(arr, algorithm='sha256')

I don't think it needs any other keyword arguments, since it's a statistic based on the whole array and all its metadata.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      pFad - Phonifier reborn

      Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

      Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


      Alternative Proxies:

      Alternative Proxy

      pFad Proxy

      pFad v3 Proxy

      pFad v4 Proxy