-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
ENH: np.unique: support hash based unique for string dtype #28767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
88 commits
Select commit
Hold shift + click to select a range
f620f3b
Support NPY_STRING, NPY_UNICODE
math-hiyoko 20ccefe
unique for NPY_STRING and NPY_UNICODE
math-hiyoko 38626b9
fix construct array
math-hiyoko 56bd858
remove unneccessary include
math-hiyoko f79736a
refactor
math-hiyoko c4e5438
refactoring
math-hiyoko 7c51049
comment
math-hiyoko bd70552
feature: unique for NPY_VSTRING
math-hiyoko cc8ece6
refactoring
math-hiyoko f7b20a0
remove unneccessary include
math-hiyoko d0170ed
add test
math-hiyoko dbb140f
add error message
math-hiyoko 49ed502
linter
math-hiyoko 0238cee
linter
math-hiyoko 6905978
reserve bucket
math-hiyoko 2fc1378
remove emoji from testcase
math-hiyoko 1ad6d6c
fix testcase
math-hiyoko b478e15
remove error
math-hiyoko 95bc405
fix testcase
math-hiyoko 3f1811b
fix testcase name
math-hiyoko 99e3662
use basic_string
math-hiyoko b99542a
fix testcase
math-hiyoko 2589dd7
add ValueError
math-hiyoko 3f40cdc
fix testcase
math-hiyoko 68d5a7b
fix memory error
math-hiyoko d38c3e3
remove multibyte char
math-hiyoko 8cf2c63
refactoring
math-hiyoko 0165d6a
add multibyte char
math-hiyoko 243be6b
refactoring
math-hiyoko a6e5d3c
fix memory error
math-hiyoko 78b9dc6
fix GIL
math-hiyoko 0464617
fix strlen
math-hiyoko 908f495
remove PyArray_GETPTR1
math-hiyoko 30d1d1a
refactoring
math-hiyoko 36c167c
refactoring
math-hiyoko 79d31e4
use optional
math-hiyoko 00143f9
refactoring
math-hiyoko 1cc09f3
refactoring
math-hiyoko b29981d
refactoring
math-hiyoko 91c5d42
refactoring
math-hiyoko e9c3aac
fix comment
math-hiyoko 8191f5f
linter
math-hiyoko 4faf36a
add doc
math-hiyoko c6aaf39
DOC: fix
math-hiyoko 1053bcb
DOC: fix format
math-hiyoko 1afefbe
MNT: refactoring
math-hiyoko b5610b1
MNT: refactoring
math-hiyoko c28a7ce
ENH: Store pointers to strings in the set instead of the strings them…
math-hiyoko b17011e
FIX: length in memcmp
math-hiyoko c2d5868
ENH: refactoring
math-hiyoko 7d4afe0
DOC: 49sec -> 34sec
math-hiyoko ad843b0
Update numpy/lib/_arraysetops_impl.py
math-hiyoko 45ec2b3
DOC: Mention that hash-based np.unique returns unsorted strings
math-hiyoko 52a982d
Merge branch 'feature/#28364' of github.com:math-hiyoko/numpy into fe…
math-hiyoko fff254e
ENH: support medium and long vstrings
math-hiyoko 370bd8f
FIX: comment
math-hiyoko 49dfcb4
ENH: use RAII wrapper
math-hiyoko c5745bf
FIX: error handling of string packing
math-hiyoko 3ba9788
FIX: error handling of string packing
math-hiyoko 376ad09
FIX: change default bucket size
math-hiyoko aa0db48
FIX: include
math-hiyoko 7a2892f
FIX: cast
math-hiyoko 896bcba
ENH: support equal_nan=False
math-hiyoko f1c1947
FIX: function equal
math-hiyoko f35123a
FIX: check the case if pack_status douesn't return NULL
math-hiyoko e6ea015
FIX: check the case if pack_status douesn't return NULL
math-hiyoko ddff98f
FIX: stderr
math-hiyoko 2758e27
ENH: METH_VARARGS -> METH_FASTCALL
math-hiyoko a6dc86a
FIX: log
math-hiyoko 9a936eb
FIX: release allocator
math-hiyoko 1e967ee
FIX: comment
math-hiyoko 52c2326
FIX: delete log
math-hiyoko 6f18a43
ENH: implemented FNV-1a as hash function
math-hiyoko 2a1bd41
bool -> npy_bool
math-hiyoko 8b632f2
FIX: cast
math-hiyoko a7bfc08
34sec -> 35.1sec
math-hiyoko dd0d8f5
Merge branch 'main' into feature/#28364
math-hiyoko 9fc9ce3
fix: lint
math-hiyoko 998ca00
fix: cast using const void *
math-hiyoko 3dd2667
fix: fix fnv1a hash
math-hiyoko 94926cb
fix: lint
math-hiyoko a711635
35.1sec -> 33.5sec
math-hiyoko ccccc44
Merge branch 'main' into feature/#28364
math-hiyoko 2b6b9b5
enh: define macro HASH_TABLE_INITIAL_BUCKETS
math-hiyoko e92a387
enh: error handling of NpyString_load
math-hiyoko 397a594
enh: delete comments on GIL
math-hiyoko 425a166
fix: PyErr_SetString when NpyString_load failed
math-hiyoko 12eb788
fix: PyErr_SetString -> npy_gil_error
math-hiyoko File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
``unique_values`` for string dtypes may return unsorted data | ||
------------------------------------------------------------ | ||
np.unique now supports hash‐based duplicate removal for string dtypes. | ||
This enhancement extends the hash-table algorithm to byte strings ('S'), | ||
Unicode strings ('U'), and the experimental string dtype ('T', StringDType). | ||
As a result, calling np.unique() on an array of strings will use | ||
the faster hash-based method to obtain unique values. | ||
Note that this hash-based method does not guarantee that the returned unique values will be sorted. | ||
This also works for StringDType arrays containing None (missing values) | ||
when using equal_nan=True (treating missing values as equal). | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
Performance improvements to ``np.unique`` for string dtypes | ||
----------------------------------------------------------- | ||
The hash-based algorithm for unique extraction provides | ||
an order-of-magnitude speedup on large string arrays. | ||
In an internal benchmark with about 1 billion string elements, | ||
the hash-based np.unique completed in roughly 33.5 seconds, | ||
compared to 498 seconds with the sort-based method | ||
– about 15× faster for unsorted unique operations on strings. | ||
This improvement greatly reduces the time to find unique values | ||
in very large string datasets. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,85 @@ | ||
/* | ||
FNV-1a hash algorithm implementation | ||
Based on the implementation from: | ||
https://github.com/lcn2/fnv | ||
*/ | ||
|
||
#define NPY_NO_DEPRECATED_API NPY_API_VERSION | ||
#define _MULTIARRAYMODULE | ||
|
||
#include <Python.h> | ||
#include "numpy/npy_common.h" | ||
#include "fnv.h" | ||
|
||
|
||
#define FNV1A_32_INIT ((npy_uint32)0x811c9dc5) | ||
#define FNV1A_64_INIT ((npy_uint64)0xcbf29ce484222325ULL) | ||
|
||
/* | ||
Compute a 32-bit FNV-1a hash of buffer | ||
original implementation from: | ||
https://github.com/lcn2/fnv/blob/b7fcbee95538ee6a15744e756e7e7f1c02862cb0/hash_32a.c | ||
*/ | ||
npy_uint32 | ||
npy_fnv1a_32(const void *buf, size_t len, npy_uint32 hval) | ||
{ | ||
const unsigned char *bp = (const unsigned char *)buf; /* start of buffer */ | ||
const unsigned char *be = bp + len; /* beyond end of buffer */ | ||
|
||
/* | ||
FNV-1a hash each octet in the buffer | ||
*/ | ||
while (bp < be) { | ||
|
||
/* xor the bottom with the current octet */ | ||
hval ^= (npy_uint32)*bp++; | ||
|
||
/* multiply by the 32 bit FNV magic prime */ | ||
/* hval *= 0x01000193; */ | ||
hval += (hval<<1) + (hval<<4) + (hval<<7) + (hval<<8) + (hval<<24); | ||
} | ||
|
||
return hval; | ||
} | ||
|
||
/* | ||
Compute a 64-bit FNV-1a hash of the given data | ||
original implementation from: | ||
https://github.com/lcn2/fnv/blob/b7fcbee95538ee6a15744e756e7e7f1c02862cb0/hash_64a.c | ||
*/ | ||
npy_uint64 | ||
npy_fnv1a_64(const void *buf, size_t len, npy_uint64 hval) | ||
{ | ||
const unsigned char *bp = (const unsigned char *)buf; /* start of buffer */ | ||
const unsigned char *be = bp + len; /* beyond end of buffer */ | ||
|
||
/* | ||
FNV-1a hash each octet in the buffer | ||
*/ | ||
while (bp < be) { | ||
|
||
/* xor the bottom with the current octet */ | ||
hval ^= (npy_uint64)*bp++; | ||
|
||
/* multiply by the 64 bit FNV magic prime */ | ||
/* hval *= 0x100000001b3ULL; */ | ||
hval += (hval << 1) + (hval << 4) + (hval << 5) + | ||
(hval << 7) + (hval << 8) + (hval << 40); | ||
} | ||
|
||
return hval; | ||
} | ||
|
||
/* | ||
* Compute a size_t FNV-1a hash of the given data | ||
* This will use 32-bit or 64-bit hash depending on the size of size_t | ||
*/ | ||
size_t | ||
npy_fnv1a(const void *buf, size_t len) | ||
{ | ||
#if NPY_SIZEOF_SIZE_T == 8 | ||
return (size_t)npy_fnv1a_64(buf, len, FNV1A_64_INIT); | ||
#else /* NPY_SIZEOF_SIZE_T == 4 */ | ||
return (size_t)npy_fnv1a_32(buf, len, FNV1A_32_INIT); | ||
#endif | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
/* | ||
FNV-1a hash algorithm implementation | ||
Based on the implementation from: | ||
https://github.com/lcn2/fnv | ||
*/ | ||
|
||
#ifndef NUMPY_CORE_INCLUDE_NUMPY_MULTIARRAY_FNV_H_ | ||
#define NUMPY_CORE_INCLUDE_NUMPY_MULTIARRAY_FNV_H_ | ||
|
||
|
||
/* | ||
Compute a size_t FNV-1a hash of the given data | ||
This will use 32-bit or 64-bit hash depending on the size of size_t | ||
|
||
Parameters: | ||
----------- | ||
buf - pointer to the data to be hashed | ||
len - length of the data in bytes | ||
|
||
Returns: | ||
----------- | ||
size_t hash value | ||
*/ | ||
size_t npy_fnv1a(const void *buf, size_t len); | ||
|
||
#endif // NUMPY_CORE_INCLUDE_NUMPY_MULTIARRAY_FNV_H_ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.