Skip to content

ENH: np.unique: support hash based unique for string dtype #28767

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 88 commits into from
Jun 20, 2025
Merged
Changes from 1 commit
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
f620f3b
Support NPY_STRING, NPY_UNICODE
math-hiyoko Apr 15, 2025
20ccefe
unique for NPY_STRING and NPY_UNICODE
math-hiyoko Apr 16, 2025
38626b9
fix construct array
math-hiyoko Apr 16, 2025
56bd858
remove unneccessary include
math-hiyoko Apr 16, 2025
f79736a
refactor
math-hiyoko Apr 16, 2025
c4e5438
refactoring
math-hiyoko Apr 17, 2025
7c51049
comment
math-hiyoko Apr 17, 2025
bd70552
feature: unique for NPY_VSTRING
math-hiyoko Apr 18, 2025
cc8ece6
refactoring
math-hiyoko Apr 18, 2025
f7b20a0
remove unneccessary include
math-hiyoko Apr 18, 2025
d0170ed
add test
math-hiyoko Apr 18, 2025
dbb140f
add error message
math-hiyoko Apr 18, 2025
49ed502
linter
math-hiyoko Apr 18, 2025
0238cee
linter
math-hiyoko Apr 18, 2025
6905978
reserve bucket
math-hiyoko Apr 18, 2025
2fc1378
remove emoji from testcase
math-hiyoko Apr 18, 2025
1ad6d6c
fix testcase
math-hiyoko Apr 18, 2025
b478e15
remove error
math-hiyoko Apr 18, 2025
95bc405
fix testcase
math-hiyoko Apr 18, 2025
3f1811b
fix testcase name
math-hiyoko Apr 18, 2025
99e3662
use basic_string
math-hiyoko Apr 18, 2025
b99542a
fix testcase
math-hiyoko Apr 18, 2025
2589dd7
add ValueError
math-hiyoko Apr 18, 2025
3f40cdc
fix testcase
math-hiyoko Apr 18, 2025
68d5a7b
fix memory error
math-hiyoko Apr 18, 2025
d38c3e3
remove multibyte char
math-hiyoko Apr 18, 2025
8cf2c63
refactoring
math-hiyoko Apr 18, 2025
0165d6a
add multibyte char
math-hiyoko Apr 18, 2025
243be6b
refactoring
math-hiyoko Apr 18, 2025
a6e5d3c
fix memory error
math-hiyoko Apr 18, 2025
78b9dc6
fix GIL
math-hiyoko Apr 18, 2025
0464617
fix strlen
math-hiyoko Apr 18, 2025
908f495
remove PyArray_GETPTR1
math-hiyoko Apr 19, 2025
30d1d1a
refactoring
math-hiyoko Apr 19, 2025
36c167c
refactoring
math-hiyoko Apr 19, 2025
79d31e4
use optional
math-hiyoko Apr 19, 2025
00143f9
refactoring
math-hiyoko Apr 19, 2025
1cc09f3
refactoring
math-hiyoko Apr 19, 2025
b29981d
refactoring
math-hiyoko Apr 19, 2025
91c5d42
refactoring
math-hiyoko Apr 19, 2025
e9c3aac
fix comment
math-hiyoko Apr 19, 2025
8191f5f
linter
math-hiyoko Apr 19, 2025
4faf36a
add doc
math-hiyoko Apr 19, 2025
c6aaf39
DOC: fix
math-hiyoko Apr 19, 2025
1053bcb
DOC: fix format
math-hiyoko Apr 20, 2025
1afefbe
MNT: refactoring
math-hiyoko Apr 20, 2025
b5610b1
MNT: refactoring
math-hiyoko Apr 20, 2025
c28a7ce
ENH: Store pointers to strings in the set instead of the strings them…
math-hiyoko Apr 24, 2025
b17011e
FIX: length in memcmp
math-hiyoko Apr 24, 2025
c2d5868
ENH: refactoring
math-hiyoko Apr 24, 2025
7d4afe0
DOC: 49sec -> 34sec
math-hiyoko Apr 24, 2025
ad843b0
Update numpy/lib/_arraysetops_impl.py
math-hiyoko Apr 25, 2025
45ec2b3
DOC: Mention that hash-based np.unique returns unsorted strings
math-hiyoko Apr 25, 2025
52a982d
Merge branch 'feature/#28364' of github.com:math-hiyoko/numpy into fe…
math-hiyoko Apr 25, 2025
fff254e
ENH: support medium and long vstrings
math-hiyoko Apr 26, 2025
370bd8f
FIX: comment
math-hiyoko Apr 29, 2025
49dfcb4
ENH: use RAII wrapper
math-hiyoko Apr 29, 2025
c5745bf
FIX: error handling of string packing
math-hiyoko Apr 29, 2025
3ba9788
FIX: error handling of string packing
math-hiyoko Apr 29, 2025
376ad09
FIX: change default bucket size
math-hiyoko Apr 29, 2025
aa0db48
FIX: include
math-hiyoko Apr 30, 2025
7a2892f
FIX: cast
math-hiyoko Apr 30, 2025
896bcba
ENH: support equal_nan=False
math-hiyoko May 1, 2025
f1c1947
FIX: function equal
math-hiyoko May 1, 2025
f35123a
FIX: check the case if pack_status douesn't return NULL
math-hiyoko May 1, 2025
e6ea015
FIX: check the case if pack_status douesn't return NULL
math-hiyoko May 1, 2025
ddff98f
FIX: stderr
math-hiyoko May 1, 2025
2758e27
ENH: METH_VARARGS -> METH_FASTCALL
math-hiyoko May 2, 2025
a6dc86a
FIX: log
math-hiyoko May 2, 2025
9a936eb
FIX: release allocator
math-hiyoko May 3, 2025
1e967ee
FIX: comment
math-hiyoko May 3, 2025
52c2326
FIX: delete log
math-hiyoko May 3, 2025
6f18a43
ENH: implemented FNV-1a as hash function
math-hiyoko May 3, 2025
2a1bd41
bool -> npy_bool
math-hiyoko May 3, 2025
8b632f2
FIX: cast
math-hiyoko May 3, 2025
a7bfc08
34sec -> 35.1sec
math-hiyoko May 4, 2025
dd0d8f5
Merge branch 'main' into feature/#28364
math-hiyoko May 21, 2025
9fc9ce3
fix: lint
math-hiyoko May 21, 2025
998ca00
fix: cast using const void *
math-hiyoko May 26, 2025
3dd2667
fix: fix fnv1a hash
math-hiyoko Jun 1, 2025
94926cb
fix: lint
math-hiyoko Jun 1, 2025
a711635
35.1sec -> 33.5sec
math-hiyoko Jun 1, 2025
ccccc44
Merge branch 'main' into feature/#28364
math-hiyoko Jun 16, 2025
2b6b9b5
enh: define macro HASH_TABLE_INITIAL_BUCKETS
math-hiyoko Jun 19, 2025
e92a387
enh: error handling of NpyString_load
math-hiyoko Jun 19, 2025
397a594
enh: delete comments on GIL
math-hiyoko Jun 19, 2025
425a166
fix: PyErr_SetString when NpyString_load failed
math-hiyoko Jun 19, 2025
12eb788
fix: PyErr_SetString -> npy_gil_error
math-hiyoko Jun 19, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
fix: cast using const void *
  • Loading branch information
math-hiyoko committed May 26, 2025
commit 998ca00a17d4db14e34fd62d1f4ba44db7efa496
17 changes: 10 additions & 7 deletions numpy/_core/src/multiarray/unique.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,7 @@ FinalAction<F> finally(F f) {
}

// function to caluculate the hash of a string
template <typename T>
size_t str_hash(const T *str, npy_intp num_chars) {
size_t str_hash(const void *buf, size_t len) {
// http://www.isthe.com/chongo/tech/comp/fnv/#FNV-1a
#if NPY_SIZEOF_INTP == 4
static const size_t FNV_OFFSET_BASIS = 2166136261U;
Expand All @@ -39,12 +38,16 @@ size_t str_hash(const T *str, npy_intp num_chars) {
static const size_t FNV_OFFSET_BASIS = 14695981039346656037ULL;
static const size_t FNV_PRIME = 1099511628211ULL;
#endif
const unsigned char* bytes = reinterpret_cast<const unsigned char*>(str);

unsigned char *bp = (unsigned char *)buf; /* start of buffer */
unsigned char *be = bp + len; /* beyond end of buffer */

size_t hash = FNV_OFFSET_BASIS;
for (npy_intp i = 0; i < num_chars * (npy_intp)sizeof(T); ++i) {
hash ^= bytes[i];
while (bp < be) {
hash ^= *bp++;
hash *= FNV_PRIME;
}

return hash;
}

Expand Down Expand Up @@ -144,7 +147,7 @@ unique_string(PyArrayObject *self, npy_bool equal_nan)
npy_intp itemsize = descr->elsize;
npy_intp num_chars = itemsize / sizeof(T);
auto hash = [num_chars](const T *value) -> size_t {
return str_hash(value, num_chars);
return str_hash(value, num_chars * sizeof(T));
};
auto equal = [itemsize](const T *lhs, const T *rhs) -> bool {
return std::memcmp(lhs, rhs, itemsize) == 0;
Expand Down Expand Up @@ -232,7 +235,7 @@ unique_vstring(PyArrayObject *self, npy_bool equal_nan)
return std::hash<const npy_static_string *>{}(value);
}
}
return str_hash(value->buf, value->size);
return str_hash(value->buf, value->size * sizeof(char));
};
auto equal = [equal_nan](const npy_static_string *lhs, const npy_static_string *rhs) -> bool {
if (lhs->buf == NULL && rhs->buf == NULL) {
Expand Down
Loading
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy