Skip to content

PERF: parallelize libjoin calls #51364

@jbrockmendel

Description

@jbrockmendel
import numpy as np
import pandas as pd
import pandas._libs.join as libjoin

left = np.random.randint(0, 10**8, size=10**8)
right = np.random.randint(0, 10**8, size=10**8)

left.sort()
right.sort()

# The current usage
In [28]: %timeit result = libjoin.inner_join_indexer(left, right)
2.89 s ± 60 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# An alternative that _may_ be parallelizable
lloc = len(left) // 2
rloc = right.searchsorted(left[lloc])  # 734 ns ± 16.6 ns

chunk1 = libjoin.inner_join_indexer(left[:lloc], right[:rloc])
chunk2 = libjoin.inner_join_indexer(left[lloc:], right[rloc:])
result = tuple([np.r_[chunk1[i], chunk2[i]] for i in range(3)])
# totals 3.9 s ± 209 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The chunk1/chunk2 calls are each 1.47 s ± 40 ms and the concat is 1.05 s ± 179 ms. In a pyarrow or otherwise-chunked context the concat might not be needed.

For the non-object-dtype case, I'm pretty sure we could release the GIL in inner_join_indexer. Could we then do the chunk1/chunk2 calls in parallel? If so, could we split it further? cc @WillAyd?

Metadata

Metadata

Assignees

No one assigned

    Labels

    MultithreadingParallelism in pandasPerformanceMemory or execution speed performanceReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      pFad - Phonifier reborn

      Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

      Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


      Alternative Proxies:

      Alternative Proxy

      pFad Proxy

      pFad v3 Proxy

      pFad v4 Proxy