Random hangs and failures when sending tensors that are split using  torch.split in a JoinableQueue

### 🐛 Random hangs and failures when sending tensors that are split using torch.split in a JoinableQueue

Splitting tensors using torch.split and sending them to processes using a JoinableQueue seems to cause random errors and hangs in 2.0.0.dev20230130+cu116, while works perfectly fine on 1.9.1+cu102

I tried to make the code to reproduce as small as I could. The key ingredients are torch.split and JoinableQueue.

The following script hangs on my CUDA machine using PyTorch 2.0 while it completes successfully on PyTorch 1.9.
 
```
import os
import sys
import tempfile

import torch
import torch.distributed as dist
import torch.multiprocessing as mp


def setup(rank: int, world_size: int) -> None:
    backend = 'nccl' if torch.cuda.is_available() else 'gloo'
    dist.init_process_group(backend, init_method='tcp://{}'.format('127.0.0.1:23456'), rank=rank, world_size=world_size)


def cleanup() -> None:
    dist.destroy_process_group()


def demo_basic(rank: int, queue: mp.JoinableQueue, world_size: int) -> None:
    setup(rank, world_size)
    device = f'cuda:{rank}' if torch.cuda.is_available() else 'cpu'

    while True:
        batch = queue.get()
        batch = batch.to(device)

        try:
            negative_in_batch = batch.lt(0).all().item()
            if negative_in_batch:
                print("Found negative in batch", sys.stderr)

        finally:
            queue.task_done()


def split_batch(batch: torch.Tensor, world_size: int) -> torch.Tensor:
    return torch.split(batch, batch.shape[0] // world_size) # if I use torch.split(batch, batch.shape[0] // world_size).clone() instead no error is observed


def run_demo(world_size: int) -> None:
    print(torch.__version__, file=sys.stderr)
    num_batches = 10000
    batch_size = 64

    ctx = mp.get_context('spawn')
    queues = [ctx.JoinableQueue() for _ in range(world_size)]
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    processes = [ctx.Process(target=demo_basic, args=(i, queues[i], world_size)) for i in range(world_size)]

    for p in processes:
        p.start()

    for i in range(num_batches):
        large_batch = torch.randint(100000, size=(batch_size,))
        batches = split_batch(large_batch, world_size) # if I remove this line and send the large batch instead no error is observed
        print(f'queuing batch {i}', file=sys.stderr)

        for batch, queue in zip(batches, queues):
            queue.put(batch)

        for q in queues:
            q.join()

    for p in processes:
        p.terminate()


def main() -> None:
    run_demo(4)


if __name__ == '__main__':
    main()

```

On CPU the behaviour is more random I sometimes observe the following error after some runtime:

```
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 358, in reduce_storage
    metadata = storage._share_filename_cpu_()
RuntimeError: Trying to resize storage that is not resizable
```
while sometime the code runs successfully.

I verified that the code runs fine in 1.9.1+cu102 in both CPU and GPU but don't know about other versions.

### Versions

Cuda environment:
PyTorch version: 2.0.0.dev20230130+cu116
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.25.0
Libc version: glibc-2.27

Python version: 3.8.12 (default, Oct 12 2021, 13:49:34)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-4.14.301-224.520.amzn2.x86_64-x86_64-with-glibc2.17
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  16
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:            1
CPU MHz:             2700.202
CPU max MHz:         3000.0000
CPU min MHz:         1200.0000
BogoMIPS:            4600.03
Hypervisor vendor:   Xen
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            46080K
NUMA node0 CPU(s):   0-31
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtmrdseed adx xsaveopt

Versions of relevant libraries:
[pip3] numpy==1.24.1
[pip3] numpyro==0.6.0
[pip3] pytorch-triton==2.0.0+0d7e753227
[pip3] torch==2.0.0.dev20230130+cu116
[pip3] torchaudio==2.0.0.dev20230130+cu116
[pip3] torchvision==0.15.0.dev20230130+cu116
[conda] blas                      1.0                         mkl
[conda] mkl                       2021.4.0           h06a4308_640
[conda] mkl-service               2.4.0            py38h7f8727e_0
[conda] mkl_fft                   1.3.1            py38hd3c417c_0
[conda] mkl_random                1.2.2            py38h51133e4_0
[conda] numpy                     1.24.1                   pypi_0    pypi
[conda] numpyro                   0.6.0                    pypi_0    pypi
[conda] pytorch-triton            2.0.0+0d7e753227          pypi_0    pypi
[conda] torch                     2.0.0.dev20230130+cu116          pypi_0    pypi
[conda] torchaudio                2.0.0.dev20230130+cu116          pypi_0    pypi
[conda] torchvision               0.15.0.dev20230130+cu116          pypi_0    pypi


CPU environment:
PyTorch version: 2.0.0.dev20230130+cu116
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.25.0
Libc version: glibc-2.27

Python version: 3.8.12 (default, Oct 12 2021, 13:49:34)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.15.49-linuxkit-x86_64-with-glibc2.17
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              6
On-line CPU(s) list: 0-5
Thread(s) per core:  1
Core(s) per socket:  6
Socket(s):           1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               158
Model name:          Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
Stepping:            10
CPU MHz:             2591.608
BogoMIPS:            5183.21
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            12288K
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch fsgsbase bmi1 avx2 smep bmi2 erms rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 arat

Versions of relevant libraries:
[pip3] numpy==1.24.1
[pip3] numpyro==0.6.0
[pip3] pytorch-triton==2.0.0+0d7e753227
[pip3] torch==2.0.0.dev20230130+cu116
[pip3] torchaudio==2.0.0.dev20230130+cu116
[pip3] torchvision==0.15.0.dev20230130+cu116
[conda] blas                      1.0                         mkl  
[conda] mkl                       2021.4.0           h06a4308_640  
[conda] mkl-service               2.4.0            py38h7f8727e_0  
[conda] mkl_fft                   1.3.1            py38hd3c417c_0  
[conda] mkl_random                1.2.2            py38h51133e4_0  
[conda] numpy                     1.24.1                   pypi_0    pypi
[conda] numpyro                   0.6.0                    pypi_0    pypi
[conda] pytorch-triton            2.0.0+0d7e753227          pypi_0    pypi
[conda] torch                     2.0.0.dev20230130+cu116          pypi_0    pypi
[conda] torchaudio                2.0.0.dev20230130+cu116          pypi_0    pypi
[conda] torchvision               0.15.0.dev20230130+cu116          pypi_0    pypi


cc @ezyang @gchanan @zou3519 @VitalyFedyunin @ejguan

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Random hangs and failures when sending tensors that are split using torch.split in a JoinableQueue #95606

🐛 Random hangs and failures when sending tensors that are split using torch.split in a JoinableQueue

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Random hangs and failures when sending tensors that are split using torch.split in a JoinableQueue #95606

Description

🐛 Random hangs and failures when sending tensors that are split using torch.split in a JoinableQueue

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.