Description
🐛 Random hangs and failures when sending tensors that are split using torch.split in a JoinableQueue
Splitting tensors using torch.split and sending them to processes using a JoinableQueue seems to cause random errors and hangs in 2.0.0.dev20230130+cu116, while works perfectly fine on 1.9.1+cu102
I tried to make the code to reproduce as small as I could. The key ingredients are torch.split and JoinableQueue.
The following script hangs on my CUDA machine using PyTorch 2.0 while it completes successfully on PyTorch 1.9.
import os
import sys
import tempfile
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
def setup(rank: int, world_size: int) -> None:
backend = 'nccl' if torch.cuda.is_available() else 'gloo'
dist.init_process_group(backend, init_method='tcp://{}'.format('127.0.0.1:23456'), rank=rank, world_size=world_size)
def cleanup() -> None:
dist.destroy_process_group()
def demo_basic(rank: int, queue: mp.JoinableQueue, world_size: int) -> None:
setup(rank, world_size)
device = f'cuda:{rank}' if torch.cuda.is_available() else 'cpu'
while True:
batch = queue.get()
batch = batch.to(device)
try:
negative_in_batch = batch.lt(0).all().item()
if negative_in_batch:
print("Found negative in batch", sys.stderr)
finally:
queue.task_done()
def split_batch(batch: torch.Tensor, world_size: int) -> torch.Tensor:
return torch.split(batch, batch.shape[0] // world_size) # if I use torch.split(batch, batch.shape[0] // world_size).clone() instead no error is observed
def run_demo(world_size: int) -> None:
print(torch.__version__, file=sys.stderr)
num_batches = 10000
batch_size = 64
ctx = mp.get_context('spawn')
queues = [ctx.JoinableQueue() for _ in range(world_size)]
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'
processes = [ctx.Process(target=demo_basic, args=(i, queues[i], world_size)) for i in range(world_size)]
for p in processes:
p.start()
for i in range(num_batches):
large_batch = torch.randint(100000, size=(batch_size,))
batches = split_batch(large_batch, world_size) # if I remove this line and send the large batch instead no error is observed
print(f'queuing batch {i}', file=sys.stderr)
for batch, queue in zip(batches, queues):
queue.put(batch)
for q in queues:
q.join()
for p in processes:
p.terminate()
def main() -> None:
run_demo(4)
if __name__ == '__main__':
main()
On CPU the behaviour is more random I sometimes observe the following error after some runtime:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
obj = _ForkingPickler.dumps(obj)
File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 358, in reduce_storage
metadata = storage._share_filename_cpu_()
RuntimeError: Trying to resize storage that is not resizable
while sometime the code runs successfully.
I verified that the code runs fine in 1.9.1+cu102 in both CPU and GPU but don't know about other versions.
Versions
Cuda environment:
PyTorch version: 2.0.0.dev20230130+cu116
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.25.0
Libc version: glibc-2.27
Python version: 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-4.14.301-224.520.amzn2.x86_64-x86_64-with-glibc2.17
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping: 1
CPU MHz: 2700.202
CPU max MHz: 3000.0000
CPU min MHz: 1200.0000
BogoMIPS: 4600.03
Hypervisor vendor: Xen
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 46080K
NUMA node0 CPU(s): 0-31
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtmrdseed adx xsaveopt
Versions of relevant libraries:
[pip3] numpy==1.24.1
[pip3] numpyro==0.6.0
[pip3] pytorch-triton==2.0.0+0d7e753227
[pip3] torch==2.0.0.dev20230130+cu116
[pip3] torchaudio==2.0.0.dev20230130+cu116
[pip3] torchvision==0.15.0.dev20230130+cu116
[conda] blas 1.0 mkl
[conda] mkl 2021.4.0 h06a4308_640
[conda] mkl-service 2.4.0 py38h7f8727e_0
[conda] mkl_fft 1.3.1 py38hd3c417c_0
[conda] mkl_random 1.2.2 py38h51133e4_0
[conda] numpy 1.24.1 pypi_0 pypi
[conda] numpyro 0.6.0 pypi_0 pypi
[conda] pytorch-triton 2.0.0+0d7e753227 pypi_0 pypi
[conda] torch 2.0.0.dev20230130+cu116 pypi_0 pypi
[conda] torchaudio 2.0.0.dev20230130+cu116 pypi_0 pypi
[conda] torchvision 0.15.0.dev20230130+cu116 pypi_0 pypi
CPU environment:
PyTorch version: 2.0.0.dev20230130+cu116
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.25.0
Libc version: glibc-2.27
Python version: 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.15.49-linuxkit-x86_64-with-glibc2.17
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 6
On-line CPU(s) list: 0-5
Thread(s) per core: 1
Core(s) per socket: 6
Socket(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 158
Model name: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
Stepping: 10
CPU MHz: 2591.608
BogoMIPS: 5183.21
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 12288K
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch fsgsbase bmi1 avx2 smep bmi2 erms rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 arat
Versions of relevant libraries:
[pip3] numpy==1.24.1
[pip3] numpyro==0.6.0
[pip3] pytorch-triton==2.0.0+0d7e753227
[pip3] torch==2.0.0.dev20230130+cu116
[pip3] torchaudio==2.0.0.dev20230130+cu116
[pip3] torchvision==0.15.0.dev20230130+cu116
[conda] blas 1.0 mkl
[conda] mkl 2021.4.0 h06a4308_640
[conda] mkl-service 2.4.0 py38h7f8727e_0
[conda] mkl_fft 1.3.1 py38hd3c417c_0
[conda] mkl_random 1.2.2 py38h51133e4_0
[conda] numpy 1.24.1 pypi_0 pypi
[conda] numpyro 0.6.0 pypi_0 pypi
[conda] pytorch-triton 2.0.0+0d7e753227 pypi_0 pypi
[conda] torch 2.0.0.dev20230130+cu116 pypi_0 pypi
[conda] torchaudio 2.0.0.dev20230130+cu116 pypi_0 pypi
[conda] torchvision 0.15.0.dev20230130+cu116 pypi_0 pypi