Skip to content

Data Basics

Manuel edited this page Jul 3, 2024 · 3 revisions

Introduction to DataLoader and Dataset

Read through link

Common Object in DataLoader

  • Sampler: Randomly choosing index per iteration. It would yield indices when batch_size is not None.
  • Fetcher: Taking single index or a batch of indices, and returning corresponding data from Dataset. It would invoke collate_fn over each batch of data and drop the remaining unfilled batch if drop_last is set.
    • For IterableDataset, it would simply get next batch-size elements as a batch.

Data/Control flow in DataLoader

  • Single Process:
         Sampler
            |
      index/indices
            |
            V
         Fetcher
            |
      index/indices
            |
            V
         dataset
            |
            V
        collate_fn
            |
            V
         output
  • Multiple processes:
          Sampler (Main process)
                    |
              index/indices
                    |
                    V
Index Multiprocessing Queue (one healthy worker)
                    |
              index/indices
                    |
                    V
          Fetcher (Worker process)
                    |
              index/indices
                    |
                    V
                 dataset
                    |
              Batch of data
                    |
                    V
                collate_fn
                    |
                    V
        Result Multiprocessing Queue
                    |
                   Data
                    |
                    V
      pin_memory_thread (Main process)
                    |
                    V
                  output

This is just a general data and control flow in DataLoader. There are multiple further detailed functionalities like prefetching, worker_status, and etc.

Common gotchas for DataLoader

Most of common questions for DataLoader come from multiple workers as multiprocessing is enabled.

  • Default multiprocessing methods are different across platforms based on Python (https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods)
    • Control randomness per worker using worker_init_fn. Otherwise, DataLoader either becomes non-deterministic when using spawn or shares same random state for each worker when using fork.
    • COW in fork (Copy-on-access in Python fork). The simplest solution for the implementation of Dataset is to use Tensor or NumPy array to replace Python arbitrary objects like list and dict.
  • Difference between Map-style Datset and Iterable-style Dataset
    • Map-style Dataset can utilize the indices sampled from main process to get automatic sharding.
    • Iterable-style Dataset requires users to manually implement sharding inside __iter__ method using torch.utils.data.get_worker_info(). Please check the example.
  • Shuffle is not enabled for Iterable-style Dataset. If needed, users need to implement the shuffle utilities inside IterableDataset class. (This is solved by TorchData project)

Introduction to next-generation Data API (TorchData)

Read through link and link Expected features:

  • Automatic/Dydamic sharding
  • Determinism Control
  • Snapshotting
  • DataFrame integration
  • etc.

Lab for DataLoader and DataPipe

Goto N1222094 for Data Lab

Next

Unit 8: function transforms/Training Loops (Optional) - vmap

Clone this wiki locally
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy