4251 Assignment 6
4251 Assignment 6
Ref. https://matlab.mathworks.com/
https://www.electronicdesign.com/technologies/embedded/software/article/21801909/tall-
arrays-allow-cross-platform-manipulation
https://www.matlabexpo.com/content/dam/mathworks/mathworks-dot-
com/images/events/matlabexpo/in/2018/tackling-big-data-using-matlab.pdf
https://idsc.at/wp-content/uploads/2017/06/c_Martynenko-MathWorks@IDSC_2017.pdf
https://www.uc.pt/site/assets/files/818807/parallel_and_distributed_computing-1.pdf
https://drive.mathworks.com/files/10042023_Cummins/Solution_predictDriverTip.mlx
https://drive.mathworks.com/files/10042023_Cummins/MapReduce/CountAirlinesMapRed
uceExample.mlx
Brief Information :
❖ Data Store :
A data store is a repository for persistently storing and managing collections of data. It can range
from traditional databases like SQL databases to NoSQL databases, data lakes, and cloud storage
solutions.
❖ Tall Array:
In MATLAB, a tall array is a data type designed to handle large datasets that do not fit entirely in
memory. Tall arrays provide a means to work with this big data efficiently, using familiar
MATLAB syntax, while performing computations in a chunk-wise manner. This is particularly
useful for datasets that are larger than the amount of available RAM, allowing for analysis and
manipulation without requiring the entire dataset to be loaded at once.
❖ Out-of-Memory Operations: Tall arrays are used for datasets that exceed your system's
memory. MATLAB processes these arrays by using algorithms that break the data into
manageable pieces or chunks, performing operations on each chunk sequentially and then
combining the results.
❖ Lazy Evaluation: MATLAB employs a technique called lazy evaluation with tall arrays.
Operations on tall arrays are queued up and only executed when the results are needed,
such as for display, plotting, or further calculations that require an actual numerical value.
This approach minimizes unnecessary computations and memory use.
❖ Scalability: Operations on tall arrays can scale from small datasets on your desktop to
large datasets distributed across a cluster when used in conjunction with MATLAB
Parallel Server.
❖ Distributed arrays:
In MATLAB, distributed arrays are a data type designed for parallel computing, enabling data to
be stored and processed across multiple workers in a cluster environment. This capability is
particularly useful for handling very large arrays that exceed the memory capacity of a single
machine, allowing you to perform computations that are otherwise not feasible on standard
hardware.
❖ Parallel Data Storage: Distributed arrays split data across the workers in a parallel pool.
Each worker holds a portion of the entire array, thus distributing memory and
computational load.
❖ Scalability: Because data is distributed, the size of the data that can be handled scales
with the number of workers and the memory available to each worker in the cluster.
❖ Integration with MATLAB’s Parallel Computing Tools: Distributed arrays are deeply
integrated with MATLAB’s parallel computing tools, making it easy to parallelize
computations without extensive rewriting of code.
When you create a distributed array, MATLAB divides the array into smaller chunks, which are
then spread out across the available workers in a MATLAB parallel pool (a group of MATLAB
workers running on a cluster or a multi-core machine). Each worker operates on its chunk of the
array independently. This division of labor means that computations can be done in parallel,
significantly speeding up operations that are suitable for parallel execution.
Use Cases
Integrating big data platforms can significantly aid in managing and processing large datasets.
Technologies like Apache Hadoop and Apache Spark provide frameworks for distributed storage
and computing, which can be leveraged to preprocess data, perform feature extraction, and even
train machine learning models.
❖ Apache Hadoop: Useful for batch processing of very large data sets, particularly when
the data is too big to fit in the memory of a single machine. Hadoop's HDFS (Hadoop
Distributed File System) offers reliable data storage and can work with commodity
hardware, reducing costs.
❖ Apache Spark: Excels in fast data processing and supports in-memory computing, which
is especially effective for iterative algorithms common in machine learning. Spark also
provides libraries such as MLlib for scalable machine learning algorithms.
Online learning algorithms are designed to update models incrementally as new data arrives,
rather than retraining models from scratch with each new data batch. This can be particularly
useful for large datasets or data streams.
❖ Online Learning: Algorithms like stochastic gradient descent (SGD) are inherently online
and can process data points individually or in mini-batches, thus handling large datasets
efficiently.
Reducing the volume of data without losing valuable information can make machine learning
algorithms more manageable.
❖ Sampling: Techniques such as random sampling or stratified sampling can reduce the
dataset size while maintaining its statistical properties.
❖ Dimensionality Reduction: Techniques like PCA (Principal Component Analysis), t-
SNE, or autoencoders reduce the number of random variables under consideration, which
can simplify the data and speed up the learning process.
5. Incremental Learning
Incremental learning models are trained progressively, assimilating new data as it becomes
available. This method is ideal for datasets that continuously grow over time.
Cloud platforms offer scalable hardware resources and managed services for machine learning.
These platforms often provide tools that automate many aspects of scaling machine learning
workflows.
❖ AWS, Google Cloud, Azure: These platforms offer managed machine learning services
like Google Cloud ML Engine, Amazon SageMaker, and Azure Machine Learning
Studio, which can scale as needed based on the dataset size and computational
requirements.
❖ Data Formats: Formats like HDF5 or Parquet are optimized for high-speed data access
and efficient storage, which is essential when working with large datasets.
● Mapreduce in Matlab:
MapReduce in MATLAB is a data processing technique that enables you to manage and analyze
large data sets that might not fit into memory all at once. This method borrows from the
MapReduce paradigm popularized by Google and implemented in technologies such as Apache
Hadoop. It's especially beneficial for distributed processing across multiple machines, although
MATLAB's implementation focuses on processing on a single computer using its parallel
processing tools.
❖ Processing Large Datasets: MATLAB can handle very large data sets by working on one
chunk at a time, allowing for data analysis that exceeds system memory limits.
❖ Map and Reduce Functions:
Map Function: Breaks down the big data problem into smaller chunks, operates
on each chunk of data independently, and emits key-value pairs.
Reduce Function: Takes the output from the Map function (intermediate key-
value pairs), consolidates the results with the same keys, and processes these
grouped results to produce the final output.
MATLAB uses efficient data handling strategies during the MapReduce process,
such as automatic partitioning of data and management of intermediate data
storage. This maximizes the use of available memory and disk resources.
❖ Scalability:
❖ Flexibility:
The user defines the operations performed in the Map and Reduce functions,
providing flexibility to perform a wide range of data analysis and mathematical
operations.
The results from MapReduce can be integrated seamlessly with other MATLAB
functions and toolboxes for further analysis or visualization.
❖ Ease of Use:
● Implementation:
The map function countMapper leverages the fact that the data is categorical. The countcats and
categories functions are used on each block of the input data to generate key/value pairs of the
airline name and associated count.
The reduce function countReducer reads in the intermediate data produced by the map function
and adds together all of the counts to produce a single final count for each airline carrier.
Run mapreduce on the data. The map and reduce functions count the number of instances of
each airline carrier name in each block of data, then combine those intermediate counts into a
final count. This method leverages the intermediate sorting by unique key performed by
mapreduce. The functions countMapper and countReducer are included at the end of this script.
Conclusion: Advantage of using datastore, tall arrays and mapreduce for Big Data
When working with Big Data in MATLAB, it's crucial to use tools and techniques that can
handle large volumes of data efficiently. MATLAB provides several features designed for this
purpose, such as datastores, tall arrays, and the mapreduce programming paradigm. Each of these
features has specific advantages for dealing with Big Data, making them indispensable tools for
data scientists and engineers. Advantages of these features are as follows:
1. Datastore
A Datastore in MATLAB is a repository for collections of data that are too large to fit in
memory. Using a datastore allows you to manage data in a way that enhances the performance
and scalability of data-intensive applications.
Advantages:
❖ Handling Large Files: Datastore can handle very large datasets that cannot be loaded into
memory all at once. It enables efficient reading in chunks, which is perfect for processing
and analyzing large files incrementally.
❖ Support for Different Formats: MATLAB supports various types of datastores for
different data formats including spreadsheets, images, text files, and key-value pairs. This
flexibility allows seamless integration and manipulation of data from diverse sources.
❖ Ease of Preprocessing: Datastores simplify the preprocessing of data with functionalities
like data transformation and filtering during the import phase, which can significantly
streamline workflows.
❖ Efficiency: Datastores are optimized for reading and processing data in large blocks,
reducing the time and system resources required for data handling.
2. Tall Arrays
Tall arrays extend MATLAB's numeric and logical arrays to handle big data. They are designed
to work with datasets that are too large to fit in memory, facilitating operations and computations
as if they were in-memory arrays.
Advantages:
❖ Transparent Handling of Large Data: Users can operate on tall arrays almost the same
way they work with in-memory arrays. MATLAB manages the memory and disk
operations in the background, making the code simpler and cleaner.
❖ Integration with Datastore: Tall arrays can be directly created from datastores, providing
a seamless workflow from data reading to processing.
❖ Lazy Evaluation: MATLAB uses optimized lazy evaluation with tall arrays.
Computations are queued and only executed when required, ensuring that the most
efficient sequence of operations is used.
❖ Scalability and Speed: By leveraging the MATLAB's built-in functions that are
automatically parallelized and optimized for tall arrays, users can achieve high-
performance analytics on large datasets.
3. MapReduce
The mapreduce function in MATLAB is a programming technique for performing big data
analytics. It allows you to process large amounts of data in a scalable way, even on a single
computer or across a cluster.
Advantages:
❖ Scalability Across Different Environments: The mapreduce algorithm can run on a single
PC, a cluster, or cloud infrastructure, which makes it highly scalable and flexible.
❖ Handling of Key-Value Pairs: It works by processing data in key-value pairs, a robust
structure for managing diverse data types and complex data manipulations.
❖ Efficient Data Reduction: The mapreduce framework is designed to filter and condense
large datasets into more manageable sizes during the 'map' phase, then further aggregate
or summarize this data in the 'reduce' phase, which is ideal for analytics.
❖ Customizable Processing: Users can define custom functions for both the 'map' and
'reduce' phases, providing a high degree of control over how data is analyzed and results
are generated.