0% found this document useful (0 votes)
7 views11 pages

4251 Assignment 6

The document discusses implementing and tackling big data using MATLAB. It covers topics like tall arrays, distributed arrays, scaling machine learning to large datasets, and MapReduce in MATLAB. Tall arrays and distributed arrays allow efficient handling of large datasets that don't fit in memory through out-of-memory operations and parallel processing across a cluster.

Uploaded by

renu.tamsekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views11 pages

4251 Assignment 6

The document discusses implementing and tackling big data using MATLAB. It covers topics like tall arrays, distributed arrays, scaling machine learning to large datasets, and MapReduce in MATLAB. Tall arrays and distributed arrays allow efficient handling of large datasets that don't fit in memory through out-of-memory operations and parallel processing across a cluster.

Uploaded by

renu.tamsekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Assignment 6

Name – Renu Tamsekar


C number – C22020111255

Aim: Implementation and tackling of Big Data using Matlab

Ref. https://matlab.mathworks.com/

https://www.electronicdesign.com/technologies/embedded/software/article/21801909/tall-
arrays-allow-cross-platform-manipulation

https://www.matlabexpo.com/content/dam/mathworks/mathworks-dot-
com/images/events/matlabexpo/in/2018/tackling-big-data-using-matlab.pdf

https://idsc.at/wp-content/uploads/2017/06/c_Martynenko-MathWorks@IDSC_2017.pdf

https://www.uc.pt/site/assets/files/818807/parallel_and_distributed_computing-1.pdf

https://drive.mathworks.com/files/10042023_Cummins/Solution_predictDriverTip.mlx

https://drive.mathworks.com/files/10042023_Cummins/MapReduce/CountAirlinesMapRed
uceExample.mlx

Brief Information :

❖ Data Store :

A data store is a repository for persistently storing and managing collections of data. It can range
from traditional databases like SQL databases to NoSQL databases, data lakes, and cloud storage
solutions.

❖ Tall Array:

In MATLAB, a tall array is a data type designed to handle large datasets that do not fit entirely in
memory. Tall arrays provide a means to work with this big data efficiently, using familiar
MATLAB syntax, while performing computations in a chunk-wise manner. This is particularly
useful for datasets that are larger than the amount of available RAM, allowing for analysis and
manipulation without requiring the entire dataset to be loaded at once.

Characteristics of Tall Arrays

❖ Out-of-Memory Operations: Tall arrays are used for datasets that exceed your system's
memory. MATLAB processes these arrays by using algorithms that break the data into
manageable pieces or chunks, performing operations on each chunk sequentially and then
combining the results.
❖ Lazy Evaluation: MATLAB employs a technique called lazy evaluation with tall arrays.
Operations on tall arrays are queued up and only executed when the results are needed,
such as for display, plotting, or further calculations that require an actual numerical value.
This approach minimizes unnecessary computations and memory use.
❖ Scalability: Operations on tall arrays can scale from small datasets on your desktop to
large datasets distributed across a cluster when used in conjunction with MATLAB
Parallel Server.

❖ Distributed arrays:

In MATLAB, distributed arrays are a data type designed for parallel computing, enabling data to
be stored and processed across multiple workers in a cluster environment. This capability is
particularly useful for handling very large arrays that exceed the memory capacity of a single
machine, allowing you to perform computations that are otherwise not feasible on standard
hardware.

Key Features of Distributed Arrays

❖ Parallel Data Storage: Distributed arrays split data across the workers in a parallel pool.
Each worker holds a portion of the entire array, thus distributing memory and
computational load.
❖ Scalability: Because data is distributed, the size of the data that can be handled scales
with the number of workers and the memory available to each worker in the cluster.
❖ Integration with MATLAB’s Parallel Computing Tools: Distributed arrays are deeply
integrated with MATLAB’s parallel computing tools, making it easy to parallelize
computations without extensive rewriting of code.

How Distributed Arrays Work

When you create a distributed array, MATLAB divides the array into smaller chunks, which are
then spread out across the available workers in a MATLAB parallel pool (a group of MATLAB
workers running on a cluster or a multi-core machine). Each worker operates on its chunk of the
array independently. This division of labor means that computations can be done in parallel,
significantly speeding up operations that are suitable for parallel execution.

Use Cases

Distributed arrays are particularly useful in scenarios involving:

❖ Large-scale numerical simulations and optimizations


❖ Big data processing tasks that can be parallelized
❖ Machine learning algorithms on large datasets

❖ Scaling Machine Learning to Large Data sets:


Scaling machine learning to handle large datasets involves overcoming challenges related to data
volume, computation speed, memory constraints, and potentially the complexity of models
themselves. Efficiently scaling machine learning processes is critical for leveraging the full value
of big data in predictive analytics and other AI-driven applications.Following are some
strategies and considerations for scaling machine learning to large datasets:

1. Utilizing Big Data Technologies

Integrating big data platforms can significantly aid in managing and processing large datasets.
Technologies like Apache Hadoop and Apache Spark provide frameworks for distributed storage
and computing, which can be leveraged to preprocess data, perform feature extraction, and even
train machine learning models.

❖ Apache Hadoop: Useful for batch processing of very large data sets, particularly when
the data is too big to fit in the memory of a single machine. Hadoop's HDFS (Hadoop
Distributed File System) offers reliable data storage and can work with commodity
hardware, reducing costs.
❖ Apache Spark: Excels in fast data processing and supports in-memory computing, which
is especially effective for iterative algorithms common in machine learning. Spark also
provides libraries such as MLlib for scalable machine learning algorithms.

2. Using Distributed Computing

Distributed computing involves parallelizing the workload across multiple machines or


processors. This approach is essential when scaling machine learning algorithms to handle large-
scale datasets.

❖ Parallel Processing: Use multi-core processing or GPU computing to parallelize data


processing and model training. Technologies like CUDA for NVIDIA GPUs enable
significant reductions in training time for complex models like deep neural networks.
❖ Distributed Machine Learning Libraries: Tools like TensorFlow, PyTorch, and Dask
support distributed computing, allowing the data and the computational workload to be
spread across many servers.
3. Employing Online Algorithms

Online learning algorithms are designed to update models incrementally as new data arrives,
rather than retraining models from scratch with each new data batch. This can be particularly
useful for large datasets or data streams.

❖ Online Learning: Algorithms like stochastic gradient descent (SGD) are inherently online
and can process data points individually or in mini-batches, thus handling large datasets
efficiently.

4. Data Reduction Techniques

Reducing the volume of data without losing valuable information can make machine learning
algorithms more manageable.

❖ Sampling: Techniques such as random sampling or stratified sampling can reduce the
dataset size while maintaining its statistical properties.
❖ Dimensionality Reduction: Techniques like PCA (Principal Component Analysis), t-
SNE, or autoencoders reduce the number of random variables under consideration, which
can simplify the data and speed up the learning process.

5. Incremental Learning

Incremental learning models are trained progressively, assimilating new data as it becomes
available. This method is ideal for datasets that continuously grow over time.

❖ Models Supporting Incremental Learning: Some models naturally support incremental


learning, such as decision trees, k-nearest neighbors, and some ensemble methods like
AdaBoost.

6. Cloud-Based Machine Learning Platforms

Cloud platforms offer scalable hardware resources and managed services for machine learning.
These platforms often provide tools that automate many aspects of scaling machine learning
workflows.

❖ AWS, Google Cloud, Azure: These platforms offer managed machine learning services
like Google Cloud ML Engine, Amazon SageMaker, and Azure Machine Learning
Studio, which can scale as needed based on the dataset size and computational
requirements.

7. Efficient Data Storage and Access


Efficiently accessing and storing data can be a bottleneck in processing large datasets. Using
databases and data formats that support high-throughput and scalable read/write operations is
crucial.

❖ Data Formats: Formats like HDF5 or Parquet are optimized for high-speed data access
and efficient storage, which is essential when working with large datasets.

● Mapreduce in Matlab:

MapReduce in MATLAB is a data processing technique that enables you to manage and analyze
large data sets that might not fit into memory all at once. This method borrows from the
MapReduce paradigm popularized by Google and implemented in technologies such as Apache
Hadoop. It's especially beneficial for distributed processing across multiple machines, although
MATLAB's implementation focuses on processing on a single computer using its parallel
processing tools.

Characteristics of MapReduce in MATLAB

❖ Processing Large Datasets: MATLAB can handle very large data sets by working on one
chunk at a time, allowing for data analysis that exceeds system memory limits.
❖ Map and Reduce Functions:

Map Function: Breaks down the big data problem into smaller chunks, operates
on each chunk of data independently, and emits key-value pairs.

Reduce Function: Takes the output from the Map function (intermediate key-
value pairs), consolidates the results with the same keys, and processes these
grouped results to produce the final output.

❖ Efficient Data Handling:

MATLAB uses efficient data handling strategies during the MapReduce process,
such as automatic partitioning of data and management of intermediate data
storage. This maximizes the use of available memory and disk resources.

❖ Scalability:

Though primarily designed to run on a single workstation, MATLAB’s


implementation of MapReduce can be scaled up to work across clusters using
MATLAB Parallel Server. This allows MATLAB computations to be scaled
using a distributed array framework where the data is spread over a cluster.

❖ Flexibility:
The user defines the operations performed in the Map and Reduce functions,
providing flexibility to perform a wide range of data analysis and mathematical
operations.

❖ Integration with MATLAB Ecosystem:

The results from MapReduce can be integrated seamlessly with other MATLAB
functions and toolboxes for further analysis or visualization.

❖ Ease of Use:

MATLAB provides a simplified approach to applying the MapReduce paradigm,


allowing users who may not be experts in distributed computing to implement and
benefit from this powerful data processing technique.

● Implementation:

1. Creating Tall array using Data store


2. Implementation of Mapreduce
Count Flights by Airline - Use mapreduce to count the number of flights made by each
unique airline carrier in a data set.

The map function countMapper leverages the fact that the data is categorical. The countcats and
categories functions are used on each block of the input data to generate key/value pairs of the
airline name and associated count.

The reduce function countReducer reads in the intermediate data produced by the map function
and adds together all of the counts to produce a single final count for each airline carrier.
Run mapreduce on the data. The map and reduce functions count the number of instances of
each airline carrier name in each block of data, then combine those intermediate counts into a
final count. This method leverages the intermediate sorting by unique key performed by
mapreduce. The functions countMapper and countReducer are included at the end of this script.
Conclusion: Advantage of using datastore, tall arrays and mapreduce for Big Data

When working with Big Data in MATLAB, it's crucial to use tools and techniques that can
handle large volumes of data efficiently. MATLAB provides several features designed for this
purpose, such as datastores, tall arrays, and the mapreduce programming paradigm. Each of these
features has specific advantages for dealing with Big Data, making them indispensable tools for
data scientists and engineers. Advantages of these features are as follows:

1. Datastore

A Datastore in MATLAB is a repository for collections of data that are too large to fit in
memory. Using a datastore allows you to manage data in a way that enhances the performance
and scalability of data-intensive applications.

Advantages:

❖ Handling Large Files: Datastore can handle very large datasets that cannot be loaded into
memory all at once. It enables efficient reading in chunks, which is perfect for processing
and analyzing large files incrementally.
❖ Support for Different Formats: MATLAB supports various types of datastores for
different data formats including spreadsheets, images, text files, and key-value pairs. This
flexibility allows seamless integration and manipulation of data from diverse sources.
❖ Ease of Preprocessing: Datastores simplify the preprocessing of data with functionalities
like data transformation and filtering during the import phase, which can significantly
streamline workflows.
❖ Efficiency: Datastores are optimized for reading and processing data in large blocks,
reducing the time and system resources required for data handling.

2. Tall Arrays

Tall arrays extend MATLAB's numeric and logical arrays to handle big data. They are designed
to work with datasets that are too large to fit in memory, facilitating operations and computations
as if they were in-memory arrays.

Advantages:

❖ Transparent Handling of Large Data: Users can operate on tall arrays almost the same
way they work with in-memory arrays. MATLAB manages the memory and disk
operations in the background, making the code simpler and cleaner.
❖ Integration with Datastore: Tall arrays can be directly created from datastores, providing
a seamless workflow from data reading to processing.
❖ Lazy Evaluation: MATLAB uses optimized lazy evaluation with tall arrays.
Computations are queued and only executed when required, ensuring that the most
efficient sequence of operations is used.
❖ Scalability and Speed: By leveraging the MATLAB's built-in functions that are
automatically parallelized and optimized for tall arrays, users can achieve high-
performance analytics on large datasets.

3. MapReduce

The mapreduce function in MATLAB is a programming technique for performing big data
analytics. It allows you to process large amounts of data in a scalable way, even on a single
computer or across a cluster.

Advantages:

❖ Scalability Across Different Environments: The mapreduce algorithm can run on a single
PC, a cluster, or cloud infrastructure, which makes it highly scalable and flexible.
❖ Handling of Key-Value Pairs: It works by processing data in key-value pairs, a robust
structure for managing diverse data types and complex data manipulations.
❖ Efficient Data Reduction: The mapreduce framework is designed to filter and condense
large datasets into more manageable sizes during the 'map' phase, then further aggregate
or summarize this data in the 'reduce' phase, which is ideal for analytics.
❖ Customizable Processing: Users can define custom functions for both the 'map' and
'reduce' phases, providing a high degree of control over how data is analyzed and results
are generated.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy