0% found this document useful (0 votes)

7 views11 pages

4251 Assignment 6

The document discusses implementing and tackling big data using MATLAB. It covers topics like tall arrays, distributed arrays, scaling machine learning to large datasets, and MapReduce in MATLAB. Tall arrays and distributed arrays allow efficient handling of large datasets that don't fit in memory through out-of-memory operations and parallel processing across a cluster.

Uploaded by

renu.tamsekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views11 pages

4251 Assignment 6

Uploaded by

renu.tamsekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Assignment 6

Name – Renu Tamsekar

C number – C22020111255

Aim: Implementation and tackling of Big Data using Matlab

Ref. https://matlab.mathworks.com/

https://www.electronicdesign.com/technologies/embedded/software/article/21801909/tall-
arrays-allow-cross-platform-manipulation

https://www.matlabexpo.com/content/dam/mathworks/mathworks-dot-
com/images/events/matlabexpo/in/2018/tackling-big-data-using-matlab.pdf

https://idsc.at/wp-content/uploads/2017/06/c_Martynenko-MathWorks@IDSC_2017.pdf

https://www.uc.pt/site/assets/files/818807/parallel_and_distributed_computing-1.pdf

https://drive.mathworks.com/files/10042023_Cummins/Solution_predictDriverTip.mlx

https://drive.mathworks.com/files/10042023_Cummins/MapReduce/CountAirlinesMapRed
uceExample.mlx

Brief Information :

❖ Data Store :

A data store is a repository for persistently storing and managing collections of data. It can range
from traditional databases like SQL databases to NoSQL databases, data lakes, and cloud storage
solutions.

❖ Tall Array:

In MATLAB, a tall array is a data type designed to handle large datasets that do not fit entirely in
memory. Tall arrays provide a means to work with this big data efficiently, using familiar
MATLAB syntax, while performing computations in a chunk-wise manner. This is particularly
useful for datasets that are larger than the amount of available RAM, allowing for analysis and
manipulation without requiring the entire dataset to be loaded at once.

Characteristics of Tall Arrays

❖ Out-of-Memory Operations: Tall arrays are used for datasets that exceed your system's
memory. MATLAB processes these arrays by using algorithms that break the data into
manageable pieces or chunks, performing operations on each chunk sequentially and then
combining the results.
❖ Lazy Evaluation: MATLAB employs a technique called lazy evaluation with tall arrays.
Operations on tall arrays are queued up and only executed when the results are needed,
such as for display, plotting, or further calculations that require an actual numerical value.
This approach minimizes unnecessary computations and memory use.
❖ Scalability: Operations on tall arrays can scale from small datasets on your desktop to
large datasets distributed across a cluster when used in conjunction with MATLAB
Parallel Server.

❖ Distributed arrays:

In MATLAB, distributed arrays are a data type designed for parallel computing, enabling data to
be stored and processed across multiple workers in a cluster environment. This capability is
particularly useful for handling very large arrays that exceed the memory capacity of a single
machine, allowing you to perform computations that are otherwise not feasible on standard
hardware.

Key Features of Distributed Arrays

❖ Parallel Data Storage: Distributed arrays split data across the workers in a parallel pool.
Each worker holds a portion of the entire array, thus distributing memory and
computational load.
❖ Scalability: Because data is distributed, the size of the data that can be handled scales
with the number of workers and the memory available to each worker in the cluster.
❖ Integration with MATLAB’s Parallel Computing Tools: Distributed arrays are deeply
integrated with MATLAB’s parallel computing tools, making it easy to parallelize
computations without extensive rewriting of code.

How Distributed Arrays Work

When you create a distributed array, MATLAB divides the array into smaller chunks, which are
then spread out across the available workers in a MATLAB parallel pool (a group of MATLAB
workers running on a cluster or a multi-core machine). Each worker operates on its chunk of the
array independently. This division of labor means that computations can be done in parallel,
significantly speeding up operations that are suitable for parallel execution.

Use Cases

Distributed arrays are particularly useful in scenarios involving:

❖ Large-scale numerical simulations and optimizations

❖ Big data processing tasks that can be parallelized
❖ Machine learning algorithms on large datasets

❖ Scaling Machine Learning to Large Data sets:

Scaling machine learning to handle large datasets involves overcoming challenges related to data
volume, computation speed, memory constraints, and potentially the complexity of models
themselves. Efficiently scaling machine learning processes is critical for leveraging the full value
of big data in predictive analytics and other AI-driven applications.Following are some
strategies and considerations for scaling machine learning to large datasets:

1. Utilizing Big Data Technologies

Integrating big data platforms can significantly aid in managing and processing large datasets.
Technologies like Apache Hadoop and Apache Spark provide frameworks for distributed storage
and computing, which can be leveraged to preprocess data, perform feature extraction, and even
train machine learning models.

❖ Apache Hadoop: Useful for batch processing of very large data sets, particularly when
the data is too big to fit in the memory of a single machine. Hadoop's HDFS (Hadoop
Distributed File System) offers reliable data storage and can work with commodity
hardware, reducing costs.
❖ Apache Spark: Excels in fast data processing and supports in-memory computing, which
is especially effective for iterative algorithms common in machine learning. Spark also
provides libraries such as MLlib for scalable machine learning algorithms.

2. Using Distributed Computing

Distributed computing involves parallelizing the workload across multiple machines or

processors. This approach is essential when scaling machine learning algorithms to handle large-
scale datasets.

❖ Parallel Processing: Use multi-core processing or GPU computing to parallelize data

processing and model training. Technologies like CUDA for NVIDIA GPUs enable
significant reductions in training time for complex models like deep neural networks.
❖ Distributed Machine Learning Libraries: Tools like TensorFlow, PyTorch, and Dask
support distributed computing, allowing the data and the computational workload to be
spread across many servers.
3. Employing Online Algorithms

Online learning algorithms are designed to update models incrementally as new data arrives,
rather than retraining models from scratch with each new data batch. This can be particularly
useful for large datasets or data streams.

❖ Online Learning: Algorithms like stochastic gradient descent (SGD) are inherently online
and can process data points individually or in mini-batches, thus handling large datasets
efficiently.

4. Data Reduction Techniques

Reducing the volume of data without losing valuable information can make machine learning
algorithms more manageable.

❖ Sampling: Techniques such as random sampling or stratified sampling can reduce the
dataset size while maintaining its statistical properties.
❖ Dimensionality Reduction: Techniques like PCA (Principal Component Analysis), t-
SNE, or autoencoders reduce the number of random variables under consideration, which
can simplify the data and speed up the learning process.

5. Incremental Learning

Incremental learning models are trained progressively, assimilating new data as it becomes
available. This method is ideal for datasets that continuously grow over time.

❖ Models Supporting Incremental Learning: Some models naturally support incremental

learning, such as decision trees, k-nearest neighbors, and some ensemble methods like
AdaBoost.

6. Cloud-Based Machine Learning Platforms

Cloud platforms offer scalable hardware resources and managed services for machine learning.
These platforms often provide tools that automate many aspects of scaling machine learning
workflows.

❖ AWS, Google Cloud, Azure: These platforms offer managed machine learning services
like Google Cloud ML Engine, Amazon SageMaker, and Azure Machine Learning
Studio, which can scale as needed based on the dataset size and computational
requirements.

7. Efficient Data Storage and Access

Efficiently accessing and storing data can be a bottleneck in processing large datasets. Using
databases and data formats that support high-throughput and scalable read/write operations is
crucial.

❖ Data Formats: Formats like HDF5 or Parquet are optimized for high-speed data access
and efficient storage, which is essential when working with large datasets.

● Mapreduce in Matlab:

MapReduce in MATLAB is a data processing technique that enables you to manage and analyze
large data sets that might not fit into memory all at once. This method borrows from the
MapReduce paradigm popularized by Google and implemented in technologies such as Apache
Hadoop. It's especially beneficial for distributed processing across multiple machines, although
MATLAB's implementation focuses on processing on a single computer using its parallel
processing tools.

Characteristics of MapReduce in MATLAB

❖ Processing Large Datasets: MATLAB can handle very large data sets by working on one
chunk at a time, allowing for data analysis that exceeds system memory limits.
❖ Map and Reduce Functions:

Map Function: Breaks down the big data problem into smaller chunks, operates
on each chunk of data independently, and emits key-value pairs.

Reduce Function: Takes the output from the Map function (intermediate key-
value pairs), consolidates the results with the same keys, and processes these
grouped results to produce the final output.

❖ Efficient Data Handling:

MATLAB uses efficient data handling strategies during the MapReduce process,
such as automatic partitioning of data and management of intermediate data
storage. This maximizes the use of available memory and disk resources.

❖ Scalability:

Though primarily designed to run on a single workstation, MATLAB’s

implementation of MapReduce can be scaled up to work across clusters using
MATLAB Parallel Server. This allows MATLAB computations to be scaled
using a distributed array framework where the data is spread over a cluster.

❖ Flexibility:
The user defines the operations performed in the Map and Reduce functions,
providing flexibility to perform a wide range of data analysis and mathematical
operations.

❖ Integration with MATLAB Ecosystem:

The results from MapReduce can be integrated seamlessly with other MATLAB
functions and toolboxes for further analysis or visualization.

❖ Ease of Use:

MATLAB provides a simplified approach to applying the MapReduce paradigm,

allowing users who may not be experts in distributed computing to implement and
benefit from this powerful data processing technique.

● Implementation:

1. Creating Tall array using Data store

2. Implementation of Mapreduce
Count Flights by Airline - Use mapreduce to count the number of flights made by each
unique airline carrier in a data set.

The map function countMapper leverages the fact that the data is categorical. The countcats and
categories functions are used on each block of the input data to generate key/value pairs of the
airline name and associated count.

The reduce function countReducer reads in the intermediate data produced by the map function
and adds together all of the counts to produce a single final count for each airline carrier.
Run mapreduce on the data. The map and reduce functions count the number of instances of
each airline carrier name in each block of data, then combine those intermediate counts into a
final count. This method leverages the intermediate sorting by unique key performed by
mapreduce. The functions countMapper and countReducer are included at the end of this script.
Conclusion: Advantage of using datastore, tall arrays and mapreduce for Big Data

When working with Big Data in MATLAB, it's crucial to use tools and techniques that can
handle large volumes of data efficiently. MATLAB provides several features designed for this
purpose, such as datastores, tall arrays, and the mapreduce programming paradigm. Each of these
features has specific advantages for dealing with Big Data, making them indispensable tools for
data scientists and engineers. Advantages of these features are as follows:

1. Datastore

A Datastore in MATLAB is a repository for collections of data that are too large to fit in
memory. Using a datastore allows you to manage data in a way that enhances the performance
and scalability of data-intensive applications.

Advantages:

❖ Handling Large Files: Datastore can handle very large datasets that cannot be loaded into
memory all at once. It enables efficient reading in chunks, which is perfect for processing
and analyzing large files incrementally.
❖ Support for Different Formats: MATLAB supports various types of datastores for
different data formats including spreadsheets, images, text files, and key-value pairs. This
flexibility allows seamless integration and manipulation of data from diverse sources.
❖ Ease of Preprocessing: Datastores simplify the preprocessing of data with functionalities
like data transformation and filtering during the import phase, which can significantly
streamline workflows.
❖ Efficiency: Datastores are optimized for reading and processing data in large blocks,
reducing the time and system resources required for data handling.

2. Tall Arrays

Tall arrays extend MATLAB's numeric and logical arrays to handle big data. They are designed
to work with datasets that are too large to fit in memory, facilitating operations and computations
as if they were in-memory arrays.

Advantages:

❖ Transparent Handling of Large Data: Users can operate on tall arrays almost the same
way they work with in-memory arrays. MATLAB manages the memory and disk
operations in the background, making the code simpler and cleaner.
❖ Integration with Datastore: Tall arrays can be directly created from datastores, providing
a seamless workflow from data reading to processing.
❖ Lazy Evaluation: MATLAB uses optimized lazy evaluation with tall arrays.
Computations are queued and only executed when required, ensuring that the most
efficient sequence of operations is used.
❖ Scalability and Speed: By leveraging the MATLAB's built-in functions that are
automatically parallelized and optimized for tall arrays, users can achieve high-
performance analytics on large datasets.

3. MapReduce

The mapreduce function in MATLAB is a programming technique for performing big data
analytics. It allows you to process large amounts of data in a scalable way, even on a single
computer or across a cluster.

Advantages:

❖ Scalability Across Different Environments: The mapreduce algorithm can run on a single
PC, a cluster, or cloud infrastructure, which makes it highly scalable and flexible.
❖ Handling of Key-Value Pairs: It works by processing data in key-value pairs, a robust
structure for managing diverse data types and complex data manipulations.
❖ Efficient Data Reduction: The mapreduce framework is designed to filter and condense
large datasets into more manageable sizes during the 'map' phase, then further aggregate
or summarize this data in the 'reduce' phase, which is ideal for analytics.
❖ Customizable Processing: Users can define custom functions for both the 'map' and
'reduce' phases, providing a high degree of control over how data is analyzed and results
are generated.

Saikiran Data - Engineer Resume
No ratings yet
Saikiran Data - Engineer Resume
7 pages
Computer Network
No ratings yet
Computer Network
10 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Distributed Linear Regression Class Notes
No ratings yet
Distributed Linear Regression Class Notes
140 pages
Guidelines: DSE3-Machine Learning
No ratings yet
Guidelines: DSE3-Machine Learning
2 pages
UNIT2
No ratings yet
UNIT2
20 pages
Tackling Big Data Using Matlab
No ratings yet
Tackling Big Data Using Matlab
73 pages
Dhaapps Datascience With Gen AI-1
No ratings yet
Dhaapps Datascience With Gen AI-1
23 pages
Functions:: Sparse Modeling
No ratings yet
Functions:: Sparse Modeling
7 pages
Mathematical Modeling
No ratings yet
Mathematical Modeling
31 pages
T1 Scheme 24 25
No ratings yet
T1 Scheme 24 25
5 pages
Complete Technical Topics For AI
No ratings yet
Complete Technical Topics For AI
17 pages
Ass 2
No ratings yet
Ass 2
6 pages
Module 3
No ratings yet
Module 3
10 pages
Matlab
No ratings yet
Matlab
57 pages
A Comprehensive Guide To Machine Learning
No ratings yet
A Comprehensive Guide To Machine Learning
8 pages
Machine Learning Practical Sem 5
No ratings yet
Machine Learning Practical Sem 5
3 pages
ML Notes-1
No ratings yet
ML Notes-1
59 pages
Unit 4 - DS - 1st Year
No ratings yet
Unit 4 - DS - 1st Year
6 pages
Machine Learning Guidelines and Practical List - Tutorialsduniya
No ratings yet
Machine Learning Guidelines and Practical List - Tutorialsduniya
2 pages
ML DL Topics Detailed All
No ratings yet
ML DL Topics Detailed All
9 pages
Ai Blueprint
No ratings yet
Ai Blueprint
6 pages
Review of AppliedMachineLearning
No ratings yet
Review of AppliedMachineLearning
2 pages
Aiml Model
No ratings yet
Aiml Model
13 pages
BCS602 Model Question Paper Solved (Search Creators)
No ratings yet
BCS602 Model Question Paper Solved (Search Creators)
37 pages
Deep Learning
No ratings yet
Deep Learning
142 pages
Limitations of Data Analytical Algorithms
No ratings yet
Limitations of Data Analytical Algorithms
2 pages
Report Format (1) .Docx - 20240508 - 124537 - 0000
No ratings yet
Report Format (1) .Docx - 20240508 - 124537 - 0000
11 pages
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Library
No ratings yet
Library
23 pages
Machine Learning With Python
No ratings yet
Machine Learning With Python
6 pages
PDS Labmanualword
No ratings yet
PDS Labmanualword
32 pages
Silver Oak College of Computer Application: Subject:Machine Learning
No ratings yet
Silver Oak College of Computer Application: Subject:Machine Learning
15 pages
Statistics and Machine Learning Toolbox™ Release Notes
No ratings yet
Statistics and Machine Learning Toolbox™ Release Notes
150 pages
ML - Part - A
No ratings yet
ML - Part - A
10 pages
Deep Learning
No ratings yet
Deep Learning
21 pages
3 Must-Have Projects For Your Data Science Portfolio - by Aakash N S - Jovian - Jan, 2021 - Medium
No ratings yet
3 Must-Have Projects For Your Data Science Portfolio - by Aakash N S - Jovian - Jan, 2021 - Medium
1 page
ML
No ratings yet
ML
13 pages
Untitled Document
No ratings yet
Untitled Document
8 pages
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Advanced Topics in Machine Learning: Supervised Learning, Deep Learning, and Optimization Techniques
No ratings yet
Advanced Topics in Machine Learning: Supervised Learning, Deep Learning, and Optimization Techniques
5 pages
Machine Learning Masterclass 2023
No ratings yet
Machine Learning Masterclass 2023
6 pages
Study Structure
No ratings yet
Study Structure
13 pages
Project Report
No ratings yet
Project Report
3 pages
Parallel Computing Toolbox™UserGuide
No ratings yet
Parallel Computing Toolbox™UserGuide
729 pages
Data Science For Civil Engineering Unit 5 Notes
No ratings yet
Data Science For Civil Engineering Unit 5 Notes
17 pages
Customer Segmentation 2
No ratings yet
Customer Segmentation 2
19 pages
DS - Module 3 1
No ratings yet
DS - Module 3 1
6 pages
001IntroductiontomachinelearningPart I
No ratings yet
001IntroductiontomachinelearningPart I
10 pages
Thesis Proposal: Scaling Distributed Machine Learning With System and Algorithm Co-Design
No ratings yet
Thesis Proposal: Scaling Distributed Machine Learning With System and Algorithm Co-Design
12 pages
4.introductin To Machine Learning
No ratings yet
4.introductin To Machine Learning
28 pages
Principles of MapReduce Systems: Definitive Reference for Developers and Engineers
From Everand
Principles of MapReduce Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
From Everand
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mitali Songara - PPTX A
No ratings yet
Mitali Songara - PPTX A
13 pages
Data Preprocessing-AIML Algorithm1
No ratings yet
Data Preprocessing-AIML Algorithm1
47 pages
ML Notes All
No ratings yet
ML Notes All
32 pages
Data Science Syllabus From Beginner To Advanced
No ratings yet
Data Science Syllabus From Beginner To Advanced
7 pages
Sidra Illahai
No ratings yet
Sidra Illahai
4 pages
Machin Learning Treking
No ratings yet
Machin Learning Treking
5 pages
Mathematics Research Proposal - Anneqa
No ratings yet
Mathematics Research Proposal - Anneqa
9 pages
Mitali Songara - PPTX A.pptx BB
No ratings yet
Mitali Songara - PPTX A.pptx BB
14 pages
System Design
No ratings yet
System Design
6 pages
Tuning - Spark 3.5.1 Documentation
No ratings yet
Tuning - Spark 3.5.1 Documentation
10 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
OC HUG 2014-10-4x3 Apache Phoenix
No ratings yet
OC HUG 2014-10-4x3 Apache Phoenix
58 pages
Evaluating Fault Tolerance and Scalability in Distributed File Systems: A Case Study of GFS, HDFS, and Minio
No ratings yet
Evaluating Fault Tolerance and Scalability in Distributed File Systems: A Case Study of GFS, HDFS, and Minio
9 pages
Dsbda Lab Manual
No ratings yet
Dsbda Lab Manual
167 pages
Cloudera Data Platform Private Cloud Base With IBM Spectrum Scale
No ratings yet
Cloudera Data Platform Private Cloud Base With IBM Spectrum Scale
42 pages
Berkeley Data Analytics Stack: Prof. Harold Liu 15 December 2014
No ratings yet
Berkeley Data Analytics Stack: Prof. Harold Liu 15 December 2014
48 pages
FQ Rate Card Final
No ratings yet
FQ Rate Card Final
5 pages
NSR SPAR LLC Staffing Profile
No ratings yet
NSR SPAR LLC Staffing Profile
6 pages
Trivago Pipeline
No ratings yet
Trivago Pipeline
18 pages
Data Science Masters Program Online
No ratings yet
Data Science Masters Program Online
14 pages
What Is Microsoft Azure
No ratings yet
What Is Microsoft Azure
20 pages
Week 2 Assignment Answers 2022
No ratings yet
Week 2 Assignment Answers 2022
4 pages
Quasar: Resource-Efficient and Qos-Aware Cluster Management: Christina Delimitrou and Christos Kozyrakis
No ratings yet
Quasar: Resource-Efficient and Qos-Aware Cluster Management: Christina Delimitrou and Christos Kozyrakis
17 pages
Installation of Hadoop in Windows10
No ratings yet
Installation of Hadoop in Windows10
4 pages
Big Data Masters Program
No ratings yet
Big Data Masters Program
13 pages
A Review Paperbased On Big Data Analytics: Rashmi
No ratings yet
A Review Paperbased On Big Data Analytics: Rashmi
7 pages
MapReduce Example
No ratings yet
MapReduce Example
3 pages
DGC - Sources November2023 ApacheAtlasSources en
No ratings yet
DGC - Sources November2023 ApacheAtlasSources en
24 pages
Week1 Frequently Asked Questions
No ratings yet
Week1 Frequently Asked Questions
19 pages
UNIT 3 HDFS, Hadoop Environment Part 2
No ratings yet
UNIT 3 HDFS, Hadoop Environment Part 2
6 pages
CSE AI - 4 1 SEM CS Syllabus - UG - R20
No ratings yet
CSE AI - 4 1 SEM CS Syllabus - UG - R20
50 pages
Adbase Presentation Group 4
No ratings yet
Adbase Presentation Group 4
60 pages
HDFS File System Shell Guide
No ratings yet
HDFS File System Shell Guide
10 pages
Big Data Analytics-Digital Notes
No ratings yet
Big Data Analytics-Digital Notes
86 pages
Evaluating Partitioning and Bucketing Strategies For Hive-Based Big Data Warehouse Systems
No ratings yet
Evaluating Partitioning and Bucketing Strategies For Hive-Based Big Data Warehouse Systems
38 pages
Introduction Hadoop Ecosystem Hdfs I Slides
No ratings yet
Introduction Hadoop Ecosystem Hdfs I Slides
12 pages
4220 2 (Bigdata)
No ratings yet
4220 2 (Bigdata)
19 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

4251 Assignment 6

Uploaded by

4251 Assignment 6

Uploaded by

Assignment 6

Name – Renu Tamsekar

Aim: Implementation and tackling of Big Data using Matlab

Characteristics of Tall Arrays

Key Features of Distributed Arrays

How Distributed Arrays Work

Distributed arrays are particularly useful in scenarios involving:

❖ Large-scale numerical simulations and optimizations

❖ Scaling Machine Learning to Large Data sets:

1. Utilizing Big Data Technologies

2. Using Distributed Computing

Distributed computing involves parallelizing the workload across multiple machines or

❖ Parallel Processing: Use multi-core processing or GPU computing to parallelize data

4. Data Reduction Techniques

❖ Models Supporting Incremental Learning: Some models naturally support incremental

6. Cloud-Based Machine Learning Platforms

7. Efficient Data Storage and Access

Characteristics of MapReduce in MATLAB

❖ Efficient Data Handling:

Though primarily designed to run on a single workstation, MATLAB’s

❖ Integration with MATLAB Ecosystem:

MATLAB provides a simplified approach to applying the MapReduce paradigm,

1. Creating Tall array using Data store

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.