0% found this document useful (0 votes)
237 views287 pages

Big Data Analytics

The document discusses big data analytics including characteristics of big data, importance of big data, patterns for big data development, and examples of using big data for IT log analytics and fraud detection. Big data is characterized by volume, velocity, variety, veracity, and validity of data. Big data analytics can provide insights from large datasets and uncover hidden patterns.

Uploaded by

Meenakshi Sai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
237 views287 pages

Big Data Analytics

The document discusses big data analytics including characteristics of big data, importance of big data, patterns for big data development, and examples of using big data for IT log analytics and fraud detection. Big data is characterized by volume, velocity, variety, veracity, and validity of data. Big data analytics can provide insights from large datasets and uncover hidden patterns.

Uploaded by

Meenakshi Sai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 287

KGRL MCA

BIG DATA ANALYTICS

BY
K.ISSACK BABU MCA
Assistant Professor

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 1


BIG DATA ANALYTICS

1.Introduction to Big Data:


What is Big Data and Big Data Analytics (BDA)?
“Big Data is an evolving term that describes any large amount of structured, semi-
structured and unstructured data that has the potential to be mined for information.”
“Big Data Analytics (BDA) is the process of examining large data sets containing a
variety of data types -- i.e., big data -- to uncover hidden patterns, unknown correlations,

market trends, customer preferences and other useful business information.”

Characteristics of Big Data (or) Why is Big Data different from any other data?

There are “Five V‟s” that characterize this data: Volume, Velocity, Variety, Veracity
and Validity.
1. Volume (Data Quantity):
Most organizations were already struggling with the increasing size of their databases
as the Big Data tsunami hit the data stores.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 2
2. Velocity (Data Speed):
There are two aspects to velocity. They are throughput of data and the other
representing latency.
a. Throughput which represents the data moving in the pipes.
b. Latency is the other measure of velocity. Analytics used to be a “store and report”
environment where reporting typically contained data as of yesterday—popularly represented
as “D-1.”
c. Variety (Data Types):
The source data includes unstructured text, sound, and video in addition to structured
data. A number of applications are gathering data from emails, documents, or blogs.
d. Veracity (Data Quality):
Veracity represents both the credibility of the data source as well as the suitability of
the data for the target audience.
e. Validity (Data Correctness):
Validity meaning is the data correct and accurate for the future use. Clearly valid data
is key to making the right decisions.
As per IBM, the number of characteristics of Big Data is V3 and described in the
following Figure:

Importance of Big Data:


1. Access to social data from search engines and sites like Facebook, twitter are enabling
organizations to fine tune their business strategies. Marketing agencies are learning
about the response for their campaigns, promotions, and other advertising mediums.
2. Traditional customer feedback systems are getting replaced by new systems designed
with „Big Data‟ technologies. In these new systems, Big Data and natural language
processing technologies are being used to read and evaluate consumer responses.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 3


3. Based on information in the social media like preferences and product perception of
their consumers, product companies and retail organizations are planning their
production.
4. Determining root causes of failures, issues and defects in near-real time
5. Detecting fraudulent behavior before it affects the organization.
When to you use Big Data technologies?
1. Big Data solutions are ideal for analyzing not only raw structured data, but semi-
structured and unstructured data from a wide variety of sources.
2. Big Data solutions are ideal when all, or most, of the data needs to be analyzed versus
a sample of the data; or a sampling of data isn‟t nearly as effective as a larger set of
data from which to derive analysis.
3. Big Data solutions are ideal for iterative and exploratory analysis when business
measures on data are not predetermined.
Patterns for Big Data Development:
The following six most common usage patterns represent great Big Data
opportunities—business problems that weren‟t easy to solve before—and help us gain an
understanding of how Big Data can help us (or how it‟s helping our competitors make us less
competitive if we are not paying attention).
1. IT for IT Log Analytics
2. The Fraud Detection Pattern
3. The Social Media Pattern
4. The Call Center Mantra: “This Call May Be Recorded for Quality Assurance Purposes”
5. Risk: Patterns for Modeling and Management
6. Big Data and the Energy Sector
1. IT for IT Log Analytics: IT departments need logs at their disposal, and today they just
can‟t store enough logs and analyze them in a cost-efficient manner, so logs are typically kept
for emergencies and discarded as soon as possible. Another reason why IT departments keep
large amounts of data in logs is to look for rare problems. It is often the case that the most
common problems are known and easy to deal with, but the problem that happens “once in a
while” is typically more difficult to diagnose and prevent from occurring again.
But there are more reasons why log analysis is a Big Data problem apart from its
large nature. The nature of these logs is semi-structured and raw, so they aren‟t always suited
for traditional database processing. In addition, log formats are constantly changing due to
hardware and software upgrades, so they can‟t be tied to strict inflexible analysis paradigms.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 4


Finally, not only do we need to perform analysis on the longevity of the logs to determine
trends and patterns and to find failures, but also we need to ensure the analysis is done on all
the data.
Log analytics is actually a pattern that IBM established after working with a number
of companies, including some large financial services sector (FSS) companies. This use case
comes up with quite a few customers since; for that reason, this pattern is called IT for IT. If
we are new to this usage pattern and wondering just who is interested in IT for IT Big Data
solutions, we should know that this is an internal use case within an organization itself. An
internal IT for IT implementation is well suited for any organization with a large data center
footprint, especially if it is relatively complex. For example, service-oriented architecture
(SOA) applications with lots of moving parts, federated data centers, and so on, all suffer
from the same issues outlined in this section.
Some of large insurance and retail clients need to know the answers to such questions
as, “What are the precursors to failures?”, “How are these systems all related?”, and more.
These are the types of questions that conventional monitoring doesn‟t answer; a Big Data
platform finally offers the opportunity to get some new and better insights into the problems
at hand.
2. The Fraud Detection Pattern: Fraud detection comes up a lot in the financial services
vertical, we will find it in any sort of claims- or transaction-based environment (online
auctions, insurance claims, underwriting entities, and so on). Pretty much anywhere some
sortof financial transaction is involved presents a potential for misuse and the universal threat
of fraud. If we influence a Big Data platform, we have the opportunity to do more than we
have ever done before to identify it or, better yet, stop it.
Traditionally, in fraud cases, samples and models are used to identify customers that
characterize a certain kind of profile. The problem with this is that although it works, we are
profiling a segment and not the granularity at an individual transaction or person level. As per
customer experiences, it is estimated that only 20 percent (or maybe less) of the available
information that could be useful for fraud modeling is actually being used. The traditional
approach is shown in the following Figure.

Finally, not only do we need to perform analysis on the longevity of the logs to determine
trends and patterns and to find failures, but also we need to ensure the analysis is done on all
the data.
Log analytics is actually a pattern that IBM established after working with a number
of companies, including some large financial services sector (FSS) companies. This use case

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 5


comes up with quite a few customers since; for that reason, this pattern is called IT for IT. If
we are new to this usage pattern and wondering just who is interested in IT for IT Big Data
solutions, we should know that this is an internal use case within an organization itself. An
internal IT for IT implementation is well suited for any organization with a large data center
footprint, especially if it is relatively complex. For example, service-oriented architecture
(SOA) applications with lots of moving parts, federated data centers, and so on, all suffer
from the same issues outlined in this section.
Some of large insurance and retail clients need to know the answers to such questions
as, “What are the precursors to failures?”, “How are these systems all related?”, and more.
These are the types of questions that conventional monitoring doesn‟t answer; a Big Data
platform finally offers the opportunity to get some new and better insights into the problems
at hand.
3. The Fraud Detection Pattern: Fraud detection comes up a lot in the financial services
vertical, we will find it in any sort of claims- or transaction-based environment (online
auctions, insurance claims, underwriting entities, and so on). Pretty much anywhere some
sortof financial transaction is involved presents a potential for misuse and the universal threat
of fraud. If we influence a Big Data platform, we have the opportunity to do more than we
have ever done before to identify it or, better yet, stop it.
Traditionally, in fraud cases, samples and models are used to identify customers that
characterize a certain kind of profile. The problem with this is that although it works, we are
profiling a segment and not the granularity at an individual transaction or person level. As per
customer experiences, it is estimated that only 20 percent (or maybe less) of the available
information that could be useful for fraud modeling is actually being used. The traditional
approach is shown in the following Figure.

Finally, not only do we need to perform analysis on the longevity of the logs to determine
trends and patterns and to find failures, but also we need to ensure the analysis is done on all
the data.
Log analytics is actually a pattern that IBM established after working with a number
of companies, including some large financial services sector (FSS) companies. This use case
comes up with quite a few customers since; for that reason, this pattern is called IT for IT. If
we are new to this usage pattern and wondering just who is interested in IT for IT Big Data
solutions, we should know that this is an internal use case within an organization itself. An
internal IT for IT implementation is well suited for any organization with a large data center
footprint, especially if it is relatively complex. For example, service-oriented architecture

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 6


(SOA) applications with lots of moving parts, federated data centers, and so on, all suffer
from the same issues outlined in this section.
Some of large insurance and retail clients need to know the answers to such questions
as, “What are the precursors to failures?”, “How are these systems all related?”, and more.
These are the types of questions that conventional monitoring doesn‟t answer; a Big Data
platform finally offers the opportunity to get some new and better insights into the problems
at hand.
4. The Fraud Detection Pattern: Fraud detection comes up a lot in the financial services
vertical, we will find it in any sort of claims- or transaction-based environment (online
auctions, insurance claims, underwriting entities, and so on). Pretty much anywhere some
sortof financial transaction is involved presents a potential for misuse and the universal threat
of fraud. If we influence a Big Data platform, we have the opportunity to do more than we
have ever done before to identify it or, better yet, stop it.
Traditionally, in fraud cases, samples and models are used to identify customers that
characterize a certain kind of profile. The problem with this is that although it works, we are
profiling a segment and not the granularity at an individual transaction or person level. As per
customer experiences, it is estimated that only 20 percent (or maybe less) of the available
information that could be useful for fraud modeling is actually being used. The traditional
approach is shown in the following Figure.

We can use Big Insights to provide an flexible and cost-effective repository to


establish what of the remaining 80 percent of the information is useful for fraud modeling,
and then feed newly discovered high-value information back into the fraud model as shown
in the following Figure.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 7


A modern-day fraud detection ecosystem provides a low-cost Big Data platform for
exploratory modeling and discovery. Typically, fraud detection works after a transaction gets
stored only to get pulled out of storage and analyzed; storing something to instantly pull it
back out again feels like latency to us. With Streams, we can apply the fraud detection
models as the transaction is happening.

5. The Social Media Pattern: Perhaps the most talked about Big Data usage pattern is social
media and customer sentiment. More specifically, we can determine how sentiment is
impacting sales, the effectiveness or receptiveness of marketing campaigns, the accuracy of
marketing mix (product, price, promotion, and placement), and so on.
Social media analytics is a pretty hot topic, so hot in fact that IBM has built a solution
specifically to accelerate our use of it: Cognos Consumer Insights (CCI). CCI can tell what
people are saying, how topics are trending in social media, and all sorts of things that affect
the business, all packed into a rich visualization engine.
6. The Call Center Mantra: “This Call May Be Recorded for Quality Assurance
Purposes”: It seems that when we want our call with a customer service representative
(CSR) to be recorded for quality assurance purposes, it seems the may part never works in
our favor. The challenge of call center efficiencies is somewhat similar to the fraud detection
pattern.
Call centers of all kinds want to find better ways to process information to address
what‟s going on in the business with lower latency. This is a really interesting Big Data use
case, because it uses analytics-in-motion and analytics-at-rest. Using in-motion analytics
(Streams) means that we basically build our models and find out what‟s interesting based

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 8


upon the conversations that have been converted from voice to text or with voice analysis as
the call is happening. Using at-rest analytics (BigInsights), we build up these models and then
promote them back into Streams to examine and analyze the calls that are actually happening
in real time: it‟s truly a closed-loop feedback mechanism.
7. Risk: Patterns for Modeling and Management: Risk modeling and management is
another big opportunity and common Big Data usage pattern. Risk modeling brings into
focus a frequent question when it comes to the Big Data usage patterns, “How much of our
data do we use in our modeling?” The financial crisis of 2008, the associated subprime
loan crisis, and its outcome has made risk modeling and management a key area of focus for
financial institutions.
Two problems are associated with this usage pattern: “How much of the data will we
use for our model?” and “How can we keep up with the data‟s velocity?” The answer to the
second question, unfortunately, is often, “We can‟t.” Finally, consider that financial services
trend to move their risk model and dashboards to inter-day positions rather than just close-of-
day positions, and we can see yet another challenge that can‟t be solved with traditional
systems alone. Another characteristic of today‟s financial markets is that there are massive
trading volumes requires better model and manage risk.

8. Big Data and the Energy Sector: The energy sector provides many Big Data use case
challenges in how to deal with the massive volumes of sensor data from remote installations.
Many companies are using only a fraction of the data being collected, because they lack the
infrastructure to store or analyze the available scale of data.
Vestas is primarily engaged in the development, manufacturing, sale, and
maintenance of power systems that use wind energy to generate electricity through its wind
turbines. Its product range includes land and offshore wind turbines. At the time of wrote this
book, it had more than 43,000 wind turbines in 65 countries on 5 continents. Vestas used
IBM BigInsights platform to achieve their vision is about the generation of clean energy.
Data in the Warehouse and Data in Hadoop:
Traditional warehouses are mostly ideal for analyzing structured data from various
systems and producing insights with known and relatively stable measurements. On the other
hand, Hadoop-based platform is well suited to deal with semi-structured and unstructured
data, as well as when a data discovery process is needed.
The authors could say that data warehouse data is trusted enough to be “public,”
while Hadoop data isn‟t as trusted (public can mean vastly distributed within the company
and not for external consumption), and although this will likely change in the future, today

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 9


this is something that experience suggests characterizes these repositories.
Hadoop-based repository scheme stores the entire business entity and the reliability of
the Tweet, transaction, Facebook post, and more is kept intact. Data in Hadoop might seem
of low value today. IT departments pick and choose high-valued data and put it through
difficult cleansing and transformation processes because they know that data has a high
known value per byte.
Unstructured data can‟t be easily stored in a warehouse. A Big Data platform can
store all of the data in its native business object format and get value out of it through
massive parallelism on readily available components.
&&&&&&&&&
2. Introduction to Hadoop

Hadoop: Hadoop is an open source framework for writing and running distributed applications
that process large amounts of data. Distributed computing is a wide and varied field, but the
key distinctions of Hadoop are that it is: Accessible—Hadoop runs on large clusters of
commodity machines or on cloud computing services such as Amazon’s Elastic Compute
Cloud (EC2 ). Robust—Because it is intended to run on commodity hardware, Hadoop is
architected with the assumption of frequent hardware malfunctions (errors). It can gracefully
handle most such failures. Scalable— Hadoop scales linearly to handle larger data by adding
more nodes to the cluster. Simple— Hadoop allows users to quickly write efficient parallel
code. The following Figure illustrates how one interacts with a Hadoop cluster. As we can
see, a Hadoop cluster is a set of commodity machines networked together in one location. Data
storage and processing all occur within this “cloud” of machines. Different users can submit
computing “jobs” to Hadoop from individual clients, which can be their own desktop machines
in remote locations from the Hadoop cluster.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 10


Understanding distributed systems and Hadoop: A lot of low-end/commodity machines tied
together as a single functional is known as distributed system. A high-end machine with four I/O
channels each having a throughput of 100 MB/sec will require three hours to read a 4 TB data
set. With Hadoop, this same data set will be divided into smaller (typically 64 MB) blocks that
are spread among many machines in the cluster via the Hadoop Distributed File System (HDFS).
With a modest degree of replication, the cluster machines can read the data set in parallel and
provide a much higher throughput. And such a cluster of commodity machines turns out to be
cheaper than one highend server. Comparing SQL databases and Hadoop: SQL (structured query
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 11
language) is designed for structured data. Many of Hadoop’s initial applications deal with
unstructured data such as text. From this perspective Hadoop provides a more general paradigm
than SQL. SQL is a query language which can be implemented on top of Hadoop as the
execution engine. But in practice, SQL databases tend to refer to a whole set of legacy
technologies, with several dominant vendors, optimized for a historical set of applications. The
following concepts explain a more detailed comparison of Hadoop with typical SQL databases
on specific dimensions.
1. Scale-Out Instead of Scale-Up: Scaling commercial relational databases is expensive. Their
design is friendlier to scaling up. To run a bigger database we need to buy a bigger machine
which is expensive. Unfortunately, at some point there won’t be a big enough machine available
for the larger data sets. Hadoop is designed to be a scale-out architecture operating on a cluster of
commodity PC machines. Adding more resources means adding more machines to the Hadoop
cluster. A Hadoop cluster with ten to hundreds of commodity machines is standard. In fact, other
than for development purposes, there’s no reason to run Hadoop on a single server.
2. Key/Value Pairs Instead of Relational Tables: A fundamental principle of relational databases is that
data resides in tables having relational structure defined by a schema. Hadoop uses key/value pairs as its
basic data unit, which is flexible enough to work with the less-structured data types. Hadoop, data can
originate in any form (Structured/unstructured/ semi-structured), but it eventually transforms into
(key/value) pairs for the processing functions to work on.
3. Functional Programming (Mapreduce) instead of Declarative Queries (Sql): SQL is fundamentally a
high-level declarative language. By executing queries, the required data will be retrieved from database.
Under MapReduce we specify the actual steps in processing the data, which is more similar to an
execution plan for a SQL engine. Under SQL we have query statements; under MapReduce we have
scripts and codes. MapReduce allows to process data in a more general fashion than SQL queries. For
example, we can build complex statistical models from our data or reformat our image data. SQL is not
well designed for such tasks.
4. Offline Batch Processing Instead of Online Transactions: Hadoop is designed for offline processing and
analysis of large-scale data. It doesn’t work for random reading and writing of a few records, which is the
type of load for online transaction processing. In fact, Hadoop is best used as a write-once , read-many-
times type of data store. In this aspect it’s similar to data warehouses in the SQL world. Hadoop relates
to distributed systems and SQL databases at a high level. Understanding MapReduce: MapReduce is a
data processing model. The main advantage is easy scaling of data processing over multiple computing
nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers.
Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But,
once we write an application in the MapReduce form, scaling the application to run over hundreds,
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 12
thousands, or even tens of thousands of machines in a cluster is merely a configuration change. This is
the reason what has attracted many programmers to the Map Reduce model.
Ex: To count the number of times each word occurs in a set of documents. We have a set of
documents having only one document with only one sentence:
Do as I say, not as I do.
We derive the word counts shown as the following.

When the set of documents is small, a straightforward program will do the job and
pseudo-code is:

The program loops through all the documents. For each document, the words are
extracted one by one using a tokenization process. For each word, its corresponding entry in a
multiset called word Count is incremented by one. At the end, a display() function prints out all
the entries in word Count.
The above code works fine until the set of documents we want to process becomes
large. If it is large, to speed it up by rewriting the program so that it distributes the work over
several machines. Each machine will process a distinct fraction of the documents. When all the

machines have completed this, a second phase of processing will combine the result of
all the machines. The pseudo code for the first phase, to be distributed over many machines, is

The pseudo-code for the second phase is:

This word counting program is getting complicated. To make it work across a cluster of
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 13
distributed machines, we need to add a number of functionalities:
a. Store files over many processing machines (of phase one).
a. Write a disk-based hash table permitting processing without being limited by RAM
capacity.
b. Partition the intermediate data (that is, wordCount) from phase one.
c. Shuffle the partitions to the appropriate machines in phase two.
Scaling the same program in MapReduce:
MapReduce programs are executed in two main phases, called mapping and reducing.
Each phase is defined by a data processing function, and these functions are called mapper and
reducer, respectively. In the mapping phase, MapReduce takes the input data and feeds each
data element to the mapper. In the reducing phase, the reducer processes all the outputs from
the mapper and arrives at a final result. In simple terms, the mapper is meant to filter and
transform the input into something that the reducer can aggregate over.
The MapReduce framework was designed in writing scalable, distributed programs.
This two-phase design pattern is using in scaling many programs, and became the basis of the
framework. Partitioning and shuffling are common design patterns along with mapping and
reducing. The MapReduce framework provides a default implementation that works in most
situations. MapReduce uses lists and (key/value) pairs as its main data primitives. The keys and
values are often integers or strings but can also be dummy values to be ignored or complex

object types. The map and reduce functions must obey the following constraint on the types of
keys and values.

In the Map Reduce framework we write applications by specifying the mapper and
reducer. The following steps explain the complete data flow:
1. The input to the application must be structured as a list of (key/value) pairs, list (<k 1,
v1>). The input format for processing multiple files is usually list (<String filename,
String file_content>).
2. The list of (key/value) pairs is broken up and each individual (key/value) pair, <k 1,

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 14


v1>, is processed by calling the map function of the mapper. For word counting,
mapper takes <String filename, String file_ content> and promptly ignores filename.
It can output a list of <String word, Integer count>, we can output a list of <String
word, Integer 1> with repeated entries and let the complete aggregation be done
later. That

is, in the output list we can have the (key/value) pair <"foo", 3> once or we can have
the pair <"foo", 1> three times.
1. The output of all the mappers are (conceptually) aggregated into one giant list of <k 2,
v2> pairs. All pairs sharing the same k2 are grouped together into a new (key/value)
pair, <k2, list(v2)>. The framework asks the reducer to process each one of these
aggregated (key/value) pairs individually. For example, the map output for one
document may be a list with pair <"foo", 1> three times, and the map output for
another document may be a list with pair <"foo", 1> twice. The aggregated pair the
reducer will see is <"foo", list(1,1,1,1,1)>. In word counting, the output of our reducer

Map (String filename, String document)


{ List<String> T =
tokenize(document);for each token
in T {

emit ((String)token, (Integer) 1);

reduce (String token, List<Integer> values)


{Integer sum = 0;

for each value in values


{sum = sum + value;
is <"foo", 5>, which is the total number of times “foo” has occurred in our document
}
set. Each reducer works on a different word. The Map Reduce framework
emit ((String)token,
automatically collects all the(Integer)
<k3, v3>sum);
pairs and writes them to file(s).
Pseudo-code
} for map and reduce functions for word counting:
In the above pseudo-code, a special function is used in the framework called emit()

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 15


which is used to generate the elements in the list one at a time. The emit() function further
relieves the programmer from managing a large list. But Hadoop makes building scalable
distributed programs easy.

Counting words with Hadoop—running your first program:


To run Hadoop on a single machine is mainly useful for development work. Linux is
the official development and production platform for Hadoop, although Windows is a

supported development platform as well. For a Windows box, we’ll need to install
cygwin (http://www-cygwin.com/) to enable shell and Unix scripts.
To run Hadoop requires Java (version 1.6 or higher). Mac users should get it from
Apple. We can download the latest JDK for other operating systems from Sun at
http://java.sun.com/javase/downloads/index.jsp (or) www.oracle.com. Install it and remember
the root of the Java installation, which we’ll need later.
To install Hadoop, first get the latest version release at
http://hadoop.apache.org/core/releases.html. After we unpack the distribution, edit the script
“conf/Hadoop-env.sh” to set JAVA_HOME to the root of the Java installation we have
remembered from earlier. For example, in Mac OS X, we’ll replace this line
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
with the following line
export JAVA_HOME=/Library/Java/Home
We’ll be using the Hadoop script quite often. Run the following command:
bin/Hadoop
We only need to know that the command to run a (Java) Hadoop program is bin/hadoop
jar <jar>. As the command implies, Hadoop programs written in Java are packaged in jar files
for execution. The following command shows about a dozen example programs prepackaged
with Hadoop:
bin/hadoop jar hadoop-*-examples.jar
One of the program is “wordcount”. The important (inner) classes of that program are:

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 16


1. Word Count uses Java’s String Tokenizer in its default setting, which tokenizes based
only on whitespaces. To ignore standard punctuation marks, we add them to the String
Tokenizer’s list of delimiter characters:
String Tokenizer itr = new StringTokenizer(line, " \t\n\r\f,.:;?![]'");
When looping through the set of tokens, each token is extracted and cast into a Text
Object.
2. In Hadoop, the special class Text is used in place of String. We want the word count
to ignore capitalization, so we lowercase all the words before turning them into Text
objects.
word. set(itr.next Token().toLowerCase());
Finally, we want only words that appear more than four times.
3. We modify to collect the word count into the output only if that condition is met.
(This is Hadoop’s equivalent of the emit() function in our pseudo-code.)
if (sum > 4) output. collect(key, new Int Writable(sum));
After making changes to those three lines, we can recompile the program and execute
it again. The results are shown in the following table.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 17


Without specifying any arguments, executing word count will show its usage
information:
bin/hadoop jar hadoop-*-examples.jar wordcount
which shows the arguments list:
wordcount [-m <maps>] [-r <reduces>] <input> <output>
The only parameters are an input directory (<input>) of text documents we want to
analyze and an output directory (<output>) where the program will dump its output. To execute
wordcount, we need to first create an input directory:
mkdir input
and put some documents in it. We can add any text document to the directory. To see the
wordcount results:
bin/hadoop jar hadoop-*-examples.jar wordcount input output more output/*
We’ll see a word count of every word used in the document, listed in alphabetical order.
The source code for wordcount is available and included in the installation at
src/examples/org/apache/hadoop/examples/WordCount.java. We can modify it as per our
requirements.
History of Hadoop:
Hadoop is a versatile (flexible) tool that allows new users to access the power of
distributed computing. By using distributed storage and transferring code instead of data,
Hadoop avoids the costly transmission step when working with large data sets. Moreover, the
redundancy of data allows Hadoop to recover should a single node fail. It is easy of creating
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 18
programs with Hadoop using the Map Reduce framework. On a fully configured cluster,
“running Hadoop” means running a set of daemons, or resident programs, on the different
servers in the network. These daemons have specific roles; some exist only on one server, some
exist across multiple servers. The daemons include:

1. NameNode
2. DataNode
3. Secondary NameNode
4. JobTracker
5. TaskTracker
1. NameNode: The distributed storage system is called the Hadoop File System, or HDFS.
The NameNode is the master of HDFS that directs the slave DataNode daemons to perform
the low-level I/O tasks. The NameNode is the bookkeeper of HDFS; it keeps track of how the
files are broken down into file blocks, which nodes store those blocks, and the overall health
of the distributed filesystem.
The function of the NameNode is memory and I/O intensive. As such, the server
hosting the NameNode typically doesn’t store any user data or perform any computations fora
MapReduce program to lower the workload on the machine. This means that the NameNode
server doesn’t double as a DataNode or a TaskTracker.
There is unfortunately a negative aspect to the importance of the NameNode—it’s a
single point of failure of the Hadoop cluster. For any of the other daemons, if their host nodes
fail for software or hardware reasons, the Hadoop cluster will likely continue to function
smoothly or we can quickly restart it and Not so for the NameNode.
2. DataNode: Each slave machine in the cluster will host a DataNode daemon to perform the
grunt work of the distributed filesystem—reading and writing HDFS blocks to actual files on
the local filesystem. When we want to read or write a HDFS file, the file is broken into
blocks and the NameNode will tell the client which DataNode each block resides in. The
client communicates directly with the DataNode daemons to process the local files
corresponding to the blocks. Furthermore, a DataNode may communicate with other
DataNodes to replicate its data blocks for redundancy. The following figure illustrates the
roles of NameNode and DataNodes.

The data1 file takes up three blocks, which we denote 1, 2, and 3, and the data2 file
consists of blocks 4 and 5. The content of the files are distributed among the DataNodes. In this
illustration, each block has three replicas. For example, block 1 (used for data1) is replicated
over the three rightmost DataNodes. This ensures that if any one DataNode crashes or becomes
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 19
inaccessible over the network, we’ll still be able to read the files.
Data Nodes are constantly reporting to the Name Node. Each of the Data Nodes
informsthe Name Node of the blocks it’s currently storing. After this mapping is complete, the
Data Nodes continually poll the Name Node to provide information regarding local changes as
well as receive instructions to create, move, or delete blocks from the local disk.
3. Secondary Name Node: The Secondary Name Node (SNN) is an assistant daemon for
monitoring the state of the cluster HDFS. Like the Name Node, each cluster has one SNN,
and it typically resides on its own machine as well. No other Data Node or Task Tracker
daemons run on the same server. The SNN differs from the Name Node in that this process
doesn’t receive or record any real-time changes to HDFS. Instead, it communicates with the
Name Node to take snapshots of the HDFS metadata at intervals defined by the cluster
configuration.
The Name Node is a single point of failure for a Hadoop cluster, and the SNNsnapshots
help minimize the downtime and loss of data. However, a Name Node failure requires human
intervention to reconfigure the cluster to use the SNN as the primary Name Node.
4. Job Tracker: There is only one Job Tracker daemon per Hadoop cluster. It’s typically run
on a server as a master node of the cluster. The Job Tracker daemon is the link between our
application and Hadoop. Once we submit our code to the cluster, the Job Tracker determines
the execution plan by determining which files to process, assigns nodes to different tasks, and
monitors all tasks as they’re running. If a task fails, the Job Tracker will automatically re
launch the task, possibly on a different node, up to a predefined limit of retries.
5. Task Tracker: The Job Tracker is the master control for overall execution of a Map
Reduce job and the Task Trackers manage the execution of individual tasks on each slave
node. The interaction between Job Tracker and Task Tracker is shown in the following
diagram.
Each Task Tracker is responsible for executing the individual tasks that the Job Tracker
assigns. Although there is a single Task Tracker per slave node, each Task Tracker can spawn

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 20


One responsibility of the Task Tracker is to constantly communicate with the
Job Tracker. If the Job Tracker fails to receive a heartbeat from a Task Tracker within a
specified amount of time, it will assume the Task Tracker has crashed and will resubmit the
corresponding tasks to other nodes in the cluster.

The topology of one typical Hadoop cluster in described in the following figure:

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 21


UNIT II

Real Time Analytics:

Real Time Analytics Definition

Real time analytics lets users see, analyze and understand data as it arrives in a system. Logic
and mathematics are applied to the data so it can give users insights for making real-time
decisions.

What Is Real Time Analytics?

Real-time analytics allows businesses to get insights and act on data immediately or
soon after the data enters their system.

Real time app analytics answer queries within seconds. They handle large amounts of data with
high velocity and low response times. For example, real-time big data analytics uses data
in financial databases to inform trading decisions.

Analytics can be on-demand or continuous. On-demand delivers results when the user requests it.
Continuous updates users as events happen and can be programmed to respond automatically to
certain events. For example, real-time web analytics might update an administrator if page load
performance goes out of preset parameters.

Examples of real-time customer analytics include:

 Viewing orders as they happen for better tracking and to identify trends.
 Continually updated customer activity like page views and shopping cart use to
understand user behavior.
 Targeting customers with promotions as they shop for items in a store, influencing real-
time decisions.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 22


How Do Real Time Analytics Work?

Real-time data analytics tools can either push or pull data. Streaming requires the ability to push
massive amounts of fast-moving data. When streaming takes too many resources and isn’t
practical, data can be pulled at intervals that can range from seconds to hours. The pull can
happen in between business needs which require computing resources so as not to disrupt
operations.

The response times for real time analytics can vary from nearly instantaneous to a few seconds or
minutes. Some products like a self-driving car need to respond to new information within
milliseconds. But other products like an oil drill or windmill can get by with a minute between
updates. Several minutes might be enough for a bank looking at credit scores of loan applicants.

The components of real-time data analytics include:

 Aggregator — Compiles real time streaming data analytics from many different data
sources.
 Broker — Makes data in real time available for use.
 Analytics Engine — Correlates values and blends data streams together while analyzing
the data.
 Stream Processor — Executes real time app analytics and logic by receiving and sending
data streams.

Real-time analytics can be deployed at the edge, which looks at the data at the closest point of its
arrival. Other technologies that make real time analytics possible include:

 Processing In Memory (PIM) — Latency is reduced by integrating the processor in a


memory chip.
 In-Database Analytics — Data processing happens within the database and the analytic
logic is also built into the database.
 In-Memory Analytics — Queries data in random access memory (RAM) instead of
physical disks.
 Massively Parallel Programming (MPP) — Multiple processors tackle different parts of a
program and each processor has its own operating system and memory.

What Are the Benefits of Using Real Time Analytics?

Speed is the main benefit of real time data analytics. The less time a business must wait to access
data between the time it arrives and is processed, the faster a business can use data insights to
make changes and act on critical decisions. For instance, analyzing monitoring data from a
manufacturing line would help early intervention before machinery malfunctions.

Similarly, real-time data analytics tools let companies see how users interact with a product upon
release, which means there is no delay in understanding user behavior for making needed
adjustments.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 23


Real-time analysis offers the following advantages over traditional analytics:

 Create custom interactive analytics tools.


 Share information through transparent dashboards.
 Customize monitoring of behavior.
 Make immediate changes when needed.
 Apply machine learning.

Other benefits and uses include:

 Managing location data — Helps determine relevant data sets for a geographic location
and what should be updated for optimal location intelligence.
 Detecting anomalies — Identifies the statistical outliers caused by security breaches and
technological failures.
 Better marketing — Finds insights in demographics and customer behavior to improve
effectiveness of advertising and marketing campaigns. Helps determine best pricing
strategies and audience targeting.

Examples

Here’s a look at some use cases of real-time data analytics in action:

 Marketing campaigns: When running a marketing campaign, most people rely on A/B tests.
With the ability to access data instantly, you can adjust campaign parameters to boost success.
For example, if you run an ad campaign and retrieve data in real-time of people clicking and
converting, then you can adjust your message and parameters to target that audience directly.

 Financial trading: Financial institutions need to make buy and sell decisions in milliseconds.
With analytics provided in real-time, traders can take advantage of information from financial
databases, news sources, social media, weather reports and more to have a wide angle
perspective on the market in real-time. This broad picture helps to make smart trading decisions.

 Financial operations: Financial teams are experiencing a transformation by which they not only
are responsible for back-office procedures, but they also add value to the organisation by
providing strategic insights. The production of financial statements must be accurate to help
inform the best decisions for the business. Analytics in real-time helps to spot errors and can aid
in reducing operational risks. The software’s ability to match records (i.e. account
reconciliation), store data securely (in a centralised system) and transform raw data into insights
(real-time analytics) makes all the difference in a team’s ability to remain accurate, agile and
ahead of the curve.

 Credit scoring: Any financial provider understands the value of credit scores. With real-time
analysis, institutions can approve or deny loans immediately.

 Healthcare: Wearable devices are an example of real-time analytics which can track a human’s
health statistics. For example, real-time data provides information like a person’s heartbeat, and
these immediate updates can be used to save lives and even predict ailments in advance.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 24


What is Apache Spark :
Apache Spark is a lightning-fast cluster computing designed for fast computation. It was built
on top of Hadoop Map Reduce and it extends the MapReduce model to efficiently use more
types of computations which includes Interactive Queries and Stream Processing. This is a brief
tutorial that explains the basics of Spark Core programming.

Audience
This tutorial has been prepared for professionals aspiring to learn the basics of Big Data
Analytics using Spark Framework and become a Spark Developer. In addition, it would be
useful for Analytics Professionals and ETL developers as well.

Prerequisites
Before you start proceeding with this tutorial, we assume that you have prior exposure to Scala
programming, database concepts, and any of the Linux operating system flavors.
Why Spark when Hadoop is there:
Industries are using Hadoop extensively to analyze their data sets. The reason is that Hadoop
framework is based on a simple programming model (MapReduce) and it enables a computing
solution that is scalable, flexible, fault-tolerant and cost effective. Here, the main concern is to
maintain speed in processing large datasets in terms of waiting time between queries and
waiting time to run the program.
Spark was introduced by Apache Software Foundation for speeding up the Hadoop
computational computing software process.
As against a common belief, Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster management. Hadoop is just one of the
ways to implement Spark.
Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark has its
own cluster management computation, it uses Hadoop for storage purpose only.
Apache Spark
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It
is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for
more types of computations, which includes interactive queries and stream processing. The
main feature of Spark is its in-memory cluster computing that increases the processing speed
of an application.
Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming. Apart from supporting all these workload in a
respective system, it reduces the management burden of maintaining separate tools.

Evolution of Apache Spark


Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s
AMPLab by Matei Zaharia. It was Open Sourced in 2010 under a BSD
license. It was donated to Apache software foundation in 2013, and now
Apache Spark has become a top level Apache project from Feb-2014.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 25
Features of Apache Spark
Apache Spark has following features.
 Speed − Spark helps to run an application in Hadoop cluster, up to 100
times faster in memory, and 10 times faster when running on disk. This
is possible by reducing number of read/write operations to disk. It
stores the intermediate processing data in memory.
 Supports multiple languages − Spark provides built-in APIs in Java,
Scala, or Python. Therefore, you can write applications in different
languages. Spark comes up with 80 high-level operators for interactive
querying.
 Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It
also supports SQL queries, Streaming data, Machine learning (ML),
and Graph algorithms.
Spark Built on Hadoop
The following diagram shows three ways of how Spark can be built with
Hadoop components.

There are three ways of Spark deployment as explained below.


 Standalone − Spark Standalone deployment means Spark occupies
the place on top of HDFS(Hadoop Distributed File System) and space
is allocated for HDFS, explicitly. Here, Spark and MapReduce will run
side by side to cover all spark jobs on cluster.
 Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs
on Yarn without any pre-installation or root access required. It helps to
integrate Spark into Hadoop ecosystem or Hadoop stack. It allows
other components to run on top of stack.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 26


 Spark in MapReduce (SIMR) − Spark in MapReduce is used to
launch spark job in addition to standalone deployment. With SIMR,
user can start Spark and uses its shell without any administrative
access.

Components of Spark
The following illustration depicts the different components of Spark.

Apache Spark Core


Spark Core is the underlying general execution engine for spark platform that
all other functionality is built upon. It provides In-Memory computing and
referencing datasets in external storage systems.

Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data
abstraction called SchemaRDD, which provides support for structured and
semi-structured data.

Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming
analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets)
transformations on those mini-batches of data.

MLlib (Machine Learning Library)


MLlib is a distributed machine learning framework above Spark because of the distributed
memory-based Spark architecture. It is, according to benchmarks, done by the MLlib developers
against the Alternating Least Squares (ALS) implementations. Spark MLlib is nine times as fast
as the Hadoop disk-based version of Apache Mahout (before Mahout gained a Spark
interface).

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 27


GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides an API for
expressing graph computation that can model the user-defined graphs by using Pregel
abstraction API. It also provides an optimized runtime for this abstraction.

Spark Features

1. Objective
Apache Spark being an open-source framework for Bigdata has a various advantage over other
big data solutions like Apache Spark is Dynamic in Nature, it supports in-memory Computation
of RDDs. It provides a provision of reusability, Fault Tolerance, real-time stream processing and
many more. In this tutorial on features of Apache Spark, we will discuss various advantages of
Spark which give us the answer for – Why we should learn Apache Spark? Why is Spark better
than Hadoop MapReduce and why is Spark called 3G of Big data?

2.I ntroduction to Apache Spark


Apache Spark is lightning fast, in-memory data processing engine. Spark mainly designs for
data science and the abstractions of Spark make it easier. Apache Spark provides high-level APIs
in Java, Scala, Python and R. It also has an optimized engine for general execution graph. In data
processing, Apache Spark is the largest open source project. Follow this guide to learn How
Apache Spark works in detail.
3. Features of Apache Spark

Let’s discuss sparkling features of Apache Spark:

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 28


a. Swift Processing

Using Apache Spark, we achieve a high data processing speed of about


100x faster in memory and 10x faster on the disk. This is made
possible by reducing the number of read-write to disk.
b. Dynamic in Nature

We can easily develop a parallel application, as Spark provides 80


high-level operators.
c. In-Memory Computation in Spark

With in-memory processing, we can increase the processing speed.


Here the data is being cached so we need not fetch data from the disk
every time thus the time is saved. Spark has DAG execution engine
which facilitates in-memory computation and acyclic data flow
resulting in high speed.
d. Reusability

we can reuse the Spark code for batch-processing, join stream against
historical data or run ad-hoc queries on stream state.
e. Fault Tolerance in Spark
Apache Spark provides fault tolerance through Spark abstraction-RDD. Spark RDDs are
designed to handle the failure of any worker node in the cluster. Thus, it ensures that the loss of
data reduces to zero. Learn different ways to create RDD in Apache Spark.

f. Real-Time Stream Processing


Spark has a provision for real-time stream processing. Earlier the problem with
Hadoop MapReduce was that it can handle and process data which is already present, but not
the real-time data. but with Spark Streaming we can solve this problem.

g. Lazy Evaluation in Apache Spark


All the transformations we make in Spark RDD are Lazy in nature, that is it does not give the
result right away rather a new RDD is formed from the existing one. Thus, this increases the
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 29
efficiency of the system. Follow this guide to learn more about Spark Lazy
Evaluation in great detail.

h. Support Multiple Languages

In Spark, there is Support for multiple languages like Java, R, Scala, Python. Thus, it provides
dynamicity and overcomes the limitation of Hadoop that it can build applications only in Java.
Get the best Scala Books To become an expert in Scala programming language.

i. Active, Progressive and Expanding Spark Community

Developers from over 50 companies were involved in making


of Apache Spark. This project was initiated in the year 2009 and is still
expanding and now there are about 250 developers who contributed to
its expansion. It is the most important project of Apache Community.

j. Support for Sophisticated Analysis

Spark comes with dedicated tools for streaming data, interactive/declarative queries, machine
learning which add-on to map and reduce.

k. Integrated with Hadoop


Spark can run independently and also on Hadoop YARN Cluster Manager and thus it can read
existing Hadoop data. Thus, Spark is flexible.

l. Spark GraphX
Spark has GraphX, which is a component for graph and graph-parallel computation. It simplifies
the graph analytics tasks by the collection of graph algorithm and builders.

m. Cost Efficient
Apache Spark is cost effective solution for Big data problem as in Hadoop large amount of
storage and the large data center is required during replication.
4. Conclusion
In conclusion, Apache Spark is the most advanced and popular product of Apache Community
that provides the provision to work with the streaming data, has various Machine learning
library, can work on structured and unstructured data, deal with graph etc. After learning Apache

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 30


Spark features follow this guide to compare Apache Spark with Hadoop MapReduce.

Getting started with Spark:

Apache Spark is explained as a ‘fast and general engine for


large-scale data processing.’ However, that doesn’t even
begin to encapsulate the reason it has become such a
prominent player in the big data space. Apache Spark is a
distributed computing platform, and its adoption by big data
companies has been on the rise at an eye-catching rate.
Spark Architecture
The architecture of spark looks as follows:

Spark is a distributed processing engine, but it does not have


its own distributed storage and cluster manager for

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 31


resources. It runs on top of out of the box cluster resource
manager and distributed storage.

Spark core has two parts to it:

 Core APIs: The Unstructured APIs(RDDs), Structured


APIs(DataFrames, Datasets). Available in Scala, Python,
Java, and R.

 Compute Engine: Memory Management, Task


Scheduling, Fault Recovery, Interacting with Cluster
Manager.

Note: We will see Core API implementations in Java


towards the end of the article.

Outside the Core APIs Spark provides:

 Spark SQL: Interact with structured data through SQL


like queries.

 Streaming: Consume and Process a continuous stream of


data.

 MLlib: Machine Learning Library. However, I wouldn’t


recommend training deep learning models here.

 GraphX: Typical Graph Processing Algorithm.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 32


All the above four directly depend upon spark core APIs for
distributed computing.

Advantages of Spark

 Spark provides a unified platform for batch processing,


structured data handling, streaming, and much more.

 Compared with map-reduce of Hadoop, the spark code is


much easy to write and use.

 The most important feature of Spark, it abstracts the


parallel programming aspect. Spark core abstracts the
complexities of distributed storage, computation, and
parallel programming.

One of the primary use cases of Apache Spark is large scale


data processing. We create programs and execute them on
spark clusters.

Executions of Program on a Cluster

There are primarily two methods to execute programs on


spark cluster:

1. Interactive clients like spark-shell, py-spark, notebooks


etc.

2. Submit a job.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 33
Most of the development process happens on interactive
clients, but when we have to put our application in
production, we use Submit a job approach.

For both long-running streaming job or a periodic batch job,


we package our application and submit it to Spark cluster for
execution.

Spark, a distributed processing engine, follows the master-


slave architecture. In spark terminology, the master is
the driver, and slaves are the executors.

Driver is responsible for:

1. Analyzing

2. Distributing.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 34


3. Monitoring.

4. Scheduling.

5. Maintaining all the necessary information during the life


time of the spark process.

Executors are only responsible for executing the part of the


code assigned to them by the driver and reporting the status
back to the driver.

Each spark process would have a separate driver and


exclusive executors.

Modes of execution

1. Client Mode: The driver is the local VM, where you


submit your application. By default, spark submits all
applications in client mode. Since the driver is the master
node in the entire spark process, in production set up, it
is not advisable. For debugging, it makes more sense for
using client mode.

2. Cluster Mode: The driver is one of the executors in the


cluster. In the spark-submit, you can pass the argument
as follows:
--deploy-mode cluster

Cluster Resource Manager

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 35


Image by Author

Yarn and Mesos are the commonly used cluster manager.

Kubernetes is a general purpose container orchestrator .

Note : Spark on Kubernetes, as of writing this aricdleis not


production ready.

Yarn being most popular resource manager for spark, let us


see the inner working of it:

In a client mode application the driver is our local VM, for


starting a spark application:

Step 1: As soon as the driver starts a spark session request


goes to Yarn to create a yarn application.

Step 2: Yarn Resource Manager creates an Application


Master. For client mode, AM acts as an executor launcher.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 36


Step 3: AM would reach out to Yarn Resource manger to
request for further containers.

Step 4: Resource Manager would allocate new containers,


and AM would start executors in each container. After that,
executors directly communicates with the driver.

Note: In the cluster mode the driver starts in the AM.

Executor and Memory Tuning

Hardware — 6 Nodes, and Each node 16 cores, 64


GB RAM

Let us start with the number of cores. The number of cores


represents concurrent task an executor can run. Research
has shown that any application having more than 5
concurrent task leads to a bad show. Hence, I would suggest
sticking with 5.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 37


Note: The above number came from the performance of an
executor not from how many cores the system has. Hence it
would remain the same for 32 cores system as well.

1 Core an 1 GB RAM is needed for OS and Hadoop daemons.


Hence we are left with 63 GB Ram and 15 Cores.

For 15 Cores, we can have 3 executors per node. That gives


us 18 executors in total. AM Container requires 1 executor.
Hence we can get 17 executors.

Coming to memory, we get 63/3 = 21 GB per executor.


However, small overhead needs to be accounted for while
calculating the full memory request.

Hence the memory comes down to approximately 19 GB.

Hence the system comes to :


--num-executors 17 --executor-memory 19G --executor-cores 5

Note: If we require less memory, we can reduce the number


of cores to increase the number of executors.

Spark Core
Now we look at some core APIs spark provides. Spark needs
a data structure to hold the data. We have three alternatives
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 38
RDD, DataFrame, and Dataset. Since Spark 2.0 it is
recommended to use only Dataset and DataFrame. These
two internally compile to RDD itself.

These three are resilient, distributed, partitioned and


immutable collection of data.

Task: The smallest unit of work in Spark and is performed


by an executor.

Dataset offers two types of actions:

 Transformations: Creates new Dataset from the


existing one. It is lazy, and data remains distributed.

 Action: Action returns data to the driver, non-


distributed in nature. Action on a Dataset triggers a job.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 39


Shuffle and Sort: Re-partitioning of a dataset for
performing actions on it. It is an abstraction in spark, and we
do not need to write code for it. This activity requires a new
stage.

Common Actions & Transformation


1) lit, geq, leq, gt, lt

lit: Creates a column of a literal value. Can be used for comparisons with other
columns.

geq(greater than equal to), leq(less than equal to), gt(greater than), lt(less than):
Used for comparisons with other column value. For example:

2) join

Spark lets us join datasets in various ways. Will try to explain with a sample
example

The result looks like:

3) union

Spark union function lets us have a union between two datasets. The datasets should
be of the same schema.

4) window

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 40


One of the essential functions in Spark. It lets you calculate a return value for every
input row of a table based on a group of rows, called the Frame.

Spark gives APIs for tumbling window, hoping window, sliding window and
delayed window.

We use it for ranking, sum, plain old windowing, etc. Some of the use cases are:

Other functions such as lag, lead, and many more, let you do other operations that
enable you to do sophisticated analytics over datasets.

However, if you still need to perform much more complex operations over datasets,
you can use UDFs. Sample of usage of a UDF:

Note: Using UDFs should be the last resort since they are not optimized for Spark;
they might take a longer time for executions. It is advisable to use native spark
functions over UDFs.

This is just the tip of the iceberg for Apache Spark. Its utilities expand in various
domains, not limited to data analytics. Watch this space for more.

Actuarial Science and Data Science with Life lib

The field of the actuarial sciences puts emphasis on the subjects of mathematics, probability and

statistics, economics, finance, and risk management. There is a subject area, however, that is not

given enough attention and is becoming more integrated into the profession of actuarial work.

This subject, or rather tool, has…

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 41


A compiled visualisation of the common convolutional neural networks

Change log:

 28 Nov 2020 — Updated “What’s novel” for every CNN.

 17 Nov 2020 — Edited no. of layers in last dense layer of Inceptionv1 from 4096 to 1000.

 24 Sep 2020 — Edited “What’s novel” section for ResNeXt-50.

The Power of Weak Ties

The effects of weak ties in social networks— a graph theoretic view

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 42


What is a social network?

We all know of the term “social network” in terms of social networking sites such as Facebook

and Twitter. On these sites, people are able to communicate and form social connections with

other users on the site. As people use these websites, patterns tend to emerge as users group

together…
Spark Eco System
In this Spark Ecosystem tutorial, we will discuss about core ecosystem components of Apache
Spark like Spark SQL, Spark Streaming, Spark Machine learning (MLlib), Spark GraphX, and
Spark R.

Apache Spark Ecosystem has extensible APIs in different languages like Scala, Python, Java,
and R built on top of the core Spark execution engine.

Apache Spark is the most popular big data tool, also considered as next generation tool, which is
being used by 100s of organization and having 1000s of contributors, it’s still emerging and
gaining popularity as the standard big data execution engine.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 43


Spark is a powerful open-source processing engine alternative to Hadoop. At first, It based on
high speed, ease of use and increased developer productivity.

Also, supports machine learning, real-time stream processing, and graph


computations as well.
Moreover, Spark provides in-memory computing capabilities. Also, supports a vast
collection of applications. For ease of development, it also supports API’s like
Java, Python, R, and Scala.

Let’s discuss uses of these languages in detail:


Architecture and its working:

Apache Spark is considered as a powerful complement to Hadoop, big data’s


original technology of choice. Spark is a more accessible, powerful and capable big
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 44
data tool for tackling various big data challenges. With more than 500
contributors from across 200 organizations responsible for code and a user base of
225,000+ members- Apache Spark has become mainstream and most in-demand
big data framework across all major industries.

Ecommerce companies like Alibaba, social networking companies like Tenet and
Chinese search engine Baidu, all run Apache spark operations at scale. Here are a
few features that are responsible for its popularity.

1. Fast Processing Speed: The first and foremost advantage of using Apache
Spark for your big data is that it offers 100x faster in memory and 10x faster
on the disk in Hadoop clusters.
2. Supports a variety of programming languages: Spark applications can be
implemented in a variety of languages like Scala, R, Python, Java, and Clojure.
This makes it easy for developers to work according to their preferences.
3. Powerful Libraries: It contains more than just map and reduce functions. It
contains libraries SQL and dataframes, MLlib (for machine learning), GraphX,
and Spark streaming which offer powerful tools for data analytics.
4. Near real-time processing: Spark has Map Reduce that can process data
stored in Hadoop and it also has Spark Streaming which can handle data in
real-time.
5. Compatibility: Spark can run on Hadoop, Apache Mesos, Kubernetes,
standalone, or in the cloud. It can operate diverse data sources.

Now that you are aware of its exciting features, let us explore Spark Architecture to
realize what makes it so special. This article is a single-stop resource that gives
spark architecture overview with the help of spark architecture diagram and is a
good beginners resource for people looking to learn spark.

Understanding Apache Spark Architecture


Apache Spark has a well-defined and layered architecture where all the spark
components and layers are loosely coupled and integrated with various extensions
and libraries. Apache Spark Architecture is based on two main abstractions-

 Resilient Distributed Datasets (RDD)


 Directed Acyclic Graph (DAG)

Resilient Distributed Datasets (RDD)


RDD’s are collection of data items that are split into partitions and can be stored in-memory on
workers nodes of the spark cluster. In terms of datasets, apache spark supports two types of

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 45


RDD’s – Hadoop Datasets which are created from the files stored on HDFS and
parallelized collections which are based on existing Scala collections. Spark RDD’s support two
different types of operations – Transformations and Actions. An important property of RDDs is
that they are immutable, thus transformations never return a single value. Instead, transformation
functions simply read an RDD and generate a new RDD. On the other hand, the Action operation
evaluates and produces a new value. When an Action function is applied on an RDD object, all
the data processing requests are evaluated at that time and the resulting value is returned.

Directed Acyclic Graph (DAG)


Direct - Transformation is an action which transitions data partition state from A to B.
Acyclic -Transformation cannot return to the older partition

DAG is a sequence of computations performed on data where each node is an RDD partition and
edge is a transformation on top of data. The DAG abstraction helps eliminate the Hadoop Map
Reduce multi0stage execution model and provides performance enhancements over Hadoop.

Spark Architecture Overview


Apache Spark follows a master/slave architecture with two main daemons and a cluster manager

i. Master Daemon – (Master/Driver Process)


ii. Worker Daemon –(Slave Process)

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 46


Spark Architecture Diagram – Overview of Apache Spark Cluster

A spark cluster has a single Master and any number of Slaves/Workers. The driver and the
executors run their individual Java processes and users can run them on the same horizontal
spark cluster or on separate machines i.e. in a vertical spark cluster or in mixed machine
configuration.
For classic Hadoop platforms, it is true that handling complex assignments require developers to
link together a series of Map Reduce jobs and run them in a sequential manner. Here, each job
has a high latency. The job output data between each step has to be saved in the HDFS before
other processes can start. The advantage of having DAG and RDD is that they replace the disk
IO with in-memory operations and support in-memory data sharing across DAGs, so that
different jobs can be performed with the same data allowing complicated workflows.

Role of Driver in Spark Architecture


Spark Driver – Master Node of a Spark Application

It is the central point and the entry point of the Spark Shell (Scala, Python, and R). The driver
program runs the main () function of the application and is the place where he Spark Context and
RDDs are created, and also where transformations and actions are performed. Spark Driver
contains various components – DAG Scheduler, Task Scheduler, Backend Scheduler, and Block
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 47
Manager responsible for the translation of spark user code into actual spark jobs
executed on the cluster.

Spark Driver performs two main tasks: Converting user programs into tasks and planning the
execution of tasks by executors. A detailed description of its tasks is as follows:

 The driver program that runs on the master node of the spark cluster schedules the job
execution and negotiates with the cluster manager.
 It translates the RDD’s into the execution graph and splits the graph into multiple stages.
 Driver stores the metadata about all the Resilient Distributed Databases and their
partitions.
 Cockpits of Jobs and Tasks Execution -Driver program converts a user application into
smaller execution units known as tasks. Tasks are then executed by the executors i.e. the
worker processes which run individual tasks.
 After the task has been completed, all the executors submit their results to the Driver.
 Driver exposes the information about the running spark application through a Web UI at
port 4040.

Data Structures of Spark:

RDD, DataFrame, and Dataset are the three most common data structures in Spark, and they
make processing very large data easy and convenient. Because of the lazy evaluation algorithm
of Spark, these data structures are not executed right way during creations, transformations, and
functions etc. Only when they encounter actions, will they start the traversal operation.
Spark RDD – since Spark 1.0
RDD stands for Resilient Distributed Dataset. It is a collection of recorded immutable partitions.
RDD is the fundamental data structure of Spark whose partitions are shuffled, sent across nodes
and operated in parallel. It allows programmers to perform complex in-memory analysis on large
clusters in a fault-tolerant manner. RDD can handle structured and unstructured data easily and
effectively as it has lots of built-in functional operators like group, map and filter etc.
However, when encountering complex logic, RDD has a very obvious disadvantage – operators
cannot be re-used. This is because RDD does not know the information of the stored data, so the
structure of the data is a black box which requires a user to write a very specific aggregation
function to complete an execution. Therefore, RDD is preferable on unstructured data, to be used
for low-level transformations and actions.
RDD provides users with a familiar object-oriented programming style, along with a distributing
collection of JVM objects, that indicate it is compile-time type safety. Using RDD is very

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 48


flexible as it provides Java, Scala, Python and R APIs. But there is a big limitation on
RDD, it cannot be used within Spark SQL as it does not have optimizations for special scenarios.
Spark DataFrame – since Spark 1.3
Data Frame is a distributed dataset based on RDD which organizes the data in named columns
before Spark 2.0. It is similar to a two-dimensional table in the relational database, so it
introduces the database’s schema. Because of that, it can be treated as an optimization on top of
RDD – for example, RDD knows the structure of the stored data, which allows users to perform
high-level operations. With respect to that, it handles structured and semi-structured data. Users
can be specific on which column to perform what operations. This allows an operator to be used
into multiple columns and makes an operator reusable.
Data Frame also makes revamping operations easier and flexible. If users have extra requires on
the existing operation that needs to include another column into the operation, users can just
write an operator for that extra column and add it into the existing operation. Whereas, RDD
needs to make a lots of changes on the existing aggregation. Compared to RDD, DataFrame does
not provide compile-time type safety as it is a distributed collection of Row objects. Like RDD,
DataFrame also supports various APIs. Unlike RDD, DataFrame is able to be used with Spark
SQL as the structure of data it stores, so it can provide more functional operators and allow users
to perform expression-based operations and UDFs. Last but not least, it can enhance the
execution efficiency, reduce the cost of loading the data, and optimize the logical plans.
Spark Dataset – since Spark 1.6
Dataset API is like an extension and enhancement of DataFrame API. Externally, Dataset is a
collection of JVM objects. Internally, Dataset has an un-typed view called a DataFrame, which is
a Dataset of Row since Spark 2.0. Dataset merges the advantages of RDD and DataFrame. Like
RDD, it supports structured, unstructured and custom data storing, and it provides users an
object-oriented programming style and compile-time type safety. Like DataFrame, it takes
advantages of the Catalyst optimizer to allow users to perform structured SQL queries on data,
but it is slower than DataFrame. Unlike RDD and DataFrame, it only supports Java and Scala
APIs. APIs for Python and R are still under development.

Spark Components
The Spark project consists of different types of tightly integrated components.
At its core, Spark is a computational engine that can schedule, distribute and
monitor multiple applications.

Let's understand each Spark component in detail.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 49


Spark Core
o The Spark Core is the heart of Spark and performs the core functionality.
o It holds the components for task scheduling, fault recovery, interacting
with storage systems and memory management.

Spark SQL
o The Spark SQL is built on the top of Spark Core. It provides support for
structured data.
o It allows to query the data via SQL (Structured Query Language) as well
as the Apache Hive variant of SQL?called the HQL (Hive Query
Language).
o It supports JDBC and ODBC connections that establish a relation
between Java objects and existing databases, data warehouses and
business intelligence tools.
o It also supports various sources of data like Hive tables, Parquet, and
JSON.

Spark Streaming
o Spark Streaming is a Spark component that supports scalable and fault-
tolerant processing of streaming data.
o It uses Spark Core's fast scheduling capability to perform streaming
analytics.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 50


o It accepts data in mini-batches and performs RDD transformations on
that data.
o Its design ensures that the applications written for streaming data can
be reused to analyze batches of historical data with little modification.
o The log files generated by web servers can be considered as a real-time
example of a data stream.

MLlib
o The MLlib is a Machine Learning library that contains various machine
learning algorithms.
o These include correlations and hypothesis testing, classification and
regression, clustering, and principal component analysis.
o It is nine times faster than the disk-based implementation used by
Apache Mahout.

GraphX
o The GraphX is a library that is used to manipulate graphs and perform
graph-parallel computations.
o It facilitates to create a directed graph with arbitrary properties attached
to each vertex and edge.
o To manipulate graph, it supports various fundamental operators like
subgraph, join Vertices, and aggregate Messages.

Using Spark with Hadoop

We are often asked how does Apache Spark fits in the Hadoop ecosystem, and how one can
run Spark in a existing Hadoop cluster. This blog aims to answer these questions.

First, Spark is intended to enhance, not replace, the Hadoop stack. From day one, Spark was
designed to read and write data from and to HDFS, as well as other storage systems, such as
HBase and Amazon’s S3. As such, Hadoop users can enrich their processing capabilities by
combining Spark with Hadoop MapReduce, HBase, and other big data frameworks.

Second, we have constantly focused on making it as easy as possible for every Hadoop user to
take advantage of Spark’s capabilities. No matter whether you run Hadoop 1.x or Hadoop 2.0
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 51
(YARN), and no matter whether you have administrative privileges to configure
the Hadoop cluster or not, there is a way for you to run Spark! In particular, there are three ways
to deploy Spark in a Hadoop cluster: standalone, YARN, and SIMR.

Standalone deployment: With the standalone deployment one can


statically allocate resources on all or a subset of machines in a Hadoop
cluster and run Spark side by side with Hadoop MR. The user can then
run arbitrary Spark jobs on her HDFS data. Its simplicity makes this the
deployment of choice for many Hadoop 1.x users.

Hadoop Yarn deployment: Hadoop users who have already deployed or are
planning to deploy Hadoop Yarn can simply run Spark on YARN without any
pre-installation or administrative access required. This allows users to easily
integrate Spark in their Hadoop stack and take advantage of the full power of
Spark, as well as of other components running on top of Spark.

Spark In MapReduce (SIMR): For the Hadoop users that are not running
YARN yet, another option, in addition to the standalone deployment, is to use
SIMR to launch Spark jobs inside MapReduce. With SIMR, users can start
experimenting with Spark and use its shell within a couple of minutes after
downloading it! This tremendously lowers the barrier of deployment, and lets
virtually everyone play with Spark.

Interoperability with other Systems

Spark interoperates not only with Hadoop, but with other popular big
data technologies as well.

 Apache Hive: Through Shark, Spark enables Apache Hive users to run their
unmodified queries much faster. Hive is a popular data warehouse solution
running on top of Hadoop, while Shark is a system that allows the Hive
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 52
framework to run on top of Spark instead of Hadoop. As a result, Shark can
accelerate Hive queries by as much as 100x when the input data fits into
memory, and up 10x when the input data is stored on disk.
 AWS EC2: Users can easily run Spark (and Shark) on top of Amazon’s EC2
either using the scripts that come with Spark, or the hosted versions of
Spark and Shark on Amazon’s Elastic MapReduce.
 Apache Mesos: Spark runs on top of Mesos, a cluster manager system
which provides efficient resource isolation across distributed applications,
including MPI and Hadoop. Mesos enables fine grained sharing which
allows a Spark job to dynamically take advantage of the idle resources in the
cluster during its execution. This leads to considerable performance
improvements, especially for long running Spark jobs.

Apache Spark Use Cases:

Apache Spark is the new shiny big data bauble making fame and gaining

mainstream presence amongst its customers. Startups to Fortune 500s are adopting
Apache Spark to build, scale and innovate their big data applications. Here are some

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 53


industry specific spark use cases that demonstrate its ability to build and run
fast big data applications .

Spark Use Cases in Finance Industry


Banks are using the Hadoop alternative - Spark to access and analyse the social
media profiles, call recordings, complaint logs, emails, forum discussions, etc. to
gain insights which can help them make right business decisions for credit risk
assessment, targeted advertising and customer segmentation.

Your credit card is swiped for $9000 and the receipt has been signed, but it was not
you who swiped the credit card as your wallet was lost. This might be some kind of
a credit card fraud. Financial institutions are leveraging big data to find out when
and where such frauds are happening so that they can stop them. They need to
resolve any kind of fraudulent charges at the earliest by detecting frauds right from
the first minor discrepancy. They already have models to detect fraudulent
transactions and most of them are deployed in batch environment. With the use of
Apache Spark on Hadoop, financial institutions can detect fraudulent transactions
in real-time, based on previous fraud footprints. All the incoming transactions are
validated against a database, if there a match then a trigger is sent to the call centre.
The call centre personnel immediately checks with the credit card owner to
validate the transaction before any fraud can happen.

Companies Using Spark in the Finance Industry


 One of the financial institutions that has retail banking and brokerage
operations is using Apache Spark to reduce its customer churn by 25%. The
financial institution has divided the platforms between retail, banking, trading
and investment. However, the banks want a 360-degree view of the customer
regardless of whether it is a company or an individual. To get the consolidated
view of the customer, the bank uses Apache Spark as the unifying layer.
Apache Spark helps the bank automate analytics with the use of machine
learning, by accessing the data from each repository for the customers. The
data is then correlated into a single customer file and is sent to the marketing
department.
 Another financial institution is using Apache Spark on Hadoop to analyse the
text inside the regulatory filling of their own reports and also their competitor
reports. The firms use the analytic results to discover patterns around what is
happening, the marketing around those and how strong their competition is.
 A multinational financial institution has implemented real time monitoring
application that runs on Apache Spark and MongoDB NoSQL database. To
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 54
provide supreme service across its online channels, the applications helps the
bank continuously monitor their client’s activity and identify if there are any
potential issues.

Apache Spark ecosystem can be leveraged in the finance industry to achieve best in
class results with risk based assessment, by collecting all the archived logs and
combining with other external data sources (information about compromised
accounts or any other data breaches).

Spark Use Cases in e-commerce Industry


Information about real time transaction can be passed to streaming clustering
algorithms like alternating least squares (collaborative filtering algorithm) or K-
means clustering algorithm. The results can be combined with data from other
sources like social media profiles, product reviews on forums, customer comments,
etc. to enhance the recommendations to customers based on new trends.

Companies Using Spark in e-commerce Industry


Shopify wanted to analyse the kinds of products its customers were selling to
identify eligible stores with which it can tie up - for a business partnership. Its data
warehousing platform could not address this problem as it always kept timing out
while running data mining queries on millions of records. Shopify has processed 67
million records in minutes, using Apache Spark and has successfully created a list of
stores for partnership.

Apache Spark at Alibaba


One of the world’s largest e-commerce platform Alibaba Taobao runs some of the
largest Apache Spark jobs in the world in order to analyse hundreds of petabytes of
data on its ecommerce platform. Some of the Spark jobs that perform feature
extraction on image data, run for several weeks. Millions of merchants and users
interact with Alibaba Taobao’s ecommerce platform. Each of these interaction is
represented as a complicated large graph and apache spark is used for fast
processing of sophisticated machine learning on this data.

Apache Spark at eBay


eBay uses Apache Spark to provide targeted offers, enhance customer experience,
and to optimize the overall performance. Apache Spark is leveraged at eBay through
Hadoop YARN.YARN manages all the cluster resources to run generic tasks. EBay

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 55


spark users leverage the Hadoop clusters in the range of 2000 nodes, 20,000
cores and 100TB of RAM through YARN.

Spark Use Cases in Healthcare


As healthcare providers look for novel ways to enhance the quality of healthcare,
Apache Spark is slowly becoming the heartbeat of many healthcare applications.
Many healthcare providers are using Apache Spark to analyse patient records along
with past clinical data to identify which patients are likely to face health issues after
being discharged from the clinic. This helps hospitals prevent hospital re-admittance
as they can deploy home healthcare services to the identified patient, saving on
costs for both the hospitals and patients.

Apache Spark is used in genomic sequencing to reduce the time needed to process
genome data. Earlier, it took several weeks to organize all the chemical compounds
with genes but now with Apache spark on Hadoop it just takes few hours. This use
case of spark might not be so real-time like other but renders considerable benefits
to researchers over earlier implementation for genomic sequencing.

Companies Using Spark in Healthcare Industry

Apache Spark at MyFitnessPal


The largest health and fitness community MyFitnessPal helps people achieve a
healthy lifestyle through better diet and exercise. MyFitnessPal uses apache spark to
clean the data entered by users with the end goal of identifying high quality food
items. Using Spark, MyFitnessPal has been able to scan through food calorie data of
about 80 million users. Earlier, MyFitnessPal used Hadoop to process 2.5TB of data
and that took several days to identify any errors or missing information in it.

Spark Use Cases in Media & Entertainment


Industry
Apache Spark is used in the gaming industry to identify patterns from the real-time
in-game events and respond to them to harvest lucrative business opportunities like
targeted advertising, auto adjustment of gaming levels based on complexity, player
retention and many more.

Few of the video sharing websites use apache spark along with MongoDB to show
relevant advertisements to its users based on the videos they view, share and
browse.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 56
Companies Using Spark in Media & Entertainment

Industry

Apache Spark at Yahoo for News Personalization


Yahoo uses Apache Spark for personalizing its news webpages and for targeted
advertising. It uses machine learning algorithms that run on Apache Spark to find out
what kind of news - users are interested to read and categorizing the news stories to
find out what kind of users would be interested in reading each category of news.

Earlier the machine learning algorithm for news personalization required 15000 lines
of C++ code but now with Spark Scala the machine learning algorithm for news
personalization has just 120 lines of Scala programming code. The algorithm was
ready for production use in just 30 minutes of training, on a hundred million
datasets.

Apache Spark at Conviva


The largest streaming video company Conviva uses Apache Spark to deliver quality
of service to its customers by removing the screen buffering and learning in detail
about the network conditions in real-time. This information is stored in the video
player to manage live video traffic coming from close to 4 billion video feeds every
month, to ensure maximum play-through. Apache Spark is helping Conviva reduce
its customer churn to a great extent by providing its customers with a smooth video
viewing experience.

Apache Spark at Netflix


Netflix uses Apache Spark for real-time stream processing to provide online
recommendations to its customers. Streaming devices at Netflix send events which
capture all member activities and play a vital role in personalization. It processes 450
billion events per day which flow to server side applications and are directed to
Apache Kafka.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 57


Apache Spark at Pinterest
Pinterest is using apache spark to discover trends in high value user engagement
data so that it can react to developing trends in real-time by getting an in-depth
understanding of user behaviour on the website.

Spark Use Cases in Travel Industry

Companies Using Spark in Travel Industry


Apache Spark at TripAdvisor
TripAdvisor, a leading travel website that helps users plan a perfect trip is using
Apache Spark to speed up its personalized customer recommendations. TripAdvisor
uses apache spark to provide advice to millions of travellers by comparing hundreds
of websites to find the best hotel prices for its customers. The time taken to read and
process the reviews of the hotels in a readable format is done with the help of
Apache Spark.

Apache Spark at OpenTable


OpenTable, an online real time reservation service, with about 31000 restaurants and
15 million diners a month, uses Spark for training its recommendation algorithms
and for NLP of the restaurant reviews to generate new topic models. OpenTable has
achieved 10 times speed enhancements by using Apache Spark. Spark has helped
reduce the run time of machine learning algorithms from few weeks to just a few
hours resulting in improved team productivity.

The spike in increasing number of spark use cases is just in its commencement and
2016 will make Apache Spark the big data darling of many other companies, as they
start using Spark to make prompt decisions based on real-time processing through
spark streaming. These are just some of the use cases of the Apache Spark
ecosystem. If you know any other companies using Spark for real-time processing,
feel free to share with the community, in the comments below.

Spark Use Cases in Gaming Industry


Apache Spark is used in the gaming industry to identify patterns from real-time in-
game events. It helps companies to harvest lucrative business opportunities like
targeted advertising, auto adjustment of gaming levels based on complexity. It also
provides in-game monitoring, player retention, detailed insights, and many more.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 58


Companies Using Spark in Gaming Industry
Riot Games
Game developers have to manage everything from performance to in-game abuses.
Spark improves the gaming experience of the users, it also helps in processing
different game skins, different game characters, in-game points, and much more. It
helps with performance improvement, offers, and efficiency. Riot can now detect the
cause which made the game slow and laggy, so they can solve problems on time
without impacting users.

Riot Games uses Apache Spark to minimize the in-game toxicity. Whether you are
winning or losing, some players get into a rage. Game developers at Riot use Spark
MLlib to train their models on NLP for words, short forms, initials, etc. to understand
how a player interacts and they can even disable their account if required.

Tencent
Tencent has the biggest market of mobile gaming users base, similar to riot it
develops multiplayer games. Tencent uses spark for its in-memory computing
feature that boosts data processing performance in real-time in a big data context
while also assuring fault tolerance and scalability. It uses Apache Spark to analyze
multiplayer chat data to reduce the usage of abusive languages in-game chat.

Spark Use Cases in Software & Information


Service Industry
Spark use cases in Computer Software and Information Technology and Services
takes about 32% and 14% respectively in the global market. Apache Spark is
designed for interactive queries on large datasets; its main use is streaming data
which can be read from sources like Kafka or Hadoop output or even files on disk.
Apache Spark also has a wide range of built-in computational engines such as SQL
and Streaming algorithms that can be used to perform computations on its data sets
There are many interesting properties that make Apache Spark attractive to use for
streaming data analysis.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 59


Spark in Software & Information Service Industry
Databricks
Databricks was developed by creators of spark. It is a cloud-optimized platform to
run Spark and ML applications on AWS and Azure, also a comprehensive training
program. They are working on spark to expand the project and make new progress
to it. The company has also developed various open-source applications like Delta
Lake, MLflow, and Koalas, popular open-source projects that span data engineering,
data science, and machine learning.

Hearst
It is a leading global media information and services company. Its main goal is to
provide services to many major businesses, from television channels to financial
services. Using Apache Spark Streaming Hearst’s team gleans real-time insights on
articles/news items performing well and identifies content that is trending.

FINRA
FINRA is a Financial Services company that helps get real-time data insights of
billions of data events. Using Apache Spark, it can test things on real data from the
market, improving its ability to provide investor security and promote market
integrity.

Big Data Analytics Projects using Spark-Spark


Projects
Spark project 1: Create a data pipeline based on messaging using Spark and
Hive

Problem: A data pipeline is used to transport data from source to destination


through a series of processing steps. The data source could be other databases,
api’s, json format, csv files etc. Final destination could be another process or
visualization tools. In between this, data is transformed into a more intelligent and
readable format.

Technologies used: AWS, Spark, Hive, Scala, Airflow, Kafka.

Solution Architecture: This implementation has the following steps: Writing events in
the context of a data pipeline. Then designing a data pipeline based on messaging.
This is followed by executing the file pipeline utility. After this we load data from a

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 60


remote URL, perform Spark transformations on this data before moving it to
a table. Then Hive is used for data access.

Spark Project 2: Building a Data Warehouse using Spark on Hive

Problem: Large companies usually have multiple storehouses of data. All this data
must be moved to a single location to make it easy to generate reports. A data
warehouse is that single location.

Technologies used: HDFS, Hive, Sqoop, Databricks Spark, Dataframes.

Solution Architecture: In the first layer of this spark project first moves data to hdfs.
The hive tables are built on top of hdfs. Data comes through batch processing.
Sqoop is used to ingest this data. Dataframes are used to store instead of RDD. In
the 2nd layer, we normalize and denormalize the data tables. Then transformation is
done using Spark Sql. This transformed data is moved to HDFS. In the final 3rd layer
visualization is done.

Spark Use Cases in Advertising


With the increased usage of digital and social media adoption, Apache Spark is
helping companies achieve their business goals in various ways. It helps to compute
additional data that enrich a dataset. Broadly, this includes gathering metadata
about the original data and computing probability distributions for categorical
features. It can be used to add additional fields like the mode or median of
numerical values in categorical columns like “color” or “age”. It can also be used to
fill in missing features based on other similar tuples in the same table such as
zip_code, gender, or state_province. It is used by advertisers to combine all sorts of
data and provide user-based and targeted ads.

Companies Using Spark in Advertising Industry


Yelp
Founded in 2004, Yelp helps connect people with local businesses. From booking a table to
getting food delivered. The advertising targeting team at Yelp uses prediction algorithms to
figure out how likely it is for a person to interact with an ad. Yelp enhanced revenue and ad
click-through rate by utilizing Apache Spark on Amazon EMR to analyze enormous volumes of
data and train machine learning models.

Gumgum
It is an AI-focused technology and digital media company. They have been using machine
learning to extract value from digital content for over a long time. It is an in-image and in-screen
advertising platform, employing Spark on Amazon EMR. For forecasting, log processing, ad hoc
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 61
analysis, and a lot more. Spark's speed helps gumgum save lots of time and resources. It
uses computer vision and NLP to identify and score different types of content

MapReduce Programming:

Writing basic Map Reduce programs:

MapReduce is a framework using which we can write applications to process


huge amounts of data, in parallel, on large clusters of commodity hardware in
a reliable manner.

What is MapReduce?
MapReduce is a processing technique and a program model for distributed
computing based on java. The MapReduce algorithm contains two important
tasks, namely Map and Reduce. Map takes a set of data and converts it into
another set of data, where individual elements are broken down into tuples
(key/value pairs). Secondly, reduce task, which takes the output from a map
as an input and combines those data tuples into a smaller set of tuples. As
the sequence of the name MapReduce implies, the reduce task is always
performed after the map job.
The major advantage of MapReduce is that it is easy to scale data
processing over multiple computing nodes. Under the MapReduce model, the
data processing primitives are called mappers and reducers. Decomposing a
data processing application into mappers and reducers is sometimes
nontrivial. But, once we write an application in the MapReduce form, scaling
the application to run over hundreds, thousands, or even tens of thousands of
machines in a cluster is merely a configuration change. This simple scalability
is what has attracted many programmers to use the MapReduce model.

The Algorithm
 Generally MapReduce paradigm is based on sending the computer to
where the data resides!
 MapReduce program executes in three stages, namely map stage,
shuffle stage, and reduce stage.
o Map stage − The map or mapper’s job is to process the input
data. Generally the input data is in the form of file or directory
and is stored in the Hadoop file system (HDFS). The input file is
passed to the mapper function line by line. The mapper
processes the data and creates several small chunks of data.
o Reduce stage − This stage is the combination of
the Shuffle stage and the Reduce stage. The Reducer’s job is to

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 62


process the data that comes from the mapper. After processing, it
produces a new set of output, which will be stored in the HDFS.
 During a MapReduce job, Hadoop sends the Map and Reduce tasks to
the appropriate servers in the cluster.
 The framework manages all the details of data-passing such as issuing
tasks, verifying task completion, and copying data around the cluster
between the nodes.
 Most of the computing takes place on nodes with data on local disks
that reduces the network traffic.
 After completion of the given tasks, the cluster collects and reduces the
data to form an appropriate result, and sends it back to the Hadoop
server.

Inputs and Outputs (Java Perspective)


The MapReduce framework operates on <key, value> pairs, that is, the
framework views the input to the job as a set of <key, value> pairs and
produces a set of <key, value> pairs as the output of the job, conceivably of
different types.
The key and the value classes should be in serialized manner by the
framework and hence, need to implement the Writable interface. Additionally,
the key classes have to implement the Writable-Comparable interface to
facilitate sorting by the framework. Input and Output types of a MapReduce
job − (Input) <k1, v1> → map → <k2, v2> → reduce → <k3, v3>(Output).

Input Output

Map <k1, v1> list (<k2, v2>)

Reduce <k2, list(v2)> list (<k3, v3>)

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 63


Terminology

 PayLoad − Applications implement the Map and the Reduce functions,


and form the core of the job.
 Mapper − Mapper maps the input key/value pairs to a set of
intermediate key/value pair.
 NamedNode − Node that manages the Hadoop Distributed File System
(HDFS).
 DataNode − Node where data is presented in advance before any
processing takes place.
 MasterNode − Node where JobTracker runs and which accepts job
requests from clients.
 SlaveNode − Node where Map and Reduce program runs.
 JobTracker − Schedules jobs and tracks the assign jobs to Task
tracker.
 Task Tracker − Tracks the task and reports status to JobTracker.
 Job − A program is an execution of a Mapper and Reducer across a
dataset.
 Task − An execution of a Mapper or a Reducer on a slice of data.
 Task Attempt − A particular instance of an attempt to execute a task
on a SlaveNode.

Example Scenario
Given below is the data regarding the electrical consumption of an
organization. It contains the monthly electrical consumption and the annual
average for various years.

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Avg

1979 23 23 2 43 24 25 26 26 26 26 25 26 25

1980 26 27 28 28 28 30 31 31 31 30 30 30 29

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 64


1981 31 32 32 32 33 34 35 36 36 34 34 34 34

1984 39 38 39 39 39 41 42 43 40 39 38 38 40

1985 38 39 39 39 39 41 41 41 00 40 39 39 45

If the above data is given as input, we have to write applications to process it


and produce results such as finding the year of maximum usage, year of
minimum usage, and so on. This is a walkover for the programmers with finite
number of records. They will simply write the logic to produce the required
output, and pass the data to the application written.
But, think of the data representing the electrical consumption of all the
largescale industries of a particular state, since its formation.
When we write applications to process such bulk data,
 They will take a lot of time to execute.
 There will be a heavy network traffic when we move data from source to
network server and so on.
To solve these problems, we have the MapReduce framework.

Input Data
The above data is saved as sample.txtand given as input. The input file
looks as shown below.
1979 23 23 2 43 24 25 26 26 26 26 25
26 25
1980 26 27 28 28 28 30 31 31 31 30 30
30 29
1981 31 32 32 32 33 34 35 36 36 34 34
34 34
1984 39 38 39 39 39 41 42 43 40 39 38
38 40
1985 38 39 39 39 39 41 41 41 00 40 39
39 45

Example Program
Given below is the program to the sample data using MapReduce framework.
package hadoop;

import java.util.*;

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 65


import java.io.IOException;
import java.io.IOException;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

public class ProcessUnits {


//Mapper class
public static class E_EMapper extends MapReduceBase
implements
Mapper<LongWritable ,/*Input key Type */
Text, /*Input value Type*/
Text, /*Output key Type*/
IntWritable> /*Output value Type*/
{
//Map function
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,

Reporter reporter) throws IOException {


String line = value.toString();
String lasttoken = null;
StringTokenizer s = new StringTokenizer(line,"\t");
String year = s.nextToken();

while(s.hasMoreTokens()) {
lasttoken = s.nextToken();
}
int avgprice = Integer.parseInt(lasttoken);
output.collect(new Text(year), new
IntWritable(avgprice));
}
}

//Reducer class
public static class E_EReduce extends MapReduceBase
implements Reducer< Text, IntWritable, Text, IntWritable > {

//Reduce function
public void reduce( Text key, Iterator <IntWritable>
values,
OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {
int maxavg = 30;
int val = Integer.MIN_VALUE;

while (values.hasNext()) {
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 66
if((val = values.next().get())>maxavg) {
output.collect(key, new IntWritable(val));
}
}
}
}

//Main function
public static void main(String args[])throws Exception {
JobConf conf = new JobConf(ProcessUnits.class);

conf.setJobName("max_eletricityunits");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(E_EMapper.class);
conf.setCombinerClass(E_EReduce.class);
conf.setReducerClass(E_EReduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));


FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);
}
}
Save the above program as ProcessUnits.java. The compilation and
execution of the program is explained below.

Compilation and Execution of Process Units


Program
Let us assume we are in the home directory of a Hadoop user (e.g.
/home/hadoop).
Follow the steps given below to compile and execute the above program.

Step 1
The following command is to create a directory to store the compiled java
classes.
$ mkdir units

Step 2
Download Hadoop-core-1.2.1.jar, which is used to compile and execute the
MapReduce program. Visit the following link mvnrepository.com to download
the jar. Let us assume the downloaded folder is /home/hadoop/.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 67


Step 3
The following commands are used for compiling
the ProcessUnits.java program and creating a jar for the program.
$ javac -classpath hadoop-core-1.2.1.jar -d units
ProcessUnits.java
$ jar -cvf units.jar -C units/ .

Step 4
The following command is used to create an input directory in HDFS.
$HADOOP_HOME/bin/hadoop fs -mkdir input_dir

Step 5
The following command is used to copy the input file named sample.txtin the
input directory of HDFS.
$HADOOP_HOME/bin/hadoop fs -put /home/hadoop/sample.txt
input_dir

Step 6
The following command is used to verify the files in the input directory.
$HADOOP_HOME/bin/hadoop fs -ls input_dir/

Step 7
The following command is used to run the Eleunit_max application by taking
the input files from the input directory.
$HADOOP_HOME/bin/hadoop jar units.jar hadoop.ProcessUnits
input_dir output_dir
Wait for a while until the file is executed. After execution, as shown below,
the output will contain the number of input splits, the number of Map tasks,
the number of reducer tasks, etc.
INFO mapreduce.Job: Job job_1414748220717_0002
completed successfully
14/10/31 06:02:52
INFO mapreduce.Job: Counters: 49
File System Counters

FILE: Number of bytes read = 61


FILE: Number of bytes written = 279400
FILE: Number of read operations = 0
FILE: Number of large read operations = 0
FILE: Number of write operations = 0
HDFS: Number of bytes read = 546

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 68


HDFS: Number of bytes written = 40
HDFS: Number of read operations = 9
HDFS: Number of large read operations = 0
HDFS: Number of write operations = 2 Job Counters

Launched map tasks = 2


Launched reduce tasks = 1
Data-local map tasks = 2
Total time spent by all maps in occupied slots (ms) =
146137
Total time spent by all reduces in occupied slots (ms) =
441
Total time spent by all map tasks (ms) = 14613
Total time spent by all reduce tasks (ms) = 44120
Total vcore-seconds taken by all map tasks = 146137
Total vcore-seconds taken by all reduce tasks = 44120
Total megabyte-seconds taken by all map tasks = 149644288
Total megabyte-seconds taken by all reduce tasks = 45178880

Map-Reduce Framework

Map input records = 5


Map output records = 5
Map output bytes = 45
Map output materialized bytes = 67
Input split bytes = 208
Combine input records = 5
Combine output records = 5
Reduce input groups = 5
Reduce shuffle bytes = 6
Reduce input records = 5
Reduce output records = 5
Spilled Records = 10
Shuffled Maps = 2
Failed Shuffles = 0
Merged Map outputs = 2
GC time elapsed (ms) = 948
CPU time spent (ms) = 5160
Physical memory (bytes) snapshot = 47749120
Virtual memory (bytes) snapshot = 2899349504
Total committed heap usage (bytes) = 277684224

File Output Format Counters

Bytes Written = 40

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 69


Step 8
The following command is used to verify the resultant files in the output
folder.
$HADOOP_HOME/bin/hadoop fs -ls output_dir/

Step 9
The following command is used to see the output in Part-00000 file. This file
is generated by HDFS.
$HADOOP_HOME/bin/hadoop fs -cat output_dir/part-00000
Below is the output generated by the MapReduce program.
1981 34
1984 40
1985 45

Step 10
The following command is used to copy the output folder from HDFS to the
local file system for analyzing.
$HADOOP_HOME/bin/hadoop fs -cat output_dir/part-
00000/bin/hadoop dfs get output_dir /home/hadoop
Important Commands
All Hadoop commands are invoked by
the $HADOOP_HOME/bin/hadoop command. Running the Hadoop script
without any arguments prints the description for all commands.
Usage − hadoop [--config confdir] COMMAND
The following table lists the options available and their description.

Sr.No. Option & Description

1 namenode -format
Formats the DFS filesystem.

2 secondarynamenode
Runs the DFS secondary namenode.

3 namenode

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 70


Runs the DFS namenode.

4 datanode
Runs a DFS datanode.

5 dfsadmin
Runs a DFS admin client.

6 mradmin
Runs a Map-Reduce admin client.

7 fsck
Runs a DFS filesystem checking utility.

8 fs
Runs a generic filesystem user client.

9 balancer
Runs a cluster balancing utility.

10 oiv
Applies the offline fsimage viewer to an fsimage.

11 fetchdt
Fetches a delegation token from the NameNode.

12 jobtracker
Runs the MapReduce job Tracker node.

13 pipes
Runs a Pipes job.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 71


14 tasktracker
Runs a MapReduce task Tracker node.

15 historyserver
Runs job history servers as a standalone daemon.

16 job
Manipulates the MapReduce jobs.

17 queue
Gets information regarding Job Queues.

18 version
Prints the version.

19 jar <jar>
Runs a jar file.

20 distcp <srcurl> <desturl>


Copies file or directories recursively.

21 distcp2 <srcurl> <desturl>


DistCp version 2.

22 archive -archiveName NAME -p <parent path> <src>* <dest>


Creates a hadoop archive.

23 classpath
Prints the class path needed to get the Hadoop jar and the required
libraries.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 72


24 daemonlog
Get/Set the log level for each daemon

How to Interact with Map Reduce Jobs


Usage − hadoop job [GENERIC_OPTIONS]
The following are the Generic Options available in a Hadoop job.

Sr.No. GENERIC_OPTION & Description

1 -submit <job-file>
Submits the job.

2 -status <job-id>
Prints the map and reduce completion percentage and all job
counters.

3 -counter <job-id> <group-name> <counter name>


Prints the counter value.

4 -kill <job-id>
Kills the job.

5 -events <job-id> <from event-#> <#-of-events>


Prints the events' details received by job tracker for the given range.

6 -history [all] <job Output Dir> - history < job Output Dir>
Prints job details, failed and killed tip details. More details about the
job such as successful tasks and task attempts made for each task
can be viewed by specifying the [all] option.

7 -list[all]
Displays all jobs. -list displays only jobs which are yet to complete.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 73


8 -kill-task <task-id>
Kills the task. Killed tasks are NOT counted against failed attempts.

9 -fail-task <task-id>
Fails the task. Failed tasks are counted against failed attempts.

10 -set-priority <job-id> <priority>


Changes the priority of the job. Allowed priority values are
VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW

To see the status of job


$ $HADOOP_HOME/bin/hadoop job -status <JOB-ID>
e.g.
$ $HADOOP_HOME/bin/hadoop job -status job_201310191043_0004

To see the history of job output-dir


$ $HADOOP_HOME/bin/hadoop job -history <DIR-NAME>
e.g.
$ $HADOOP_HOME/bin/hadoop job -history /user/expert/output

To kill the job


$ $HADOOP_HOME/bin/hadoop job -kill <JOB-ID>
e.g. $ $HADOOP_HOME/bin/hadoop job -kill job_201310191043_0004

MapReduce Word Count Example

In MapReduce word count example, we find out the frequency of each word.
Here, the role of Mapper is to map the keys to the existing values and the role
of Reducer is to aggregate the keys of common values. So, everything is
represented in the form of Key-value pair.

Pre-requisite
Java Installation - Check whether the Java is installed or not using the
following command.
java -version
Hadoop Installation - Check whether the Hadoop is installed or not using the
following command.
hadoop version

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 74


If any of them is not installed in your system, follow the below link to install it.

www.javatpoint.com/hadoop-installation

Steps to execute MapReduce word count example


o Create a text file in your local machine and write some text into it.
$ nano data.txt

Check the text written in the data.txt file.


$ cat data.txt

o Create a directory in HDFS, where to kept text file.


$ hdfs dfs -mkdir /test

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 75


o Upload the data.txt file on HDFS in the specific directory.
$ hdfs dfs -put /home/codegyani/data.txt /test

o Write the MapReduce program using eclipse.

File: WC_Mapper.java
1. package com.javatpoint;

2. import java.io.IOException;
3. import java.util.StringTokenizer;
4. import org.apache.hadoop.io.IntWritable;
5. import org.apache.hadoop.io.LongWritable;
6. import org.apache.hadoop.io.Text;
7. import org.apache.hadoop.mapred.MapReduceBase;
8. import org.apache.hadoop.mapred.Mapper;
9. import org.apache.hadoop.mapred.OutputCollector;
10. import org.apache.hadoop.mapred.Reporter;
11. public class WC_Mapper extends MapReduceBase implements Mapper<Long
Writable,Text,Text,IntWritable>{
12. private final static IntWritable one = new IntWritable(1);
13. private Text word = new Text();

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 76


14. public void map(LongWritable key, Text value,OutputCollector<Text,IntWrit
able> output,
15. Reporter reporter) throws IOException{
16. String line = value.toString();
17. StringTokenizer tokenizer = new StringTokenizer(line);
18. while (tokenizer.hasMoreTokens()){
19. word.set(tokenizer.nextToken());
20. output.collect(word, one);
21. }
22. }
23.
24. }

File: WC_Reducer.java
1. package com.javatpoint;
2. import java.io.IOException;
3. import java.util.Iterator;
4. import org.apache.hadoop.io.IntWritable;
5. import org.apache.hadoop.io.Text;
6. import org.apache.hadoop.mapred.MapReduceBase;
7. import org.apache.hadoop.mapred.OutputCollector;
8. import org.apache.hadoop.mapred.Reducer;
9. import org.apache.hadoop.mapred.Reporter;
10. public class WC_Reducer extends MapReduceBase implements Reducer<
Text,IntWritable,Text,IntWritable> {
11. public void reduce(Text key, Iterator<IntWritable> values,OutputCollector
<Text,IntWritable> output,
12. Reporter reporter) throws IOException {
13. int sum=0;
14. while (values.hasNext()) {
15. sum+=values.next().get();
16. }
17. output.collect(key,new IntWritable(sum));

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 77


18. }
19. }

File: WC_Runner.java
1. package com.javatpoint;
2. import java.io.IOException;
3. import org.apache.hadoop.fs.Path;
4. import org.apache.hadoop.io.IntWritable;
5. import org.apache.hadoop.io.Text;
6. import org.apache.hadoop.mapred.FileInputFormat;
7. import org.apache.hadoop.mapred.FileOutputFormat;
8. import org.apache.hadoop.mapred.JobClient;
9. import org.apache.hadoop.mapred.JobConf;
10. import org.apache.hadoop.mapred.TextInputFormat;
11. import org.apache.hadoop.mapred.TextOutputFormat;
12. public class WC_Runner {
13. public static void main(String[] args) throws IOException{
14. JobConf conf = new JobConf(WC_Runner.class);
15. conf.setJobName("WordCount");
16. conf.setOutputKeyClass(Text.class);
17. conf.setOutputValueClass(IntWritable.class);
18. conf.setMapperClass(WC_Mapper.class);
19. conf.setCombinerClass(WC_Reducer.class);
20. conf.setReducerClass(WC_Reducer.class);
21. conf.setInputFormat(TextInputFormat.class);
22. conf.setOutputFormat(TextOutputFormat.class);
23. FileInputFormat.setInputPaths(conf,new Path(args[0]));
24. FileOutputFormat.setOutputPath(conf,new Path(args[1]));
25. JobClient.runJob(conf);
26. }
27. }

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 78


Download the source code.
o Create the jar file of this program and name it countworddemo.jar.
o Runthejarfile
hadoop jar /home/codegyani/wordcountdemo.jar
com.javatpoint.WC_Runner /test/data.txt /r_output
o The output is stored in /r_output/part-00000

o Now execute the command to see the output.


hdfs dfs -cat /r_output/part-00000

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 79


Programming with RDDs Basics:

Resilient Distributed Datasets


Resilient Distributed Datasets (RDD) is a fundamental data structure of
Spark. It is an immutable distributed collection of objects. Each dataset in
RDD is divided into logical partitions, which may be computed on different
nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects,
including user-defined classes.
Formally, an RDD is a read-only, partitioned collection of records. RDDs can be
created through deterministic operations on either data on stable storage or other
RDDs. RDD is a fault-tolerant collection of elements that can be operated on in
parallel.
There are two ways to create RDDs − parallelizing an existing collection in your
driver program, or referencing a dataset in an external storage system, such as a
shared file system, HDFS, HBase, or any data source offering a Hadoop Input
Format.
Spark makes use of the concept of RDD to achieve faster and efficient
MapReduce operations. Let us first discuss how MapReduce operations take place
and why they are not so efficient.
Data Sharing is Slow in MapReduce
MapReduce is widely adopted for processing and generating large datasets
with a parallel, distributed algorithm on a cluster. It allows users to write
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 80
parallel computations, using a set of high-level operators, without
having to worry about work distribution and fault tolerance.
Unfortunately, in most current frameworks, the only way to reuse data
between computations (Ex − between two MapReduce jobs) is to write it to
an external stable storage system (Ex − HDFS). Although this framework
provides numerous abstractions for accessing a cluster’s computational
resources, users still want more.
Both Iterative and Interactive applications require faster data sharing across
parallel jobs. Data sharing is slow in MapReduce due to replication,
serialization, and disk IO. Regarding storage system, most of the Hadoop
applications, they spend more than 90% of the time doing HDFS read-write
operations.

Iterative Operations on MapReduce


Reuse intermediate results across multiple computations in multi-stage
applications. The following illustration explains how the current framework
works, while doing the iterative operations on MapReduce. This incurs
substantial overheads due to data replication, disk I/O, and serialization,
which makes the system slow.

Interactive Operations on MapReduce


User runs ad-hoc queries on the same subset of data. Each query will do the disk I/O on the
stable storage, which can dominate application execution time.
The following illustration explains how the current framework works while doing the
interactive queries on MapReduce.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 81


Data Sharing using Spark RDD
Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Most of the
Hadoop applications, they spend more than 90% of the time doing HDFS read-write operations.
Recognizing this problem, researchers developed a specialized framework called Apache Spark.
The key idea of spark is Resilient Distributed Datasets (RDD); it supports in-memory
processing computation. This means, it stores the state of memory as an object across the jobs
and the object is sharable between those jobs. Data sharing in memory is 10 to 100 times faster
than network and Disk.
Let us now try to find out how iterative and interactive operations take place in Spark RDD.

Iterative Operations on Spark RDD

The illustration given below shows the iterative operations on Spark RDD. It
will store intermediate results in a distributed memory instead of Stable
storage (Disk) and make the system faster.
Note − If the Distributed memory (RAM) is not sufficient to store intermediate
results (State of the JOB), then it will store those results on the disk.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 82


Interactive Operations on Spark RDD
This illustration shows interactive operations on Spark RDD. If different
queries are run on the same set of data repeatedly, this particular data can
be kept in memory for better execution times.

By default, each transformed RDD may be recomputed each time you run an action on it.
However, you may also persist an RDD in memory, in which case Spark will keep the elements
around on the cluster for much faster access, the next time you query it. There is also support
for persisting RDDs on disk, or replicated across multiple nodes.

Programming
Spark contains two different types of shared variables − one is broadcast
variables and second is accumulators.
 Broadcast variables − used to efficiently, distribute large values.
 Accumulators − used to aggregate the information of particular
collection.

Broadcast Variables
Broadcast variables allow the programmer to keep a read-only variable cached
on each machine rather than shipping a copy of it with tasks. They can be used, for example, to
give every node, a copy of a large input dataset, in an efficient manner. Spark also attempts to
distribute broadcast variables using efficient broadcast algorithms to reduce communication
cost.

Spark actions are executed through a set of stages, separated by distributed “shuffle” operations.
Spark automatically broadcasts the common data needed by tasks within each stage.

The data broadcasted this way is cached in serialized form and is deserialized before running
each task. This means that explicitly creating broadcast variables, is only useful when tasks
across multiple stages need the same data or when caching the data in deserialized form is
important.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 83


Broadcast variables are created from a variable v by
calling SparkContext.broadcast(v). The broadcast variable is a wrapper around v, and its
value can be accessed by calling the value method. The code given below shows this −

scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))


Output −
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]]
= Broadcast(0)
After the broadcast variable is created, it should be used instead of the
value v in any functions run on the cluster, so that v is not shipped to the
nodes more than once. In addition, the object v should not be modified after
its broadcast, in order to ensure that all nodes get the same value of the
broadcast variable.

Accumulators
Accumulators are variables that are only “added” to through an associative operation and can
therefore, be efficiently supported in parallel. They can be used to implement counters (as in
MapReduce) or sums. Spark natively supports accumulators of numeric types, and programmers
can add support for new types. If accumulators are created with a name, they will be displayed
in Spark’s UI. This can be useful for understanding the progress of running stages (NOTE −
this is not yet supported in Python).
An accumulator is created from an initial value v by calling SparkContext.accumulator(v).
Tasks running on the cluster can then add to it using the add method or the += operator (in
Scala and Python). However, they cannot read its value. Only the driver program can read the
accumulator’s value, using its value method.
The code given below shows an accumulator being used to add up the elements of an array −
scala> val accum = sc.accumulator(0)

scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum +=


x)
If you want to see the output of above code then use the following command

scala> accum.value

Output
res2: Int = 10
Numeric RDD Operations
Spark allows you to do different operations on numeric data, using one of the
predefined API methods. Spark’s numeric operations are implemented with a
streaming algorithm that allows building the model, one element at a time.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 84


These operations are computed and returned as
a StatusCounter object by calling status() method.
The following is a list of numeric methods available in StatusCounter.

S.No Methods & Meaning

1 count()
Number of elements in the RDD.

2 Mean()
Average of the elements in the RDD.

3 Sum()
Total value of the elements in the RDD.

4 Max()
Maximum value among all elements in the RDD.

5 Min()
Minimum value among all elements in the RDD.

6 Variance()
Variance of the elements.

7 Stdev()
Standard deviation.

If you want to use only one of these methods, you can call the corresponding
method directly on RDD.

Core Programming
Spark Core is the base of the whole project. It provides distributed task
dispatching, scheduling, and basic I/O functionalities. Spark uses a
specialized fundamental data structure known as RDD (Resilient Distributed
Datasets) that is a logical collection of data partitioned across machines.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 85


RDDs can be created in two ways; one is by referencing datasets in
external storage systems and second is by applying transformations (e.g.
map, filter, reducer, join) on existing RDDs.
The RDD abstraction is exposed through a language-integrated API. This
simplifies programming complexity because the way applications manipulate
RDDs is similar to manipulating local collections of data.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 86


S.No Transformations & Meaning

1 map(func)
Returns a new distributed dataset, formed by passing each element of
the source through a function func.

2 filter(func)
Returns a new dataset formed by selecting those elements of the source
on which func returns true.

3 flatMap(func)
Similar to map, but each input item can be mapped to 0 or more output
items (so func should return a Seq rather than a single item).

4 mapPartitions(func)
Similar to map, but runs separately on each partition (block) of the RDD,
so func must be of type Iterator<T> ⇒ Iterator<U> when running on an
RDD of type T.

5 mapPartitionsWithIndex(func)
Similar to map Partitions, but also provides func with an integer value
representing the index of the partition, so func must be of type (Int,
Iterator<T>) ⇒ Iterator<U> when running on an RDD of type T.

6 sample(withReplacement, fraction, seed)


Sample a fraction of the data, with or without replacement, using a given
random number generator seed.

7 union(otherDataset)
Returns a new dataset that contains the union of the elements in the
source dataset and the argument.

8 intersection(otherDataset)
Returns a new RDD that contains the intersection of elements in the
source dataset and the argument.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 87


9 distinct([numTasks])
Returns a new dataset that contains the distinct elements of the source
dataset.

10 groupByKey([numTasks])
When called on a dataset of (K, V) pairs, returns a dataset of (K,
Iterable<V>) pairs.
Note − If you are grouping in order to perform an aggregation (such as a
sum or average) over each key, using reduceByKey or aggregateByKey
will yield much better performance.

11 reduceByKey(func, [numTasks])
When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs
where the values for each key are aggregated using the given reduce
function func, which must be of type (V, V) ⇒ V. Like in groupByKey, the
number of reduce tasks is configurable through an optional second
argument.

12 aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])


When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs
where the values for each key are aggregated using the given combine
functions and a neutral "zero" value. Allows an aggregated value type
that is different from the input value type, while avoiding unnecessary
allocations. Like in groupByKey, the number of reduce tasks is
configurable through an optional second argument.

13 sortByKey([ascending], [numTasks])
When called on a dataset of (K, V) pairs where K implements Ordered,
returns a dataset of (K, V) pairs sorted by keys in ascending or
descending order, as specified in the Boolean ascending argument.

14 join(otherDataset, [numTasks])
When called on datasets of type (K, V) and (K, W), returns a dataset of
(K, (V, W)) pairs with all pairs of elements for each key. Outer joins are
supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 88


15 cogroup(otherDataset, [numTasks])
When called on datasets of type (K, V) and (K, W), returns a dataset of
(K, (Iterable<V>, Iterable<W>)) tuples. This operation is also called group
With.

16 cartesian(otherDataset)
When called on datasets of types T and U, returns a dataset of (T, U)
pairs (all pairs of elements).

17 pipe(command, [envVars])
Pipe each partition of the RDD through a shell command, e.g. a Perl or
bash script. RDD elements are written to the process's stdin and lines
output to its stdout are returned as an RDD of strings.

18 coalesce(numPartitions)
Decrease the number of partitions in the RDD to numPartitions. Useful
for running operations more efficiently after filtering down a large dataset.

19 repartition(numPartitions)
Reshuffle the data in the RDD randomly to create either more or fewer
partitions and balance it across them. This always shuffles all data over
the network.

20 repartitionAndSortWithinPartitions(partitioner)
Repartition the RDD according to the given partitioner and, within each
resulting partition, sort records by their keys. This is more efficient than
calling repartition and then sorting within each partition because it can
push the sorting down into the shuffle machinery.

Spark Shell
Spark provides an interactive shell − a powerful tool to analyze data
interactively. It is available in either Scala or Python language. Spark’s
primary abstraction is a distributed collection of items called a Resilient
Distributed Dataset (RDD). RDDs can be created from Hadoop Input Formats
(such as HDFS files) or by transforming other RDDs.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 89


Open Spark Shell
The following command is used to open Spark shell.
$ spark-shell

Create simple RDD


Let us create a simple RDD from the text file. Use the following command to
create a simple RDD.
scala> val inputfile = sc.textFile(“input.txt”)
The output for the above command is
inputfile: org.apache.spark.rdd.RDD[String] = input.txt
MappedRDD[1] at textFile at <console>:12
The Spark RDD API introduces few Transformations and few Actions to
manipulate RDD.

RDD Transformations
RDD transformations returns pointer to new RDD and allows you to create
dependencies between RDDs. Each RDD in dependency chain (String of
Dependencies) has a function for calculating its data and has a pointer
(dependency) to its parent RDD.
Spark is lazy, so nothing will be executed unless you call some
transformation or action that will trigger job creation and execution. Look at
the following snippet of the word-count example.
Therefore, RDD transformation is not a set of data but is a step in a program
(might be the only step) telling Spark how to get data and what to do with it.
Given below is a list of RDD transformations.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 90


S.No Action & Meaning

1 reduce(func)
Aggregate the elements of the dataset using a function func (which
takes two arguments and returns one). The function should be
commutative and associative so that it can be computed correctly in
parallel.

2 collect()
Returns all the elements of the dataset as an array at the driver
program. This is usually useful after a filter or other operation that
returns a sufficiently small subset of the data.

3 count()
Returns the number of elements in the dataset.

4 first()
Returns the first element of the dataset (similar to take (1)).

5 take(n)
Returns an array with the first n elements of the dataset.

6 takeSample (withReplacement,num, [seed])


Returns an array with a random sample of num elements of the
dataset, with or without replacement, optionally pre-specifying a
random number generator seed.

7 takeOrdered(n, [ordering])
Returns the first n elements of the RDD using either their natural
order or a custom comparator.

8 saveAsTextFile(path)
Writes the elements of the dataset as a text file (or set of text files) in
a given directory in the local filesystem, HDFS or any other Hadoop-
supported file system. Spark calls toString on each element to convert

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 91


it to a line of text in the file.

9 saveAsSequenceFile(path) (Java and Scala)


Writes the elements of the dataset as a Hadoop SequenceFile in a
given path in the local filesystem, HDFS or any other Hadoop-
supported file system. This is available on RDDs of key-value pairs
that implement Hadoop's Writable interface. In Scala, it is also
available on types that are implicitly convertible to Writable (Spark
includes conversions for basic types like Int, Double, String, etc).

10 saveAsObjectFile(path) (Java and Scala)


Writes the elements of the dataset in a simple format using Java
serialization, which can then be loaded using
SparkContext.objectFile().

11 countByKey()
Only available on RDDs of type (K, V). Returns a hashmap of (K, Int)
pairs with the count of each key.

12 foreach(func)
Runs a function func on each element of the dataset. This is usually,
done for side effects such as updating an Accumulator or interacting
with external storage systems.
Note − modifying variables other than Accumulators outside of the
foreach() may result in undefined behavior. See Understanding
closures for more details.

Actions
The following table gives a list of Actions, which return values.

Programming with RDD


Let us see the implementations of few RDD transformations and actions in
RDD programming with the help of an example.

Example
Consider a word count example − It counts each word appearing in a
document. Consider the following text as an input and is saved as
an input.txt file in a home directory.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 92


input.txt − input file.
people are not as beautiful as they look,
as they walk or as they talk.
they are only as beautiful as they love,
as they care as they share.
Follow the procedure given below to execute the given example.

Open Spark-Shell
The following command is used to open spark shell. Generally, spark is built
using Scala. Therefore, a Spark program runs on Scala environment.
$ spark-shell
If Spark shell opens successfully then you will find the following output. Look
at the last line of the output “Spark context available as sc” means the Spark
container is automatically created spark context object with the name sc.
Before starting the first step of a program, the SparkContext object should be
created.
Spark assembly has been built with Hive, including Datanucleus
jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-
defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to:
hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls
to: hadoop
15/06/04 15:25:22 INFO SecurityManager: SecurityManager:
authentication disabled;
ui acls disabled; users with view permissions: Set(hadoop);
users with modify permissions: Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service
'HTTP class server' on port 43292.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.4.0
/_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM,


Java 1.7.0_71)
Type in expressions to have them evaluated.
Spark context available as sc
scala>

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 93


Create an RDD
First, we have to read the input file using Spark-Scala API and create an
RDD.
The following command is used for reading a file from given location. Here,
new RDD is created with the name of inputfile. The String which is given as
an argument in the textFile(“”) method is absolute path for the input file name.
However, if only the file name is given, then it means that the input file is in
the current location.
scala> val inputfile = sc.textFile("input.txt")

Execute Word count Transformation


Our aim is to count the words in a file. Create a flat map for splitting each line
into words (flatMap(line ⇒ line.split(“ ”)).
Next, read each word as a key with a value ‘1’ (<key, value> =
<word,1>)using map function (map(word ⇒ (word, 1)).
Finally, reduce those keys by adding values of similar keys
(reduceByKey(_+_)).
The following command is used for executing word count logic. After
executing this, you will not find any output because this is not an action, this
is a transformation; pointing a new RDD or tell spark to what to do with the
given data)
scala> val counts = inputfile.flatMap(line => line.split("
")).map(word => (word, 1)).reduceByKey(_+_);

Current RDD
While working with the RDD, if you want to know about current RDD, then
use the following command. It will show you the description about current
RDD and its dependencies for debugging.
scala> counts.toDebugString

Caching the Transformations


You can mark an RDD to be persisted using the persist() or cache() methods
on it. The first time it is computed in an action, it will be kept in memory on the
nodes. Use the following command to store the intermediate transformations
in memory.
scala> counts.cache()

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 94


Applying the Action
Applying an action, like store all the transformations, results into a text file.
The String argument for saveAsTextFile(“ ”) method is the absolute path of
output folder. Try the following command to save the output in a text file. In
the following example, ‘output’ folder is in current location.
scala> counts.saveAsTextFile("output")

Checking the Output


Open another terminal to go to home directory (where spark is executed in
the other terminal). Use the following commands for checking output
directory.
[hadoop@localhost ~]$ cd output/
[hadoop@localhost output]$ ls -1

part-00000
part-00001
_SUCCESS
The following command is used to see output from Part-00000 files.
[hadoop@localhost output]$ cat part-00000

Output
(people,1)
(are,2)
(not,1)
(as,8)
(beautiful,2)
(they, 7)
(look,1)
The following command is used to see output from Part-00001 files.
[hadoop@localhost output]$ cat part-00001

Output
(walk, 1)
(or, 1)
(talk, 1)
(only, 1)
(love, 1)
(care, 1)
(share, 1)

UN Persist the Storage

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 95


Before UN-persisting, if you want to see the storage space that is used for
this application, then use the following URL in your browser.
http://localhost:4040
You will see the following screen, which shows the storage space used for
the application, which are running on the Spark shell.

If you want to UN-persist the storage space of particular RDD, then use the
following command.
Scala> counts.unpersist()
You will see the output as follows −
15/06/27 00:57:33 INFO ShuffledRDD: Removing RDD 9 from
persistence list
15/06/27 00:57:33 INFO BlockManager: Removing RDD 9
15/06/27 00:57:33 INFO BlockManager: Removing block rdd_9_1
15/06/27 00:57:33 INFO MemoryStore: Block rdd_9_1 of size 480
dropped from memory (free 280061810)
15/06/27 00:57:33 INFO BlockManager: Removing block rdd_9_0
15/06/27 00:57:33 INFO MemoryStore: Block rdd_9_0 of size 296
dropped from memory (free 280062106)
res7: cou.type = ShuffledRDD[9] at reduceByKey at <console>:14
For verifying the storage space in the browser, use the following URL.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 96


http://localhost:4040/
You will see the following screen. It shows the storage space used for the
application, which are running on the Spark shell.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 97


UNIT III

Streaming in Spark:
Apache Spark Streaming is a scalable fault-tolerant streaming processing system that natively
supports both batch and streaming workloads. Spark Streaming is an extension of the core Spark
API that allows data engineers and data scientists to process real-time data from various sources
including (but not limited to) Kafka, Flume, and Amazon Kinesis. This processed data can be
pushed out to file systems, databases, and live dashboards. Its key abstraction is a Discretized
Stream or, in short, a DStream, which represents a stream of data divided into small batches.
DStreams are built on RDDs, Spark’s core data abstraction. This allows Spark Streaming to
seamlessly integrate with any other Spark components like MLlib and Spark SQL. Spark
Streaming is different from other systems that either have a processing engine designed only for
streaming, or have similar batch and streaming APIs but compile internally to different engines.
Spark’s single execution engine and unified programming model for batch and streaming lead to
some unique benefits over other traditional streaming systems.

Four Major Aspects of Spark Streaming


 Fast recovery from failures and stragglers
 Better load balancing and resource usage
 Combining of streaming data with static datasets and interactive queries
 Native integration with advanced processing libraries (SQL, machine
learning, graph processing)

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 98


This unification of disparate data processing capabilities is the key
reason behind Spark Streaming’s rapid adoption. It makes it very easy
for developers to use a single framework to satisfy all their processing
needs.

Streaming Features:

The idea of this article is to explain the general features of Apache


Big Data stream processing frameworks. Also, it provides a crisp
comparative analysis of Apache’s Big Data streaming frameworks
against the generic features so it's useful in selecting the right
framework for application development.

1. Introduction
In the Big Data world, there are many tools and frameworks available to process the large
volume of data in offline mode or batch mode. But the need for real-time processing to analyze
the data arriving at high velocity on the fly and provide analytics or enrichment services is also
high. In the last couple of years, this is an ever changing landscape, with many new entrants of
streaming frameworks. So choosing the real-time processing engine becomes a challenge.

2. Design
The real-time streaming engines interacts with stream or messaging frameworks such as Apache
Kafka, Rabbit MQ, or Apache Flume to receive the data in real time.

It processes the data inside the cluster computing engine which typically runs on top of a cluster
manager such as Apache YARN, Apache Mesos, or Apache Tez.

The processed data sent back to message queues ( Apache Kafka, RabbitMQ, Flume) or written
into storage such as HDFS, NFS.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 99


.
3 Characteristics of Real-Time Stream Process Engines

3.1 Programming Models


They are two types of programming models present in real time
streaming frameworks.

3.1.1 Compositional
This approach provides basic components, using which the
streaming application can be created. For example, In Apache Storm,
the spout is used to connect to different sources and receive the data
and bolts are used to process the received data.

3.1.2 Declarative
This is more of a functional programming approach, where the
framework allows us to define higher order functions. This
declarative APIs provide more advanced operations like windowing
or state management and it is considered more flexible.

3.2 Message Delivery Guarantee


There are three message delivery guarantee mechanisms. They are:
at most once, at least
once, and exactly once.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 100
3.2.1 At Most Once
This is a best effort delivery mechanism. The message may be
delivered one or more times. So the possibilities of getting duplicate
events processed are very high.

3.2.2 At Least Once


This mechanism will ensure that the message is delivered at least
once.
But in the process of delivering at least once, the framework might
deliver the
message more than once. So, duplicate messages might be received
and processed.
This might result in unnecessary complications, where the
processing logic is not
omnipotent.

3.2.3 Exactly Once


The framework will ensure that the message is delivered and
processed exactly once.
The message delivery is guaranteed and there won’t be any
duplicate messages.
So, “Exactly Once” delivery guarantee is considered to be best of all.

3.3 State Management


Statement management defines the way events are accumulated
inside the frameworks before it actually processes the data. This is
a critical factor while deciding the framework for real-time analytics.

3.3.1 Stateless Processing


The frameworks which process the incoming events independently
without the knowledge
of any previous events are considered to be stateless. The data
enrichment and data processing applications might need kind of
processing power.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 101
3.3.2 Stateful Processing
The stream processing frameworks can make use of the previous
events to process the
incoming events, by storing them in cache or external databases.
Real-time analytics applications need stateful processing so that it
can collect the data for a specific interval and process them before it
really recommends any suggestions to the user.

3.4 Processing Modes


Processing mode defines, how the incoming data is processed.
There are three processing modes: Event, Micro batch, and batch.

3.4.1 . Event Mode


Each and every incoming message is processed independently. It
may or may not maintain the state information.

3.4.2 Micro Batch


The incoming events are accumulated for a specific time
window and the collected events processed together as batch.

3.4.3 Batch
The incoming events are processed like a bounded stream of inputs.
This allows it to process the large, finite set of incoming events.

3.5 Cluster Manager


The real-time processing frameworks runs in cluster computing
environment might need a cluster manager. The support for cluster
manager is critical to support the scalability and performance
requirement of the application. The frameworks might run on
standalone mode, their own cluster manager, Apache YARN, Apache
Mesos or Apache Tez.

3.5.1 Standalone Mode


The support to run on standalone mode is useful during
the development phase, where the developers can run the code in

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 102
their development environment, they do not need to deploy
their code in the large cluster computing environment.

3.5.2 Proprietary Cluster Manager


Some of real-time processing frameworks might support their own
cluster managers, such Apache Spark has its own Standalone Cluster
manager, which is bundled with the software. This reduces the
overhead of installing , configuration and maintenance of other
cluster managers such as Apache Yarn or Apache Mesos.

3.5.3 Support for Industry Standard Cluster Managers


If you already have a Big Data environment and want to leverage the
cluster for real-time processing, then support to existing cluster
computing manager is very critical. The real-time stream processing
frameworks must support Apache YARN, Apache Mesos, or Apache
Tez.

3.6 Fault Tolerance


Most of the Big Data frameworks follows the master-slave
architecture. Basically, the master is responsible for running the job
on the cluster and monitoring the clients in the cluster. So, the
framework must handle failures at the master node as well as failure
in client nodes. Some frameworks might need some external tools
like monit/supervisord to monitor the master node. For example,
Apache Spark streaming has its own monitoring process for the
master (driver) node. If the master node fails it will be
automatically restarted. If the client node fails, master takes care of
restarting them. But in Apache Storm, the master has to be
monitored using monit.

3.7 External Connectors


The framework must support seamless connection to external data
generation sources such Twitter feeds, Kafka , RabbitMQ, Flume, RSS
Feeds, Hekad, etc. The frameworks must provide standard inbuilt
connectors as well as provision to extend the connectors to connect
various streaming data sources.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 103
3.7.1 Social Media Connectors – Twitter / RSS Feeds
3.7.2 Message Queue Connectors -Kafka / RabbitMQ
3.7.3 Network Port Connectors – TCP/UDP Ports
3.7.4 Custom Connectors – Support to develop customized
connectors to read from custom applications.

3.8 Programming Language Support


Most of these frameworks support JVM languages, especially Java &
Scala. Some also support Python. The selection of the framework
might depend on the language of the choice.

3.9 Reference Data Storage & Access


The real-time processing engines might need to refer some
databases to enhance or aggregate the given data. So, the
framework must provide features to integrate and efficient access to
the reference data. Some frameworks provide ways to internally
cache the reference data in memory (Eg. Apache Spark Broadcast
Variable). Apache Samza and Apache Flink supports storing the
reference data internally in each cluster node so that jobs can access
them internally without connecting to the database over the
network.
The following are the various methods available in the Big Data
streaming frameworks:
3.9.1 In-memory cache : Allows to store reference data inside
cluster nodes, so that it improves the performance by reducing the
delay in connecting to external databases.
3.9.2 Per Client Data Base Storage: Allows to store data in 3rd
party database systems like MySQL, SQLite,MongoDB etc inside the
streaming cluster. Also provides API support to connect and
retrieve data from those databases and provides efficient database
connection methodologies.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 104
3.9.3 Remote DBMS connection: These systems support
connecting to the external databases outside the streaming clusters.
This is considered to be less efficient due to higher latency
introduced due to network connectivity and bottlenecks introduced
due to network communication.

3.10 Latency and throughput


Though hardware configuration plays a major role in latency and
throughput, some of the design factors of the frameworks affect the
performance. The factors are: Network IO, efficient use memory,
reduced disk access, and in-memory cache for reference data. For
example, Apache Kafka Streaming API provides higher throughput
and low latency due to reduced network I/O, hence the messaging
framework and computing engines are in the same cluster.
Similarly, Apache Spark uses the memory to cache the data,
and thereby reduces the disk access results in low latency and
higher throughput.

4. Feature Comparison Table


The following table provides a comparison Apache streaming
frameworks against the above-discussed features.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 105
The above frameworks support both stateful and stateless
processing modes.

5. Conclusion
This article summarizes the various features of the
streaming framework, which are critical selection criteria for new
streaming application. Every application is unique and has its own
specific functional and non-functional requirement, so the right
framework completely depends on the requirement.

Use case on streaming : About the top four use cases for streaming data integration

Use Case #1 Cloud Adoption – Online Database Migration

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 106
The first one is cloud adoption – specifically online database migration. When you have
your legacy database and you want to move it to the cloud and modernize your data
infrastructure, if it’s a critical database, you don’t want to experience downtime. The streaming
data integration solution helps with that. When you’re doing an initial load from the legacy
system to the cloud, the Change Data Capture (CDC) feature captures all the new transactions
happening in this database as it’s happening. Once this database is loaded and ready, all the
changes that happened in the legacy database can be applied in the cloud. During the migration,
your legacy system is open for transactions – you don’t have to pause it.

While the migration is happening, CDC helps you to keep these two databases continuously in-
sync by moving the real-time data between the systems. Because the system is open to
transactions, there is no business interruption. And if this technology is designed for both
validating the delivery and checkpointing the systems, you will also not experience any data loss.

Because this cloud database has production data, is open to transactions, and is continuousl y
updated, you can take your time to test it before you move your users. So you have basically
unlimited testing time, which helps you minimize your risks during such a major transition. Once
the system is completely in-sync and you have checked it and tested it, you can point your
applications and run your cloud database.

This is a single switch-over scenario. But streaming data integration gives you the ability to
move the data bi-directionally. You can have both systems open to transactions. Once you test
this, you can run some of your users in the cloud and some of you users in the legacy database.

All the changes happening with these users can be moved between databases, synchronized so
that they’re constantly in-sync. You can gradually move your users to the cloud database to
further minimize your risk. Phased migration is a very popular use case, especially for mission-
critical systems that cannot tolerate risk and downtime.

Use Case #2 Hybrid Cloud Architecture

Once you’re in the cloud and you have a hybrid cloud architecture, you need to maintain it. You
need to connect it with the rest of your enterprise. It needs to be a natural extension of your data
center. Continuous real-time data moment with streaming data integration allows you to have
your cloud databases and services as part of your data center.

The important thing is that these workloads in the cloud can be operational workloads because
there’s fresh information (ie, continuously updated information) available. Your databases, your
machine data, your log files, your other cloud sources, messaging systems, and sensors can move
continuously to enable operational workloads.

What do we see in hybrid cloud architectures? Heavy use of cloud analytics solutions. If you
want operational reporting or operational intelligence, you want comprehensive data delivered
continuously so that you can trust that’s up-to-date, and gain operational intelligence from your
analytics solutions.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 107
You can also connect your data sources with the messaging systems in the cloud to
support event distribution for your new apps that you’re running in the cloud so that they are
completely part of your data center. If you’re adopting multi-cloud solutions, you can again
connect your new cloud systems with existing cloud systems, or send data to multiple cloud
destinations.

Use Case #3 Real-Time Modern Applications

A third use case is real-time modern applications. Cloud is a big trend right now, but not
everything is necessarily in the cloud. You can have modern applications on-premises. So, if
you’re building any real-time app and modern new system that needs timely information, you
need to have continuous real-time data pipelines. Streaming data integration enables you run
real-time apps with real-time data.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 108
Use Case #4 Hot Cache

Last, but not least, when you have an in-memory data grid to help with your data retrieval
performance, you need to make sure it is continuously up-to-date so that you can rely on that
data – it’s something that users can depend on. If the source system is updated, but your cache is
not updated, it can create business problems. By continuously moving real-time data
using CDC technology, streaming data integration helps you to keep your data grid up-to-date. It
can serve as your hot cache to support your business with fresh data.

Machine Learning :

Big Data Machine Learning

Big Data is more of extraction and Machine Learning is more of using input
analysis of information from huge data and algorithms for estimating
volumes of data. unknown future results.

Types of Machine Learning Algorithms


are Supervised Learning and
Types of Big Data are Structured, Unsupervised Learning, Reinforcement
Unstructured and Semi-Structured. Learning.

Machine Learning is the way of


Big data analysis is the unique way of analysing input datasets using various
handling bigger and unstructured data algorithms and tools like Numpy,
sets using tools like Apache Hadoop, Pandas, Scikit Learn, TensorFlow,
MongoDB. Keras.

Machine Learning can learn from


Big Data analytics pulls raw data and training data and acts like a human for
looks for patterns to help in stronger making effective predictions by teaching
decision-making for the firms itself using Algorithms.

It’s very difficult to extract relevant


features even with latest data handling Machine Learning models work with
tools because of high-dimensionality of limited dimensional data hence making
data. it easier for recognizing features

Big Data Analysis requires Human Perfectly built Machine Learning

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 109
Big Data Machine Learning

Validation because of large volume of Algorithms does not require human


multidimensional data. intervention.

Machine Learning is helpful for


Big Data is helpful for handling providing virtual assistance, Product
different purposes including Stock Recommendations, Email Spam
Analysis, Market Analysis, etc. filtering, etc.

The Scope of Big Data in the near


future is not just limited to handling The Scope of Machine Learning is to
large volumes of data but also improve quality of predictive analysis,
optimizing the data storage in a faster decision making, more robust,
structured format which enables easier cognitive analysis, rise of robots and
analysis. improved medical services.

Spark MLlib Overview :

Overview

With the demand for big data and machine learning, this article provides an
introduction to Spark MLlib, its components, and how it works. This covers the
main topics of using machine learning algorithms in Apache Spark.

Introduction

Apache Spark is a data processing framework that can quickly perform processing tasks on very

large data sets and can also distribute data processing tasks across multiple computers, either on

its own or in tandem with other distributed computing tools. It is a lightning-fast unified

analytics engine for big data and machine learning

To support Python with Spark, the Apache Spark community released a tool, PySpark. Using

PySpark, one can work with RDDs in Python programming language.

The components of Spark are:

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 110
1. Spark Core
2. Spark SQL
3. Spark Streaming
4. Spark MLlib
5. GraphX
6. Spark R

Spark Core
All the functionalities being provided by Apache Spark are built on the top of Spark Core. It

manages all essential I/O functionalities. It is used for task dispatching and fault recovery. Spark

Core is embedded with a special collection called RDD (Resilient Distributed Dataset). RDD is

among the abstractions of Spark. Spark RDD handles partitioning data across all the nodes in a

cluster. It holds them in the memory pool of the cluster as a single unit. There are two operations

performed on RDDs:

Transformation: It is a function that produces new RDD from the existing RDDs.

Action: In Transformation, RDDs are created from each other. But when we want to work with

the actual dataset, then, at that point we use Action.

Spark SQL

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 111
The Spark SQL component is a distributed framework for structured data
processing. Spark SQL works to access structured and semi-structured information.
It also enables powerful, interactive, analytical applications across both streaming
and historical data. DataFrames and SQL provide a common way to access a
variety of data sources. Its main feature is being a Cost-based optimizer and Mid
query fault-tolerance.

Spark Streaming

It is an add-on to core Spark API which allows scalable, high-throughput, fault-tolerant stream

processing of live data streams. Spark Streaming, groups the live data into small batches. It then

delivers it to the batch system for processing. It also provides fault tolerance characteristics .

Spark GraphX:

GraphX in Spark is an API for graphs and graph parallel execution. It is a network
graph analytics engine and data store. Clustering, classification, traversal,
searching, and pathfinding is also possible in graphs.

SparkR:

SparkR provides a distributed data frame implementation. It supports operations


like selection, filtering, aggregation but on large datasets.

Spark MLlib:

Spark MLlib is used to perform machine learning in Apache Spark. MLlib consists
of popular algorithms and utilities. MLlib in Spark is a scalable Machine learning
library that discusses both high-quality algorithm and high speed. The machine

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 112
learning algorithms like regression, classification, clustering, pattern mining, and
collaborative filtering. Lower level machine learning primitives like generic
gradient descent optimization algorithm are also present in MLlib.

Spark.ml is the primary Machine Learning API for Spark. The


library Spark.ml offers a higher-level API built on top of DataFrames
for constructing ML pipelines.

Spark MLlib tools are given below:

1. ML Algorithms
2. Featurization
3. Pipelines
4. Persistence
5. Utilities

ML Algorithms

ML Algorithms form the core of MLlib. These include common learning algorithms such
as classification, regression, clustering, and collaborative filtering.

MLlib standardizes APIs to make it easier to combine multiple algorithms into a single
pipeline, or workflow. The key concepts are the Pipelines API, where the pipeline
concept is inspired by the scikit-learn project.

Transformer:
A Transformer is an algorithm that can transform one DataFrame into another
DataFrame. Technically, a Transformer implements a method transform(), which
converts one DataFrame into another, generally by appending one or more columns. For
example:

A feature transformer might take a DataFrame, read a column (e.g., text), map it into a
new column (e.g., feature vectors), and output a new DataFrame with the mapped column
appended.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 113
A learning model might take a DataFrame, read the column containing feature vectors,
predict the label for each feature vector, and output a new DataFrame with predicted
labels appended as a column.

Estimator:
An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer.
Technically, an Estimator implements a method fit(), which accepts a DataFrame and
produces a Model, which is a Transformer. For example, a learning algorithm such as
LogisticRegression is an Estimator, and calling fit() trains a LogisticRegressionModel,
which is a Model and hence a Transformer.

Transformer.transform() and Estimator.fit() are both stateless. In the future, stateful


algorithms may be supported via alternative concepts.

Each instance of a Transformer or Estimator has a unique ID, which is useful in


specifying parameters (discussed below).

1. Featurization

Featurization includes feature extraction, transformation, dimensionality reduction, and


selection.

1. Feature Extraction is extracting features from raw data.


2. Feature Transformation includes scaling, renovating, or modifying features
3. Feature Selection involves selecting a subset of necessary features from a huge set
of features.

2. Pipelines:

A Pipeline chains multiple Transformers and Estimators together to specify an ML


workflow. It also provides tools for constructing, evaluating and tuning ML Pipelines.

In machine learning, it is common to run a sequence of algorithms to process and learn


from data. MLlib represents such a workflow as a Pipeline, which consists of a sequence
of Pipeline Stages (Transformers and Estimators) to be run in a specific order. We will
use this simple workflow as a running example in this section.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 114
Example: Pipeline sample given below does the data preprocessing in a specific order as
given below:

1. Apply String Indexer method to find the index of the categorical columns

2. Apply OneHot encoding for the categorical columns

3. Apply String indexer for the output variable “label” column

4. VectorAssembler is applied for both categorical columns and numeric columns.


VectorAssembler is a transformer that combines a given list of columns into a single
vector column.

The pipeline workflow will execute the data modelling in the above specific order.

from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer,


VectorAssembler

categoricalColumns = ['job', 'marital', 'education', 'default', 'housing', 'loan']

stages = []

for categoricalCol in categoricalColumns:

stringIndexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol +


'Indexer')

encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()],
outputCols=[categoricalCol + "Vec"])

stages += [stringIndexer, encoder]

label_stringIdx = StringIndexer(inputCol = 'deposit', outputCol = 'label')

stages += [label_stringIdx]

numericColumns = ['age', 'balance', 'duration']

assemblerInputs = [c + "Vec" for c in categoricalColumns] + numericColumns

Vassembler = VectorAssembler(inputCols = assemblerInputs, outputCol="features")

stages += [Vassembler]

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 115
from pyspark.ml import Pipeline

pipeline = Pipeline(stages = stages)

pipelineModel = pipeline.fit(df)

df = pipelineModel.transform(df)

selectedCols = ['label', 'features'] + cols

df = df.select(selectedCols)

Dataframe
Dataframes provide a more user-friendly API than RDDs. The DataFrame-based API for
MLlib provides a uniform API across ML algorithms and across multiple languages.
Dataframes facilitate practical ML Pipelines, particularly feature transformations.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('mlearnsample').getOrCreate()

df = spark.read.csv('loan_bank.csv', header = True, inferSchema = True)

df.printSchema()

3. Persistence:

Persistence helps in saving and loading algorithms, models, and Pipelines. This helps in
reducing time and efforts as the model is persistence, it can be loaded/ reused any time
when needed.

from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(featuresCol = 'features', labelCol = 'label')

lrModel = lr.fit(train)

from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator()

print(‘Test Area Under ROC’, evaluator.evaluate(predictions))

predictions = lrModel.transform(test)

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 116
predictions.select('age', 'label', 'rawPrediction', 'prediction').show()

4. Utilities:

Utilities for linear algebra, statistics, and data handling. Example: mllib.linalg is MLlib
utilities for linear algebra.

TOOLS:

Introduction to Spark Tools:

Spark tools are the major software features of the spark framework those are used for efficient

and scalable data processing for big data analytics. The Spark framework being open-sourced

through Apache license. It comprises 5 important tools for data processing such as GraphX,

MLlib, Spark Streaming, Spark SQL and Spark Core. GraphX is the tool used for processing and

managing graph data analysis. MLlib Spark tool is used for machine learning implementation on

the distributed dataset. Whereas Spark Streaming is used for stream data processing. Spark SQL

is the tool mostly used for structured data analysis. Spark Core tool manages the Resilient data

distribution known as RDD.

Tools of Spark
There exist 5 spark tools namely GraphX, MLlib, Spark Streaming, Spark SQL and Spark Core.

Below we examine each tool in detail.

1. GraphX Tool

 This is the Spark API related to graphs as well as graph-parallel computation. GraphX

provides a Resilient Distributed Property Graph which is an extension of the Spark RDD.

 The form possesses a proliferating collection of graph algorithms as well as builders in

order to make graph analytics activities simple.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 117
 This important tool is used to develop as well as manipulate graph data in order

to perform comparative analytics. The former transforms as well as merges structured

data at very high speed consuming minimum time resources.

 Employ the user-friendly Graphical User Interface to pick from a fast-growing collection

of algorithms. You can even develop custom algorithms to monitor ETL insights.

 The GraphFrames package permits you to perform graph operations on data frames. This

includes leveraging the Catalyst optimizer for graph queries. This critical tool possesses a

selection of distributed algorithms.

 The latter’s purpose is to process graph structures that include an implementation of

Google’s highly acclaimed PageRank algorithm. These special algorithms employ Spark

Core’s RDD approach to modeling important data.

2. MLlib Tool

 MLlib is a library that contains basic Machine Learning services. The library offers

various kinds of Machine Learning algorithms that make possible many operations on

data with the object of obtaining meaningful insights.

 The spark platform bundles libraries in order to apply graph analysis techniques as well

as machine learning to data at scale.

 The MLlib tool has a framework for developing machine learning pipelines enabling

simple implementation of transformations, feature extraction as well as selections on any

particular structured dataset. The former includes rudimentary machine learning that

includes filtering, regression, classification as well as clustering.

 However, facilities for training deep neural networks as well as modeling are not

available. MLlib supplies robust algorithms as well as lightning speed in order to build as

well as maintain machine learning libraries that drive business intelligence.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 118
 It also operates natively above Apache spark delivering quick as well as

extremely scalable machine learning.

3. Spark Streaming Tool

 This tool’s purpose is to process live streams of data. There occurs real-time processing

of data produced by different sources. Instances of this kind of data are messages having

status updates posted by visitors, log files and others.

 This tool also leverages Spark Core’s speedy scheduling capability in order to execute

streaming analytics. Data is ingested in mini-batches.

 Subsequently, RDD (Resilient Distributed Dataset) transformations are performed on the

mini-batches of data. Spark Streaming enables fault-tolerant stream processing as well as

high-throughput of live data streams. The core stream unit is DStream.

 The latter simply put is a series of Resilient Distributed Datasets whose function is to

process the real-time data. This useful tool extended the Apache Spark paradigm of batch

processing into streaming. This was achieved by breaking down the stream into a

succession of micro batches.

 The latter was then manipulated by employing the Apache Spark API. Spark Streaming is

the engine of robust applications that need real-time data.

 The former has the bigdata platform’s reliable fault tolerance making it extremely

attractive for the purpose of development. Spark Streaming introduces interactive

analytics abilities for live data sourced from almost any common repository source.

4. Spark SQL Tool


This is a newly introduced module in Spark that combines relational processing with the

platform’s functional programming interface. There exists support for querying data through the

Hive Query Language as well as through Standard SQL. Spark SQL consists of 4 libraries:

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 119
 SQL Service

 Interpreter and Optimizer

 Data Frame API

 Data Source API

This tool’s function is to work with structured data. The former gives integrated access to the

most common data sources. This includes JDBC, JSON, Hive, Avro and more. The tool sorts

data into labeled columns as well as rows perfect for dispatching the results of high-speed

queries. Spark SQL integrates smoothly with newly introduced as well as already existing Spark

programs resulting in minimal computing expenses as well as superior performance. Apache

Spark employs a query optimizer named Catalyst which studies data and queries with the

objective of creating an efficient query plan for computation as well as data locality. The plan

will execute the necessary calculations across the cluster. Currently, it is advised to use the Spark

SQL interface of datasets as well as data frames for the purpose of development.

5. Spark Core Tool

 This is the basic building block of the platform. Among other things, it consists of

components for running memory operations, scheduling of jobs and others. Core hosts

the API containing RDD. The former provides the APIs that are used to build as well as

manipulate data in RDD.

 Distributed task dispatching as well as fundamental I/O functionalities are also provided

by the core. When benchmarked against Apache Hadoop components the Spark

Application Programming Interface is pretty simple and easy to use for developers.

 The API conceals a large part of the complexity involved in a distributed processing

engine behind relatively simple method invocations.

 Spark operates in a distributed way by merging a driver core process which splits a

particular Spark application into multiple tasks as well as distributes them among

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 120
numerous processes that perform the job. These particular executions could be scaled up

or down depending on the application’s requirements.

 All the tools belonging to the Spark ecosystem interact smoothly and run well while

consuming minimal overhead. This makes Spark both an extremely scalable as well as a

very powerful platform. Work is ongoing in order to improve the tools in terms of both

performance and convenient usability.

 Algorithms-Classification

Spark – Overview
Apache Spark is a lightning fast real-time processing framework. It does in-memory
computations to analyze data in real-time. It came into picture as Apache Hadoop
MapReduce was performing batch processing only and lacked a real-time processing feature.
Hence, Apache Spark was introduced as it can perform stream processing in real-time and can
also take care of batch processing.
Apart from real-time and batch processing, Apache Spark supports interactive queries and
iterative algorithms also. Apache Spark has its own cluster manager, where it can host its
application. It leverages Apache Hadoop for both storage and processing. It
uses HDFS (Hadoop Distributed File system) for storage and it can run Spark applications
on YARN as well.
PySpark – Overview
Apache Spark is written in Scala programming language. To support Python with Spark,
Apache Spark Community released a tool, PySpark. Using PySpark, you can work
with RDDs in Python programming language also. It is because of a library called Py4j that
they are able to achieve this.
PySpark offers PySpark Shell which links the Python API to the spark core and initializes the
Spark context. Majority of data scientists and analytics experts today use Python because of its
rich library set. Integrating Python with Spark is a boon to them.
Apache Spark offers a Machine Learning API called MLlib. PySpark has this machine learning
API in Python as well. It supports different kind of algorithms, which are mentioned below −
 mllib.classification − The spark.mllib package supports various methods for binary
classification, multiclass classification and regression analysis. Some of the most
popular algorithms in classification are Random Forest, Naive Bayes, Decision Tree,
etc.
 mllib.clustering − Clustering is an unsupervised learning problem, whereby you aim to
group subsets of entities with one another based on some notion of similarity.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 121
 mllib.fpm − Frequent pattern matching is mining frequent items, itemsets,
subsequences or other substructures that are usually among the first steps to analyze a
large-scale dataset. This has been an active research topic in data mining for years.
 mllib.linalg − MLlib utilities for linear algebra.
 mllib.recommendation − Collaborative filtering is commonly used for recommender
systems. These techniques aim to fill in the missing entries of a user item association
matrix.
 spark.mllib − It ¬currently supports model-based collaborative filtering, in which users
and products are described by a small set of latent factors that can be used to predict
missing entries. spark.mllib uses the Alternating Least Squares (ALS) algorithm to learn
these latent factors.
 mllib. regression − Linear regression belongs to the family of regression algorithms.
The goal of regression is to find relationships and dependencies between variables. The
interface for working with linear regression models and model summaries is similar to
the logistic regression case.
There are other algorithms, classes and functions also as a part of the mllib
package. As of now, let us understand a demonstration on pyspark.mllib.
The following example is of collaborative filtering using ALS algorithm to build
the recommendation model and evaluate it on training data.
Dataset used − test.data
1,1,5.0
1,2,1.0
1,3,5.0
1,4,1.0
2,1,5.0
2,2,1.0
2,3,5.0
2,4,1.0
3,1,1.0
3,2,5.0
3,3,1.0
3,4,5.0
4,1,1.0
4,2,5.0
4,3,1.0
4,4,5.0
--------------------------------------recommend.py------------
----------------------------
from __future__ import print_function
from pyspark import SparkContext
from pyspark.mllib.recommendation import ALS,
MatrixFactorizationModel, Rating
if __name__ == "__main__":
sc = SparkContext(appName="Pspark mllib Example")
data = sc.textFile("test.data")
ratings = data.map(lambda l: l.split(','))\
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 122
.map(lambda l: Rating(int(l[0]), int(l[1]),
float(l[2])))

# Build the recommendation model using Alternating Least


Squares
rank = 10
numIterations = 10
model = ALS.train(ratings, rank, numIterations)

# Evaluate the model on training data


testdata = ratings.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r:
((r[0], r[1]), r[2]))
ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]),
r[2])).join(predictions)
MSE = ratesAndPreds.map(lambda r: (r[1][0] -
r[1][1])**2).mean()
print("Mean Squared Error = " + str(MSE))

# Save and load model


model.save(sc, "target/tmp/myCollaborativeFilter")
sameModel = MatrixFactorizationModel.load(sc,
"target/tmp/myCollaborativeFilter")
--------------------------------------recommend.py------------
----------------------------
Command − The command will be as follows −
$SPARK_HOME/bin/spark-submit recommend.py
Output − The output of the above command will be −
Mean Squared Error = 1.20536041839e-05

Logistic regression
\
Logistic regression is a popular method to predict a categorical response. It is a special case
of Generalized Linear models that predicts the probability of the outcomes.
In spark.ml logistic regression can be used to predict a binary outcome by using binomial
logistic regression, or it can be used to predict a multiclass outcome by using multinomial
logistic regression. Use the family parameter to select between these two algorithms, or
leave it unset and Spark will infer the correct variant.

Multinomial logistic regression can be used for binary classification by setting


the family param to “multinomial”. It will produce two sets of coefficients and two
intercepts.

When fitting LogisticRegressionModel without intercept on dataset with constant nonzero


column, Spark MLlib outputs zero coefficients for constant nonzero columns. This behavior
is the same as R glmnet but different from LIBSVM.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 123
Binomial logistic regression
For more background and more details about the implementation of binomial logistic
regression, refer to the documentation of logistic regression in spark.mllib.

Examples

The following example shows how to train binomial and multinomial logistic regression
models for binary classification with elastic net
regularization. elasticNetParam corresponds to αα and regParam corresponds to λλ.

 Scala
 Java
 Python
 R
More details on parameters can be found in the Scala API documentation.

import org.apache.spark.ml.classification.LogisticRegression
// Load training data
val training =
spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
// Fit the model
val lrModel = lr.fit(training)

// Print the coefficients and intercept for logistic regression


println(s"Coefficients: ${lrModel.coefficients} Intercept:
${lrModel.intercept}")

// We can also use the multinomial family for binary classification


val mlr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
.setFamily("multinomial")
val mlrModel = mlr.fit(training)

// Print the coefficients and intercepts for logistic regression with


multinomial family
println(s"Multinomial coefficients: ${mlrModel.coefficientMatrix}")
println(s"Multinomial intercepts: ${mlrModel.interceptVector}")
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/ml/LogisticRegressionWithElasticNetExample.scala" in
the Spark repo.

The spark.ml implementation of logistic regression also supports extracting a summary of


the model over the training set. Note that the predictions and metrics which are stored
as DataFrame in LogisticRegressionSummary are annotated @transient and hence only
available on the driver.

 Scala

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 124
 Java
 Python
LogisticRegressionTrainingSummary provides a summary for
a LogisticRegressionModel. In the case of binary classification, certain additional metrics
are available, e.g. ROC curve. The binary summary can be accessed via
the binarySummary method. See BinaryLogisticRegressionTrainingSummary.

Continuing the earlier example:

import org.apache.spark.ml.classification.LogisticRegression
// Extract the summary from the returned LogisticRegressionModel instance
trained in the earlier
// example
val trainingSummary = lrModel.binarySummary
// Obtain the objective per iteration.
val objectiveHistory = trainingSummary.objectiveHistory
println("objectiveHistory:")
objectiveHistory.foreach(loss => println(loss))
// Obtain the receiver-operating characteristic as a dataframe and
areaUnderROC.
val roc = trainingSummary.roc
roc.show()
println(s"areaUnderROC: ${trainingSummary.areaUnderROC}")

// Set the model threshold to maximize F-Measure


val fMeasure = trainingSummary.fMeasureByThreshold
val maxFMeasure = fMeasure.select(max("F-Measure")).head().getDouble(0)
val bestThreshold = fMeasure.where($"F-Measure" === maxFMeasure)
.select("threshold").head().getDouble(0)
lrModel.setThreshold(bestThreshold)
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/ml/LogisticRegressionSummaryExample.scala" in the
Spark repo.

Multinomial logistic regression


Multiclass classification is supported via multinomial logistic (softmax) regression. In
multinomial logistic regression, the algorithm produces KK sets of coefficients, or a matrix
of dimension K×JK×J where KK is the number of outcome classes and JJ is the number of
features. If the algorithm is fit with an intercept term then a length KK vector of intercepts is
available.

Multinomial coefficients are available as coefficientMatrix and intercepts are available


as interceptVector.

coefficients and intercept methods on a logistic regression model trained with


multinomial family are not supported.
Use coefficientMatrix and interceptVector instead.

The conditional probabilities of the outcome classes k∈1,2,…,Kk∈1,2,…,K are modeled


using the softmax function.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 125
P(Y=k|X,βk,β0k)=eβk⋅X+β0k∑K−1k′=0eβk′⋅X+β0k′P(Y=k|X,βk,β0k)=eβk⋅X+β0k∑k′=0K−1eβk′⋅X+β0
k′

We minimize the weighted negative log-likelihood, using a multinomial response model,


with elastic-net penalty to control for overfitting.

minβ,β0−[∑i=1Lwi⋅logP(Y=yi|xi)]+λ[12(1−α)||β||22+α||β||1] minβ,β0−[∑i=1Lwi⋅log⁡P(Y=yi|xi
)]+λ[12(1−α)||β||22+α||β||1]

For a detailed derivation please see here.

Examples

The following example shows how to train a multiclass logistic regression model with elastic
net regularization, as well as extract the multiclass training summary for evaluating the
model.

 Scala
 Java
 Python
 R
import org.apache.spark.ml.classification.LogisticRegression
// Load training data
val training = spark
.read
.format("libsvm")
.load("data/mllib/sample_multiclass_classification_data.txt")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)

// Fit the model


val lrModel = lr.fit(training)
// Print the coefficients and intercept for multinomial logistic regression
println(s"Coefficients: \n${lrModel.coefficientMatrix}")
println(s"Intercepts: \n${lrModel.interceptVector}")
val trainingSummary = lrModel.summary
// Obtain the objective per iteration
val objectiveHistory = trainingSummary.objectiveHistory
println("objectiveHistory:")
objectiveHistory.foreach(println)
// for multiclass, we can inspect metrics on a per-label basis
println("False positive rate by label:")
trainingSummary.falsePositiveRateByLabel.zipWithIndex.foreach { case (rate,
label) =>
println(s"label $label: $rate")
}
println("True positive rate by label:")
trainingSummary.truePositiveRateByLabel.zipWithIndex.foreach { case (rate,
label) =>
println(s"label $label: $rate")
}

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 126
println("Precision by label:")
trainingSummary.precisionByLabel.zipWithIndex.foreach { case (prec, label) =>
println(s"label $label: $prec")
}
println("Recall by label:")
trainingSummary.recallByLabel.zipWithIndex.foreach { case (rec, label) =>
println(s"label $label: $rec")
}

println("F-measure by label:")
trainingSummary.fMeasureByLabel.zipWithIndex.foreach { case (f, label) =>
println(s"label $label: $f")
}

val accuracy = trainingSummary.accuracy


val falsePositiveRate = trainingSummary.weightedFalsePositiveRate
val truePositiveRate = trainingSummary.weightedTruePositiveRate
val fMeasure = trainingSummary.weightedFMeasure
val precision = trainingSummary.weightedPrecision
val recall = trainingSummary.weightedRecall
println(s"Accuracy: $accuracy\nFPR: $falsePositiveRate\nTPR:
$truePositiveRate\n" +
s"F-measure: $fMeasure\nPrecision: $precision\nRecall: $recall")
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/ml/MulticlassLogisticRegressionWithElasticNetExample.s
cala" in the Spark repo.

Decision tree classifier


Decision trees are a popular family of classification and regression methods. More
information about the spark.ml implementation can be found further in the section on
decision trees.

Examples

The following examples load a dataset in LibSVM format, split it into training and test sets,
train on the first dataset, and then evaluate on the held-out test set. We use two feature
transformers to prepare the data; these help index categories for the label and categorical
features, adding metadata to the DataFrame which the Decision Tree algorithm can
recognize.

 Scala
 Java
 Python
 R
More details on parameters can be found in the Scala API documentation.

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.DecisionTreeClassificationModel
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer,
VectorIndexer}

// Load the data stored in LIBSVM format as a DataFrame.


val data =
spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 127
// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(data)
// Automatically identify categorical features, and index them.
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4) // features with > 4 distinct values are treated as
continuous.
.fit(data)
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

// Train a DecisionTree model.


val dt = new DecisionTreeClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures")

// Convert indexed labels back to original labels.


val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labelsArray(0))
// Chain indexers and tree in a Pipeline.
val pipeline = new Pipeline()
.setStages(Array(labelIndexer, featureIndexer, dt, labelConverter))
// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)

// Make predictions.
val predictions = model.transform(testData)

// Select example rows to display.


predictions.select("predictedLabel", "label", "features").show(5)
// Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("indexedLabel")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println(s"Test Error = ${(1.0 - accuracy)}")

val treeModel = model.stages(2).asInstanceOf[DecisionTreeClassificationModel]


println(s"Learned classification tree model:\n ${treeModel.toDebugString}")
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/ml/DecisionTreeClassificationExample.scala" in the Spark
repo.

Random forest classifier


Random forests are a popular family of classification and regression methods. More
information about the spark.ml implementation can be found further in the section on
random forests.

Examples

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 128
The following examples load a dataset in LibSVM format, split it into training and
test sets, train on the first dataset, and then evaluate on the held-out test set. We use two
feature transformers to prepare the data; these help index categories for the label and
categorical features, adding metadata to the DataFrame which the tree-based algorithms
can recognize.

 Scala
 Java
 Python
 R
Refer to the Scala API docs for more details.

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{RandomForestClassificationModel,
RandomForestClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer,
VectorIndexer}

// Load and parse the data file, converting it to a DataFrame.


val data =
spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(data)
// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as
continuous.
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(data)
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
// Train a RandomForest model.
val rf = new RandomForestClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures")
.setNumTrees(10)

// Convert indexed labels back to original labels.


val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labelsArray(0))
// Chain indexers and forest in a Pipeline.
val pipeline = new Pipeline()
.setStages(Array(labelIndexer, featureIndexer, rf, labelConverter))

// Train model. This also runs the indexers.


val model = pipeline.fit(trainingData)
// Make predictions.
val predictions = model.transform(testData)
// Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(5)
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 129
// Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("indexedLabel")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println(s"Test Error = ${(1.0 - accuracy)}")

val rfModel = model.stages(2).asInstanceOf[RandomForestClassificationModel]


println(s"Learned classification forest model:\n ${rfModel.toDebugString}")
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/ml/RandomForestClassifierExample.scala" in the Spark
repo.

Gradient-boosted tree classifier


Gradient-boosted trees (GBTs) are a popular classification and regression method using
ensembles of decision trees. More information about the spark.ml implementation can be
found further in the section on GBTs.

Examples

The following examples load a dataset in LibSVM format, split it into training and test sets,
train on the first dataset, and then evaluate on the held-out test set. We use two feature
transformers to prepare the data; these help index categories for the label and categorical
features, adding metadata to the DataFrame which the tree-based algorithms can
recognize.

 Scala
 Java
 Python
 R
Refer to the Scala API docs for more details.

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{GBTClassificationModel,
GBTClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer,
VectorIndexer}
// Load and parse the data file, converting it to a DataFrame.
val data =
spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(data)
// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as
continuous.
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(data)
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 130
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
// Train a GBT model.
val gbt = new GBTClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures")
.setMaxIter(10)
.setFeatureSubsetStrategy("auto")
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labelsArray(0))

// Chain indexers and GBT in a Pipeline.


val pipeline = new Pipeline()
.setStages(Array(labelIndexer, featureIndexer, gbt, labelConverter))
// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)
// Make predictions.
val predictions = model.transform(testData)

// Select example rows to display.


predictions.select("predictedLabel", "label", "features").show(5)

// Select (prediction, true label) and compute test error.


val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("indexedLabel")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println(s"Test Error = ${1.0 - accuracy}")
val gbtModel = model.stages(2).asInstanceOf[GBTClassificationModel]
println(s"Learned classification GBT model:\n ${gbtModel.toDebugString}")
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/ml/GradientBoostedTreeClassifierExample.scala" in the
Spark repo.

Multilayer perceptron classifier


Multilayer perceptron classifier (MLPC) is a classifier based on the feedforward artificial
neural network. MLPC consists of multiple layers of nodes. Each layer is fully connected to
the next layer in the network. Nodes in the input layer represent the input data. All other
nodes map inputs to outputs by a linear combination of the inputs with the node’s
weights ww and bias bb and applying an activation function. This can be written in matrix
form for MLPC with K+1K+1 layers as follows:

y(x)=fK(...f2(wT2f1(wT1x+b1)+b2)...+bK)y(x)=fK(...f2(w2Tf1(w1Tx+b1)+b2)...+bK)

Nodes in intermediate layers use sigmoid (logistic) function:

f(zi)=11+e−zif(zi)=11+e−zi

Nodes in the output layer use softmax function:

f(zi)=ezi∑Nk=1ezkf(zi)=ezi∑k=1Nezk

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 131
The number of nodes NN in the output layer corresponds to the number of classes.

MLPC employs backpropagation for learning the model. We use the logistic loss function for
optimization and L-BFGS as an optimization routine.

Examples

 Scala
 Java
 Python
 R
Refer to the Scala API docs for more details.

import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
// Load the data stored in LIBSVM format as a DataFrame.
val data = spark.read.format("libsvm")
.load("data/mllib/sample_multiclass_classification_data.txt")
// Split the data into train and test
val splits = data.randomSplit(Array(0.6, 0.4), seed = 1234L)
val train = splits(0)
val test = splits(1)
// specify layers for the neural network:
// input layer of size 4 (features), two intermediate of size 5 and 4
// and output of size 3 (classes)
val layers = Array[Int](4, 5, 4, 3)
// create the trainer and set its parameters
val trainer = new MultilayerPerceptronClassifier()
.setLayers(layers)
.setBlockSize(128)
.setSeed(1234L)
.setMaxIter(100)

// train the model


val model = trainer.fit(train)

// compute accuracy on the test set


val result = model.transform(test)
val predictionAndLabels = result.select("prediction", "label")
val evaluator = new MulticlassClassificationEvaluator()
.setMetricName("accuracy")

println(s"Test set accuracy = ${evaluator.evaluate(predictionAndLabels)}")


Find full example code at
"examples/src/main/scala/org/apache/spark/examples/ml/MultilayerPerceptronClassifierExample.scala" in the
Spark repo.

Linear Support Vector Machine


A support vector machine constructs a hyperplane or set of hyperplanes in a high- or
infinite-dimensional space, which can be used for classification, regression, or other tasks.
Intuitively, a good separation is achieved by the hyperplane that has the largest distance to
the nearest training-data points of any class (so-called functional margin), since in general
the larger the margin the lower the generalization error of the classifier. LinearSVC in Spark
ML supports binary classification with linear SVM. Internally, it optimizes the Hinge
Loss using OWLQN optimizer.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 132
Examples

 Scala
 Java
 Python
 R
Refer to the Scala API docs for more details.

import org.apache.spark.ml.classification.LinearSVC
// Load training data
val training =
spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
val lsvc = new LinearSVC()
.setMaxIter(10)
.setRegParam(0.1)
// Fit the model
val lsvcModel = lsvc.fit(training)

// Print the coefficients and intercept for linear svc


println(s"Coefficients: ${lsvcModel.coefficients} Intercept:
${lsvcModel.intercept}")
Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/LinearSVCExample.scala" in
the Spark repo.

One-vs-Rest classifier (a.k.a. One-vs-All)


OneVsRest is an example of a machine learning reduction for performing multiclass
classification given a base classifier that can perform binary classification efficiently. It is also
known as “One-vs-All.”

OneVsRest is implemented as an Estimator. For the base classifier, it takes instances


of Classifier and creates a binary classification problem for each of the k classes. The
classifier for class i is trained to predict whether the label is i or not, distinguishing class i
from all other classes.

Predictions are done by evaluating each binary classifier and the index of the most confident
classifier is output as label.

Examples

The example below demonstrates how to load the Iris dataset, parse it as a DataFrame and
perform multiclass classification using OneVsRest. The test error is calculated to measure
the algorithm accuracy.

 Scala
 Java

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 133
 Python
Refer to the Scala API docs for more details.

import org.apache.spark.ml.classification.{LogisticRegression, OneVsRest}


import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

// load data file.


val inputData = spark.read.format("libsvm")
.load("data/mllib/sample_multiclass_classification_data.txt")
// generate the train/test split.
val Array(train, test) = inputData.randomSplit(Array(0.8, 0.2))
// instantiate the base classifier
val classifier = new LogisticRegression()
.setMaxIter(10)
.setTol(1E-6)
.setFitIntercept(true)
// instantiate the One Vs Rest Classifier.
val ovr = new OneVsRest().setClassifier(classifier)
// train the multiclass model.
val ovrModel = ovr.fit(train)

// score the model on test data.


val predictions = ovrModel.transform(test)
// obtain evaluator.
val evaluator = new MulticlassClassificationEvaluator()
.setMetricName("accuracy")
// compute the classification error on test data.
val accuracy = evaluator.evaluate(predictions)
println(s"Test Error = ${1 - accuracy}")
Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/OneVsRestExample.scala"
in the Spark repo.

Naive Bayes
Naive Bayes classifiers are a family of simple probabilistic, multiclass classifiers based on
applying Bayes’ theorem with strong (naive) independence assumptions between every pair
of features.

Naive Bayes can be trained very efficiently. With a single pass over the training data, it
computes the conditional probability distribution of each feature given each label. For
prediction, it applies Bayes’ theorem to compute the conditional probability distribution of
each label given an observation.

MLlib supports Multinomial naive Bayes, Complement naive Bayes, Bernoulli naive
Bayes and Gaussian naive Bayes.

Input data: These Multinomial, Complement and Bernoulli models are typically used
for document classification. Within that context, each observation is a document and each
feature represents a term. A feature’s value is the frequency of the term (in Multinomial or
Complement Naive Bayes) or a zero or one indicating whether the term was found in the
document (in Bernoulli Naive Bayes). Feature values for Multinomial and Bernoulli models
must be non-negative. The model type is selected with an optional parameter “multinomial”,

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 134
“complement”, “bernoulli” or “gaussian”, with “multinomial” as the default. For
document classification, the input feature vectors should usually be sparse vectors. Since the
training data is only used once, it is not necessary to cache it.

Additive smoothing can be used by setting the parameter λλ (default to 1.01.0).

Examples

 Scala
 Java
 Python
 R
Refer to the Scala API docs for more details.

import org.apache.spark.ml.classification.NaiveBayes
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
// Load the data stored in LIBSVM format as a DataFrame.
val data =
spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
// Split the data into training and test sets (30% held out for testing)
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3), seed =
1234L)

// Train a NaiveBayes model.


val model = new NaiveBayes()
.fit(trainingData)

// Select example rows to display.


val predictions = model.transform(testData)
predictions.show()

// Select (prediction, true label) and compute test error


val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println(s"Test set accuracy = $accuracy")
Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/NaiveBayesExample.scala"
in the Spark repo.

Factorization machines classifier


For more background and more details about the implementation of factorization machines,
refer to the Factorization Machines section.

Examples

The following examples load a dataset in LibSVM format, split it into training and test sets,
train on the first dataset, and then evaluate on the held-out test set. We scale features to be
between 0 and 1 to prevent the exploding gradient problem.

 Scala
 Java

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 135
 Python
 R
Refer to the Scala API docs for more details.

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{FMClassificationModel,
FMClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, MinMaxScaler,
StringIndexer}

// Load and parse the data file, converting it to a DataFrame.


val data =
spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

// Index labels, adding metadata to the label column.


// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(data)
// Scale features.
val featureScaler = new MinMaxScaler()
.setInputCol("features")
.setOutputCol("scaledFeatures")
.fit(data)

// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
// Train a FM model.
val fm = new FMClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("scaledFeatures")
.setStepSize(0.001)
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labelsArray(0))
// Create a Pipeline.
val pipeline = new Pipeline()
.setStages(Array(labelIndexer, featureScaler, fm, labelConverter))
// Train model.
val model = pipeline.fit(trainingData)
// Make predictions.
val predictions = model.transform(testData)
// Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(5)

// Select (prediction, true label) and compute test accuracy.


val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("indexedLabel")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println(s"Test set accuracy = $accuracy")
val fmModel = model.stages(2).asInstanceOf[FMClassificationModel]
println(s"Factors: ${fmModel.factors} Linear: ${fmModel.linear} " +
s"Intercept: ${fmModel.intercept}")
Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/FMClassifierExample.scala"
in the Spark repo.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 136
Regression

Linear regression
The interface for working with linear regression models and model summaries is similar to
the logistic regression case.

When fitting LinearRegressionModel without intercept on dataset with constant nonzero


column by “l-bfgs” solver, Spark MLlib outputs zero coefficients for constant nonzero
columns. This behavior is the same as R glmnet but different from LIBSVM.

Examples

The following example demonstrates training an elastic net regularized linear regression
model and extracting model summary statistics.

 Scala
 Java
 Python
 R
More details on parameters can be found in the Scala API documentation.

import org.apache.spark.ml.regression.LinearRegression

// Load training data


val training = spark.read.format("libsvm")
.load("data/mllib/sample_linear_regression_data.txt")
val lr = new LinearRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)

// Fit the model


val lrModel = lr.fit(training)

// Print the coefficients and intercept for linear regression


println(s"Coefficients: ${lrModel.coefficients} Intercept:
${lrModel.intercept}")
// Summarize the model over the training set and print out some metrics
val trainingSummary = lrModel.summary
println(s"numIterations: ${trainingSummary.totalIterations}")
println(s"objectiveHistory:
[${trainingSummary.objectiveHistory.mkString(",")}]")
trainingSummary.residuals.show()
println(s"RMSE: ${trainingSummary.rootMeanSquaredError}")
println(s"r2: ${trainingSummary.r2}")
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/ml/LinearRegressionWithElasticNetExample.scala" in the
Spark repo.

Generalized linear regression


Contrasted with linear regression where the output is assumed to follow a Gaussian
distribution, generalized linear models (GLMs) are specifications of linear models where the
response variable YiYi follows some distribution from the exponential family of
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 137
distributions. Spark’s GeneralizedLinearRegression interface allows for flexible
specification of GLMs which can be used for various types of prediction problems including
linear regression, Poisson regression, logistic regression, and others. Currently in spark.ml,
only a subset of the exponential family distributions are supported and they are
listed below.

NOTE: Spark currently only supports up to 4096 features through


its GeneralizedLinearRegression interface, and will throw an exception if this constraint
is exceeded. See the advanced section for more details. Still, for linear and logistic
regression, models with an increased number of features can be trained using
the LinearRegression and LogisticRegression estimators.

GLMs require exponential family distributions that can be written in their “canonical” or
“natural” form, aka natural exponential family distributions. The form of a natural
exponential family distribution is given as:

fY(y|θ,τ)=h(y,τ)exp(θ⋅y−A(θ)d(τ))fY(y|θ,τ)=h(y,τ)exp⁡(θ⋅y−A(θ)d(τ))

where θθ is the parameter of interest and ττ is a dispersion parameter. In a GLM the


response variable YiYi is assumed to be drawn from a natural exponential family
distribution:

Yi∼f(⋅|θi,τ)Yi∼f(⋅|θi,τ)

where the parameter of interest θiθi is related to the expected value of the response
variable μiμi by

μi=A′(θi)μi=A′(θi)

Here, A′(θi)A′(θi) is defined by the form of the distribution selected. GLMs also allow
specification of a link function, which defines the relationship between the expected value of
the response variable μiμi and the so called linear predictor η iηi:

g(μi)=ηi=xi→T⋅β⃗ g(μi)=ηi=xi→T⋅β→

Often, the link function is chosen such that A′=g−1A′=g−1, which yields a simplified
relationship between the parameter of interest θθ and the linear predictor ηη. In this case,
the link function g(μ)g(μ) is said to be the “canonical” link function.

θi=A′−1(μi)=g(g−1(ηi))=ηiθi=A′−1(μi)=g(g−1(ηi))=ηi

A GLM finds the regression coefficients β⃗ β→ which maximize the likelihood function.

maxβ⃗ L(θ⃗ |y⃗ ,X)=∏i=1Nh(yi,τ)exp(yiθi−A(θi)d(τ))maxβ→L(θ→|y→,X)=∏i=1Nh(yi,τ)exp⁡(


yiθi−A(θi)d(τ))

where the parameter of interest θiθi is related to the regression coefficients β⃗ β→ by

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 138
θi=A′−1(g−1(xi→⋅β⃗ ))θi=A′−1(g−1(xi→⋅β→))

Spark’s generalized linear regression interface also provides summary statistics for
diagnosing the fit of GLM models, including residuals, p-values, deviances, the Akaike
information criterion, and others.

See here for a more comprehensive review of GLMs and their applications.

Available families
Family Response Type Supported Links

Gaussian Continuous Identity*, Log, Inverse

Binomial Binary Logit*, Probit, CLogLog

Poisson Count Log*, Identity, Sqrt

Gamma Continuous Inverse*, Identity, Log

Tweedie Zero-inflated continuous Power link function

* Canonical Link

Examples

The following example demonstrates training a GLM with a Gaussian response and identity
link function and extracting model summary statistics.

 Scala
 Java
 Python
 R
Refer to the Scala API docs for more details.

import org.apache.spark.ml.regression.GeneralizedLinearRegression

// Load training data


val dataset = spark.read.format("libsvm")
.load("data/mllib/sample_linear_regression_data.txt")

val glr = new GeneralizedLinearRegression()


.setFamily("gaussian")
.setLink("identity")
.setMaxIter(10)
.setRegParam(0.3)
// Fit the model
val model = glr.fit(dataset)

// Print the coefficients and intercept for generalized linear regression


model
println(s"Coefficients: ${model.coefficients}")
println(s"Intercept: ${model.intercept}")

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 139
// Summarize the model over the training set and print out some
metrics
val summary = model.summary
println(s"Coefficient Standard Errors:
${summary.coefficientStandardErrors.mkString(",")}")
println(s"T Values: ${summary.tValues.mkString(",")}")
println(s"P Values: ${summary.pValues.mkString(",")}")
println(s"Dispersion: ${summary.dispersion}")
println(s"Null Deviance: ${summary.nullDeviance}")
println(s"Residual Degree Of Freedom Null:
${summary.residualDegreeOfFreedomNull}")
println(s"Deviance: ${summary.deviance}")
println(s"Residual Degree Of Freedom: ${summary.residualDegreeOfFreedom}")
println(s"AIC: ${summary.aic}")
println("Deviance Residuals: ")
summary.residuals().show()
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/ml/GeneralizedLinearRegressionExample.scala" in the
Spark repo.

Decision tree regression


Decision trees are a popular family of classification and regression methods. More
information about the spark.ml implementation can be found further in the section on
decision trees.

Examples

The following examples load a dataset in LibSVM format, split it into training and test sets,
train on the first dataset, and then evaluate on the held-out test set. We use a feature
transformer to index categorical features, adding metadata to the DataFrame which the
Decision Tree algorithm can recognize.

 Scala
 Java
 Python
 R
More details on parameters can be found in the Scala API documentation.

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.feature.VectorIndexer
import org.apache.spark.ml.regression.DecisionTreeRegressionModel
import org.apache.spark.ml.regression.DecisionTreeRegressor

// Load the data stored in LIBSVM format as a DataFrame.


val data =
spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

// Automatically identify categorical features, and index them.


// Here, we treat features with > 4 distinct values as continuous.
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(data)
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 140
// Train a DecisionTree model.
val dt = new DecisionTreeRegressor()
.setLabelCol("label")
.setFeaturesCol("indexedFeatures")
// Chain indexer and tree in a Pipeline.
val pipeline = new Pipeline()
.setStages(Array(featureIndexer, dt))

// Train model. This also runs the indexer.


val model = pipeline.fit(trainingData)

// Make predictions.
val predictions = model.transform(testData)
// Select example rows to display.
predictions.select("prediction", "label", "features").show(5)

// Select (prediction, true label) and compute test error.


val evaluator = new RegressionEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("rmse")
val rmse = evaluator.evaluate(predictions)
println(s"Root Mean Squared Error (RMSE) on test data = $rmse")

val treeModel = model.stages(1).asInstanceOf[DecisionTreeRegressionModel]


println(s"Learned regression tree model:\n ${treeModel.toDebugString}")
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/ml/DecisionTreeRegressionExample.scala" in the Spark
repo.

Random forest regression


Random forests are a popular family of classification and regression methods. More
information about the spark.ml implementation can be found further in the section on
random forests.

Examples

The following examples load a dataset in LibSVM format, split it into training and test sets,
train on the first dataset, and then evaluate on the held-out test set. We use a feature
transformer to index categorical features, adding metadata to the DataFrame which the
tree-based algorithms can recognize.

 Scala
 Java
 Python
 R
Refer to the Scala API docs for more details.

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.feature.VectorIndexer
import org.apache.spark.ml.regression.{RandomForestRegressionModel,
RandomForestRegressor}
// Load and parse the data file, converting it to a DataFrame.
val data =
spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 141
// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as
continuous.
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(data)
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

// Train a RandomForest model.


val rf = new RandomForestRegressor()
.setLabelCol("label")
.setFeaturesCol("indexedFeatures")

// Chain indexer and forest in a Pipeline.


val pipeline = new Pipeline()
.setStages(Array(featureIndexer, rf))
// Train model. This also runs the indexer.
val model = pipeline.fit(trainingData)
// Make predictions.
val predictions = model.transform(testData)
// Select example rows to display.
predictions.select("prediction", "label", "features").show(5)

// Select (prediction, true label) and compute test error.


val evaluator = new RegressionEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("rmse")
val rmse = evaluator.evaluate(predictions)
println(s"Root Mean Squared Error (RMSE) on test data = $rmse")

val rfModel = model.stages(1).asInstanceOf[RandomForestRegressionModel]


println(s"Learned regression forest model:\n ${rfModel.toDebugString}")
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/ml/RandomForestRegressorExample.scala" in the Spark
repo.

Gradient-boosted tree regression


Gradient-boosted trees (GBTs) are a popular regression method using ensembles of
decision trees. More information about the spark.ml implementation can be found further
in the section on GBTs.

Examples

Note: For this example dataset, GBTRegressor actually only needs 1 iteration, but that will
not be true in general.

 Scala
 Java
 Python
 R
Refer to the Scala API docs for more details.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 142
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.feature.VectorIndexer
import org.apache.spark.ml.regression.{GBTRegressionModel, GBTRegressor}
// Load and parse the data file, converting it to a DataFrame.
val data =
spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

// Automatically identify categorical features, and index them.


// Set maxCategories so features with > 4 distinct values are treated as
continuous.
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(data)

// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
// Train a GBT model.
val gbt = new GBTRegressor()
.setLabelCol("label")
.setFeaturesCol("indexedFeatures")
.setMaxIter(10)

// Chain indexer and GBT in a Pipeline.


val pipeline = new Pipeline()
.setStages(Array(featureIndexer, gbt))

// Train model. This also runs the indexer.


val model = pipeline.fit(trainingData)
// Make predictions.
val predictions = model.transform(testData)
// Select example rows to display.
predictions.select("prediction", "label", "features").show(5)

// Select (prediction, true label) and compute test error.


val evaluator = new RegressionEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("rmse")
val rmse = evaluator.evaluate(predictions)
println(s"Root Mean Squared Error (RMSE) on test data = $rmse")
val gbtModel = model.stages(1).asInstanceOf[GBTRegressionModel]
println(s"Learned regression GBT model:\n ${gbtModel.toDebugString}")
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/ml/GradientBoostedTreeRegressorExample.scala" in the
Spark repo.

Survival regression
In spark.ml, we implement the Accelerated failure time (AFT) model which is a parametric
survival regression model for censored data. It describes a model for the log of survival time,
so it’s often called a log-linear model for survival analysis. Different from a Proportional
hazards model designed for the same purpose, the AFT model is easier to parallelize
because each instance contributes to the objective function independently.

Given the values of the covariates x‘x‘, for random lifetime titi of subjects i = 1, …, n, with
possible right-censoring, the likelihood function under the AFT model is given as:

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 143
L(β,σ)=∏i=1n[1σf0(logt i−x′βσ)]δiS0(logti−x′βσ)1−δ iL(β,σ)=∏i=1n[1σf0(log⁡ti−x′βσ)]δiS0
(log⁡ti−x′βσ)1−δi

Where δiδi is the indicator of the event has occurred i.e. uncensored or not.
Using ϵi=logti−x‘βσϵi=log⁡ti−x‘βσ, the log-likelihood function assumes the form:

ι(β,σ)=∑i=1n[−δilogσ+δilogf0(ϵi)+(1−δi)logS0(ϵi)]ι(β,σ)=∑i=1n[−δilog⁡σ+δilog⁡f0(ϵi)+
(1−δi)log⁡S0(ϵi)]

Where S0(ϵi)S0(ϵi) is the baseline survivor function, and f0(ϵi)f0(ϵi) is the corresponding density
function.

The most commonly used AFT model is based on the Weibull distribution of the survival
time. The Weibull distribution for lifetime corresponds to the extreme value distribution for
the log of the lifetime, and the S0(ϵ)S0(ϵ) function is:

S0(ϵi)=exp(−eϵi)S0(ϵi)=exp⁡(−eϵi)

the f0(ϵi)f0(ϵi) function is:

f0(ϵi)=eϵiexp(−eϵi)f0(ϵi)=eϵiexp⁡(−eϵi)

The log-likelihood function for AFT model with a Weibull distribution of lifetime is:

ι(β,σ)=−∑i=1n[δilogσ−δiϵi+eϵi] ι(β,σ)=−∑i=1n[δilog⁡σ−δiϵi+eϵi]

Due to minimizing the negative log-likelihood equivalent to maximum a posteriori probability, the loss
function we use to optimize is −ι(β,σ)−ι(β,σ). The gradient functions
for ββ and logσlog⁡σ respectively are:

∂(−ι)∂β=∑1=1n[δi−eϵi]xiσ∂(−ι)∂β=∑1=1n[δi−eϵi]xiσ

∂(−ι)∂(logσ)=∑i=1n[δi+(δi−eϵi)ϵi]∂(−ι)∂(log⁡σ)=∑i=1n[δi+(δi−eϵi)ϵi]

The AFT model can be formulated as a convex optimization problem, i.e. the task of finding
a minimizer of a convex function −ι(β,σ)−ι(β,σ) that depends on the coefficients
vector ββ and the log of scale parameter logσlog⁡σ. The optimization algorithm underlying
the implementation is L-BFGS. The implementation matches the result from R’s survival
function survreg

When fitting AFTSurvivalRegressionModel without intercept on dataset with constant


nonzero column, Spark MLlib outputs zero coefficients for constant nonzero columns. This
behavior is different from R survival::survreg.

Examples

 Scala
 Java

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 144
 Python
 R
Refer to the Scala API docs for more details.

import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.regression.AFTSurvivalRegression
val training = spark.createDataFrame(Seq(
(1.218, 1.0, Vectors.dense(1.560, -0.605)),
(2.949, 0.0, Vectors.dense(0.346, 2.158)),
(3.627, 0.0, Vectors.dense(1.380, 0.231)),
(0.273, 1.0, Vectors.dense(0.520, 1.151)),
(4.199, 0.0, Vectors.dense(0.795, -0.226))
)).toDF("label", "censor", "features")
val quantileProbabilities = Array(0.3, 0.6)
val aft = new AFTSurvivalRegression()
.setQuantileProbabilities(quantileProbabilities)
.setQuantilesCol("quantiles")

val model = aft.fit(training)


// Print the coefficients, intercept and scale parameter for AFT survival
regression
println(s"Coefficients: ${model.coefficients}")
println(s"Intercept: ${model.intercept}")
println(s"Scale: ${model.scale}")
model.transform(training).show(false)
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/ml/AFTSurvivalRegressionExample.scala" in the Spark
repo.

Isotonic regression
Isotonic regression belongs to the family of regression algorithms. Formally isotonic
regression is a problem where given a finite set of real
numbers Y=y1,y2,...,ynY=y1,y2,...,yn representing observed responses
and X=x1,x2,...,xnX=x1,x2,...,xn the unknown response values to be fitted finding a
function that minimizes

f(x)=∑i=1nwi(yi−xi)2(1)(1)f(x)=∑i=1nwi(yi−xi)2

with respect to complete order subject to x1≤x2≤...≤xnx1≤x2≤...≤xn where wiwi are


positive weights. The resulting function is called isotonic regression and it is unique. It can
be viewed as least squares problem under order restriction. Essentially isotonic regression is
a monotonic function best fitting the original data points.

We implement a pool adjacent violators algorithm which uses an approach to parallelizing


isotonic regression. The training input is a DataFrame which contains three columns label,
features and weight. Additionally, IsotonicRegression algorithm has one optional parameter
called isotonicisotonic defaulting to true. This argument specifies if the isotonic regression
is isotonic (monotonically increasing) or antitonic (monotonically decreasing).

Training returns an IsotonicRegressionModel that can be used to predict labels for both
known and unknown features. The result of isotonic regression is treated as piecewise linear
function. The rules for prediction therefore are:

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 145
 If the prediction input exactly matches a training feature then associated
prediction is returned. In case there are multiple predictions with the same feature then
one of them is returned. Which one is undefined (same as java.util.Arrays.binarySearch).
 If the prediction input is lower or higher than all training features then prediction with
lowest or highest feature is returned respectively. In case there are multiple predictions
with the same feature then the lowest or highest is returned respectively.
 If the prediction input falls between two training features then prediction is treated as
piecewise linear function and interpolated value is calculated from the predictions of the
two closest features. In case there are multiple values with the same feature then the
same rules as in previous point are used.

Examples

 Scala
 Java
 Python
 R
Refer to the IsotonicRegression Scala docs for details on the API.

import org.apache.spark.ml.regression.IsotonicRegression

// Loads data.
val dataset = spark.read.format("libsvm")
.load("data/mllib/sample_isotonic_regression_libsvm_data.txt")
// Trains an isotonic regression model.
val ir = new IsotonicRegression()
val model = ir.fit(dataset)
println(s"Boundaries in increasing order: ${model.boundaries}\n")
println(s"Predictions associated with the boundaries:
${model.predictions}\n")
// Makes predictions.
model.transform(dataset).show()
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/ml/IsotonicRegressionExample.scala" in the Spark repo.

Factorization machines regressor


For more background and more details about the implementation of factorization machines,
refer to the Factorization Machines section.

Examples

The following examples load a dataset in LibSVM format, split it into training and test sets,
train on the first dataset, and then evaluate on the held-out test set. We scale features to be
between 0 and 1 to prevent the exploding gradient problem.

 Scala
 Java
 Python
 R
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 146
Refer to the Scala API docs for more details.

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.feature.MinMaxScaler
import org.apache.spark.ml.regression.{FMRegressionModel, FMRegressor}

// Load and parse the data file, converting it to a DataFrame.


val data =
spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

// Scale features.
val featureScaler = new MinMaxScaler()
.setInputCol("features")
.setOutputCol("scaledFeatures")
.fit(data)

// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
// Train a FM model.
val fm = new FMRegressor()
.setLabelCol("label")
.setFeaturesCol("scaledFeatures")
.setStepSize(0.001)
// Create a Pipeline.
val pipeline = new Pipeline()
.setStages(Array(featureScaler, fm))

// Train model.
val model = pipeline.fit(trainingData)

// Make predictions.
val predictions = model.transform(testData)

// Select example rows to display.


predictions.select("prediction", "label", "features").show(5)

// Select (prediction, true label) and compute test error.


val evaluator = new RegressionEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("rmse")
val rmse = evaluator.evaluate(predictions)
println(s"Root Mean Squared Error (RMSE) on test data = $rmse")

val fmModel = model.stages(1).asInstanceOf[FMRegressionModel]


println(s"Factors: ${fmModel.factors} Linear: ${fmModel.linear} " +
s"Intercept: ${fmModel.intercept}")
Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/FMRegressorExample.scala" in the
Spark repo.

Introduction to Clustering
It is basically a type of unsupervised learning method. An unsupervised learning method is a
method in which we draw references from datasets consisting of input data without labeled
responses. Generally, it is used as a process to find meaningful structure, explanatory
underlying processes, generative features, and groupings inherent in a set of examples.
Attention reader! Don’t stop learning now. Get hold of all the important Machine Learning
Concepts with the Machine Learning Foundation Course at a student-friendly price and
become industry ready.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 147
Clustering is the task of dividing the population or data points into a number of groups such
that data points in the same groups are more similar to other data points in the same group and
dissimilar to the data points in other groups. It is basically a collection of objects on the basis
of similarity and dissimilarity between them.
For ex– The data points in the graph below clustered together can be classified into one single
group. We can distinguish the clusters, and we can identify that there are 3 clusters in the
below picture.

It is not necessary for clusters to be spherical. Such as :

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 148
DBSCAN: Density-based Spatial Clustering of Applications with Noise
These data points are clustered by using the basic concept that the data point lies within the
given constraint from the cluster center. Various distance methods and techniques are used for
the calculation of the outliers.
Why Clustering?
Clustering is very much important as it determines the intrinsic grouping among the unlabelled data
present. There are no criteria for good clustering. It depends on the user, what is the criteria they may
use which satisfy their need. For instance, we could be interested in finding representatives for
homogeneous groups (data reduction), in finding “natural clusters” and describe their unknown
properties (“natural” data types), in finding useful and suitable groupings (“useful” data classes) or in
finding unusual data objects (outlier detection). This algorithm must make some assumptions that
constitute the similarity of points and each assumption make different and equally valid clusters.
Clustering Methods :
 Density-Based Methods: These methods consider the clusters as the dense region having
some similarities and differences from the lower dense region of the space. These
methods have good accuracy and the ability to merge two clusters. Example DBSCAN
(Density-Based Spatial Clustering of Applications with Noise), OPTICS (Ordering Points
to Identify Clustering Structure), etc.
 Hierarchical Based Methods: The clusters formed in this method form a tree-type
structure based on the hierarchy. New clusters are formed using the previously formed
one. It is divided into two category
 Agglomerative (bottom-up approach)
 Divisive (top-down approach)
examples CURE (Clustering Using Representatives), BIRCH (Balanced Iterative Reducing
Clustering and using Hierarchies), etc.
 Partitioning Methods: These methods partition the objects into k clusters and each
partition forms one cluster. This method is used to optimize an objective criterion
similarity function such as when the distance is a major parameter example K-means,
CLARANS (Clustering Large Applications based upon Randomized Search), etc.
 Grid-based Methods: In this method, the data space is formulated into a finite number of
cells that form a grid-like structure. All the clustering operations done on these grids are
fast and independent of the number of data objects example STING (Statistical
Information Grid), wave cluster, CLIQUE (CLustering In Quest), etc.
Clustering Algorithms :
K-means clustering algorithm – It is the simplest unsupervised learning algorithm that solves
clustering problem.K-means algorithm partitions n observations into k clusters where each
observation belongs to the cluster with the nearest mean serving as a prototype of the cluster.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 149
Applications of Clustering in different fields
 Marketing: It can be used to characterize & discover customer
segments for marketing purposes.
 Biology: It can be used for classification among different species of
plants and animals.
 Libraries: It is used in clustering different books on the basis of
topics and information.
 Insurance: It is used to acknowledge the customers, their policies
and identifying the frauds.
City Planning: It is used to make groups of houses and to study their
values based on their geographical locations and other factors
present.
Earthquake studies: By learning the earthquake-affected areas we
can determine the dangerous zones.
Dimensionality Reduction:

What is Dimensionality Reduction?

The number of input features, variables, or columns present in a given dataset is known
as dimensionality, and the process to reduce these features is called dimensionality
reduction.

A dataset contains a huge number of input features in various cases, which makes the
predictive modeling task more complicated. Because it is very difficult to visualize or
make predictions for the training dataset with a high number of features, for such cases,
dimensionality reduction techniques are required to use.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 150
Dimensionality reduction technique can be defined as, "It is a way of
converting the higher dimensions dataset into lesser dimensions dataset ensuring
that it provides similar information." These techniques are widely used in machine
learning for obtaining a better fit predictive model while solving the classification and
regression problems.

It is commonly used in the fields that deal with high-dimensional data, such as speech
recognition, signal processing, bioinformatics, etc. It can also be used for data
visualization, noise reduction, cluster analysis, etc.

The Curse of Dimensionality


Handling the high-dimensional data is very difficult in practice, commonly known as
the curse of dimensionality. If the dimensionality of the input dataset increases, any
machine learning algorithm and model becomes more complex. As the number of
features increases, the number of samples also gets increased proportionally, and the
chance of overfitting also increases. If the machine learning model is trained on high-
dimensional data, it becomes overfitted and results in poor performance.

Hence, it is often required to reduce the number of features, which can be done with
dimensionality reduction.

Benefits of applying Dimensionality Reduction


Some benefits of applying dimensionality reduction technique to the given dataset are
given below:

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 151
o By reducing the dimensions of the features, the space required to store the dataset also
gets reduced.
o Less Computation training time is required for reduced dimensions of features.
o Reduced dimensions of features of the dataset help in visualizing the data quickly.
o It removes the redundant features (if present) by taking care of multicollinearity.

Disadvantages of dimensionality Reduction


There are also some disadvantages of applying the dimensionality reduction, which are
given below:

o Some data may be lost due to dimensionality reduction.


o In the PCA dimensionality reduction technique, sometimes the principal components
required to consider are unknown.

Approaches of Dimension Reduction


There are two ways to apply the dimension reduction technique, which are given below:

Feature Selection
Feature selection is the process of selecting the subset of the relevant features and
leaving out the irrelevant features present in a dataset to build a model of high accuracy.
In other words, it is a way of selecting the optimal features from the input dataset.

Three methods are used for the feature selection:

1. Filters Methods

In this method, the dataset is filtered, and a subset that contains only the relevant
features is taken. Some common techniques of filters method are:

o Correlation
o Chi-Square Test
o ANOVA
o Information Gain, etc.

2. Wrappers Methods

The wrapper method has the same goal as the filter method, but it takes a machine
learning model for its evaluation. In this method, some features are fed to the ML
model, and evaluate the performance. The performance decides whether to add those
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 152
features or remove to increase the accuracy of the model. This method is more
accurate than the filtering method but complex to work. Some common techniques of
wrapper methods are:

o Forward Selection
o Backward Selection
o Bi-directional Elimination

3. Embedded Methods: Embedded methods check the different training iterations of


the machine learning model and evaluate the importance of each feature. Some
common techniques of Embedded methods are:

o LASSO
o Elastic Net
o Ridge Regression, etc.

Feature Extraction:

Feature extraction is the process of transforming the space containing many dimensions
into space with fewer dimensions. This approach is useful when we want to keep the
whole information but use fewer resources while processing the information.

Some common feature extraction techniques are:

a. Principal Component Analysis


b. Linear Discriminant Analysis
c. Kernel PCA
d. Quadratic Discriminant Analysis

Common techniques of Dimensionality Reduction


a. Principal Component Analysis
b. Backward Elimination
c. Forward Selection
d. Score comparison
e. Missing Value Ratio
f. Low Variance Filter
g. High Correlation Filter
h. Random Forest
i. Factor Analysis
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 153
j. Auto-Encoder

Principal Component Analysis (PCA)


Principal Component Analysis is a statistical process that converts the observations of
correlated features into a set of linearly uncorrelated features with the help of
orthogonal transformation. These new transformed features are called the Principal
Components. It is one of the popular tools that is used for exploratory data analysis and
predictive modeling.

PCA works by considering the variance of each attribute because the high attribute
shows the good split between the classes, and hence it reduces the dimensionality.
Some real-world applications of PCA are image processing, movie recommendation
system, optimizing the power allocation in various communication channels.

Backward Feature Elimination


The backward feature elimination technique is mainly used while developing Linear
Regression or Logistic Regression model. Below steps are performed in this technique to
reduce the dimensionality or in feature selection:

o In this technique, firstly, all the n variables of the given dataset are taken to train the
model.
o The performance of the model is checked.
o Now we will remove one feature each time and train the model on n-1 features for n
times, and will compute the performance of the model.
o We will check the variable that has made the smallest or no change in the performance
of the model, and then we will drop that variable or features; after that, we will be left
with n-1 features.
o Repeat the complete process until no feature can be dropped.

In this technique, by selecting the optimum performance of the model and maximum
tolerable error rate, we can define the optimal number of features require for the
machine learning algorithms.

Forward Feature Selection


Forward feature selection follows the inverse process of the backward elimination
process. It means, in this technique, we don't eliminate the feature; instead, we will find
the best features that can produce the highest increase in the performance of the
model. Below steps are performed in this technique:

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 154
o We start with a single feature only, and progressively we will add each feature at a time.
o Here we will train the model on each feature separately.
o The feature with the best performance is selected.
o The process will be repeated until we get a significant increase in the performance of the
model.

Missing Value Ratio


If a dataset has too many missing values, then we drop those variables as they do not
carry much useful information. To perform this, we can set a threshold level, and if a
variable has missing values more than that threshold, we will drop that variable. The
higher the threshold value, the more efficient the reduction.

Low Variance Filter


As same as missing value ratio technique, data columns with some changes in the data
have less information. Therefore, we need to calculate the variance of each variable, and
all data columns with variance lower than a given threshold are dropped because low
variance features will not affect the target variable.

High Correlation Filter


High Correlation refers to the case when two variables carry approximately similar
information. Due to this factor, the performance of the model can be degraded. This
correlation between the independent numerical variable gives the calculated value of
the correlation coefficient. If this value is higher than the threshold value, we can remove
one of the variables from the dataset. We can consider those variables or features that
show a high correlation with the target variable.

Random Forest
Random Forest is a popular and very useful feature selection algorithm in machine
learning. This algorithm contains an in-built feature importance package, so we do not
need to program it separately. In this technique, we need to generate a large set of trees
against the target variable, and with the help of usage statistics of each attribute, we
need to find the subset of features.

Random forest algorithm takes only numerical variables, so we need to convert the
input data into numeric data using hot encoding.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 155
Factor Analysis
Factor analysis is a technique in which each variable is kept within a group according to
the correlation with other variables, it means variables within a group can have a high
correlation between themselves, but they have a low correlation with variables of other
groups.

We can understand it by an example, such as if we have two variables Income and


spend. These two variables have a high correlation, which means people with high
income spends more, and vice versa. So, such variables are put into a group, and that
group is known as the factor. The number of these factors will be reduced as compared
to the original dimension of the dataset.

Auto-encoders
One of the popular methods of dimensionality reduction is auto-encoder, which is a
type of ANN or artificial neural network, and its main aim is to copy the inputs to their
outputs. In this, the input is compressed into latent-space representation, and output is
occurred using this representation. It has mainly two parts:

o Encoder: The function of the encoder is to compress the input to form the latent-space
representation.
o Decoder: The function of the decoder is to recreate the output from the latent-space
representation.

Term frequency-inverse document frequency (TF-IDF) is a feature


vectorization method widely used in text mining to reflect the importance of a
term to a document in the corpus. More details can be found
at: https://spark.apache.org/docs/latest/ml-features#feature-extractors

Stackoverflow TF: Both HashingTF and CountVectorizer can be used to


generate the term frequency vectors. A few important differences:

a. partially reversible (CountVectorizer) vs irreversible (HashingTF) - since


hashing is not reversible you cannot restore original input from a hash
vector. From the other hand count vector with model (index) can be used
to restore unordered input. As a consequence models created using
hashed input can be much harder to interpret and monitor.
b. memory and computational overhead - HashingTF requires only a
single data scan and no additional memory beyond original input and
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 156
vector. CountVectorizer requires additional scan over the data to build a
model and additional memory to store vocabulary (index). In case of
unigram language model it is usually not a problem but in case of higher n-
grams it can be prohibitively expensive or not feasible.
c. hashing depends on a size of the vector , hashing function and a
document. Counting depends on a size of the vector, training corpus and a
document.
d. a source of the information loss - in case of HashingTF it is
dimensionality reduction with possible collisions. CountVectorizer discards
infrequent tokens. How it affects downstream models depends on a
particular use case and data.
HashingTF and CountVectorizer are the two popular alogoritms which used
to generate term frequency vectors. They basically convert documents into a
numerical representation which can be fed directly or with further processing
into other algorithms like LDA, MinHash for Jaccard Distance, Cosine
Distance.

 : term
 : document
 : corpus
 : the number of the elements in corpus
 : Term Frequency: the number of times that term appears in
document
 : Document Frequency: the number of documents that contains
term
 : Inverse Document Frequency is a numerical measure of how
much information a term provides

 the product of TF and IDF

Let’s look at the example:

from pyspark.ml import Pipeline

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 157
from pyspark.ml.feature import HashingTF, IDF, Tokenizer

sentenceData = spark.createDataFrame([
(0, "Python python Spark Spark"),
(1, "Python SQL")],
["document", "sentence"])
sentenceData.show(truncate=False)
+--------+-------------------------+
|document|sentence |
+--------+-------------------------+
|0 |Python python Spark Spark|
|1 |Python SQL |
+--------+-------------------------+
Then:


 IDF:

 TFIDF

Countvectorizer
Stackoverflow TF: CountVectorizer and CountVectorizerModel aim to help convert a collection
of text documents to vectors of token counts. When an a-priori dictionary is not available,
CountVectorizer can be used as an Estimator to extract the vocabulary, and generates a
CountVectorizerModel. The model produces sparse representations for the documents over the
vocabulary, which can then be passed to other algorithms like LDA.

from pyspark.ml.feature import CountVectorizer

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 158
from pyspark.ml.feature import HashingTF, IDF, Tokenizer

sentenceData = spark.createDataFrame([
(0, "Python python Spark Spark"),
(1, "Python SQL")],
["document", "sentence"])

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")


vectorizer = CountVectorizer(inputCol="words", outputCol="rawFeatures")

idf = IDF(inputCol="rawFeatures", outputCol="features")

pipeline = Pipeline(stages=[tokenizer, vectorizer, idf])

model = pipeline.fit(sentenceData)
import numpy as np

total_counts = model.transform(sentenceData)\
.select('rawFeatures').rdd\
.map(lambda row: row['rawFeatures'].toArray())\
.reduce(lambda x,y: [x[i]+y[i] for i in range(len(y))])

vocabList = model.stages[1].vocabulary
d = {'vocabList':vocabList,'counts':total_counts}

spark.createDataFrame(np.array(list(d.values())).T.tolist(),list(d.keys())).show()
counts = model.transform(sentenceData).select('rawFeatures').collect()
counts

[Row(rawFeatures=SparseVector(8, {0: 1.0, 1: 1.0, 2: 1.0})),


Row(rawFeatures=SparseVector(8, {0: 1.0, 1: 1.0, 4: 1.0})),
Row(rawFeatures=SparseVector(8, {0: 1.0, 3: 1.0, 5: 1.0, 6: 1.0, 7: 1.0}))]
+---------+------+
|vocabList|counts|
+---------+------+
| python| 3.0|
| spark| 2.0|
| sql| 1.0|
+---------+------+
model.transform(sentenceData).show(truncate=False)
+--------+-------------------------+------------------------------+------------------
-+----------------------------------+
|document|sentence |words |rawFeatures
|features |
+--------+-------------------------+------------------------------+------------------
-+----------------------------------+
|0 |Python python Spark Spark|[python, python, spark,
spark]|(3,[0,1],[2.0,2.0])|(3,[0,1],[0.0,0.8109302162163288])|
|1 |Python SQL |[python, sql]
|(3,[0,2],[1.0,1.0])|(3,[0,2],[0.0,0.4054651081081644])|
+--------+-------------------------+------------------------------+------------------
-+----------------------------------+
from pyspark.sql.types import ArrayType, StringType

def termsIdx2Term(vocabulary):
def termsIdx2Term(termIndices):
return [vocabulary[int(index)] for index in termIndices]
return udf(termsIdx2Term, ArrayType(StringType()))

vectorizerModel = model.stages[1]
vocabList = vectorizerModel.vocabulary
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 159
vocabList
['python', 'spark', 'sql']
rawFeatures = model.transform(sentenceData).select('rawFeatures')
rawFeatures.show()

+-------------------+
| rawFeatures|
+-------------------+
|(3,[0,1],[2.0,2.0])|
|(3,[0,2],[1.0,1.0])|
+-------------------+
from pyspark.sql.functions import udf
import pyspark.sql.functions as F
from pyspark.sql.types import StringType, DoubleType, IntegerType

indices_udf = udf(lambda vector: vector.indices.tolist(), ArrayType(IntegerType()))


values_udf = udf(lambda vector: vector.toArray().tolist(), ArrayType(DoubleType()))

rawFeatures.withColumn('indices', indices_udf(F.col('rawFeatures')))\
.withColumn('values', values_udf(F.col('rawFeatures')))\
.withColumn("Terms", termsIdx2Term(vocabList)("indices")).show()
+-------------------+-------+---------------+---------------+
| rawFeatures|indices| values| Terms|
+-------------------+-------+---------------+---------------+
|(3,[0,1],[2.0,2.0])| [0, 1]|[2.0, 2.0, 0.0]|[python, spark]|
|(3,[0,2],[1.0,1.0])| [0, 2]|[1.0, 0.0, 1.0]| [python, sql]|
+-------------------+-------+---------------+---------------+
. HashingTF
Stackoverflow TF: HashingTF is a Transformer which takes sets of terms and
converts those sets into fixed-length feature vectors. In text processing, a “set
of terms” might be a bag of words. HashingTF utilizes the hashing trick. A raw
feature is mapped into an index (term) by applying a hash function. The hash
function used here is MurmurHash 3. Then term frequencies are calculated
based on the mapped indices. This approach avoids the need to compute a
global term-to-index map, which can be expensive for a large corpus, but it
suffers from potential hash collisions, where different raw features may
become the same term after hashing.

from pyspark.ml import Pipeline


from pyspark.ml.feature import HashingTF, IDF, Tokenizer

sentenceData = spark.createDataFrame([
(0, "Python python Spark Spark"),
(1, "Python SQL")],
["document", "sentence"])

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")


vectorizer = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=5)

idf = IDF(inputCol="rawFeatures", outputCol="features")

pipeline = Pipeline(stages=[tokenizer, vectorizer, idf])


KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 160
model = pipeline.fit(sentenceData)
model.transform(sentenceData).show(truncate=False)
+--------+-------------------------+------------------------------+------------------
-+----------------------------------+
|document|sentence |words |rawFeatures
|features |
+--------+-------------------------+------------------------------+------------------
-+----------------------------------+
|0 |Python python Spark Spark|[python, python, spark,
spark]|(5,[0,4],[2.0,2.0])|(5,[0,4],[0.8109302162163288,0.0])|
|1 |Python SQL |[python, sql]
|(5,[1,4],[1.0,1.0])|(5,[1,4],[0.4054651081081644,0.0])|
+--------+-------------------------+------------------------------+------------------
-+----------------------------------+
Word2Vec
. Word Embeddings
Word2Vec is one of the popupar method to implement the Word Embeddings. Word
embeddings (The best tutorial I have read. The following word and images content are from
Chris Bail, PhD Duke University. So the copyright belongs to Chris Bail, PhD Duke University.)
gained fame in the world of automated text analysis when it was demonstrated that they could be
used to identify analogies. Figure 1 illustrates the output of a word embedding model where
individual words are plotted in three dimensional space generated by the model. By examining
the adjacency of words in this space, word embedding models can complete analogies such as
“Man is to woman as king is to queen.” If you’d like to explore what the output of a large word
embedding model looks like in more detail, check out this fantastic visualization of most words
in the English language that was produced using a word embedding model called GloVE.

output of a word embedding model


The Context Window
Word embeddings are created by identifying the words that occur within something called a
“Context Window.” The Figure below illustrates context windows of varied length for a single

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 161
sentence. The context window is defined by a string of words before and after a focal or “center”
word that will be used to train a word embedding model. Each center word and context words
can be represented as a vector of numbers that describe the presence or absence of unique words
within a dataset, which is perhaps why word embedding models are often described as “word
vector” models, or “word2vec” models.

Two Types of Embedding Models


Word embeddings are usually performed in one of two ways: “Continuous Bag of Words”
(CBOW) or a “Skip-Gram Model.” The figure below illustrates the differences between the two
models. The CBOW model reads in the context window words and tries to predict the most
likely center word. The Skip-Gram Model predicts the context words given the center word. The
examples above were created using the Skip-Gram model, which is perhaps most useful for
people who want to identify patterns within texts to represent them in multimensional space,
whereas the CBOW model is more useful in practical applications such as predictive web search.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 162
Word Embedding Models in PySpark
from pyspark.ml.feature import Word2Vec

from pyspark.ml import Pipeline

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")


word2Vec = Word2Vec(vectorSize=3, minCount=0, inputCol="words", outputCol="feature")

pipeline = Pipeline(stages=[tokenizer, word2Vec])

model = pipeline.fit(sentenceData)
result = model.transform(sentenceData)
result.show()
+-----+--------------------+--------------------+--------------------+
|label| sentence| words| feature|
+-----+--------------------+--------------------+--------------------+
| 0.0| I love Spark| [i, love, spark]|[0.05594437588782...|
| 0.0| I love python| [i, love, python]|[-0.0350368790871...|
| 1.0|I think ML is awe...|[i, think, ml, is...|[0.01242086507845...|
+-----+--------------------+--------------------+--------------------+
w2v = model.stages[1]
w2v.getVectors().show()

+-------+-----------------------------------------------------------------+
|word |vector |
+-------+-----------------------------------------------------------------+
|is |[0.13657838106155396,0.060924094170331955,-0.03379475697875023] |
|awesome|[0.037024181336164474,-0.023855900391936302,0.0760037824511528] |
|i |[-0.0014482572441920638,0.049365971237421036,0.12016955763101578]|
|ml |[-0.14006119966506958,0.01626444421708584,0.042281970381736755] |
|spark |[0.1589149385690689,-0.10970081388950348,-0.10547549277544022] |
|think |[0.030011219903826714,-0.08994936943054199,0.16471518576145172] |
|love |[0.01036644633859396,-0.017782460898160934,0.08870164304971695] |
|python |[-0.11402882635593414,0.045119188725948334,-0.029877422377467155]|
+-------+-----------------------------------------------------------------+
from pyspark.sql.functions import format_number as fmt
w2v.findSynonyms("could", 2).select("word", fmt("similarity",
5).alias("similarity")).show()
+-------+----------+
| word|similarity|
+-------+----------+
|classes| 0.90232|
| i| 0.75424|
+-------+----------+
. FeatureHasher
from pyspark.ml.feature import FeatureHasher

dataset = spark.createDataFrame([
(2.2, True, "1", "foo"),
(3.3, False, "2", "bar"),
(4.4, False, "3", "baz"),
(5.5, False, "4", "foo")
], ["real", "bool", "stringNum", "string"])

hasher = FeatureHasher(inputCols=["real", "bool", "stringNum", "string"],


outputCol="features")

featurized = hasher.transform(dataset)

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 163
featurized.show(truncate=False)
+----+-----+---------+------+--------------------------------------------------------
+
|real|bool |stringNum|string|features
|
+----+-----+---------+------+--------------------------------------------------------
+
|2.2 |true |1 |foo
|(262144,[174475,247670,257907,262126],[2.2,1.0,1.0,1.0])|
|3.3 |false|2 |bar |(262144,[70644,89673,173866,174475],[1.0,1.0,1.0,3.3])
|
|4.4 |false|3 |baz |(262144,[22406,70644,174475,187923],[1.0,1.0,4.4,1.0])
|
|5.5 |false|4 |foo |(262144,[70644,101499,174475,257907],[1.0,1.0,5.5,1.0])
|
+----+-----+---------+------+--------------------------------------------------------
+
. RFormula
from pyspark.ml.feature import RFormula

dataset = spark.createDataFrame(
[(7, "US", 18, 1.0),
(8, "CA", 12, 0.0),
(9, "CA", 15, 0.0)],
["id", "country", "hour", "clicked"])

formula = RFormula(
formula="clicked ~ country + hour",
featuresCol="features",
labelCol="label")

output = formula.fit(dataset).transform(dataset)
output.select("features", "label").show()
+----------+-----+
| features|label|
+----------+-----+
|[0.0,18.0]| 1.0|
|[1.0,12.0]| 0.0|
|[1.0,15.0]| 0.0|
+----------+-----+

Feature Transform
Tokenizer
from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

sentenceDataFrame = spark.createDataFrame([
(0, "Hi I heard about Spark"),
(1, "I wish Java could use case classes"),
(2, "Logistic,regression,models,are,neat")
], ["id", "sentence"])

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")

regexTokenizer = RegexTokenizer(inputCol="sentence", outputCol="words",


pattern="\\W")
# alternatively, pattern="\\w+", gaps(False)

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 164
countTokens = udf(lambda words: len(words), IntegerType())

tokenized = tokenizer.transform(sentenceDataFrame)
tokenized.select("sentence", "words")\
.withColumn("tokens", countTokens(col("words"))).show(truncate=False)

regexTokenized = regexTokenizer.transform(sentenceDataFrame)
regexTokenized.select("sentence", "words") \
.withColumn("tokens", countTokens(col("words"))).show(truncate=False)
+-----------------------------------+------------------------------------------+-----
-+
|sentence |words
|tokens|
+-----------------------------------+------------------------------------------+-----
-+
|Hi I heard about Spark |[hi, i, heard, about, spark] |5
|
|I wish Java could use case classes |[i, wish, java, could, use, case, classes]|7
|
|Logistic,regression,models,are,neat|[logistic,regression,models,are,neat] |1
|
+-----------------------------------+------------------------------------------+-----
-+

+-----------------------------------+------------------------------------------+-----
-+
|sentence |words
|tokens|
+-----------------------------------+------------------------------------------+-----
-+
|Hi I heard about Spark |[hi, i, heard, about, spark] |5
|
|I wish Java could use case classes |[i, wish, java, could, use, case, classes]|7
|
|Logistic,regression,models,are,neat|[logistic, regression, models, are, neat] |5
|
+-----------------------------------+------------------------------------------+-----
-+
StopWordsRemover
from pyspark.ml.feature import StopWordsRemover

sentenceData = spark.createDataFrame([
(0, ["I", "saw", "the", "red", "balloon"]),
(1, ["Mary", "had", "a", "little", "lamb"])
], ["id", "raw"])

remover = StopWordsRemover(inputCol="raw", outputCol="removeded")


remover.transform(sentenceData).show(truncate=False)
+---+----------------------------+--------------------+
|id |raw |removeded |
+---+----------------------------+--------------------+
|0 |[I, saw, the, red, balloon] |[saw, red, balloon] |
|1 |[Mary, had, a, little, lamb]|[Mary, little, lamb]|
+---+----------------------------+--------------------+
NGram
from pyspark.ml import Pipeline
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.feature import HashingTF, IDF, Tokenizer

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 165
from pyspark.ml.feature import NGram

sentenceData = spark.createDataFrame([
(0.0, "I love Spark"),
(0.0, "I love python"),
(1.0, "I think ML is awesome")],
["label", "sentence"])

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")


ngram = NGram(n=2, inputCol="words", outputCol="ngrams")

idf = IDF(inputCol="rawFeatures", outputCol="features")

pipeline = Pipeline(stages=[tokenizer, ngram])

model = pipeline.fit(sentenceData)

model.transform(sentenceData).show(truncate=False)
+-----+---------------------+---------------------------+----------------------------
----------+
|label|sentence |words |ngrams
|
+-----+---------------------+---------------------------+----------------------------
----------+
|0.0 |I love Spark |[i, love, spark] |[i love, love spark]
|
|0.0 |I love python |[i, love, python] |[i love, love python]
|
|1.0 |I think ML is awesome|[i, think, ml, is, awesome]|[i think, think ml, ml is,
is awesome]|
+-----+---------------------+---------------------------+----------------------------
----------+
Binarizer
from pyspark.ml.feature import Binarizer

continuousDataFrame = spark.createDataFrame([
(0, 0.1),
(1, 0.8),
(2, 0.2),
(3,0.5)
], ["id", "feature"])

binarizer = Binarizer(threshold=0.5, inputCol="feature",


outputCol="binarized_feature")

binarizedDataFrame = binarizer.transform(continuousDataFrame)

print("Binarizer output with Threshold = %f" % binarizer.getThreshold())


binarizedDataFrame.show()
Binarizer output with Threshold = 0.500000
+---+-------+-----------------+
| id|feature|binarized_feature|
+---+-------+-----------------+
| 0| 0.1| 0.0|
| 1| 0.8| 1.0|
| 2| 0.2| 0.0|
| 3| 0.5| 0.0|
+---+-------+-----------------+

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 166
Bucketizer
[Bucketizer](https://spark.apache.org/docs/latest/ml-features.html#bucketizer) transforms a
column of continuous features to a column of feature buckets, where the buckets are specified by
users.

from pyspark.ml.feature import QuantileDiscretizer, Bucketizer

data = [(0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.0)]
df = spark.createDataFrame(data, ["id", "age"])
print(df.show())

splits = [-float("inf"),3, 10,float("inf")]


result_bucketizer = Bucketizer(splits=splits,
inputCol="age",outputCol="result").transform(df)
result_bucketizer.show()
+---+----+
| id| age|
+---+----+
| 0|18.0|
| 1|19.0|
| 2| 8.0|
| 3| 5.0|
| 4| 2.0|
+---+----+

None
+---+----+------+
| id| age|result|
+---+----+------+
| 0|18.0| 2.0|
| 1|19.0| 2.0|
| 2| 8.0| 1.0|
| 3| 5.0| 1.0|
| 4| 2.0| 0.0|
+---+----+------+
QuantileDiscretizer
QuantileDiscretizer takes a column with continuous features and outputs a
column with binned categorical features. The number of bins is set by the
numBuckets parameter. It is possible that the number of buckets used will be
smaller than this value, for example, if there are too few distinct values of the
input to create enough distinct quantiles.

from pyspark.ml.feature import QuantileDiscretizer, Bucketizer

data = [(0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.0)]
df = spark.createDataFrame(data, ["id", "age"])
print(df.show())

qds = QuantileDiscretizer(numBuckets=5, inputCol="age", outputCol="buckets",


relativeError=0.01, handleInvalid="error")
bucketizer = qds.fit(df)
bucketizer.transform(df).show()
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 167
bucketizer.setHandleInvalid("skip").transform(df).show()
+---+----+
| id| age|
+---+----+
| 0|18.0|
| 1|19.0|
| 2| 8.0|
| 3| 5.0|
| 4| 2.0|
+---+----+

None
+---+----+-------+
| id| age|buckets|
+---+----+-------+
| 0|18.0| 3.0|
| 1|19.0| 3.0|
| 2| 8.0| 2.0|
| 3| 5.0| 2.0|
| 4| 2.0| 1.0|
+---+----+-------+

+---+----+-------+
| id| age|buckets|
+---+----+-------+
| 0|18.0| 3.0|
| 1|19.0| 3.0|
| 2| 8.0| 2.0|
| 3| 5.0| 2.0|
| 4| 2.0| 1.0|
+---+----+-------+
If the data has NULL values, then you will get the following results:

from pyspark.ml.feature import QuantileDiscretizer, Bucketizer

data = [(0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, None)]
df = spark.createDataFrame(data, ["id", "age"])
print(df.show())

splits = [-float("inf"),3, 10,float("inf")]


result_bucketizer = Bucketizer(splits=splits,
inputCol="age",outputCol="result").transform(df)
result_bucketizer.show()

qds = QuantileDiscretizer(numBuckets=5, inputCol="age", outputCol="buckets",


relativeError=0.01, handleInvalid="error")
bucketizer = qds.fit(df)
bucketizer.transform(df).show()
bucketizer.setHandleInvalid("skip").transform(df).show()
+---+----+
| id| age|
+---+----+
| 0|18.0|
| 1|19.0|
| 2| 8.0|
| 3| 5.0|
| 4|null|
+---+----+

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 168
None
+---+----+------+
| id| age|result|
+---+----+------+
| 0|18.0| 2.0|
| 1|19.0| 2.0|
| 2| 8.0| 1.0|
| 3| 5.0| 1.0|
| 4|null| null|
+---+----+------+

+---+----+-------+
| id| age|buckets|
+---+----+-------+
| 0|18.0| 3.0|
| 1|19.0| 4.0|
| 2| 8.0| 2.0|
| 3| 5.0| 1.0|
| 4|null| null|
+---+----+-------+

+---+----+-------+
| id| age|buckets|
+---+----+-------+
| 0|18.0| 3.0|
| 1|19.0| 4.0|
| 2| 8.0| 2.0|
| 3| 5.0| 1.0|
+---+----+-------+
StringIndexer
from pyspark.ml.feature import StringIndexer

df = spark.createDataFrame(
[(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")],
["id", "category"])

indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")


indexed = indexer.fit(df).transform(df)
indexed.show()
+---+--------+-------------+
| id|category|categoryIndex|
+---+--------+-------------+
| 0| a| 0.0|
| 1| b| 2.0|
| 2| c| 1.0|
| 3| a| 0.0|
| 4| a| 0.0|
| 5| c| 1.0|
+---+--------+-------------+
labelConverter
from pyspark.ml.feature import IndexToString, StringIndexer

df = spark.createDataFrame(
[(0, "Yes"), (1, "Yes"), (2, "Yes"), (3, "No"), (4, "No"), (5, "No")],
["id", "label"])

indexer = StringIndexer(inputCol="label", outputCol="labelIndex")


model = indexer.fit(df)
indexed = model.transform(df)

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 169
print("Transformed string column '%s' to indexed column '%s'"
% (indexer.getInputCol(), indexer.getOutputCol()))
indexed.show()

print("StringIndexer will store labels in output column metadata\n")

converter = IndexToString(inputCol="labelIndex", outputCol="originalLabel")


converted = converter.transform(indexed)

print("Transformed indexed column '%s' back to original string column '%s' using "
"labels in metadata" % (converter.getInputCol(), converter.getOutputCol()))
converted.select("id", "labelIndex", "originalLabel").show()
Transformed string column 'label' to indexed column 'labelIndex'
+---+-----+----------+
| id|label|labelIndex|
+---+-----+----------+
| 0| Yes| 1.0|
| 1| Yes| 1.0|
| 2| Yes| 1.0|
| 3| No| 0.0|
| 4| No| 0.0|
| 5| No| 0.0|
+---+-----+----------+

StringIndexer will store labels in output column metadata

Transformed indexed column 'labelIndex' back to original string column


'originalLabel' using labels in metadata
+---+----------+-------------+
| id|labelIndex|originalLabel|
+---+----------+-------------+
| 0| 1.0| Yes|
| 1| 1.0| Yes|
| 2| 1.0| Yes|
| 3| 0.0| No|
| 4| 0.0| No|
| 5| 0.0| No|
+---+----------+-------------+
from pyspark.ml import Pipeline
from pyspark.ml.feature import IndexToString, StringIndexer

df = spark.createDataFrame(
[(0, "Yes"), (1, "Yes"), (2, "Yes"), (3, "No"), (4, "No"), (5, "No")],
["id", "label"])

indexer = StringIndexer(inputCol="label", outputCol="labelIndex")


converter = IndexToString(inputCol="labelIndex", outputCol="originalLabel")

pipeline = Pipeline(stages=[indexer, converter])

model = pipeline.fit(df)
result = model.transform(df)

result.show()
+---+-----+----------+-------------+
| id|label|labelIndex|originalLabel|
+---+-----+----------+-------------+
| 0| Yes| 1.0| Yes|
| 1| Yes| 1.0| Yes|
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 170
| 2| Yes| 1.0| Yes|
| 3| No| 0.0| No|
| 4| No| 0.0| No|
| 5| No| 0.0| No|
+---+-----+----------+-------------+
VectorIndexer
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator

from pyspark.ml.feature import RFormula

df = spark.createDataFrame([
(0, 2.2, True, "1", "foo", 'CA'),
(1, 3.3, False, "2", "bar", 'US'),
(0, 4.4, False, "3", "baz", 'CHN'),
(1, 5.5, False, "4", "foo", 'AUS')
], ['label',"real", "bool", "stringNum", "string","country"])

formula = RFormula(
formula="label ~ real + bool + stringNum + string + country",
featuresCol="features",
labelCol="label")

# Automatically identify categorical features, and index them.


# We specify maxCategories so features with > 4 distinct values
# are treated as continuous.
featureIndexer = VectorIndexer(inputCol="features", \
outputCol="indexedFeatures",\
maxCategories=2)

pipeline = Pipeline(stages=[formula, featureIndexer])

model = pipeline.fit(df)
result = model.transform(df)

result.show()
+-----+----+-----+---------+------+-------+--------------------+--------------------+
|label|real| bool|stringNum|string|country| features| indexedFeatures|
+-----+----+-----+---------+------+-------+--------------------+--------------------+
| 0| 2.2| true| 1| foo| CA|(10,[0,1,5,7],[2....|(10,[0,1,5,7],[2....|
| 1| 3.3|false| 2| bar| US|(10,[0,3,8],[3.3,...|(10,[0,3,8],[3.3,...|
| 0| 4.4|false| 3| baz| CHN|(10,[0,4,6,9],[4....|(10,[0,4,6,9],[4....|
| 1| 5.5|false| 4| foo| AUS|(10,[0,2,5],[5.5,...|(10,[0,2,5],[5.5,...|
+-----+----+-----+---------+------+-------+--------------------+--------------------+
VectorAssembler
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

dataset = spark.createDataFrame(
[(0, 18, 1.0, Vectors.dense([0.0, 10.0, 0.5]), 1.0)],
["id", "hour", "mobile", "userFeatures", "clicked"])

assembler = VectorAssembler(
inputCols=["hour", "mobile", "userFeatures"],
outputCol="features")

output = assembler.transform(dataset)

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 171
print("Assembled columns 'hour', 'mobile', 'userFeatures' to vector column
'features'")
output.select("features", "clicked").show(truncate=False)
Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'features'
+-----------------------+-------+
|features |clicked|
+-----------------------+-------+
|[18.0,1.0,0.0,10.0,0.5]|1.0 |
+-----------------------+-------+
OneHotEncoder
This is the note I wrote for one of my readers for explaining the
OneHotEncoder. I would like to share it at here:

Import and creating SparkSession


from pyspark.sql import SparkSession

spark = SparkSession \
.builder \
.appName("Python Spark create RDD example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
df = spark.createDataFrame([
(0, "a"),
(1, "b"),
(2, "c"),
(3, "a"),
(4, "a"),
(5, "c")
], ["id", "category"])
df.show()
+---+--------+
| id|category|
+---+--------+
| 0| a|
| 1| b|
| 2| c|
| 3| a|
| 4| a|
| 5| c|
+---+--------+
. OneHotEncoder
. Encoder
from pyspark.ml.feature import OneHotEncoder, StringIndexer

stringIndexer = StringIndexer(inputCol="category", outputCol="categoryIndex")


model = stringIndexer.fit(df)
indexed = model.transform(df)

# default setting: dropLast=True


encoder = OneHotEncoder(inputCol="categoryIndex",
outputCol="categoryVec",dropLast=False)
encoded = encoder.transform(indexed)
encoded.show()

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 172
+---+--------+-------------+-------------+
| id|category|categoryIndex| categoryVec|
+---+--------+-------------+-------------+
| 0| a| 0.0|(3,[0],[1.0])|
| 1| b| 2.0|(3,[2],[1.0])|
| 2| c| 1.0|(3,[1],[1.0])|
| 3| a| 0.0|(3,[0],[1.0])|
| 4| a| 0.0|(3,[0],[1.0])|
| 5| c| 1.0|(3,[1],[1.0])|
+---+--------+-------------+-------------+
Note

The default setting of OneHotEncoder is: dropLast=True

# default setting: dropLast=True


encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec")
encoded = encoder.transform(indexed)
encoded.show()
+---+--------+-------------+-------------+
| id|category|categoryIndex| categoryVec|
+---+--------+-------------+-------------+
| 0| a| 0.0|(2,[0],[1.0])|
| 1| b| 2.0| (2,[],[])|
| 2| c| 1.0|(2,[1],[1.0])|
| 3| a| 0.0|(2,[0],[1.0])|
| 4| a| 0.0|(2,[0],[1.0])|
| 5| c| 1.0|(2,[1],[1.0])|
+---+--------+-------------+-------------+
Vector Assembler
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
categoricalCols = ['category']

indexers = [ StringIndexer(inputCol=c, outputCol="{0}_indexed".format(c))


for c in categoricalCols ]
# default setting: dropLast=True
encoders = [ OneHotEncoder(inputCol=indexer.getOutputCol(),

outputCol="{0}_encoded".format(indexer.getOutputCol()),dropLast=False)
for indexer in indexers ]
assembler = VectorAssembler(inputCols=[encoder.getOutputCol() for encoder in
encoders]
, outputCol="features")
pipeline = Pipeline(stages=indexers + encoders + [assembler])

model=pipeline.fit(df)
data = model.transform(df)
data.show()
+---+--------+----------------+------------------------+-------------+
| id|category|category_indexed|category_indexed_encoded| features|
+---+--------+----------------+------------------------+-------------+
| 0| a| 0.0| (3,[0],[1.0])|[1.0,0.0,0.0]|
| 1| b| 2.0| (3,[2],[1.0])|[0.0,0.0,1.0]|
| 2| c| 1.0| (3,[1],[1.0])|[0.0,1.0,0.0]|
| 3| a| 0.0| (3,[0],[1.0])|[1.0,0.0,0.0]|
| 4| a| 0.0| (3,[0],[1.0])|[1.0,0.0,0.0]|
| 5| c| 1.0| (3,[1],[1.0])|[0.0,1.0,0.0]|

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 173
+---+--------+----------------+------------------------+-------------+
Application: Get Dummy Variable
def get_dummy(df,indexCol,categoricalCols,continuousCols,labelCol,dropLast=False):

'''
Get dummy variables and concat with continuous variables for ml modeling.
:param df: the dataframe
:param categoricalCols: the name list of the categorical data
:param continuousCols: the name list of the numerical data
:param labelCol: the name of label column
:param dropLast: the flag of drop last column
:return: feature matrix

:author: Wenqiang Feng


:email: von198@gmail.com

>>> df = spark.createDataFrame([
(0, "a"),
(1, "b"),
(2, "c"),
(3, "a"),
(4, "a"),
(5, "c")
], ["id", "category"])

>>> indexCol = 'id'


>>> categoricalCols = ['category']
>>> continuousCols = []
>>> labelCol = []

>>> mat = get_dummy(df,indexCol,categoricalCols,continuousCols,labelCol)


>>> mat.show()

>>>
+---+-------------+
| id| features|
+---+-------------+
| 0|[1.0,0.0,0.0]|
| 1|[0.0,0.0,1.0]|
| 2|[0.0,1.0,0.0]|
| 3|[1.0,0.0,0.0]|
| 4|[1.0,0.0,0.0]|
| 5|[0.0,1.0,0.0]|
+---+-------------+
'''

from pyspark.ml import Pipeline


from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.sql.functions import col

indexers = [ StringIndexer(inputCol=c, outputCol="{0}_indexed".format(c))


for c in categoricalCols ]

# default setting: dropLast=True


encoders = [ OneHotEncoder(inputCol=indexer.getOutputCol(),

outputCol="{0}_encoded".format(indexer.getOutputCol()),dropLast=dropLast)
for indexer in indexers ]

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 174
assembler = VectorAssembler(inputCols=[encoder.getOutputCol() for encoder
in encoders]
+ continuousCols, outputCol="features")

pipeline = Pipeline(stages=indexers + encoders + [assembler])

model=pipeline.fit(df)
data = model.transform(df)

if indexCol and labelCol:


# for supervised learning
data = data.withColumn('label',col(labelCol))
return data.select(indexCol,'features','label')
elif not indexCol and labelCol:
# for supervised learning
data = data.withColumn('label',col(labelCol))
return data.select('features','label')
elif indexCol and not labelCol:
# for unsupervised learning
return data.select(indexCol,'features')
elif not indexCol and not labelCol:
# for unsupervised learning
return data.select('features')
Unsupervised scenario
df = spark.createDataFrame([
(0, "a"),
(1, "b"),
(2, "c"),
(3, "a"),
(4, "a"),
(5, "c")
], ["id", "category"])
df.show()

indexCol = 'id'
categoricalCols = ['category']
continuousCols = []
labelCol = []

mat = get_dummy(df,indexCol,categoricalCols,continuousCols,labelCol)
mat.show()

+---+-------------+
| id| features|
+---+-------------+
| 0|[1.0,0.0,0.0]|
| 1|[0.0,0.0,1.0]|
| 2|[0.0,1.0,0.0]|
| 3|[1.0,0.0,0.0]|
| 4|[1.0,0.0,0.0]|
| 5|[0.0,1.0,0.0]|
+---+-------------+
Supervised scenario
df = spark.read.csv(path='bank.csv',
sep=',',encoding='UTF-8',comment=None,
header=True,inferSchema=True)

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 175
indexCol = []
catCols = ['job','marital','education','default',
'housing','loan','contact','poutcome']

contCols = ['balance', 'duration','campaign','pdays','previous']


labelCol = 'y'

data = get_dummy(df,indexCol,catCols,contCols,labelCol,dropLast=False)
data.show(5)
+--------------------+-----+
| features|label|
+--------------------+-----+
|(37,[8,12,17,19,2...| no|
|(37,[4,12,15,19,2...| no|
|(37,[0,13,16,19,2...| no|
|(37,[0,12,16,19,2...| no|
|(37,[1,12,15,19,2...| no|
+--------------------+-----+
only showing top 5 rows
The Jupyter Notebook can be found on Colab: OneHotEncoder .

Scaler
from pyspark.ml.feature import Normalizer, StandardScaler, MinMaxScaler, MaxAbsScaler

scaler_type = 'Normal'
if scaler_type=='Normal':
scaler = Normalizer(inputCol="features", outputCol="scaledFeatures", p=1.0)
elif scaler_type=='Standard':
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
withStd=True, withMean=False)
elif scaler_type=='MinMaxScaler':
scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")
elif scaler_type=='MaxAbsScaler':
scaler = MaxAbsScaler(inputCol="features", outputCol="scaledFeatures")
from pyspark.ml import Pipeline
from pyspark.ml.linalg import Vectors

df = spark.createDataFrame([
(0, Vectors.dense([1.0, 0.5, -1.0]),),
(1, Vectors.dense([2.0, 1.0, 1.0]),),
(2, Vectors.dense([4.0, 10.0, 2.0]),)
], ["id", "features"])
df.show()

pipeline = Pipeline(stages=[scaler])

model =pipeline.fit(df)
data = model.transform(df)
data.show()
+---+--------------+
| id| features|
+---+--------------+
| 0|[1.0,0.5,-1.0]|
| 1| [2.0,1.0,1.0]|
| 2|[4.0,10.0,2.0]|
+---+--------------+

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 176
+---+--------------+------------------+
| id| features| scaledFeatures|
+---+--------------+------------------+
| 0|[1.0,0.5,-1.0]| [0.4,0.2,-0.4]|
| 1| [2.0,1.0,1.0]| [0.5,0.25,0.25]|
| 2|[4.0,10.0,2.0]|[0.25,0.625,0.125]|
+---+--------------+------------------+
Normalizer
from pyspark.ml.feature import Normalizer
from pyspark.ml.linalg import Vectors

dataFrame = spark.createDataFrame([
(0, Vectors.dense([1.0, 0.5, -1.0]),),
(1, Vectors.dense([2.0, 1.0, 1.0]),),
(2, Vectors.dense([4.0, 10.0, 2.0]),)
], ["id", "features"])

# Normalize each Vector using $L^1$ norm.


normalizer = Normalizer(inputCol="features", outputCol="normFeatures", p=1.0)
l1NormData = normalizer.transform(dataFrame)
print("Normalized using L^1 norm")
l1NormData.show()

# Normalize each Vector using $L^\infty$ norm.


lInfNormData = normalizer.transform(dataFrame, {normalizer.p: float("inf")})
print("Normalized using L^inf norm")
lInfNormData.show()
Normalized using L^1 norm
+---+--------------+------------------+
| id| features| normFeatures|
+---+--------------+------------------+
| 0|[1.0,0.5,-1.0]| [0.4,0.2,-0.4]|
| 1| [2.0,1.0,1.0]| [0.5,0.25,0.25]|
| 2|[4.0,10.0,2.0]|[0.25,0.625,0.125]|
+---+--------------+------------------+

Normalized using L^inf norm


+---+--------------+--------------+
| id| features| normFeatures|
+---+--------------+--------------+
| 0|[1.0,0.5,-1.0]|[1.0,0.5,-1.0]|
| 1| [2.0,1.0,1.0]| [1.0,0.5,0.5]|
| 2|[4.0,10.0,2.0]| [0.4,1.0,0.2]|
+---+--------------+--------------+
StandardScaler
from pyspark.ml.feature import Normalizer, StandardScaler, MinMaxScaler, MaxAbsScaler

from pyspark.ml.linalg import Vectors

dataFrame = spark.createDataFrame([
(0, Vectors.dense([1.0, 0.5, -1.0]),),
(1, Vectors.dense([2.0, 1.0, 1.0]),),
(2, Vectors.dense([4.0, 10.0, 2.0]),)
], ["id", "features"])

scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",


withStd=True, withMean=False)
scaleredData = scaler.fit((dataFrame)).transform(dataFrame)
scaleredData.show(truncate=False)

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 177
+---+--------------+---------------------------------------------------------
---+
|id |features |scaledFeatures |
+---+--------------+------------------------------------------------------------+
|0 |[1.0,0.5,-1.0]|[0.6546536707079772,0.09352195295828244,-0.6546536707079771]|
|1 |[2.0,1.0,1.0] |[1.3093073414159544,0.1870439059165649,0.6546536707079771] |
|2 |[4.0,10.0,2.0]|[2.618614682831909,1.870439059165649,1.3093073414159542] |
+---+--------------+------------------------------------------------------------+
MinMaxScaler
from pyspark.ml.feature import Normalizer, StandardScaler, MinMaxScaler, MaxAbsScaler

from pyspark.ml.linalg import Vectors

dataFrame = spark.createDataFrame([
(0, Vectors.dense([1.0, 0.5, -1.0]),),
(1, Vectors.dense([2.0, 1.0, 1.0]),),
(2, Vectors.dense([4.0, 10.0, 2.0]),)
], ["id", "features"])

scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")


scaledData = scaler.fit((dataFrame)).transform(dataFrame)
scaledData.show(truncate=False)
+---+--------------+-----------------------------------------------------------+
|id |features |scaledFeatures |
+---+--------------+-----------------------------------------------------------+
|0 |[1.0,0.5,-1.0]|[0.0,0.0,0.0] |
|1 |[2.0,1.0,1.0] |[0.3333333333333333,0.05263157894736842,0.6666666666666666]|
|2 |[4.0,10.0,2.0]|[1.0,1.0,1.0] |
+---+--------------+-----------------------------------------------------------+
MaxAbsScaler
from pyspark.ml.feature import Normalizer, StandardScaler, MinMaxScaler, MaxAbsScaler

from pyspark.ml.linalg import Vectors

dataFrame = spark.createDataFrame([
(0, Vectors.dense([1.0, 0.5, -1.0]),),
(1, Vectors.dense([2.0, 1.0, 1.0]),),
(2, Vectors.dense([4.0, 10.0, 2.0]),)
], ["id", "features"])

scaler = MaxAbsScaler(inputCol="features", outputCol="scaledFeatures")


scaledData = scaler.fit((dataFrame)).transform(dataFrame)
scaledData.show(truncate=False)
+---+--------------+----------------+
|id |features |scaledFeatures |
+---+--------------+----------------+
|0 |[1.0,0.5,-1.0]|[0.25,0.05,-0.5]|
|1 |[2.0,1.0,1.0] |[0.5,0.1,0.5] |
|2 |[4.0,10.0,2.0]|[1.0,1.0,1.0] |
+---+--------------+----------------+
PCA
from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors

data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),


(Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
(Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = spark.createDataFrame(data, ["features"])

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 178
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)

result = model.transform(df).select("pcaFeatures")
result.show(truncate=False)
+-----------------------------------------------------------+
|pcaFeatures |
+-----------------------------------------------------------+
|[1.6485728230883807,-4.013282700516296,-5.524543751369388] |
|[-4.645104331781534,-1.1167972663619026,-5.524543751369387]|
|[-6.428880535676489,-5.337951427775355,-5.524543751369389] |
+-----------------------------------------------------------+
DCT
from pyspark.ml.feature import DCT
from pyspark.ml.linalg import Vectors

df = spark.createDataFrame([
(Vectors.dense([0.0, 1.0, -2.0, 3.0]),),
(Vectors.dense([-1.0, 2.0, 4.0, -7.0]),),
(Vectors.dense([14.0, -2.0, -5.0, 1.0]),)], ["features"])

dct = DCT(inverse=False, inputCol="features", outputCol="featuresDCT")

dctDf = dct.transform(df)

dctDf.select("featuresDCT").show(truncate=False)
+----------------------------------------------------------------+
|featuresDCT |
+----------------------------------------------------------------+
|[1.0,-1.1480502970952693,2.0000000000000004,-2.7716385975338604]|
|[-1.0,3.378492794482933,-7.000000000000001,2.9301512653149677] |
|[4.0,9.304453421915744,11.000000000000002,1.5579302036357163] |
+----------------------------------------------------------------+

Feature Selection
LASSO
Variable selection and the removal of correlated variables. The Ridge method
shrinks the coefficients of correlated variables while the LASSO method picks
one variable and discards the others. The elastic net penalty is a mixture of
these two; if variables are correlated in groups then tends to select
the groups as in or out. If α is close to 1, the elastic net performs much like the
LASSO method and removes any degeneracies and wild behavior caused by
extreme correlations.

RandomForest
AutoFeatures library based on RandomForest is coming soon………….

Unbalanced data: Undersampling


KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 179
Since we use PySpark to deal with the big data, Undersampling for
Unbalanced Classification is a useful method to deal with the Unbalanced
data. Undersampling is a popular technique for unbalanced datasets to reduce
the skew in class distributions. However, it is well-known that undersampling
one class modifies the priors of the training set and consequently biases the
posterior probabilities of a classifier. After you applied the Undersampling, you
need to recalibrate the Probability Calibrating Probability with Undersampling
for Unbalanced Classification.

df = spark.createDataFrame([
(0, "Yes"),
(1, "Yes"),
(2, "Yes"),
(3, "Yes"),
(4, "No"),
(5, "No")
], ["id", "label"])
df.show()
+---+-----+
| id|label|
+---+-----+
| 0| Yes|
| 1| Yes|
| 2| Yes|
| 3| Yes|
| 4| No|
| 5| No|
+---+-----+
Calculate undersampling Ratio
import math
def round_up(n, decimals=0):
multiplier = 10 ** decimals
return math.ceil(n * multiplier) / multiplier

# drop missing value rows


df = df.dropna()
# under-sampling majority set
label_Y = df.filter(df.label=='Yes')
label_N = df.filter(df.label=='No')
sampleRatio = round_up(label_N.count() / df.count(),2)

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 180
Undersampling
label_Y_sample = label_Y.sample(False, sampleRatio)
# union minority set and the under-sampling majority set
data = label_N.unionAll(label_Y_sample)
data.show()
+---+-----+
| id|label|
+---+-----+
| 4| No|
| 5| No|
| 1| Yes|
| 2| Yes|
+---+-----+
. Recalibrating Probability
Undersampling is a popular technique for unbalanced datasets to reduce the skew in class
distributions. However, it is well-known that undersampling one class modifies the priors of the
training set and consequently biases the posterior probabilities of a classifier Calibrating
Probability with Undersampling for Unbalanced Classification.

Clustering:

Overview

 Learn about Clustering , one of the most popular unsupervised classification techniques
 Dividing the data into clusters can be on the basis of centroids, distributions, densities, etc
 Get to know K means and hierarchical clustering and the difference between the two

Introduction
Have you come across a situation when a Chief Marketing Officer of a company tells you –

“Help me understand our customers better so that we can market our products to them in a better

manner!”

I did and the analyst in me was completely clueless what to do! I was used to getting specific

problems, where there is an outcome to be predicted for various set of conditions. But I had no

clue what to do in this case. If the person would have asked me to calculate Life Time Value

(LTV) or propensity of Cross-sell, I wouldn’t have blinked. But this question looked very broad

to me.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 181
This is usually the first reaction when you come across an unsupervised learning problem for the

first time! You are not looking for specific insights for a phenomena, but what you are looking

for are structures with in data with out them being tied down to a specific outcome.

The method of identifying similar groups of data in a dataset is called clustering. It is one of the

most popular techniques in data science. Entities in each group are comparatively more similar to

entities of that group than those of the other groups. In this article, I will be taking you through

the types of clustering, different clustering algorithms and a comparison between two of the most

commonly used clustering methods.

Table of Contents

1. Overview
2. Types of Clustering
3. Types of Clustering Algorithms
4. K Means Clustering
5. Hierarchical Clustering
6. Difference between K Means and Hierarchical clustering
7. Applications of Clustering
8. Improving Supervised Learning algorithms with clustering

1. Overview

Clustering is the task of dividing the population or data points into a number of groups such that

data points in the same groups are more similar to other data points in the same group than those

in other groups. In simple words, the aim is to segregate groups with similar traits and assign

them into clusters.

Let’s understand this with an example. Suppose, you are the head of a rental store and wish to

understand preferences of your costumers to scale up your business. Is it possible for you to look

at details of each costumer and devise a unique business strategy for each one of them?

Definitely not. But, what you can do is to cluster all of your costumers into say 10 groups based

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 182
on their purchasing habits and use a separate strategy for costumers in each of these 10 groups.

And this is what we call clustering.

2. Types of Clustering

Broadly speaking, clustering can be divided into two subgroups :

 Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or
not. For example, in the above example each customer is put into one group out of the 10 groups.
 Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a
probability or likelihood of that data point to be in those clusters is assigned. For example, from
the above scenario each costumer is assigned a probability to be in either of 10 clusters of the
retail store.

3. Types of clustering algorithms

Since the task of clustering is subjective, the means that can be used for achieving this goal are

plenty. Every methodology follows a different set of rules for defining the ‘similarity’ among

data points. In fact, there are more than 100 clustering algorithms known. But few of the

algorithms are used popularly, let’s look at them in detail:

 Connectivity models: As the name suggests, these models are based on the notion that the data
points closer in data space exhibit more similarity to each other than the data points lying farther
away. These models can follow two approaches. In the first approach, they start with classifying
all data points into separate clusters & then aggregating them as the distance decreases. In the
second approach, all data points are classified as a single cluster and then partitioned as the
distance increases. Also, the choice of distance function is subjective. These models are very
easy to interpret but lacks scalability for handling big datasets. Examples of these models are
hierarchical clustering algorithm and its variants.

 Centroid models: These are iterative clustering algorithms in which the notion of similarity is
derived by the closeness of a data point to the centroid of the clusters. K-Means clustering
algorithm is a popular algorithm that falls into this category. In these models, the no. of clusters
required at the end have to be mentioned beforehand, which makes it important to have prior
knowledge of the dataset. These models run iteratively to find the local optima.

 Distribution models: These clustering models are based on the notion of how probable is it that
all data points in the cluster belong to the same distribution (For example: Normal, Gaussian).
These models often suffer from overfitting. A popular example of these models is Expectation-
maximization algorithm which uses multivariate normal distributions.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 183
 Density Models: These models search the data space for areas of varied density of data
points in the data space. It isolates various different density regions and assign the data points
within these regions in the same cluster. Popular examples of density models are DBSCAN and
OPTICS.

Now I will be taking you through two of the most popular clustering algorithms in detail – K

Means clustering and Hierarchical clustering. Let’s begin.

4. K Means Clustering

K means is an iterative clustering algorithm that aims to find local maxima in each iteration.

This algorithm works in these 5 steps :

1. Specify the desired number of clusters K : Let us choose k=2 for


these 5 data points in 2-D space.

2. Randomly assign each data point to a cluster : Let’s assign three


points in cluster 1 shown using red color and two points in cluster
2 shown using grey color.

3. Compute cluster centroids : The centroid of data points in the red


cluster is shown using red cross and those in grey cluster using
grey cross.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 184
4. Re-assign each point to the closest cluster centroid : Note that
only the data point at the bottom is assigned to the red cluster
even though its closer to the centroid of grey cluster. Thus, we
assign that data point into grey cluster

5. Re-compute cluster centroids : Now, re-computing the centroids


for both the clusters.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 185
6. Repeat steps 4 and 5 until no improvements are possible : Similarly, we’ll repeat
the 4th and 5th steps until we’ll reach global optima. When there will be no further
switching of data points between two clusters for two successive repeats. It will mark the
termination of the algorithm if not explicitly mentioned.

Here is a live coding window where you can try out K Means Algorithm
using scikit-learn library.

5. Hierarchical Clustering

Hierarchical clustering, as the name suggests is an algorithm that builds hierarchy of clusters.

This algorithm starts with all the data points assigned to a cluster of their own. Then two nearest

clusters are merged into the same cluster. In the end, this algorithm terminates when there is only

a single cluster left.

The results of hierarchical clustering can be shown using dendrogram. The dendrogram can be

interpreted as:

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 186
At the bottom, we start with 25 data points, each assigned to separate clusters. Two closest

clusters are then merged till we have just one cluster at the top. The height in the dendrogram at

which two clusters are merged represents the distance between two clusters in the data space.

The decision of the no. of clusters that can best depict different groups can be chosen by

observing the dendrogram. The best choice of the no. of clusters is the no. of vertical lines in the

dendrogram cut by a horizontal line that can transverse the maximum distance vertically without

intersecting a cluster.

In the above example, the best choice of no. of clusters will be 4 as the red horizontal line in the

dendrogram below covers maximum vertical distance AB.

Two important things that you should know about hierarchical clustering are:

 This algorithm has been implemented above using bottom up approach. It is also possible to
follow top-down approach starting with all data points assigned in the same cluster and
recursively performing splits till each data point is assigned a separate cluster.
 The decision of merging two clusters is taken on the basis of closeness of these clusters. There
are multiple metrics for deciding the closeness of two clusters :
o Euclidean distance: ||a-b||2 = √(Σ(ai-bi))
o Squared Euclidean distance: ||a-b||22 = Σ((ai-bi) 2)
o Manhattan distance: ||a-b||1 = Σ|ai-bi|
o Maximum distance:||a-b||INFINITY = maxi|ai-bi|
o Mahalanobis distance: √((a-b)T S-1 (-b)) {where, s : covariance matrix}

6. Difference between K Means and Hierarchical clustering

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 187
 Hierarchical clustering can’t handle big data well but K Means clustering can. This is
because the time complexity of K Means is linear i.e. O(n) while that of hierarchical clustering is
quadratic i.e. O(n2).
 In K Means clustering, since we start with random choice of clusters, the results produced by
running the algorithm multiple times might differ. While results are reproducible in Hierarchical
clustering.
 K Means is found to work well when the shape of the clusters is hyper spherical (like circle in
2D, sphere in 3D).
 K Means clustering requires prior knowledge of K i.e. no. of clusters you want to divide your
data into. But, you can stop at whatever number of clusters you find appropriate in hierarchical
clustering by interpreting the dendrogram

7. Applications of Clustering

Clustering has a large no. of applications spread across various domains. Some of the most

popular applications of clustering are:

 Recommendation engines
 Market segmentation
 Social network analysis
 Search result grouping
 Medical imaging
 Image segmentation
 Anomaly detection

8. Improving Supervised Learning Algorithms with Clustering

Clustering is an unsupervised machine learning approach, but can it be used to improve the

accuracy of supervised machine learning algorithms as well by clustering the data points into

similar groups and using these cluster labels as independent variables in the supervised machine

learning algorithm? Let’s find out.

Let’s check out the impact of clustering on the accuracy of our model for the classification

problem using 3000 observations with 100 predictors of stock data to predicting whether the

stock will go up or down using R. This dataset contains 100 independent variables from X1 to

X100 representing profile of a stock and one outcome variable Y with two levels : 1 for rise in

stock price and -1 for drop in stock price.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 188
Let’s first try applying randomforest without clustering.

#loading required libraries


library('randomForest')

library('Metrics')
#set random seed
set.seed(101)

#loading dataset

data<-read.csv("train.csv",stringsAsFactors= T)

#checking dimensions of data


dim(data)

## [1] 3000 101

#specifying outcome variable as factor

data$Y<-as.factor(data$Y)

#dividing the dataset into train and test


train<-data[1:2000,]
test<-data[2001:3000,]

#applying randomForest
model_rf<-randomForest(Y~.,data=train)

preds<-predict(object=model_rf,test[,-101])

table(preds)

## preds
## -1 1
## 453 547

#checking accuracy

auc(preds,test$Y)

## [1] 0.4522703

So, the accuracy we get is 0.45. Now let’s create five clusters based on values of independent

variables using k-means clustering and reapply randomforest.


#combing test and train
all<-rbind(train,test)

#creating 5 clusters using K- means clustering


KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 189
Cluster <- kmeans(all[,-101], 5)
#adding clusters as independent variable to the dataset.
all$cluster<-as.factor(Cluster$cluster)
#dividing the dataset into train and test
train<-all[1:2000,]
test<-all[2001:3000,]
#applying randomforest
model_rf<-randomForest(Y~.,data=train)
preds2<-predict(object=model_rf,test[,-101])
table(preds2)
## preds2
## -1 1
##548 452
auc(preds2,test$Y)
## [1] 0.5345908

Whoo! In the above example, even though the final accuracy is poor but clustering has given our

model a significant boost from accuracy of 0.45 to slightly above 0.53.

This shows that clustering can indeed be helpful for supervised machine learning tasks.

Dimensionality Reduction
Dimensionality reduction, or variable reduction techniques, simply refers to the process of
reducing the number or dimensions of features in a dataset. It is commonly used during the
analysis of high-dimensional data (e.g., multipixel images of a face or texts from an article,
astronomical catalogues, etc.). Many statistical and ML methods have been applied to high-
dimensional data, such as vector quantization and mixture models, generative topographic
mapping (Bishop et al., 1998), and principal component analysis (PCA), to list just a few. PCA is
one of the most popular algorithms used for dimensionality reduction (Pearson, 1901; Wold et
al., 1987; Dunteman, 1989; Jolliffe and Cadima, 2016).
It is an unsupervised learning technique of dimensionality reduction also known as Karhunen–
Loève transform, generally applied for data compression and visualization, feature extraction,
dimensionality reduction, and feature extraction (Bishop, 2000). It is defined as a set of data
being projected orthogonally onto lower-dimensional linear space (called the principal subspace)
to maximize the projected data's variance (Hotelling, 1933). Other common methods of
dimensionality reduction worth mentioning are independent component analysis (Comon, 1994),
nonnegative matrix factorization (Lee and Seung, 1999), self-organized maps (Kohonen, 1982),
isomaps (Tenenbaum et al., 2000), t-distributed stochastic neighbor embedding (van der Maaten
and Hinton, 2008), Uniform Manifold Approximation and Projection for Dimension Reduction
(McInnes et al., 2018), and autoencoders (Vincent et al., 2008).

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 190
Overview of Data Reduction Strategies
Data reduction strategies include dimensionality reduction, numerosity reduction, and data
compression.
Dimensionality reduction is the process of reducing the number of random variables or
attributes under consideration. Dimensionality reduction methods
include wavelet transforms (Section 3.4.2) and principal components analysis (Section 3.4.3),
which transform or project the original data onto a smaller space. Attribute subset selection is a
method of dimensionality reduction in which irrelevant, weakly relevant, or redundant attributes
or dimensions are detected and removed (Section 3.4.4).
Numerosity reduction techniques replace the original data volume by alternative, smaller forms
of data representation. These techniques may be parametric or nonparametric. For parametric
methods, a model is used to estimate the data, so that typically only the data parameters need to
be stored, instead of the actual data. (Outliers may also be stored.) Regression and log-linear
models (Section 3.4.5) are examples. Nonparametric methodsfor storing reduced representations
of the data include histograms (Section 3.4.6), clustering (Section 3.4.7), sampling (Section
3.4.8), and data cube aggregation (Section 3.4.9).
In data compression, transformations are applied so as to obtain a reduced or “compressed”
representation of the original data. If the original data can be reconstructed from the compressed
data without any information loss, the data reduction is called lossless. If, instead, we can
reconstruct only an approximation of the original data, then the data reduction is called lossy.
There are several lossless algorithms for string compression; however, they typically allow only
limited data manipulation. Dimensionality reduction and numerosity reduction techniques can
also be considered forms of data compression.
There are many other ways of organizing methods of data reduction. The computational time
spent on data reduction should not outweigh or “erase” the time saved by mining on a reduced
data set size.

Role of Feature Selection and Extraction


Dimensionality reduction plays an important role in classification performance. A recognition
system is designed using a finite set of inputs. While the performance of this system increases if
we add additional features, at some point a further inclusion leads to a performance degradation.
Thus a dimensionality reduction may not always improve a classification system.
A model of the pattern recognition system including the feature selection and extraction stages is
shown in Fig. 2.1.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 191
Sign in to download full-size image

Figure 2.1. Pattern recognition system including feature selection and extraction.

The sensor data are subject to a feature extraction and selection process for determining the input
vector for the subsequent classifier. This makes a decision regarding the class associated with
this pattern vector.
Dimensionality reduction is accomplished based on either feature selection or feature extraction.
Feature selection is based on omitting those features from the available measurements which do
not contribute to class separability. In other words, redundant and irrelevant features are ignored.
This is illustrated in Fig. 2.2.

Sign in to download full-size image

Figure 2.2. Dimensionality reduction based on feature selection.

Feature extraction, on the other hand, considers the whole information content and maps the
useful information content into a lower dimensional feature space. This is shown in Fig. 2.3. In
feature extraction, the mapping type A has to be specified beforehand.

Sign in to download full-size image


KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 192
Figure 2.3. Dimensionality reduction based on feature extraction.

We see immediately that for feature selection or extraction the following is required: (1) feature
evaluation criterion, (2) dimensionality of the feature space, and (3) optimization procedure.
Data Preprocessing
Jiawei Han, ... Jian Pei, in Data Mining (Third Edition), 2012
Overview of Data Reduction Strategies
Data reduction strategies include dimensionality reduction, numerosity reduction, and data
compression.
Dimensionality reduction is the process of reducing the number of random variables or
attributes under consideration. Dimensionality reduction methods include wavelet transforms ()
and principal components analysis (Section 3.4.3), which transform or project the original data
onto a smaller space. Attribute subset selection is a method of dimensionality reduction in which
irrelevant, weakly relevant, or redundant attributes or dimensions are detected and removed
(Section 3.4.4).
Numerosity reduction techniques replace the original data volume by alternative, smaller forms
of data representation. These techniques may be parametric or nonparametric. For parametric
methods, a model is used to estimate the data, so that typically only the data parameters need to
be stored, instead of the actual data. (Outliers may also be stored.) Regression and log-linear
models (Section 3.4.5) are examples. Nonparametric methodsfor storing reduced representations
of the data include histograms (Section 3.4.6), clustering (Section 3.4.7), sampling (Section),
and data cube aggregation (Section 3.4.9).
In data compression, transformations are applied so as to obtain a reduced or “compressed”
representation of the original data. If the original data can be reconstructed from the compressed
data without any information loss, the data reduction is called lossless. If, instead, we can
reconstruct only an approximation of the original data, then the data reduction is called lossy.
There are several lossless algorithms for string compression; however, they typically allow only
limited data manipulation. Dimensionality reduction and numerosity reduction techniques can
also be considered forms of data compression.
There are many other ways of organizing methods of data reduction. The computational time
spent on data reduction should not outweigh or “erase” the time saved by mining on a reduced
data set size.

Surveys, Catalogues, Databases/Archives, and State-of-the-Art Methods for Geoscience


Data Processing
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 193
Lachezar Filchev Assoc Prof, PhD, ... Stuart Frye MSc, in Knowledge Discovery in Big
Data from Astronomy and Earth Observation, 2020
Dimensionality Reduction Methods for Hyperspectral Images
Dimensionality reduction selects spectral components with higher HSI-to-noise ratio (SNR)
among neighboring bands with high correlation. Some known techniques are PCA (Jolliffe,
1986); computing KLT, which is the best data representation in the least-squares sense; SVD
(Scharf, 1991), which provides the projection that best represents data in the maximum power
sense; maximum noise fraction (MaxNF) (Green et al., 1988), and noise-adjusted principal
components (NAPC) (Lee et al., 1990), which seeks the projection that optimizes the ratio of
noise to HSI powers.
In the analysis of hyperspectral imagery, a selection of an optimal subset of bands must be
performed to avoid the problems due to interband correlation. This can be achieved by
employing feature extraction techniques for significant reduction of data dimensionality. Based
on locally linear embedding, the Roweis and Saul approach mitigates the effects of high
dimensionality on information extraction from hyperspectral imagery (Roweis and Saul, 2000),
including PCA (Jolliffe, 2002), minimum noise fraction (MinNF), and linear discriminate
analysis (Fukunaga, 1990). Chen and Qian (2008) consider PCA followed by a wavelet
decomposition in the spatial domain. Due to a significant amount of noise after the PCA, they
apply a bivariate wavelet thresholding method for denoising, which is the most efficient method
to increase the peak SNR. Hence, better dimensionality reduction is achieved.

Feature Extraction:

Extracting, transforming and selecting features


This section covers algorithms for working with features, roughly divided into these groups:

 Extraction: Extracting features from “raw” data


 Transformation: Scaling, converting, or modifying features
 Selection: Selecting a subset from a larger set of features
 Locality Sensitive Hashing (LSH): This class of algorithms combines aspects of feature
transformation with other algorithms.

Table of Contents

 Feature Extractors
o TF-IDF
o Word2Vec
o CountVectorizer
o FeatureHasher
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 194
 Feature Transformers
o Tokenizer
o StopWordsRemover
o nn-gram
o Binarizer
o PCA
o PolynomialExpansion
o Discrete Cosine Transform (DCT)
o StringIndexer
o IndexToString
o OneHotEncoder
o VectorIndexer
o Interaction
o Normalizer
o StandardScaler
o RobustScaler
o MinMaxScaler
o MaxAbsScaler
o Bucketizer
o ElementwiseProduct
o SQLTransformer
o VectorAssembler
o VectorSizeHint
o QuantileDiscretizer
o Imputer
 Feature Selectors
o VectorSlicer
o RFormula
o ChiSqSelector
o UnivariateFeatureSelector
o VarianceThresholdSelector
 Locality Sensitive Hashing
o LSH Operations
 Feature Transformation
 Approximate Similarity Join
 Approximate Nearest Neighbor Search
o LSH Algorithms
 Bucketed Random Projection for Euclidean Distance
 MinHash for Jaccard Distance

Feature Extractors
TF-IDF

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 195
Term frequency-inverse document frequency (TF-IDF) is a feature vectorization
method widely used in text mining to reflect the importance of a term to a document in the
corpus. Denote a term by tt, a document by dd, and the corpus by DD. Term
frequency TF(t,d)TF(t,d) is the number of times that term tt appears in document dd, while
document frequency DF(t,D)DF(t,D) is the number of documents that contains term tt. If we only
use term frequency to measure the importance, it is very easy to over-emphasize terms that
appear very often but carry little information about the document, e.g. “a”, “the”, and “of”. If a
term appears very often across the corpus, it means it doesn’t carry special information about a
particular document. Inverse document frequency is a numerical measure of how much
information a term provides:

IDF(t,D)=log|D|+1DF(t,D)+1,IDF(t,D)=log⁡|D|+1DF(t,D)+1,

where |D||D| is the total number of documents in the corpus. Since logarithm is used, if a term
appears in all documents, its IDF value becomes 0. Note that a smoothing term is applied to
avoid dividing by zero for terms outside the corpus. The TF-IDF measure is simply the product
of TF and IDF:

TFIDF(t,d,D)=TF(t,d)⋅IDF(t,D).TFIDF(t,d,D)=TF(t,d)⋅IDF(t,D).

There are several variants on the definition of term frequency and document frequency. In
MLlib, we separate TF and IDF to make them flexible.

TF: Both HashingTF and CountVectorizer can be used to generate the term frequency vectors.

HashingTF is a Transformer which takes sets of terms and converts those sets into fixed-length
feature vectors. In text processing, a “set of terms” might be a bag of words. HashingTF utilizes
the hashing trick. A raw feature is mapped into an index (term) by applying a hash function. The
hash function used here is MurmurHash 3. Then term frequencies are calculated based on the
mapped indices. This approach avoids the need to compute a global term-to-index map, which
can be expensive for a large corpus, but it suffers from potential hash collisions, where different
raw features may become the same term after hashing. To reduce the chance of collision, we can
increase the target feature dimension, i.e. the number of buckets of the hash table. Since a simple
modulo on the hashed value is used to determine the vector index, it is advisable to use a power
of two as the feature dimension, otherwise the features will not be mapped evenly to the vector
indices. The default feature dimension is 218=262,144218=262,144. An optional binary toggle
parameter controls term frequency counts. When set to true all nonzero frequency counts are set

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 196
to 1. This is especially useful for discrete probabilistic models that model binary, rather
than integer, counts.

CountVectorizer converts text documents to vectors of term counts. Refer to CountVectorizer for
more details.

IDF: IDF is an Estimator which is fit on a dataset and produces an IDFModel.


The IDFModel takes feature vectors (generally created from HashingTF or CountVectorizer) and
scales each feature. Intuitively, it down-weights features which appear frequently in a corpus.

Note: spark.ml doesn’t provide tools for text segmentation. We refer users to the Stanford NLP
Group and scalanlp/chalk.

Examples

In the following code segment, we start with a set of sentences. We split each sentence into
words using Tokenizer. For each sentence (bag of words), we use HashingTF to hash the
sentence into a feature vector. We use IDF to rescale the feature vectors; this generally improves
performance when using text as features. Our feature vectors could then be passed to a learning
algorithm.

 Scala
 Java
 Python
Refer to the HashingTF Scala docs and the IDF Scala docs for more details on the API.

import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}

val sentenceData = spark.createDataFrame(Seq(


(0.0, "Hi I heard about Spark"),
(0.0, "I wish Java could use case classes"),
(1.0, "Logistic regression models are neat")
)).toDF("label", "sentence")

val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")


val wordsData = tokenizer.transform(sentenceData)

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 197
val hashingTF = new HashingTF()
.setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)

val featurizedData = hashingTF.transform(wordsData)


// alternatively, CountVectorizer can also be used to get term frequency vectors

val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")


val idfModel = idf.fit(featurizedData)

val rescaledData = idfModel.transform(featurizedData)


rescaledData.select("label", "features").show()
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/ml/TfIdfExample.scala" in the Spark repo.

Word2Vec
Word2Vec is an Estimator which takes sequences of words representing documents and trains
a Word2VecModel. The model maps each word to a unique fixed-size vector.
The Word2VecModel transforms each document into a vector using the average of all words in
the document; this vector can then be used as features for prediction, document similarity
calculations, etc. Please refer to the MLlib user guide on Word2Vec for more details.

Examples

In the following code segment, we start with a set of documents, each of which is represented as
a sequence of words. For each document, we transform it into a feature vector. This feature
vector could then be passed to a learning algorithm.

 Scala
 Java
 Python
Refer to the Word2Vec Scala docs for more details on the API.

import org.apache.spark.ml.feature.Word2Vec
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 198
// Input data: Each row is a bag of words from a sentence or document.
val documentDF = spark.createDataFrame(Seq(
"Hi I heard about Spark".split(" "),
"I wish Java could use case classes".split(" "),
"Logistic regression models are neat".split(" ")
).map(Tuple1.apply)).toDF("text")

// Learn a mapping from words to Vectors.


val word2Vec = new Word2Vec()
.setInputCol("text")
.setOutputCol("result")
.setVectorSize(3)
.setMinCount(0)
val model = word2Vec.fit(documentDF)

val result = model.transform(documentDF)


result.collect().foreach { case Row(text: Seq[_], features: Vector) =>
println(s"Text: [${text.mkString(", ")}] => \nVector: $features\n") }
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/ml/Word2VecExample.scala" in the Spark
repo.

CountVectorizer
CountVectorizer and CountVectorizerModel aim to help convert a collection of text documents
to vectors of token counts. When an a-priori dictionary is not available, CountVectorizer can be
used as an Estimator to extract the vocabulary, and generates a CountVectorizerModel. The
model produces sparse representations for the documents over the vocabulary, which can then be
passed to other algorithms like LDA.

During the fitting process, CountVectorizer will select the top vocabSize words ordered by term
frequency across the corpus. An optional parameter minDF also affects the fitting process by
specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be
included in the vocabulary. Another optional binary toggle parameter controls the output vector.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 199
If set to true all nonzero counts are set to 1. This is especially useful for discrete
probabilistic models that model binary, rather than integer, counts.

Examples

Assume that we have the following DataFrame with columns id and texts:

id | texts
----|----------
0 | Array("a", "b", "c")
1 | Array("a", "b", "b", "c", "a")
each row in texts is a document of type Array[String]. Invoking fit of CountVectorizer produces
a CountVectorizerModel with vocabulary (a, b, c). Then the output column “vector” after
transformation contains:

id | texts | vector
----|---------------------------------|---------------
0 | Array("a", "b", "c") | (3,[0,1,2],[1.0,1.0,1.0])
1 | Array("a", "b", "b", "c", "a") | (3,[0,1,2],[2.0,2.0,1.0])
Each vector represents the token counts of the document over the vocabulary.

 Scala
 Java
 Python
Refer to the CountVectorizer Scala docs and the CountVectorizerModel Scala docs for more
details on the API.

import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}

val df = spark.createDataFrame(Seq(
(0, Array("a", "b", "c")),
(1, Array("a", "b", "b", "c", "a"))
)).toDF("id", "words")

// fit a CountVectorizerModel from the corpus


val cvModel: CountVectorizerModel = new CountVectorizer()

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 200
.setInputCol("words")
.setOutputCol("features")
.setVocabSize(3)
.setMinDF(2)
.fit(df)

// alternatively, define CountVectorizerModel with a-priori vocabulary


val cvm = new CountVectorizerModel(Array("a", "b", "c"))
.setInputCol("words")
.setOutputCol("features")

cvModel.transform(df).show(false)
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/ml/CountVectorizerExample.scala" in the
Spark repo.

FeatureHasher
Feature hashing projects a set of categorical or numerical features into a feature vector of
specified dimension (typically substantially smaller than that of the original feature space). This
is done using the hashing trick to map features to indices in the feature vector.

The FeatureHasher transformer operates on multiple columns. Each column may contain either
numeric or categorical features. Behavior and handling of column data types is as follows:

 Numeric columns: For numeric features, the hash value of the column name is used to map
the feature value to its index in the feature vector. By default, numeric features are not
treated as categorical (even when they are integers). To treat them as categorical, specify the
relevant columns using the categoricalCols parameter.
 String columns: For categorical features, the hash value of the string “column_name=value”
is used to map to the vector index, with an indicator value of 1.0. Thus, categorical features
are “one-hot” encoded (similarly to using OneHotEncoder with dropLast=false).
 Boolean columns: Boolean values are treated in the same way as string columns. That is,
boolean features are represented as “column_name=true” or “column_name=false”, with an
indicator value of 1.0.

Null (missing) values are ignored (implicitly zero in the resulting feature vector).

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 201
The hash function used here is also the MurmurHash 3 used in HashingTF. Since a
simple modulo on the hashed value is used to determine the vector index, it is advisable to use a
power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to
the vector indices.

Examples

Assume that we have a DataFrame with 4 input columns real, bool, stringNum, and string. These
different data types as input will illustrate the behavior of the transform to produce a column of
feature vectors.

real| bool|stringNum|string
----|-----|---------|------
2.2| true| 1| foo
3.3|false| 2| bar
4.4|false| 3| baz
5.5|false| 4| foo
Then the output of FeatureHasher.transform on this DataFrame is:

real|bool |stringNum|string|features
----|-----|---------|------|-------------------------------------------------------
2.2 |true |1 |foo |(262144,[51871, 63643,174475,253195],[1.0,1.0,2.2,1.0])
3.3 |false|2 |bar |(262144,[6031, 80619,140467,174475],[1.0,1.0,1.0,3.3])
4.4 |false|3 |baz |(262144,[24279,140467,174475,196810],[1.0,1.0,4.4,1.0])
5.5 |false|4 |foo |(262144,[63643,140467,168512,174475],[1.0,1.0,1.0,5.5])
The resulting feature vectors could then be passed to a learning algorithm.

 Scala
 Java
 Python
Refer to the FeatureHasher Scala docs for more details on the API.

import org.apache.spark.ml.feature.FeatureHasher

val dataset = spark.createDataFrame(Seq(


(2.2, true, "1", "foo"),

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 202
(3.3, false, "2", "bar"),
(4.4, false, "3", "baz"),
(5.5, false, "4", "foo")
)).toDF("real", "bool", "stringNum", "string")

val hasher = new FeatureHasher()


.setInputCols("real", "bool", "stringNum", "string")
.setOutputCol("features")

val featurized = hasher.transform(dataset)


featurized.show(false)
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/ml/FeatureHasherExample.scala" in the Spark repo.

MapReduce Advanced Programming- Chaining Map Reduce jobs:

Chaining MapReduce Job in Hadoop

While processing data using MapReduce you may want to break the requirement into a series of
task and do them as a chain of MapReduce jobs rather than doing everything with in one
MapReduce job and making it more complex. Hadoop provides two predefined
classes ChainMapper and ChainReducer for the purpose of chaining MapReduce job in
Hadoop.

Table of contents

1. ChainMapper class in Hadoop


2. ChainReducer class in Hadoop
3. How to chain MapReduce jobs
4. Chained MapReduce job advantages
5. Chained MapReduce job example

ChainMapper class in Hadoop

Using ChainMapper class you can use multiple Mapper classes within a single Map task. The
Mapper classes are invoked in a chained fashion, the output of the first becomes the input of the
second, and so on until the last Mapper, the output of the last Mapper will be written to the task's
output.

For adding map tasks to the ChainedMapper addMapper() method is used.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 203
ChainReducer class in Hadoop

Using the predefined ChainReducer class in Hadoop you can chain multiple Mapper classes after
a Reducer within the Reducer task. For each record output by the Reducer, the Mapper classes
are invoked in a chained fashion. The output of the reducer becomes the input of the first mapper
and output of first becomes the input of the second, and so on until the last Mapper, the output of
the last Mapper will be written to the task's output.

For setting the Reducer class to the chain job setReducer() method is used.

For adding a Mapper class to the chain reducer addMapper() method is used.

ChainMapper class in Hadoop

Using ChainMapper class you can use multiple Mapper classes within a single Map task. The
Mapper classes are invoked in a chained fashion, the output of the first becomes the input of the
second, and so on until the last Mapper, the output of the last Mapper will be written to the task's
output.

For adding map tasks to the ChainedMapper addMapper() method is used.

ChainReducer class in Hadoop

Using the predefined ChainReducer class in Hadoop you can chain multiple Mapper classes after
a Reducer within the Reducer task. For each record output by the Reducer, the Mapper classes
are invoked in a chained fashion. The output of the reducer becomes the input of the first mapper
and output of first becomes the input of the second, and so on until the last Mapper, the output of
the last Mapper will be written to the task's output.

For setting the Reducer class to the chain job setReducer() method is used.

For adding a Mapper class to the chain reducer addMapper() method is used.

How to chain MapReduce jobs

Using the ChainMapper and the ChainReducer classes it is possible to compose Map/Reduce
jobs that look like [MAP+ / REDUCE MAP*].

In the chain of MapReduce job you can have-

 A chain of map tasks executed using ChainMapper


 A reducer set using ChainReducer.
 A chain of map tasks added using ChainReducer (This step is optional).

Special care has to be taken when creating chains that the key/values output by a Mapper are
valid for the following Mapper in the chain.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 204
Benefits of using a chained MapReduce job
 When MapReduce jobs are chained data from immediate mappers is kept in memory
rather than storing to disk so that another mapper in chain doesn't have to read data from disk.
Immediate benefit of this pattern is a dramatic reduction in disk IO.
 Gives you a chance to break the problem into simpler tasks and execute them as a chain.
Chained MapReduce job example

Let’s take a simple example to show chained MapReduce job in action. Here input file has item,
sales and zone columns in the below format (tab separated) and you have to get the total sales per
item for zone-1.

Item1 345 zone-1


Item1 234 zone-2
Item3 654 zone-2
Item2 231 zone-3

For the sake of example let’s say in first mapper you get all the records, in the second mapper
you filter them to get only the records for zone-1. In the reducer you get the total for each item
and then you flip the records so that key become value and value becomes key. For that Inverse
Mapper is used which is a predefined mapper in Hadoop.

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.chain.ChainMapper;
import org.apache.hadoop.mapreduce.lib.chain.ChainReducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.map.InverseMapper;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class Sales extends Configured implements Tool{


// First Mapper
public static class CollectionMapper extends Mapper<LongWritable, Text, Text, Text>{
private Text item = new Text();

public void map(LongWritable key, Text value, Context context)


throws IOException, InterruptedException {
//splitting record
String[] salesArr = value.toString().split("\t");

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 205
item.set(salesArr[0]);
// Writing (sales,zone) as value
context.write(item, new Text(salesArr[1] + "," + salesArr[2]));
}
}

// Mapper 2
public static class FilterMapper extends Mapper<Text, Text, Text, IntWritable>{
public void map(Text key, Text value, Context context)
throws IOException, InterruptedException {

String[] recordArr = value.toString().split(",");


// Filtering on zone
if(recordArr[1].equals("zone-1")) {
Integer sales = Integer.parseInt(recordArr[0]);
context.write(key, new IntWritable(sales));
}
}
}

// Reduce function
public static class TotalSalesReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}

public static void main(String[] args) throws Exception {


int exitFlag = ToolRunner.run(new Sales(), args);
System.exit(exitFlag);
}

@Override
public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Sales");
job.setJarByClass(getClass());

// MapReduce chaining
Configuration mapConf1 = new Configuration(false);
ChainMapper.addMapper(job, CollectionMapper.class, LongWritable.class, Text.class,
Text.class, Text.class, mapConf1);

Configuration mapConf2 = new Configuration(false);


ChainMapper.addMapper(job, FilterMapper.class, Text.class, Text.class,
Text.class, IntWritable.class, mapConf2);

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 206
Configuration reduceConf = new Configuration(false);
ChainReducer.setReducer(job, TotalSalesReducer.class, Text.class, IntWritable.class,
Text.class, IntWritable.class, reduceConf);

ChainReducer.addMapper(job, InverseMapper.class, Text.class, IntWritable.class,


IntWritable.class, Text.class, null);

job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;
}
}

MapReduce Example: Reduce Side Join in Hadoop MapReduce


Introduction:
In this blog, I am going to explain you how a reduce side join is performed in Hadoop
MapReduce using a MapReduce example. Here, I am assuming that you are already familiar
with MapReduce framework and know how to write a basic MapReduce program. In case you
don’t, I would suggest you to go through my previous blog on MapReduce Tutorial so that you
can grasp the concepts discussed here without facing any difficulties. The topics discussed in this
blog are as follows:

 What is a Join?
 Joins in MapReduce
 What is a Reduce side join?
 MapReduce Example on Reduce side join
 Conclusion

What is a Join?
The join operation is used to combine two or more database tables based on
foreign keys. In general, companies maintain separate tables for the customer
and the transaction records in their database. And, many times these
companies need to generate analytic reports using the data present in such
separate tables. Therefore, they perform a join operation on these separate
tables using a common column (foreign key), like customer id, etc., to
generate a combined table. Then, they analyze this combined table to get the
desired analytic reports.

Joins in MapReduce
Just like SQL join, we can also perform join operations in MapReduce on
different data sets. There are two types of join operations in MapReduce:

 Map Side Join: As the name implies, the join operation is performed in
the map phase itself. Therefore, in the map side join, the mapper

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 207
performs the join and it is mandatory that the input to each map is
partitioned and sorted according to the keys.

The map side join has been covered in a separate blog with an
example. Click Here to go through that blog to understand how the map side
join works and what are its advantages.

 Reduce Side Join: As the name suggests, in the reduce side join, the
reducer is responsible for performing the join operation. It is
comparatively simple and easier to implement than the map side join as
the sorting and shuffling phase sends the values having identical keys to
the same reducer and therefore, by default, the data is organized for us.

Now, let us understand the reduce side join in detail.

What is Reduce Side Join?

As discussed earlier, the reduce side join is a process where the join
operation is performed in the reducer phase. Basically, the reduce side join
takes place in the following manner:

 Mapper reads the input data which are to be combined based on


common column or join key.
 The mapper processes the input and adds a tag to the input to
distinguish the input belonging from different sources or data sets or
databases.
 The mapper outputs the intermediate key-value pair where the key is
nothing but the join key.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 208
 After the sorting and shuffling phase, a key and the list of values
is generated for the reducer.
 Now, the reducer joins the values present in the list with the key to give
the final aggregated output.

Meanwhile, you may go through this MapReduce Tutorial video where various
MapReduce Use Cases has been clearly explained and practically
demonstrated:

Now, let us take a MapReduce example to understand the above steps in the
reduce side join.

MapReduce Example of Reduce Side Join


Suppose that I have two separate datasets of a sports complex:

 cust_details: It contains the details of the customer.


 transaction_details: It contains the transaction record of the customer.

Using these two datasets, I want to know the lifetime value of each customer.
In doing so, I will be needing the following things:

 The person’s name along with the frequency of the visits by that person.
 The total amount spent by him/her for purchasing the equipment.

The above figure is just to show you the schema of the two datasets on which
we will perform the reduce side join operation. Click on the button below to
download the whole project containing the source code and the input files for
this MapReduce example:

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 209
Big Data Hadoop Certification Training Course
 Instructor-lens Studies
 Atsime Access
Explore Curriculum

Kindly, keep the following things in mind while importing the above
MapReduce example project on reduce side join into Eclipse:

 The input files are in input_files directory of the project. Load these into
your HDFS.
 Don’t forget to build the path of Hadoop Reference Jars (present in
reduce side join project lib directory) according to your system or VM.

Now, let us understand what happens inside the map and reduce phases in
this MapReduce example on reduce side join:

1. Map Phase:
I will have a separate mapper for each of the two datasets i.e. One mapper for
cust_details input and other for transaction_details input.

Mapper for cust_details:

MapReduce Architecture

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 210
public static class CustsMapper extend

The architecture of MapReduce is as shown:

8
s Mapper <Object, Text, Text, Text>

public void map(Object key, Text value, Context context) throws IOException,
InterruptedException

{String record = value.toString();

String[] parts = record.split(",");

context.write(new Text(parts[0]), new Text("cust " + parts[1]));

 I will read the input taking one tuple at a time.


 Then, I will tokenize each word in that tuple and fetch the cust ID along
with the name of the person.
 The cust ID will be my key of the key-value pair that my mapper will
generate eventually.
 I will also add a tag “cust” to indicate that this input tuple is of
cust_details type.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 211
 Therefore, my mapper for cust_details will produce following
intermediate key-value pair:

Key – Value pair: [cust ID, cust name]

Example: [4000001, cust Kristina], [4000002, cust Paige], etc.

Mapper for transaction_details:

public static class TxnsMapper extends Mapper <Object, Text, Text, Text>
1
{
2
public void map(Object key, Text value, Context context) throws IOException,
3 InterruptedException

4 {

String record = value.toString();


5
String[] parts = record.split(",");
6
context.write(new Text(parts[2]), new Text("tnxn " + parts[3]));
7
}
8
}
9

 Like mapper for cust_details, I will follow the similar steps here. Though,
there will be a few differences:
o I will fetch the amount value instead of name of the person.
o In this case, we will use “tnxn” as a tag.
 Therefore, the cust ID will be my key of the key-value pair that the
mapper will generate eventually.
 Finally, the output of my mapper for transaction_details will be of the
following format:

Key, Value Pair: [cust ID, tnxn amount]

Example: [4000001, tnxn 40.33], [4000002, tnxn 198.44], etc.

2. Sorting and Shuffling Phase


The sorting and shuffling phase will generate an array list of values
corresponding to each key. In other words, it will put together all the values
corresponding to each unique key in the intermediate key-value pair. The
output of sorting and shuffling phase will be of the following format:

Key – list of Values:

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 212
 {cust ID1 – [(cust name1), (tnxn amount1), (tnxn amount2),
(tnxn amount3),…..]}
 {cust ID2 – [(cust name2), (tnxn amount1), (tnxn amount2), (tnxn
amount3),…..]}
 ……

MapReduce Advanced Programming use case

MapReduce Use Case: Global Warming


So, how are companies, governments, and organizations using MapReduce?
First, we give an example where the goal is to calculate a single value from a set of data through
reduction.
Suppose we want to know the level by which global warming has raised the ocean’s temperature.
We have input temperature readings from thousands of buoys all over the globe. We have data
in this format:
(buoy, DateTime, longitude, latitude, low temperature, high temperature)
We would attack this problem in several map and reduce steps. The first would be to run map
over every buoy-dateTime reading and add the average temperature as a field:
(buoy, DateTime, longitude, latitude, low, high, average)
Then we would drop the DateTime column and sum these items for all buoys to produce one
average temperature for each buoy:
(buoy n, average)
Then the reduce operation runs. A mathematician would say this is a pairwise operation on
associative data. In other words, we take each of these (buoy, average) adjacent pairs and sum
them and then divide that sum by the count to produce the average of averages:
ocean average temperature = average (buoy n) + average ( buoy n-1) + … + average (buoy 2) +
average (buoy 1) / number of buoys
MapReduce Use Case: Drug Trials
Mathematicians and data scientists have traditionally worked together in the pharmaceutical
industry. The invention of MapReduce and the dissemination of data science algorithms in big
data systems means ordinary IT departments can now tackle problems that would have required
the work of Ph.D. scientists and supercomputers in the past.
Let’s take, for example, a company that conducts drug trials to show whether its new drug works
against certain illnesses, which is a problem that fits perfectly into the MapReduce model. In this
case, we want to run a regression model against a set of patients who have been given the new
drug and calculate how effective the drug is in combating the disease.
Suppose the drug is used for cancer patients. We have data points like this:
{ (patient name: John, DateTime: 3/01/2016 14:00, dosage: 10 mg, size of cancer tumor: 1 mm)
}
The first step here is to calculate the change in the size of the tumor from one dateTime to next.
Different patients would be taking different amounts of the drug, so we would want to know
what amount of the drug works best. Using MapReduce, we would try to reduce this problem to
some linear relationship like this:
percent reduction in tumor = x (quantity of drug) + y (period of time) + constant value

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 213
If some correlation exists between the drug and the reduction in the tumor, then the drug
can be said to work. The model would also show to what degree it works by calculating the error
statistic.

UNIT IV

Graph Representation in MapReduce:

Modeling data and solving problems with graphs:

Modeling Data and Solving Problems with Graphs: A graph consists of a number of nodes
(formally called vertices) and links (informally called edges) that connect nodes together. The
following Figure shows a graph with nodes and edges.

Graphs are mathematical constructs that represent an interconnected set of objects.


They’re used to represent data such as the hyperlink structure of the internet, social
networks (where they represent relationships between users), and in internet routing to
determine optimal paths for forwarding packets.
The edges can be directed (implying a one-way relationship), or undirected.
For example, we can use a directed graph to model relationships between users in a social
network because relationships are not always bidirectional. The following Figure shows
examples of directed and undirected graphs.

Graphs can be cyclic or acyclic. In cyclic graphs it’s possible for a vertex to
reach itself by traversing a sequence of edges. In an acyclic graph it’s not possible for a
vertex to traverse a path to reach itself. The following Figure shows examples of cyclic and
acyclic graphs.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 214
Modeling Graphs: There are two common ways of representing graphs are with
adjacencymatrices and with adjacencylists.

ADJACENCY MATRIX: In this matrix, we represent a graph as an N x N square matrix M, where

N is the number of nodes, and Mij represents an edge between nodes i and j.
The following Figure shows a directed graph representing connections in a social
graph. The arrows indicate a one-way relationship between two people. The adjacency
matrixshows how this graph would be represented.

The disadvantage of adjacency matrices are that they model both the existence
andlack of a relationship, which makes them a dense data structure.
ADJACENCY LIST: Adjacency lists are similar to adjacency matrices, other than the fact that
they don’t model the lack of relationship. The following Figure shows an adjacency list to
theabove graph.

The advantage of the adjacency list is that it offers a sparse representation of the data,
which is good because it requires less space. It also fits well when representing graphs
in MapReduce because the key can represent a vertex, and the values are a list of vertices
that denote a directed or undirected relationship node.
Shortest path algorithm:
This algorithm is a common problem in graph theory, where the goal is to find the shortest
route between two nodes. The following Figure shows an example of this algorithm on a
graph where the edges don’t have a weight, in which case the shortest path is the path with
the smallest number of hops, or intermediary nodes between the source and destination.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 215
Applications of this algorithm include traffic mapping software to determine the
shortest route between two addresses, routers that compute the shortest path tree for each
route, and social networks to determine connections between users.
Find the shortest distance between two users: Dijkstra’s algorithm is a shortest
path algorithm and its basic implementation uses a sequential iterative process to
traverse the entire graph from the starting node.

Problem:
We need to use MapReduce to find the shortest path between two people in a social
graph.
Solution:
Use an adjacency list to model a graph, and for each node store the distance from
the original node, as well as a backpointer to the original node. Use the mappers to
propagate the distance to the original node, and the reducer to restore the state of the graph.
Iterate until the target node has been reached.
Discussion: The following Figure shows a small social network, which we’ll use for this
technique. Our goal is to find the shortest path between Dee and Joe. There are four paths
thatwe can take from Dee to Joe, but only one of them results in the fewest number of hops.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 216
We’ll implement a parallel breadth-first search algorithm to find the
shortest path between two users. Because we’re operating on a social network, we don’t
need to care about weights on our edges. The pseudo-code for the algorithm is described as

the following:

The following Figure shows the algorithm iterations in play with our social graph.
Just like Dijkstra’s algorithm, we’ll start with all the node distances set to infinite, and set
the distance for the starting node, Dee, at zero. With each MapReduce pass, we’ll
determine nodes that don’t have an infinite distance and propagate their distance values to
their adjacent nodes. We continue this until we reach the end node.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 217
We first need to create the starting point. This is done by reading in the social
network (which is stored as an adjacency list) from file and setting the initial distance
values. The following Figure shows the two file formats, the second being the format that’s
used iteratively in our MapReduce code.

Our first step is to create the MapReduce form from the original file. The following
command shows the original input file, and the MapReduce-ready form of the input file
generated by the transformation code:

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 218
The reducer calculates the minimum distance for each node and to output the
minimum distance, the backpointer, and the original adjacent nodes. This is shown in the
following code.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 219
Now we can run our code. But we need to copy the input file into
HDFS, and then kick off our MapReduce job, specifying the start node name (dee) and

target node name (joe):

Friends-of-friends (FoFs):
The FoF algorithm is using by Social networSolution
Two MapReduce jobs are required to calculate the FoFs for each user in a social
network. The first job calculates the common friends for each user, and the second job sorts
the common friends by the number of connections to our friendsk sites such as LinkedIn

and Facebook to help users broaden their networks. The Friends-of-friends (FoF) algorithm
suggests friends that a user may know that aren’t part of their immediate network. The
following Figure shows FoF to be in the 2nd degree network.

Problem
We want to implement the FoF algorithm in MapReduce.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 220
Solution
Two MapReduce jobs are required to calculate the FoFs for each user in a social
network. The first job calculates the common friends for each user, and the second job sorts
the common friends by the number of connections to our friends.

The following Figure shows a network of people with Jim, one of the users,
highlighted.

In above graph Jim’s FoFs are represented in bold (Dee, Joe, and Jon). Next to Jim’s FoFs is the
number of friends that the FoF and Jim have in common. Our goal here is to determine all the
FoFs and order them by the number of fiends in common. Therefore, our expected results would
have Joe as the first FoF recommendation, followed by Dee, and then Jon. The text file to
represent the social graph for this technique is shown

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 221
The following first MapReduce job code calculates the FoFs for
each user.

The following second MapReduce job code sorts FoFs by the number of shared
common friends.

To run the above code, we require driver code. But it is not written here. After

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 222
writing driver code, we require input (“friends.txt”) and output files (“ calc-

output and sort-

After running code we can look at the output in HDFS:

output”) (as per the Text Book) to execute it.

To run the above code, we require driver code. But it is not written here. After
writing driver code, we require input (“friends.txt”) and output files (“ calc-output
and sort-output”) (as per the Text Book) to execute it.

PageRank:
PageRank was a formula introduced by the founders of Google during their
Stanford years in 1998. PageRank, which gives a score to each web page that indicates the
page’s importance.
Calculate PageRank over a web graph: PageRank uses the scores for all the inbound links to
calculate a page’s PageRank. But it disciplines individual inbound links from sources that
have a high number of outbound links by dividing that outbound link PageRank by the
number of outbound links. The following Figure presents a simple example of a web graph
with three pages and their respective PageRank values.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 223
In the above formula, |webGraph| is a count of all the pages in the graph, and d, set
to 0.85, is a constant damping factor used in two parts. First, it denotes the probability of a
random surfer reaching the page after clicking on many links (this is a constant equal to
0.15 divided by the total number of pages), and, second, it dampens the effect of the
inbound link PageRanks by 85 percent.
Problem:
We want to implement an iterative PageRank graph algorithm in MapReduce.
Solution:
PageRank can be implemented by iterating a MapReduce job until the graph has
converged. The mappers are responsible for propagating node PageRank values to their
adjacent nodes, and the reducers are responsible for calculating new PageRank values for
each node, and for re-creating the original graph with the updated PageRank values.

Discussion:
One of the advantages of PageRank is that it can be computed iteratively and
applied locally. Every vertex starts with a seed value, with is 1 divided by the number of

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 224
nodes, and with each iteration each node propagates its value to all pages it links
to. Each vertex in turn sums up the value of all the inbound vertex values to compute a
new seed value. This iterative process is repeated until such a time as convergence is
reached. Convergence is a measure of how much the seed values have changed since the
last iteration. If the convergence value is below a certain threshold, it means that there’s
been minimal change and we can stop the iteration. It’s also common to limit the number
of iterations in cases of large graphs where convergence takes too much iteration
The following PageRank algorithm expressed as map and reduce parts. The map
phase is responsible for preserving the graph as well as emitting the PageRank value to all
the outbound nodes. The reducer is responsible for recalculating the new PageRank value
for each node and including it in the output of the original graph.

This technique is applied on the following graph. All the nodes of the graph have
bothinbound and output edges.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 225
Bloom Filters:

A Bloom filter is defined as a data structure designed to identify of a element’s


presence in a set in a rapid and memory efficient manner.
A specific data structure named as probabilistic data structure is
implemented as bloom filter. This data structure helps us to identify that an
element is either present or absent in a set.
Bit Vector is implemented as a base data structure. Here's a small one we'll use to explain

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Each empty cell in that table specifies a bit and the number below it its index or position. To
append an element to the Bloom filter, we simply hash it a few times and set the bits in the bit
vector at the position or index of those hashes to 1.

Elaborate implementation of Bloom Filter is discussed in following

Bloom filters support two actions, at first appending object and keeping track
of an object and next verifying whether an object has been seen before.
Appending objects to the Bloom filter

 We compute hash values for the object to append;


 We implement these hash-values to set certain bits in the Bloom filter
state (hash value is the position of the bit to set).
Verifying whether the Bloom filter contains an object −

 We compute hash values for the object to append;


 Next we verify whether the bits indexed by these hash values are set in
the Bloom filter state.
We have to keep in mind that the hash value for an object is not directly
appended to the bloom filter state; each hash function simply determines
which bit to set or to verify. For example: if only one hash function is used,
only one bit is verified or checked.

Counting Bloom Filter

Basic Concept
A Counting Bloom filter is defined as a generalized data structure of Bloom filter that is
implemented to test whether a count number of a given element is less than a given threshold
when a sequence of elements is given. As a generalized form, of Bloom filter there is possibility

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 226
of false positive matches, but no chance of false negatives – in other words, a query
returns either "possibly higher or equal than the threshold" or "definitely less than the threshold".

Algorithm description

 Most of the parameters, used under counting bloom filter, are defined same with Bloom
filter, such as n, k. m is denoted as the number of counters in Counting Bloom filter,
which is expansion of m bits in Bloom filter.

 An empty Counting Bloom filter is set as a m counters, all initialized to 0.

 Similar to Bloom filter, there must also be k various hash functions defined, each of
which responsible to map or hash some set element to one of the m counter array
positions, creating a uniform random distribution. It is also same that k is a constant,
much less than m, which is proportional to the number of elements to be appended.

 The main generalization of Bloom filter is appending an element. To append an element,


insert it to each of the k hash functions to obtain k array positions and increment the
counters 1 at all these positions.

 To query for an element with a threshold θ (verify whether the count number of an
element is less than θ), insert it to each of the k hash functions to obtain k counter
positions.

 If any of the counters at these positions is smaller than θ, the count number of element is
definitely smaller than θ – if it were higher and equal, then all the corresponding counters
would have been higher or equal to θ.

 If all are higher or equal to θ, then either the count is really higher or equal to θ, or the
counters have by chance been higher or equal to θ.

 If all are higher or equal to θ even though the count is less than θ, this situation is defined
as false positive. Like Bloom filter, this also should be minimized.

Blocked Bloom Filter

 We select a memory block first.


 Then we select local Bloom Filter within each block.
 It might cause imbalance between memory blocks
 This filter is efficient, but poor false positive rate(FPR).
 At first instance, blocked Bloom filters should have the same FPR (False
Positive Rate) as standard Bloom filters of the same size.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 227
 Blocked Bloom Filter consists of a sequence of block b
comparatively less than standard Bloom filters (Bloom filter blocks),
each of which fits into one cache-line.
 Blocked Bloom filter scheme is differentiated from the partition schemes,
where each bit is inserted into a different block.
Blocked Bloom Filter is implemented in following ways −
Bit Patterns (pat)
In this topic, we discuss to implement blocked Bloom filters implementing
precomputed bit patterns. Instead of setting k bits through the evaluation of k
hash functions, a single hash function selects a precomputed pattern from a table of random k-bit
pattern of width B. In many cases, this table will fit into the cache. With this solution, only one
small (in terms of bits) hash value is required, and the operation can be implemented using few
SIMD(Single Instruction Multiple Data)instructions. At the time of transferring the Bloom filter,
the table need not be included explicitly in the data, but can be reconstructed implementing the
seed value.
The main disadvantage of the bit pattern method is that two elements may cause a table collision
when they are hashed to the same pattern. This causes increased FPR.

Multiplexing Patterns
To refine this idea once more, we can achieve a larger variety of patterns from
a single table by bitwise-or-ing x patterns with an average number of k/x set
bits.
Multi-Blocking
One more variant that helps improving the FPR, is denoted as called multi-blocking. We permit
the query operation to access X Bloom filters blocks, setting or testing k/X bits respectively in
each block. (When k is not divisible by X, we set an extra bit in the first k mod X blocks.) Multi-
blocking performs better than just increasing the block size to XB (B-each block size), since
more variety is introduced this way. If we divide the set bits among several blocks, the expected
number of 1 bit per block remains the same. However, only k/X bits are considered in each
participating block, when accessing an element.

Spark – Create RDD

To create RDD in Apache Spark, some of the possible ways are


1. Create RDD from List<T> using Spark Parallelize.
2. Create RDD from Text file
3. Create RDD from JSON file
In this tutorial, we will go through examples, covering each of the above
mentioned processes.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 228
Example – Create RDD from List<T>
In this example, we will take a List of strings, and then create a Spark
RDD from this list.
RDDfromList.java
import java.util.Arrays;
import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;

import org.apache.spark.api.java.JavaSparkContext;

public class RDDfromList {

public static void main(String[] args) {


// configure spark
SparkConf sparkConf = new SparkConf().setAppName("Spark
RDD foreach Example")
.setMaster("local[2]").set("spark.executor.memory
","2g");

// start a spark context


JavaSparkContext sc = new JavaSparkContext(sparkConf);

// read list to RDD


List<String> data =
Arrays.asList("Learn","Apache","Spark","with","Tutorial Kart");
JavaRDD<String> items = sc.parallelize(data,1);

// apply a function for each element of RDD

items.foreach(item -> {
System.out.println("* "+item);

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 229
});
}
}

Example – Create RDD from Text file


In this example, we have the data in text file and will create an RDD from
this text file.
ReadTextToRDD.java
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class ReadTextToRDD {

public static void main(String[] args) {

// configure spark
SparkConf sparkConf = new SparkConf().setAppName("Read
Text to RDD")
.setMaster("local[2]").set("spark.executor.memory","2g");

// start a spark context


JavaSparkContext sc = new JavaSparkContext(sparkConf);

// provide path to input text file


String path = "data/rdd/input/sample.txt";

// read text file to RDD


JavaRDD<String> lines = sc.textFile(path);

// collect RDD for printing

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 230
for(String line:lines.collect()){
System.out.println(line);
}
}

Example – Create RDD from JSON file


In this example, we will create an RDD from JSON file.
JSONtoRDD.java
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class JSONtoRDD {

public static void main(String[] args) {

// configure spark
SparkSession spark = SparkSession
.builder()
.appName("Spark Example - Read JSON to RDD")
.master("local[2]")

.getOrCreate();

// read list to RDD

String jsonPath = "data/employees.json";


JavaRDD<Row> items = spark.read().json(jsonPath).toJavaRDD();

items.foreach(item -> {
System.out.println(item);
});
}

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 231
Conclusion

we have learnt to create Spark RDD from a List, reading a text or JSON file from file-system
etc., with the help of example programs.

Operations:
RDD Operations

The RDD provides the two types of operations:

o Transformation
o Action

Transformation

In Spark, the role of transformation is to create a new dataset from an existing one. The
transformations are considered lazy as they only computed when an action requires a result to be
returned to the driver program.

Let's see some of the frequently used RDD Transformations.

Transformation Description

map(func) It returns a new distributed dataset formed by


passing each element of the source through a
function func.

filter(func) It returns a new dataset formed by selecting


those elements of the source on which func
returns true.

flatMap(func) Here, each input item can be mapped to zero or


more output items, so func should return a
sequence rather than a single item.

mapPartitions(func) It is similar to map, but runs separately on each


partition (block) of the RDD, so func must be of
type Iterator<T> => Iterator<U> when running

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 232
on an RDD of type T.

mapPartitionsWithIndex(func) It is similar to mapPartitions that provides func


with an integer value representing the index of
the partition, so func must be of type (Int,
Iterator<T>) => Iterator<U> when running on
an RDD of type T.

sample(withReplacement, fraction, seed) It samples the fraction fraction of the data, with
or without replacement, using a given random
number generator seed.

union(otherDataset) It returns a new dataset that contains the union


of the elements in the source dataset and the
argument.

intersection(otherDataset) It returns a new RDD that contains the


intersection of elements in the source dataset
and the argument.

distinct([numPartitions])) It returns a new dataset that contains the distinct


elements of the source dataset.

groupByKey([numPartitions]) It returns a dataset of (K, Iterable) pairs when


called on a dataset of (K, V) pairs.

reduceByKey(func, [numPartitions]) When called on a dataset of (K, V) pairs, returns


a dataset of (K, V) pairs where the values for
each key are aggregated using the given reduce
function func, which must be of type (V,V) =>
V.

aggregateByKey(zeroValue)(seqOp, combOp, When called on a dataset of (K, V) pairs, returns


[numPartitions]) a dataset of (K, U) pairs where the values for
each key are aggregated using the given
combine functions and a neutral "zero" value.

sortByKey([ascending], [numPartitions]) It returns a dataset of key-value pairs sorted by


keys in ascending or descending order, as

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 233
specified in the boolean ascending argument.

join(otherDataset, [numPartitions]) When called on datasets of type (K, V) and (K,


W), returns a dataset of (K, (V, W)) pairs with
all pairs of elements for each key. Outer joins
are supported through leftOuterJoin,
rightOuterJoin, and fullOuterJoin.

cogroup(otherDataset, [numPartitions]) When called on datasets of type (K, V) and (K,


W), returns a dataset of (K, (Iterable, Iterable))
tuples. This operation is also called groupWith.

cartesian(otherDataset) When called on datasets of types T and U,


returns a dataset of (T, U) pairs (all pairs of
elements).

pipe(command, [envVars]) Pipe each partition of the RDD through a shell


command, e.g. a Perl or bash script.

coalesce(numPartitions) It decreases the number of partitions in the RDD


to numPartitions.

repartition(numPartitions) It reshuffles the data in the RDD randomly to


create either more or fewer partitions and
balance it across them.

repartitionAndSortWithinPartitions(partitioner) It repartition the RDD according to the given


partitioner and, within each resulting partition,
sort records by their keys.

Action

In Spark, the role of action is to return a value to the driver program after running a computation
on the dataset.

Exception Handling in Java - Javatpoint

Let's see some of the frequently used RDD Actions.

Action Description

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 234
reduce(func) It aggregate the elements of the dataset using a function func
(which takes two arguments and returns one). The function
should be commutative and associative so that it can be
computed correctly in parallel.

collect() It returns all the elements of the dataset as an array at the driver
program. This is usually useful after a filter or other operation
that returns a sufficiently small subset of the data.

count() It returns the number of elements in the dataset.

first() It returns the first element of the dataset (similar to take(1)).

take(n) It returns an array with the first n elements of the dataset.

takeSample(withReplacement, It returns an array with a random sample of num elements of the


num, [seed]) dataset, with or without replacement, optionally pre-specifying a
random number generator seed.

takeOrdered(n, [ordering]) It returns the first n elements of the RDD using either their
natural order or a custom comparator.

saveAsTextFile(path) It is used to write the elements of the dataset as a text file (or set
of text files) in a given directory in the local filesystem, HDFS or
any other Hadoop-supported file system. Spark calls toString on
each element to convert it to a line of text in the file.

saveAsSequenceFile(path) It is used to write the elements of the dataset as a Hadoop


(Java and Scala) SequenceFile in a given path in the local filesystem, HDFS or
any other Hadoop-supported file system.

saveAsObjectFile(path) It is used to write the elements of the dataset in a simple format


(Java and Scala) using Java serialization, which can then be loaded
usingSparkContext.objectFile().

countByKey() It is only available on RDDs of type (K, V). Thus, it returns a


hashmap of (K, Int) pairs with the count of each key.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 235
foreach(func) It runs a function func on each element of the dataset for side
effects such as updating an Accumulator or interacting with
external storage systems.

Passing Functions to Spark:

Python provides a simple way to pass functions to Spark. The Spark programming guide
available at spark.apache.org suggests there are three recommended ways to do this:

 Lambda expressions is the ideal way for short functions that can be
written inside a single expression

 Local defs inside the function calling into Spark for longer code
 Top-level functions in a module

While we have already looked at the lambda functions in some of the previous
examples, let's look at local definitions of the functions. We can encapsulate
our business logic which is splitting of words, and counting into two separate
functions as shown below.

def splitter(lineOfText):
words = lineOfText.split(" ")
return len(words)
def aggregate(numWordsLine1, numWordsLineNext):
totalWords = numWordsLine1 + numWordsLineNext
return totalWords

Common Transformations and Actions:

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 236
Spark Transformation is a function that produces new RDD from the existing RDDs.
It takes RDD as input and produces one or more RDD as output. Each time it creates new RDD
when we apply any transformation. Thus, the so input RDDs, cannot be changed since RDD are
immutable in nature.
Applying transformation built an RDD lineage, with the entire parent RDDs of the final RDD(s).
RDD lineage, also known as RDD operator graph or RDD dependency graph. It is a logical
execution plan i.e., it is Directed Acyclic Graph (DAG) of the entire parent RDDs of RDD.
Transformations are lazy in nature i.e., they get execute when we call an action. They are not
executed immediately. Two most basic type of transformations is a map(), filter().
After the transformation, the resultant RDD is always different from its parent RDD. It can be
smaller (e.g. filter, count, distinct, sample), bigger (e.g. flatMap(), union(), Cartesian()) or the
same size (e.g. map).
There are two types of transformations:

 Narrow transformation – In Narrow transformation, all the elements that are required to
compute the records in single partition live in the single partition of parent RDD. A limited
subset of partition is used to calculate the result. Narrow transformations are the result
of map(), filter().

 Wide transformation – In wide transformation, all the elements that are required to
compute the records in the single partition may live in many partitions of parent RDD. The
partition may live in many partitions of parent RDD. Wide transformations are the result
of groupbyKey() and reducebyKey().

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 237
There are various functions in RDD transformation. Let us see RDD
transformation with examples.

map(func)
The map function iterates over every line in RDD and split into new RDD.
Using map() transformation we take in any function, and that function is
applied to every element of RDD.
In the map, we have the flexibility that the input and the return type of RDD
may differ from each other. For example, we can have input RDD type as
String, after applying the

map() function the return RDD can be Boolean.

For example, in RDD {1, 2, 3, 4, 5} if we apply “rdd.map(x=>x+2)” we will get


the result as (3, 4, 5, 6, 7).

Map() example:
[php]import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
object mapTest{
def main(args: Array[String]) = {
val spark =
SparkSession.builder.appName(“mapExample”).master(“local”).getOrCreate()
val data = spark.read.textFile(“spark_test.txt”).rdd
val mapFile = data.map(line => (line,line.length))
mapFile.foreach(println)
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 238
}
}[/php]
spark_test.txt”

hello...user! this file is created to check the operations of spark.

?, and how can we apply functions on that RDD partitions?. All this will
be done through spark programming which is done with the help of scala
language support…

 Note – In above code, map() function map each line of the file with
its length.
Flat Map()

With the help of flatMap() function, to each input element, we have many elements in an output
RDD. The most simple use of flat Map() is to split each input string into words.
Map and flatMap are similar in the way that they take a line from input RDD and apply a
function on that line. The key difference between map() and flatMap() is map() returns only one
element, while flatMap() can return a list of elements.
Flat Map() example:
[php]val data = spark.read.textFile(“spark_test.txt”).rdd
val flatmapFile = data.flatMap(lines => lines.split(” “))
flatmapFile.foreach(println)[/php]

Persistence:

Persistence is "the continuance of an effect after its cause is removed". In the context of storing

data in a computer system, this means that the data survives after the process with which it was

created has ended. In other words, for a data store to be considered persistent, it must write to

non-volatile storage.

Which data stores provide persistence?


If you need persistence in your data store, then you need to also understand the four main design

approaches that a data store can take and how (or if) these designs provide persistence:

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 239
 Pure in-memory, no persistence at all, such as memcached or Scalaris

 In-memory with periodic snapshots, such as Oracle Coherence or Redis

 Disk-based with update-in-place writes, such as MySQL ISAM or MongoDB

 Commitlog-based, such as all traditional OLTP databases (Oracle, SQL Server, etc.)

In-memory approaches can achieve blazing speed, but at the cost of being limited to a relatively

small data set. Most workloads have relatively small "hot" (active) subset of their total data;

systems that require the whole dataset to fit in memory rather than just the active part are fine for

caches but a bad fit for most other applications. Because the data is in memory only, it will not

survive process termination. Therefore these types of data stores are not considered persistent.

Adding persistence to systems

The easiest way to add persistence to an in-memory system is with periodic snapshots to disk at a

configurable interval. Thus, you can lose up to that interval's worth of updates.

Update-in-place and commitlog-based systems store to non-volatile memory immediately, but

only commitlog-based persistence provides Durability -- the D in ACID -- with every write

persisted before success is returned to the client.

Cassandra implements a commit-log based persistence design, but at the same time provides

for tunable levels of durability. This allows you to decide what the right trade off is between

safety and performance. You can choose, for each write operation, to wait for that update to be:

 buffered to memory

 written to disk on a single machine

 written to disk on multiple machines

 written to disk on multiple machines in different data centers

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 240
Or, you can choose to accept writes as quickly as possible, acknowledging their receipt

immediately before they have even been fully deserialized from the network.

Why data persistence matters

At the end of the day, you're the only one who knows what the right performance/durability trade

off is for your data. Making an informed decision on data store technologies is critical to

addressing this tradeoff on your terms. Because Cassandra provides such tunability, it is a logical

choice for systems with a need for a durable, performant data store.
Adding Schemas to RDDs :

Spark introduces the concept of an RDD (Resilient Distributed Dataset), an immutable fault-

tolerant, distributed collection of objects that can be operated on in parallel. An RDD can contain

any type of object and is created by loading an external dataset or distributing a collection from

the driver program.

Schema RDD is a RDD where you can run SQL on. It is more than SQL. It is a unified interface

for structured data.

Code explanation:
1. Importing Expression Encoder for RDDs. RDDs are similar to Datasets but use encoders for

serialization.

2. Importing Encoder library into the shell.

3. Importing the Implicts class into our ‘spark’ Session.

4. Creating an ’employeeDF’ DataFrame from ’employee.txt’ and mapping the columns based on

the delimiter comma ‘,’ into a temporary view ’employee’.

5. Creating the temporary view ’employee’.

6. Defining a DataFrame ‘youngstersDF’ which will contain all the employees between the ages

of 18 and 30.

7. Mapping the names from the RDD into ‘youngstersDF’ to display the names of youngsters.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 241
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.Encoder
import spark.implicits._
val employeeDF =
spark.sparkContext.textFile("examples/src/main/resources/employee.txt").map(_.split(",")).map(attrib
utes =&amp;amp;amp;amp;gt; Employee(attributes(0), attributes(1).trim.toInt)).toDF()
employeeDF.createOrReplaceTempView("employee")
val youngstersDF = spark.sql("SELECT name, age FROM employee WHERE age BETWEEN 18
AND 30")
youngstersDF.map(youngster =&amp;amp;amp;amp;gt; "Name: " + youngster(0)).show()

Code explanation:
1. Converting the mapped names into string for transformations.
2. Using the mapEncoder from Implicits class to map the names to the ages.
3. Mapping the names to the ages of our ‘youngstersDF’ DataFrame. The result is an array with
names mapped to their respective ages.

youngstersDF.map(youngster =&amp;amp;amp;amp;gt; “Name: “ +


youngster.getAs[String](“name”)).show()
implicit val mapEncoder =
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 242
org.apache.spark.sql.Encoders.kryo[Map[String, Any]]
youngstersDF.map(youngster =&amp;amp;amp;amp;gt;
youngster.getValuesMap[Any](List(“name”, “age”))).collect()

RDDs support two types of operations:

 Transformations: These are the operations (such as map, filter, join, union, and so on)

performed on an RDD which yield a new RDD containing the result.

 Actions: These are operations (such as reduce, count, first, and so on) that return a value

after running a computation on an RDD.

Transformations in Spark are “lazy”, meaning that they do not compute their results right away.

Instead, they just “remember” the operation to be performed and the dataset (e.g., file) to which

the operation is to be performed. The transformations are computed only when an action is called

and the result is returned to the driver program and stored as Directed Acyclic Graphs (DAG).

This design enables Spark to run more efficiently. For example, if a big file was transformed in

various ways and passed to the first action, Spark would only process and return the result for the

first line, rather than do the work for the entire file.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 243
By default, each transformed RDD may be recomputed each time you run an action on it.
However, you may also persist an RDD in memory using the persist or cache method, in which
case Spark will keep the elements around on the cluster for much faster access the next time you
query it.

RDDs as Relations:

Resilient Distributed Datasets (RDDs) are distributed memory abstraction which lets

programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs

can be created from any data source. Eg: Scala collection, local file system, Hadoop, Amazon S3,

HBase Table, etc.

Specifying Schema

Code explanation:

1. Importing the ‘types’ class into the Spark Shell.

2. Importing ‘Row’ class into the Spark Shell. Row is used in mapping RDD Schema.

3. Creating an RDD ’employeeRDD’ from the text file ’employee.txt’.

4. Defining the schema as “name age”. This is used to map the columns of the RDD.

5. Defining ‘fields’ RDD which will be the output after mapping the ’employeeRDD’ to the

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 244
schema ‘schemaString’.

6. Obtaining the type of ‘fields’ RDD into ‘schema’.


import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val employeeRDD = spark.sparkContext.textFile(“examples/src/main/resources/employee.txt”)
val schemaString = “name age”
val fields = schemaString.split(“ “).map(fieldName =&amp;amp;amp;amp;gt; StructField(fieldName,
StringType, nullable = true))
val schema = StructType(fields)

Code explanation:
1. We now create a RDD called ‘rowRDD’ and transform the ’employeeRDD’ using the ‘map’
function into ‘rowRDD’.
2. We define a DataFrame ’employeeDF’ and store the RDD schema into it.
3. Creating a temporary view of ’employeeDF’ into ’employee’.
4. Performing the SQL operation on ’employee’ to display the contents of employee.
5. Displaying the names of the previous operation from the ’employee’ view.

val rowRDD = employeeRDD.map(_.split(",")).map(attributes


=&amp;amp;amp;amp;gt; Row(attributes(0),
attributes(1).trim))
val employeeDF = spark.createDataFrame(rowRDD, schema)
employeeDF.createOrReplaceTempView("employee")
val results = spark.sql("SELECT name FROM employee")

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 245
results.map(attributes =&amp;amp;amp;amp;gt; "Name: "
+ attributes(0)).show()

Even though RDDs are defined, they don’t contain any data. The computation to create the data in

an RDD is only done when the data is referenced. e.g. Caching results or writing out the RDD.

Caching Tables In-Memory


Spark SQL caches tables using an in-memory columnar
format:

1. Scan only required columns

2. Fewer allocated objects

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 246
3. Automatically selects the best comparison

Loading Data Programmatically

The below code will read employee.json file and create a


DataFrame. We will then use it to create a Parquet file.

Code explanation:
1. Importing Implicits class into the shell.
2. Creating an ’employeeDF’ DataFrame from our
’employee.json’ file.
import spark.implicits._
val employeeDF =
spark.read.json(“examples/src/main/resources/employee.json”
)

Code explanation:
1. Creating a ‘parquetFile’ temporary view of our DataFrame.
2. Selecting the names of people between the ages of 18 and
30 from our Parquet file.
3. Displaying the result of the Spark SQL operation.
employeeDF.write.parquet("employee.parquet")
val parquetFileDF = spark.read.parquet("employee.parquet")
parquetFileDF.createOrReplaceTempView("parquetFile")
val namesDF = spark.sql("SELECT name FROM parquetFile WHERE
age BETWEEN 18 AND 30")
namesDF.map(attributes =&amp;amp;amp;amp;gt; "Name: " +
attributes(0)).show()

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 247
JSON Datasets
We will now work on JSON data. As Spark SQL supports
JSON dataset, we create a DataFrame of employee.json. The
schema of this DataFrame can be seen below. We then define
a Youngster DataFrame and add all the employees between
the ages of 18 and 30.

Code explanation:
1. Setting to path to our ’employee.json’ file.
2. Creating a DataFrame ’employeeDF’ from our JSON file.
3. Printing the schema of ’employeeDF’.
4. Creating a temporary view of the DataFrame into
’employee’.
5. Defining a DataFrame ‘youngsterNamesDF’ which stores
the names of all the employees between the ages of 18 and
30 present in ’employee’.
6. Displaying the contents of our DataFrame.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 248
val path = "examples/src/main/resources/employee.json"
val employeeDF = spark.read.json(path)
employeeDF.printSchema()
employeeDF.createOrReplaceTempView("employee")
val youngsterNamesDF = spark.sql("SELECT name FROM employee
WHERE age BETWEEN 18 AND 30")
youngsterNamesDF.show()

Code explanation:
1. Creating a RDD ‘otherEmployeeRDD’ which will store the
content of employee George from New Delhi, Delhi.
2. Assigning the contents of ‘otherEmployeeRDD’ into
‘otherEmployee’.
3. Displaying the contents of ‘otherEmployee’.
val otherEmployeeRDD =
spark.sparkContext.makeRDD(“””{“name”:”George”,”address”:{“
city”:”New Delhi”,”state”:”Delhi”}}””” :: Nil)
val otherEmployee = spark.read.json(otherEmployeeRDD)
otherEmployee.show()

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 249
Hive Tables
We perform a Spark example using Hive tables.

Code explanation:
1. Importing ‘Row’ class into the Spark Shell. Row is used in mapping RDD Schema.

2. Importing Spark Session into the shell.

3. Creating a class ‘Record’ with attributes Int and String.

4. Setting the location of ‘warehouseLocation’ to Spark warehouse.

5. We now build a Spark Session ‘spark’ to demonstrate Hive example in Spark SQL.

6. Importing Implicits class into the shell.

7. Importing SQL library into the Spark Shell.

8. Creating a table ‘src’ with columns to store key and value.


import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
case class Record(key: Int, value: String)
val warehouseLocation = “spark-warehouse”
val spark = SparkSession.builder().appName(“Spark Hive
Example”).config(“spark.sql.warehouse.dir”,
warehouseLocation).enableHiveSupport().getOrCreate()
import spark.implicits._

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 250
import spark.sql
sql(“CREATE TABLE IF NOT EXISTS src (key INT, value
STRING)”)

Code explanation:
1. We now load the data from the examples present in Spark directory into our table ‘src’.
2. The contents of ‘src’ is displayed below.

sql("LOAD DATA LOCAL INPATH


'examples/src/main/resources/kv1.txt' INTO TABLE src")
sql("SELECT * FROM src").show()

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 251
Code explanation:
1. We perform the ‘count’ operation to select the number of keys in ‘src’ table.
2. We now select all the records with ‘key’ value less than 10 and store it in the ‘sqlDF’
DataFrame.
3. Creating a Dataset ‘stringDS’ from ‘sqlDF’.
4. Displaying the contents of ‘stringDS’ Dataset.

sql(“SELECT COUNT(*) FROM src”).show()


val sqlDF = sql(“SELECT key, value FROM src WHERE key
&amp;amp;amp;amp;lt; 10 ORDER BY key”) val stringsDS =
sqlDF.map {case Row(key: Int, value: String)
=&amp;amp;amp;amp;gt; s”Key: $key, Value: $value”}
stringsDS.show()

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 252
Code explanation:
1. We create a DataFrame ‘recordsDF’ and store all the records with key values 1 to 100.
2. Create a temporary view ‘records’ of ‘recordsDF’ DataFrame.
3. Displaying the contents of the join of tables ‘records’ and ‘src’ with ‘key’ as the primary key.

val recordsDF = spark.createDataFrame((1 to 100).map(i


=&amp;amp;amp;amp;gt; Record(i, s”val_$i”)))
recordsDF.createOrReplaceTempView(“records”)
sql(“SELECT * FROM records r JOIN src s ON r.key =
s.key”).show()

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 253
Creating Pairs in RDDs:

In Apache Spark, Key-value pairs are known as paired RDD. In this blog, we will learn what are
paired RDDs in Spark in detail.
To understand in deep, we will focus on following methods of creating spark paired RDD in and
operations on paired RDDs in spark, such as transformations and actions in Spark RDD.

The transformation such as groupByKey, reduceByKey, join, leftOuterJoin/rightOuterJoin,


while, actions like countByKey. But at first, we will learn brief introduction on RDDs in Spark.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 254
RDD refers to Resilient Distributed Datasets, core abstraction and a fundamental data structure
of Spark. RDDs in spark are immutable as well as the distributed collection of objects. In RDD,
each dataset is divided into logical partitions.
That each partition may be computed on different nodes of the cluster. Spark RDDs can contain
user-defined classes. Also, includes any type of Scala, python or java objects.

It is a read-only, partitioned collection of records. Spark RDDs are the fault-tolerant collection of
elements and it can be operated in parallel. There are generally three ways to create spark RDDs.

Data in stable storage, other RDDs, and parallelizing existing collection in driver program. By
using RDD, it is possible to achieve faster and efficient MapReduce operations.

Introduction – Apache Spark Paired RDD


Spark Paired RDDs are defined as the RDD containing a key-value pair. There is two linked data
item in a key-value pair (KVP). We can say the key is the identifier, while the value is the data
corresponding to the key value.
In addition, most of the Spark operations work on RDDs containing any type of objects. But on
RDDs of key-value pairs, a few special operations are available. For example, distributed
“shuffle” operations, such as grouping or aggregating the elements by a key.

These operations are automatically available on RDDs containing Tuple2 objects, in Scala. In the
Pair RDD functions class, the key-value pair operations are available. That wraps around an
RDD of tuples.

For example:
In this code we are using the reduceByKey operation on key-value
pairs. We will count how many times each line of text occurs in a file:
val lines1 = sc.textFile("data1.txt")

val pairs1 = lines.map(s => (s, 1))

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 255
val counts1 = pairs.reduceByKey((a, b) => a + b)

There is one more method counts.sortByKey() we can use.


Importance of Apache Spark Paired RDD?

In many programs, pair RDDs of Apache Spark are a useful building block. Operations that
allow us to act on each key in parallel, it exposes those operations. Also, helps to regroup the
data across the network.
For instance, in spark paired RDDs reduceByKey() method aggregate data separately for each
key and a join() method, which merges two RDDs together by grouping elements with the same
key. It is very normal to extract fields from an RDD.

For example, representing, for instance, an event time, customer ID, or another identifier. Also,
use those fields in spark pair RDD operations as keys.

Creating Paired RDD in Spark

By running a map() function that returns key or value pairs, we can create spark pair RDDs. On
the basis of language, the procedure to build the key-value RDDs differs.

Some Interesting Spark Paired RDD – Operations

1.Transformation Operations
All the transformations available to standard RDDs, Pair RDDs are allowed to use them. Even it
can apply same rules from “passing functions to spark”.
As there are tuples available in spark paired RDDs, we need to pass functions that operate on
tuples, rather than on individual elements. Some of the transformation methods are listed here.
For example:

Basically, it groups all the values with the same key.


rdd.groupByKey()

 reduceByKey(fun)

It uses to combine values with the same key.

add.reduceByKey( (x, y) => x + y)

 combineByKey(createCombiner, mergeValue, mergeCombiners, partitioner)

By using a different result type, combine values with the same key.

 mapValues(func)

Without changing the key, apply a function to each value of a pair RDD of spark.

rdd.mapValues(x => x+1)

 keys()

Basically, Keys() returns a spark RDD of just the keys.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 256
rdd.keys()

 values()

Generally, values() returns an RDD of just the values.

rdd.values()

 sortByKey()

Basically, sortByKey returns an RDD sorted by the key.

rdd.sortByKey()

2. Action Operations
Like transformations, actions available on spark pair RDDs are similar to base RDD. Basically,
there are some additional actions available on pair RDDs of spark. Moreover, those leverages
the advantage of the key/value nature of the data. Some of them are listed below. For example,

 countByKey()

For each key, it helps to count the number of elements.

rdd.countByKey()

 collectAsMap()

Basically, it helps to collect the result as a map to provide easy lookup.

rdd.collectAsMap()

 lookup(key)

Basically, lookup(key) returns all values associated with the provided key.

rdd.lookup()

Conclusion

Hence, we have seen how to work with Spark key/value data. Also, how to use the specialized
functions and operations available in spark. Finally, we hope this article has given all your
answers regarding spark paired RDDs.

Transformations and actions on RDDs

1. Spark RDD Operations

Two types of Apache Spark RDD operations are- Transformations and Actions.
A Transformation is a function that produces new RDD from the existing RDDs but when we
want to work with the actual dataset, at that point Action is performed. When the action is
triggered after the result, new RDD is not formed like transformation. In this Apache
Spark RDD operations tutorial we will get the detailed view of what is Spark RDD, what is the

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 257
transformation in Spark RDD, various RDD transformation operations in Spark with
examples, what is action in Spark RDD and various RDD action operations in Spark with
examples.

2. Apache Spark RDD


Operations
Before we start with Spark RDD Operations, let us deep dive into RDD
in Spark.
Apache Spark RDD supports two types of Operations-
 Transformations
 Actions
Now let us understand first what is Spark RDD Transformation and
Action-

3. RDD Transformation
Spark Transformation is a function that produces new RDD from the
existing RDDs. It takes RDD as input and produces one or more RDD
as output. Each time it creates new RDD when we apply any
transformation. Thus, the so input RDDs, cannot be changed since
RDD are immutable in nature.
Applying transformation built an RDD lineage, with the entire parent
RDDs of the final RDD(s). RDD lineage, also known as RDD operator

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 258
graph or RDD dependency graph. It is a logical execution plan
i.e., it is Directed Acyclic Graph (DAG) of the entire parent RDDs of
RDD.
Transformations are lazy in nature i.e., they get execute when we call
an action. They are not executed immediately. Two most basic type of
transformations is a map(), filter().
After the transformation, the resultant RDD is always different from
its parent RDD. It can be smaller (e.g. filter, count, distinct, sample),
bigger (e.g. flatMap(), union(), Cartesian()) or the same size
(e.g. map).

There are two types of transformations:


 Narrow transformation – In Narrow transformation, all the elements
that are required to compute the records in single partition live in
the single partition of parent RDD. A limited subset of partition is
used to calculate the result. Narrow transformations are the result
of map(), filter().

 Wide transformation – In wide transformation, all the elements


that are required to compute the records in the single partition
may live in many partitions of parent RDD. The partition may live
in many partitions of parent RDD. Wide transformations are the
result of groupbyKey() and reducebyKey().

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 259
There are various functions in RDD transformation.

Let us see RDD transformation with examples.


1.map(func)

The map function iterates over every line in RDD and split into new RDD.
Using map() transformation we take in any function, and that function is applied to every
element of RDD.
In the map, we have the flexibility that the input and the return type of RDD may differ from
each other. For example, we can have input RDD type as String, after applying the

map() function the return RDD can be Boolean.

For example, in RDD {1, 2, 3, 4, 5} if we apply “rdd.map(x=>x+2)” we will get the result as (3,
4, 5, 6, 7).

Map() example:
[php]import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
object mapTest{
def main(args: Array[String]) = {
val spark = SparkSession.builder.appName(“mapExample”).master(“local”).getOrCreate()
val data = spark.read.textFile(“spark_test.txt”).rdd
val mapFile = data.map(line => (line,line.length))
mapFile.foreach(println)
}
}[/php]

spark_test.txt”

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 260
hello...user! this file is created to check the operations of spark.

?, and how can we apply functions on that RDD partitions?. All this will be done through spark
programming which is done with the help of scala language support…

2. flatMap()

With the help of flatMap() function, to each input element, we have many elements in an output
RDD. The most simple use of flatMap() is to split each input string into words.
Map and flatMap are similar in the way that they take a line from input RDD and apply a
function on that line. The key difference between map() and flatMap() is map() returns only one
element, while flatMap() can return a list of elements.
flatMap() example:
[php]val data = spark.read.textFile(“spark_test.txt”).rdd
val flatmapFile = data.flatMap(lines => lines.split(” “))
flatmapFile.foreach(println)[/php]

3. filter(func)

Spark RDD filter() function returns a new RDD, containing only the elements that meet a
predicate. It is a narrow operation because it does not shuffle data from one partition to many
partitions.
For example, Suppose RDD contains first five natural numbers (1, 2, 3, 4, and 5) and the
predicate is check for an even number. The resulting RDD after the filter will contain only the
even numbers i.e., 2 and 4.

Filter() example:
[php]val data = spark.read.textFile(“spark_test.txt”).rdd
val mapFile = data.flatMap(lines => lines.split(” “)).filter(value => value==”spark”)
println(mapFile.count())[/php]

 Note – In above code, flatMap function map line into words and then count the word
“Spark” using count() Action after filtering lines containing “Spark” from mapFile.
4. mapPartitions(func)

The MapPartition converts each partition of the source RDD into many elements of the result
(possibly none). In mapPartition(), the map() function is applied on each partitions
simultaneously. MapPartition is like a map, but the difference is it runs separately on each
partition(block) of the RDD.

5. mapPartitionWithIndex()

It is like mapPartition; Besides mapPartition it provides func with an integer value representing
the index of the partition, and the map() is applied on partition index wise one after the other.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 261
Spark SQL-Overview

industries are using Hadoop extensively to analyze their data sets. The reason is that Hadoop
framework is based on a simple programming model (MapReduce) and it enables a computing
solution that is scalable, flexible, fault-tolerant and cost effective. Here, the main concern is to
maintain speed in processing large datasets in terms of waiting time between queries and
waiting time to run the program.
Spark was introduced by Apache Software Foundation for speeding up the Hadoop
computational computing software process.
As against a common belief, Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster management. Hadoop is just one of the
ways to implement Spark.
Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark has its
own cluster management computation, it uses Hadoop for storage purpose only.

Apache Spark
Apache Spark is a lightning-fast cluster computing technology, designed for
fast computation. It is based on Hadoop MapReduce and it extends the
MapReduce model to efficiently use it for more types of computations, which
includes interactive queries and stream processing. The main feature of Spark is its in-memory
cluster computing that increases the processing speed of an application.
Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming. Apart from supporting all these workload in a
respective system, it reduces the management burden of maintaining separate tools.
Evolution of Apache Spark
Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei
Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software
foundation in 2013, and now Apache Spark has become a top level Apache project from Feb-
2014.
Features of Apache Spark
Apache Spark has following features.
 Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in
memory, and 10 times faster when running on disk. This is possible by reducing number
of read/write operations to disk. It stores the intermediate processing data in memory.
 Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python.
Therefore, you can write applications in different languages. Spark comes up with 80
high-level operators for interactive querying.
 Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports
SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.
Spark Built on Hadoop
The following diagram shows three ways of how Spark can be built with Hadoop components.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 262
There are three ways of Spark deployment as explained below.
 Standalone − Spark Standalone deployment means Spark occupies
the place on top of HDFS(Hadoop Distributed File System) and space
is allocated for HDFS, explicitly. Here, Spark and MapReduce will run
side by side to cover all spark jobs on cluster.
 Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs
on Yarn without any pre-installation or root access required. It helps to
integrate Spark into Hadoop ecosystem or Hadoop stack. It allows
other components to run on top of stack.
 Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch
spark job in addition to standalone deployment. With SIMR, user can
start Spark and uses its shell without any administrative access.

Components of Spark
The following illustration depicts the different components of Spark.

Apache Spark Core


Spark Core is the underlying general execution engine for spark platform that
all other functionality is built upon. It provides In-Memory computing and
referencing datasets in external storage systems.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 263
Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data
abstraction called SchemaRDD, which provides support for structured and
semi-structured data.

Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform
streaming analytics. It ingests data in mini-batches and performs RDD
(Resilient Distributed Datasets) transformations on those mini-batches of
data.

MLlib (Machine Learning Library)


MLlib is a distributed machine learning framework above Spark because of
the distributed memory-based Spark architecture. It is, according to
benchmarks, done by the MLlib developers against the Alternating Least
Squares (ALS) implementations. Spark MLlib is nine times as fast as the
Hadoop disk-based version of Apache Mahout (before Mahout gained a
Spark interface).

GraphX
GraphX is a distributed graph-processing framework on top of Spark. It
provides an API for expressing graph computation that can model the user-
defined graphs by using Pregel abstraction API. It also provides an optimized
runtime for this abstraction.

Spark – RDD
Resilient Distributed Datasets
Resilient Distributed Datasets (RDD) is a fundamental data structure of
Spark. It is an immutable distributed collection of objects. Each dataset in
RDD is divided into logical partitions, which may be computed on different
nodes of the cluster. RDDs can contain any type of Python, Java, or Scala
objects, including user-defined classes.
Formally, an RDD is a read-only, partitioned collection of records. RDDs can
be created through deterministic operations on either data on stable storage
or other RDDs. RDD is a fault-tolerant collection of elements that can be
operated on in parallel.
There are two ways to create RDDs − parallelizing an existing collection in
your driver program, or referencing a dataset in an external storage system,
such as a shared file system, HDFS, HBase, or any data source offering a
Hadoop Input Format.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 264
Spark makes use of the concept of RDD to achieve faster and efficient
MapReduce operations. Let us first discuss how MapReduce operations take
place and why they are not so efficient.

Data Sharing is Slow in MapReduce


MapReduce is widely adopted for processing and generating large datasets
with a parallel, distributed algorithm on a cluster. It allows users to write
parallel computations, using a set of high-level operators, without having to
worry about work distribution and fault tolerance.
Unfortunately, in most current frameworks, the only way to reuse data
between computations (Ex: between two MapReduce jobs) is to write it to an
external stable storage system (Ex: HDFS). Although this framework provides
numerous abstractions for accessing a cluster’s computational resources,
users still want more.
Both Iterative and Interactive applications require faster data sharing across
parallel jobs. Data sharing is slow in MapReduce due
to replication, serialization, and disk IO. Regarding storage system, most
of the Hadoop applications, they spend more than 90% of the time doing
HDFS read-write operations.

Iterative Operations on MapReduce


Reuse intermediate results across multiple computations in multi-stage
applications. The following illustration explains how the current framework
works, while doing the iterative operations on MapReduce. This incurs
substantial overheads due to data replication, disk I/O, and serialization,
which makes the system slow.

Interactive Operations on MapReduce


User runs ad-hoc queries on the same subset of data. Each query will do the
disk I/O on the stable storage, which can dominates application execution
time.
The following illustration explains how the current framework works while
doing the interactive queries on MapReduce.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 265
Data Sharing using Spark RDD
Data sharing is slow in MapReduce due to replication, serialization,
and disk IO. Most of the Hadoop applications, they spend more than 90% of
the time doing HDFS read-write operations.
Recognizing this problem, researchers developed a specialized framework
called Apache Spark. The key idea of spark is Resilient Distributed Datasets
(RDD); it supports in-memory processing computation. This means, it stores
the state of memory as an object across the jobs and the object is sharable
between those jobs. Data sharing in memory is 10 to 100 times faster than
network and Disk.
Let us now try to find out how iterative and interactive operations take place
in Spark RDD.

Iterative Operations on Spark RDD


The illustration given below shows the iterative operations on Spark RDD. It
will store intermediate results in a distributed memory instead of Stable
storage (Disk) and make the system faster.
Note − If the Distributed memory (RAM) in sufficient to store intermediate
results (State of the JOB), then it will store those results on the disk

Interactive Operations on Spark RDD


This illustration shows interactive operations on Spark RDD. If different
queries are run on the same set of data repeatedly, this particular data can
be kept in memory for better execution times.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 266
By default, each transformed RDD may be recomputed each time you run an
action on it. However, you may also persist an RDD in memory, in which
case Spark will keep the elements around on the cluster for much faster
access, the next time you query it. There is also support for persisting RDDs
on disk, or replicated across multiple nodes.

Spark - Installation
Spark is Hadoop’s sub-project. Therefore, it is better to install Spark into a Linux based system.
The following steps show how to install Apache Spark.
Step1: Verifying Java Installation
Java installation is one of the mandatory things in installing Spark. Try the following command
to verify the JAVA version.
$java -version
If Java is already, installed on your system, you get to see the following response −
java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)
In case you do not have Java installed on your system, then Install Java before proceeding to
next step.
Step2: Verifying Scala Installation
You should Scala language to implement Spark. So let us verify Scala installation using
following command.
$scala -version
If Scala is already installed on your system, you get to see the following response −
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL
In case you don’t have Scala installed on your system, then proceed to next step for Scala
installation.
Step3: Downloading Scala
Download the latest version of Scala by visit the following link Download Scala. For this
tutorial, we are using scala-2.11.6 version. After downloading, you will find the Scala tar file in
the download folder.
Step4: Installing Scala
Follow the below given steps for installing Scala.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 267
Extract the Scala tar file
Type the following command for extracting the Scala tar file.
$ tar xvf scala-2.11.6.tgz

Move Scala software files


Use the following commands for moving the Scala software files, to respective
directory (/usr/local/scala).
$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv scala-2.11.6 /usr/local/scala
# exit
Set PATH for Scala
Use the following command for setting PATH for Scala.
$ export PATH = $PATH:/usr/local/scala/bin
Verifying Scala Installation
After installation, it is better to verify it. Use the following command for verifying Scala
installation.
$scala -version
If Scala is already installed on your system, you get to see the following response −
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL
Step5: Downloading Apache Spark
Download the latest version of Spark by visiting the following link Download Spark. For this
tutorial, we are using spark-1.3.1-bin-hadoop2.6 version. After downloading it, you will find
the Spark tar file in the download folder.
Step6: Installing Spark
Follow the steps given below for installing Spark.
Extracting Spark tar
The following command for extracting the spark tar file.
$ tar xvf spark-1.3.1-bin-hadoop2.6.tgz
Moving Spark software files
The following commands for moving the Spark software files to respective
directory (/usr/local/spark).
$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark
# exit

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 268
Setting up the environment for Spark
Add the following line to ~/.bashrc file. It means adding the location, where the spark software
file are located to the PATH variable.
export PATH = $PATH:/usr/local/spark/bin
Use the following command for sourcing the ~/.bashrc file.
$ source ~/.bashrc
Step7: Verifying the Spark Installation
Write the following command for opening Spark shell.
$spark-shell
If spark is installed successfully then you will find the following output.
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to:
hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls
to: hadoop
disabled; ui acls disabled; users with view permissions:
Set(hadoop); users with modify permissions: Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service
'HTTP class server' on port 43292.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.4.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM,
Java 1.7.0_71)
Type in expressions to have them evaluated.
Spark context available as sc
scala>

Libraries:

 SQL and DataFrames


 Spark Streaming
 MLlib (machine learning)
 GraphX (graph)

 SQL and DataFrames

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 269
Integrated
Seamlessly mix SQL queries with Spark programs.
Spark SQL lets you query structured data inside Spark programs, using either SQL or a
familiar DataFrame API. Usable in Java, Scala, Python and R.
results = spark.sql(
"SELECT * FROM people")
names = results.map(lambda p: p.name)

Apply functions to results of SQL queries.

Uniform data access


Connect to any data source the same way.
DataFrames and SQL provide a common way to access a variety of data sources, including Hive,
Avro, Parquet, ORC, JSON, and JDBC. You can even join data across these sources.
spark.read.json("s3n://...")
.registerTempTable("json")
results = spark.sql(
"""SELECT *
FROM people
JOIN json ...""")

Query and join different data sources.

Hive integration
Run SQL or HiveQL queries on existing warehouses.
Spark SQL supports the HiveQL syntax as well as Hive SerDes and UDFs, allowing you to
access existing Hive warehouses.

Spark SQL can use existing Hive metastores, SerDes, and UDFs.

Standard connectivity
Connect through JDBC or ODBC.
A server mode provides industry standard JDBC and ODBC connectivity for business
intelligence tools.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 270
Use your existing BI tools to query big data.

Performance & scalability

Spark SQL includes a cost-based optimizer, columnar storage and code generation to make
queries fast. At the same time, it scales to thousands of nodes and multi hour queries using the
Spark engine, which provides full mid-query fault tolerance. Don't worry about using a different
engine for historical data.

Community

Spark SQL is developed as part of Apache Spark. It thus gets tested and updated with each Spark
release.
If you have questions about the system, ask on the Spark mailing lists.
The Spark SQL developers welcome contributions. If you'd like to help out, read how to
contribute to Spark, and send us a patch!

Getting started

To get started with Spark SQL:

 Download Spark. It includes Spark SQL as a module.


 Read the Spark SQL and DataFrame guide to learn the API.

 Spark Streaming:

Spark streaming makes it easy to build scalable fault-tolerant streaming applications.

Ease of use
Build applications through high-level operators.
Spark Streaming brings Apache Spark's language-integrated API to stream processing, letting
you write streaming jobs the same way you write batch jobs. It supports Java, Scala and Python.
TwitterUtils.createStream(...)
.filter(_.getText.contains("Spark"))
.countByWindow(Seconds(5))

Counting tweets on a sliding window

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 271
Fault tolerance
Stateful exactly-once semantics out of the box.
Spark Streaming recovers both lost work and operator state (e.g. sliding windows) out of the box,
without any extra code on your part.

Spark integration
Combine streaming with batch and interactive queries.
By running on Spark, Spark Streaming lets you reuse the same code for batch processing, join
streams against historical data, or run ad-hoc queries on stream state. Build powerful interactive
applications, not just analytics.
stream.join(historicCounts).filter {
case (word, (curCount, oldCount)) =>
curCount > oldCount
}

Find words with higher frequency than historic data

Deployment options

Spark Streaming can read data from HDFS, Flume, Kafka, Twitter and ZeroMQ. You can also
define your own custom data sources.
You can run Spark Streaming on Spark's standalone cluster mode or other supported cluster
resource managers. It also includes a local run mode for development. In production, Spark
Streaming uses ZooKeeper and HDFS for high availability.

Community

Spark Streaming is developed as part of Apache Spark. It thus gets tested and updated with each
Spark release.
If you have questions about the system, ask on the Spark mailing lists.
The Spark Streaming developers welcome contributions. If you'd like to help out, read how to
contribute to Spark, and send us a patch!

Getting started

To get started with Spark Streaming:


KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 272
 Download Spark. It includes Streaming as a module.
 Read the Spark Streaming programming guide, which includes a tutorial and describes
system architecture, configuration and high availability.
 Check out example programs in Scala and Java.

 MLlib (machine learning)

MLlib is Apache Spark's scalable machine learning library.


Ease of use
Usable in Java, Scala, Python, and R.
MLlib fits into Spark's APIs and interoperates with NumPy in Python (as of Spark 0.9) and R
libraries (as of Spark 1.5). You can use any Hadoop data source (e.g. HDFS, HBase, or local
files), making it easy to plug into Hadoop workflows.
data = spark.read.format("libsvm")\
.load("hdfs://...")

model = KMeans(k=10).fit(data)

Calling MLlib in Python

Performance
High-quality algorithms, 100x faster than MapReduce.
Spark excels at iterative computation, enabling MLlib to run fast. At the same time, we care
about algorithmic performance: MLlib contains high-quality algorithms that leverage iteration,
and can yield better results than the one-pass approximations sometimes used on MapReduce.

Logistic regression in Hadoop and Spark

Runs everywhere
Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud, against diverse
data sources.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 273
You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN,
on Mesos, or on Kubernetes. Access data in HDFS, Apache Cassandra, Apache HBase, Apache
Hive, and hundreds of other data sources.

Algorithms

MLlib contains many algorithms and utilities.


ML algorithms include:

 Classification: logistic regression, naive Bayes,...


 Regression: generalized linear regression, survival regression,...
 Decision trees, random forests, and gradient-boosted trees
 Recommendation: alternating least squares (ALS)
 Clustering: K-means, Gaussian mixtures (GMMs),...
 Topic modeling: latent Dirichlet allocation (LDA)
 Frequent itemsets, association rules, and sequential pattern mining

ML workflow utilities include:

 Feature transformations: standardization, normalization, hashing,...


 ML Pipeline construction
 Model evaluation and hyper-parameter tuning
 ML persistence: saving and loading models and Pipelines

Other utilities include:

 Distributed linear algebra: SVD, PCA,...


 Statistics: summary statistics, hypothesis testing,...

Refer to the MLlib guide for usage examples.

Community

MLlib is developed as part of the Apache Spark project. It thus gets tested and updated with each
Spark release.
If you have questions about the library, ask on the Spark mailing lists.
MLlib is still a rapidly growing project and welcomes contributions. If you'd like to submit an
algorithm to MLlib, read how to contribute to Spark and send us a patch!

 GraphX (graph)

GraphX is Apache Spark's API for graphs and graph-parallel


computation.

Flexibility
Seamlessly work with both graphs and collections.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 274
GraphX unifies ETL, exploratory analysis, and iterative graph computation within a
single system. You can view the same data as both graphs and
collections, transform and join graphs with RDDs efficiently, and write custom iterative graph
algorithms using the Pregel API.
graph = Graph(vertices, edges)
messages = spark.textFile("hdfs://...")
graph2 = graph.joinVertices(messages) {
(id, vertex, msg) => ...
}

Using GraphX in Scala

Speed
Comparable performance to the fastest specialized graph processing systems.
GraphX competes on performance with the fastest graph systems while retaining Spark's
flexibility, fault tolerance, and ease of use.

End-to-end PageRank performance (20 iterations, 3.7B edges)

Algorithms
Choose from a growing library of graph algorithms.
In addition to a highly flexible API, GraphX comes with a variety of graph algorithms, many of
which were contributed by our users.

 PageRank
 Connected components
 Label propagation
 SVD++
 Strongly connected components
 Triangle count

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 275
Community

GraphX is developed as part of the Apache Spark project. It thus gets tested and updated with
each Spark release.
If you have questions about the library, ask on the Spark mailing lists.
GraphX is in the alpha stage and welcomes contributions. If you'd like to submit a change to
GraphX, read how to contribute to Spark and send us a patch!

Getting started

To get started with GraphX:

 Download Spark. GraphX is included as a module.


 Read the GraphX guide, which includes usage examples.
 Learn how to deploy Spark on a cluster if you'd like to run in distributed mode. You can
also run locally on a multicore machine without any setup.

Features:

Spark introduces a programming module for structured data processing


called Spark SQL. It provides a programming abstraction called DataFrame
and can act as distributed SQL query engine.

Features of Spark SQL


The following are the features of Spark SQL −
 Integrated − Seamlessly mix SQL queries with Spark programs. Spark
SQL lets you query structured data as a distributed dataset (RDD) in
Spark, with integrated APIs in Python, Scala and Java. This tight
integration makes it easy to run SQL queries alongside complex
analytic algorithms.
 Unified Data Access − Load and query data from a variety of sources.
Schema-RDDs provide a single interface for efficiently working with
structured data, including Apache Hive tables, parquet files and JSON
files.
 Hive Compatibility − Run unmodified Hive queries on existing
warehouses. Spark SQL reuses the Hive frontend and MetaStore,
giving you full compatibility with existing Hive data, queries, and UDFs.
Simply install it alongside Hive.
 Standard Connectivity − Connect through JDBC or ODBC. Spark SQL
includes a server mode with industry standard JDBC and ODBC
connectivity.
 Scalability − Use the same engine for both interactive and long
queries. Spark SQL takes advantage of the RDD model to support mid-

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 276
query fault tolerance, letting it scale to large jobs too. Do not worry
about using a different engine for historical data.

Spark SQL Architecture


The following illustration explains the architecture of Spark SQL −

This architecture contains three layers namely, Language API, Schema RDD,
and Data Sources.
 Language API − Spark is compatible with different languages and
Spark SQL. It is also, supported by these languages- API (python,
scala, java, HiveQL).
 Schema RDD − Spark Core is designed with special data structure
called RDD. Generally, Spark SQL works on schemas, tables, and
records. Therefore, we can use the Schema RDD as temporary table.
We can call this Schema RDD as Data Frame.
 Data Sources − Usually the Data source for spark-core is a text file,
Avro file, etc. However, the Data Sources for Spark SQL is different.
Those are Parquet file, JSON document, HIVE tables, and Cassandra
database.
We will discuss more about these in the subsequent chapters.

Spark SQL - DataFrames


A DataFrame is a distributed collection of data, which is organized into
named columns. Conceptually, it is equivalent to relational tables with good
optimization techniques.
A DataFrame can be constructed from an array of different sources such as
Hive tables, Structured Data files, external databases, or existing RDDs. This
API was designed for modern Big Data and data science applications taking
inspiration from DataFrame in R Programming and Pandas in Python.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 277
Features of DataFrame
Here is a set of few characteristic features of DataFrame −
 Ability to process the data in the size of Kilobytes to Petabytes on a
single node cluster to large cluster.
 Supports different data formats (Avro, csv, elastic search, and
Cassandra) and storage systems (HDFS, HIVE tables, mysql, etc).
 State of art optimization and code generation through the Spark SQL
Catalyst optimizer (tree transformation framework).
 Can be easily integrated with all Big Data tools and frameworks via
Spark-Core.
 Provides API for Python, Java, Scala, and R Programming.

SQLContext
SQLContext is a class and is used for initializing the functionalities of Spark
SQL. SparkContext class object (sc) is required for initializing SQLContext
class object.
The following command is used for initializing the SparkContext through
spark-shell.
$ spark-shell
By default, the SparkContext object is initialized with the name sc when the
spark-shell starts.
Use the following command to create SQLContext.
scala> val sqlcontext = new
org.apache.spark.sql.SQLContext(sc)

Example
Let us consider an example of employee records in a JSON file
named employee.json. Use the following commands to create a DataFrame
(df) and read a JSON document named employee.json with the following
content.
employee.json − Place this file in the directory where the
current scala> pointer is located.
{
{"id" : "1201", "name" : "satish", "age" : "25"}
{"id" : "1202", "name" : "krishna", "age" : "28"}
{"id" : "1203", "name" : "amith", "age" : "39"}
{"id" : "1204", "name" : "javed", "age" : "23"}
{"id" : "1205", "name" : "prudvi", "age" : "23"}
}

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 278
DataFrame Operations
DataFrame provides a domain-specific language for structured data
manipulation. Here, we include some basic examples of structured data
processing using DataFrames.
Follow the steps given below to perform DataFrame operations −

Read the JSON Document


First, we have to read the JSON document. Based on this, generate a
DataFrame named (dfs).
Use the following command to read the JSON document
named employee.json. The data is shown as a table with the fields − id,
name, and age.
scala> val dfs = sqlContext.read.json("employee.json")
Output − The field names are taken automatically from employee.json.
dfs: org.apache.spark.sql.DataFrame = [age: string, id:
string, name: string]

Show the Data


If you want to see the data in the DataFrame, then use the following
command.
scala> dfs.show()
Output − You can see the employee data in a tabular format.
<console>:22, took 0.052610 s
+----+------+--------+
|age | id | name |
+----+------+--------+
| 25 | 1201 | satish |
| 28 | 1202 | krishna|
| 39 | 1203 | amith |
| 23 | 1204 | javed |
| 23 | 1205 | prudvi |
+----+------+--------+

Use printSchema Method


If you want to see the Structure (Schema) of the DataFrame, then use the
following command.
scala> dfs.printSchema()
Output
root
|-- age: string (nullable = true)

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 279
|-- id: string (nullable = true)
|-- name: string (nullable = true)

Use Select Method


Use the following command to fetch name-column among three columns
from the DataFrame.
scala> dfs.select("name").show()
Output − You can see the values of the name column.
<console>:22, took 0.044023 s
+--------+
| name |
+--------+
| satish |
| krishna|
| amith |
| javed |
| prudvi |
+--------+

Use Age Filter


Use the following command for finding the employees whose age is greater
than 23 (age > 23).
scala> dfs.filter(dfs("age") > 23).show()
Output
<console>:22, took 0.078670 s
+----+------+--------+
|age | id | name |
+----+------+--------+
| 25 | 1201 | satish |
| 28 | 1202 | krishna|
| 39 | 1203 | amith |
+----+------+--------+

Use groupBy Method


Use the following command for counting the number of employees who are of
the same age.
scala> dfs.groupBy("age").count().show()
Output − two employees are having age 23.
<console>:22, took 5.196091 s
+----+-----+
|age |count|
+----+-----+
| 23 | 2 |
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 280
| 25 | 1 |
| 28 | 1 |
| 39 | 1 |
+----+-----+
Running SQL Queries Programmatically
An SQLContext enables applications to run SQL queries programmatically
while running SQL functions and returns the result as a DataFrame.
Generally, in the background, SparkSQL supports two different methods for
converting existing RDDs into DataFrames −

Sr. Methods & Description


No

1 Inferring the Schema using Reflection

This method uses reflection to generate the schema of an RDD that


contains specific types of objects.

2 Programmatically Specifying the Schema

The second method for creating DataFrame is through programmatic


interface that allows you to construct a schema and then apply it to
an existing RDD.

Spark SQL - Data Sources


A DataFrame interface allows different DataSources to work on Spark SQL. It
is a temporary table and can be operated as a normal RDD. Registering a
DataFrame as a table allows you to run SQL queries over its data.
In this chapter, we will describe the general methods for loading and saving
data using different Spark DataSources. Thereafter, we will discuss in detail
the specific options that are available for the built-in data sources.
There are different types of data sources available in SparkSQL, some of
which are listed below −

Sr. Data Sources


No

1 JSON Datasets

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 281
Spark SQL can automatically capture the schema of a JSON dataset
and load it as a DataFrame.

2 Hive Tables

Hive comes bundled with the Spark library as HiveContext, which


inherits from SQLContext.

3 Parquet Files

Parquet is a columnar format, supported by many data processing


systems.

Querying Using Spark SQL

We will now start querying using Spark SQL. Note that the actual SQL queries are similar to the
ones used in popular SQL clients.

Starting the Spark Shell. Go to the Spark directory and execute ./bin/spark-shell in the terminal to
being the Spark Shell.

For the querying examples shown in the blog, we will be using two files, ’employee.txt’ and
’employee.json’. The images below show the content of both the files. Both these files are stored
at ‘examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala’ inside the
folder containing the Spark installation (~/Downloads/spark-2.0.2-bin-hadoop2.7). So, all of you
who are executing the queries, place them in this directory or set the path to your files in the lines
of code below.

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 282
Code explanation:
1. We first import a Spark Session into Apache Spark.
2. Creating a Spark Session ‘spark’ using the ‘builder()’ function.
3. Importing the Implicts class into our ‘spark’ Session.
4. We now create a DataFrame ‘df’ and import data from the ’employee.json’
file.
5. Displaying the DataFrame ‘df’. The result is a table of 5 rows of ages and
names from our ’employee.json’ file.

1 import org.apache.spark.sql.SparkSession
2 val spark = SparkSession.builder().appName("Spark SQL basic
example").config("spark.some.config.option", "some-value").getOrCreate()
3 import spark.implicits._
4 val df = spark.read.json("examples/src/main/resources/employee.json")
5 df.show()

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 283
Code explanation:
1. Importing the Implicts class into our ‘spark’ Session.
2. Printing the schema of our ‘df’ DataFrame.
3. Displaying the names of all our records from ‘df’ DataFrame.

1 import spark.implicits._
2 df.printSchema()
3 df.select("name").show()

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 284
Code explanation:
1. Displaying the DataFrame after incrementing everyone’s age by two years.
2. We filter all the employees above age 30 and display the result.

Code explanation:
1. Counting the number of people with the same ages. We use the ‘groupBy’
function for the same.
2. Creating a temporary view ’employee’ of our ‘df’ DataFrame.
3. Perform a ‘select’ operation on our ’employee’ view to display the table into
‘sqlDF’.
4. Displaying the results of ‘sqlDF’.

1 df.groupBy("age").count().show()
2 df.createOrReplaceTempView("employee")
3 val sqlDF = spark.sql("SELECT * FROM employee")
4 sqlDF.show()

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 285
Creating Datasets
After understanding DataFrames, let us now move on to Dataset API. The
below code creates a Dataset class in SparkSQL.

Code explanation:
1. Creating a class ‘Employee’ to store name and age of an employee.
2. Assigning a Dataset ‘caseClassDS’ to store the record of Andrew.
3. Displaying the Dataset ‘caseClassDS’.
4. Creating a primitive Dataset to demonstrate mapping of DataFrames into
Datasets.
5. Assigning the above sequence into an array.

1 case class Employee(name: String, age: Long)

2 val caseClassDS = Seq(Employee("Andrew", 55)).toDS()

3 caseClassDS.show()

4 val primitiveDS = Seq(1, 2, 3).toDS

()primitiveDS.map(_ + 1).collect()
5

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 286
Code explanation:
1. Setting the path to our JSON file ’employee.json’.
2. Creating a Dataset and from the file.
3. Displaying the contents of ’employeeDS’ Dataset.

1 val path = "examples/src/main/resources/employee.json"


2 val employeeDS = spark.read.json(path).as[Employee]
3 employeeDS.show()

**********
THE END

KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 287

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy