Big Data Analytics
Big Data Analytics
BY
K.ISSACK BABU MCA
Assistant Professor
Characteristics of Big Data (or) Why is Big Data different from any other data?
There are “Five V‟s” that characterize this data: Volume, Velocity, Variety, Veracity
and Validity.
1. Volume (Data Quantity):
Most organizations were already struggling with the increasing size of their databases
as the Big Data tsunami hit the data stores.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 2
2. Velocity (Data Speed):
There are two aspects to velocity. They are throughput of data and the other
representing latency.
a. Throughput which represents the data moving in the pipes.
b. Latency is the other measure of velocity. Analytics used to be a “store and report”
environment where reporting typically contained data as of yesterday—popularly represented
as “D-1.”
c. Variety (Data Types):
The source data includes unstructured text, sound, and video in addition to structured
data. A number of applications are gathering data from emails, documents, or blogs.
d. Veracity (Data Quality):
Veracity represents both the credibility of the data source as well as the suitability of
the data for the target audience.
e. Validity (Data Correctness):
Validity meaning is the data correct and accurate for the future use. Clearly valid data
is key to making the right decisions.
As per IBM, the number of characteristics of Big Data is V3 and described in the
following Figure:
Finally, not only do we need to perform analysis on the longevity of the logs to determine
trends and patterns and to find failures, but also we need to ensure the analysis is done on all
the data.
Log analytics is actually a pattern that IBM established after working with a number
of companies, including some large financial services sector (FSS) companies. This use case
Finally, not only do we need to perform analysis on the longevity of the logs to determine
trends and patterns and to find failures, but also we need to ensure the analysis is done on all
the data.
Log analytics is actually a pattern that IBM established after working with a number
of companies, including some large financial services sector (FSS) companies. This use case
comes up with quite a few customers since; for that reason, this pattern is called IT for IT. If
we are new to this usage pattern and wondering just who is interested in IT for IT Big Data
solutions, we should know that this is an internal use case within an organization itself. An
internal IT for IT implementation is well suited for any organization with a large data center
footprint, especially if it is relatively complex. For example, service-oriented architecture
5. The Social Media Pattern: Perhaps the most talked about Big Data usage pattern is social
media and customer sentiment. More specifically, we can determine how sentiment is
impacting sales, the effectiveness or receptiveness of marketing campaigns, the accuracy of
marketing mix (product, price, promotion, and placement), and so on.
Social media analytics is a pretty hot topic, so hot in fact that IBM has built a solution
specifically to accelerate our use of it: Cognos Consumer Insights (CCI). CCI can tell what
people are saying, how topics are trending in social media, and all sorts of things that affect
the business, all packed into a rich visualization engine.
6. The Call Center Mantra: “This Call May Be Recorded for Quality Assurance
Purposes”: It seems that when we want our call with a customer service representative
(CSR) to be recorded for quality assurance purposes, it seems the may part never works in
our favor. The challenge of call center efficiencies is somewhat similar to the fraud detection
pattern.
Call centers of all kinds want to find better ways to process information to address
what‟s going on in the business with lower latency. This is a really interesting Big Data use
case, because it uses analytics-in-motion and analytics-at-rest. Using in-motion analytics
(Streams) means that we basically build our models and find out what‟s interesting based
8. Big Data and the Energy Sector: The energy sector provides many Big Data use case
challenges in how to deal with the massive volumes of sensor data from remote installations.
Many companies are using only a fraction of the data being collected, because they lack the
infrastructure to store or analyze the available scale of data.
Vestas is primarily engaged in the development, manufacturing, sale, and
maintenance of power systems that use wind energy to generate electricity through its wind
turbines. Its product range includes land and offshore wind turbines. At the time of wrote this
book, it had more than 43,000 wind turbines in 65 countries on 5 continents. Vestas used
IBM BigInsights platform to achieve their vision is about the generation of clean energy.
Data in the Warehouse and Data in Hadoop:
Traditional warehouses are mostly ideal for analyzing structured data from various
systems and producing insights with known and relatively stable measurements. On the other
hand, Hadoop-based platform is well suited to deal with semi-structured and unstructured
data, as well as when a data discovery process is needed.
The authors could say that data warehouse data is trusted enough to be “public,”
while Hadoop data isn‟t as trusted (public can mean vastly distributed within the company
and not for external consumption), and although this will likely change in the future, today
Hadoop: Hadoop is an open source framework for writing and running distributed applications
that process large amounts of data. Distributed computing is a wide and varied field, but the
key distinctions of Hadoop are that it is: Accessible—Hadoop runs on large clusters of
commodity machines or on cloud computing services such as Amazon’s Elastic Compute
Cloud (EC2 ). Robust—Because it is intended to run on commodity hardware, Hadoop is
architected with the assumption of frequent hardware malfunctions (errors). It can gracefully
handle most such failures. Scalable— Hadoop scales linearly to handle larger data by adding
more nodes to the cluster. Simple— Hadoop allows users to quickly write efficient parallel
code. The following Figure illustrates how one interacts with a Hadoop cluster. As we can
see, a Hadoop cluster is a set of commodity machines networked together in one location. Data
storage and processing all occur within this “cloud” of machines. Different users can submit
computing “jobs” to Hadoop from individual clients, which can be their own desktop machines
in remote locations from the Hadoop cluster.
When the set of documents is small, a straightforward program will do the job and
pseudo-code is:
The program loops through all the documents. For each document, the words are
extracted one by one using a tokenization process. For each word, its corresponding entry in a
multiset called word Count is incremented by one. At the end, a display() function prints out all
the entries in word Count.
The above code works fine until the set of documents we want to process becomes
large. If it is large, to speed it up by rewriting the program so that it distributes the work over
several machines. Each machine will process a distinct fraction of the documents. When all the
machines have completed this, a second phase of processing will combine the result of
all the machines. The pseudo code for the first phase, to be distributed over many machines, is
This word counting program is getting complicated. To make it work across a cluster of
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 13
distributed machines, we need to add a number of functionalities:
a. Store files over many processing machines (of phase one).
a. Write a disk-based hash table permitting processing without being limited by RAM
capacity.
b. Partition the intermediate data (that is, wordCount) from phase one.
c. Shuffle the partitions to the appropriate machines in phase two.
Scaling the same program in MapReduce:
MapReduce programs are executed in two main phases, called mapping and reducing.
Each phase is defined by a data processing function, and these functions are called mapper and
reducer, respectively. In the mapping phase, MapReduce takes the input data and feeds each
data element to the mapper. In the reducing phase, the reducer processes all the outputs from
the mapper and arrives at a final result. In simple terms, the mapper is meant to filter and
transform the input into something that the reducer can aggregate over.
The MapReduce framework was designed in writing scalable, distributed programs.
This two-phase design pattern is using in scaling many programs, and became the basis of the
framework. Partitioning and shuffling are common design patterns along with mapping and
reducing. The MapReduce framework provides a default implementation that works in most
situations. MapReduce uses lists and (key/value) pairs as its main data primitives. The keys and
values are often integers or strings but can also be dummy values to be ignored or complex
object types. The map and reduce functions must obey the following constraint on the types of
keys and values.
In the Map Reduce framework we write applications by specifying the mapper and
reducer. The following steps explain the complete data flow:
1. The input to the application must be structured as a list of (key/value) pairs, list (<k 1,
v1>). The input format for processing multiple files is usually list (<String filename,
String file_content>).
2. The list of (key/value) pairs is broken up and each individual (key/value) pair, <k 1,
is, in the output list we can have the (key/value) pair <"foo", 3> once or we can have
the pair <"foo", 1> three times.
1. The output of all the mappers are (conceptually) aggregated into one giant list of <k 2,
v2> pairs. All pairs sharing the same k2 are grouped together into a new (key/value)
pair, <k2, list(v2)>. The framework asks the reducer to process each one of these
aggregated (key/value) pairs individually. For example, the map output for one
document may be a list with pair <"foo", 1> three times, and the map output for
another document may be a list with pair <"foo", 1> twice. The aggregated pair the
reducer will see is <"foo", list(1,1,1,1,1)>. In word counting, the output of our reducer
supported development platform as well. For a Windows box, we’ll need to install
cygwin (http://www-cygwin.com/) to enable shell and Unix scripts.
To run Hadoop requires Java (version 1.6 or higher). Mac users should get it from
Apple. We can download the latest JDK for other operating systems from Sun at
http://java.sun.com/javase/downloads/index.jsp (or) www.oracle.com. Install it and remember
the root of the Java installation, which we’ll need later.
To install Hadoop, first get the latest version release at
http://hadoop.apache.org/core/releases.html. After we unpack the distribution, edit the script
“conf/Hadoop-env.sh” to set JAVA_HOME to the root of the Java installation we have
remembered from earlier. For example, in Mac OS X, we’ll replace this line
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
with the following line
export JAVA_HOME=/Library/Java/Home
We’ll be using the Hadoop script quite often. Run the following command:
bin/Hadoop
We only need to know that the command to run a (Java) Hadoop program is bin/hadoop
jar <jar>. As the command implies, Hadoop programs written in Java are packaged in jar files
for execution. The following command shows about a dozen example programs prepackaged
with Hadoop:
bin/hadoop jar hadoop-*-examples.jar
One of the program is “wordcount”. The important (inner) classes of that program are:
1. NameNode
2. DataNode
3. Secondary NameNode
4. JobTracker
5. TaskTracker
1. NameNode: The distributed storage system is called the Hadoop File System, or HDFS.
The NameNode is the master of HDFS that directs the slave DataNode daemons to perform
the low-level I/O tasks. The NameNode is the bookkeeper of HDFS; it keeps track of how the
files are broken down into file blocks, which nodes store those blocks, and the overall health
of the distributed filesystem.
The function of the NameNode is memory and I/O intensive. As such, the server
hosting the NameNode typically doesn’t store any user data or perform any computations fora
MapReduce program to lower the workload on the machine. This means that the NameNode
server doesn’t double as a DataNode or a TaskTracker.
There is unfortunately a negative aspect to the importance of the NameNode—it’s a
single point of failure of the Hadoop cluster. For any of the other daemons, if their host nodes
fail for software or hardware reasons, the Hadoop cluster will likely continue to function
smoothly or we can quickly restart it and Not so for the NameNode.
2. DataNode: Each slave machine in the cluster will host a DataNode daemon to perform the
grunt work of the distributed filesystem—reading and writing HDFS blocks to actual files on
the local filesystem. When we want to read or write a HDFS file, the file is broken into
blocks and the NameNode will tell the client which DataNode each block resides in. The
client communicates directly with the DataNode daemons to process the local files
corresponding to the blocks. Furthermore, a DataNode may communicate with other
DataNodes to replicate its data blocks for redundancy. The following figure illustrates the
roles of NameNode and DataNodes.
The data1 file takes up three blocks, which we denote 1, 2, and 3, and the data2 file
consists of blocks 4 and 5. The content of the files are distributed among the DataNodes. In this
illustration, each block has three replicas. For example, block 1 (used for data1) is replicated
over the three rightmost DataNodes. This ensures that if any one DataNode crashes or becomes
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 19
inaccessible over the network, we’ll still be able to read the files.
Data Nodes are constantly reporting to the Name Node. Each of the Data Nodes
informsthe Name Node of the blocks it’s currently storing. After this mapping is complete, the
Data Nodes continually poll the Name Node to provide information regarding local changes as
well as receive instructions to create, move, or delete blocks from the local disk.
3. Secondary Name Node: The Secondary Name Node (SNN) is an assistant daemon for
monitoring the state of the cluster HDFS. Like the Name Node, each cluster has one SNN,
and it typically resides on its own machine as well. No other Data Node or Task Tracker
daemons run on the same server. The SNN differs from the Name Node in that this process
doesn’t receive or record any real-time changes to HDFS. Instead, it communicates with the
Name Node to take snapshots of the HDFS metadata at intervals defined by the cluster
configuration.
The Name Node is a single point of failure for a Hadoop cluster, and the SNNsnapshots
help minimize the downtime and loss of data. However, a Name Node failure requires human
intervention to reconfigure the cluster to use the SNN as the primary Name Node.
4. Job Tracker: There is only one Job Tracker daemon per Hadoop cluster. It’s typically run
on a server as a master node of the cluster. The Job Tracker daemon is the link between our
application and Hadoop. Once we submit our code to the cluster, the Job Tracker determines
the execution plan by determining which files to process, assigns nodes to different tasks, and
monitors all tasks as they’re running. If a task fails, the Job Tracker will automatically re
launch the task, possibly on a different node, up to a predefined limit of retries.
5. Task Tracker: The Job Tracker is the master control for overall execution of a Map
Reduce job and the Task Trackers manage the execution of individual tasks on each slave
node. The interaction between Job Tracker and Task Tracker is shown in the following
diagram.
Each Task Tracker is responsible for executing the individual tasks that the Job Tracker
assigns. Although there is a single Task Tracker per slave node, each Task Tracker can spawn
The topology of one typical Hadoop cluster in described in the following figure:
Real time analytics lets users see, analyze and understand data as it arrives in a system. Logic
and mathematics are applied to the data so it can give users insights for making real-time
decisions.
Real-time analytics allows businesses to get insights and act on data immediately or
soon after the data enters their system.
Real time app analytics answer queries within seconds. They handle large amounts of data with
high velocity and low response times. For example, real-time big data analytics uses data
in financial databases to inform trading decisions.
Analytics can be on-demand or continuous. On-demand delivers results when the user requests it.
Continuous updates users as events happen and can be programmed to respond automatically to
certain events. For example, real-time web analytics might update an administrator if page load
performance goes out of preset parameters.
Viewing orders as they happen for better tracking and to identify trends.
Continually updated customer activity like page views and shopping cart use to
understand user behavior.
Targeting customers with promotions as they shop for items in a store, influencing real-
time decisions.
Real-time data analytics tools can either push or pull data. Streaming requires the ability to push
massive amounts of fast-moving data. When streaming takes too many resources and isn’t
practical, data can be pulled at intervals that can range from seconds to hours. The pull can
happen in between business needs which require computing resources so as not to disrupt
operations.
The response times for real time analytics can vary from nearly instantaneous to a few seconds or
minutes. Some products like a self-driving car need to respond to new information within
milliseconds. But other products like an oil drill or windmill can get by with a minute between
updates. Several minutes might be enough for a bank looking at credit scores of loan applicants.
Aggregator — Compiles real time streaming data analytics from many different data
sources.
Broker — Makes data in real time available for use.
Analytics Engine — Correlates values and blends data streams together while analyzing
the data.
Stream Processor — Executes real time app analytics and logic by receiving and sending
data streams.
Real-time analytics can be deployed at the edge, which looks at the data at the closest point of its
arrival. Other technologies that make real time analytics possible include:
Speed is the main benefit of real time data analytics. The less time a business must wait to access
data between the time it arrives and is processed, the faster a business can use data insights to
make changes and act on critical decisions. For instance, analyzing monitoring data from a
manufacturing line would help early intervention before machinery malfunctions.
Similarly, real-time data analytics tools let companies see how users interact with a product upon
release, which means there is no delay in understanding user behavior for making needed
adjustments.
Managing location data — Helps determine relevant data sets for a geographic location
and what should be updated for optimal location intelligence.
Detecting anomalies — Identifies the statistical outliers caused by security breaches and
technological failures.
Better marketing — Finds insights in demographics and customer behavior to improve
effectiveness of advertising and marketing campaigns. Helps determine best pricing
strategies and audience targeting.
Examples
Marketing campaigns: When running a marketing campaign, most people rely on A/B tests.
With the ability to access data instantly, you can adjust campaign parameters to boost success.
For example, if you run an ad campaign and retrieve data in real-time of people clicking and
converting, then you can adjust your message and parameters to target that audience directly.
Financial trading: Financial institutions need to make buy and sell decisions in milliseconds.
With analytics provided in real-time, traders can take advantage of information from financial
databases, news sources, social media, weather reports and more to have a wide angle
perspective on the market in real-time. This broad picture helps to make smart trading decisions.
Financial operations: Financial teams are experiencing a transformation by which they not only
are responsible for back-office procedures, but they also add value to the organisation by
providing strategic insights. The production of financial statements must be accurate to help
inform the best decisions for the business. Analytics in real-time helps to spot errors and can aid
in reducing operational risks. The software’s ability to match records (i.e. account
reconciliation), store data securely (in a centralised system) and transform raw data into insights
(real-time analytics) makes all the difference in a team’s ability to remain accurate, agile and
ahead of the curve.
Credit scoring: Any financial provider understands the value of credit scores. With real-time
analysis, institutions can approve or deny loans immediately.
Healthcare: Wearable devices are an example of real-time analytics which can track a human’s
health statistics. For example, real-time data provides information like a person’s heartbeat, and
these immediate updates can be used to save lives and even predict ailments in advance.
Audience
This tutorial has been prepared for professionals aspiring to learn the basics of Big Data
Analytics using Spark Framework and become a Spark Developer. In addition, it would be
useful for Analytics Professionals and ETL developers as well.
Prerequisites
Before you start proceeding with this tutorial, we assume that you have prior exposure to Scala
programming, database concepts, and any of the Linux operating system flavors.
Why Spark when Hadoop is there:
Industries are using Hadoop extensively to analyze their data sets. The reason is that Hadoop
framework is based on a simple programming model (MapReduce) and it enables a computing
solution that is scalable, flexible, fault-tolerant and cost effective. Here, the main concern is to
maintain speed in processing large datasets in terms of waiting time between queries and
waiting time to run the program.
Spark was introduced by Apache Software Foundation for speeding up the Hadoop
computational computing software process.
As against a common belief, Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster management. Hadoop is just one of the
ways to implement Spark.
Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark has its
own cluster management computation, it uses Hadoop for storage purpose only.
Apache Spark
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It
is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for
more types of computations, which includes interactive queries and stream processing. The
main feature of Spark is its in-memory cluster computing that increases the processing speed
of an application.
Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming. Apart from supporting all these workload in a
respective system, it reduces the management burden of maintaining separate tools.
Components of Spark
The following illustration depicts the different components of Spark.
Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data
abstraction called SchemaRDD, which provides support for structured and
semi-structured data.
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming
analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets)
transformations on those mini-batches of data.
Spark Features
1. Objective
Apache Spark being an open-source framework for Bigdata has a various advantage over other
big data solutions like Apache Spark is Dynamic in Nature, it supports in-memory Computation
of RDDs. It provides a provision of reusability, Fault Tolerance, real-time stream processing and
many more. In this tutorial on features of Apache Spark, we will discuss various advantages of
Spark which give us the answer for – Why we should learn Apache Spark? Why is Spark better
than Hadoop MapReduce and why is Spark called 3G of Big data?
we can reuse the Spark code for batch-processing, join stream against
historical data or run ad-hoc queries on stream state.
e. Fault Tolerance in Spark
Apache Spark provides fault tolerance through Spark abstraction-RDD. Spark RDDs are
designed to handle the failure of any worker node in the cluster. Thus, it ensures that the loss of
data reduces to zero. Learn different ways to create RDD in Apache Spark.
In Spark, there is Support for multiple languages like Java, R, Scala, Python. Thus, it provides
dynamicity and overcomes the limitation of Hadoop that it can build applications only in Java.
Get the best Scala Books To become an expert in Scala programming language.
Spark comes with dedicated tools for streaming data, interactive/declarative queries, machine
learning which add-on to map and reduce.
l. Spark GraphX
Spark has GraphX, which is a component for graph and graph-parallel computation. It simplifies
the graph analytics tasks by the collection of graph algorithm and builders.
m. Cost Efficient
Apache Spark is cost effective solution for Big data problem as in Hadoop large amount of
storage and the large data center is required during replication.
4. Conclusion
In conclusion, Apache Spark is the most advanced and popular product of Apache Community
that provides the provision to work with the streaming data, has various Machine learning
library, can work on structured and unstructured data, deal with graph etc. After learning Apache
Advantages of Spark
2. Submit a job.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 33
Most of the development process happens on interactive
clients, but when we have to put our application in
production, we use Submit a job approach.
1. Analyzing
2. Distributing.
4. Scheduling.
Modes of execution
Spark Core
Now we look at some core APIs spark provides. Spark needs
a data structure to hold the data. We have three alternatives
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 38
RDD, DataFrame, and Dataset. Since Spark 2.0 it is
recommended to use only Dataset and DataFrame. These
two internally compile to RDD itself.
lit: Creates a column of a literal value. Can be used for comparisons with other
columns.
geq(greater than equal to), leq(less than equal to), gt(greater than), lt(less than):
Used for comparisons with other column value. For example:
2) join
Spark lets us join datasets in various ways. Will try to explain with a sample
example
3) union
Spark union function lets us have a union between two datasets. The datasets should
be of the same schema.
4) window
Spark gives APIs for tumbling window, hoping window, sliding window and
delayed window.
We use it for ranking, sum, plain old windowing, etc. Some of the use cases are:
Other functions such as lag, lead, and many more, let you do other operations that
enable you to do sophisticated analytics over datasets.
However, if you still need to perform much more complex operations over datasets,
you can use UDFs. Sample of usage of a UDF:
Note: Using UDFs should be the last resort since they are not optimized for Spark;
they might take a longer time for executions. It is advisable to use native spark
functions over UDFs.
This is just the tip of the iceberg for Apache Spark. Its utilities expand in various
domains, not limited to data analytics. Watch this space for more.
The field of the actuarial sciences puts emphasis on the subjects of mathematics, probability and
statistics, economics, finance, and risk management. There is a subject area, however, that is not
given enough attention and is becoming more integrated into the profession of actuarial work.
Change log:
17 Nov 2020 — Edited no. of layers in last dense layer of Inceptionv1 from 4096 to 1000.
We all know of the term “social network” in terms of social networking sites such as Facebook
and Twitter. On these sites, people are able to communicate and form social connections with
other users on the site. As people use these websites, patterns tend to emerge as users group
together…
Spark Eco System
In this Spark Ecosystem tutorial, we will discuss about core ecosystem components of Apache
Spark like Spark SQL, Spark Streaming, Spark Machine learning (MLlib), Spark GraphX, and
Spark R.
Apache Spark Ecosystem has extensible APIs in different languages like Scala, Python, Java,
and R built on top of the core Spark execution engine.
Apache Spark is the most popular big data tool, also considered as next generation tool, which is
being used by 100s of organization and having 1000s of contributors, it’s still emerging and
gaining popularity as the standard big data execution engine.
Ecommerce companies like Alibaba, social networking companies like Tenet and
Chinese search engine Baidu, all run Apache spark operations at scale. Here are a
few features that are responsible for its popularity.
1. Fast Processing Speed: The first and foremost advantage of using Apache
Spark for your big data is that it offers 100x faster in memory and 10x faster
on the disk in Hadoop clusters.
2. Supports a variety of programming languages: Spark applications can be
implemented in a variety of languages like Scala, R, Python, Java, and Clojure.
This makes it easy for developers to work according to their preferences.
3. Powerful Libraries: It contains more than just map and reduce functions. It
contains libraries SQL and dataframes, MLlib (for machine learning), GraphX,
and Spark streaming which offer powerful tools for data analytics.
4. Near real-time processing: Spark has Map Reduce that can process data
stored in Hadoop and it also has Spark Streaming which can handle data in
real-time.
5. Compatibility: Spark can run on Hadoop, Apache Mesos, Kubernetes,
standalone, or in the cloud. It can operate diverse data sources.
Now that you are aware of its exciting features, let us explore Spark Architecture to
realize what makes it so special. This article is a single-stop resource that gives
spark architecture overview with the help of spark architecture diagram and is a
good beginners resource for people looking to learn spark.
DAG is a sequence of computations performed on data where each node is an RDD partition and
edge is a transformation on top of data. The DAG abstraction helps eliminate the Hadoop Map
Reduce multi0stage execution model and provides performance enhancements over Hadoop.
A spark cluster has a single Master and any number of Slaves/Workers. The driver and the
executors run their individual Java processes and users can run them on the same horizontal
spark cluster or on separate machines i.e. in a vertical spark cluster or in mixed machine
configuration.
For classic Hadoop platforms, it is true that handling complex assignments require developers to
link together a series of Map Reduce jobs and run them in a sequential manner. Here, each job
has a high latency. The job output data between each step has to be saved in the HDFS before
other processes can start. The advantage of having DAG and RDD is that they replace the disk
IO with in-memory operations and support in-memory data sharing across DAGs, so that
different jobs can be performed with the same data allowing complicated workflows.
It is the central point and the entry point of the Spark Shell (Scala, Python, and R). The driver
program runs the main () function of the application and is the place where he Spark Context and
RDDs are created, and also where transformations and actions are performed. Spark Driver
contains various components – DAG Scheduler, Task Scheduler, Backend Scheduler, and Block
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 47
Manager responsible for the translation of spark user code into actual spark jobs
executed on the cluster.
Spark Driver performs two main tasks: Converting user programs into tasks and planning the
execution of tasks by executors. A detailed description of its tasks is as follows:
The driver program that runs on the master node of the spark cluster schedules the job
execution and negotiates with the cluster manager.
It translates the RDD’s into the execution graph and splits the graph into multiple stages.
Driver stores the metadata about all the Resilient Distributed Databases and their
partitions.
Cockpits of Jobs and Tasks Execution -Driver program converts a user application into
smaller execution units known as tasks. Tasks are then executed by the executors i.e. the
worker processes which run individual tasks.
After the task has been completed, all the executors submit their results to the Driver.
Driver exposes the information about the running spark application through a Web UI at
port 4040.
RDD, DataFrame, and Dataset are the three most common data structures in Spark, and they
make processing very large data easy and convenient. Because of the lazy evaluation algorithm
of Spark, these data structures are not executed right way during creations, transformations, and
functions etc. Only when they encounter actions, will they start the traversal operation.
Spark RDD – since Spark 1.0
RDD stands for Resilient Distributed Dataset. It is a collection of recorded immutable partitions.
RDD is the fundamental data structure of Spark whose partitions are shuffled, sent across nodes
and operated in parallel. It allows programmers to perform complex in-memory analysis on large
clusters in a fault-tolerant manner. RDD can handle structured and unstructured data easily and
effectively as it has lots of built-in functional operators like group, map and filter etc.
However, when encountering complex logic, RDD has a very obvious disadvantage – operators
cannot be re-used. This is because RDD does not know the information of the stored data, so the
structure of the data is a black box which requires a user to write a very specific aggregation
function to complete an execution. Therefore, RDD is preferable on unstructured data, to be used
for low-level transformations and actions.
RDD provides users with a familiar object-oriented programming style, along with a distributing
collection of JVM objects, that indicate it is compile-time type safety. Using RDD is very
Spark Components
The Spark project consists of different types of tightly integrated components.
At its core, Spark is a computational engine that can schedule, distribute and
monitor multiple applications.
Spark SQL
o The Spark SQL is built on the top of Spark Core. It provides support for
structured data.
o It allows to query the data via SQL (Structured Query Language) as well
as the Apache Hive variant of SQL?called the HQL (Hive Query
Language).
o It supports JDBC and ODBC connections that establish a relation
between Java objects and existing databases, data warehouses and
business intelligence tools.
o It also supports various sources of data like Hive tables, Parquet, and
JSON.
Spark Streaming
o Spark Streaming is a Spark component that supports scalable and fault-
tolerant processing of streaming data.
o It uses Spark Core's fast scheduling capability to perform streaming
analytics.
MLlib
o The MLlib is a Machine Learning library that contains various machine
learning algorithms.
o These include correlations and hypothesis testing, classification and
regression, clustering, and principal component analysis.
o It is nine times faster than the disk-based implementation used by
Apache Mahout.
GraphX
o The GraphX is a library that is used to manipulate graphs and perform
graph-parallel computations.
o It facilitates to create a directed graph with arbitrary properties attached
to each vertex and edge.
o To manipulate graph, it supports various fundamental operators like
subgraph, join Vertices, and aggregate Messages.
We are often asked how does Apache Spark fits in the Hadoop ecosystem, and how one can
run Spark in a existing Hadoop cluster. This blog aims to answer these questions.
First, Spark is intended to enhance, not replace, the Hadoop stack. From day one, Spark was
designed to read and write data from and to HDFS, as well as other storage systems, such as
HBase and Amazon’s S3. As such, Hadoop users can enrich their processing capabilities by
combining Spark with Hadoop MapReduce, HBase, and other big data frameworks.
Second, we have constantly focused on making it as easy as possible for every Hadoop user to
take advantage of Spark’s capabilities. No matter whether you run Hadoop 1.x or Hadoop 2.0
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 51
(YARN), and no matter whether you have administrative privileges to configure
the Hadoop cluster or not, there is a way for you to run Spark! In particular, there are three ways
to deploy Spark in a Hadoop cluster: standalone, YARN, and SIMR.
Hadoop Yarn deployment: Hadoop users who have already deployed or are
planning to deploy Hadoop Yarn can simply run Spark on YARN without any
pre-installation or administrative access required. This allows users to easily
integrate Spark in their Hadoop stack and take advantage of the full power of
Spark, as well as of other components running on top of Spark.
Spark In MapReduce (SIMR): For the Hadoop users that are not running
YARN yet, another option, in addition to the standalone deployment, is to use
SIMR to launch Spark jobs inside MapReduce. With SIMR, users can start
experimenting with Spark and use its shell within a couple of minutes after
downloading it! This tremendously lowers the barrier of deployment, and lets
virtually everyone play with Spark.
Spark interoperates not only with Hadoop, but with other popular big
data technologies as well.
Apache Hive: Through Shark, Spark enables Apache Hive users to run their
unmodified queries much faster. Hive is a popular data warehouse solution
running on top of Hadoop, while Shark is a system that allows the Hive
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 52
framework to run on top of Spark instead of Hadoop. As a result, Shark can
accelerate Hive queries by as much as 100x when the input data fits into
memory, and up 10x when the input data is stored on disk.
AWS EC2: Users can easily run Spark (and Shark) on top of Amazon’s EC2
either using the scripts that come with Spark, or the hosted versions of
Spark and Shark on Amazon’s Elastic MapReduce.
Apache Mesos: Spark runs on top of Mesos, a cluster manager system
which provides efficient resource isolation across distributed applications,
including MPI and Hadoop. Mesos enables fine grained sharing which
allows a Spark job to dynamically take advantage of the idle resources in the
cluster during its execution. This leads to considerable performance
improvements, especially for long running Spark jobs.
Apache Spark is the new shiny big data bauble making fame and gaining
mainstream presence amongst its customers. Startups to Fortune 500s are adopting
Apache Spark to build, scale and innovate their big data applications. Here are some
Your credit card is swiped for $9000 and the receipt has been signed, but it was not
you who swiped the credit card as your wallet was lost. This might be some kind of
a credit card fraud. Financial institutions are leveraging big data to find out when
and where such frauds are happening so that they can stop them. They need to
resolve any kind of fraudulent charges at the earliest by detecting frauds right from
the first minor discrepancy. They already have models to detect fraudulent
transactions and most of them are deployed in batch environment. With the use of
Apache Spark on Hadoop, financial institutions can detect fraudulent transactions
in real-time, based on previous fraud footprints. All the incoming transactions are
validated against a database, if there a match then a trigger is sent to the call centre.
The call centre personnel immediately checks with the credit card owner to
validate the transaction before any fraud can happen.
Apache Spark ecosystem can be leveraged in the finance industry to achieve best in
class results with risk based assessment, by collecting all the archived logs and
combining with other external data sources (information about compromised
accounts or any other data breaches).
Apache Spark is used in genomic sequencing to reduce the time needed to process
genome data. Earlier, it took several weeks to organize all the chemical compounds
with genes but now with Apache spark on Hadoop it just takes few hours. This use
case of spark might not be so real-time like other but renders considerable benefits
to researchers over earlier implementation for genomic sequencing.
Few of the video sharing websites use apache spark along with MongoDB to show
relevant advertisements to its users based on the videos they view, share and
browse.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 56
Companies Using Spark in Media & Entertainment
Industry
Earlier the machine learning algorithm for news personalization required 15000 lines
of C++ code but now with Spark Scala the machine learning algorithm for news
personalization has just 120 lines of Scala programming code. The algorithm was
ready for production use in just 30 minutes of training, on a hundred million
datasets.
The spike in increasing number of spark use cases is just in its commencement and
2016 will make Apache Spark the big data darling of many other companies, as they
start using Spark to make prompt decisions based on real-time processing through
spark streaming. These are just some of the use cases of the Apache Spark
ecosystem. If you know any other companies using Spark for real-time processing,
feel free to share with the community, in the comments below.
Riot Games uses Apache Spark to minimize the in-game toxicity. Whether you are
winning or losing, some players get into a rage. Game developers at Riot use Spark
MLlib to train their models on NLP for words, short forms, initials, etc. to understand
how a player interacts and they can even disable their account if required.
Tencent
Tencent has the biggest market of mobile gaming users base, similar to riot it
develops multiplayer games. Tencent uses spark for its in-memory computing
feature that boosts data processing performance in real-time in a big data context
while also assuring fault tolerance and scalability. It uses Apache Spark to analyze
multiplayer chat data to reduce the usage of abusive languages in-game chat.
Hearst
It is a leading global media information and services company. Its main goal is to
provide services to many major businesses, from television channels to financial
services. Using Apache Spark Streaming Hearst’s team gleans real-time insights on
articles/news items performing well and identifies content that is trending.
FINRA
FINRA is a Financial Services company that helps get real-time data insights of
billions of data events. Using Apache Spark, it can test things on real data from the
market, improving its ability to provide investor security and promote market
integrity.
Solution Architecture: This implementation has the following steps: Writing events in
the context of a data pipeline. Then designing a data pipeline based on messaging.
This is followed by executing the file pipeline utility. After this we load data from a
Problem: Large companies usually have multiple storehouses of data. All this data
must be moved to a single location to make it easy to generate reports. A data
warehouse is that single location.
Solution Architecture: In the first layer of this spark project first moves data to hdfs.
The hive tables are built on top of hdfs. Data comes through batch processing.
Sqoop is used to ingest this data. Dataframes are used to store instead of RDD. In
the 2nd layer, we normalize and denormalize the data tables. Then transformation is
done using Spark Sql. This transformed data is moved to HDFS. In the final 3rd layer
visualization is done.
Gumgum
It is an AI-focused technology and digital media company. They have been using machine
learning to extract value from digital content for over a long time. It is an in-image and in-screen
advertising platform, employing Spark on Amazon EMR. For forecasting, log processing, ad hoc
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 61
analysis, and a lot more. Spark's speed helps gumgum save lots of time and resources. It
uses computer vision and NLP to identify and score different types of content
MapReduce Programming:
What is MapReduce?
MapReduce is a processing technique and a program model for distributed
computing based on java. The MapReduce algorithm contains two important
tasks, namely Map and Reduce. Map takes a set of data and converts it into
another set of data, where individual elements are broken down into tuples
(key/value pairs). Secondly, reduce task, which takes the output from a map
as an input and combines those data tuples into a smaller set of tuples. As
the sequence of the name MapReduce implies, the reduce task is always
performed after the map job.
The major advantage of MapReduce is that it is easy to scale data
processing over multiple computing nodes. Under the MapReduce model, the
data processing primitives are called mappers and reducers. Decomposing a
data processing application into mappers and reducers is sometimes
nontrivial. But, once we write an application in the MapReduce form, scaling
the application to run over hundreds, thousands, or even tens of thousands of
machines in a cluster is merely a configuration change. This simple scalability
is what has attracted many programmers to use the MapReduce model.
The Algorithm
Generally MapReduce paradigm is based on sending the computer to
where the data resides!
MapReduce program executes in three stages, namely map stage,
shuffle stage, and reduce stage.
o Map stage − The map or mapper’s job is to process the input
data. Generally the input data is in the form of file or directory
and is stored in the Hadoop file system (HDFS). The input file is
passed to the mapper function line by line. The mapper
processes the data and creates several small chunks of data.
o Reduce stage − This stage is the combination of
the Shuffle stage and the Reduce stage. The Reducer’s job is to
Input Output
Example Scenario
Given below is the data regarding the electrical consumption of an
organization. It contains the monthly electrical consumption and the annual
average for various years.
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Avg
1979 23 23 2 43 24 25 26 26 26 26 25 26 25
1980 26 27 28 28 28 30 31 31 31 30 30 30 29
1984 39 38 39 39 39 41 42 43 40 39 38 38 40
1985 38 39 39 39 39 41 41 41 00 40 39 39 45
Input Data
The above data is saved as sample.txtand given as input. The input file
looks as shown below.
1979 23 23 2 43 24 25 26 26 26 26 25
26 25
1980 26 27 28 28 28 30 31 31 31 30 30
30 29
1981 31 32 32 32 33 34 35 36 36 34 34
34 34
1984 39 38 39 39 39 41 42 43 40 39 38
38 40
1985 38 39 39 39 39 41 41 41 00 40 39
39 45
Example Program
Given below is the program to the sample data using MapReduce framework.
package hadoop;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
while(s.hasMoreTokens()) {
lasttoken = s.nextToken();
}
int avgprice = Integer.parseInt(lasttoken);
output.collect(new Text(year), new
IntWritable(avgprice));
}
}
//Reducer class
public static class E_EReduce extends MapReduceBase
implements Reducer< Text, IntWritable, Text, IntWritable > {
//Reduce function
public void reduce( Text key, Iterator <IntWritable>
values,
OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {
int maxavg = 30;
int val = Integer.MIN_VALUE;
while (values.hasNext()) {
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 66
if((val = values.next().get())>maxavg) {
output.collect(key, new IntWritable(val));
}
}
}
}
//Main function
public static void main(String args[])throws Exception {
JobConf conf = new JobConf(ProcessUnits.class);
conf.setJobName("max_eletricityunits");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(E_EMapper.class);
conf.setCombinerClass(E_EReduce.class);
conf.setReducerClass(E_EReduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
JobClient.runJob(conf);
}
}
Save the above program as ProcessUnits.java. The compilation and
execution of the program is explained below.
Step 1
The following command is to create a directory to store the compiled java
classes.
$ mkdir units
Step 2
Download Hadoop-core-1.2.1.jar, which is used to compile and execute the
MapReduce program. Visit the following link mvnrepository.com to download
the jar. Let us assume the downloaded folder is /home/hadoop/.
Step 4
The following command is used to create an input directory in HDFS.
$HADOOP_HOME/bin/hadoop fs -mkdir input_dir
Step 5
The following command is used to copy the input file named sample.txtin the
input directory of HDFS.
$HADOOP_HOME/bin/hadoop fs -put /home/hadoop/sample.txt
input_dir
Step 6
The following command is used to verify the files in the input directory.
$HADOOP_HOME/bin/hadoop fs -ls input_dir/
Step 7
The following command is used to run the Eleunit_max application by taking
the input files from the input directory.
$HADOOP_HOME/bin/hadoop jar units.jar hadoop.ProcessUnits
input_dir output_dir
Wait for a while until the file is executed. After execution, as shown below,
the output will contain the number of input splits, the number of Map tasks,
the number of reducer tasks, etc.
INFO mapreduce.Job: Job job_1414748220717_0002
completed successfully
14/10/31 06:02:52
INFO mapreduce.Job: Counters: 49
File System Counters
Map-Reduce Framework
Bytes Written = 40
Step 9
The following command is used to see the output in Part-00000 file. This file
is generated by HDFS.
$HADOOP_HOME/bin/hadoop fs -cat output_dir/part-00000
Below is the output generated by the MapReduce program.
1981 34
1984 40
1985 45
Step 10
The following command is used to copy the output folder from HDFS to the
local file system for analyzing.
$HADOOP_HOME/bin/hadoop fs -cat output_dir/part-
00000/bin/hadoop dfs get output_dir /home/hadoop
Important Commands
All Hadoop commands are invoked by
the $HADOOP_HOME/bin/hadoop command. Running the Hadoop script
without any arguments prints the description for all commands.
Usage − hadoop [--config confdir] COMMAND
The following table lists the options available and their description.
1 namenode -format
Formats the DFS filesystem.
2 secondarynamenode
Runs the DFS secondary namenode.
3 namenode
4 datanode
Runs a DFS datanode.
5 dfsadmin
Runs a DFS admin client.
6 mradmin
Runs a Map-Reduce admin client.
7 fsck
Runs a DFS filesystem checking utility.
8 fs
Runs a generic filesystem user client.
9 balancer
Runs a cluster balancing utility.
10 oiv
Applies the offline fsimage viewer to an fsimage.
11 fetchdt
Fetches a delegation token from the NameNode.
12 jobtracker
Runs the MapReduce job Tracker node.
13 pipes
Runs a Pipes job.
15 historyserver
Runs job history servers as a standalone daemon.
16 job
Manipulates the MapReduce jobs.
17 queue
Gets information regarding Job Queues.
18 version
Prints the version.
19 jar <jar>
Runs a jar file.
23 classpath
Prints the class path needed to get the Hadoop jar and the required
libraries.
1 -submit <job-file>
Submits the job.
2 -status <job-id>
Prints the map and reduce completion percentage and all job
counters.
4 -kill <job-id>
Kills the job.
6 -history [all] <job Output Dir> - history < job Output Dir>
Prints job details, failed and killed tip details. More details about the
job such as successful tasks and task attempts made for each task
can be viewed by specifying the [all] option.
7 -list[all]
Displays all jobs. -list displays only jobs which are yet to complete.
9 -fail-task <task-id>
Fails the task. Failed tasks are counted against failed attempts.
In MapReduce word count example, we find out the frequency of each word.
Here, the role of Mapper is to map the keys to the existing values and the role
of Reducer is to aggregate the keys of common values. So, everything is
represented in the form of Key-value pair.
Pre-requisite
Java Installation - Check whether the Java is installed or not using the
following command.
java -version
Hadoop Installation - Check whether the Hadoop is installed or not using the
following command.
hadoop version
www.javatpoint.com/hadoop-installation
File: WC_Mapper.java
1. package com.javatpoint;
2. import java.io.IOException;
3. import java.util.StringTokenizer;
4. import org.apache.hadoop.io.IntWritable;
5. import org.apache.hadoop.io.LongWritable;
6. import org.apache.hadoop.io.Text;
7. import org.apache.hadoop.mapred.MapReduceBase;
8. import org.apache.hadoop.mapred.Mapper;
9. import org.apache.hadoop.mapred.OutputCollector;
10. import org.apache.hadoop.mapred.Reporter;
11. public class WC_Mapper extends MapReduceBase implements Mapper<Long
Writable,Text,Text,IntWritable>{
12. private final static IntWritable one = new IntWritable(1);
13. private Text word = new Text();
File: WC_Reducer.java
1. package com.javatpoint;
2. import java.io.IOException;
3. import java.util.Iterator;
4. import org.apache.hadoop.io.IntWritable;
5. import org.apache.hadoop.io.Text;
6. import org.apache.hadoop.mapred.MapReduceBase;
7. import org.apache.hadoop.mapred.OutputCollector;
8. import org.apache.hadoop.mapred.Reducer;
9. import org.apache.hadoop.mapred.Reporter;
10. public class WC_Reducer extends MapReduceBase implements Reducer<
Text,IntWritable,Text,IntWritable> {
11. public void reduce(Text key, Iterator<IntWritable> values,OutputCollector
<Text,IntWritable> output,
12. Reporter reporter) throws IOException {
13. int sum=0;
14. while (values.hasNext()) {
15. sum+=values.next().get();
16. }
17. output.collect(key,new IntWritable(sum));
File: WC_Runner.java
1. package com.javatpoint;
2. import java.io.IOException;
3. import org.apache.hadoop.fs.Path;
4. import org.apache.hadoop.io.IntWritable;
5. import org.apache.hadoop.io.Text;
6. import org.apache.hadoop.mapred.FileInputFormat;
7. import org.apache.hadoop.mapred.FileOutputFormat;
8. import org.apache.hadoop.mapred.JobClient;
9. import org.apache.hadoop.mapred.JobConf;
10. import org.apache.hadoop.mapred.TextInputFormat;
11. import org.apache.hadoop.mapred.TextOutputFormat;
12. public class WC_Runner {
13. public static void main(String[] args) throws IOException{
14. JobConf conf = new JobConf(WC_Runner.class);
15. conf.setJobName("WordCount");
16. conf.setOutputKeyClass(Text.class);
17. conf.setOutputValueClass(IntWritable.class);
18. conf.setMapperClass(WC_Mapper.class);
19. conf.setCombinerClass(WC_Reducer.class);
20. conf.setReducerClass(WC_Reducer.class);
21. conf.setInputFormat(TextInputFormat.class);
22. conf.setOutputFormat(TextOutputFormat.class);
23. FileInputFormat.setInputPaths(conf,new Path(args[0]));
24. FileOutputFormat.setOutputPath(conf,new Path(args[1]));
25. JobClient.runJob(conf);
26. }
27. }
The illustration given below shows the iterative operations on Spark RDD. It
will store intermediate results in a distributed memory instead of Stable
storage (Disk) and make the system faster.
Note − If the Distributed memory (RAM) is not sufficient to store intermediate
results (State of the JOB), then it will store those results on the disk.
By default, each transformed RDD may be recomputed each time you run an action on it.
However, you may also persist an RDD in memory, in which case Spark will keep the elements
around on the cluster for much faster access, the next time you query it. There is also support
for persisting RDDs on disk, or replicated across multiple nodes.
Programming
Spark contains two different types of shared variables − one is broadcast
variables and second is accumulators.
Broadcast variables − used to efficiently, distribute large values.
Accumulators − used to aggregate the information of particular
collection.
Broadcast Variables
Broadcast variables allow the programmer to keep a read-only variable cached
on each machine rather than shipping a copy of it with tasks. They can be used, for example, to
give every node, a copy of a large input dataset, in an efficient manner. Spark also attempts to
distribute broadcast variables using efficient broadcast algorithms to reduce communication
cost.
Spark actions are executed through a set of stages, separated by distributed “shuffle” operations.
Spark automatically broadcasts the common data needed by tasks within each stage.
The data broadcasted this way is cached in serialized form and is deserialized before running
each task. This means that explicitly creating broadcast variables, is only useful when tasks
across multiple stages need the same data or when caching the data in deserialized form is
important.
Accumulators
Accumulators are variables that are only “added” to through an associative operation and can
therefore, be efficiently supported in parallel. They can be used to implement counters (as in
MapReduce) or sums. Spark natively supports accumulators of numeric types, and programmers
can add support for new types. If accumulators are created with a name, they will be displayed
in Spark’s UI. This can be useful for understanding the progress of running stages (NOTE −
this is not yet supported in Python).
An accumulator is created from an initial value v by calling SparkContext.accumulator(v).
Tasks running on the cluster can then add to it using the add method or the += operator (in
Scala and Python). However, they cannot read its value. Only the driver program can read the
accumulator’s value, using its value method.
The code given below shows an accumulator being used to add up the elements of an array −
scala> val accum = sc.accumulator(0)
Output
res2: Int = 10
Numeric RDD Operations
Spark allows you to do different operations on numeric data, using one of the
predefined API methods. Spark’s numeric operations are implemented with a
streaming algorithm that allows building the model, one element at a time.
1 count()
Number of elements in the RDD.
2 Mean()
Average of the elements in the RDD.
3 Sum()
Total value of the elements in the RDD.
4 Max()
Maximum value among all elements in the RDD.
5 Min()
Minimum value among all elements in the RDD.
6 Variance()
Variance of the elements.
7 Stdev()
Standard deviation.
If you want to use only one of these methods, you can call the corresponding
method directly on RDD.
Core Programming
Spark Core is the base of the whole project. It provides distributed task
dispatching, scheduling, and basic I/O functionalities. Spark uses a
specialized fundamental data structure known as RDD (Resilient Distributed
Datasets) that is a logical collection of data partitioned across machines.
1 map(func)
Returns a new distributed dataset, formed by passing each element of
the source through a function func.
2 filter(func)
Returns a new dataset formed by selecting those elements of the source
on which func returns true.
3 flatMap(func)
Similar to map, but each input item can be mapped to 0 or more output
items (so func should return a Seq rather than a single item).
4 mapPartitions(func)
Similar to map, but runs separately on each partition (block) of the RDD,
so func must be of type Iterator<T> ⇒ Iterator<U> when running on an
RDD of type T.
5 mapPartitionsWithIndex(func)
Similar to map Partitions, but also provides func with an integer value
representing the index of the partition, so func must be of type (Int,
Iterator<T>) ⇒ Iterator<U> when running on an RDD of type T.
7 union(otherDataset)
Returns a new dataset that contains the union of the elements in the
source dataset and the argument.
8 intersection(otherDataset)
Returns a new RDD that contains the intersection of elements in the
source dataset and the argument.
10 groupByKey([numTasks])
When called on a dataset of (K, V) pairs, returns a dataset of (K,
Iterable<V>) pairs.
Note − If you are grouping in order to perform an aggregation (such as a
sum or average) over each key, using reduceByKey or aggregateByKey
will yield much better performance.
11 reduceByKey(func, [numTasks])
When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs
where the values for each key are aggregated using the given reduce
function func, which must be of type (V, V) ⇒ V. Like in groupByKey, the
number of reduce tasks is configurable through an optional second
argument.
13 sortByKey([ascending], [numTasks])
When called on a dataset of (K, V) pairs where K implements Ordered,
returns a dataset of (K, V) pairs sorted by keys in ascending or
descending order, as specified in the Boolean ascending argument.
14 join(otherDataset, [numTasks])
When called on datasets of type (K, V) and (K, W), returns a dataset of
(K, (V, W)) pairs with all pairs of elements for each key. Outer joins are
supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.
16 cartesian(otherDataset)
When called on datasets of types T and U, returns a dataset of (T, U)
pairs (all pairs of elements).
17 pipe(command, [envVars])
Pipe each partition of the RDD through a shell command, e.g. a Perl or
bash script. RDD elements are written to the process's stdin and lines
output to its stdout are returned as an RDD of strings.
18 coalesce(numPartitions)
Decrease the number of partitions in the RDD to numPartitions. Useful
for running operations more efficiently after filtering down a large dataset.
19 repartition(numPartitions)
Reshuffle the data in the RDD randomly to create either more or fewer
partitions and balance it across them. This always shuffles all data over
the network.
20 repartitionAndSortWithinPartitions(partitioner)
Repartition the RDD according to the given partitioner and, within each
resulting partition, sort records by their keys. This is more efficient than
calling repartition and then sorting within each partition because it can
push the sorting down into the shuffle machinery.
Spark Shell
Spark provides an interactive shell − a powerful tool to analyze data
interactively. It is available in either Scala or Python language. Spark’s
primary abstraction is a distributed collection of items called a Resilient
Distributed Dataset (RDD). RDDs can be created from Hadoop Input Formats
(such as HDFS files) or by transforming other RDDs.
RDD Transformations
RDD transformations returns pointer to new RDD and allows you to create
dependencies between RDDs. Each RDD in dependency chain (String of
Dependencies) has a function for calculating its data and has a pointer
(dependency) to its parent RDD.
Spark is lazy, so nothing will be executed unless you call some
transformation or action that will trigger job creation and execution. Look at
the following snippet of the word-count example.
Therefore, RDD transformation is not a set of data but is a step in a program
(might be the only step) telling Spark how to get data and what to do with it.
Given below is a list of RDD transformations.
1 reduce(func)
Aggregate the elements of the dataset using a function func (which
takes two arguments and returns one). The function should be
commutative and associative so that it can be computed correctly in
parallel.
2 collect()
Returns all the elements of the dataset as an array at the driver
program. This is usually useful after a filter or other operation that
returns a sufficiently small subset of the data.
3 count()
Returns the number of elements in the dataset.
4 first()
Returns the first element of the dataset (similar to take (1)).
5 take(n)
Returns an array with the first n elements of the dataset.
7 takeOrdered(n, [ordering])
Returns the first n elements of the RDD using either their natural
order or a custom comparator.
8 saveAsTextFile(path)
Writes the elements of the dataset as a text file (or set of text files) in
a given directory in the local filesystem, HDFS or any other Hadoop-
supported file system. Spark calls toString on each element to convert
11 countByKey()
Only available on RDDs of type (K, V). Returns a hashmap of (K, Int)
pairs with the count of each key.
12 foreach(func)
Runs a function func on each element of the dataset. This is usually,
done for side effects such as updating an Accumulator or interacting
with external storage systems.
Note − modifying variables other than Accumulators outside of the
foreach() may result in undefined behavior. See Understanding
closures for more details.
Actions
The following table gives a list of Actions, which return values.
Example
Consider a word count example − It counts each word appearing in a
document. Consider the following text as an input and is saved as
an input.txt file in a home directory.
Open Spark-Shell
The following command is used to open spark shell. Generally, spark is built
using Scala. Therefore, a Spark program runs on Scala environment.
$ spark-shell
If Spark shell opens successfully then you will find the following output. Look
at the last line of the output “Spark context available as sc” means the Spark
container is automatically created spark context object with the name sc.
Before starting the first step of a program, the SparkContext object should be
created.
Spark assembly has been built with Hive, including Datanucleus
jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-
defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to:
hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls
to: hadoop
15/06/04 15:25:22 INFO SecurityManager: SecurityManager:
authentication disabled;
ui acls disabled; users with view permissions: Set(hadoop);
users with modify permissions: Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service
'HTTP class server' on port 43292.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.4.0
/_/
Current RDD
While working with the RDD, if you want to know about current RDD, then
use the following command. It will show you the description about current
RDD and its dependencies for debugging.
scala> counts.toDebugString
part-00000
part-00001
_SUCCESS
The following command is used to see output from Part-00000 files.
[hadoop@localhost output]$ cat part-00000
Output
(people,1)
(are,2)
(not,1)
(as,8)
(beautiful,2)
(they, 7)
(look,1)
The following command is used to see output from Part-00001 files.
[hadoop@localhost output]$ cat part-00001
Output
(walk, 1)
(or, 1)
(talk, 1)
(only, 1)
(love, 1)
(care, 1)
(share, 1)
If you want to UN-persist the storage space of particular RDD, then use the
following command.
Scala> counts.unpersist()
You will see the output as follows −
15/06/27 00:57:33 INFO ShuffledRDD: Removing RDD 9 from
persistence list
15/06/27 00:57:33 INFO BlockManager: Removing RDD 9
15/06/27 00:57:33 INFO BlockManager: Removing block rdd_9_1
15/06/27 00:57:33 INFO MemoryStore: Block rdd_9_1 of size 480
dropped from memory (free 280061810)
15/06/27 00:57:33 INFO BlockManager: Removing block rdd_9_0
15/06/27 00:57:33 INFO MemoryStore: Block rdd_9_0 of size 296
dropped from memory (free 280062106)
res7: cou.type = ShuffledRDD[9] at reduceByKey at <console>:14
For verifying the storage space in the browser, use the following URL.
Streaming in Spark:
Apache Spark Streaming is a scalable fault-tolerant streaming processing system that natively
supports both batch and streaming workloads. Spark Streaming is an extension of the core Spark
API that allows data engineers and data scientists to process real-time data from various sources
including (but not limited to) Kafka, Flume, and Amazon Kinesis. This processed data can be
pushed out to file systems, databases, and live dashboards. Its key abstraction is a Discretized
Stream or, in short, a DStream, which represents a stream of data divided into small batches.
DStreams are built on RDDs, Spark’s core data abstraction. This allows Spark Streaming to
seamlessly integrate with any other Spark components like MLlib and Spark SQL. Spark
Streaming is different from other systems that either have a processing engine designed only for
streaming, or have similar batch and streaming APIs but compile internally to different engines.
Spark’s single execution engine and unified programming model for batch and streaming lead to
some unique benefits over other traditional streaming systems.
Streaming Features:
1. Introduction
In the Big Data world, there are many tools and frameworks available to process the large
volume of data in offline mode or batch mode. But the need for real-time processing to analyze
the data arriving at high velocity on the fly and provide analytics or enrichment services is also
high. In the last couple of years, this is an ever changing landscape, with many new entrants of
streaming frameworks. So choosing the real-time processing engine becomes a challenge.
2. Design
The real-time streaming engines interacts with stream or messaging frameworks such as Apache
Kafka, Rabbit MQ, or Apache Flume to receive the data in real time.
It processes the data inside the cluster computing engine which typically runs on top of a cluster
manager such as Apache YARN, Apache Mesos, or Apache Tez.
The processed data sent back to message queues ( Apache Kafka, RabbitMQ, Flume) or written
into storage such as HDFS, NFS.
3.1.1 Compositional
This approach provides basic components, using which the
streaming application can be created. For example, In Apache Storm,
the spout is used to connect to different sources and receive the data
and bolts are used to process the received data.
3.1.2 Declarative
This is more of a functional programming approach, where the
framework allows us to define higher order functions. This
declarative APIs provide more advanced operations like windowing
or state management and it is considered more flexible.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 100
3.2.1 At Most Once
This is a best effort delivery mechanism. The message may be
delivered one or more times. So the possibilities of getting duplicate
events processed are very high.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 101
3.3.2 Stateful Processing
The stream processing frameworks can make use of the previous
events to process the
incoming events, by storing them in cache or external databases.
Real-time analytics applications need stateful processing so that it
can collect the data for a specific interval and process them before it
really recommends any suggestions to the user.
3.4.3 Batch
The incoming events are processed like a bounded stream of inputs.
This allows it to process the large, finite set of incoming events.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 102
their development environment, they do not need to deploy
their code in the large cluster computing environment.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 104
3.9.3 Remote DBMS connection: These systems support
connecting to the external databases outside the streaming clusters.
This is considered to be less efficient due to higher latency
introduced due to network connectivity and bottlenecks introduced
due to network communication.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 105
The above frameworks support both stateful and stateless
processing modes.
5. Conclusion
This article summarizes the various features of the
streaming framework, which are critical selection criteria for new
streaming application. Every application is unique and has its own
specific functional and non-functional requirement, so the right
framework completely depends on the requirement.
Use case on streaming : About the top four use cases for streaming data integration
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 106
The first one is cloud adoption – specifically online database migration. When you have
your legacy database and you want to move it to the cloud and modernize your data
infrastructure, if it’s a critical database, you don’t want to experience downtime. The streaming
data integration solution helps with that. When you’re doing an initial load from the legacy
system to the cloud, the Change Data Capture (CDC) feature captures all the new transactions
happening in this database as it’s happening. Once this database is loaded and ready, all the
changes that happened in the legacy database can be applied in the cloud. During the migration,
your legacy system is open for transactions – you don’t have to pause it.
While the migration is happening, CDC helps you to keep these two databases continuously in-
sync by moving the real-time data between the systems. Because the system is open to
transactions, there is no business interruption. And if this technology is designed for both
validating the delivery and checkpointing the systems, you will also not experience any data loss.
Because this cloud database has production data, is open to transactions, and is continuousl y
updated, you can take your time to test it before you move your users. So you have basically
unlimited testing time, which helps you minimize your risks during such a major transition. Once
the system is completely in-sync and you have checked it and tested it, you can point your
applications and run your cloud database.
This is a single switch-over scenario. But streaming data integration gives you the ability to
move the data bi-directionally. You can have both systems open to transactions. Once you test
this, you can run some of your users in the cloud and some of you users in the legacy database.
All the changes happening with these users can be moved between databases, synchronized so
that they’re constantly in-sync. You can gradually move your users to the cloud database to
further minimize your risk. Phased migration is a very popular use case, especially for mission-
critical systems that cannot tolerate risk and downtime.
Once you’re in the cloud and you have a hybrid cloud architecture, you need to maintain it. You
need to connect it with the rest of your enterprise. It needs to be a natural extension of your data
center. Continuous real-time data moment with streaming data integration allows you to have
your cloud databases and services as part of your data center.
The important thing is that these workloads in the cloud can be operational workloads because
there’s fresh information (ie, continuously updated information) available. Your databases, your
machine data, your log files, your other cloud sources, messaging systems, and sensors can move
continuously to enable operational workloads.
What do we see in hybrid cloud architectures? Heavy use of cloud analytics solutions. If you
want operational reporting or operational intelligence, you want comprehensive data delivered
continuously so that you can trust that’s up-to-date, and gain operational intelligence from your
analytics solutions.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 107
You can also connect your data sources with the messaging systems in the cloud to
support event distribution for your new apps that you’re running in the cloud so that they are
completely part of your data center. If you’re adopting multi-cloud solutions, you can again
connect your new cloud systems with existing cloud systems, or send data to multiple cloud
destinations.
A third use case is real-time modern applications. Cloud is a big trend right now, but not
everything is necessarily in the cloud. You can have modern applications on-premises. So, if
you’re building any real-time app and modern new system that needs timely information, you
need to have continuous real-time data pipelines. Streaming data integration enables you run
real-time apps with real-time data.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 108
Use Case #4 Hot Cache
Last, but not least, when you have an in-memory data grid to help with your data retrieval
performance, you need to make sure it is continuously up-to-date so that you can rely on that
data – it’s something that users can depend on. If the source system is updated, but your cache is
not updated, it can create business problems. By continuously moving real-time data
using CDC technology, streaming data integration helps you to keep your data grid up-to-date. It
can serve as your hot cache to support your business with fresh data.
Machine Learning :
Big Data is more of extraction and Machine Learning is more of using input
analysis of information from huge data and algorithms for estimating
volumes of data. unknown future results.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 109
Big Data Machine Learning
Overview
With the demand for big data and machine learning, this article provides an
introduction to Spark MLlib, its components, and how it works. This covers the
main topics of using machine learning algorithms in Apache Spark.
Introduction
Apache Spark is a data processing framework that can quickly perform processing tasks on very
large data sets and can also distribute data processing tasks across multiple computers, either on
its own or in tandem with other distributed computing tools. It is a lightning-fast unified
To support Python with Spark, the Apache Spark community released a tool, PySpark. Using
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 110
1. Spark Core
2. Spark SQL
3. Spark Streaming
4. Spark MLlib
5. GraphX
6. Spark R
Spark Core
All the functionalities being provided by Apache Spark are built on the top of Spark Core. It
manages all essential I/O functionalities. It is used for task dispatching and fault recovery. Spark
Core is embedded with a special collection called RDD (Resilient Distributed Dataset). RDD is
among the abstractions of Spark. Spark RDD handles partitioning data across all the nodes in a
cluster. It holds them in the memory pool of the cluster as a single unit. There are two operations
performed on RDDs:
Transformation: It is a function that produces new RDD from the existing RDDs.
Action: In Transformation, RDDs are created from each other. But when we want to work with
Spark SQL
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 111
The Spark SQL component is a distributed framework for structured data
processing. Spark SQL works to access structured and semi-structured information.
It also enables powerful, interactive, analytical applications across both streaming
and historical data. DataFrames and SQL provide a common way to access a
variety of data sources. Its main feature is being a Cost-based optimizer and Mid
query fault-tolerance.
Spark Streaming
It is an add-on to core Spark API which allows scalable, high-throughput, fault-tolerant stream
processing of live data streams. Spark Streaming, groups the live data into small batches. It then
delivers it to the batch system for processing. It also provides fault tolerance characteristics .
Spark GraphX:
GraphX in Spark is an API for graphs and graph parallel execution. It is a network
graph analytics engine and data store. Clustering, classification, traversal,
searching, and pathfinding is also possible in graphs.
SparkR:
Spark MLlib:
Spark MLlib is used to perform machine learning in Apache Spark. MLlib consists
of popular algorithms and utilities. MLlib in Spark is a scalable Machine learning
library that discusses both high-quality algorithm and high speed. The machine
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 112
learning algorithms like regression, classification, clustering, pattern mining, and
collaborative filtering. Lower level machine learning primitives like generic
gradient descent optimization algorithm are also present in MLlib.
1. ML Algorithms
2. Featurization
3. Pipelines
4. Persistence
5. Utilities
ML Algorithms
ML Algorithms form the core of MLlib. These include common learning algorithms such
as classification, regression, clustering, and collaborative filtering.
MLlib standardizes APIs to make it easier to combine multiple algorithms into a single
pipeline, or workflow. The key concepts are the Pipelines API, where the pipeline
concept is inspired by the scikit-learn project.
Transformer:
A Transformer is an algorithm that can transform one DataFrame into another
DataFrame. Technically, a Transformer implements a method transform(), which
converts one DataFrame into another, generally by appending one or more columns. For
example:
A feature transformer might take a DataFrame, read a column (e.g., text), map it into a
new column (e.g., feature vectors), and output a new DataFrame with the mapped column
appended.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 113
A learning model might take a DataFrame, read the column containing feature vectors,
predict the label for each feature vector, and output a new DataFrame with predicted
labels appended as a column.
Estimator:
An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer.
Technically, an Estimator implements a method fit(), which accepts a DataFrame and
produces a Model, which is a Transformer. For example, a learning algorithm such as
LogisticRegression is an Estimator, and calling fit() trains a LogisticRegressionModel,
which is a Model and hence a Transformer.
1. Featurization
2. Pipelines:
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 114
Example: Pipeline sample given below does the data preprocessing in a specific order as
given below:
1. Apply String Indexer method to find the index of the categorical columns
The pipeline workflow will execute the data modelling in the above specific order.
stages = []
encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()],
outputCols=[categoricalCol + "Vec"])
stages += [label_stringIdx]
stages += [Vassembler]
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 115
from pyspark.ml import Pipeline
pipelineModel = pipeline.fit(df)
df = pipelineModel.transform(df)
df = df.select(selectedCols)
Dataframe
Dataframes provide a more user-friendly API than RDDs. The DataFrame-based API for
MLlib provides a uniform API across ML algorithms and across multiple languages.
Dataframes facilitate practical ML Pipelines, particularly feature transformations.
spark = SparkSession.builder.appName('mlearnsample').getOrCreate()
df.printSchema()
3. Persistence:
Persistence helps in saving and loading algorithms, models, and Pipelines. This helps in
reducing time and efforts as the model is persistence, it can be loaded/ reused any time
when needed.
lrModel = lr.fit(train)
evaluator = BinaryClassificationEvaluator()
predictions = lrModel.transform(test)
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 116
predictions.select('age', 'label', 'rawPrediction', 'prediction').show()
4. Utilities:
Utilities for linear algebra, statistics, and data handling. Example: mllib.linalg is MLlib
utilities for linear algebra.
TOOLS:
Spark tools are the major software features of the spark framework those are used for efficient
and scalable data processing for big data analytics. The Spark framework being open-sourced
through Apache license. It comprises 5 important tools for data processing such as GraphX,
MLlib, Spark Streaming, Spark SQL and Spark Core. GraphX is the tool used for processing and
managing graph data analysis. MLlib Spark tool is used for machine learning implementation on
the distributed dataset. Whereas Spark Streaming is used for stream data processing. Spark SQL
is the tool mostly used for structured data analysis. Spark Core tool manages the Resilient data
Tools of Spark
There exist 5 spark tools namely GraphX, MLlib, Spark Streaming, Spark SQL and Spark Core.
1. GraphX Tool
This is the Spark API related to graphs as well as graph-parallel computation. GraphX
provides a Resilient Distributed Property Graph which is an extension of the Spark RDD.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 117
This important tool is used to develop as well as manipulate graph data in order
Employ the user-friendly Graphical User Interface to pick from a fast-growing collection
of algorithms. You can even develop custom algorithms to monitor ETL insights.
The GraphFrames package permits you to perform graph operations on data frames. This
includes leveraging the Catalyst optimizer for graph queries. This critical tool possesses a
Google’s highly acclaimed PageRank algorithm. These special algorithms employ Spark
2. MLlib Tool
MLlib is a library that contains basic Machine Learning services. The library offers
various kinds of Machine Learning algorithms that make possible many operations on
The spark platform bundles libraries in order to apply graph analysis techniques as well
The MLlib tool has a framework for developing machine learning pipelines enabling
particular structured dataset. The former includes rudimentary machine learning that
However, facilities for training deep neural networks as well as modeling are not
available. MLlib supplies robust algorithms as well as lightning speed in order to build as
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 118
It also operates natively above Apache spark delivering quick as well as
This tool’s purpose is to process live streams of data. There occurs real-time processing
of data produced by different sources. Instances of this kind of data are messages having
This tool also leverages Spark Core’s speedy scheduling capability in order to execute
The latter simply put is a series of Resilient Distributed Datasets whose function is to
process the real-time data. This useful tool extended the Apache Spark paradigm of batch
processing into streaming. This was achieved by breaking down the stream into a
The latter was then manipulated by employing the Apache Spark API. Spark Streaming is
The former has the bigdata platform’s reliable fault tolerance making it extremely
analytics abilities for live data sourced from almost any common repository source.
platform’s functional programming interface. There exists support for querying data through the
Hive Query Language as well as through Standard SQL. Spark SQL consists of 4 libraries:
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 119
SQL Service
This tool’s function is to work with structured data. The former gives integrated access to the
most common data sources. This includes JDBC, JSON, Hive, Avro and more. The tool sorts
data into labeled columns as well as rows perfect for dispatching the results of high-speed
queries. Spark SQL integrates smoothly with newly introduced as well as already existing Spark
Spark employs a query optimizer named Catalyst which studies data and queries with the
objective of creating an efficient query plan for computation as well as data locality. The plan
will execute the necessary calculations across the cluster. Currently, it is advised to use the Spark
SQL interface of datasets as well as data frames for the purpose of development.
This is the basic building block of the platform. Among other things, it consists of
components for running memory operations, scheduling of jobs and others. Core hosts
the API containing RDD. The former provides the APIs that are used to build as well as
Distributed task dispatching as well as fundamental I/O functionalities are also provided
by the core. When benchmarked against Apache Hadoop components the Spark
Application Programming Interface is pretty simple and easy to use for developers.
The API conceals a large part of the complexity involved in a distributed processing
Spark operates in a distributed way by merging a driver core process which splits a
particular Spark application into multiple tasks as well as distributes them among
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 120
numerous processes that perform the job. These particular executions could be scaled up
All the tools belonging to the Spark ecosystem interact smoothly and run well while
consuming minimal overhead. This makes Spark both an extremely scalable as well as a
very powerful platform. Work is ongoing in order to improve the tools in terms of both
Algorithms-Classification
Spark – Overview
Apache Spark is a lightning fast real-time processing framework. It does in-memory
computations to analyze data in real-time. It came into picture as Apache Hadoop
MapReduce was performing batch processing only and lacked a real-time processing feature.
Hence, Apache Spark was introduced as it can perform stream processing in real-time and can
also take care of batch processing.
Apart from real-time and batch processing, Apache Spark supports interactive queries and
iterative algorithms also. Apache Spark has its own cluster manager, where it can host its
application. It leverages Apache Hadoop for both storage and processing. It
uses HDFS (Hadoop Distributed File system) for storage and it can run Spark applications
on YARN as well.
PySpark – Overview
Apache Spark is written in Scala programming language. To support Python with Spark,
Apache Spark Community released a tool, PySpark. Using PySpark, you can work
with RDDs in Python programming language also. It is because of a library called Py4j that
they are able to achieve this.
PySpark offers PySpark Shell which links the Python API to the spark core and initializes the
Spark context. Majority of data scientists and analytics experts today use Python because of its
rich library set. Integrating Python with Spark is a boon to them.
Apache Spark offers a Machine Learning API called MLlib. PySpark has this machine learning
API in Python as well. It supports different kind of algorithms, which are mentioned below −
mllib.classification − The spark.mllib package supports various methods for binary
classification, multiclass classification and regression analysis. Some of the most
popular algorithms in classification are Random Forest, Naive Bayes, Decision Tree,
etc.
mllib.clustering − Clustering is an unsupervised learning problem, whereby you aim to
group subsets of entities with one another based on some notion of similarity.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 121
mllib.fpm − Frequent pattern matching is mining frequent items, itemsets,
subsequences or other substructures that are usually among the first steps to analyze a
large-scale dataset. This has been an active research topic in data mining for years.
mllib.linalg − MLlib utilities for linear algebra.
mllib.recommendation − Collaborative filtering is commonly used for recommender
systems. These techniques aim to fill in the missing entries of a user item association
matrix.
spark.mllib − It ¬currently supports model-based collaborative filtering, in which users
and products are described by a small set of latent factors that can be used to predict
missing entries. spark.mllib uses the Alternating Least Squares (ALS) algorithm to learn
these latent factors.
mllib. regression − Linear regression belongs to the family of regression algorithms.
The goal of regression is to find relationships and dependencies between variables. The
interface for working with linear regression models and model summaries is similar to
the logistic regression case.
There are other algorithms, classes and functions also as a part of the mllib
package. As of now, let us understand a demonstration on pyspark.mllib.
The following example is of collaborative filtering using ALS algorithm to build
the recommendation model and evaluate it on training data.
Dataset used − test.data
1,1,5.0
1,2,1.0
1,3,5.0
1,4,1.0
2,1,5.0
2,2,1.0
2,3,5.0
2,4,1.0
3,1,1.0
3,2,5.0
3,3,1.0
3,4,5.0
4,1,1.0
4,2,5.0
4,3,1.0
4,4,5.0
--------------------------------------recommend.py------------
----------------------------
from __future__ import print_function
from pyspark import SparkContext
from pyspark.mllib.recommendation import ALS,
MatrixFactorizationModel, Rating
if __name__ == "__main__":
sc = SparkContext(appName="Pspark mllib Example")
data = sc.textFile("test.data")
ratings = data.map(lambda l: l.split(','))\
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 122
.map(lambda l: Rating(int(l[0]), int(l[1]),
float(l[2])))
Logistic regression
\
Logistic regression is a popular method to predict a categorical response. It is a special case
of Generalized Linear models that predicts the probability of the outcomes.
In spark.ml logistic regression can be used to predict a binary outcome by using binomial
logistic regression, or it can be used to predict a multiclass outcome by using multinomial
logistic regression. Use the family parameter to select between these two algorithms, or
leave it unset and Spark will infer the correct variant.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 123
Binomial logistic regression
For more background and more details about the implementation of binomial logistic
regression, refer to the documentation of logistic regression in spark.mllib.
Examples
The following example shows how to train binomial and multinomial logistic regression
models for binary classification with elastic net
regularization. elasticNetParam corresponds to αα and regParam corresponds to λλ.
Scala
Java
Python
R
More details on parameters can be found in the Scala API documentation.
import org.apache.spark.ml.classification.LogisticRegression
// Load training data
val training =
spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
// Fit the model
val lrModel = lr.fit(training)
Scala
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 124
Java
Python
LogisticRegressionTrainingSummary provides a summary for
a LogisticRegressionModel. In the case of binary classification, certain additional metrics
are available, e.g. ROC curve. The binary summary can be accessed via
the binarySummary method. See BinaryLogisticRegressionTrainingSummary.
import org.apache.spark.ml.classification.LogisticRegression
// Extract the summary from the returned LogisticRegressionModel instance
trained in the earlier
// example
val trainingSummary = lrModel.binarySummary
// Obtain the objective per iteration.
val objectiveHistory = trainingSummary.objectiveHistory
println("objectiveHistory:")
objectiveHistory.foreach(loss => println(loss))
// Obtain the receiver-operating characteristic as a dataframe and
areaUnderROC.
val roc = trainingSummary.roc
roc.show()
println(s"areaUnderROC: ${trainingSummary.areaUnderROC}")
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 125
P(Y=k|X,βk,β0k)=eβk⋅X+β0k∑K−1k′=0eβk′⋅X+β0k′P(Y=k|X,βk,β0k)=eβk⋅X+β0k∑k′=0K−1eβk′⋅X+β0
k′
minβ,β0−[∑i=1Lwi⋅logP(Y=yi|xi)]+λ[12(1−α)||β||22+α||β||1] minβ,β0−[∑i=1Lwi⋅logP(Y=yi|xi
)]+λ[12(1−α)||β||22+α||β||1]
Examples
The following example shows how to train a multiclass logistic regression model with elastic
net regularization, as well as extract the multiclass training summary for evaluating the
model.
Scala
Java
Python
R
import org.apache.spark.ml.classification.LogisticRegression
// Load training data
val training = spark
.read
.format("libsvm")
.load("data/mllib/sample_multiclass_classification_data.txt")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 126
println("Precision by label:")
trainingSummary.precisionByLabel.zipWithIndex.foreach { case (prec, label) =>
println(s"label $label: $prec")
}
println("Recall by label:")
trainingSummary.recallByLabel.zipWithIndex.foreach { case (rec, label) =>
println(s"label $label: $rec")
}
println("F-measure by label:")
trainingSummary.fMeasureByLabel.zipWithIndex.foreach { case (f, label) =>
println(s"label $label: $f")
}
Examples
The following examples load a dataset in LibSVM format, split it into training and test sets,
train on the first dataset, and then evaluate on the held-out test set. We use two feature
transformers to prepare the data; these help index categories for the label and categorical
features, adding metadata to the DataFrame which the Decision Tree algorithm can
recognize.
Scala
Java
Python
R
More details on parameters can be found in the Scala API documentation.
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.DecisionTreeClassificationModel
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer,
VectorIndexer}
// Make predictions.
val predictions = model.transform(testData)
Examples
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 128
The following examples load a dataset in LibSVM format, split it into training and
test sets, train on the first dataset, and then evaluate on the held-out test set. We use two
feature transformers to prepare the data; these help index categories for the label and
categorical features, adding metadata to the DataFrame which the tree-based algorithms
can recognize.
Scala
Java
Python
R
Refer to the Scala API docs for more details.
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{RandomForestClassificationModel,
RandomForestClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer,
VectorIndexer}
Examples
The following examples load a dataset in LibSVM format, split it into training and test sets,
train on the first dataset, and then evaluate on the held-out test set. We use two feature
transformers to prepare the data; these help index categories for the label and categorical
features, adding metadata to the DataFrame which the tree-based algorithms can
recognize.
Scala
Java
Python
R
Refer to the Scala API docs for more details.
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{GBTClassificationModel,
GBTClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer,
VectorIndexer}
// Load and parse the data file, converting it to a DataFrame.
val data =
spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(data)
// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as
continuous.
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(data)
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 130
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
// Train a GBT model.
val gbt = new GBTClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures")
.setMaxIter(10)
.setFeatureSubsetStrategy("auto")
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labelsArray(0))
y(x)=fK(...f2(wT2f1(wT1x+b1)+b2)...+bK)y(x)=fK(...f2(w2Tf1(w1Tx+b1)+b2)...+bK)
f(zi)=11+e−zif(zi)=11+e−zi
f(zi)=ezi∑Nk=1ezkf(zi)=ezi∑k=1Nezk
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 131
The number of nodes NN in the output layer corresponds to the number of classes.
MLPC employs backpropagation for learning the model. We use the logistic loss function for
optimization and L-BFGS as an optimization routine.
Examples
Scala
Java
Python
R
Refer to the Scala API docs for more details.
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
// Load the data stored in LIBSVM format as a DataFrame.
val data = spark.read.format("libsvm")
.load("data/mllib/sample_multiclass_classification_data.txt")
// Split the data into train and test
val splits = data.randomSplit(Array(0.6, 0.4), seed = 1234L)
val train = splits(0)
val test = splits(1)
// specify layers for the neural network:
// input layer of size 4 (features), two intermediate of size 5 and 4
// and output of size 3 (classes)
val layers = Array[Int](4, 5, 4, 3)
// create the trainer and set its parameters
val trainer = new MultilayerPerceptronClassifier()
.setLayers(layers)
.setBlockSize(128)
.setSeed(1234L)
.setMaxIter(100)
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 132
Examples
Scala
Java
Python
R
Refer to the Scala API docs for more details.
import org.apache.spark.ml.classification.LinearSVC
// Load training data
val training =
spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
val lsvc = new LinearSVC()
.setMaxIter(10)
.setRegParam(0.1)
// Fit the model
val lsvcModel = lsvc.fit(training)
Predictions are done by evaluating each binary classifier and the index of the most confident
classifier is output as label.
Examples
The example below demonstrates how to load the Iris dataset, parse it as a DataFrame and
perform multiclass classification using OneVsRest. The test error is calculated to measure
the algorithm accuracy.
Scala
Java
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 133
Python
Refer to the Scala API docs for more details.
Naive Bayes
Naive Bayes classifiers are a family of simple probabilistic, multiclass classifiers based on
applying Bayes’ theorem with strong (naive) independence assumptions between every pair
of features.
Naive Bayes can be trained very efficiently. With a single pass over the training data, it
computes the conditional probability distribution of each feature given each label. For
prediction, it applies Bayes’ theorem to compute the conditional probability distribution of
each label given an observation.
MLlib supports Multinomial naive Bayes, Complement naive Bayes, Bernoulli naive
Bayes and Gaussian naive Bayes.
Input data: These Multinomial, Complement and Bernoulli models are typically used
for document classification. Within that context, each observation is a document and each
feature represents a term. A feature’s value is the frequency of the term (in Multinomial or
Complement Naive Bayes) or a zero or one indicating whether the term was found in the
document (in Bernoulli Naive Bayes). Feature values for Multinomial and Bernoulli models
must be non-negative. The model type is selected with an optional parameter “multinomial”,
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 134
“complement”, “bernoulli” or “gaussian”, with “multinomial” as the default. For
document classification, the input feature vectors should usually be sparse vectors. Since the
training data is only used once, it is not necessary to cache it.
Examples
Scala
Java
Python
R
Refer to the Scala API docs for more details.
import org.apache.spark.ml.classification.NaiveBayes
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
// Load the data stored in LIBSVM format as a DataFrame.
val data =
spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
// Split the data into training and test sets (30% held out for testing)
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3), seed =
1234L)
Examples
The following examples load a dataset in LibSVM format, split it into training and test sets,
train on the first dataset, and then evaluate on the held-out test set. We scale features to be
between 0 and 1 to prevent the exploding gradient problem.
Scala
Java
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 135
Python
R
Refer to the Scala API docs for more details.
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{FMClassificationModel,
FMClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, MinMaxScaler,
StringIndexer}
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
// Train a FM model.
val fm = new FMClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("scaledFeatures")
.setStepSize(0.001)
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labelsArray(0))
// Create a Pipeline.
val pipeline = new Pipeline()
.setStages(Array(labelIndexer, featureScaler, fm, labelConverter))
// Train model.
val model = pipeline.fit(trainingData)
// Make predictions.
val predictions = model.transform(testData)
// Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(5)
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 136
Regression
Linear regression
The interface for working with linear regression models and model summaries is similar to
the logistic regression case.
Examples
The following example demonstrates training an elastic net regularized linear regression
model and extracting model summary statistics.
Scala
Java
Python
R
More details on parameters can be found in the Scala API documentation.
import org.apache.spark.ml.regression.LinearRegression
GLMs require exponential family distributions that can be written in their “canonical” or
“natural” form, aka natural exponential family distributions. The form of a natural
exponential family distribution is given as:
fY(y|θ,τ)=h(y,τ)exp(θ⋅y−A(θ)d(τ))fY(y|θ,τ)=h(y,τ)exp(θ⋅y−A(θ)d(τ))
Yi∼f(⋅|θi,τ)Yi∼f(⋅|θi,τ)
where the parameter of interest θiθi is related to the expected value of the response
variable μiμi by
μi=A′(θi)μi=A′(θi)
Here, A′(θi)A′(θi) is defined by the form of the distribution selected. GLMs also allow
specification of a link function, which defines the relationship between the expected value of
the response variable μiμi and the so called linear predictor η iηi:
g(μi)=ηi=xi→T⋅β⃗ g(μi)=ηi=xi→T⋅β→
Often, the link function is chosen such that A′=g−1A′=g−1, which yields a simplified
relationship between the parameter of interest θθ and the linear predictor ηη. In this case,
the link function g(μ)g(μ) is said to be the “canonical” link function.
θi=A′−1(μi)=g(g−1(ηi))=ηiθi=A′−1(μi)=g(g−1(ηi))=ηi
A GLM finds the regression coefficients β⃗ β→ which maximize the likelihood function.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 138
θi=A′−1(g−1(xi→⋅β⃗ ))θi=A′−1(g−1(xi→⋅β→))
Spark’s generalized linear regression interface also provides summary statistics for
diagnosing the fit of GLM models, including residuals, p-values, deviances, the Akaike
information criterion, and others.
See here for a more comprehensive review of GLMs and their applications.
Available families
Family Response Type Supported Links
* Canonical Link
Examples
The following example demonstrates training a GLM with a Gaussian response and identity
link function and extracting model summary statistics.
Scala
Java
Python
R
Refer to the Scala API docs for more details.
import org.apache.spark.ml.regression.GeneralizedLinearRegression
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 139
// Summarize the model over the training set and print out some
metrics
val summary = model.summary
println(s"Coefficient Standard Errors:
${summary.coefficientStandardErrors.mkString(",")}")
println(s"T Values: ${summary.tValues.mkString(",")}")
println(s"P Values: ${summary.pValues.mkString(",")}")
println(s"Dispersion: ${summary.dispersion}")
println(s"Null Deviance: ${summary.nullDeviance}")
println(s"Residual Degree Of Freedom Null:
${summary.residualDegreeOfFreedomNull}")
println(s"Deviance: ${summary.deviance}")
println(s"Residual Degree Of Freedom: ${summary.residualDegreeOfFreedom}")
println(s"AIC: ${summary.aic}")
println("Deviance Residuals: ")
summary.residuals().show()
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/ml/GeneralizedLinearRegressionExample.scala" in the
Spark repo.
Examples
The following examples load a dataset in LibSVM format, split it into training and test sets,
train on the first dataset, and then evaluate on the held-out test set. We use a feature
transformer to index categorical features, adding metadata to the DataFrame which the
Decision Tree algorithm can recognize.
Scala
Java
Python
R
More details on parameters can be found in the Scala API documentation.
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.feature.VectorIndexer
import org.apache.spark.ml.regression.DecisionTreeRegressionModel
import org.apache.spark.ml.regression.DecisionTreeRegressor
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 140
// Train a DecisionTree model.
val dt = new DecisionTreeRegressor()
.setLabelCol("label")
.setFeaturesCol("indexedFeatures")
// Chain indexer and tree in a Pipeline.
val pipeline = new Pipeline()
.setStages(Array(featureIndexer, dt))
// Make predictions.
val predictions = model.transform(testData)
// Select example rows to display.
predictions.select("prediction", "label", "features").show(5)
Examples
The following examples load a dataset in LibSVM format, split it into training and test sets,
train on the first dataset, and then evaluate on the held-out test set. We use a feature
transformer to index categorical features, adding metadata to the DataFrame which the
tree-based algorithms can recognize.
Scala
Java
Python
R
Refer to the Scala API docs for more details.
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.feature.VectorIndexer
import org.apache.spark.ml.regression.{RandomForestRegressionModel,
RandomForestRegressor}
// Load and parse the data file, converting it to a DataFrame.
val data =
spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 141
// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as
continuous.
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(data)
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
Examples
Note: For this example dataset, GBTRegressor actually only needs 1 iteration, but that will
not be true in general.
Scala
Java
Python
R
Refer to the Scala API docs for more details.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 142
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.feature.VectorIndexer
import org.apache.spark.ml.regression.{GBTRegressionModel, GBTRegressor}
// Load and parse the data file, converting it to a DataFrame.
val data =
spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
// Train a GBT model.
val gbt = new GBTRegressor()
.setLabelCol("label")
.setFeaturesCol("indexedFeatures")
.setMaxIter(10)
Survival regression
In spark.ml, we implement the Accelerated failure time (AFT) model which is a parametric
survival regression model for censored data. It describes a model for the log of survival time,
so it’s often called a log-linear model for survival analysis. Different from a Proportional
hazards model designed for the same purpose, the AFT model is easier to parallelize
because each instance contributes to the objective function independently.
Given the values of the covariates x‘x‘, for random lifetime titi of subjects i = 1, …, n, with
possible right-censoring, the likelihood function under the AFT model is given as:
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 143
L(β,σ)=∏i=1n[1σf0(logt i−x′βσ)]δiS0(logti−x′βσ)1−δ iL(β,σ)=∏i=1n[1σf0(logti−x′βσ)]δiS0
(logti−x′βσ)1−δi
Where δiδi is the indicator of the event has occurred i.e. uncensored or not.
Using ϵi=logti−x‘βσϵi=logti−x‘βσ, the log-likelihood function assumes the form:
ι(β,σ)=∑i=1n[−δilogσ+δilogf0(ϵi)+(1−δi)logS0(ϵi)]ι(β,σ)=∑i=1n[−δilogσ+δilogf0(ϵi)+
(1−δi)logS0(ϵi)]
Where S0(ϵi)S0(ϵi) is the baseline survivor function, and f0(ϵi)f0(ϵi) is the corresponding density
function.
The most commonly used AFT model is based on the Weibull distribution of the survival
time. The Weibull distribution for lifetime corresponds to the extreme value distribution for
the log of the lifetime, and the S0(ϵ)S0(ϵ) function is:
S0(ϵi)=exp(−eϵi)S0(ϵi)=exp(−eϵi)
f0(ϵi)=eϵiexp(−eϵi)f0(ϵi)=eϵiexp(−eϵi)
The log-likelihood function for AFT model with a Weibull distribution of lifetime is:
ι(β,σ)=−∑i=1n[δilogσ−δiϵi+eϵi] ι(β,σ)=−∑i=1n[δilogσ−δiϵi+eϵi]
Due to minimizing the negative log-likelihood equivalent to maximum a posteriori probability, the loss
function we use to optimize is −ι(β,σ)−ι(β,σ). The gradient functions
for ββ and logσlogσ respectively are:
∂(−ι)∂β=∑1=1n[δi−eϵi]xiσ∂(−ι)∂β=∑1=1n[δi−eϵi]xiσ
∂(−ι)∂(logσ)=∑i=1n[δi+(δi−eϵi)ϵi]∂(−ι)∂(logσ)=∑i=1n[δi+(δi−eϵi)ϵi]
The AFT model can be formulated as a convex optimization problem, i.e. the task of finding
a minimizer of a convex function −ι(β,σ)−ι(β,σ) that depends on the coefficients
vector ββ and the log of scale parameter logσlogσ. The optimization algorithm underlying
the implementation is L-BFGS. The implementation matches the result from R’s survival
function survreg
Examples
Scala
Java
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 144
Python
R
Refer to the Scala API docs for more details.
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.regression.AFTSurvivalRegression
val training = spark.createDataFrame(Seq(
(1.218, 1.0, Vectors.dense(1.560, -0.605)),
(2.949, 0.0, Vectors.dense(0.346, 2.158)),
(3.627, 0.0, Vectors.dense(1.380, 0.231)),
(0.273, 1.0, Vectors.dense(0.520, 1.151)),
(4.199, 0.0, Vectors.dense(0.795, -0.226))
)).toDF("label", "censor", "features")
val quantileProbabilities = Array(0.3, 0.6)
val aft = new AFTSurvivalRegression()
.setQuantileProbabilities(quantileProbabilities)
.setQuantilesCol("quantiles")
Isotonic regression
Isotonic regression belongs to the family of regression algorithms. Formally isotonic
regression is a problem where given a finite set of real
numbers Y=y1,y2,...,ynY=y1,y2,...,yn representing observed responses
and X=x1,x2,...,xnX=x1,x2,...,xn the unknown response values to be fitted finding a
function that minimizes
f(x)=∑i=1nwi(yi−xi)2(1)(1)f(x)=∑i=1nwi(yi−xi)2
Training returns an IsotonicRegressionModel that can be used to predict labels for both
known and unknown features. The result of isotonic regression is treated as piecewise linear
function. The rules for prediction therefore are:
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 145
If the prediction input exactly matches a training feature then associated
prediction is returned. In case there are multiple predictions with the same feature then
one of them is returned. Which one is undefined (same as java.util.Arrays.binarySearch).
If the prediction input is lower or higher than all training features then prediction with
lowest or highest feature is returned respectively. In case there are multiple predictions
with the same feature then the lowest or highest is returned respectively.
If the prediction input falls between two training features then prediction is treated as
piecewise linear function and interpolated value is calculated from the predictions of the
two closest features. In case there are multiple values with the same feature then the
same rules as in previous point are used.
Examples
Scala
Java
Python
R
Refer to the IsotonicRegression Scala docs for details on the API.
import org.apache.spark.ml.regression.IsotonicRegression
// Loads data.
val dataset = spark.read.format("libsvm")
.load("data/mllib/sample_isotonic_regression_libsvm_data.txt")
// Trains an isotonic regression model.
val ir = new IsotonicRegression()
val model = ir.fit(dataset)
println(s"Boundaries in increasing order: ${model.boundaries}\n")
println(s"Predictions associated with the boundaries:
${model.predictions}\n")
// Makes predictions.
model.transform(dataset).show()
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/ml/IsotonicRegressionExample.scala" in the Spark repo.
Examples
The following examples load a dataset in LibSVM format, split it into training and test sets,
train on the first dataset, and then evaluate on the held-out test set. We scale features to be
between 0 and 1 to prevent the exploding gradient problem.
Scala
Java
Python
R
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 146
Refer to the Scala API docs for more details.
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.feature.MinMaxScaler
import org.apache.spark.ml.regression.{FMRegressionModel, FMRegressor}
// Scale features.
val featureScaler = new MinMaxScaler()
.setInputCol("features")
.setOutputCol("scaledFeatures")
.fit(data)
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
// Train a FM model.
val fm = new FMRegressor()
.setLabelCol("label")
.setFeaturesCol("scaledFeatures")
.setStepSize(0.001)
// Create a Pipeline.
val pipeline = new Pipeline()
.setStages(Array(featureScaler, fm))
// Train model.
val model = pipeline.fit(trainingData)
// Make predictions.
val predictions = model.transform(testData)
Introduction to Clustering
It is basically a type of unsupervised learning method. An unsupervised learning method is a
method in which we draw references from datasets consisting of input data without labeled
responses. Generally, it is used as a process to find meaningful structure, explanatory
underlying processes, generative features, and groupings inherent in a set of examples.
Attention reader! Don’t stop learning now. Get hold of all the important Machine Learning
Concepts with the Machine Learning Foundation Course at a student-friendly price and
become industry ready.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 147
Clustering is the task of dividing the population or data points into a number of groups such
that data points in the same groups are more similar to other data points in the same group and
dissimilar to the data points in other groups. It is basically a collection of objects on the basis
of similarity and dissimilarity between them.
For ex– The data points in the graph below clustered together can be classified into one single
group. We can distinguish the clusters, and we can identify that there are 3 clusters in the
below picture.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 148
DBSCAN: Density-based Spatial Clustering of Applications with Noise
These data points are clustered by using the basic concept that the data point lies within the
given constraint from the cluster center. Various distance methods and techniques are used for
the calculation of the outliers.
Why Clustering?
Clustering is very much important as it determines the intrinsic grouping among the unlabelled data
present. There are no criteria for good clustering. It depends on the user, what is the criteria they may
use which satisfy their need. For instance, we could be interested in finding representatives for
homogeneous groups (data reduction), in finding “natural clusters” and describe their unknown
properties (“natural” data types), in finding useful and suitable groupings (“useful” data classes) or in
finding unusual data objects (outlier detection). This algorithm must make some assumptions that
constitute the similarity of points and each assumption make different and equally valid clusters.
Clustering Methods :
Density-Based Methods: These methods consider the clusters as the dense region having
some similarities and differences from the lower dense region of the space. These
methods have good accuracy and the ability to merge two clusters. Example DBSCAN
(Density-Based Spatial Clustering of Applications with Noise), OPTICS (Ordering Points
to Identify Clustering Structure), etc.
Hierarchical Based Methods: The clusters formed in this method form a tree-type
structure based on the hierarchy. New clusters are formed using the previously formed
one. It is divided into two category
Agglomerative (bottom-up approach)
Divisive (top-down approach)
examples CURE (Clustering Using Representatives), BIRCH (Balanced Iterative Reducing
Clustering and using Hierarchies), etc.
Partitioning Methods: These methods partition the objects into k clusters and each
partition forms one cluster. This method is used to optimize an objective criterion
similarity function such as when the distance is a major parameter example K-means,
CLARANS (Clustering Large Applications based upon Randomized Search), etc.
Grid-based Methods: In this method, the data space is formulated into a finite number of
cells that form a grid-like structure. All the clustering operations done on these grids are
fast and independent of the number of data objects example STING (Statistical
Information Grid), wave cluster, CLIQUE (CLustering In Quest), etc.
Clustering Algorithms :
K-means clustering algorithm – It is the simplest unsupervised learning algorithm that solves
clustering problem.K-means algorithm partitions n observations into k clusters where each
observation belongs to the cluster with the nearest mean serving as a prototype of the cluster.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 149
Applications of Clustering in different fields
Marketing: It can be used to characterize & discover customer
segments for marketing purposes.
Biology: It can be used for classification among different species of
plants and animals.
Libraries: It is used in clustering different books on the basis of
topics and information.
Insurance: It is used to acknowledge the customers, their policies
and identifying the frauds.
City Planning: It is used to make groups of houses and to study their
values based on their geographical locations and other factors
present.
Earthquake studies: By learning the earthquake-affected areas we
can determine the dangerous zones.
Dimensionality Reduction:
The number of input features, variables, or columns present in a given dataset is known
as dimensionality, and the process to reduce these features is called dimensionality
reduction.
A dataset contains a huge number of input features in various cases, which makes the
predictive modeling task more complicated. Because it is very difficult to visualize or
make predictions for the training dataset with a high number of features, for such cases,
dimensionality reduction techniques are required to use.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 150
Dimensionality reduction technique can be defined as, "It is a way of
converting the higher dimensions dataset into lesser dimensions dataset ensuring
that it provides similar information." These techniques are widely used in machine
learning for obtaining a better fit predictive model while solving the classification and
regression problems.
It is commonly used in the fields that deal with high-dimensional data, such as speech
recognition, signal processing, bioinformatics, etc. It can also be used for data
visualization, noise reduction, cluster analysis, etc.
Hence, it is often required to reduce the number of features, which can be done with
dimensionality reduction.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 151
o By reducing the dimensions of the features, the space required to store the dataset also
gets reduced.
o Less Computation training time is required for reduced dimensions of features.
o Reduced dimensions of features of the dataset help in visualizing the data quickly.
o It removes the redundant features (if present) by taking care of multicollinearity.
Feature Selection
Feature selection is the process of selecting the subset of the relevant features and
leaving out the irrelevant features present in a dataset to build a model of high accuracy.
In other words, it is a way of selecting the optimal features from the input dataset.
1. Filters Methods
In this method, the dataset is filtered, and a subset that contains only the relevant
features is taken. Some common techniques of filters method are:
o Correlation
o Chi-Square Test
o ANOVA
o Information Gain, etc.
2. Wrappers Methods
The wrapper method has the same goal as the filter method, but it takes a machine
learning model for its evaluation. In this method, some features are fed to the ML
model, and evaluate the performance. The performance decides whether to add those
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 152
features or remove to increase the accuracy of the model. This method is more
accurate than the filtering method but complex to work. Some common techniques of
wrapper methods are:
o Forward Selection
o Backward Selection
o Bi-directional Elimination
o LASSO
o Elastic Net
o Ridge Regression, etc.
Feature Extraction:
Feature extraction is the process of transforming the space containing many dimensions
into space with fewer dimensions. This approach is useful when we want to keep the
whole information but use fewer resources while processing the information.
PCA works by considering the variance of each attribute because the high attribute
shows the good split between the classes, and hence it reduces the dimensionality.
Some real-world applications of PCA are image processing, movie recommendation
system, optimizing the power allocation in various communication channels.
o In this technique, firstly, all the n variables of the given dataset are taken to train the
model.
o The performance of the model is checked.
o Now we will remove one feature each time and train the model on n-1 features for n
times, and will compute the performance of the model.
o We will check the variable that has made the smallest or no change in the performance
of the model, and then we will drop that variable or features; after that, we will be left
with n-1 features.
o Repeat the complete process until no feature can be dropped.
In this technique, by selecting the optimum performance of the model and maximum
tolerable error rate, we can define the optimal number of features require for the
machine learning algorithms.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 154
o We start with a single feature only, and progressively we will add each feature at a time.
o Here we will train the model on each feature separately.
o The feature with the best performance is selected.
o The process will be repeated until we get a significant increase in the performance of the
model.
Random Forest
Random Forest is a popular and very useful feature selection algorithm in machine
learning. This algorithm contains an in-built feature importance package, so we do not
need to program it separately. In this technique, we need to generate a large set of trees
against the target variable, and with the help of usage statistics of each attribute, we
need to find the subset of features.
Random forest algorithm takes only numerical variables, so we need to convert the
input data into numeric data using hot encoding.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 155
Factor Analysis
Factor analysis is a technique in which each variable is kept within a group according to
the correlation with other variables, it means variables within a group can have a high
correlation between themselves, but they have a low correlation with variables of other
groups.
Auto-encoders
One of the popular methods of dimensionality reduction is auto-encoder, which is a
type of ANN or artificial neural network, and its main aim is to copy the inputs to their
outputs. In this, the input is compressed into latent-space representation, and output is
occurred using this representation. It has mainly two parts:
o Encoder: The function of the encoder is to compress the input to form the latent-space
representation.
o Decoder: The function of the decoder is to recreate the output from the latent-space
representation.
: term
: document
: corpus
: the number of the elements in corpus
: Term Frequency: the number of times that term appears in
document
: Document Frequency: the number of documents that contains
term
: Inverse Document Frequency is a numerical measure of how
much information a term provides
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 157
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
sentenceData = spark.createDataFrame([
(0, "Python python Spark Spark"),
(1, "Python SQL")],
["document", "sentence"])
sentenceData.show(truncate=False)
+--------+-------------------------+
|document|sentence |
+--------+-------------------------+
|0 |Python python Spark Spark|
|1 |Python SQL |
+--------+-------------------------+
Then:
IDF:
TFIDF
Countvectorizer
Stackoverflow TF: CountVectorizer and CountVectorizerModel aim to help convert a collection
of text documents to vectors of token counts. When an a-priori dictionary is not available,
CountVectorizer can be used as an Estimator to extract the vocabulary, and generates a
CountVectorizerModel. The model produces sparse representations for the documents over the
vocabulary, which can then be passed to other algorithms like LDA.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 158
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
sentenceData = spark.createDataFrame([
(0, "Python python Spark Spark"),
(1, "Python SQL")],
["document", "sentence"])
model = pipeline.fit(sentenceData)
import numpy as np
total_counts = model.transform(sentenceData)\
.select('rawFeatures').rdd\
.map(lambda row: row['rawFeatures'].toArray())\
.reduce(lambda x,y: [x[i]+y[i] for i in range(len(y))])
vocabList = model.stages[1].vocabulary
d = {'vocabList':vocabList,'counts':total_counts}
spark.createDataFrame(np.array(list(d.values())).T.tolist(),list(d.keys())).show()
counts = model.transform(sentenceData).select('rawFeatures').collect()
counts
def termsIdx2Term(vocabulary):
def termsIdx2Term(termIndices):
return [vocabulary[int(index)] for index in termIndices]
return udf(termsIdx2Term, ArrayType(StringType()))
vectorizerModel = model.stages[1]
vocabList = vectorizerModel.vocabulary
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 159
vocabList
['python', 'spark', 'sql']
rawFeatures = model.transform(sentenceData).select('rawFeatures')
rawFeatures.show()
+-------------------+
| rawFeatures|
+-------------------+
|(3,[0,1],[2.0,2.0])|
|(3,[0,2],[1.0,1.0])|
+-------------------+
from pyspark.sql.functions import udf
import pyspark.sql.functions as F
from pyspark.sql.types import StringType, DoubleType, IntegerType
rawFeatures.withColumn('indices', indices_udf(F.col('rawFeatures')))\
.withColumn('values', values_udf(F.col('rawFeatures')))\
.withColumn("Terms", termsIdx2Term(vocabList)("indices")).show()
+-------------------+-------+---------------+---------------+
| rawFeatures|indices| values| Terms|
+-------------------+-------+---------------+---------------+
|(3,[0,1],[2.0,2.0])| [0, 1]|[2.0, 2.0, 0.0]|[python, spark]|
|(3,[0,2],[1.0,1.0])| [0, 2]|[1.0, 0.0, 1.0]| [python, sql]|
+-------------------+-------+---------------+---------------+
. HashingTF
Stackoverflow TF: HashingTF is a Transformer which takes sets of terms and
converts those sets into fixed-length feature vectors. In text processing, a “set
of terms” might be a bag of words. HashingTF utilizes the hashing trick. A raw
feature is mapped into an index (term) by applying a hash function. The hash
function used here is MurmurHash 3. Then term frequencies are calculated
based on the mapped indices. This approach avoids the need to compute a
global term-to-index map, which can be expensive for a large corpus, but it
suffers from potential hash collisions, where different raw features may
become the same term after hashing.
sentenceData = spark.createDataFrame([
(0, "Python python Spark Spark"),
(1, "Python SQL")],
["document", "sentence"])
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 161
sentence. The context window is defined by a string of words before and after a focal or “center”
word that will be used to train a word embedding model. Each center word and context words
can be represented as a vector of numbers that describe the presence or absence of unique words
within a dataset, which is perhaps why word embedding models are often described as “word
vector” models, or “word2vec” models.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 162
Word Embedding Models in PySpark
from pyspark.ml.feature import Word2Vec
model = pipeline.fit(sentenceData)
result = model.transform(sentenceData)
result.show()
+-----+--------------------+--------------------+--------------------+
|label| sentence| words| feature|
+-----+--------------------+--------------------+--------------------+
| 0.0| I love Spark| [i, love, spark]|[0.05594437588782...|
| 0.0| I love python| [i, love, python]|[-0.0350368790871...|
| 1.0|I think ML is awe...|[i, think, ml, is...|[0.01242086507845...|
+-----+--------------------+--------------------+--------------------+
w2v = model.stages[1]
w2v.getVectors().show()
+-------+-----------------------------------------------------------------+
|word |vector |
+-------+-----------------------------------------------------------------+
|is |[0.13657838106155396,0.060924094170331955,-0.03379475697875023] |
|awesome|[0.037024181336164474,-0.023855900391936302,0.0760037824511528] |
|i |[-0.0014482572441920638,0.049365971237421036,0.12016955763101578]|
|ml |[-0.14006119966506958,0.01626444421708584,0.042281970381736755] |
|spark |[0.1589149385690689,-0.10970081388950348,-0.10547549277544022] |
|think |[0.030011219903826714,-0.08994936943054199,0.16471518576145172] |
|love |[0.01036644633859396,-0.017782460898160934,0.08870164304971695] |
|python |[-0.11402882635593414,0.045119188725948334,-0.029877422377467155]|
+-------+-----------------------------------------------------------------+
from pyspark.sql.functions import format_number as fmt
w2v.findSynonyms("could", 2).select("word", fmt("similarity",
5).alias("similarity")).show()
+-------+----------+
| word|similarity|
+-------+----------+
|classes| 0.90232|
| i| 0.75424|
+-------+----------+
. FeatureHasher
from pyspark.ml.feature import FeatureHasher
dataset = spark.createDataFrame([
(2.2, True, "1", "foo"),
(3.3, False, "2", "bar"),
(4.4, False, "3", "baz"),
(5.5, False, "4", "foo")
], ["real", "bool", "stringNum", "string"])
featurized = hasher.transform(dataset)
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 163
featurized.show(truncate=False)
+----+-----+---------+------+--------------------------------------------------------
+
|real|bool |stringNum|string|features
|
+----+-----+---------+------+--------------------------------------------------------
+
|2.2 |true |1 |foo
|(262144,[174475,247670,257907,262126],[2.2,1.0,1.0,1.0])|
|3.3 |false|2 |bar |(262144,[70644,89673,173866,174475],[1.0,1.0,1.0,3.3])
|
|4.4 |false|3 |baz |(262144,[22406,70644,174475,187923],[1.0,1.0,4.4,1.0])
|
|5.5 |false|4 |foo |(262144,[70644,101499,174475,257907],[1.0,1.0,5.5,1.0])
|
+----+-----+---------+------+--------------------------------------------------------
+
. RFormula
from pyspark.ml.feature import RFormula
dataset = spark.createDataFrame(
[(7, "US", 18, 1.0),
(8, "CA", 12, 0.0),
(9, "CA", 15, 0.0)],
["id", "country", "hour", "clicked"])
formula = RFormula(
formula="clicked ~ country + hour",
featuresCol="features",
labelCol="label")
output = formula.fit(dataset).transform(dataset)
output.select("features", "label").show()
+----------+-----+
| features|label|
+----------+-----+
|[0.0,18.0]| 1.0|
|[1.0,12.0]| 0.0|
|[1.0,15.0]| 0.0|
+----------+-----+
Feature Transform
Tokenizer
from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType
sentenceDataFrame = spark.createDataFrame([
(0, "Hi I heard about Spark"),
(1, "I wish Java could use case classes"),
(2, "Logistic,regression,models,are,neat")
], ["id", "sentence"])
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 164
countTokens = udf(lambda words: len(words), IntegerType())
tokenized = tokenizer.transform(sentenceDataFrame)
tokenized.select("sentence", "words")\
.withColumn("tokens", countTokens(col("words"))).show(truncate=False)
regexTokenized = regexTokenizer.transform(sentenceDataFrame)
regexTokenized.select("sentence", "words") \
.withColumn("tokens", countTokens(col("words"))).show(truncate=False)
+-----------------------------------+------------------------------------------+-----
-+
|sentence |words
|tokens|
+-----------------------------------+------------------------------------------+-----
-+
|Hi I heard about Spark |[hi, i, heard, about, spark] |5
|
|I wish Java could use case classes |[i, wish, java, could, use, case, classes]|7
|
|Logistic,regression,models,are,neat|[logistic,regression,models,are,neat] |1
|
+-----------------------------------+------------------------------------------+-----
-+
+-----------------------------------+------------------------------------------+-----
-+
|sentence |words
|tokens|
+-----------------------------------+------------------------------------------+-----
-+
|Hi I heard about Spark |[hi, i, heard, about, spark] |5
|
|I wish Java could use case classes |[i, wish, java, could, use, case, classes]|7
|
|Logistic,regression,models,are,neat|[logistic, regression, models, are, neat] |5
|
+-----------------------------------+------------------------------------------+-----
-+
StopWordsRemover
from pyspark.ml.feature import StopWordsRemover
sentenceData = spark.createDataFrame([
(0, ["I", "saw", "the", "red", "balloon"]),
(1, ["Mary", "had", "a", "little", "lamb"])
], ["id", "raw"])
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 165
from pyspark.ml.feature import NGram
sentenceData = spark.createDataFrame([
(0.0, "I love Spark"),
(0.0, "I love python"),
(1.0, "I think ML is awesome")],
["label", "sentence"])
model = pipeline.fit(sentenceData)
model.transform(sentenceData).show(truncate=False)
+-----+---------------------+---------------------------+----------------------------
----------+
|label|sentence |words |ngrams
|
+-----+---------------------+---------------------------+----------------------------
----------+
|0.0 |I love Spark |[i, love, spark] |[i love, love spark]
|
|0.0 |I love python |[i, love, python] |[i love, love python]
|
|1.0 |I think ML is awesome|[i, think, ml, is, awesome]|[i think, think ml, ml is,
is awesome]|
+-----+---------------------+---------------------------+----------------------------
----------+
Binarizer
from pyspark.ml.feature import Binarizer
continuousDataFrame = spark.createDataFrame([
(0, 0.1),
(1, 0.8),
(2, 0.2),
(3,0.5)
], ["id", "feature"])
binarizedDataFrame = binarizer.transform(continuousDataFrame)
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 166
Bucketizer
[Bucketizer](https://spark.apache.org/docs/latest/ml-features.html#bucketizer) transforms a
column of continuous features to a column of feature buckets, where the buckets are specified by
users.
data = [(0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.0)]
df = spark.createDataFrame(data, ["id", "age"])
print(df.show())
None
+---+----+------+
| id| age|result|
+---+----+------+
| 0|18.0| 2.0|
| 1|19.0| 2.0|
| 2| 8.0| 1.0|
| 3| 5.0| 1.0|
| 4| 2.0| 0.0|
+---+----+------+
QuantileDiscretizer
QuantileDiscretizer takes a column with continuous features and outputs a
column with binned categorical features. The number of bins is set by the
numBuckets parameter. It is possible that the number of buckets used will be
smaller than this value, for example, if there are too few distinct values of the
input to create enough distinct quantiles.
data = [(0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.0)]
df = spark.createDataFrame(data, ["id", "age"])
print(df.show())
None
+---+----+-------+
| id| age|buckets|
+---+----+-------+
| 0|18.0| 3.0|
| 1|19.0| 3.0|
| 2| 8.0| 2.0|
| 3| 5.0| 2.0|
| 4| 2.0| 1.0|
+---+----+-------+
+---+----+-------+
| id| age|buckets|
+---+----+-------+
| 0|18.0| 3.0|
| 1|19.0| 3.0|
| 2| 8.0| 2.0|
| 3| 5.0| 2.0|
| 4| 2.0| 1.0|
+---+----+-------+
If the data has NULL values, then you will get the following results:
data = [(0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, None)]
df = spark.createDataFrame(data, ["id", "age"])
print(df.show())
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 168
None
+---+----+------+
| id| age|result|
+---+----+------+
| 0|18.0| 2.0|
| 1|19.0| 2.0|
| 2| 8.0| 1.0|
| 3| 5.0| 1.0|
| 4|null| null|
+---+----+------+
+---+----+-------+
| id| age|buckets|
+---+----+-------+
| 0|18.0| 3.0|
| 1|19.0| 4.0|
| 2| 8.0| 2.0|
| 3| 5.0| 1.0|
| 4|null| null|
+---+----+-------+
+---+----+-------+
| id| age|buckets|
+---+----+-------+
| 0|18.0| 3.0|
| 1|19.0| 4.0|
| 2| 8.0| 2.0|
| 3| 5.0| 1.0|
+---+----+-------+
StringIndexer
from pyspark.ml.feature import StringIndexer
df = spark.createDataFrame(
[(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")],
["id", "category"])
df = spark.createDataFrame(
[(0, "Yes"), (1, "Yes"), (2, "Yes"), (3, "No"), (4, "No"), (5, "No")],
["id", "label"])
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 169
print("Transformed string column '%s' to indexed column '%s'"
% (indexer.getInputCol(), indexer.getOutputCol()))
indexed.show()
print("Transformed indexed column '%s' back to original string column '%s' using "
"labels in metadata" % (converter.getInputCol(), converter.getOutputCol()))
converted.select("id", "labelIndex", "originalLabel").show()
Transformed string column 'label' to indexed column 'labelIndex'
+---+-----+----------+
| id|label|labelIndex|
+---+-----+----------+
| 0| Yes| 1.0|
| 1| Yes| 1.0|
| 2| Yes| 1.0|
| 3| No| 0.0|
| 4| No| 0.0|
| 5| No| 0.0|
+---+-----+----------+
df = spark.createDataFrame(
[(0, "Yes"), (1, "Yes"), (2, "Yes"), (3, "No"), (4, "No"), (5, "No")],
["id", "label"])
model = pipeline.fit(df)
result = model.transform(df)
result.show()
+---+-----+----------+-------------+
| id|label|labelIndex|originalLabel|
+---+-----+----------+-------------+
| 0| Yes| 1.0| Yes|
| 1| Yes| 1.0| Yes|
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 170
| 2| Yes| 1.0| Yes|
| 3| No| 0.0| No|
| 4| No| 0.0| No|
| 5| No| 0.0| No|
+---+-----+----------+-------------+
VectorIndexer
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator
df = spark.createDataFrame([
(0, 2.2, True, "1", "foo", 'CA'),
(1, 3.3, False, "2", "bar", 'US'),
(0, 4.4, False, "3", "baz", 'CHN'),
(1, 5.5, False, "4", "foo", 'AUS')
], ['label',"real", "bool", "stringNum", "string","country"])
formula = RFormula(
formula="label ~ real + bool + stringNum + string + country",
featuresCol="features",
labelCol="label")
model = pipeline.fit(df)
result = model.transform(df)
result.show()
+-----+----+-----+---------+------+-------+--------------------+--------------------+
|label|real| bool|stringNum|string|country| features| indexedFeatures|
+-----+----+-----+---------+------+-------+--------------------+--------------------+
| 0| 2.2| true| 1| foo| CA|(10,[0,1,5,7],[2....|(10,[0,1,5,7],[2....|
| 1| 3.3|false| 2| bar| US|(10,[0,3,8],[3.3,...|(10,[0,3,8],[3.3,...|
| 0| 4.4|false| 3| baz| CHN|(10,[0,4,6,9],[4....|(10,[0,4,6,9],[4....|
| 1| 5.5|false| 4| foo| AUS|(10,[0,2,5],[5.5,...|(10,[0,2,5],[5.5,...|
+-----+----+-----+---------+------+-------+--------------------+--------------------+
VectorAssembler
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
dataset = spark.createDataFrame(
[(0, 18, 1.0, Vectors.dense([0.0, 10.0, 0.5]), 1.0)],
["id", "hour", "mobile", "userFeatures", "clicked"])
assembler = VectorAssembler(
inputCols=["hour", "mobile", "userFeatures"],
outputCol="features")
output = assembler.transform(dataset)
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 171
print("Assembled columns 'hour', 'mobile', 'userFeatures' to vector column
'features'")
output.select("features", "clicked").show(truncate=False)
Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'features'
+-----------------------+-------+
|features |clicked|
+-----------------------+-------+
|[18.0,1.0,0.0,10.0,0.5]|1.0 |
+-----------------------+-------+
OneHotEncoder
This is the note I wrote for one of my readers for explaining the
OneHotEncoder. I would like to share it at here:
spark = SparkSession \
.builder \
.appName("Python Spark create RDD example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
df = spark.createDataFrame([
(0, "a"),
(1, "b"),
(2, "c"),
(3, "a"),
(4, "a"),
(5, "c")
], ["id", "category"])
df.show()
+---+--------+
| id|category|
+---+--------+
| 0| a|
| 1| b|
| 2| c|
| 3| a|
| 4| a|
| 5| c|
+---+--------+
. OneHotEncoder
. Encoder
from pyspark.ml.feature import OneHotEncoder, StringIndexer
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 172
+---+--------+-------------+-------------+
| id|category|categoryIndex| categoryVec|
+---+--------+-------------+-------------+
| 0| a| 0.0|(3,[0],[1.0])|
| 1| b| 2.0|(3,[2],[1.0])|
| 2| c| 1.0|(3,[1],[1.0])|
| 3| a| 0.0|(3,[0],[1.0])|
| 4| a| 0.0|(3,[0],[1.0])|
| 5| c| 1.0|(3,[1],[1.0])|
+---+--------+-------------+-------------+
Note
outputCol="{0}_encoded".format(indexer.getOutputCol()),dropLast=False)
for indexer in indexers ]
assembler = VectorAssembler(inputCols=[encoder.getOutputCol() for encoder in
encoders]
, outputCol="features")
pipeline = Pipeline(stages=indexers + encoders + [assembler])
model=pipeline.fit(df)
data = model.transform(df)
data.show()
+---+--------+----------------+------------------------+-------------+
| id|category|category_indexed|category_indexed_encoded| features|
+---+--------+----------------+------------------------+-------------+
| 0| a| 0.0| (3,[0],[1.0])|[1.0,0.0,0.0]|
| 1| b| 2.0| (3,[2],[1.0])|[0.0,0.0,1.0]|
| 2| c| 1.0| (3,[1],[1.0])|[0.0,1.0,0.0]|
| 3| a| 0.0| (3,[0],[1.0])|[1.0,0.0,0.0]|
| 4| a| 0.0| (3,[0],[1.0])|[1.0,0.0,0.0]|
| 5| c| 1.0| (3,[1],[1.0])|[0.0,1.0,0.0]|
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 173
+---+--------+----------------+------------------------+-------------+
Application: Get Dummy Variable
def get_dummy(df,indexCol,categoricalCols,continuousCols,labelCol,dropLast=False):
'''
Get dummy variables and concat with continuous variables for ml modeling.
:param df: the dataframe
:param categoricalCols: the name list of the categorical data
:param continuousCols: the name list of the numerical data
:param labelCol: the name of label column
:param dropLast: the flag of drop last column
:return: feature matrix
>>> df = spark.createDataFrame([
(0, "a"),
(1, "b"),
(2, "c"),
(3, "a"),
(4, "a"),
(5, "c")
], ["id", "category"])
>>>
+---+-------------+
| id| features|
+---+-------------+
| 0|[1.0,0.0,0.0]|
| 1|[0.0,0.0,1.0]|
| 2|[0.0,1.0,0.0]|
| 3|[1.0,0.0,0.0]|
| 4|[1.0,0.0,0.0]|
| 5|[0.0,1.0,0.0]|
+---+-------------+
'''
outputCol="{0}_encoded".format(indexer.getOutputCol()),dropLast=dropLast)
for indexer in indexers ]
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 174
assembler = VectorAssembler(inputCols=[encoder.getOutputCol() for encoder
in encoders]
+ continuousCols, outputCol="features")
model=pipeline.fit(df)
data = model.transform(df)
indexCol = 'id'
categoricalCols = ['category']
continuousCols = []
labelCol = []
mat = get_dummy(df,indexCol,categoricalCols,continuousCols,labelCol)
mat.show()
+---+-------------+
| id| features|
+---+-------------+
| 0|[1.0,0.0,0.0]|
| 1|[0.0,0.0,1.0]|
| 2|[0.0,1.0,0.0]|
| 3|[1.0,0.0,0.0]|
| 4|[1.0,0.0,0.0]|
| 5|[0.0,1.0,0.0]|
+---+-------------+
Supervised scenario
df = spark.read.csv(path='bank.csv',
sep=',',encoding='UTF-8',comment=None,
header=True,inferSchema=True)
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 175
indexCol = []
catCols = ['job','marital','education','default',
'housing','loan','contact','poutcome']
data = get_dummy(df,indexCol,catCols,contCols,labelCol,dropLast=False)
data.show(5)
+--------------------+-----+
| features|label|
+--------------------+-----+
|(37,[8,12,17,19,2...| no|
|(37,[4,12,15,19,2...| no|
|(37,[0,13,16,19,2...| no|
|(37,[0,12,16,19,2...| no|
|(37,[1,12,15,19,2...| no|
+--------------------+-----+
only showing top 5 rows
The Jupyter Notebook can be found on Colab: OneHotEncoder .
Scaler
from pyspark.ml.feature import Normalizer, StandardScaler, MinMaxScaler, MaxAbsScaler
scaler_type = 'Normal'
if scaler_type=='Normal':
scaler = Normalizer(inputCol="features", outputCol="scaledFeatures", p=1.0)
elif scaler_type=='Standard':
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
withStd=True, withMean=False)
elif scaler_type=='MinMaxScaler':
scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")
elif scaler_type=='MaxAbsScaler':
scaler = MaxAbsScaler(inputCol="features", outputCol="scaledFeatures")
from pyspark.ml import Pipeline
from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([
(0, Vectors.dense([1.0, 0.5, -1.0]),),
(1, Vectors.dense([2.0, 1.0, 1.0]),),
(2, Vectors.dense([4.0, 10.0, 2.0]),)
], ["id", "features"])
df.show()
pipeline = Pipeline(stages=[scaler])
model =pipeline.fit(df)
data = model.transform(df)
data.show()
+---+--------------+
| id| features|
+---+--------------+
| 0|[1.0,0.5,-1.0]|
| 1| [2.0,1.0,1.0]|
| 2|[4.0,10.0,2.0]|
+---+--------------+
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 176
+---+--------------+------------------+
| id| features| scaledFeatures|
+---+--------------+------------------+
| 0|[1.0,0.5,-1.0]| [0.4,0.2,-0.4]|
| 1| [2.0,1.0,1.0]| [0.5,0.25,0.25]|
| 2|[4.0,10.0,2.0]|[0.25,0.625,0.125]|
+---+--------------+------------------+
Normalizer
from pyspark.ml.feature import Normalizer
from pyspark.ml.linalg import Vectors
dataFrame = spark.createDataFrame([
(0, Vectors.dense([1.0, 0.5, -1.0]),),
(1, Vectors.dense([2.0, 1.0, 1.0]),),
(2, Vectors.dense([4.0, 10.0, 2.0]),)
], ["id", "features"])
dataFrame = spark.createDataFrame([
(0, Vectors.dense([1.0, 0.5, -1.0]),),
(1, Vectors.dense([2.0, 1.0, 1.0]),),
(2, Vectors.dense([4.0, 10.0, 2.0]),)
], ["id", "features"])
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 177
+---+--------------+---------------------------------------------------------
---+
|id |features |scaledFeatures |
+---+--------------+------------------------------------------------------------+
|0 |[1.0,0.5,-1.0]|[0.6546536707079772,0.09352195295828244,-0.6546536707079771]|
|1 |[2.0,1.0,1.0] |[1.3093073414159544,0.1870439059165649,0.6546536707079771] |
|2 |[4.0,10.0,2.0]|[2.618614682831909,1.870439059165649,1.3093073414159542] |
+---+--------------+------------------------------------------------------------+
MinMaxScaler
from pyspark.ml.feature import Normalizer, StandardScaler, MinMaxScaler, MaxAbsScaler
dataFrame = spark.createDataFrame([
(0, Vectors.dense([1.0, 0.5, -1.0]),),
(1, Vectors.dense([2.0, 1.0, 1.0]),),
(2, Vectors.dense([4.0, 10.0, 2.0]),)
], ["id", "features"])
dataFrame = spark.createDataFrame([
(0, Vectors.dense([1.0, 0.5, -1.0]),),
(1, Vectors.dense([2.0, 1.0, 1.0]),),
(2, Vectors.dense([4.0, 10.0, 2.0]),)
], ["id", "features"])
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 178
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)
result = model.transform(df).select("pcaFeatures")
result.show(truncate=False)
+-----------------------------------------------------------+
|pcaFeatures |
+-----------------------------------------------------------+
|[1.6485728230883807,-4.013282700516296,-5.524543751369388] |
|[-4.645104331781534,-1.1167972663619026,-5.524543751369387]|
|[-6.428880535676489,-5.337951427775355,-5.524543751369389] |
+-----------------------------------------------------------+
DCT
from pyspark.ml.feature import DCT
from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([
(Vectors.dense([0.0, 1.0, -2.0, 3.0]),),
(Vectors.dense([-1.0, 2.0, 4.0, -7.0]),),
(Vectors.dense([14.0, -2.0, -5.0, 1.0]),)], ["features"])
dctDf = dct.transform(df)
dctDf.select("featuresDCT").show(truncate=False)
+----------------------------------------------------------------+
|featuresDCT |
+----------------------------------------------------------------+
|[1.0,-1.1480502970952693,2.0000000000000004,-2.7716385975338604]|
|[-1.0,3.378492794482933,-7.000000000000001,2.9301512653149677] |
|[4.0,9.304453421915744,11.000000000000002,1.5579302036357163] |
+----------------------------------------------------------------+
Feature Selection
LASSO
Variable selection and the removal of correlated variables. The Ridge method
shrinks the coefficients of correlated variables while the LASSO method picks
one variable and discards the others. The elastic net penalty is a mixture of
these two; if variables are correlated in groups then tends to select
the groups as in or out. If α is close to 1, the elastic net performs much like the
LASSO method and removes any degeneracies and wild behavior caused by
extreme correlations.
RandomForest
AutoFeatures library based on RandomForest is coming soon………….
df = spark.createDataFrame([
(0, "Yes"),
(1, "Yes"),
(2, "Yes"),
(3, "Yes"),
(4, "No"),
(5, "No")
], ["id", "label"])
df.show()
+---+-----+
| id|label|
+---+-----+
| 0| Yes|
| 1| Yes|
| 2| Yes|
| 3| Yes|
| 4| No|
| 5| No|
+---+-----+
Calculate undersampling Ratio
import math
def round_up(n, decimals=0):
multiplier = 10 ** decimals
return math.ceil(n * multiplier) / multiplier
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 180
Undersampling
label_Y_sample = label_Y.sample(False, sampleRatio)
# union minority set and the under-sampling majority set
data = label_N.unionAll(label_Y_sample)
data.show()
+---+-----+
| id|label|
+---+-----+
| 4| No|
| 5| No|
| 1| Yes|
| 2| Yes|
+---+-----+
. Recalibrating Probability
Undersampling is a popular technique for unbalanced datasets to reduce the skew in class
distributions. However, it is well-known that undersampling one class modifies the priors of the
training set and consequently biases the posterior probabilities of a classifier Calibrating
Probability with Undersampling for Unbalanced Classification.
Clustering:
Overview
Learn about Clustering , one of the most popular unsupervised classification techniques
Dividing the data into clusters can be on the basis of centroids, distributions, densities, etc
Get to know K means and hierarchical clustering and the difference between the two
Introduction
Have you come across a situation when a Chief Marketing Officer of a company tells you –
“Help me understand our customers better so that we can market our products to them in a better
manner!”
I did and the analyst in me was completely clueless what to do! I was used to getting specific
problems, where there is an outcome to be predicted for various set of conditions. But I had no
clue what to do in this case. If the person would have asked me to calculate Life Time Value
(LTV) or propensity of Cross-sell, I wouldn’t have blinked. But this question looked very broad
to me.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 181
This is usually the first reaction when you come across an unsupervised learning problem for the
first time! You are not looking for specific insights for a phenomena, but what you are looking
for are structures with in data with out them being tied down to a specific outcome.
The method of identifying similar groups of data in a dataset is called clustering. It is one of the
most popular techniques in data science. Entities in each group are comparatively more similar to
entities of that group than those of the other groups. In this article, I will be taking you through
the types of clustering, different clustering algorithms and a comparison between two of the most
Table of Contents
1. Overview
2. Types of Clustering
3. Types of Clustering Algorithms
4. K Means Clustering
5. Hierarchical Clustering
6. Difference between K Means and Hierarchical clustering
7. Applications of Clustering
8. Improving Supervised Learning algorithms with clustering
1. Overview
Clustering is the task of dividing the population or data points into a number of groups such that
data points in the same groups are more similar to other data points in the same group than those
in other groups. In simple words, the aim is to segregate groups with similar traits and assign
Let’s understand this with an example. Suppose, you are the head of a rental store and wish to
understand preferences of your costumers to scale up your business. Is it possible for you to look
at details of each costumer and devise a unique business strategy for each one of them?
Definitely not. But, what you can do is to cluster all of your costumers into say 10 groups based
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 182
on their purchasing habits and use a separate strategy for costumers in each of these 10 groups.
2. Types of Clustering
Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or
not. For example, in the above example each customer is put into one group out of the 10 groups.
Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a
probability or likelihood of that data point to be in those clusters is assigned. For example, from
the above scenario each costumer is assigned a probability to be in either of 10 clusters of the
retail store.
Since the task of clustering is subjective, the means that can be used for achieving this goal are
plenty. Every methodology follows a different set of rules for defining the ‘similarity’ among
data points. In fact, there are more than 100 clustering algorithms known. But few of the
Connectivity models: As the name suggests, these models are based on the notion that the data
points closer in data space exhibit more similarity to each other than the data points lying farther
away. These models can follow two approaches. In the first approach, they start with classifying
all data points into separate clusters & then aggregating them as the distance decreases. In the
second approach, all data points are classified as a single cluster and then partitioned as the
distance increases. Also, the choice of distance function is subjective. These models are very
easy to interpret but lacks scalability for handling big datasets. Examples of these models are
hierarchical clustering algorithm and its variants.
Centroid models: These are iterative clustering algorithms in which the notion of similarity is
derived by the closeness of a data point to the centroid of the clusters. K-Means clustering
algorithm is a popular algorithm that falls into this category. In these models, the no. of clusters
required at the end have to be mentioned beforehand, which makes it important to have prior
knowledge of the dataset. These models run iteratively to find the local optima.
Distribution models: These clustering models are based on the notion of how probable is it that
all data points in the cluster belong to the same distribution (For example: Normal, Gaussian).
These models often suffer from overfitting. A popular example of these models is Expectation-
maximization algorithm which uses multivariate normal distributions.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 183
Density Models: These models search the data space for areas of varied density of data
points in the data space. It isolates various different density regions and assign the data points
within these regions in the same cluster. Popular examples of density models are DBSCAN and
OPTICS.
Now I will be taking you through two of the most popular clustering algorithms in detail – K
4. K Means Clustering
K means is an iterative clustering algorithm that aims to find local maxima in each iteration.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 184
4. Re-assign each point to the closest cluster centroid : Note that
only the data point at the bottom is assigned to the red cluster
even though its closer to the centroid of grey cluster. Thus, we
assign that data point into grey cluster
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 185
6. Repeat steps 4 and 5 until no improvements are possible : Similarly, we’ll repeat
the 4th and 5th steps until we’ll reach global optima. When there will be no further
switching of data points between two clusters for two successive repeats. It will mark the
termination of the algorithm if not explicitly mentioned.
Here is a live coding window where you can try out K Means Algorithm
using scikit-learn library.
5. Hierarchical Clustering
Hierarchical clustering, as the name suggests is an algorithm that builds hierarchy of clusters.
This algorithm starts with all the data points assigned to a cluster of their own. Then two nearest
clusters are merged into the same cluster. In the end, this algorithm terminates when there is only
The results of hierarchical clustering can be shown using dendrogram. The dendrogram can be
interpreted as:
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 186
At the bottom, we start with 25 data points, each assigned to separate clusters. Two closest
clusters are then merged till we have just one cluster at the top. The height in the dendrogram at
which two clusters are merged represents the distance between two clusters in the data space.
The decision of the no. of clusters that can best depict different groups can be chosen by
observing the dendrogram. The best choice of the no. of clusters is the no. of vertical lines in the
dendrogram cut by a horizontal line that can transverse the maximum distance vertically without
intersecting a cluster.
In the above example, the best choice of no. of clusters will be 4 as the red horizontal line in the
Two important things that you should know about hierarchical clustering are:
This algorithm has been implemented above using bottom up approach. It is also possible to
follow top-down approach starting with all data points assigned in the same cluster and
recursively performing splits till each data point is assigned a separate cluster.
The decision of merging two clusters is taken on the basis of closeness of these clusters. There
are multiple metrics for deciding the closeness of two clusters :
o Euclidean distance: ||a-b||2 = √(Σ(ai-bi))
o Squared Euclidean distance: ||a-b||22 = Σ((ai-bi) 2)
o Manhattan distance: ||a-b||1 = Σ|ai-bi|
o Maximum distance:||a-b||INFINITY = maxi|ai-bi|
o Mahalanobis distance: √((a-b)T S-1 (-b)) {where, s : covariance matrix}
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 187
Hierarchical clustering can’t handle big data well but K Means clustering can. This is
because the time complexity of K Means is linear i.e. O(n) while that of hierarchical clustering is
quadratic i.e. O(n2).
In K Means clustering, since we start with random choice of clusters, the results produced by
running the algorithm multiple times might differ. While results are reproducible in Hierarchical
clustering.
K Means is found to work well when the shape of the clusters is hyper spherical (like circle in
2D, sphere in 3D).
K Means clustering requires prior knowledge of K i.e. no. of clusters you want to divide your
data into. But, you can stop at whatever number of clusters you find appropriate in hierarchical
clustering by interpreting the dendrogram
7. Applications of Clustering
Clustering has a large no. of applications spread across various domains. Some of the most
Recommendation engines
Market segmentation
Social network analysis
Search result grouping
Medical imaging
Image segmentation
Anomaly detection
Clustering is an unsupervised machine learning approach, but can it be used to improve the
accuracy of supervised machine learning algorithms as well by clustering the data points into
similar groups and using these cluster labels as independent variables in the supervised machine
Let’s check out the impact of clustering on the accuracy of our model for the classification
problem using 3000 observations with 100 predictors of stock data to predicting whether the
stock will go up or down using R. This dataset contains 100 independent variables from X1 to
X100 representing profile of a stock and one outcome variable Y with two levels : 1 for rise in
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 188
Let’s first try applying randomforest without clustering.
library('Metrics')
#set random seed
set.seed(101)
#loading dataset
data<-read.csv("train.csv",stringsAsFactors= T)
data$Y<-as.factor(data$Y)
#applying randomForest
model_rf<-randomForest(Y~.,data=train)
preds<-predict(object=model_rf,test[,-101])
table(preds)
## preds
## -1 1
## 453 547
#checking accuracy
auc(preds,test$Y)
## [1] 0.4522703
So, the accuracy we get is 0.45. Now let’s create five clusters based on values of independent
Whoo! In the above example, even though the final accuracy is poor but clustering has given our
This shows that clustering can indeed be helpful for supervised machine learning tasks.
Dimensionality Reduction
Dimensionality reduction, or variable reduction techniques, simply refers to the process of
reducing the number or dimensions of features in a dataset. It is commonly used during the
analysis of high-dimensional data (e.g., multipixel images of a face or texts from an article,
astronomical catalogues, etc.). Many statistical and ML methods have been applied to high-
dimensional data, such as vector quantization and mixture models, generative topographic
mapping (Bishop et al., 1998), and principal component analysis (PCA), to list just a few. PCA is
one of the most popular algorithms used for dimensionality reduction (Pearson, 1901; Wold et
al., 1987; Dunteman, 1989; Jolliffe and Cadima, 2016).
It is an unsupervised learning technique of dimensionality reduction also known as Karhunen–
Loève transform, generally applied for data compression and visualization, feature extraction,
dimensionality reduction, and feature extraction (Bishop, 2000). It is defined as a set of data
being projected orthogonally onto lower-dimensional linear space (called the principal subspace)
to maximize the projected data's variance (Hotelling, 1933). Other common methods of
dimensionality reduction worth mentioning are independent component analysis (Comon, 1994),
nonnegative matrix factorization (Lee and Seung, 1999), self-organized maps (Kohonen, 1982),
isomaps (Tenenbaum et al., 2000), t-distributed stochastic neighbor embedding (van der Maaten
and Hinton, 2008), Uniform Manifold Approximation and Projection for Dimension Reduction
(McInnes et al., 2018), and autoencoders (Vincent et al., 2008).
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 190
Overview of Data Reduction Strategies
Data reduction strategies include dimensionality reduction, numerosity reduction, and data
compression.
Dimensionality reduction is the process of reducing the number of random variables or
attributes under consideration. Dimensionality reduction methods
include wavelet transforms (Section 3.4.2) and principal components analysis (Section 3.4.3),
which transform or project the original data onto a smaller space. Attribute subset selection is a
method of dimensionality reduction in which irrelevant, weakly relevant, or redundant attributes
or dimensions are detected and removed (Section 3.4.4).
Numerosity reduction techniques replace the original data volume by alternative, smaller forms
of data representation. These techniques may be parametric or nonparametric. For parametric
methods, a model is used to estimate the data, so that typically only the data parameters need to
be stored, instead of the actual data. (Outliers may also be stored.) Regression and log-linear
models (Section 3.4.5) are examples. Nonparametric methodsfor storing reduced representations
of the data include histograms (Section 3.4.6), clustering (Section 3.4.7), sampling (Section
3.4.8), and data cube aggregation (Section 3.4.9).
In data compression, transformations are applied so as to obtain a reduced or “compressed”
representation of the original data. If the original data can be reconstructed from the compressed
data without any information loss, the data reduction is called lossless. If, instead, we can
reconstruct only an approximation of the original data, then the data reduction is called lossy.
There are several lossless algorithms for string compression; however, they typically allow only
limited data manipulation. Dimensionality reduction and numerosity reduction techniques can
also be considered forms of data compression.
There are many other ways of organizing methods of data reduction. The computational time
spent on data reduction should not outweigh or “erase” the time saved by mining on a reduced
data set size.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 191
Sign in to download full-size image
Figure 2.1. Pattern recognition system including feature selection and extraction.
The sensor data are subject to a feature extraction and selection process for determining the input
vector for the subsequent classifier. This makes a decision regarding the class associated with
this pattern vector.
Dimensionality reduction is accomplished based on either feature selection or feature extraction.
Feature selection is based on omitting those features from the available measurements which do
not contribute to class separability. In other words, redundant and irrelevant features are ignored.
This is illustrated in Fig. 2.2.
Feature extraction, on the other hand, considers the whole information content and maps the
useful information content into a lower dimensional feature space. This is shown in Fig. 2.3. In
feature extraction, the mapping type A has to be specified beforehand.
We see immediately that for feature selection or extraction the following is required: (1) feature
evaluation criterion, (2) dimensionality of the feature space, and (3) optimization procedure.
Data Preprocessing
Jiawei Han, ... Jian Pei, in Data Mining (Third Edition), 2012
Overview of Data Reduction Strategies
Data reduction strategies include dimensionality reduction, numerosity reduction, and data
compression.
Dimensionality reduction is the process of reducing the number of random variables or
attributes under consideration. Dimensionality reduction methods include wavelet transforms ()
and principal components analysis (Section 3.4.3), which transform or project the original data
onto a smaller space. Attribute subset selection is a method of dimensionality reduction in which
irrelevant, weakly relevant, or redundant attributes or dimensions are detected and removed
(Section 3.4.4).
Numerosity reduction techniques replace the original data volume by alternative, smaller forms
of data representation. These techniques may be parametric or nonparametric. For parametric
methods, a model is used to estimate the data, so that typically only the data parameters need to
be stored, instead of the actual data. (Outliers may also be stored.) Regression and log-linear
models (Section 3.4.5) are examples. Nonparametric methodsfor storing reduced representations
of the data include histograms (Section 3.4.6), clustering (Section 3.4.7), sampling (Section),
and data cube aggregation (Section 3.4.9).
In data compression, transformations are applied so as to obtain a reduced or “compressed”
representation of the original data. If the original data can be reconstructed from the compressed
data without any information loss, the data reduction is called lossless. If, instead, we can
reconstruct only an approximation of the original data, then the data reduction is called lossy.
There are several lossless algorithms for string compression; however, they typically allow only
limited data manipulation. Dimensionality reduction and numerosity reduction techniques can
also be considered forms of data compression.
There are many other ways of organizing methods of data reduction. The computational time
spent on data reduction should not outweigh or “erase” the time saved by mining on a reduced
data set size.
Feature Extraction:
Table of Contents
Feature Extractors
o TF-IDF
o Word2Vec
o CountVectorizer
o FeatureHasher
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 194
Feature Transformers
o Tokenizer
o StopWordsRemover
o nn-gram
o Binarizer
o PCA
o PolynomialExpansion
o Discrete Cosine Transform (DCT)
o StringIndexer
o IndexToString
o OneHotEncoder
o VectorIndexer
o Interaction
o Normalizer
o StandardScaler
o RobustScaler
o MinMaxScaler
o MaxAbsScaler
o Bucketizer
o ElementwiseProduct
o SQLTransformer
o VectorAssembler
o VectorSizeHint
o QuantileDiscretizer
o Imputer
Feature Selectors
o VectorSlicer
o RFormula
o ChiSqSelector
o UnivariateFeatureSelector
o VarianceThresholdSelector
Locality Sensitive Hashing
o LSH Operations
Feature Transformation
Approximate Similarity Join
Approximate Nearest Neighbor Search
o LSH Algorithms
Bucketed Random Projection for Euclidean Distance
MinHash for Jaccard Distance
Feature Extractors
TF-IDF
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 195
Term frequency-inverse document frequency (TF-IDF) is a feature vectorization
method widely used in text mining to reflect the importance of a term to a document in the
corpus. Denote a term by tt, a document by dd, and the corpus by DD. Term
frequency TF(t,d)TF(t,d) is the number of times that term tt appears in document dd, while
document frequency DF(t,D)DF(t,D) is the number of documents that contains term tt. If we only
use term frequency to measure the importance, it is very easy to over-emphasize terms that
appear very often but carry little information about the document, e.g. “a”, “the”, and “of”. If a
term appears very often across the corpus, it means it doesn’t carry special information about a
particular document. Inverse document frequency is a numerical measure of how much
information a term provides:
IDF(t,D)=log|D|+1DF(t,D)+1,IDF(t,D)=log|D|+1DF(t,D)+1,
where |D||D| is the total number of documents in the corpus. Since logarithm is used, if a term
appears in all documents, its IDF value becomes 0. Note that a smoothing term is applied to
avoid dividing by zero for terms outside the corpus. The TF-IDF measure is simply the product
of TF and IDF:
TFIDF(t,d,D)=TF(t,d)⋅IDF(t,D).TFIDF(t,d,D)=TF(t,d)⋅IDF(t,D).
There are several variants on the definition of term frequency and document frequency. In
MLlib, we separate TF and IDF to make them flexible.
TF: Both HashingTF and CountVectorizer can be used to generate the term frequency vectors.
HashingTF is a Transformer which takes sets of terms and converts those sets into fixed-length
feature vectors. In text processing, a “set of terms” might be a bag of words. HashingTF utilizes
the hashing trick. A raw feature is mapped into an index (term) by applying a hash function. The
hash function used here is MurmurHash 3. Then term frequencies are calculated based on the
mapped indices. This approach avoids the need to compute a global term-to-index map, which
can be expensive for a large corpus, but it suffers from potential hash collisions, where different
raw features may become the same term after hashing. To reduce the chance of collision, we can
increase the target feature dimension, i.e. the number of buckets of the hash table. Since a simple
modulo on the hashed value is used to determine the vector index, it is advisable to use a power
of two as the feature dimension, otherwise the features will not be mapped evenly to the vector
indices. The default feature dimension is 218=262,144218=262,144. An optional binary toggle
parameter controls term frequency counts. When set to true all nonzero frequency counts are set
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 196
to 1. This is especially useful for discrete probabilistic models that model binary, rather
than integer, counts.
CountVectorizer converts text documents to vectors of term counts. Refer to CountVectorizer for
more details.
Note: spark.ml doesn’t provide tools for text segmentation. We refer users to the Stanford NLP
Group and scalanlp/chalk.
Examples
In the following code segment, we start with a set of sentences. We split each sentence into
words using Tokenizer. For each sentence (bag of words), we use HashingTF to hash the
sentence into a feature vector. We use IDF to rescale the feature vectors; this generally improves
performance when using text as features. Our feature vectors could then be passed to a learning
algorithm.
Scala
Java
Python
Refer to the HashingTF Scala docs and the IDF Scala docs for more details on the API.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 197
val hashingTF = new HashingTF()
.setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
Word2Vec
Word2Vec is an Estimator which takes sequences of words representing documents and trains
a Word2VecModel. The model maps each word to a unique fixed-size vector.
The Word2VecModel transforms each document into a vector using the average of all words in
the document; this vector can then be used as features for prediction, document similarity
calculations, etc. Please refer to the MLlib user guide on Word2Vec for more details.
Examples
In the following code segment, we start with a set of documents, each of which is represented as
a sequence of words. For each document, we transform it into a feature vector. This feature
vector could then be passed to a learning algorithm.
Scala
Java
Python
Refer to the Word2Vec Scala docs for more details on the API.
import org.apache.spark.ml.feature.Word2Vec
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 198
// Input data: Each row is a bag of words from a sentence or document.
val documentDF = spark.createDataFrame(Seq(
"Hi I heard about Spark".split(" "),
"I wish Java could use case classes".split(" "),
"Logistic regression models are neat".split(" ")
).map(Tuple1.apply)).toDF("text")
CountVectorizer
CountVectorizer and CountVectorizerModel aim to help convert a collection of text documents
to vectors of token counts. When an a-priori dictionary is not available, CountVectorizer can be
used as an Estimator to extract the vocabulary, and generates a CountVectorizerModel. The
model produces sparse representations for the documents over the vocabulary, which can then be
passed to other algorithms like LDA.
During the fitting process, CountVectorizer will select the top vocabSize words ordered by term
frequency across the corpus. An optional parameter minDF also affects the fitting process by
specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be
included in the vocabulary. Another optional binary toggle parameter controls the output vector.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 199
If set to true all nonzero counts are set to 1. This is especially useful for discrete
probabilistic models that model binary, rather than integer, counts.
Examples
Assume that we have the following DataFrame with columns id and texts:
id | texts
----|----------
0 | Array("a", "b", "c")
1 | Array("a", "b", "b", "c", "a")
each row in texts is a document of type Array[String]. Invoking fit of CountVectorizer produces
a CountVectorizerModel with vocabulary (a, b, c). Then the output column “vector” after
transformation contains:
id | texts | vector
----|---------------------------------|---------------
0 | Array("a", "b", "c") | (3,[0,1,2],[1.0,1.0,1.0])
1 | Array("a", "b", "b", "c", "a") | (3,[0,1,2],[2.0,2.0,1.0])
Each vector represents the token counts of the document over the vocabulary.
Scala
Java
Python
Refer to the CountVectorizer Scala docs and the CountVectorizerModel Scala docs for more
details on the API.
val df = spark.createDataFrame(Seq(
(0, Array("a", "b", "c")),
(1, Array("a", "b", "b", "c", "a"))
)).toDF("id", "words")
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 200
.setInputCol("words")
.setOutputCol("features")
.setVocabSize(3)
.setMinDF(2)
.fit(df)
cvModel.transform(df).show(false)
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/ml/CountVectorizerExample.scala" in the
Spark repo.
FeatureHasher
Feature hashing projects a set of categorical or numerical features into a feature vector of
specified dimension (typically substantially smaller than that of the original feature space). This
is done using the hashing trick to map features to indices in the feature vector.
The FeatureHasher transformer operates on multiple columns. Each column may contain either
numeric or categorical features. Behavior and handling of column data types is as follows:
Numeric columns: For numeric features, the hash value of the column name is used to map
the feature value to its index in the feature vector. By default, numeric features are not
treated as categorical (even when they are integers). To treat them as categorical, specify the
relevant columns using the categoricalCols parameter.
String columns: For categorical features, the hash value of the string “column_name=value”
is used to map to the vector index, with an indicator value of 1.0. Thus, categorical features
are “one-hot” encoded (similarly to using OneHotEncoder with dropLast=false).
Boolean columns: Boolean values are treated in the same way as string columns. That is,
boolean features are represented as “column_name=true” or “column_name=false”, with an
indicator value of 1.0.
Null (missing) values are ignored (implicitly zero in the resulting feature vector).
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 201
The hash function used here is also the MurmurHash 3 used in HashingTF. Since a
simple modulo on the hashed value is used to determine the vector index, it is advisable to use a
power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to
the vector indices.
Examples
Assume that we have a DataFrame with 4 input columns real, bool, stringNum, and string. These
different data types as input will illustrate the behavior of the transform to produce a column of
feature vectors.
real| bool|stringNum|string
----|-----|---------|------
2.2| true| 1| foo
3.3|false| 2| bar
4.4|false| 3| baz
5.5|false| 4| foo
Then the output of FeatureHasher.transform on this DataFrame is:
real|bool |stringNum|string|features
----|-----|---------|------|-------------------------------------------------------
2.2 |true |1 |foo |(262144,[51871, 63643,174475,253195],[1.0,1.0,2.2,1.0])
3.3 |false|2 |bar |(262144,[6031, 80619,140467,174475],[1.0,1.0,1.0,3.3])
4.4 |false|3 |baz |(262144,[24279,140467,174475,196810],[1.0,1.0,4.4,1.0])
5.5 |false|4 |foo |(262144,[63643,140467,168512,174475],[1.0,1.0,1.0,5.5])
The resulting feature vectors could then be passed to a learning algorithm.
Scala
Java
Python
Refer to the FeatureHasher Scala docs for more details on the API.
import org.apache.spark.ml.feature.FeatureHasher
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 202
(3.3, false, "2", "bar"),
(4.4, false, "3", "baz"),
(5.5, false, "4", "foo")
)).toDF("real", "bool", "stringNum", "string")
While processing data using MapReduce you may want to break the requirement into a series of
task and do them as a chain of MapReduce jobs rather than doing everything with in one
MapReduce job and making it more complex. Hadoop provides two predefined
classes ChainMapper and ChainReducer for the purpose of chaining MapReduce job in
Hadoop.
Table of contents
Using ChainMapper class you can use multiple Mapper classes within a single Map task. The
Mapper classes are invoked in a chained fashion, the output of the first becomes the input of the
second, and so on until the last Mapper, the output of the last Mapper will be written to the task's
output.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 203
ChainReducer class in Hadoop
Using the predefined ChainReducer class in Hadoop you can chain multiple Mapper classes after
a Reducer within the Reducer task. For each record output by the Reducer, the Mapper classes
are invoked in a chained fashion. The output of the reducer becomes the input of the first mapper
and output of first becomes the input of the second, and so on until the last Mapper, the output of
the last Mapper will be written to the task's output.
For setting the Reducer class to the chain job setReducer() method is used.
For adding a Mapper class to the chain reducer addMapper() method is used.
Using ChainMapper class you can use multiple Mapper classes within a single Map task. The
Mapper classes are invoked in a chained fashion, the output of the first becomes the input of the
second, and so on until the last Mapper, the output of the last Mapper will be written to the task's
output.
Using the predefined ChainReducer class in Hadoop you can chain multiple Mapper classes after
a Reducer within the Reducer task. For each record output by the Reducer, the Mapper classes
are invoked in a chained fashion. The output of the reducer becomes the input of the first mapper
and output of first becomes the input of the second, and so on until the last Mapper, the output of
the last Mapper will be written to the task's output.
For setting the Reducer class to the chain job setReducer() method is used.
For adding a Mapper class to the chain reducer addMapper() method is used.
Using the ChainMapper and the ChainReducer classes it is possible to compose Map/Reduce
jobs that look like [MAP+ / REDUCE MAP*].
Special care has to be taken when creating chains that the key/values output by a Mapper are
valid for the following Mapper in the chain.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 204
Benefits of using a chained MapReduce job
When MapReduce jobs are chained data from immediate mappers is kept in memory
rather than storing to disk so that another mapper in chain doesn't have to read data from disk.
Immediate benefit of this pattern is a dramatic reduction in disk IO.
Gives you a chance to break the problem into simpler tasks and execute them as a chain.
Chained MapReduce job example
Let’s take a simple example to show chained MapReduce job in action. Here input file has item,
sales and zone columns in the below format (tab separated) and you have to get the total sales per
item for zone-1.
For the sake of example let’s say in first mapper you get all the records, in the second mapper
you filter them to get only the records for zone-1. In the reducer you get the total for each item
and then you flip the records so that key become value and value becomes key. For that Inverse
Mapper is used which is a predefined mapper in Hadoop.
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.chain.ChainMapper;
import org.apache.hadoop.mapreduce.lib.chain.ChainReducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.map.InverseMapper;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 205
item.set(salesArr[0]);
// Writing (sales,zone) as value
context.write(item, new Text(salesArr[1] + "," + salesArr[2]));
}
}
// Mapper 2
public static class FilterMapper extends Mapper<Text, Text, Text, IntWritable>{
public void map(Text key, Text value, Context context)
throws IOException, InterruptedException {
// Reduce function
public static class TotalSalesReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
@Override
public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Sales");
job.setJarByClass(getClass());
// MapReduce chaining
Configuration mapConf1 = new Configuration(false);
ChainMapper.addMapper(job, CollectionMapper.class, LongWritable.class, Text.class,
Text.class, Text.class, mapConf1);
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 206
Configuration reduceConf = new Configuration(false);
ChainReducer.setReducer(job, TotalSalesReducer.class, Text.class, IntWritable.class,
Text.class, IntWritable.class, reduceConf);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;
}
}
What is a Join?
Joins in MapReduce
What is a Reduce side join?
MapReduce Example on Reduce side join
Conclusion
What is a Join?
The join operation is used to combine two or more database tables based on
foreign keys. In general, companies maintain separate tables for the customer
and the transaction records in their database. And, many times these
companies need to generate analytic reports using the data present in such
separate tables. Therefore, they perform a join operation on these separate
tables using a common column (foreign key), like customer id, etc., to
generate a combined table. Then, they analyze this combined table to get the
desired analytic reports.
Joins in MapReduce
Just like SQL join, we can also perform join operations in MapReduce on
different data sets. There are two types of join operations in MapReduce:
Map Side Join: As the name implies, the join operation is performed in
the map phase itself. Therefore, in the map side join, the mapper
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 207
performs the join and it is mandatory that the input to each map is
partitioned and sorted according to the keys.
The map side join has been covered in a separate blog with an
example. Click Here to go through that blog to understand how the map side
join works and what are its advantages.
Reduce Side Join: As the name suggests, in the reduce side join, the
reducer is responsible for performing the join operation. It is
comparatively simple and easier to implement than the map side join as
the sorting and shuffling phase sends the values having identical keys to
the same reducer and therefore, by default, the data is organized for us.
As discussed earlier, the reduce side join is a process where the join
operation is performed in the reducer phase. Basically, the reduce side join
takes place in the following manner:
Meanwhile, you may go through this MapReduce Tutorial video where various
MapReduce Use Cases has been clearly explained and practically
demonstrated:
Now, let us take a MapReduce example to understand the above steps in the
reduce side join.
Using these two datasets, I want to know the lifetime value of each customer.
In doing so, I will be needing the following things:
The person’s name along with the frequency of the visits by that person.
The total amount spent by him/her for purchasing the equipment.
The above figure is just to show you the schema of the two datasets on which
we will perform the reduce side join operation. Click on the button below to
download the whole project containing the source code and the input files for
this MapReduce example:
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 209
Big Data Hadoop Certification Training Course
Instructor-lens Studies
Atsime Access
Explore Curriculum
Kindly, keep the following things in mind while importing the above
MapReduce example project on reduce side join into Eclipse:
The input files are in input_files directory of the project. Load these into
your HDFS.
Don’t forget to build the path of Hadoop Reference Jars (present in
reduce side join project lib directory) according to your system or VM.
Now, let us understand what happens inside the map and reduce phases in
this MapReduce example on reduce side join:
1. Map Phase:
I will have a separate mapper for each of the two datasets i.e. One mapper for
cust_details input and other for transaction_details input.
MapReduce Architecture
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 210
public static class CustsMapper extend
8
s Mapper <Object, Text, Text, Text>
public void map(Object key, Text value, Context context) throws IOException,
InterruptedException
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 211
Therefore, my mapper for cust_details will produce following
intermediate key-value pair:
public static class TxnsMapper extends Mapper <Object, Text, Text, Text>
1
{
2
public void map(Object key, Text value, Context context) throws IOException,
3 InterruptedException
4 {
Like mapper for cust_details, I will follow the similar steps here. Though,
there will be a few differences:
o I will fetch the amount value instead of name of the person.
o In this case, we will use “tnxn” as a tag.
Therefore, the cust ID will be my key of the key-value pair that the
mapper will generate eventually.
Finally, the output of my mapper for transaction_details will be of the
following format:
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 212
{cust ID1 – [(cust name1), (tnxn amount1), (tnxn amount2),
(tnxn amount3),…..]}
{cust ID2 – [(cust name2), (tnxn amount1), (tnxn amount2), (tnxn
amount3),…..]}
……
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 213
If some correlation exists between the drug and the reduction in the tumor, then the drug
can be said to work. The model would also show to what degree it works by calculating the error
statistic.
UNIT IV
Modeling Data and Solving Problems with Graphs: A graph consists of a number of nodes
(formally called vertices) and links (informally called edges) that connect nodes together. The
following Figure shows a graph with nodes and edges.
Graphs can be cyclic or acyclic. In cyclic graphs it’s possible for a vertex to
reach itself by traversing a sequence of edges. In an acyclic graph it’s not possible for a
vertex to traverse a path to reach itself. The following Figure shows examples of cyclic and
acyclic graphs.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 214
Modeling Graphs: There are two common ways of representing graphs are with
adjacencymatrices and with adjacencylists.
N is the number of nodes, and Mij represents an edge between nodes i and j.
The following Figure shows a directed graph representing connections in a social
graph. The arrows indicate a one-way relationship between two people. The adjacency
matrixshows how this graph would be represented.
The disadvantage of adjacency matrices are that they model both the existence
andlack of a relationship, which makes them a dense data structure.
ADJACENCY LIST: Adjacency lists are similar to adjacency matrices, other than the fact that
they don’t model the lack of relationship. The following Figure shows an adjacency list to
theabove graph.
The advantage of the adjacency list is that it offers a sparse representation of the data,
which is good because it requires less space. It also fits well when representing graphs
in MapReduce because the key can represent a vertex, and the values are a list of vertices
that denote a directed or undirected relationship node.
Shortest path algorithm:
This algorithm is a common problem in graph theory, where the goal is to find the shortest
route between two nodes. The following Figure shows an example of this algorithm on a
graph where the edges don’t have a weight, in which case the shortest path is the path with
the smallest number of hops, or intermediary nodes between the source and destination.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 215
Applications of this algorithm include traffic mapping software to determine the
shortest route between two addresses, routers that compute the shortest path tree for each
route, and social networks to determine connections between users.
Find the shortest distance between two users: Dijkstra’s algorithm is a shortest
path algorithm and its basic implementation uses a sequential iterative process to
traverse the entire graph from the starting node.
Problem:
We need to use MapReduce to find the shortest path between two people in a social
graph.
Solution:
Use an adjacency list to model a graph, and for each node store the distance from
the original node, as well as a backpointer to the original node. Use the mappers to
propagate the distance to the original node, and the reducer to restore the state of the graph.
Iterate until the target node has been reached.
Discussion: The following Figure shows a small social network, which we’ll use for this
technique. Our goal is to find the shortest path between Dee and Joe. There are four paths
thatwe can take from Dee to Joe, but only one of them results in the fewest number of hops.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 216
We’ll implement a parallel breadth-first search algorithm to find the
shortest path between two users. Because we’re operating on a social network, we don’t
need to care about weights on our edges. The pseudo-code for the algorithm is described as
the following:
The following Figure shows the algorithm iterations in play with our social graph.
Just like Dijkstra’s algorithm, we’ll start with all the node distances set to infinite, and set
the distance for the starting node, Dee, at zero. With each MapReduce pass, we’ll
determine nodes that don’t have an infinite distance and propagate their distance values to
their adjacent nodes. We continue this until we reach the end node.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 217
We first need to create the starting point. This is done by reading in the social
network (which is stored as an adjacency list) from file and setting the initial distance
values. The following Figure shows the two file formats, the second being the format that’s
used iteratively in our MapReduce code.
Our first step is to create the MapReduce form from the original file. The following
command shows the original input file, and the MapReduce-ready form of the input file
generated by the transformation code:
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 218
The reducer calculates the minimum distance for each node and to output the
minimum distance, the backpointer, and the original adjacent nodes. This is shown in the
following code.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 219
Now we can run our code. But we need to copy the input file into
HDFS, and then kick off our MapReduce job, specifying the start node name (dee) and
Friends-of-friends (FoFs):
The FoF algorithm is using by Social networSolution
Two MapReduce jobs are required to calculate the FoFs for each user in a social
network. The first job calculates the common friends for each user, and the second job sorts
the common friends by the number of connections to our friendsk sites such as LinkedIn
and Facebook to help users broaden their networks. The Friends-of-friends (FoF) algorithm
suggests friends that a user may know that aren’t part of their immediate network. The
following Figure shows FoF to be in the 2nd degree network.
Problem
We want to implement the FoF algorithm in MapReduce.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 220
Solution
Two MapReduce jobs are required to calculate the FoFs for each user in a social
network. The first job calculates the common friends for each user, and the second job sorts
the common friends by the number of connections to our friends.
The following Figure shows a network of people with Jim, one of the users,
highlighted.
In above graph Jim’s FoFs are represented in bold (Dee, Joe, and Jon). Next to Jim’s FoFs is the
number of friends that the FoF and Jim have in common. Our goal here is to determine all the
FoFs and order them by the number of fiends in common. Therefore, our expected results would
have Joe as the first FoF recommendation, followed by Dee, and then Jon. The text file to
represent the social graph for this technique is shown
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 221
The following first MapReduce job code calculates the FoFs for
each user.
The following second MapReduce job code sorts FoFs by the number of shared
common friends.
To run the above code, we require driver code. But it is not written here. After
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 222
writing driver code, we require input (“friends.txt”) and output files (“ calc-
To run the above code, we require driver code. But it is not written here. After
writing driver code, we require input (“friends.txt”) and output files (“ calc-output
and sort-output”) (as per the Text Book) to execute it.
PageRank:
PageRank was a formula introduced by the founders of Google during their
Stanford years in 1998. PageRank, which gives a score to each web page that indicates the
page’s importance.
Calculate PageRank over a web graph: PageRank uses the scores for all the inbound links to
calculate a page’s PageRank. But it disciplines individual inbound links from sources that
have a high number of outbound links by dividing that outbound link PageRank by the
number of outbound links. The following Figure presents a simple example of a web graph
with three pages and their respective PageRank values.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 223
In the above formula, |webGraph| is a count of all the pages in the graph, and d, set
to 0.85, is a constant damping factor used in two parts. First, it denotes the probability of a
random surfer reaching the page after clicking on many links (this is a constant equal to
0.15 divided by the total number of pages), and, second, it dampens the effect of the
inbound link PageRanks by 85 percent.
Problem:
We want to implement an iterative PageRank graph algorithm in MapReduce.
Solution:
PageRank can be implemented by iterating a MapReduce job until the graph has
converged. The mappers are responsible for propagating node PageRank values to their
adjacent nodes, and the reducers are responsible for calculating new PageRank values for
each node, and for re-creating the original graph with the updated PageRank values.
Discussion:
One of the advantages of PageRank is that it can be computed iteratively and
applied locally. Every vertex starts with a seed value, with is 1 divided by the number of
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 224
nodes, and with each iteration each node propagates its value to all pages it links
to. Each vertex in turn sums up the value of all the inbound vertex values to compute a
new seed value. This iterative process is repeated until such a time as convergence is
reached. Convergence is a measure of how much the seed values have changed since the
last iteration. If the convergence value is below a certain threshold, it means that there’s
been minimal change and we can stop the iteration. It’s also common to limit the number
of iterations in cases of large graphs where convergence takes too much iteration
The following PageRank algorithm expressed as map and reduce parts. The map
phase is responsible for preserving the graph as well as emitting the PageRank value to all
the outbound nodes. The reducer is responsible for recalculating the new PageRank value
for each node and including it in the output of the original graph.
This technique is applied on the following graph. All the nodes of the graph have
bothinbound and output edges.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 225
Bloom Filters:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Each empty cell in that table specifies a bit and the number below it its index or position. To
append an element to the Bloom filter, we simply hash it a few times and set the bits in the bit
vector at the position or index of those hashes to 1.
Bloom filters support two actions, at first appending object and keeping track
of an object and next verifying whether an object has been seen before.
Appending objects to the Bloom filter
Basic Concept
A Counting Bloom filter is defined as a generalized data structure of Bloom filter that is
implemented to test whether a count number of a given element is less than a given threshold
when a sequence of elements is given. As a generalized form, of Bloom filter there is possibility
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 226
of false positive matches, but no chance of false negatives – in other words, a query
returns either "possibly higher or equal than the threshold" or "definitely less than the threshold".
Algorithm description
Most of the parameters, used under counting bloom filter, are defined same with Bloom
filter, such as n, k. m is denoted as the number of counters in Counting Bloom filter,
which is expansion of m bits in Bloom filter.
Similar to Bloom filter, there must also be k various hash functions defined, each of
which responsible to map or hash some set element to one of the m counter array
positions, creating a uniform random distribution. It is also same that k is a constant,
much less than m, which is proportional to the number of elements to be appended.
To query for an element with a threshold θ (verify whether the count number of an
element is less than θ), insert it to each of the k hash functions to obtain k counter
positions.
If any of the counters at these positions is smaller than θ, the count number of element is
definitely smaller than θ – if it were higher and equal, then all the corresponding counters
would have been higher or equal to θ.
If all are higher or equal to θ, then either the count is really higher or equal to θ, or the
counters have by chance been higher or equal to θ.
If all are higher or equal to θ even though the count is less than θ, this situation is defined
as false positive. Like Bloom filter, this also should be minimized.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 227
Blocked Bloom Filter consists of a sequence of block b
comparatively less than standard Bloom filters (Bloom filter blocks),
each of which fits into one cache-line.
Blocked Bloom filter scheme is differentiated from the partition schemes,
where each bit is inserted into a different block.
Blocked Bloom Filter is implemented in following ways −
Bit Patterns (pat)
In this topic, we discuss to implement blocked Bloom filters implementing
precomputed bit patterns. Instead of setting k bits through the evaluation of k
hash functions, a single hash function selects a precomputed pattern from a table of random k-bit
pattern of width B. In many cases, this table will fit into the cache. With this solution, only one
small (in terms of bits) hash value is required, and the operation can be implemented using few
SIMD(Single Instruction Multiple Data)instructions. At the time of transferring the Bloom filter,
the table need not be included explicitly in the data, but can be reconstructed implementing the
seed value.
The main disadvantage of the bit pattern method is that two elements may cause a table collision
when they are hashed to the same pattern. This causes increased FPR.
Multiplexing Patterns
To refine this idea once more, we can achieve a larger variety of patterns from
a single table by bitwise-or-ing x patterns with an average number of k/x set
bits.
Multi-Blocking
One more variant that helps improving the FPR, is denoted as called multi-blocking. We permit
the query operation to access X Bloom filters blocks, setting or testing k/X bits respectively in
each block. (When k is not divisible by X, we set an extra bit in the first k mod X blocks.) Multi-
blocking performs better than just increasing the block size to XB (B-each block size), since
more variety is introduced this way. If we divide the set bits among several blocks, the expected
number of 1 bit per block remains the same. However, only k/X bits are considered in each
participating block, when accessing an element.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 228
Example – Create RDD from List<T>
In this example, we will take a List of strings, and then create a Spark
RDD from this list.
RDDfromList.java
import java.util.Arrays;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
items.foreach(item -> {
System.out.println("* "+item);
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 229
});
}
}
// configure spark
SparkConf sparkConf = new SparkConf().setAppName("Read
Text to RDD")
.setMaster("local[2]").set("spark.executor.memory","2g");
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 230
for(String line:lines.collect()){
System.out.println(line);
}
}
// configure spark
SparkSession spark = SparkSession
.builder()
.appName("Spark Example - Read JSON to RDD")
.master("local[2]")
.getOrCreate();
items.foreach(item -> {
System.out.println(item);
});
}
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 231
Conclusion
we have learnt to create Spark RDD from a List, reading a text or JSON file from file-system
etc., with the help of example programs.
Operations:
RDD Operations
o Transformation
o Action
Transformation
In Spark, the role of transformation is to create a new dataset from an existing one. The
transformations are considered lazy as they only computed when an action requires a result to be
returned to the driver program.
Transformation Description
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 232
on an RDD of type T.
sample(withReplacement, fraction, seed) It samples the fraction fraction of the data, with
or without replacement, using a given random
number generator seed.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 233
specified in the boolean ascending argument.
Action
In Spark, the role of action is to return a value to the driver program after running a computation
on the dataset.
Action Description
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 234
reduce(func) It aggregate the elements of the dataset using a function func
(which takes two arguments and returns one). The function
should be commutative and associative so that it can be
computed correctly in parallel.
collect() It returns all the elements of the dataset as an array at the driver
program. This is usually useful after a filter or other operation
that returns a sufficiently small subset of the data.
takeOrdered(n, [ordering]) It returns the first n elements of the RDD using either their
natural order or a custom comparator.
saveAsTextFile(path) It is used to write the elements of the dataset as a text file (or set
of text files) in a given directory in the local filesystem, HDFS or
any other Hadoop-supported file system. Spark calls toString on
each element to convert it to a line of text in the file.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 235
foreach(func) It runs a function func on each element of the dataset for side
effects such as updating an Accumulator or interacting with
external storage systems.
Python provides a simple way to pass functions to Spark. The Spark programming guide
available at spark.apache.org suggests there are three recommended ways to do this:
Lambda expressions is the ideal way for short functions that can be
written inside a single expression
Local defs inside the function calling into Spark for longer code
Top-level functions in a module
While we have already looked at the lambda functions in some of the previous
examples, let's look at local definitions of the functions. We can encapsulate
our business logic which is splitting of words, and counting into two separate
functions as shown below.
def splitter(lineOfText):
words = lineOfText.split(" ")
return len(words)
def aggregate(numWordsLine1, numWordsLineNext):
totalWords = numWordsLine1 + numWordsLineNext
return totalWords
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 236
Spark Transformation is a function that produces new RDD from the existing RDDs.
It takes RDD as input and produces one or more RDD as output. Each time it creates new RDD
when we apply any transformation. Thus, the so input RDDs, cannot be changed since RDD are
immutable in nature.
Applying transformation built an RDD lineage, with the entire parent RDDs of the final RDD(s).
RDD lineage, also known as RDD operator graph or RDD dependency graph. It is a logical
execution plan i.e., it is Directed Acyclic Graph (DAG) of the entire parent RDDs of RDD.
Transformations are lazy in nature i.e., they get execute when we call an action. They are not
executed immediately. Two most basic type of transformations is a map(), filter().
After the transformation, the resultant RDD is always different from its parent RDD. It can be
smaller (e.g. filter, count, distinct, sample), bigger (e.g. flatMap(), union(), Cartesian()) or the
same size (e.g. map).
There are two types of transformations:
Narrow transformation – In Narrow transformation, all the elements that are required to
compute the records in single partition live in the single partition of parent RDD. A limited
subset of partition is used to calculate the result. Narrow transformations are the result
of map(), filter().
Wide transformation – In wide transformation, all the elements that are required to
compute the records in the single partition may live in many partitions of parent RDD. The
partition may live in many partitions of parent RDD. Wide transformations are the result
of groupbyKey() and reducebyKey().
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 237
There are various functions in RDD transformation. Let us see RDD
transformation with examples.
map(func)
The map function iterates over every line in RDD and split into new RDD.
Using map() transformation we take in any function, and that function is
applied to every element of RDD.
In the map, we have the flexibility that the input and the return type of RDD
may differ from each other. For example, we can have input RDD type as
String, after applying the
Map() example:
[php]import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
object mapTest{
def main(args: Array[String]) = {
val spark =
SparkSession.builder.appName(“mapExample”).master(“local”).getOrCreate()
val data = spark.read.textFile(“spark_test.txt”).rdd
val mapFile = data.map(line => (line,line.length))
mapFile.foreach(println)
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 238
}
}[/php]
spark_test.txt”
?, and how can we apply functions on that RDD partitions?. All this will
be done through spark programming which is done with the help of scala
language support…
Note – In above code, map() function map each line of the file with
its length.
Flat Map()
With the help of flatMap() function, to each input element, we have many elements in an output
RDD. The most simple use of flat Map() is to split each input string into words.
Map and flatMap are similar in the way that they take a line from input RDD and apply a
function on that line. The key difference between map() and flatMap() is map() returns only one
element, while flatMap() can return a list of elements.
Flat Map() example:
[php]val data = spark.read.textFile(“spark_test.txt”).rdd
val flatmapFile = data.flatMap(lines => lines.split(” “))
flatmapFile.foreach(println)[/php]
Persistence:
Persistence is "the continuance of an effect after its cause is removed". In the context of storing
data in a computer system, this means that the data survives after the process with which it was
created has ended. In other words, for a data store to be considered persistent, it must write to
non-volatile storage.
approaches that a data store can take and how (or if) these designs provide persistence:
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 239
Pure in-memory, no persistence at all, such as memcached or Scalaris
Commitlog-based, such as all traditional OLTP databases (Oracle, SQL Server, etc.)
In-memory approaches can achieve blazing speed, but at the cost of being limited to a relatively
small data set. Most workloads have relatively small "hot" (active) subset of their total data;
systems that require the whole dataset to fit in memory rather than just the active part are fine for
caches but a bad fit for most other applications. Because the data is in memory only, it will not
survive process termination. Therefore these types of data stores are not considered persistent.
The easiest way to add persistence to an in-memory system is with periodic snapshots to disk at a
configurable interval. Thus, you can lose up to that interval's worth of updates.
only commitlog-based persistence provides Durability -- the D in ACID -- with every write
Cassandra implements a commit-log based persistence design, but at the same time provides
for tunable levels of durability. This allows you to decide what the right trade off is between
safety and performance. You can choose, for each write operation, to wait for that update to be:
buffered to memory
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 240
Or, you can choose to accept writes as quickly as possible, acknowledging their receipt
immediately before they have even been fully deserialized from the network.
At the end of the day, you're the only one who knows what the right performance/durability trade
off is for your data. Making an informed decision on data store technologies is critical to
addressing this tradeoff on your terms. Because Cassandra provides such tunability, it is a logical
choice for systems with a need for a durable, performant data store.
Adding Schemas to RDDs :
Spark introduces the concept of an RDD (Resilient Distributed Dataset), an immutable fault-
tolerant, distributed collection of objects that can be operated on in parallel. An RDD can contain
any type of object and is created by loading an external dataset or distributing a collection from
Schema RDD is a RDD where you can run SQL on. It is more than SQL. It is a unified interface
Code explanation:
1. Importing Expression Encoder for RDDs. RDDs are similar to Datasets but use encoders for
serialization.
4. Creating an ’employeeDF’ DataFrame from ’employee.txt’ and mapping the columns based on
6. Defining a DataFrame ‘youngstersDF’ which will contain all the employees between the ages
of 18 and 30.
7. Mapping the names from the RDD into ‘youngstersDF’ to display the names of youngsters.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 241
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.Encoder
import spark.implicits._
val employeeDF =
spark.sparkContext.textFile("examples/src/main/resources/employee.txt").map(_.split(",")).map(attrib
utes =&amp;amp;amp;gt; Employee(attributes(0), attributes(1).trim.toInt)).toDF()
employeeDF.createOrReplaceTempView("employee")
val youngstersDF = spark.sql("SELECT name, age FROM employee WHERE age BETWEEN 18
AND 30")
youngstersDF.map(youngster =&amp;amp;amp;gt; "Name: " + youngster(0)).show()
Code explanation:
1. Converting the mapped names into string for transformations.
2. Using the mapEncoder from Implicits class to map the names to the ages.
3. Mapping the names to the ages of our ‘youngstersDF’ DataFrame. The result is an array with
names mapped to their respective ages.
Transformations: These are the operations (such as map, filter, join, union, and so on)
Actions: These are operations (such as reduce, count, first, and so on) that return a value
Transformations in Spark are “lazy”, meaning that they do not compute their results right away.
Instead, they just “remember” the operation to be performed and the dataset (e.g., file) to which
the operation is to be performed. The transformations are computed only when an action is called
and the result is returned to the driver program and stored as Directed Acyclic Graphs (DAG).
This design enables Spark to run more efficiently. For example, if a big file was transformed in
various ways and passed to the first action, Spark would only process and return the result for the
first line, rather than do the work for the entire file.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 243
By default, each transformed RDD may be recomputed each time you run an action on it.
However, you may also persist an RDD in memory using the persist or cache method, in which
case Spark will keep the elements around on the cluster for much faster access the next time you
query it.
RDDs as Relations:
Resilient Distributed Datasets (RDDs) are distributed memory abstraction which lets
can be created from any data source. Eg: Scala collection, local file system, Hadoop, Amazon S3,
Specifying Schema
Code explanation:
2. Importing ‘Row’ class into the Spark Shell. Row is used in mapping RDD Schema.
4. Defining the schema as “name age”. This is used to map the columns of the RDD.
5. Defining ‘fields’ RDD which will be the output after mapping the ’employeeRDD’ to the
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 244
schema ‘schemaString’.
Code explanation:
1. We now create a RDD called ‘rowRDD’ and transform the ’employeeRDD’ using the ‘map’
function into ‘rowRDD’.
2. We define a DataFrame ’employeeDF’ and store the RDD schema into it.
3. Creating a temporary view of ’employeeDF’ into ’employee’.
4. Performing the SQL operation on ’employee’ to display the contents of employee.
5. Displaying the names of the previous operation from the ’employee’ view.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 245
results.map(attributes =&amp;amp;amp;gt; "Name: "
+ attributes(0)).show()
Even though RDDs are defined, they don’t contain any data. The computation to create the data in
an RDD is only done when the data is referenced. e.g. Caching results or writing out the RDD.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 246
3. Automatically selects the best comparison
Code explanation:
1. Importing Implicits class into the shell.
2. Creating an ’employeeDF’ DataFrame from our
’employee.json’ file.
import spark.implicits._
val employeeDF =
spark.read.json(“examples/src/main/resources/employee.json”
)
Code explanation:
1. Creating a ‘parquetFile’ temporary view of our DataFrame.
2. Selecting the names of people between the ages of 18 and
30 from our Parquet file.
3. Displaying the result of the Spark SQL operation.
employeeDF.write.parquet("employee.parquet")
val parquetFileDF = spark.read.parquet("employee.parquet")
parquetFileDF.createOrReplaceTempView("parquetFile")
val namesDF = spark.sql("SELECT name FROM parquetFile WHERE
age BETWEEN 18 AND 30")
namesDF.map(attributes =&amp;amp;amp;gt; "Name: " +
attributes(0)).show()
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 247
JSON Datasets
We will now work on JSON data. As Spark SQL supports
JSON dataset, we create a DataFrame of employee.json. The
schema of this DataFrame can be seen below. We then define
a Youngster DataFrame and add all the employees between
the ages of 18 and 30.
Code explanation:
1. Setting to path to our ’employee.json’ file.
2. Creating a DataFrame ’employeeDF’ from our JSON file.
3. Printing the schema of ’employeeDF’.
4. Creating a temporary view of the DataFrame into
’employee’.
5. Defining a DataFrame ‘youngsterNamesDF’ which stores
the names of all the employees between the ages of 18 and
30 present in ’employee’.
6. Displaying the contents of our DataFrame.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 248
val path = "examples/src/main/resources/employee.json"
val employeeDF = spark.read.json(path)
employeeDF.printSchema()
employeeDF.createOrReplaceTempView("employee")
val youngsterNamesDF = spark.sql("SELECT name FROM employee
WHERE age BETWEEN 18 AND 30")
youngsterNamesDF.show()
Code explanation:
1. Creating a RDD ‘otherEmployeeRDD’ which will store the
content of employee George from New Delhi, Delhi.
2. Assigning the contents of ‘otherEmployeeRDD’ into
‘otherEmployee’.
3. Displaying the contents of ‘otherEmployee’.
val otherEmployeeRDD =
spark.sparkContext.makeRDD(“””{“name”:”George”,”address”:{“
city”:”New Delhi”,”state”:”Delhi”}}””” :: Nil)
val otherEmployee = spark.read.json(otherEmployeeRDD)
otherEmployee.show()
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 249
Hive Tables
We perform a Spark example using Hive tables.
Code explanation:
1. Importing ‘Row’ class into the Spark Shell. Row is used in mapping RDD Schema.
5. We now build a Spark Session ‘spark’ to demonstrate Hive example in Spark SQL.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 250
import spark.sql
sql(“CREATE TABLE IF NOT EXISTS src (key INT, value
STRING)”)
Code explanation:
1. We now load the data from the examples present in Spark directory into our table ‘src’.
2. The contents of ‘src’ is displayed below.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 251
Code explanation:
1. We perform the ‘count’ operation to select the number of keys in ‘src’ table.
2. We now select all the records with ‘key’ value less than 10 and store it in the ‘sqlDF’
DataFrame.
3. Creating a Dataset ‘stringDS’ from ‘sqlDF’.
4. Displaying the contents of ‘stringDS’ Dataset.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 252
Code explanation:
1. We create a DataFrame ‘recordsDF’ and store all the records with key values 1 to 100.
2. Create a temporary view ‘records’ of ‘recordsDF’ DataFrame.
3. Displaying the contents of the join of tables ‘records’ and ‘src’ with ‘key’ as the primary key.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 253
Creating Pairs in RDDs:
In Apache Spark, Key-value pairs are known as paired RDD. In this blog, we will learn what are
paired RDDs in Spark in detail.
To understand in deep, we will focus on following methods of creating spark paired RDD in and
operations on paired RDDs in spark, such as transformations and actions in Spark RDD.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 254
RDD refers to Resilient Distributed Datasets, core abstraction and a fundamental data structure
of Spark. RDDs in spark are immutable as well as the distributed collection of objects. In RDD,
each dataset is divided into logical partitions.
That each partition may be computed on different nodes of the cluster. Spark RDDs can contain
user-defined classes. Also, includes any type of Scala, python or java objects.
It is a read-only, partitioned collection of records. Spark RDDs are the fault-tolerant collection of
elements and it can be operated in parallel. There are generally three ways to create spark RDDs.
Data in stable storage, other RDDs, and parallelizing existing collection in driver program. By
using RDD, it is possible to achieve faster and efficient MapReduce operations.
These operations are automatically available on RDDs containing Tuple2 objects, in Scala. In the
Pair RDD functions class, the key-value pair operations are available. That wraps around an
RDD of tuples.
For example:
In this code we are using the reduceByKey operation on key-value
pairs. We will count how many times each line of text occurs in a file:
val lines1 = sc.textFile("data1.txt")
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 255
val counts1 = pairs.reduceByKey((a, b) => a + b)
In many programs, pair RDDs of Apache Spark are a useful building block. Operations that
allow us to act on each key in parallel, it exposes those operations. Also, helps to regroup the
data across the network.
For instance, in spark paired RDDs reduceByKey() method aggregate data separately for each
key and a join() method, which merges two RDDs together by grouping elements with the same
key. It is very normal to extract fields from an RDD.
For example, representing, for instance, an event time, customer ID, or another identifier. Also,
use those fields in spark pair RDD operations as keys.
By running a map() function that returns key or value pairs, we can create spark pair RDDs. On
the basis of language, the procedure to build the key-value RDDs differs.
1.Transformation Operations
All the transformations available to standard RDDs, Pair RDDs are allowed to use them. Even it
can apply same rules from “passing functions to spark”.
As there are tuples available in spark paired RDDs, we need to pass functions that operate on
tuples, rather than on individual elements. Some of the transformation methods are listed here.
For example:
reduceByKey(fun)
By using a different result type, combine values with the same key.
mapValues(func)
Without changing the key, apply a function to each value of a pair RDD of spark.
keys()
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 256
rdd.keys()
values()
rdd.values()
sortByKey()
rdd.sortByKey()
2. Action Operations
Like transformations, actions available on spark pair RDDs are similar to base RDD. Basically,
there are some additional actions available on pair RDDs of spark. Moreover, those leverages
the advantage of the key/value nature of the data. Some of them are listed below. For example,
countByKey()
rdd.countByKey()
collectAsMap()
rdd.collectAsMap()
lookup(key)
Basically, lookup(key) returns all values associated with the provided key.
rdd.lookup()
Conclusion
Hence, we have seen how to work with Spark key/value data. Also, how to use the specialized
functions and operations available in spark. Finally, we hope this article has given all your
answers regarding spark paired RDDs.
Two types of Apache Spark RDD operations are- Transformations and Actions.
A Transformation is a function that produces new RDD from the existing RDDs but when we
want to work with the actual dataset, at that point Action is performed. When the action is
triggered after the result, new RDD is not formed like transformation. In this Apache
Spark RDD operations tutorial we will get the detailed view of what is Spark RDD, what is the
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 257
transformation in Spark RDD, various RDD transformation operations in Spark with
examples, what is action in Spark RDD and various RDD action operations in Spark with
examples.
3. RDD Transformation
Spark Transformation is a function that produces new RDD from the
existing RDDs. It takes RDD as input and produces one or more RDD
as output. Each time it creates new RDD when we apply any
transformation. Thus, the so input RDDs, cannot be changed since
RDD are immutable in nature.
Applying transformation built an RDD lineage, with the entire parent
RDDs of the final RDD(s). RDD lineage, also known as RDD operator
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 258
graph or RDD dependency graph. It is a logical execution plan
i.e., it is Directed Acyclic Graph (DAG) of the entire parent RDDs of
RDD.
Transformations are lazy in nature i.e., they get execute when we call
an action. They are not executed immediately. Two most basic type of
transformations is a map(), filter().
After the transformation, the resultant RDD is always different from
its parent RDD. It can be smaller (e.g. filter, count, distinct, sample),
bigger (e.g. flatMap(), union(), Cartesian()) or the same size
(e.g. map).
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 259
There are various functions in RDD transformation.
The map function iterates over every line in RDD and split into new RDD.
Using map() transformation we take in any function, and that function is applied to every
element of RDD.
In the map, we have the flexibility that the input and the return type of RDD may differ from
each other. For example, we can have input RDD type as String, after applying the
For example, in RDD {1, 2, 3, 4, 5} if we apply “rdd.map(x=>x+2)” we will get the result as (3,
4, 5, 6, 7).
Map() example:
[php]import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
object mapTest{
def main(args: Array[String]) = {
val spark = SparkSession.builder.appName(“mapExample”).master(“local”).getOrCreate()
val data = spark.read.textFile(“spark_test.txt”).rdd
val mapFile = data.map(line => (line,line.length))
mapFile.foreach(println)
}
}[/php]
spark_test.txt”
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 260
hello...user! this file is created to check the operations of spark.
?, and how can we apply functions on that RDD partitions?. All this will be done through spark
programming which is done with the help of scala language support…
2. flatMap()
With the help of flatMap() function, to each input element, we have many elements in an output
RDD. The most simple use of flatMap() is to split each input string into words.
Map and flatMap are similar in the way that they take a line from input RDD and apply a
function on that line. The key difference between map() and flatMap() is map() returns only one
element, while flatMap() can return a list of elements.
flatMap() example:
[php]val data = spark.read.textFile(“spark_test.txt”).rdd
val flatmapFile = data.flatMap(lines => lines.split(” “))
flatmapFile.foreach(println)[/php]
3. filter(func)
Spark RDD filter() function returns a new RDD, containing only the elements that meet a
predicate. It is a narrow operation because it does not shuffle data from one partition to many
partitions.
For example, Suppose RDD contains first five natural numbers (1, 2, 3, 4, and 5) and the
predicate is check for an even number. The resulting RDD after the filter will contain only the
even numbers i.e., 2 and 4.
Filter() example:
[php]val data = spark.read.textFile(“spark_test.txt”).rdd
val mapFile = data.flatMap(lines => lines.split(” “)).filter(value => value==”spark”)
println(mapFile.count())[/php]
Note – In above code, flatMap function map line into words and then count the word
“Spark” using count() Action after filtering lines containing “Spark” from mapFile.
4. mapPartitions(func)
The MapPartition converts each partition of the source RDD into many elements of the result
(possibly none). In mapPartition(), the map() function is applied on each partitions
simultaneously. MapPartition is like a map, but the difference is it runs separately on each
partition(block) of the RDD.
5. mapPartitionWithIndex()
It is like mapPartition; Besides mapPartition it provides func with an integer value representing
the index of the partition, and the map() is applied on partition index wise one after the other.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 261
Spark SQL-Overview
industries are using Hadoop extensively to analyze their data sets. The reason is that Hadoop
framework is based on a simple programming model (MapReduce) and it enables a computing
solution that is scalable, flexible, fault-tolerant and cost effective. Here, the main concern is to
maintain speed in processing large datasets in terms of waiting time between queries and
waiting time to run the program.
Spark was introduced by Apache Software Foundation for speeding up the Hadoop
computational computing software process.
As against a common belief, Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster management. Hadoop is just one of the
ways to implement Spark.
Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark has its
own cluster management computation, it uses Hadoop for storage purpose only.
Apache Spark
Apache Spark is a lightning-fast cluster computing technology, designed for
fast computation. It is based on Hadoop MapReduce and it extends the
MapReduce model to efficiently use it for more types of computations, which
includes interactive queries and stream processing. The main feature of Spark is its in-memory
cluster computing that increases the processing speed of an application.
Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming. Apart from supporting all these workload in a
respective system, it reduces the management burden of maintaining separate tools.
Evolution of Apache Spark
Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei
Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software
foundation in 2013, and now Apache Spark has become a top level Apache project from Feb-
2014.
Features of Apache Spark
Apache Spark has following features.
Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in
memory, and 10 times faster when running on disk. This is possible by reducing number
of read/write operations to disk. It stores the intermediate processing data in memory.
Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python.
Therefore, you can write applications in different languages. Spark comes up with 80
high-level operators for interactive querying.
Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports
SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.
Spark Built on Hadoop
The following diagram shows three ways of how Spark can be built with Hadoop components.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 262
There are three ways of Spark deployment as explained below.
Standalone − Spark Standalone deployment means Spark occupies
the place on top of HDFS(Hadoop Distributed File System) and space
is allocated for HDFS, explicitly. Here, Spark and MapReduce will run
side by side to cover all spark jobs on cluster.
Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs
on Yarn without any pre-installation or root access required. It helps to
integrate Spark into Hadoop ecosystem or Hadoop stack. It allows
other components to run on top of stack.
Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch
spark job in addition to standalone deployment. With SIMR, user can
start Spark and uses its shell without any administrative access.
Components of Spark
The following illustration depicts the different components of Spark.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 263
Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data
abstraction called SchemaRDD, which provides support for structured and
semi-structured data.
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform
streaming analytics. It ingests data in mini-batches and performs RDD
(Resilient Distributed Datasets) transformations on those mini-batches of
data.
GraphX
GraphX is a distributed graph-processing framework on top of Spark. It
provides an API for expressing graph computation that can model the user-
defined graphs by using Pregel abstraction API. It also provides an optimized
runtime for this abstraction.
Spark – RDD
Resilient Distributed Datasets
Resilient Distributed Datasets (RDD) is a fundamental data structure of
Spark. It is an immutable distributed collection of objects. Each dataset in
RDD is divided into logical partitions, which may be computed on different
nodes of the cluster. RDDs can contain any type of Python, Java, or Scala
objects, including user-defined classes.
Formally, an RDD is a read-only, partitioned collection of records. RDDs can
be created through deterministic operations on either data on stable storage
or other RDDs. RDD is a fault-tolerant collection of elements that can be
operated on in parallel.
There are two ways to create RDDs − parallelizing an existing collection in
your driver program, or referencing a dataset in an external storage system,
such as a shared file system, HDFS, HBase, or any data source offering a
Hadoop Input Format.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 264
Spark makes use of the concept of RDD to achieve faster and efficient
MapReduce operations. Let us first discuss how MapReduce operations take
place and why they are not so efficient.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 265
Data Sharing using Spark RDD
Data sharing is slow in MapReduce due to replication, serialization,
and disk IO. Most of the Hadoop applications, they spend more than 90% of
the time doing HDFS read-write operations.
Recognizing this problem, researchers developed a specialized framework
called Apache Spark. The key idea of spark is Resilient Distributed Datasets
(RDD); it supports in-memory processing computation. This means, it stores
the state of memory as an object across the jobs and the object is sharable
between those jobs. Data sharing in memory is 10 to 100 times faster than
network and Disk.
Let us now try to find out how iterative and interactive operations take place
in Spark RDD.
Spark - Installation
Spark is Hadoop’s sub-project. Therefore, it is better to install Spark into a Linux based system.
The following steps show how to install Apache Spark.
Step1: Verifying Java Installation
Java installation is one of the mandatory things in installing Spark. Try the following command
to verify the JAVA version.
$java -version
If Java is already, installed on your system, you get to see the following response −
java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)
In case you do not have Java installed on your system, then Install Java before proceeding to
next step.
Step2: Verifying Scala Installation
You should Scala language to implement Spark. So let us verify Scala installation using
following command.
$scala -version
If Scala is already installed on your system, you get to see the following response −
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL
In case you don’t have Scala installed on your system, then proceed to next step for Scala
installation.
Step3: Downloading Scala
Download the latest version of Scala by visit the following link Download Scala. For this
tutorial, we are using scala-2.11.6 version. After downloading, you will find the Scala tar file in
the download folder.
Step4: Installing Scala
Follow the below given steps for installing Scala.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 267
Extract the Scala tar file
Type the following command for extracting the Scala tar file.
$ tar xvf scala-2.11.6.tgz
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 268
Setting up the environment for Spark
Add the following line to ~/.bashrc file. It means adding the location, where the spark software
file are located to the PATH variable.
export PATH = $PATH:/usr/local/spark/bin
Use the following command for sourcing the ~/.bashrc file.
$ source ~/.bashrc
Step7: Verifying the Spark Installation
Write the following command for opening Spark shell.
$spark-shell
If spark is installed successfully then you will find the following output.
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to:
hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls
to: hadoop
disabled; ui acls disabled; users with view permissions:
Set(hadoop); users with modify permissions: Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service
'HTTP class server' on port 43292.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.4.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM,
Java 1.7.0_71)
Type in expressions to have them evaluated.
Spark context available as sc
scala>
Libraries:
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 269
Integrated
Seamlessly mix SQL queries with Spark programs.
Spark SQL lets you query structured data inside Spark programs, using either SQL or a
familiar DataFrame API. Usable in Java, Scala, Python and R.
results = spark.sql(
"SELECT * FROM people")
names = results.map(lambda p: p.name)
Hive integration
Run SQL or HiveQL queries on existing warehouses.
Spark SQL supports the HiveQL syntax as well as Hive SerDes and UDFs, allowing you to
access existing Hive warehouses.
Spark SQL can use existing Hive metastores, SerDes, and UDFs.
Standard connectivity
Connect through JDBC or ODBC.
A server mode provides industry standard JDBC and ODBC connectivity for business
intelligence tools.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 270
Use your existing BI tools to query big data.
Spark SQL includes a cost-based optimizer, columnar storage and code generation to make
queries fast. At the same time, it scales to thousands of nodes and multi hour queries using the
Spark engine, which provides full mid-query fault tolerance. Don't worry about using a different
engine for historical data.
Community
Spark SQL is developed as part of Apache Spark. It thus gets tested and updated with each Spark
release.
If you have questions about the system, ask on the Spark mailing lists.
The Spark SQL developers welcome contributions. If you'd like to help out, read how to
contribute to Spark, and send us a patch!
Getting started
Spark Streaming:
Ease of use
Build applications through high-level operators.
Spark Streaming brings Apache Spark's language-integrated API to stream processing, letting
you write streaming jobs the same way you write batch jobs. It supports Java, Scala and Python.
TwitterUtils.createStream(...)
.filter(_.getText.contains("Spark"))
.countByWindow(Seconds(5))
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 271
Fault tolerance
Stateful exactly-once semantics out of the box.
Spark Streaming recovers both lost work and operator state (e.g. sliding windows) out of the box,
without any extra code on your part.
Spark integration
Combine streaming with batch and interactive queries.
By running on Spark, Spark Streaming lets you reuse the same code for batch processing, join
streams against historical data, or run ad-hoc queries on stream state. Build powerful interactive
applications, not just analytics.
stream.join(historicCounts).filter {
case (word, (curCount, oldCount)) =>
curCount > oldCount
}
Deployment options
Spark Streaming can read data from HDFS, Flume, Kafka, Twitter and ZeroMQ. You can also
define your own custom data sources.
You can run Spark Streaming on Spark's standalone cluster mode or other supported cluster
resource managers. It also includes a local run mode for development. In production, Spark
Streaming uses ZooKeeper and HDFS for high availability.
Community
Spark Streaming is developed as part of Apache Spark. It thus gets tested and updated with each
Spark release.
If you have questions about the system, ask on the Spark mailing lists.
The Spark Streaming developers welcome contributions. If you'd like to help out, read how to
contribute to Spark, and send us a patch!
Getting started
model = KMeans(k=10).fit(data)
Performance
High-quality algorithms, 100x faster than MapReduce.
Spark excels at iterative computation, enabling MLlib to run fast. At the same time, we care
about algorithmic performance: MLlib contains high-quality algorithms that leverage iteration,
and can yield better results than the one-pass approximations sometimes used on MapReduce.
Runs everywhere
Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud, against diverse
data sources.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 273
You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN,
on Mesos, or on Kubernetes. Access data in HDFS, Apache Cassandra, Apache HBase, Apache
Hive, and hundreds of other data sources.
Algorithms
Community
MLlib is developed as part of the Apache Spark project. It thus gets tested and updated with each
Spark release.
If you have questions about the library, ask on the Spark mailing lists.
MLlib is still a rapidly growing project and welcomes contributions. If you'd like to submit an
algorithm to MLlib, read how to contribute to Spark and send us a patch!
GraphX (graph)
Flexibility
Seamlessly work with both graphs and collections.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 274
GraphX unifies ETL, exploratory analysis, and iterative graph computation within a
single system. You can view the same data as both graphs and
collections, transform and join graphs with RDDs efficiently, and write custom iterative graph
algorithms using the Pregel API.
graph = Graph(vertices, edges)
messages = spark.textFile("hdfs://...")
graph2 = graph.joinVertices(messages) {
(id, vertex, msg) => ...
}
Speed
Comparable performance to the fastest specialized graph processing systems.
GraphX competes on performance with the fastest graph systems while retaining Spark's
flexibility, fault tolerance, and ease of use.
Algorithms
Choose from a growing library of graph algorithms.
In addition to a highly flexible API, GraphX comes with a variety of graph algorithms, many of
which were contributed by our users.
PageRank
Connected components
Label propagation
SVD++
Strongly connected components
Triangle count
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 275
Community
GraphX is developed as part of the Apache Spark project. It thus gets tested and updated with
each Spark release.
If you have questions about the library, ask on the Spark mailing lists.
GraphX is in the alpha stage and welcomes contributions. If you'd like to submit a change to
GraphX, read how to contribute to Spark and send us a patch!
Getting started
Features:
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 276
query fault tolerance, letting it scale to large jobs too. Do not worry
about using a different engine for historical data.
This architecture contains three layers namely, Language API, Schema RDD,
and Data Sources.
Language API − Spark is compatible with different languages and
Spark SQL. It is also, supported by these languages- API (python,
scala, java, HiveQL).
Schema RDD − Spark Core is designed with special data structure
called RDD. Generally, Spark SQL works on schemas, tables, and
records. Therefore, we can use the Schema RDD as temporary table.
We can call this Schema RDD as Data Frame.
Data Sources − Usually the Data source for spark-core is a text file,
Avro file, etc. However, the Data Sources for Spark SQL is different.
Those are Parquet file, JSON document, HIVE tables, and Cassandra
database.
We will discuss more about these in the subsequent chapters.
SQLContext
SQLContext is a class and is used for initializing the functionalities of Spark
SQL. SparkContext class object (sc) is required for initializing SQLContext
class object.
The following command is used for initializing the SparkContext through
spark-shell.
$ spark-shell
By default, the SparkContext object is initialized with the name sc when the
spark-shell starts.
Use the following command to create SQLContext.
scala> val sqlcontext = new
org.apache.spark.sql.SQLContext(sc)
Example
Let us consider an example of employee records in a JSON file
named employee.json. Use the following commands to create a DataFrame
(df) and read a JSON document named employee.json with the following
content.
employee.json − Place this file in the directory where the
current scala> pointer is located.
{
{"id" : "1201", "name" : "satish", "age" : "25"}
{"id" : "1202", "name" : "krishna", "age" : "28"}
{"id" : "1203", "name" : "amith", "age" : "39"}
{"id" : "1204", "name" : "javed", "age" : "23"}
{"id" : "1205", "name" : "prudvi", "age" : "23"}
}
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 278
DataFrame Operations
DataFrame provides a domain-specific language for structured data
manipulation. Here, we include some basic examples of structured data
processing using DataFrames.
Follow the steps given below to perform DataFrame operations −
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 279
|-- id: string (nullable = true)
|-- name: string (nullable = true)
1 JSON Datasets
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 281
Spark SQL can automatically capture the schema of a JSON dataset
and load it as a DataFrame.
2 Hive Tables
3 Parquet Files
We will now start querying using Spark SQL. Note that the actual SQL queries are similar to the
ones used in popular SQL clients.
Starting the Spark Shell. Go to the Spark directory and execute ./bin/spark-shell in the terminal to
being the Spark Shell.
For the querying examples shown in the blog, we will be using two files, ’employee.txt’ and
’employee.json’. The images below show the content of both the files. Both these files are stored
at ‘examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala’ inside the
folder containing the Spark installation (~/Downloads/spark-2.0.2-bin-hadoop2.7). So, all of you
who are executing the queries, place them in this directory or set the path to your files in the lines
of code below.
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 282
Code explanation:
1. We first import a Spark Session into Apache Spark.
2. Creating a Spark Session ‘spark’ using the ‘builder()’ function.
3. Importing the Implicts class into our ‘spark’ Session.
4. We now create a DataFrame ‘df’ and import data from the ’employee.json’
file.
5. Displaying the DataFrame ‘df’. The result is a table of 5 rows of ages and
names from our ’employee.json’ file.
1 import org.apache.spark.sql.SparkSession
2 val spark = SparkSession.builder().appName("Spark SQL basic
example").config("spark.some.config.option", "some-value").getOrCreate()
3 import spark.implicits._
4 val df = spark.read.json("examples/src/main/resources/employee.json")
5 df.show()
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 283
Code explanation:
1. Importing the Implicts class into our ‘spark’ Session.
2. Printing the schema of our ‘df’ DataFrame.
3. Displaying the names of all our records from ‘df’ DataFrame.
1 import spark.implicits._
2 df.printSchema()
3 df.select("name").show()
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 284
Code explanation:
1. Displaying the DataFrame after incrementing everyone’s age by two years.
2. We filter all the employees above age 30 and display the result.
Code explanation:
1. Counting the number of people with the same ages. We use the ‘groupBy’
function for the same.
2. Creating a temporary view ’employee’ of our ‘df’ DataFrame.
3. Perform a ‘select’ operation on our ’employee’ view to display the table into
‘sqlDF’.
4. Displaying the results of ‘sqlDF’.
1 df.groupBy("age").count().show()
2 df.createOrReplaceTempView("employee")
3 val sqlDF = spark.sql("SELECT * FROM employee")
4 sqlDF.show()
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 285
Creating Datasets
After understanding DataFrames, let us now move on to Dataset API. The
below code creates a Dataset class in SparkSQL.
Code explanation:
1. Creating a class ‘Employee’ to store name and age of an employee.
2. Assigning a Dataset ‘caseClassDS’ to store the record of Andrew.
3. Displaying the Dataset ‘caseClassDS’.
4. Creating a primitive Dataset to demonstrate mapping of DataFrames into
Datasets.
5. Assigning the above sequence into an array.
3 caseClassDS.show()
()primitiveDS.map(_ + 1).collect()
5
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 286
Code explanation:
1. Setting the path to our JSON file ’employee.json’.
2. Creating a Dataset and from the file.
3. Displaying the contents of ’employeeDS’ Dataset.
**********
THE END
KGRL MCA BIGDATA ANALYTICS Lecture Notes K.Issack Babu Page 287