Big Data Manual - Fall 2023
Big Data Manual - Fall 2023
School of Computing
University City of Sharjah
EXPERIMENT INDEX
Upon the successful completion of the course, the student will be able to:
CLO1: Describe the tools and methods suitable for big data analysis.
CLO2: Analyze methods to identify, classify and extract information from
unstructured datasets.
CLO3: Evaluate the techniques to interpret data using visual information.
1|Page
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
MARKS BREAKDOWN
Total Marks
Instructor Sign
2|Page
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
EXPERIMENT NO:1
Installing Hadoop involves several steps. Below are the high-level steps for installing Hadoop on a
Unix-like system (e.g., Linux) or the Windows Subsystem for Linux (WSL) on Windows. Installing
Hadoop directly on Windows can be challenging, so using WSL is recommended for Windows users.
Prerequisites:
1. Java: Ensure you have Java (preferably Java 8 or later) installed on your system. You can
check by running java -version.
Installation Steps:
1.Installing Java
Verify Installation:
Run the cmd and check the java version.
Download Hadoop:
1 Visit the Apache Hadoop website (https://hadoop.apache.org/) and download the Hadoop
distribution that suits your needs. Choose a stable version and download the tarball (e.g.,
hadoop-x.y.z.tar.gz).
2. Extract Hadoop:
1. Extract the downloaded tarball to your preferred directory(C drive and name it as
Hadoop.
3|Page
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
1. Hadoop's configuration files are located in the etc/hadoop directory. You'll need to
modify these files to suit your cluster setup. Key configuration files include:
hdfs-site.xml: Configure HDFS-specific properties such as the block size and replication
factor.
Create a folder in Hadoop name as data
Inside data folder ,create namenode and datanode
4|Page
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\hadoop\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:\hadoop\data\datanode</value>
</property>
5|Page
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property><property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.cl
ass</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
You can find templates for these files in the etc/hadoop directory. Copy them to the
same directory and make your changes.
Now, you need to fix some configuration files. To do it, you need to replace the Hadoop bin folder
with another bin folder which already contains all the files properly configured. First, download this
compressed file (hadoop3_xFixedbin.rar). Then, you need to delete bin folder:
Initialize HDFS:
Before starting Hadoop services, you need to initialize the HDFS filesystem. Run the following
command:
You can start Hadoop services using the sbin/start-all.sh script. This will start the
HDFS, YARN (Resource Manager and Node Manager), and other components.
Verify Installation:
You can also use Hadoop shell commands (hadoop fs -ls /) to interact with the HDFS
and ensure it's working as expected.
6|Page
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
You can stop all Hadoop services using the sbin/stop-all.sh script.
Optional: If you plan to set up a multi-node cluster, you'll need to configure each node in a similar
manner. Ensure that all nodes can communicate with each other and update the configuration files
accordingly.
EXPERIMENT NO:2
HDFS is the primary or major component of the Hadoop ecosystem which is responsible for storing
large data sets of structured or unstructured data across various nodes and thereby maintaining the
metadata in the form of log files. To use the HDFS commands, first you need to start the Hadoop
services using the following command:
sbin/start-all.sh
OR
Sbin.start-all.cmd
To check the Hadoop services are up and running use the following command:
Jps
Commands:
Syntax:
Example:
It will print all the directories present in HDFS. bin directory contains executables so, bin/hdfs
means we want the executables of hdfs particularly dfs(Distributed File System) commands
2. mkdir: To create a directory. In Hadoop dfs there is no home directory by default. So let’s
first create it.
7|Page
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
Syntax:
4. copyFromLocal (or) put: To copy files/folders from local file system to hdfs store. This is the
most important command. Local filesystem means the files present on the OS.
Syntax:
Example: Let’s suppose we have a file AI.txt on Desktop which we want to copy to folder geeks
present on hdfs.
(OR)
8|Page
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
Syntax:
Example:
6. copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
Syntax:
Example:
(OR)
myfile.txt from geeks folder will be copied to folder hero present on Desktop.
Syntax:
Example:
9|Page
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
8. cp: This command is used to copy files within hdfs. Lets copy folder geeks to geeks_copied.
Syntax:
Example:
9 . mv: This command is used to move files within hdfs. Let’s cut-paste a file myfile.txt from geeks
folder to geeks_copied.
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
Example:
bin/hdfs -mv /geeks/myfile.txt /geeks_copied
10. rmr: This command deletes a file from HDFS recursively. It is very useful command when you
want to delete a non-empty directory.
Syntax:
Example:
bin/hdfs dfs -rmr /geeks_copied -> It will delete all the content inside the
directory then the directory itself.
10 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
Syntax:
Example:
12. dus:: This command will give the total size of directory/file.
Syntax:
Example:
13. stat: It will give the last modified time of directory or path. In short it will give stats of the
directory or file.
Syntax:
Example:
11 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
EXPERIMENT NO. 3
MapReduce Word Count is a framework which splits the chunk of data, sorts the map outputs and
input to reduce tasks. A File-system stores the output and input of jobs. Re-execution of failed tasks,
scheduling them and monitoring them is the task of the framework.
Figure below shows the architecture as well as working of MapReduce with an example:
12 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
Splitting: The parameter of splitter can be anything. By comma, space, by a new line or a semicolon.
Shuffle/Intermediate splitting: The process is usually parallel on cluster keys. The output of the map
gets into the Reducer phase and all the similar keys of data are aligned in a cluster.
Reducing: This is done as explained below. Final result — All the data is clustered or combined to
show the together form of a result.
First, we will import our dataset into the HDFS (Hadoop Distributed File System).
The dataset can be a simple txt file with some words or sentences written in it.
Step 1. Download and extract the compressed file of the text-based dataset from a source and save
it as a txt file in your local storage.
Step 2. Create a directory in the Hadoop HDFS by using the following command:
Step 3. Copy the txt file in HDFS from the local directory.
13 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
Step 4: Run MapReduceClient.jar and also provide input and out directories.
EXPERIMENT NO. 4
In this lab, students will import different files and perform the following tasks:
● Data Preparation
● Descriptive Statistics
● Correlation Matrix
14 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
15 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
16 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
17 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
18 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
19 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
EXPERIMENT NO:5
Theory: Imagine a dataset consisting of several points spread over an n-dimensional space. In order
to find patterns over data points on a n-dimensional space, we use unsupervised methods.
One of the most popular unsupervised method is clustering. Clustering is the task of grouping
together a set of objects such that objects in the same cluster are more similar to each
Clustering algorithms can be categorized based on their cluster model, in other words on how they
form clusters or groups. Some of the prominent based clustering algorithms are connectivity-based
clustering, centroid-based clustering, Distribution based clustering and density based methods.
In this exercise, centroid based clustering is implemented. In this type of clustering, clusters
are represented by a central vector or a centroid. This centroid might not necessarily be a
member of the dataset. This is an iterative clustering algorithm where the notion of similarity
is derived on how close the data point is to the center of the cluster. In this exercise, we will be
working on mall customers data.
20 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
Procedure:
5. Predict the clutsters again and analyze the issues while preparing clusters.
CODE + DISCUSSION
21 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
22 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
23 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
24 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
25 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
26 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
Output
27 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
28 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
29 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
30 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
31 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
EXPERIMENT NO: 6
Theory: Market Basket Analysis is one of the key techniques used by large retailers to uncover
associations between items. It works by looking for combinations of items that occur together
frequently in transactions. To put it another way, it allows retailers to identify relationships between
the items that people buy.
Example: You are a data scientist (or becoming one!), and you get a client who runs a retail store.
Your client gives you data for all transactions that consists of items bought in the store by several
customers over a period of time and asks you to use that data to help boost their business. Your
client will use your findings to not only change/update/add items in inventory but also use them to
change the layout of the physical store or rather an online store. To find results that will help your
client, you will use Market Basket Analysis (MBA) which uses Association Rule Mining on the given
transaction data
Procedure:
Using Python
Read the data into variable and check the descriptive information of read data.
Start with item sets containing just a single item.
Determine the support for item sets. Keep the item sets that meet your minimum support
threshold and remove item sets that do not.
32 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
Using the item sets you have kept from Step 1, generate all the possible itemset
configurations.
Repeat Steps 3 & 4 until there are no more new items etc.
33 | P a g e
CODE & OUTPUT
-------------------------------------
import numpy as np
import pandas as pd
import squarify
-------------------------------------
df = pd.read_csv('MBA.csv', header=None)
df.head()
-------------------------------------
df.columns
-------------------------------------
len(df.columns)
-------------------------------------
df_result = pd.DataFrame()
for i in range(len(df.columns)):
df_result = df_result.append(df[i].value_counts())
-------------------------------------
df_result
-------------------------------------
df_sum = df_result.sum()
df_sum = df_sum.sort_values(ascending=False)
df_sum
-------------------------------------
plt.figure(figsize=(15,10))
34 | P a g e
count = 40
plt.title('Frequency Plot')
color = plt.cm.spring(np.linspace(0,1,count))
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
35 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
36 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
37 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
Output
38 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
39 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
40 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
41 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
EXPERIMENT NO: 7
Theory: Linear regression is based on least square estimation which says regression
coefficients (estimates) should be chosen in such a way that it minimizes the sum of the
squared distances of each observed response to its fitted value.
Example: In this experiment, we will be working with vehicle dataset to understand The relation
between labor cost (dependent variable) and other variables such as labor hours using the technique
Linear Regression.
Procedure:
Using Python
1. Read the data into variable vehicle, and check the summary of read data.
2. Analyze the dependent and independent variables and create a scatter plot.
3. Apply the Linear Regression Model.
4. Create the Linear Regression Model for response variable.
5. Analyze the summary of result, and remove non-significant variable(s).
6. Re-run the model.
7. Predict the value using reduced model.
42 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
43 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
44 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
45 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
46 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
47 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
48 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
49 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
50 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
51 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
52 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
53 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
54 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
55 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
56 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
EXPERIMENT NO: 8
Theory: A decision tree is a decision support tool that uses a tree-like graph or model
of decisions and their possible consequences, including chance event outcomes, resource costs,
and utility. It is one way to display an algorithm that only contains conditional control
statements.
Example: This dataset is originally recorded from UCI Machine Learning Repository. The main aim of
this dataset is to classify whether the mushrooms are edible(e) or poisonous(p) to consume. Each
and every species can be identified as purely edible, purely poisonous or unknown edibility
contained in the dataset.
The detailed description of attributes mentioned in the dataset are listed below:
cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
cap-color:brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y
bruises?: bruises=t,no=f
gill-attachment: attached=a,descending=d,free=f,notched=n
gill-spacing: close=c,crowded=w,distant=d
gill-size: broad=b,narrow=n
gill-color:black=k,brown=n,buff=b,chocolate=h,gray=g,green=r,orange=o,pink=p,purple=u,red=e,
white=w,yellow=y
stalk-shape: enlarging=e,tapering=t
stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
57 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
stalk-color-above-ing:brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
veil-type: partial=p,universal=u
veil-color: brown=n,orange=o,white=w,yellow=y
ring-number: none=n,one=o,two=t
Logistic Regression approach is used for classification. Initially, the data is converted to numerical
format using LabelEncoder. Then the data is splitted into training & testing sets. By
applying logistic regression model the resultant is produced with 95% accuracy. The
overall impact of every parameter over the classes is shown using a heatmap.
Procedure:
Using Python
Read the data into variable, and check the summary of read data.
Clean the dataset and remove irrelevant columns
Partition the data into training and test datasets.
Create a classification tree from training data.
Use predict function, with validate data set to create learning model.
Display misclassification error for validate and train dataset.
58 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
59 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
60 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
61 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
62 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
63 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
64 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
Output
65 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
66 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
67 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
68 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
EXPERIMENT NO: 9
69 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
Python provides various libraries that come with different features for visualizing data. All these
libraries come with different features and can support various types of graphs. In this tutorial, we
will
be discussing four such libraries.
Matplotlib
Seaborn
Bokeh
Plotly
Scatter Plot
Line Chart
70 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
Histogram
71 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
72 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
Output
73 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah
74 | P a g e