0% found this document useful (0 votes)
20 views76 pages

Big Data Manual - Fall 2023

Uploaded by

rukshana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views76 pages

Big Data Manual - Fall 2023

Uploaded by

rukshana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 76

SKYLINE UNIVERSITY COLLEGE

School of Computing
University City of Sharjah

COURSE CODE SIT4112


COURSE NAME BIG DATA ANALYTICS
STUDENT ID 17084
STUDENT NAME MOUFI OWAIDA

BIG DATA ANALYTICS


Lab Manual
SIT4112
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

EXPERIMENT INDEX

COURSE LEARNING OUTCOMES

Upon the successful completion of the course, the student will be able to:
CLO1: Describe the tools and methods suitable for big data analysis.
CLO2: Analyze methods to identify, classify and extract information from
unstructured datasets.
CLO3: Evaluate the techniques to interpret data using visual information.

Experimen Experimen Experimen Experimen Experimen Experimen Experimen Experimen


t #1 t #2 t #3 t #4 t #5 t #6 t #7 t #8

CLO3 CLO3 CLO3 CLO3 CLO3 CLO3 CLO3 CLO3

1|Page
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

MARKS BREAKDOWN

Question CLO Covered Max Marks Marks Obtained


Lab Manual- CLO3
Experiment-1
Lab Manual- CLO3
Experiment-2
Lab Manual- CLO3
Experiment-3
Lab Manual- CLO3
Experiment-4
Lab Manual- CLO3
Experiment-5
Lab Manual- CLO3
Experiment-6
Lab Manual- CLO3
Experiment-7
Lab Manual- CLO3
Experiment-8

Total Marks

Instructor Sign

2|Page
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

EXPERIMENT NO:1

Name of experiment: - Installing Hadoop

Installing Hadoop involves several steps. Below are the high-level steps for installing Hadoop on a
Unix-like system (e.g., Linux) or the Windows Subsystem for Linux (WSL) on Windows. Installing
Hadoop directly on Windows can be challenging, so using WSL is recommended for Windows users.

Prerequisites:

1. Java: Ensure you have Java (preferably Java 8 or later) installed on your system. You can
check by running java -version.

Installation Steps:

1.Installing Java

 Oracle java download


 Choose Java 8 for windows
 Download java
 Move to the C drive inside Java folder and install java.
 Cut jdk from program files and paste it inside Java folder in C.

 Edit the system environment variable.


 Create a NEW environment variable name as JAVA_HOME and value as C:\Java\
jdk1.8.0_281\bin
 EDIT the path and give the NEW path as C:\Java\jdk1.8.0_281\bin.

 Verify Installation:
Run the cmd and check the java version.

Download Hadoop:

1 Visit the Apache Hadoop website (https://hadoop.apache.org/) and download the Hadoop
distribution that suits your needs. Choose a stable version and download the tarball (e.g.,
hadoop-x.y.z.tar.gz).

2. Extract Hadoop:

1. Extract the downloaded tarball to your preferred directory(C drive and name it as
Hadoop.

2. Configure Hadoop files:

3|Page
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

3. Navigate to the Hadoop installation directory, and configure the Hadoop


environment by editing the etc/hadoop/hadoop-env.sh file. Set the JAVA_HOME
variable to the path as C:\Java\jdk1.8.0_281

4. For example: JAVA_HOME= C:\Java\jdk1.8.0_281

 Edit the system environment variable.


 Create a NEW environment variable name as HADOOP_HOME and value as C:\
hadoop\bin
 EDIT the path and give the NEW path as C:\hadoop\bin and C:\hadoop\sbin
 Verify Installation:
o Run the cmd and check the hadoop.
o Hadoop version

3. Edit Hadoop Configuration Files:

1. Hadoop's configuration files are located in the etc/hadoop directory. You'll need to
modify these files to suit your cluster setup. Key configuration files include:

1. core-site.xml: Configure core properties such as the Hadoop distributed file


system URI.
# For core-site.xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>

 hdfs-site.xml: Configure HDFS-specific properties such as the block size and replication
factor.
Create a folder in Hadoop name as data
Inside data folder ,create namenode and datanode

# For hdfs-site.xml or https-site.xml


<property>

4|Page
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\hadoop\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:\hadoop\data\datanode</value>
</property>

 mapred-site.xml: Configure properties related to the MapReduce framework.


# For mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

 yarn-site.xml: Configure properties for the Hadoop resource manager.


# For yarn-site.xml
<property>

5|Page
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property><property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.cl
ass</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

You can find templates for these files in the etc/hadoop directory. Copy them to the
same directory and make your changes.

Fix of Hadoop ‘bin’ Folder

Now, you need to fix some configuration files. To do it, you need to replace the Hadoop bin folder
with another bin folder which already contains all the files properly configured. First, download this
compressed file (hadoop3_xFixedbin.rar). Then, you need to delete bin folder:

Initialize HDFS:

Before starting Hadoop services, you need to initialize the HDFS filesystem. Run the following
command:

hdfs namenode -format

Start Hadoop Services:

 You can start Hadoop services using the sbin/start-all.sh script. This will start the
HDFS, YARN (Resource Manager and Node Manager), and other components.

Verify Installation:

 Open a web browser and go to the Hadoop ResourceManager web interface at


http://localhost:8088 to verify that the ResourceManager is running.

 You can also use Hadoop shell commands (hadoop fs -ls /) to interact with the HDFS
and ensure it's working as expected.

6|Page
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

Manage Hadoop Services:

 You can stop all Hadoop services using the sbin/stop-all.sh script.

Optional: If you plan to set up a multi-node cluster, you'll need to configure each node in a similar
manner. Ensure that all nodes can communicate with each other and update the configuration files
accordingly.

EXPERIMENT NO:2

Name of experiment: - HDFS Commands

HDFS is the primary or major component of the Hadoop ecosystem which is responsible for storing
large data sets of structured or unstructured data across various nodes and thereby maintaining the
metadata in the form of log files. To use the HDFS commands, first you need to start the Hadoop
services using the following command:

sbin/start-all.sh

OR

Sbin.start-all.cmd

To check the Hadoop services are up and running use the following command:

Jps

Commands:

1. ls: This command is used to list all the files.

Syntax:

bin/hdfs dfs -ls <path>

Example:

bin/hdfs dfs -ls /

It will print all the directories present in HDFS. bin directory contains executables so, bin/hdfs
means we want the executables of hdfs particularly dfs(Distributed File System) commands

2. mkdir: To create a directory. In Hadoop dfs there is no home directory by default. So let’s
first create it.

7|Page
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

Syntax:

bin/hdf dfs -mkdir <folder name>

creating home directory:


hdfs/bin -mkdir /user
hdfs/bin -mkdir /user/username -> write the username of your computer
Example:
bin/hdfs dfs -mkdir /myfolder => '/' means absolute path

3. touchz: It creates an empty file.


Syntax:

bin/hdfs dfs -touchz <file_path>


Example:
bin/hdfs dfs -touchz /myfolder/myfile.txt

4. copyFromLocal (or) put: To copy files/folders from local file system to hdfs store. This is the
most important command. Local filesystem means the files present on the OS.

Syntax:

bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>

Example: Let’s suppose we have a file AI.txt on Desktop which we want to copy to folder geeks
present on hdfs.

bin/hdfs dfs -copyFromLocal ../Desktop/AI.txt /geeks

(OR)

bin/hdfs dfs -put ../Desktop/AI.txt /geeks

8|Page
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

5. cat: To print file contents.

Syntax:

bin/hdfs dfs -cat <path>

Example:

// print the content of AI.txt present

// inside geeks folder.

bin/hdfs dfs -cat /geeks/AI.txt

6. copyToLocal (or) get: To copy files/folders from hdfs store to local file system.

Syntax:

bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>

Example:

bin/hdfs dfs -copyToLocal /geeks ../Desktop/hero

(OR)

bin/hdfs dfs -get /geeks/myfile.txt ../Desktop/hero

myfile.txt from geeks folder will be copied to folder hero present on Desktop.

7. moveFromLocal: This command will move file from local to hdfs.

Syntax:

bin/hdfs dfs -moveFromLocal <local src> <dest(on hdfs)>

Example:

bin/hdfs dfs -moveFromLocal ../Desktop/cutAndPaste.txt /geeks

9|Page
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

8. cp: This command is used to copy files within hdfs. Lets copy folder geeks to geeks_copied.

Syntax:

bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>

Example:

bin/hdfs -cp /geeks /geeks_copied

9 . mv: This command is used to move files within hdfs. Let’s cut-paste a file myfile.txt from geeks
folder to geeks_copied.

Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>

Example:
bin/hdfs -mv /geeks/myfile.txt /geeks_copied

10. rmr: This command deletes a file from HDFS recursively. It is very useful command when you
want to delete a non-empty directory.

Syntax:

bin/hdfs dfs -rmr <filename/directoryName>

Example:

bin/hdfs dfs -rmr /geeks_copied -> It will delete all the content inside the
directory then the directory itself.

10 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

11. du: It will give the size of each file in directory.

Syntax:

bin/hdfs dfs -du <dirName>

Example:

bin/hdfs dfs -du /geeks

12. dus:: This command will give the total size of directory/file.

Syntax:

bin/hdfs dfs -dus <dirName>

Example:

bin/hdfs dfs -dus /geeks

13. stat: It will give the last modified time of directory or path. In short it will give stats of the
directory or file.

Syntax:

bin/hdfs dfs -stat <hdfs file>

Example:

bin/hdfs dfs -stat /geeks

11 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

EXPERIMENT NO. 3

Name of experiment: Running a MapReduce Job

To run a MapReduce job to find the word count.

MapReduce in Hadoop is a software framework for ease in writing applications of software,


processing huge amounts of data. MapReduce provides the facility to distribute the workload among
various nodes. Hence, reducing the processing time as data on which the computation needs to be
done is now divided into small chunks and individually processed. Through MapReduce you can
achieve parallel processing resulting in faster execution of the job.

MapReduce Word Count is a framework which splits the chunk of data, sorts the map outputs and
input to reduce tasks. A File-system stores the output and input of jobs. Re-execution of failed tasks,
scheduling them and monitoring them is the task of the framework.

Figure below shows the architecture as well as working of MapReduce with an example:

12 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

Splitting: The parameter of splitter can be anything. By comma, space, by a new line or a semicolon.

Mapping: This is done as explained below.

Shuffle/Intermediate splitting: The process is usually parallel on cluster keys. The output of the map
gets into the Reducer phase and all the similar keys of data are aligned in a cluster.

Reducing: This is done as explained below. Final result — All the data is clustered or combined to
show the together form of a result.

First, we will import our dataset into the HDFS (Hadoop Distributed File System).

The dataset can be a simple txt file with some words or sentences written in it.

Please follow these steps to import the dataset:

Step 1. Download and extract the compressed file of the text-based dataset from a source and save
it as a txt file in your local storage.

Step 2. Create a directory in the Hadoop HDFS by using the following command:

Step 3. Copy the txt file in HDFS from the local directory.

13 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

Step 4: Run MapReduceClient.jar and also provide input and out directories.

Step 5: Verify content of the copied file.

EXPERIMENT NO. 4

Name of experiment: Understanding Big Data Analytics Types.

In this lab, students will import different files and perform the following tasks:

● Data Preparation

● Descriptive Statistics

● Correlation Matrix

● Data Visualization using Histogram, Scatter Plot, Barchart etc.

Code and Output

14 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

15 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

16 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

17 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

18 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

19 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

EXPERIMENT NO:5

Name of experiment: - CLUSTERING

Goal: Understanding the intuition of Clustering using K-means clustering

Theory: Imagine a dataset consisting of several points spread over an n-dimensional space. In order
to find patterns over data points on a n-dimensional space, we use unsupervised methods.

One of the most popular unsupervised method is clustering. Clustering is the task of grouping

together a set of objects such that objects in the same cluster are more similar to each

other than objects in different cluster.

Clustering algorithms can be categorized based on their cluster model, in other words on how they
form clusters or groups. Some of the prominent based clustering algorithms are connectivity-based
clustering, centroid-based clustering, Distribution based clustering and density based methods.

In this exercise, centroid based clustering is implemented. In this type of clustering, clusters

are represented by a central vector or a centroid. This centroid might not necessarily be a

member of the dataset. This is an iterative clustering algorithm where the notion of similarity

is derived on how close the data point is to the center of the cluster. In this exercise, we will be
working on mall customers data.

Software Tools: Python - Jupyter Notebook

20 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

Procedure:

1. Import the required libraries and load dataset

2. Use scatter plot to visualize the expected clusters.

3. Apply KMeans to predict the clusters.

4. Pre-process the dataframe using min max scaler

5. Predict the clutsters again and analyze the issues while preparing clusters.

6. Use Elbow technique to show number of clusters that can be formed.

CODE + DISCUSSION

21 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

22 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

23 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

24 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

25 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

26 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

Output

27 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

28 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

29 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

30 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

31 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

EXPERIMENT NO: 6

Goal: Introduction to Market basket analysis using Apriori Algorithm.

Theory: Market Basket Analysis is one of the key techniques used by large retailers to uncover
associations between items. It works by looking for combinations of items that occur together
frequently in transactions. To put it another way, it allows retailers to identify relationships between
the items that people buy.

Software Tools: Python - Jupyter Notebook

Example: You are a data scientist (or becoming one!), and you get a client who runs a retail store.
Your client gives you data for all transactions that consists of items bought in the store by several
customers over a period of time and asks you to use that data to help boost their business. Your
client will use your findings to not only change/update/add items in inventory but also use them to
change the layout of the physical store or rather an online store. To find results that will help your
client, you will use Market Basket Analysis (MBA) which uses Association Rule Mining on the given
transaction data

Procedure:

Using Python

 Read the data into variable and check the descriptive information of read data.
 Start with item sets containing just a single item.
 Determine the support for item sets. Keep the item sets that meet your minimum support
threshold and remove item sets that do not.

32 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

 Using the item sets you have kept from Step 1, generate all the possible itemset
configurations.
 Repeat Steps 3 & 4 until there are no more new items etc.

33 | P a g e
CODE & OUTPUT

SKYLINE UNIVERSITY COLLEGE


!pipUniversity
install squarify
City of Sharjah

-------------------------------------

import numpy as np

import pandas as pd

import squarify

import matplotlib.pyplot as plt

#importing the packages required for market basket analysis

from mlxtend.frequent_patterns import apriori

from mlxtend.frequent_patterns import association_rules

from mlxtend.preprocessing import TransactionEncoder

-------------------------------------

df = pd.read_csv('MBA.csv', header=None)

df.head()
-------------------------------------

df.columns
-------------------------------------

len(df.columns)
-------------------------------------

df_result = pd.DataFrame()

for i in range(len(df.columns)):

df_result = df_result.append(df[i].value_counts())
-------------------------------------

df_result
-------------------------------------

df_sum = df_result.sum()

df_sum = df_sum.sort_values(ascending=False)

df_sum

-------------------------------------

plt.figure(figsize=(15,10))
34 | P a g e
count = 40

plt.title('Frequency Plot')

color = plt.cm.spring(np.linspace(0,1,count))
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

35 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

36 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

37 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

Output

38 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

39 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

40 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

41 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

EXPERIMENT NO: 7

Goal: Applying Linear Regression on Vehicles data.

Theory: Linear regression is based on least square estimation which says regression
coefficients (estimates) should be chosen in such a way that it minimizes the sum of the
squared distances of each observed response to its fitted value.

Software Tools: Python - Jupyter Notebook

Example: In this experiment, we will be working with vehicle dataset to understand The relation
between labor cost (dependent variable) and other variables such as labor hours using the technique
Linear Regression.

Procedure:

Using Python

1. Read the data into variable vehicle, and check the summary of read data.
2. Analyze the dependent and independent variables and create a scatter plot.
3. Apply the Linear Regression Model.
4. Create the Linear Regression Model for response variable.
5. Analyze the summary of result, and remove non-significant variable(s).
6. Re-run the model.
7. Predict the value using reduced model.

42 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

Code & Output

43 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

44 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

45 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

46 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

47 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

48 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

49 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

50 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

51 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

52 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

53 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

54 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

55 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

56 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

EXPERIMENT NO: 8

Goal: Decision Tree for Classifying Mushrooms as safe to eat or poisonous.

Theory: A decision tree is a decision support tool that uses a tree-like graph or model

of decisions and their possible consequences, including chance event outcomes, resource costs,

and utility. It is one way to display an algorithm that only contains conditional control

statements.

Software Tools: Python – Jupyter notebook

Example: This dataset is originally recorded from UCI Machine Learning Repository. The main aim of
this dataset is to classify whether the mushrooms are edible(e) or poisonous(p) to consume. Each
and every species can be identified as purely edible, purely poisonous or unknown edibility
contained in the dataset.

The detailed description of attributes mentioned in the dataset are listed below:

cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s

cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s

cap-color:brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y

bruises?: bruises=t,no=f

odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s

gill-attachment: attached=a,descending=d,free=f,notched=n

gill-spacing: close=c,crowded=w,distant=d

gill-size: broad=b,narrow=n

gill-color:black=k,brown=n,buff=b,chocolate=h,gray=g,green=r,orange=o,pink=p,purple=u,red=e,
white=w,yellow=y

stalk-shape: enlarging=e,tapering=t

stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=?

stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s

stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s

57 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

stalk-color-above-ing:brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y

veil-type: partial=p,universal=u

veil-color: brown=n,orange=o,white=w,yellow=y

ring-number: none=n,one=o,two=t

ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z

spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y

population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y

habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d

Here, classification is being performed using 2 classes i.e. edible or poisonous.

Logistic Regression approach is used for classification. Initially, the data is converted to numerical
format using LabelEncoder. Then the data is splitted into training & testing sets. By
applying logistic regression model the resultant is produced with 95% accuracy. The
overall impact of every parameter over the classes is shown using a heatmap.

Target Variable = Class (e: edible, p: poisonous)

Procedure:

Using Python

 Read the data into variable, and check the summary of read data.
 Clean the dataset and remove irrelevant columns
 Partition the data into training and test datasets.
 Create a classification tree from training data.
 Use predict function, with validate data set to create learning model.
 Display misclassification error for validate and train dataset.

58 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

Code & Output

59 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

60 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

61 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

62 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

63 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

64 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

Output

65 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

66 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

67 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

68 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

EXPERIMENT NO: 9

69 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

Goal: Data Visualization

Python provides various libraries that come with different features for visualizing data. All these
libraries come with different features and can support various types of graphs. In this tutorial, we
will
be discussing four such libraries.

Matplotlib
Seaborn
Bokeh
Plotly

Scatter Plot

Line Chart

70 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

Histogram

71 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

72 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

Output

73 | P a g e
SKYLINE UNIVERSITY COLLEGE
University City of Sharjah

74 | P a g e

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy