DSBDA LAB - MANUAL (Autosaved) - Sd1-Converted-1-2
DSBDA LAB - MANUAL (Autosaved) - Sd1-Converted-1-2
TE-COMPUTER ENGINEERING
SEMESTER-VI
1 Data Wrangling, I
Perform the following operations using Python on
any open source dataset (e.g., data.csv)
1. Import all the required Python Libraries.
2. Locate an open source data from the web (e.g.
https://www.kaggle.com). Provide a clear
description of the data and its source (i.e., URL of
the web site).
3. Load the Dataset into pandas data frame.
4. Data Preprocessing: check for missing values in
the data using pandas isnull(), describe() function
to get some initial statistics. Provide variable
descriptions. Types of variables etc. Check the
dimensions of the data frame.
5. Data Formatting and Data Normalization:
Summarize the types of variables by checking the
data types (i.e., character, numeric, integer, factor,
and logical) of the variables in the data set. If
variables are not in the correct data type, apply
proper type conversions.
6. Turn categorical variables into quantitative
variables in Python.
In addition to the codes and outputs, explain every
operation that you do in the above steps and
explain everything that you do to
import/read/scrape the data set.
2 Data Wrangling II
Create an “Academic performance” dataset of
students and perform the following operations
using Python.
1. Scan all variables for missing values and
inconsistencies. If there are missing values and/or
inconsistencies, use any of the suitable techniques
to deal with them.
2. Scan all numeric variables for outliers. If there
are outliers, use any of the suitable techniques to
deal with them.
3. Apply data transformations on at least one of the
variables. The purpose of this transformation
should be one of the following reasons: to change
the scale for better understanding of the variable,
to convert a non-linear relation into a linear one, or
to decrease the skewness and convert the
Data Wrangling, I
Perform the following operations using Python on any open source
Title dataset (e.g., data.csv)
1. Import all the required Python Libraries.
2. Locate an open source data from the web (e.g.
https://www.kaggle.com). Provide a clear
description of the data and its source (i.e., URL of the web site).
3. Load the Dataset into pandas data frame.
4. Data Preprocessing: check for missing values in the data using pandas
isnull(), describe() function to get some initial statistics. Provide variable
descriptions. Types of variables etc. Check the dimensions of the data
frame.
5. Data Formatting and Data Normalization: Summarize the types of
variables by checking the data types (i.e., character, numeric, integer,
factor, and logical) of the variables in the data set. If variables are not in
the correct data type, apply proper type conversions.
6. Turn categorical variables into quantitative variables in Python.
In addition to the codes and outputs, explain every operation that you
do in the above steps and explain everything that you do to
import/read/scrape the data set
Student Name
Division A
Roll No
Exam Seat No
Date of
Completion
Assessment
grade/marks
Signature
• Problem Statement:
Perform the following operations using Python on any open source dataset (e.g., data.csv)
1. Import all the required Python Libraries.
2. Locate an open source data from the web (e.g. https://www.kaggle.com). Provide a clear
Description of the data and its source (i.e., URL of the web site).
3. Load the Dataset into pandas data frame.
4. Data Preprocessing: check for missing values in the data using pandas isnull(), describe() function to get
some initial statistics. Provide variable descriptions. Types of variables etc. Check the dimensions of the data
frame.
5. Data Formatting and Data Normalization: Summarize the types of variables by checking the data types
(i.e., character, numeric, integer, factor, and logical) of the variables in the data set. If variables are not in the
correct data type, apply proper type conversions.
6. Turn categorical variables into quantitative variables in Python.
In addition to the codes and outputs, explain every operation that you do in the above steps and explain
everything that you do to import/read/scrape the data set.
• Prerequisites:
Subjects: Discrete Mathematics, Database Management Systems
Tools/Software Requirements: Jupyter Notebook / Google Colab / Kaggle
Processor/Hardware Requirements: Intel Core 2 Duo or above, 2 GB RAM or above
Operating System: 64 Bit Windows/Linux/Mac OS
Programming Language: Python 3.8, PIP
Datasets: Any Open Source Dataset/CSV Files
• Objectives:
To learn concepts of Data Wrangling, import /read/scrap meaningful insights from dataset.
Data Wrangling:
Data wrangling is the process of cleaning, structuring and enriching raw data into a desired format for
better decision making in less time.
Python3
# Assign data
df = pd.DataFrame(data)
# Display data
df
Output:
Python3
# Compute average
c = avg = 0
if str(ele).isnumeric():
c += 1
avg += ele
avg /= c
df = df.replace(to_replace="NaN",
value=avg)
# Display data
df
Output:
Python3
# Categorize gender
df['Gender'] = df['Gender'].map({'M': 0,
'F': 1, }).astype(float)
# Display data
df
Output:
Python3
df = df.drop(['Age'], axis=1)
# Display data
df
Output:
Hence, we have finally obtained an efficient dataset which can be further used for various purposes.
Now that we know the basics of data wrangling. Below we will discuss various operations using which
we can perform data wrangling:
Python3
# import module
import pandas as pd
details = pd.DataFrame({
# printing details
print(details)
Python3
# Import module
import pandas as pd
fees_status = pd.DataFrame(
# Printing fees_status
print(fees_status)
Python3
# Import module
import pandas as pd
# Creating Dataframe
details = pd.DataFrame({
# Creating Dataframe
fees_status = pd.DataFrame(
# Merging Dataframe
Output:
Python3
# Import module
import pandas as pd
# Creating Data
'Sold': [6, 7, 9, 8, 3, 5,
2, 8, 7, 2, 4, 2]}
df = pd.DataFrame(car_selling_data)
# printing Dataframe
print(df)
Output:
Python3
# Import module
# Creating Data
'Sold': [6, 7, 9, 8, 3, 5,
2, 8, 7, 2, 4, 2]}
df = pd.DataFrame(car_selling_data)
grouped = df.groupby('Year')
print(grouped.get_group(2010))
Output:
Python3
# Import module
import pandas as pd
# Initializing Data
'xxxxxx@gmail.com', 'xx@gmail.com',
'xxxx@gmail.com', 'xxxxx@gmail.com',
'xxxxx@gmail.com', 'xxxxx@gmail.com',
'xxxxx@gmail.com', 'xxxxxx@gmail.com',
df = pd.DataFrame(student_data)
# Printing Dataframe
print(df)
Output:
Python3
# import module
import pandas as pd
# initializing Data
'xxxxxx@gmail.com', 'xx@gmail.com',
'xxxx@gmail.com', 'xxxxx@gmail.com',
'xxxxx@gmail.com', 'xxxxx@gmail.com',
'xxxxx@gmail.com', 'xxxxxx@gmail.com',
'xxxxxxxxxx@gmail.com', 'xxxxxxxxxx@gmail.com']}
# creating dataframe
df = pd.DataFrame(student_data)
non_duplicate = df[~df.duplicated('Roll_no')]
print(non_duplicate)
Output:
Data Wrangling II
Create an “Academic performance” dataset of students and perform the
Title following operations using Python.
1. Scan all variables for missing values and inconsistencies. If there are
missing values and/or inconsistencies, use any of the suitable techniques
to deal with them.
2. Scan all numeric variables for outliers. If there are outliers, use any of
the suitable techniques to deal with them.
3. Apply data transformations on at least one of the variables. The
purpose of this transformation should be one of the following reasons: to
change the scale for better understanding of the variable, to convert a
non-linear relation into a linear one, or to decrease the skewness and
convert the distribution into a normal distribution. Reason and document
your approach properly.
Student Name
Division A
Roll No
Exam Seat No
Date of
Completion
Assessment
grade/marks
Signature
• Problem Statement:
Data Wrangling II
Create an “Academic performance” dataset of students and perform the following operations using Python.
• 1. Scan all variables for missing values and inconsistencies. If there are missing values and/or
inconsistencies, use any of the suitable techniques to deal with them.
• 2. Scan all numeric variables for outliers. If there are outliers, use any of the suitable techniques to deal with
them.
• 3. Apply data transformations on at least one of the variables. The purpose of this transformation should be
one of the following reasons: to change the scale for better understanding of the variable, to convert a non-
linear relation into a linear one, or to decrease the skewness and convert the distribution into a normal
distribution. Reason and document your approach properly.
• Prerequisites:
Subjects: Discrete Mathematics, Database Management Systems
Tools/Software Requirements: Jupyter Notebook / Google Colab / Kaggle
Processor/Hardware Requirements: Intel Core 2 Duo or above, 2 GB RAM or above
Operating System: 64 Bit Windows/Linux/Mac OS
Programming Language: Python 3.8, PIP
Datasets: Any Open Source Dataset/CSV Files
Step 2: Enter the name of the dataset and Save the dataset astye CSV(MS-DOS).
Step 3: Fill the dara by using RANDOMBETWEEN function. For every feature ,
fill the data by considering above spectified range.
one example is given:
Step 5: To violate the ruleof response variable, update few valus . If placement scoreis
greater then 85, facilated only 1 offer.
1. None: None is a Python singleton object that is often used for missing data in
Python code.
2. NaN : NaN (an acronym for Not a Number), is a special floating-point value
recognized by all systems that use the standard IEEE floating-point representation.
● isnull()
● notnull()
● dropna()
● fillna()
● replace()
1. Checking for missing values using isnull() and notnull()
Algorithm:
Step 1 : Import pandas and numpy in order to check missing values in Pandas
DataFrame
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/StudentsPerformanceTest1.csv")
Step 3: Display the data frame
df
Step 5: To create a series true for NaN values for specific columns. for example
math score in dataset and display data with only math score as NaN
series = pd.isnull(df["math score"])
df[series]
Algorithm:
Step 1 : Import pandas and numpy in order to check missing values in Pandas
DataFrame
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/StudentsPerformanceTest1.csv")
Step 3: Display the data frame
df
Step 5: To create a series true for NaN values for specific columns. for example
math score in dataset and display data with only math score as NaN
series1 = pd.notnull(df["math score"])
df[series1]
In order to fill null values in a datasets, fillna(), replace() functions are used. These
functions replace NaN values with some value of their own. All these functions
help in filling null values in datasets of a DataFrame.
Following line will replace Nan value in dataframe with value -99
Algorithm:
Step 1 : Import pandas and numpy in order to check missing values in Pandas
DataFrame
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/StudentsPerformanceTest1.csv")
Step 3: Display the data frame
df
Step 4:To drop rows with at least 1 null value
ndf.dropna()
Similarly, an Outlier is an observation in a given dataset that lies far from the rest
of the observations. That means an outlier is vastly larger or smaller than the remaining
values in the set.
Mean is the accurate measure to describe the data when we do not have any outliers
present. Median is used if there is an outlier in the dataset. Mode is used if there is an outlier
AND about ½ or more of the data is the same.
‘Mean’ is the only measure of central tendency that is affected by the outliers which
in turn impacts Standard deviation.
From the above calculations, we can clearly say the Mean is more affected than the Median.
4. Detecting Outliers
If our dataset is small, we can detect the outlier by just looking at the dataset. But
what if we have a huge dataset, how do we identify the outliers then? We need to use
visualization and mathematical techniques.
● Boxplots
● Scatterplots
● Z-score
● Inter Quantile Range(IQR)
Step 4:Select the columns for boxplot and draw the boxplot.
Step 5: We can now print the outliers for each column with reference to the box plot.
print(np.where(df['math score']>90))
Algorithm:
Step 1 : Import numpy and stats from scipy libraries
import numpy as np
upper = Q3 +1.5*IQR
lower = Q1 – 1.5*IQR
In the above formula as according to statistics, the 0.5 scale-up of IQR
(new_IQR = IQR + 0.5*IQR) is taken.
Algorithm:
Step 1 : Import numpy library
import numpy as np
Step 2: Sort Reading Score feature and store it into sorted_rscore.
sorted_rscore= sorted(df['reading score'])
Step 3: Print sorted_rscore
sorted_rscore
Step 4: Calculate and print Quartile 1 and Quartile 3
df_stud.insert(1,"m score",b,True)
df_stud
● Mean/Median imputation:
As the mean value is highly influenced by the outliers, it is advised to replace the
outliers with the median value.
1. Plot the box plot for reading score
col = ['reading score']
df.boxplot(col)
Student Name
Division A
Roll No
Exam Seat No
Date of
Completion
Assessment
grade/marks
Signature
• Problem Statement:
Descriptive Statistics - Measures of Central Tendency and variability
Perform the following operations on any open source dataset (e.g., data.csv)
1. Provide summary statistics (mean, median, minimum, maximum, standard deviation) for a dataset (age,
income etc.) with numeric variables grouped by one of the qualitative (categorical) variable. For example, if
your categorical variable is age groups and quantitative variable is income, then provide summary statistics
of income grouped by the age groups. Create a list that contains a numeric value for each response to the
categorical variable.
2. Write a Python program to display some basic statistical details like percentile, mean, standard deviation
etc. of the species of ‘Iris-setosa’, ‘Iris-versicolor’ and ‘Iris-versicolor’ of iris.csv dataset.
Provide the codes with outputs and explain everything that you do in this step.
• Prerequisites:
Subjects: Discrete Mathematics, Database Management Systems
Tools/Software Requirements: Jupyter Notebook / Google Colab / Kaggle
Processor/Hardware Requirements: Intel Core 2 Duo or above, 2 GB RAM or above
Operating System: 64 Bit Windows/Linux/Mac OS
Programming Language: Python 3.8, PIP
Datasets: Any Open Source Dataset/CSV Files
Descriptive Statistics is the building block of data science. Advanced analytics is often
incomplete without analyzing descriptive statistics of the key metrics. In simple terms,
descriptive statistics can be defined as the measures that summarize a given data, and these
measures can be broken down further into the measures of central tendency and the measures
of dispersion. Measures of central tendency include mean, median, and the mode, while the
measures of variability include standard deviation, variance, and the interquartile range. In
this guide, you will learn how to compute these measures of descriptive statistics and use
them to interpret the data.
1. Mean
2. Median
3. Mode
4. Standard Deviation
5. Variance
6. Interquartile Range
7. Skewness
We will begin by loading the dataset to be used in this guide.
Data:
In this guide, we will be using fictitious data of loan applicants containing 0 observations
and 10 variables, as described below:
Measures of central tendency describe the center of the data, and are often represented by the
mean, the median, and the mode.
Mean: Mean represents the arithmetic average of the data. The line of code below prints the
mean of the numerical variables in the data. From the output, we can infer that the average
age of the applicant is 49 years, the average annual income is USD 705,541, and the average
tenure of loans is 183 months. The command df.mean(axis = 0) will also give the same
output.
1df.mean()
python
Output:
1 Dependents 0.748333
Median
In simple terms, median represents the 50th percentile, or the middle value of the data, that
separates the distribution into two halves. The line of code below prints the median of the
numerical variables in the data. The command df.median(axis = 0) will also give the same
output.
1df.median()
python
Output:
1 Dependents 0.0
2 Income 508350.0
3 Loan_amount 700.0
Mode: Mode represents the most frequent value of a variable in the data. This is the only
central tendency measure that can be used with categorical variables, unlike the mean and the
median which can be used only with quantitative data.
The line of code below prints the mode of all the variables in the data. The .mode() function
returns the most common value or most repeated value of a variable. The
command df.mode(axis = 0) will also give the same output.
1df.mode()
python
Output:
1| | Marital_status | Dependents | Is_graduate | Income | Loan_amount | Term_month
2| | | - | | | |
Measures of Dispersion:
In the previous sections, we have discussed the various measures of central tendency. However,
as we have seen in the data, the values of these measures differ for many variables. This is
because of the extent to which a distribution is stretched or squeezed. In statistics, this is
measured by dispersion which is also referred to as variability, scatter, or spread. The most
popular measures of dispersion are standard deviation, variance, and the interquartile range.
Standard Deviation:
Standard deviation is a measure that is used to quantify the amount of variation of a set of
data values from its mean. A low standard deviation for a variable indicates that the data points
tend to be close to its mean, and vice versa. The line of code below prints the standard deviation
of all the numerical variables in the data.
1df.std()
python
Output:
1 Dependents 1.026362
2 Income 711421.814154
3 Loan_amount 724293.480782
4 Term_months 31.933949
5 Age 14.728511
6 dtype: float64
While interpreting standard deviation values, it is important to understand them in conjunction
with the mean. For example, in the above output, the standard deviation of the variable 'Income'
is much higher than that of the variable 'Dependents'. However, the unit of these two variables
is different and, therefore, comparing the dispersion of these two variables on the basis of
standard deviation alone will be incorrect. This needs to be kept in mind.
It is also possible to calculate the standard deviation of a particular variable, as shown in the
first two lines of code below. The third line calculates the standard deviation for the first five
rows.
1print(df.loc[:,'Age'].std())
2print(df.loc[:,'Income'].std())
Variance
Variance is another measure of dispersion. It is the square of the standard deviation and the
covariance of the random variable with itself. The line of code below prints the variance of
all the numerical variables in the dataset. The interpretation of the variance is similar to that
of the standard deviation.
1df.var()
python
Output:
1 Dependents 1.053420e+00
2 Income 5.061210e+11
3 Loan_amount 5.2410e+11
4 Term_months 1.019777e+03
5 Age 2.169290e+02
6 dtype: float64
The Interquartile Range (IQR) is a measure of statistical dispersion, and is calculated as the
difference between the upper quartile (75th percentile) and the lower quartile (25th percentile).
The IQR is also a very important measure for identifying outliers and could be visualized using
a boxplot.
Skewness
Another useful statistic is skewness, which is the measure of the symmetry, or lack of it, for a
real-valued random variable about its mean. The skewness value can be positive, negative, or
undefined. In a perfectly symmetrical distribution, the mean, the median, and the mode will
all have the same value. However, the variables in our data are not symmetrical, resulting in
different values of the central tendency.
We can calculate the skewness of the numerical variables using the skew() function, as
shown below.
1print(df.skew())
python
Output:
1 Dependents 1.169632
2 Income 5.344587
3 Loan_amount 5.006374
4 Term_months -2.471879
5 Age -0.055537
6 dtype: float64
The skewness values can be interpreted in the following manner:
• Highly skewed distribution: If the skewness value is less than −1 or greater than +1.
• Moderately skewed distribution: If the skewness value is between −1 and −½ or
between +½ and +1.
• Approximately symmetric distribution: If the skewness value is between −½ and +½.
We have learned the measures of central tendency and dispersion, in the previous sections. It
is important to analyse these individually, however, because there are certain useful functions
in python that can be called upon to find these values. One such important function is
the .describe() function that prints the summary statistic of the numerical variables. The line
of code below performs this operation on the data.
1df.describe()
python
Conclusion:
In this guide, you have learned about the fundamentals of the most widely used descriptive
statistics and their calculations with Python. We covered the following topics in this guide:
1. Mean
2. Median
3. Mode
4. Standard Deviation
5. Variance
6. Interquartile Range
7. Skewness
It is important to understand the usage of these statistics and which one to use, depending on
the problem statement and the data. To learn more about data preparation and building
machine learning models using Python's 'scikit-learn' library, please refer to the following
guides:
Data Analytics I
Create a Linear Regression Model using Python/R to predict home
Title prices using Boston Housing Dataset (https://www.kaggle.com/c/boston-
housing). The Boston Housing dataset contains information about
various houses in Boston through different parameters. There are 506
samples and 14 feature variables in this dataset.
The objective is to predict the value of prices of the house using the
given features.
Student Name
Division A
Roll No
Exam Seat No
Date of
Completion
Assessment
grade/marks
Signature
• Problem Statement:
• Create a Linear Regression Model using Python/R to predict home prices using Boston Housing Dataset
(https://www.kaggle.com/c/boston-housing). The Boston Housing dataset contains information about various
houses in Boston through different parameters. There are 506 samples and 14 feature variables in this
dataset.
• The objective is to predict the value of prices of the house using the given features.
• Prerequisites:
Subjects: Discrete Mathematics, Database Management Systems
Tools/Software Requirements: Jupyter Notebook / Google Colab / Kaggle
Processor/Hardware Requirements: Intel Core 2 Duo or above, 2 GB RAM or above
Operating System: 64 Bit Windows/Linux/Mac OS
Programming Language: Python 3.8, PIP
Datasets: Any Open Source Dataset/CSV Files
Lets Start
First we need to prepare our enviroment importing some librarys
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [2]:
# Importing DataSet and take a look at Data
BostonTrain = pd.read_csv("../input/boston_train.csv")
Here we can look at the BostonTrain data
In [3]:
BostonTrain.head()
Out[3]:
In [5]:
#ID columns does not relevant for our analysis.
BostonTrain.drop('ID', axis = 1, inplace=True)
In [6]:
In this plot its clearly to see a linear pattern. Wheter more average number of rooms per
dwelling, more expensive the median value is.
Now lets take a loot how the all variables relate to each other.
In [7]:
plt.subplots(figsize=(12,8))
sns.heatmap(BostonTrain.corr(), cmap = 'RdGy')
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fbe883530b8>
At this heatmap plot, we can do our analysis better than the pairplot.
Lets focus ate the last line, where y = medv:
When shades of Red/Orange: the more red the color is on X axis, smaller the medv. Negative
correlation
When light colors: those variables at axis x and y, they dont have any relation. Zero
correlation
When shades of Gray/Black : the more black the color is on X axis, more higher the value
med is. Positive correlation
In [8]:
sns.pairplot(BostonTrain, vars = ['lstat', 'ptratio', 'indus', 'tax', 'crim', 'nox', 'rad', 'age', 'medv'])
Out[8]:
<seaborn.axisgrid.PairGrid at 0x7fbe88285c50>
In [9]:
In [10]:
X = BostonTrain[['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax',
'ptratio', 'black', 'lstat']]
y = BostonTrain['medv']
Import sklearn librarys:
train_test_split, to split our data in two DF, one for build a model and other to validate.
LinearRegression, to apply the linear regression.
In [11]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
In [13]:
lm = LinearRegression()
lm.fit(X_train,y_train)
Out[13]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [14]:
predictions = lm.predict(X_test)
In [15]:
plt.scatter(y_test,predictions)
plt.xlabel('Y Test')
plt.ylabel('Predicted Y')
Out[15]:
Text(0,0.5,'Predicted Y')
In [16]:
from sklearn import metrics
In [17]:
sns.distplot((y_test-predictions),bins=50);
In [18]:
coefficients = pd.DataFrame(lm.coef_,X.columns)
coefficients.columns = ['coefficients']
coefficients
Out[18]:
Coefficient
s
crim -0.116916
zn 0.017422
indus -0.001589
chas 3.267698
nox -17.405512
rm 3.242758
age 0.006570
dis -1.414341
rad 0.404683
Tax -0.013598
Ptratio -0.724007
Black 0.007861
Lstat -0.711690
How to interpret those coefficients: they are in function of Medv, so for one unit that nox
increase, the house value decrease 'nox'*1000 (Negative correlation) money unit, for one unit
that rm increase, the house value increase 'rm'*1000 (Positive correlation) money unit.
*1000 because the medv is in 1000 and this apply to the other variables/coefficients.
Python3
# Importing Libraries
import numpy as np
import pandas as pd
# Importing Data
boston = load_boston()
Python3
boston.data.shape
Python3
boston.feature_names
Converting data from nd-array to data frame and adding feature names to the data
Python3
data = pd.DataFrame(boston.data)
data.columns = boston.feature_names
data.head(10)
Python3
boston.target.shape
Python3
data.head()
Python3
data.describe()
Python3
data.info()
Python3
# Input Data
x = boston.data
y = boston.target
random_state = 0)
Python3
regressor = LinearRegression()
regressor.fit(xtrain, ytrain)
y_pred = regressor.predict(xtest)
Plotting Scatter graph to show the prediction results – ‘ytrue’ value vs ‘y_pred’ value
Python3
plt.xlabel("Price: in $1000's")
plt.ylabel("Predicted value")
plt.show()
As per the result, our model is only 66.55% accurate. So, the prepared model is not
very good for predicting housing prices. One can improve the prediction results using
many other possible machine learning algorithms and techniques.
Data Analytics II
1. Implement logistic regression using Python/R to perform
Title classification on Social_Network_Ads.csv dataset.
2. Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error
rate, Precision, Recall on the given dataset.
Student Name
Division A
Roll No
Exam Seat No
Date of
Completion
Assessment
grade/marks
Signature
• Problem Statement:
Data Analytics II
1. Implement logistic regression using Python/R to perform classification on Social_Network_Ads.csv
dataset.
2. Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate, Precision, Recall on the given
dataset.
• Prerequisites:
Subjects: Discrete Mathematics, Database Management Systems
Tools/Software Requirements: Jupyter Notebook / Google Colab / Kaggle
Processor/Hardware Requirements: Intel Core 2 Duo or above, 2 GB RAM or above
Operating System: 64 Bit Windows/Linux/Mac OS
Programming Language: Python 3.8, PIP
Datasets: Any Open Source Dataset/CSV Files
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('../input/Social_Network_Ads.csv')
dataset.head()
Sr. No. User ID Gender Age Estimated Purchased
Salary
If we wanted to determine the effect of more independent variables on the outcome (such as
Gender), we would have to implement a Dimensionality Reduction aspect to the model
because we can only describe so many dimensions visually. However, right now we are only
worried about how the users' Age and Estimated Salary affect their decision to click or not
click on the advertisement. To do this, we will extract the relevant vectors from our dataset:
the independent variables (X) and the dependent variable (y). The following code segment
print(X[:3, :])
print('-'*15)
print(y[:3])
[[ 19 19000]
[ 35 20000]
[ 26 43000]]
-
[0 0 0]
We now need to split our data into two sets: a training set for the machine to learn from, as
well as a test set for the machine to execute on. This process is referred to as Cross Validation
and we will be implementing SciKit Learn's appropriately named 'train_test_split' class to
make it happen. Industry standard usually calls for a training set size of 70-80% so we'll split
the two.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
print(X_train[:3])
print('-'*15)
print(y_train[:3])
print('-'*15)
print(X_test[:3])
print('-'*15)
print(y_test[:3])
[[ 44 39000]
[ 32 120000]
[ 38 50000]]
-
[0 1 0]
-
[[ 30 87000]
[ 38 50000]
[ 35 75000]]
-
[0 0 0]
To get the most accurate results, a common tool within machine learning models is to apply
Feature Scaling: "...a method used to standardize the range of independent variables or
features of data. In data processing, it is also known as data normalization and is generally
performed during the data preprocessing step." -
In [6]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0, solver='lbfgs' )
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(X_test[:10])
In [7]:
print(y_pred[:20])
print(y_test[:20])
[0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0]
Now that we've preprocessed the data, fit our classifier to the training set, and predicted the
dependent values for our test set, we can use a Confusion Matrix to evaluate exactly how
accurate our Logistic Regression model is. This function will compare the calculated results
in our y_pred vector to the actual observed results in y_test to determine how similar they
are. The more values that match, the higher the accuracy of the classifier.
In [8]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
[[65 3]
[ 8 24]]
Conclusion: This Confusion Matrix tells us that there were 89 correct predictions and 11
incorrect ones, meaning the model overall accomplished an 89% accuracy rating. This is very
good and there are many ways to improve the model by parameter tuning and sample size
increasing, but those topics are outside the scope of this project. Our next step is to create
visualizations to compare the training set and the test set. As we've stated throughout this
discussion, seeing our data and being able to visualize our work in front of us is imperative to
understanding each step of the model. Charts and graphs will also help us explain our
findings in layman's terms so that others can comprehend the insights that we've derived and
they can implement our findings into their business plans. The Matplotlib library provides
some excellent tools to create visualizations so let's do that now. We'll start by plotting the
training set results amidst our classifier. Most of the code in the next cell is relatively
straightforward but feel free to visit https://matplotlib.org/ for more detail.
In [9]:
# Visualizing the Training set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1,
step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step =
0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(),
X2.ravel()]).T).reshape(X1.shape),
alpha = 0.6, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Logistic Regression (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
This graph helps us see the clear correlations between the dependent and independent
variables. It is obvious that, as Age and Estimated Salary increase, each individual has a
higher likelihood of being green (they will click on the ad). Intuitively, this graph makes a lot
of sense because we can quickly tell that about 80-90% of the observations have been
correctly identified. There will almost always be some degree of error - or at least there
should be, otherwise our model is probably guilty of overfitting. Now let's map the test set
results to visualize where our Confusion Matrix came from.
In [10]:
# Visualizing the Test set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1,
step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step =
0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(),
X2.ravel()]).T).reshape(X1.shape),
alpha = 0.6, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Logistic Regression (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
When we look at both models together, we can actually see that there is a shape to this data
that's becoming increasingly apparent as the number of observations increases. Notice that
that the positive (or green) data points seem to be almost wrapping around the most crowded
area of red dots, inferring that we can probably improve our model and projections by
implementing a non-linear model (one with a classifier that isn't restricted to being a straight
line). The best X-intercept is probably closer to 1 than it is to 2 (as shown in this model), and
the y-intercept is likely between 2 and 3. But while there is always room for improvement,
we can be satisfied with this model as our final product. Our accuracy is high, but not so high
that we need to be suspicious of any overfitting. We can safely say that an increase in both
Age and Estimated Salary will lead to a higher probability of clicking the advertisement. As
new users sign-up for the website, we can use this model to quickly determine whether or not
to expose them to this particular ad or choose another that is more relevant to their profile.
10.5s 5 [NbConvertApp] Support files will be in results files/ [NbConvertApp] Making directory re
10.5s 6 [NbConvertApp] Making directory results files [NbConvertApp] Writing 287968 bytes to r
Student Name
Division A
Roll No
Exam Seat No
Date of
Completion
Assessment
grade/marks
Signature
• Problem Statement:
Data Analytics III
1. Implement Simple Naïve Bayes classification algorithm using Python/R on iris.csv dataset.
2. Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate, Precision, Recall on the given
dataset.
• Prerequisites:
Subjects: Discrete Mathematics, Database Management Systems
Tools/Software Requirements: Jupyter Notebook / Google Colab / Kaggle
Processor/Hardware Requirements: Intel Core 2 Duo or above, 2 GB RAM or above
Operating System: 64 Bit Windows/Linux/Mac OS
Programming Language: Python 3.8, PIP
Datasets: Any Open Source Dataset/CSV Files
For example, P(A), P(B), P(C) are prior probabilities because while calculating P(A),
occurrences of event B or C are not concerned i.e. no information about occurrence of
any other event is used.
Conditional Probabilities:
We have a dataset with some features Outlook, Temp, Humidity, and Windy, and the
target here is to predict whether a person or team will play tennis or not.
Conditional Probability
Here, we are predicting the probability of class1 and class2 based on the given condition. If I try
to write the same formula in terms of classes and features, we will get the following equation
Now we have two classes and four features, so if we write this formula for class C1, it will be
Here, we replaced Ck with C1 and X with the intersection of X1, X2, X3, X4. You might have a
question, It’s because we are taking the situation when all these features are present at the same
time.
words all the features are unrelated. With that assumption, we can further simplify the above
This is the final equation of the Naive Bayes and we have to calculate the probability of both C1
P (N0 | Today) > P (Yes | Today) So, the prediction that golf would be played is ‘No’.
Step 5: Use Naive Bayes algorithm( Train the Machine ) to Create Model
# import the class
from sklearn.naive_bayes import GaussianNB
gaussian = GaussianNB()
gaussian.fit(X_train, y_train)
Step 6: Predict the y_pred for all values of train_x and test_x
Y_pred = gaussian.predict(X_test)
Conclusion:
In this way we have done data analysis using Naive Bayes Algorithm for Iris dataset and
evaluated the performance of the model.
Assignment Question:
1) Consider the observation for the car theft scenario having 3 attributes colour, Type and
origin.
2) Write python code for the preprocessing mentioned in step 4. and Explain every step in
detail.
Text Analytics
1. Extract Sample document and apply following document
Title preprocessing methods:
Tokenization, POS Tagging, stop words removal, Stemming and
Lemmatization.
2. Create representation of document by calculating Term Frequency
and Inverse Document Frequency.
Student Name
Division A
Roll No
Exam Seat No
Date of
Completion
Assessment
grade/marks
Signature
• Problem Statement:
Text Analytics
1. Extract Sample document and apply following document preprocessing methods:
Tokenization, POS Tagging, stop words removal, Stemming and Lemmatization.
2. Create representation of document by calculating Term Frequency and Inverse Document Frequency. .
• Prerequisites:
Subjects: Discrete Mathematics, Database Management Systems
Tools/Software Requirements: Jupyter Notebook / Google Colab / Kaggle
Processor/Hardware Requirements: Intel Core 2 Duo or above, 2 GB RAM or above
Operating System: 64 Bit Windows/Linux/Mac OS
Programming Language: Python 3.8, PIP
Datasets: Any Open Source Dataset/CSV Files
Text mining is also referred to as text analytics. Text mining is a process of exploring
sizable textual data and finding patterns. Text Mining processes the text itself, while NLP
processes with the underlying metadata. Finding frequency counts of words, length of the
sentence, presence/absence of specific words is known as text mining. Natural language
processing is one of the components of text mining. NLP helps identify sentiment, finding
entities in the sentence, and category of blog/article. Text mining is preprocessed data for
text analytics. In Text Analytics, statistical and machine learning algorithms are used to
classify information.
sent_tokenize() method
method
Lemmatization Vs Stemming
Stemming algorithm works by cutting the suffix from the word. In a broader sense
cuts either the beginning or end of the word.
Example:
After applying TFIDF, text in A and B documents can be represented as a TFIDF vector of
dimension equal to the vocabulary words. The value corresponding to each word represents
the importance of that word in a particular document.
TFIDF is the product of TF with IDF. Since TF values lie between 0 and 1, not using ln can
result in high IDF for some words, thereby dominating the TFIDF. We don’t want that, and
therefore, we use ln so that the IDF should not completely dominate the TFIDF.
● Disadvantage of TFIDF
It is unable to capture the semantics. For example, funny and humorous are synonyms, but
TFIDF does not capture that. Moreover, TFIDF can be computationally expensive if the
vocabulary is vast.
4. Bag of Words (BoW)
Machine learning algorithms cannot work with raw text directly. Rather, the text must be
converted into vectors of numbers. In natural language processing, a common technique
for extracting features from text is to place all of the words that occur in the text in a
Algorithm for Tokenization, POS Tagging, stop words removal, Stemming and
Lemmatization:
Step 1: Download the required packages
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
Step 2: Initialize the text
text= "Tokenization is the first step in text analytics. The process
of breaking down a text paragraph into smaller chunks such as
words or sentences is called Tokenization."
Step 3: Perform Tokenization
#Sentence Tokenization
from nltk.tokenize import sent_tokenize
tokenized_text= sent_tokenize(text)
print(tokenized_text)
#Word Tokenization
from nltk.tokenize import word_tokenize
tokenized_word=word_tokenize(text)
print(tokenized_word)
idfDict = dict.fromkeys(documents[0].keys(), 0)
for document in documents:
for word, val in document.items():
if val > 0:
idfDict[word] += 1
Conclusion:
In this way we have done text data analysis using TF IDF algorithm
Assignment Question:
1) Perform Stemming for text = "studies studying cries cry".Compare the
results generated with Lemmatization. Comment on your answer how Stemming and
Lemmatization differ from each other.
2) Write Python code for removing stop words from the below documents, conver the
documents into lowercase and calculate the TF, IDF and TFIDF score for each
document.
documentA = 'Jupiter is the largest Planet'
documentB = 'Mars is the fourth planet from the Sun'
Data Visualization I
1. Use the inbuilt dataset 'titanic'. The dataset contains 891 rows and
Title contains information about the passengers who boarded the unfortunate
Titanic ship. Use the Seaborn library to see if we can find any patterns in
the data.
2. Write a code to check how the price of the ticket (column name: 'fare')
for each passenger is distributed by plotting a histogram.
Student Name
Division A
Roll No
Exam Seat No
Date of
Completion
Assessment
grade/marks
Signature
• Problem Statement:
Data Visualization I
1. Use the inbuilt dataset 'titanic'. The dataset contains 891 rows and contains information about the
passengers who boarded the unfortunate Titanic ship. Use the Seaborn library to see if we can find any
patterns in the data.
2. Write a code to check how the price of the ticket (column name: 'fare') for each passenger is distributed by
plotting a histogram.
• Prerequisites:
Subjects: Discrete Mathematics, Database Management Systems
Tools/Software Requirements: Jupyter Notebook / Google Colab / Kaggle
Processor/Hardware Requirements: Intel Core 2 Duo or above, 2 GB RAM or above
Operating System: 64 Bit Windows/Linux/Mac OS
Programming Language: Python 3.8, PIP
Datasets: Any Open Source Dataset/CSV Files
In this article we will look at Seaborn which is another extremely useful library for data
visualization in Python. The Seaborn library is built on top of Matplotlib and offers many
advanced data visualization capabilities.
Though, the Seaborn library can be used to draw a variety of charts such as matrix plots, grid
plots, regression plots etc., in this article we will see how the Seaborn library can be used to
draw distributional and categorial plots. In the second part of the series, we will see how to
draw regression plots, matrix plots, and grid plots.
The seaborn library can be downloaded in a couple of ways. If you are using pip installer for
Python libraries, you can execute the following command to download the library:
Alternatively, if you are using the Anaconda distribution of Python, you can use execute the
following command to download the seaborn library:
Theory:
The Dataset
The dataset that we are going to use to draw our plots will be the Titanic dataset, which is
downloaded by default with the Seaborn library. Now we have to use
the load_dataset function and pass it the name of the dataset.
Let's see what the Titanic dataset looks like. Execute the following script:
import pandas as pd
import numpy as np
dataset.head()
The dataset contains 891 rows and 15 columns and contains information about the passengers
who boarded the unfortunate Titanic ship. The original task is to predict whether or not the
passenger survived depending upon different features such as their age, ticket, cabin they
boarded, the class of the ticket, etc. We will use the Seaborn library to see if we can find any
patterns in the data.
Distributional Plots:
Distributional plots, as the name suggests are type of plots that show the statistical
distribution of data. In this section we will see some of the most commonly used distribution
plots in Seaborn.
The distplot() shows the histogram distribution of data for a single column. The column
name is passed as a parameter to the distplot() function. Let's see how the price of the ticket
for each passenger is distributed. Execute the following script:
sns.distplot(dataset[ 'fare' ])
Output:
Output:
Here we set the number of bins to 10. In the output, you will see data distributed in 10 bins as
shown below:
Output:
The jointplot() is used to display the mutual distribution of each column. You need to pass
three parameters to jointplot . The first parameter is the column name for which you want to
display the distribution of data on x-axis. The second parameter is the column name for
which you want to display the distribution of data on y-axis. Finally, the third parameter is
the name of the data frame.
Let's plot a joint plot of age and fare columns to see if we can find any relationship between
the two.
Output:
Output:
The Pair Plot: The paitplot() is a type of distribution plot that basically plots a joint plot for
all the possible combination of numeric and Boolean columns in the dataset. We need to pass
the name of your dataset as the parameter to the pairplot() function as shown below:
sns.pairplot(dataset)
To add information from the categorical column to the pair plot, you can pass the name of the
categorical column to the hue parameter. For instance, if we want to plot the gender
information on the pair plot, we can execute the following script:
In the output, we can see the information about the males in orange and the information about
the female in blue (as shown in the legend). From the joint plot on the top left, we can clearly
see that among the surviving passengers, the majority were female.
The Rug Plot: The rugplot() is used to draw small bars along x-axis for each point in the
dataset. To plot a rug plot, we need to pass the name of the column. Let's plot a rug plot for
fare.
Output:
From the output, we can see that as was the case with the distplot() , most of the instances for
the fares have values between 0 and 100. These are some of the most commonly used
distribution plots offered by the Python's Seaborn Library. Let's see some of categorical plots
in the Seaborn library.
Categorical Plots: Categorical plots, as the name suggests are normally used to plot
categorical data. The categorical plots plot the values in the categorical column against another
categorical column or a numeric column. Let's see some of the most commonly used categorical
data.
The Bar Plot: The barplot() is used to display the mean value for each value in a categorical
column, against a numeric column. The first parameter is the categorical column, the second
parameter is the numeric column while the third parameter is the dataset. For instance, if you
want to know the mean value of the age of the male and female passengers, you can use the
bar plot as follows.
sns.barplot(x= 'sex' , y= 'age' , data=dataset)
Output:
import numpy as np
In the above script, we use the std aggregate function from the numpy library to calculate
the standard deviation for the ages of male and female passengers. The output looks like this:
Output:
The box plot is used to display the distribution of the categorical data in the form of quartiles.
The center of the box shows the median value. The value from the lower whisker to the bottom
of the box shows the first quartile. From the bottom of the box to the middle of the box lies the
second quartile. From the middle of the box to the top of the box lies the third quartile and
finally from the top of the box to the top whisker lies the last quartile.
Now plot a box plot that displays the distribution for the age with respect to each gender.
Here we need to pass the categorical column as the first parameter (which is sex in our case)
and the numeric column (age in our case) as the second parameter. Finally, the dataset is
passed as the third parameter, take a look at the following script:
Output:
We can make our box plots more fancy by adding another layer of distribution. For instance,
if we want to see the box plots of forage of passengers of both genders, along with the
information about whether or not they survived, you can pass the survived as value to
the hue parameter as shown below:
Output:
The Violin Plot: The violin plot is similar to the box plot, however, the violin plot allows us
to display all the components that actually correspond to the data point.
The violinplot() function is used to plot the violin plot. Like the box plot, the first parameter
is the categorical column; the second parameter is the numeric column while the third
parameter is the dataset. Now, plot a violin plot that displays the distribution for the age with
respect to each gender.
sns.violinplot(x= 'sex' , y= 'age' , data=dataset)
Output:
Instead of plotting two different graphs for the passengers who survived and those who did not,
you can have one violin plot divided into two halves, where one half represents surviving while
the other half represents the non-surviving passengers. To do so, we need to
pass True as value for the split parameter of the violinplot() function. Let's see how we can
do this:
The Strip Plot: The strip plot draws a scatter plot where one of the variables is categorical.
We have seen scatter plots in the joint plot and the pair plot sections where we had two
numeric variables. The strip plot is different in a way that one of the variables is categorical
in this case, and for each category in the categorical variable, we will see scatter plot with
respect to the numeric column.
The stripplot() function is used to plot the violin plot. Like the box plot, the first parameter is
the categorical column, the second parameter is the numeric column while the third parameter
is the dataset. Look at the following script:
Output:
Output:
split= True )
Output:
The Swarm Plot: The swarm plot is a combination of the strip and the violin plots. In the
swarm plots, the points are adjusted in such a way that they don't overlap. Let's plot a swarm
plot for the distribution of age against gender. The swarmplot() function is used to plot the
violin plot. Like the box plot, the first parameter is the categorical column, the second
parameter is the numeric column while the third parameter is the dataset. Look at the
following script:
sns.swarmplot(x= 'sex' , y= 'age' , data=dataset)
Output:
Output:
Combining Swarm and Violin Plots: Swarm plots are not recommended if you have a huge
dataset since they do not scale well because they have to plot each data point. If you really like
swarm plots, a better way is to combine two plots. For instance, to combine a violin plot with
swarm plot, you need to execute the following script:
sns.violinplot(x= 's ex' , y= ' age' , data=dataset)
Output:
Data Visualization II
1. Use the inbuilt dataset 'titanic' as used in the above problem. Plot a
Title box plot for distribution of age with respect to each gender along with
the information about whether they survived or not. (Column names :
'sex' and 'age')
2. Write observations on the inference from the above statistics.
Student Name
Division A
Roll No
Exam Seat No
Date of
Completion
Assessment
grade/marks
Signature
• Problem Statement:
Data Visualization II
1. Use the inbuilt dataset 'titanic' as used in the above problem. Plot a box plot for distribution of age with
respect to each gender along with the information about whether they survived or not. (Column names: 'sex'
and 'age')
2. Write observations on the inference from the above statistics.
• Prerequisites:
Subjects: Discrete Mathematics, Database Management Systems
Tools/Software Requirements: Jupyter Notebook / Google Colab / Kaggle
Processor/Hardware Requirements: Intel Core 2 Duo or above, 2 GB RAM or above
Operating System: 64 Bit Windows/Linux/Mac OS
Programming Language: Python 3.8, PIP
Datasets: Any Open Source Dataset/CSV Files
There are various techniques to understand your data, And the basic need is you should have
the knowledge of Numpy for mathematical operations and Pandas for data manipulation. We
are using Titanic dataset. For demonstrating some of the techniques we will also use an inbuilt
dataset of seaborn as tips data which explains the tips each waiter gets from different customers.
Univariate Analysis
Univariate analysis is the simplest form of analysis where we explore a single variable.
Univariate analysis is performed to describe the data in a better way. we perform Univariate
analysis of Numerical and categorical variables differently because plotting uses different plots.
Categorical Data:
following are various plots which we can use for visualizing Categorical data.
1) CountPlot:
Countplot is basically a count of frequency plot in form of a bar graph. It plots the count of
each category in a separate bar. When we use the pandas’ value counts function on any
sns.countplot(data['Survived'])
plt.show()
2) Pie Chart:
The pie chart is also the same as the countplot, only gives us additional information about the
percentage presence of each category in data means which category is getting how much
weightage in data. Now we check about the Sex column, what is a percentage of Male and
data['Sex'].value_counts().plot(kind="pie", autopct="%.2f")
plt.show()
helps to further process the data. Most of the time, we will find much inconsistency with
1) Histogram:
various ranges in values and plots it where we can visualize how values are distributed. We can
have a look where more values lie like in positive, negative, or at the center(mean). Let’s have
plt.hist(data['Age'], bins=5)
plt.show()
2) Distplot:
Distplot is also known as the second Histogram because it is a slight improvement version of
the Histogram. Distplot gives us a KDE(Kernel Density Estimation) over histogram which
explains PDF(Probability Density Function) which means what is the probability of each value
sns.distplot(data['Age'])
plt.show()
Boxplot is a very interesting plot that basically plots a 5 number summary. to get 5 number
IQR = Q3 - Q1
Lower_boundary = Q1 - 1.5 * IQR
Upper_bounday = Q3 + 1.5 * IQR
Here Q1 and Q3 is 1st quantile (25th percentile) and 3rd Quantile(75th percentile).
We have study about various plots to explore single categorical and numerical data. Bivariate
Analysis is used when we have to explore the relationship between 2 different variables and
we have to do this because, in the end, our main task is to explore the relationship between
variables to build a powerful model. And when we analyze more than 2 variables together then
it is known as Multivariate Analysis. we will work on different plots for Bivariate as well
on Multivariate Analysis.
1) Scatter Plot:
To plot the relationship between two numerical variables scatter plot is a simple plot to do.
Let us see the relationship between the total bill and tip provided using a scatter plot.
sns.scatterplot(tips["total_bill"], tips["tip"])
We can also plot 3 variable or 4 variable relationships with scatter plot. suppose we want to
find the separate ratio of male and female with total bill and tip provided.
Suppose along with gender we also want to know whether the customer was a smoker or not
so we can do this.
If one variable is numerical and one is categorical then there are various plots that we can use
1) Bar Plot:
Bar plot is a simple plot which we can use to plot categorical variable on the x-axis and
numerical variable on y-axis and explore the relationship between both variables. The
blacktip on top of each bar shows the confidence Interval. let us explore P-Class with age.
sns.barplot(data['Pclass'], data['Age'])
plt.show()
Hue’s argument is very useful which helps to analyze more than 2 variables. Now along with
2) Boxplot:
We have already study about boxplots in the Univariate analysis above. we can draw a
separate boxplot for both the variable. let us explore gender with age using a boxplot.
sns.boxplot(data['Sex'], data["Age"])
Along with age and gender let’s see who has survived and who has not.
3) Distplot:
Distplot explains the PDF function using kernel density estimation. Distplot does not have a
hue parameter but we can create it. Suppose we want to see the probability of people with an
age range that of survival probability and find out whose survival probability is high to the
1) Heatmap:
If you have ever used a crosstab function of pandas then Heatmap is a similar visual
representation of that only. It basically shows that how much presence of one category
concerning another category is present in the dataset. let me show first with crosstab and then
with heatmap.
pd.crosstab(data['Pclass'], data['Survived'])
2) Cluster map:
We can also use a cluster map to understand the relationship between two categorical variables.
A cluster map basically plots a dendrogram that shows the categories of similar behavior
together.
sns.clustermap(pd.crosstab(data['Parch'], data['Survived']))
plt.show()
Student Name
Division A
Roll No
Exam Seat No
Date of
Completion
Assessment
grade/marks
Signature
• Problem Statement:
Data Visualization III
Download the Iris flower dataset or any other dataset into a DataFrame. (e.g.,
https://archive.ics.uci.edu/ml/datasets/Iris ). Scan the dataset and give the inference as:
1. List down the features and their types (e.g., numeric, nominal) available in the dataset.
2. Create a histogram for each feature in the dataset to illustrate the feature distributions.
3. Create a box plot for each feature in the dataset.
4. Compare distributions and identify outliers.
• Prerequisites:
Subjects: Discrete Mathematics, Database Management Systems
Tools/Software Requirements: Jupyter Notebook / Google Colab / Kaggle
Processor/Hardware Requirements: Intel Core 2 Duo or above, 2 GB RAM or above
Operating System: 64 Bit Windows/Linux/Mac OS
Programming Language: Python 3.8, PIP
Datasets: Any Open Source Dataset/CSV Files
Python3
import pandas as pd
Python3
Code: Displaying up the top rows of the dataset with their columns
The function head() will display the top rows of the dataset, the default value of this function
Python3
data.head()
Output:
Python3
data.sample(10)
Output:
Python3
data.columns
Output:
Python3
Output:
Python3
print(data)
Output:
Python3
print(data[10:21])
sliced_data=data[10:21]
print(sliced_data)
Output:
Python3
specific_data=data[["Id","Species"]]
#data[["column_name1","column_name2","column_name3"]]
print(specific_data.head(10))
Output:
Python3
data.iloc[5]
data.loc[data["Species"] == "Iris-setosa"]
Output:
iloc()[/caption]
Python3
#In this dataset we will work on the Species column, it will count number of times a
particular species has occurred.
data["Species"].value_counts()
Output:
Python3
# data["column_name"].sum()
sum_data = data["SepalLengthCm"].sum()
mean_data = data["SepalLengthCm"].mean()
median_data = data["SepalLengthCm"].median()
Output:
Python3
min_data=data["SepalLengthCm"].min()
Output:
Python3
# that means if you want to add all the integer value of that particular
cols = data.columns
print(cols)
data1 = data[cols]
data["total_values"]=data1[cols].sum(axis=1)
Output:
Python3
newcols={
"Id":"id",
"SepalLengthCm":"sepallength"
"SepalWidthCm":"sepalwidth"}
data.rename(columns=newcols,inplace=True)
Output:
Python3
data.style
Output:
Now we will highlight the maximum and minimum column-wise, row-wise, and the whole
dataframe wise using Styler.apply function. The Styler.apply function passes each column or
row of the dataframe depending upon the keyword argument axis. For column-wise use axis=0,
row-wise use axis=1, and for the entire table at once use axis=None.
data.head(10).style.highlight_max(color='lightgreen', axis=0)
data.head(10).style.highlight_max(color='lightgreen', axis=1)
data.head(10).style.highlight_max(color='lightgreen', axis=None)
Output:
for axis=0
for axis=None
Python3
data.isnull()
Output:
Python3
data.isnull.sum()
Output:
Python3
iris = sns.load_dataset("iris")
Output:
Code: Annotate each cell with the numeric value using integer formatting
Python3
Output:
Python3
data.corr(method='pearson')
Output:
data.corr()
The output dataframe can be seen as for any cell, row variable correlation with the column
variable is the value of the cell. The correlation of a variable with itself is 1. For that reason,
all the diagonal values are 1.00.
Python3
g = sns.pairplot(data,hue="Species")
Output:
Write a code in JAVA for a simple Word Count application that counts
the number of occurrences of each word in a given input set using the
Title Hadoop Map-Reduce framework on local-standalone set-up.
Student Name
Division A
Roll No
Exam Seat No
Date of
Completion
Assessment
grade/marks
Signature
• Problem Statement:
Write a code in JAVA for a simple Word Count application that counts the number of occurrences of each
word in a given input set using the Hadoop Map-Reduce framework on local-standalone set-up.
• Prerequisites:
Subjects: Discrete Mathematics, Database Management Systems
Tools/Software Requirements: Apache Hadoop
Processor/Hardware Requirements: Intel Core 2 Duo or above, 2 GB RAM or above
Operating System: 64 Bit Windows/Linux/Mac OS
Programming Language: JAVA
Title: Write a code in JAVA for a simple WordCount application that counts the number of
occurrencesof each word in a given input set using the Hadoop MapReduce framework on local-
standalone set-up
Pre-requisite
1. Java Installation – Java
(openjdk 11)java -version
2. Hadoop Installation – Hadoop 2 or higher.
Theory:
Hadoop MapReduce is a software framework for easily writing applications which process vast
amountsof data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of
commodity hardware in a reliable, fault-tolerant manner.
A MapReduce job usually splits the input data-set into independent chunks which are processed by
the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which
are then input to the reduce tasks. Typically both the input and the output of the job are stored in a
file- system. The framework takes care of scheduling tasks, monitoring them and re-executes the
failed tasks.
WordCount example reads text files and counts how often words occur. The input is text files and
the output is text files, each line of which contains a word and the count of how often it occured,
separatedby a tab.
Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the word
and each reducer sums the counts for each word and emits a single key/value with the word and sum.
As an optimization, the reducer is also used as a combiner on the map outputs. This reduces the
amountof data sent across the network by combining each word into a single record
Steps to excute:
1. Create a text file in your local machine and write some text into it.
$ nano data.txt
2. In this example, we find out the frequency of each word exists in this text file.
File: WC_Mapper.java
package com.javatpoint;
import java.io.IOException;
import
java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class WC_Mapper extends MapReduceBase implements Mapper<LongWritable,Text,Text,Int
Writable>{
private final static IntWritable one = new IntWritable(1);private
Text word = new Text();
public void map(LongWritable key, Text value,OutputCollector<Text,IntWritable> output,
Reporter reporter) throws IOException{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()){
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
import
java.io.IOException;
import
java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import
java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
put:
HD
FS
HADOOP 2
MapReduce 1
a 2
is 2
of 2
processing 1
storage 1
tool 1
unit 1
Student Name
Division A
Roll No
Exam Seat No
Date of
Completion
Assessment
grade/marks
Signature
Title: Map-Reduce
• Problem Statement:
Design a distributed application using Map-Reduce which processes a log file of a system.
• Prerequisites:
Subjects: Discrete Mathematics, Database Management Systems
Tools/Software Requirements: Apache Hadoop /Oracle
Processor/Hardware Requirements: Intel Core 2 Duo or above, 2 GB RAM or above
Operating System: 64 Bit Windows/Linux/Mac OS
Programming Language: Java
designed for the programmers who want to write programs in a concise, elegant, and type- safe
If you write a code in Scala, you will see that the style is similar to a scripting language. Even
though Scala is a new language, it has gained enough users and has a wide community support.
2. About Scala
The design of Scala started in 2001 in the programming methods laboratory at EPFL (École
Polytechnique Fédérale de Lausanne). Scala made its first public appearance in January 2004
on the JVM platform and a few months later in June 2004, it was released on the .(dot)NET
platform. The .(dot)NET support of Scala was officially dropped in 2012. A few more
operations you perform is a method call. Scala, allow you to add new operations to existing
can also write a Java code inside Scala class. The Scala supports advanced component
which avoids states and mutable data. The functional programming exhibits following
characteristics:
Scala is not a pure functional language. Haskell is an example of a pure functional language.
If you want to read more about functional programming, please refer to this article.
Scala is a compiler based language which makes Scala execution very fast if you compare it
with Python (which is an interpreted language). The compiler in Scala works in similar fashion
as Java compiler. It gets the source code and generates Java byte-code that can be executed
independently on any standard JVM (Java Virtual Machine). If you want to know more about
the difference between complied vs interpreted language please refer this article.
There are more important points about Scala which I have not covered. Some of them are:
Scala is now big name. It is used by many companies to develop the commercial software.
These are the following notable big companies which are using Scala as a programming
alternative.
• LinkedIn
• Twitter
• Foursquare
• Netflix
• Tumblr
• The Guardian
• Precog
• Sony
• AirBnB
• Klout
• Apple
If you want to read more about how and when these companies started using Scala please
3. Installing Scala
Scala can be installed in any Unix or windows based system. Below are the steps to install for
Ubuntu (14.04) for scala version 2.11.7. I am showing the steps for installing Scala (2.11.7)
with Java version 7. It is necessary to install Java before installing Scala. You can also install
check whether Java has installed successfully or not. To check the Java version and installation,
$ java -version
$ cd ~/Downloads
$ wget http://www.scala-lang.org/files/archive/scala-2.11.7.deb
$ sudo dpkg -i scala-2.11.7.deb
$ scala –version
Scala being an easy to learn language has minimal prerequisites. If you are someone with basic
knowledge of C/C++, then you will be easily able to get started with Scala. Since Scala is
developed on top of Java. Basic programming function in Scala is similar to Java. So, if you
have some basic knowledge of Java syntax and OOPs concept, it would be helpful for you
to work in Scala.
Once you have installed Scala, there are various options for choosing an environment. Here
writing a program on shell because it provides a lot of good features like suggestions for method
call and you can also run your code while writing line by line.
Object: An entity that has state and behavior is known as an object. For example: table,
Class: A class can be defined as a blueprint or a template for creating different objects which
Closure: Closure is any function that closes over the environment in which it’s defined. A
closure returns value depends on the value of one or more variables which is declared outside
this closure.
Traits: Traits are used to define object types by specifying the signature of the supported
• It is case sensitive
• If you are writing a program in Scala, you should save this program using “.scala”
• Scala execution starts from main() methods
• Any identifier name cannot begin with numbers. For example, variable name “123salary” is
invalid.
• You can not use Scala reserved keywords for variable declarations or constant or any
identifiers.
In Scala, you can declare a variable using ‘var’ or ‘val’ keyword. The decision is based on
whether it is a constant or a variable. If you use ‘var’ keyword, you define a variable as mutable
variable. On the other hand, if you use ‘val’, you define it as immutable. Let’s first declare a
In the above Scala statement, you declare a mutable variable called “Var1” which takes a string
value. You can also write the above statement without specifying the type of variable. Scala
In the above Scala statement, we have declared an immutable variable “Var2” which takes a
string “Ankit”. Try it for without specifying the type of variable. If you want to read about
9. Operations on variables
You can perform various operations on variables. There are various kinds of operators defined
in Scala. For example: Arithmetic Operators, Relational Operators, Logical Operators, Bitwise
Lets see “+” , “==” operators on two variables ‘Var4’, “Var5”. But, before that, let us first
Var4+Var5
Output:
res1: Int = 5
Var4==Var5
Output:
If you want to know complete list of operators in Scala refer this link:
In Scala, if-else expression is used for conditional statements. You can write one or more
conditions inside “if”. Let’s declare a variable called “Var3” with a value 1 and then
var Var3 =1
if (Var3 ==1){
println("True")}else{
println("False")}
Output: True
In the above snippet, the condition evaluates to True and hence True will be printed in the
output.
Like most languages, Scala also has a FOR-loop which is the most widely used method for
Scala also supports “while” and “do while” loops. If you want to know how both work,
You can define a function in Scala using “def” keyword. Let’s define a function called “mul2”
which will take a number and multiply it by 10. You need to define the return type of function,
if a function not returning any value you should use the “Unit” keyword.
In the below example, the function returns an integer value. Let’s define the function “mul2”:
mul2(2)
Output:
res9: Int = 20
If you want to read more about the function, please refer this tutorial.
• Arrays
• Lists
• Sets
• Tuple
• Maps
• Option
JCEI’s Jaihind College of Engineering, Kuran 2022-2023
13.1 Arrays in Scala
In Scala, an array is a collection of similar elements. It can contain duplicates. Arrays are also
immutable in nature. Further, you can access elements of an array using an index:
To declare any array in Scala, you can define it either using a new keyword or you can
In the above program, we have defined an array called name with 5 string values.
The following is the syntax for declaring an array variable using a new keyword.
Here you have declared an array of Strings called “name” that can hold up to three elements.
scala> name
res3: Array[String] = Array(jal, Faizy, Expert in deep learning)
You can access the element of an array by index. Lets access the first element of array
name(0)
Output:
res11: String = jal
Lists are one of the most versatile data structure in Scala. Lists contain items of different
types in Python, but in Scala the items all have the same type. Scala lists are immutable.
You can define list simply by comma separated values inside the “List” method.
You can also define multi dimensional list in Scala. Lets define a two dimensional list:
Accessing a list
Let’s get the third element of the list “numbers” . The index should 2 because index in Scala
start from 0.
scala> numbers(2)
res6: Int = 3
We have discussed two of the most used data Structures. You can learn more from this link.
Let us start with a “Hello World!” program. It is a good simple way to understand how to
write, compile and run codes in Scala. No prizes for telling the outcome of this code!
object HelloWorld {
def main(args: Array[String]) {
println("Hello, world!")
}
}
As mentioned before, if you are familiar with Java, it will be easier for you to understand Scala.
If you know Java, you can easily see that the structure of above “HelloWorld” program is very
This program contains a method “main” (not returning any value) which takes an argument –
a string array through command line. Next, it calls a predefined method called “Println” and
You can define the main method as static in Java but in Scala, the static method is no longer
available. Scala programmer can’t use static methods because they use singleton objects. To
read more about singleton object you can refer this article.
To run any Scala program, you first need to compile it. “Scalac” is the compiler which takes
Let’s start compiling your “HelloWorld” program using the following steps:
2. Now you need change your working directory to the directory where your program is
saved
3. After changing the directory you can compile the program by issuing the command.
scalac HelloWorld.scala
4. After compiling, you will get Helloworld.class as an output in the same directory. If you
can see the file, you have successfully compiled the above program.
After compiling, you can now run the program using following command:
scala HelloWorld
You will get an output if the above command runs successfully. The program will print
“Hello, world!”
If you are working with Apache Spark then you would know that it has 4 different APIs
Each of these languages have their own unique advantages. But using Scala is more
advantageous than other languages. These are the following reasons why Scala is taking over
Let’s compare 4 major languages which are supported by Apache Spark API.
To know the basics of Apache Spark and installation, please refer to my first article on Pyspark.
I have introduced basic terminologies used in Apache Spark like big data, cluster computing,
driver, worker, spark context, In-memory computation, lazy evaluation, DAG, memory
As a quick refresher, I will be explaining some of the topics which are very useful to proceed
further. If you are a beginner, then I strongly recommend you to go through my first article
functionality, we must use one of them for data manipulation. Let’s discuss each of them
briefly:
• RDD: RDD (Resilient Distributed Database) is a collection of elements, that can be divided
across multiple nodes in a cluster for parallel processing. It is also fault tolerant collection of
elements, which means it can automatically recover from failures. RDD is immutable, we can
create RDD once but can’t change it.
• Dataset: It is also a distributed collection of data. A Dataset can be constructed from JVM
objects and then manipulated using functional transformations (map, flatMap, filter, etc.). As
I have already discussed in my previous articles, dataset API is only available in Scala and
Java. It is not available in Python and R.
• DataFrame: In Spark, a DataFrame is a distributed collection of data organized into named
columns. It is conceptually equivalent to a table in a relational database or a data frame. It is
mostly used for structured data processing. In Scala, a DataFrame is represented by a Dataset
of Rows. A DataFrame can be constructed by wide range of arrays for example, existing RDDs,
Hive tables, database tables.
• Transformation: Transformation refers to the operation applied on a RDD to create new
RDD.
• Action: Actions refer to an operation which also apply on RDD that perform computation
and send the result back to driver.
• Broadcast: We can use the Broadcast variable to save the copy of data across all node.
• Accumulator: In Accumulator, variables are used for aggregating the information.
First step to use RDD functionality is to create a RDD. In Apache Spark, RDD can be created
by two different ways. One is from existing Source and second is from an external source.
So before moving further let’s open the Apache Spark Shell with Scala. Type the following
command after switching into the home directory of Spark. It will also load the spark context
as sc.
$ ./bin/spark-shell
After typing above command you can start programming of Apache Spark in Scala.
When you want to create a RDD from existing storage in driver program (which we would like
driver program.
In the above program, I first created an array for 10 elements and then I created a distributed
data called RDD from that array using “parallelize” method. SparkContext has a parallelize
method, which is used for creating the Spark RDD from an iterable already present in driver
program.
To see the content of any RDD we can use “collect” method. Let’s see the content of
distData.
scala> distData.collect()
Output: res1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
You can create a RDD through external sources such as a shared file system, HDFS, HBase,
or any data source offering a Hadoop Input Format. So let’s create a RDD from the text file:
The name of the text file is text.txt. and it has only 4 lines given below.
lines.take(2)
Output: Array(I love solving data mining problems., I don't like solving data mining
problems)
each element. So how can we use map transformation on ‘rdd’ in our case? Let’s
lines.count()
res1: Long = 4
Let’s filter out the words in “text.txt” whose length is more than 5.
• It can be created using different data formats. For example, by loading the data from JSON,
CSV
• Loading data from Existing RDD
• Programmatically specifying schema
Let’s create a DataFrame using a csv file and perform some analysis on that.
perform this action, first, we need to download Spark-csv package (Latest version) and extract
this package into the home directory of Spark. Then, we need to open a PySpark shell and
Now let’s load the csv file into a DataFrame df. You can download the file(train) from
this link.
val df = sqlContext.read.format("com.databricks.spark.csv").option("header",
"true").option("inferSchema", "true").load("train.csv")
df.columns
Output:
res0: Array[String] = Array(User_ID, Product_ID, Gender, Age, Occupation, City_Category,
Stay_In_Current_City_Years, Marital_Status, Product_Category_1, Product_Category_2,
Product_Category_3, Purchase)
df.count()
Output:
res1: Long = 550068
You can use “printSchema” method on df. Let’s print the schema of df.
df.printSchema()
You can use “show” method on DataFrame. Let’s print the first 2 rows of df.
df.show(2)
Output:
+ -+ + + + -+ + + +
+ + + -+
|User_ID|Product_ID|Gender|
Age|Occupation|City_Category|Stay_In_Current_City_Years|Marital_Status|Product_Categor
y_1|Product_Category_2|Product_Category_3|Purchase|
+ -+ + + + -+ + + +
+ + + -+
|1000001| P00069042| F|0-17| 10| A| 2| 0| 3| null| null| 8370|
|1000001| P00248942| F|0-17| 10| A| 2| 0| 1| 6| 14| 15200|
+ -+ + + + -+ + + +
+ + + -+
only showing top 2 rows
To select columns you can use “select” method. Let’s apply select on df for “Age” columns.
df.select("Age").show(10)
Output:
+ +
| Age|
+-----+
| 0-17|
| 0-17|
| 0-17|
| 0-17|
| 55+|
|26-35|
|46-50|
|46-50|
|46-50|
|26-35|
+ +
only showing top 10 rows
To filter the rows you can use “filter” method. Let’s apply filter on “Purchase” column of df
To groupby columns, you can use groupBy method on DataFrame. Let’s see the distribution
df.groupBy("Age").count().show()
Output:
+ + +
| Age| count|
+ + +
|51-55| 38501|
|46-50| 45701|
| 0-17| 15102|
|36-45|110013|
|26-35|219587|
| 55+| 21504|
|18-25| 996|
+ + +
To apply queries on DataFrame You need to register DataFrame(df) as table. Let’s first
Now you can apply SQL queries on “B_friday” table using sqlContext.sql. Lets select
If you have come this far, you are in for a treat! I’ll complete this tutorial by building a
I will use only three dependent features and the independent variable in df1. Let’s create a
Let’s try to create a formula for Machine learning model like we do in R. First, we need to
import RFormula. Then we need to specify the dependent and independent column inside this
formula. We also have to specify the names for features column and label column.
import org.apache.spark.ml.feature.RFormula
After creating the formula, we need to fit this formula on df1 and transform df1 through this
After applying the formula we can see that train dataset has 2 extra columns called features and
label. These are the ones we have specified in the formula (featuresCol=”features” and
labelCol=”label”)
After applying the RFormula and transforming the DataFrame, we now need to develop the
machine learning model on this data. I want to apply a Linear Regression for this task. Let us
import a Linear regression and apply on train. Before fitting the model, I am setting the
hyperparameters.
import org.apache.spark.ml.regression.LinearRegression
val lr = new LinearRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
val lrModel = lr.fit(train)
You can also make predictions on unseen data. But I am not showing this here. Let’s print the
Let’s summarize the model over the training set and print out some metrics.
println(s"RMSE: ${trainingSummary.rootMeanSquaredError}")
Output:
RMSE: 5021.899441991144
Let’s repeat above procedure for taking the prediction on cross-validation set. Let’s read the
Now, randomly divide the train in two part train_cv and test_cv
import org.apache.spark.ml.feature.RFormula
After transforming using RFormula, we can build a machine learning model and take the
import org.apache.spark.ml.regression.LinearRegression
val lr = new LinearRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
val lrModel = lr.fit(train_cv1)
val train_cv_pred = lrModel.transform(train_cv1)
val test_cv_pred = lrModel.transform(test_cv1)
In train_cv_pred and test_cv_pred, you will find a new column for prediction.
Conclusion: Hence, We have thoroughly studied how to program in SCALA using Apache
Spark.
Student Name
Division A
Roll No
Exam Seat No
Date of
Completion
Assessment
grade/marks
Signature
• Problem Statement:
Locate dataset (e.g., sample_weather.txt) for working on weather data which reads the text input files and
finds average for temperature, dew point and wind speed.
• Prerequisites:
Subjects: Discrete Mathematics, Database Management Systems
Tools/Software Requirements: Apache Hadoop
Processor/Hardware Requirements: Intel Core 2 Duo or above, 2 GB RAM or above
Operating System: 64 Bit Windows/Linux/Mac OS
Programming Language: SCALA/Python3
Aim: Locate dataset (e.g., sample_weather.txt) for working on weather data which
reads the text input files and finds average for temperature, dew point and wind speed.
Theory: Analysis of Weather data using Pandas, Python, and Seaborn:
The most recent post on this site was an analysis of how often people cycling to work
actually get rained on in different cities around the world. You can check it out here.
The analysis was completed using data from the Wunderground weather website, Python,
specifically the Pandas and Seaborn libraries. In this post, I will provide the Python code to
replicate the work and analyse information for your own city. During the analysis, I
used Python Jupyter notebooks to interactively explore and cleanse data; there’s a simple
setup if you elect to use something like the Anaconda Python distribution to install
everything you need.
If you want to skip data downloading and scraping, all of the data I used is available
to download here.
Scraping Weather Data
import requests
import pandas as pd
import time
"""
Function to return a data frame of minute-level weather data for a single Wunderground
PWS station.
Args:
Returns:
Pandas Dataframe with weather data for specified station and date.
"""
url =
"http://www.wunderground.com/weatherstation/WXDailyHistory.asp?ID={station}&da
y={day}&month={month}&year={year}&graphspan=day&format=1"
data = response.text
try:
except Exception as e:
return None
return dataframe
start_date = "2015-01-01"
end_date = "2015-12-31"
start = parser.parse(start_date)
end = parser.parse(end_date)
backoff_time = 10
data = {}
print("Working on {}".format(station))
data[station] = []
if date.day % 10 == 0:
done = False
try:
done = True
except ConnectionError as e:
time.sleep(10)
data[station].append(weather_data)
# Finally combine all of the individual days and output to CSV for analysis.
pd.concat(data[station]).to_csv("data/{}_weather.csv".format(station))
Cleansing and Data Processing
The data downloaded from Wunderground needs a little bit of work. Again, if you want the
raw data, it’s here. Ultimately, we want to work out when its raining at certain times of the
day and aggregate this result to daily, monthly, and yearly levels. As such, we use Pandas to
add month, year, and date columns. Simple stuff in preparation, and we can then output plots
as required.
station = 'IEDINBUR6' # Edinburgh
data_raw = pd.read_csv('data/{}_weather.csv'.format(station))
# Give the variables some friendlier names and convert types as necessary.
data_raw['temp'] = data_raw['TemperatureC'].astype(float)
data_raw['rain'] = data_raw['HourlyPrecipMM'].astype(float)
data_raw['date'] = data_raw['DateUTC'].apply(parser.parse)
data_raw['humidity'] = data_raw['Humidity'].astype(float)
data_raw['wind_direction'] = data_raw['WindDirectionDegrees']
data_raw['wind'] = data_raw['WindSpeedKMH']
# There's an issue with some stations that record rainfall ~-2500 where data is missing.
# Get the time, day, and hour of each timestamp in the dataset
# Mark the month for each entry so we can look at monthly patterns
# You get wet cycling if its a working day, and its raining at the travel times!
At this point, the dataset is relatively clean, and ready for analysis. If you are not familiar
with grouping and aggregation procedures in Python and Pandas, here is another blog post on
the topic.
Data after cleansing from Wunderground.com. This data is now in good format for grouping
and visualisation using Pandas.
With the data cleansed, we now have non-uniform samples of the weather at a given station
throughout the year, at a sub-hour level. To make meaningful plots on this data, we can
aggregate over the days and months to gain an overall view and to compare across stations.
# Looking at the working days only and create a daily data set of working days:
wet_cycling = pd.DataFrame(wet_cycling).reset_index()
monthly = wet_cycling.groupby('month')['get_wet_cycling'].value_counts().reset_index()
# Get aggregate stats for each day in the dataset on rain in general - for heatmaps.
rainy_days = data.groupby(['day']).agg({
"rain_amount": "sum"},
})
# Add the number of rainy hours per day this to the rainy_days dataset.
temp = temp.groupby(level=[0]).sum().reset_index()
len(wet_cycling),
station)
At this point, we can start to plot the data. It’s well worth reading the documentation
on plotting with Pandas, and looking over the API of Seaborn, a high-level data visualisation
library that is a level above matplotlib.
This is not a tutorial on how to plot with seaborn or pandas – that’ll be a seperate blog post,
but rather instructions on how to reproduce the plots shown on this blog post.
The monthly summarised rainfall data is the source for this chart.
plt.figure(figsize=(12,8))
sns.set_context("notebook", font_scale=2)
plt.xlabel("Month")
plt.ylabel("Number of Days")
Number of days monthly when cyclists get wet commuting at typical work times in Dublin,
Ireland.
Heatmaps of Rainfall and Rainy Hours per day
The heatmaps shown on the blog post are generated using the “calmap” python library,
installable using pip. Simply import the library, and form a Pandas series with a
JCEI’s Jaihind College of Engineering, Kuran 2022-2023
DateTimeIndex and the library takes care of the rest. I had some difficulty here with font
sizes, so had to increase the size of the plot overall to counter.
import calmap
temp = rainy_days.copy().set_index(pd.DatetimeIndex(analysis['rainy_days']['date']))
#temp.set_index('date', inplace=True)
plt.title("Hours raining")
The Calmap package is very useful for generating heatmaps. Note that if you have highly
outlying points of data, these will skew your color mapping considerably – I’d advise
removing or reducing them for visualisation purposes.
Heatmap of total rainfall daily over 2015. Note that if you are looking at rainfall data like
this, outlying values such as that in August in this example will skew the overall visualisation
and reduce the colour-resolution of smaller values. Its best to normalise the data or reduce the
outliers prior to plotting.
Remember that Pandas can be used on its own for quick visualisations of the data – this is
really useful for error checking and sense checking your results. For example:
Quickly view and analyse your data with Pandas straight out of the box. The .plot()
command will plot against the axis, but you can specify x and y variables as required.
To compare every city in the dataset, summary stats for each city were calculated in advance
and then the plot was generated using the seaborn library. To achieve this as quickly as
possible, I wrapped the entire data preparation and cleansing phase described above into a
single function called “analyse data”, used this function on each city’s dataset, and extracted
out the pieces of information needed for the plot
"""
Function to analyse weather data for a period from one weather station.
Args:
Returns:
wet_cycling: Data on working days and whether you get wet or not commuting
"""
# Give the variables some friendlier names and convert types as necessary.
data_raw['temp'] = data_raw['TemperatureC'].astype(float)
data_raw['rain'] = data_raw['HourlyPrecipMM'].astype(float)
data_raw['total_rain'] = data_raw['dailyrainMM'].astype(float)
data_raw['date'] = data_raw['DateUTC'].apply(parser.parse)
data_raw['humidity'] = data_raw['Humidity'].astype(float)
data_raw['wind_direction'] = data_raw['WindDirectionDegrees']
data_raw['wind'] = data_raw['WindSpeedKMH']
# There's an issue with some stations that record rainfall ~-2500 where data is missing.
# Get the time, day, and hour of each timestamp in the dataset
# Mark the month for each entry so we can look at monthly patterns
# Classify into morning or evening times (assuming travel between 8.15-9am and 5.15-6pm)
# You get wet cycling if its a working day, and its raining at the travel times!
wet_cycling = pd.DataFrame(wet_cycling).reset_index()
monthly = wet_cycling.groupby('month')['get_wet_cycling'].value_counts().reset_index()
rainy_days = data.groupby(['day']).agg({
"rain_amount": "sum"},
})
rainy_days.reset_index(drop=False, inplace=True)
rainy_days.columns = rainy_days.columns.droplevel(level=0)
rainy_days['rain'] = rainy_days['rain'].astype(bool)
rainy_days.rename(columns={"":"date"}, inplace=True)
# Also get the number of hours per day where its raining, and add this to the rainy_days
dataset.
temp = temp.groupby(level=[0]).sum().reset_index()
len(wet_cycling),
station)
JCEI’s Jaihind College of Engineering, Kuran 2022-2023
print "You get wet cycling {} % of the
time!!".format(wet_cycling['get_wet_cycling'].sum()*1.0*100/len(wet_cycling))
The following code was used to individually analyse the raw data for each city in turn. Note
that this could be done in a more memory efficient manner by simply saving the aggregate
statistics for each city at first rather than loading all into memory. I would recommend that
approach if you are dealing with more cities etc.
stations = [
("IAMSTERD55", "Amsterdam"),
("IBCNORTH17", "Vancouver"),
("IBELFAST4", "Belfast"),
("IBERLINB54", "Berlin"),
("ICOGALWA4", "Galway"),
("ICOMUNID56", "Madrid"),
("IDUBLIND35", "Dublin"),
("ILAZIORO71", "Rome"),
("ILEDEFRA6", "Paris"),
("ILONDONL28", "London"),
("IMUNSTER11", "Cork"),
("INEWSOUT455", "Sydney"),
("IENGLAND64", "Liverpool"),
('IEDINBUR6', 'Edinburgh')
data = []
weather = {}
weather['data'] = pd.DataFrame.from_csv("data/{}_weather.csv".format(station[0]))
weather['station'] = station[0]
weather['name'] = station[1]
data.append(weather)
for ii in range(len(data)):
# Now extract the number of wet days, the number of wet cycling days, and the number of
wet commutes for a single chart.
output = []
for ii in range(len(data)):
temp = {
"total_wet_days": data[ii]['result']['rainy_days']['rain'].sum(),
"wet_commutes": data[ii]['result']['wet_cycling']['get_wet_cycling'].sum(),
"commutes": len(data[ii]['result']['wet_cycling']),
"city": data[ii]['name']
output.append(temp)
output = pd.DataFrame(output)
The final step in the process is to actually create the diagram using Seaborn.
plt.figure(figsize=(20,8))
sns.set_context("notebook", font_scale=2)
sns.barplot(x="city", y="percent_wet_commute",
data=output.sort_values('percent_wet_commute', ascending=False))
plt.xlabel("City")
plt.xticks(rotation= )
plt.savefig("images/city_comparison_wet_commutes.png", bbox_inches='tight')
Student Name
Division A
Roll No
Exam Seat No
Date of
Completion
Assessment
grade/marks
Signature
• Problem Statement:
Write a simple program in SCALA using Apache Spark framework.
• Prerequisites:
Subjects: Discrete Mathematics, Database Management Systems
Tools/Software Requirements: Apache Hadoop
Processor/Hardware Requirements: Intel Core 2 Duo or above, 2 GB RAM or above
Operating System: 64 Bit Windows/Linux/Mac OS
Programming Language: Scala
• Objectives:
To learn concepts of Data Wrangling, import /read/scrap meaningful insights from dataset.
Aim: Design a distributed application using Map Reduce which processes a log file of a
system.
Theory: In this tutorial, you will learn to use Hadoop with MapReduce Examples. The input
data used is SalesJan2009.csv. It contains Sales related information like Product name, price,
payment mode, city, country of client etc. The goal is to Find out Number of Products Sold
in Each Country.
Now in this MapReduce tutorial, we will create our first Java MapReduce program:
Data of SalesJan2009
Ensure you have Hadoop installed. Before you start with the actual process, change user to
‘hduser’ (id used while Hadoop configuration, you can switch to the userid used during your
Hadoop programming config ).
su - hduser_
Step 1)
Create a new directory with name MapReduceTutorial as shwon in the below MapReduce
example
Give permissions
package SalesCountry;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
package SalesCountry;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
}
output.collect(key, new IntWritable(frequencyForCountry));
}
}
JCEI’s Jaihind College of Engineering, Kuran 2022-2023
SalesCountryDriver.java
package SalesCountry;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
my_client.setConf(job_conf);
try {
// Run the job
JobClient.runJob(job_conf);
} catch (Exception e) {
e.printStackTrace();
}
}
}
Step 2)
Export classpath as shown in the below Hadoop example
export CLASSPATH="$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-
client-core-2.2.0.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-
common-2.2.0.jar:$HADOOP_HOME/share/hadoop/common/hadoop-common-
2.2.0.jar:~/MapReduceTutorial/SalesCountry/*:$HADOOP_HOME/lib/*"
Step 3)
Compile Java files (these files are present in directory Final-MapReduceHandsOn). Its
class files will be put in the package directory
Step 4)
Create a new file Manifest.txt
Main-Class: SalesCountry.SalesCountryDriver
SalesCountry.SalesCountryDriver is the name of main class. Please note that you have to
hit enter key at end of this line.
Step 5)
Create a Jar file
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
Step 7)
Copy the File SalesJan2009.csv into ~/inputMapReduce
Step 8)
Run MapReduce job
Step 9)
The result can be seen through command interface as,
Every mapper class must be extended from MapReduceBase class and it must
implement Mapper interface.
‘map()’ method begins by splitting input text which is received as an argument. It uses the
tokenizer to split these lines into words.
After this, a pair is formed using a record at 7th index of array ‘SingleCountryData’ and a
value ‘1’.
We are choosing record at 7th index because we need Country data and it is located at 7th
index in array ‘SingleCountryData’.
Please note that our input data is in the below format (where Country is at 7th index, with 0
as a starting index)-
Transaction_date,Product,Price,Payment_Type,Name,City,State,Country,Account_Created,
Last_Login,Latitude,Longitude
An output of mapper is again a key-value pair which is outputted using ‘collect()’ method
of ‘OutputCollector’.
1. We begin by specifying a name of the package for our class. SalesCountry is a name of
out package. Please note that output of compilation, SalesCountryReducer.class will go into
a directory named by this package name: SalesCountry.
Here, the first two data types, ‘Text’ and ‘IntWritable’ are data type of input key-value to
the reducer.
Output of mapper is in the form of <CountryName1, 1>, <CountryName2, 1>. This output of
mapper becomes input to the reducer. So, to align with its data
type, Text and IntWritable are used as data type here.
The last two data types, ‘Text’ and ‘IntWritable’ are data type of output generated by reducer
in the form of key-value pair.
Every reducer class must be extended from MapReduceBase class and it must
implement Reducer interface.
<United Arab Emirates, 1>, <United Arab Emirates, 1>, <United Arab Emirates, 1>,<United
Arab Emirates, 1>, <United Arab Emirates, 1>, <United Arab Emirates, 1>.
So, to accept arguments of this form, first two data types are used,
viz., Text and Iterator<IntWritable>. Text is a data type of key
and Iterator<IntWritable> is a data type for list of values for that key.
reduce() method begins by copying key value and initializing frequency count to 0.
Then, using ‘while’ loop, we iterate through the list of values associated with the key and
calculate the final frequency by summing up all the values.
while (values.hasNext()) {
// replace type of value with the actual type of our value
IntWritable value = (IntWritable) values.next();
frequencyForCountry += value.get();
}
Now, we push the result to the output collector in the form of key and obtained frequency
count.
1. We begin by specifying a name of package for our class. SalesCountry is a name of out
package. Please note that output of compilation, SalesCountryDriver.class will go into
directory named by this package name: SalesCountry.
2. Define a driver class which will create a new client job, configuration object and advertise
Mapper and Reducer classes.
The driver class is responsible for setting our MapReduce job to run in Hadoop. In this class,
we specify job name, data type of input/output and names of mapper and reducer
classes.
arg[0] and arg[1] are the command-line arguments passed with a command given in
MapReduce hands-on, i.e.,
try {
// Run the job
JobClient.runJob(job_conf);
} catch (Exception e) {
e.printStackTrace();
}
Conclusion: Hence we have thourouly studied how to design a distributed application using
Map Reduce which processes a log file of a system.
Student Name
Division A
Roll No
Exam Seat No
Date of
Completion
Assessment
grade/marks
Signature
Student Name
Division A
Roll No
Exam Seat No
Date of
Completion
Assessment
grade/marks
Signature
SUBMITTED BY
Miss.
Is a Bonafede work carried out by student under the supervision of Dr. A. A. Khatri and it is submitted
towards the partial fulfilment of the requirement of Second Year of Computer Engineering.
Dr. D. J. Garkal
Principal