Data Analytics Part 3 (1)
Data Analytics Part 3 (1)
LEARNING
LIFECYCLE
Linear Regression
Decision Tree Analysis
Read Article titled: Agile Software Development Lifecycle Model for Machine
Learning (ASDLMML)
Machine learning application software
development life cycle (MLASDLC).
■ The complexity of building and integrating machine learning
applications is challenging to software engineering teams.
■ The inherent differences between software engineering and machine
learning do not allow software engineering methodologies to be
applied uniformly.
■ Whereas software engineering is dependent on software design,
development and testing, machine learning model development is
based on data and model design, training, evaluation, deployment,
and monitoring. Machine learning systems are non-deterministic and
are therefore difficult to build using sequential development methods
■ Data, hidden technical debt and the need for iterative experimentation
are the main technical challenges of machine learning development.
ML & Data Analytics
■ In this data-rich age, understanding how to analyze and extract
true meaning from the business’s digital insights is one of the
primary drivers of success.
■ Despite the colossal volume of data created every day, a mere
0.5% is actually analyzed and used for data discovery,
improvement, and intelligence.
■ 1st there is data analytics – then there is the challenge of using
the analytics in an automated manner – resulting in ML
■ ML has a pivotal dependence on data analytics – because the
machine learning models are informed by data analytics
processes and models – the ML lifecycle model is different from
a traditional systems lifecycle model
Traditional vs ML
■ The traditional lifecycle model…
■ Python 1 Male
2 Male
28
27
65
65
3 Female 39 61
■ PowerBi 4 Female
5 Female
34
43
50
65
6 Male 48 72
7 Female 41 55
8 Female 52 55
9 Male 39 68
10 Female 48 68
Descriptive Sample
Machine learning Algorithms
■ Supervised learning: Supervised learning occurs when an algorithm is trained using
“labeled data,” or data that is tagged with a label so that an algorithm can successfully
learn from it. Training labels help the eventual machine learning model know how to
classify data in the manner that the researcher desires.
■ Unsupervised learning: Unsupervised algorithms use unlabeled data to train an algorithm.
In this process, the algorithm finds patterns in the data itself and creates its own data
clusters. Unsupervised learning and pattern recognition are helpful for researchers who
are looking to find patterns in data that are currently unknown to them.
■ Semi-supervised learning: Semi-supervised learning uses a mix of labeled and unlabeled
data to train an algorithm. In this process, the algorithm is first trained with a small
amount of labeled data before being trained with a much larger amount of unlabeled
data.
■ Reinforcement learning: Reinforcement learning is a machine learning technique in which
positive and negative values are assigned to desired and undesired actions. The goal is to
encourage programs to avoid the negative training examples and seek out the positive,
learning how to maximize rewards through trial and error. Reinforcement learning can be
used to direct unsupervised machine learning.
machine learning models
■ df = pd.read_csv('Heights.csv')
■ df.plot(kind='scatter',x='Female_Height', y='Male_Height',figsize=(10,6));
■ print(df.corr())
■ corr,sig=pearsonr(df['Male_Height'],df['Female_Height'])
■ print("Male Heights vs Female Heights is SIGNIFICANT at the 95% confidence lebel", pearsonr(df['Male_Height'],df['Female_Height']))
■ print(round(sig,3))
Linear/multiple regression
■ Simple linear regression is a function that allows an analyst or
statistician to make predictions about one variable based on the
information that is known about another variable.
■ Linear regression can only be used when one has two continuous
variables—an independent variable and a dependent variable.
■ The independent variable is the parameter that is used to calculate
the dependent variable or outcome….
– y=mx+c…OR y=2x+3…y(profit) = 2*x(no of customers) + 3
■ A multiple regression model extends to several explanatory variables.
– Y=2x+3k+4z+c ….the dependent variable is determined by more
than one independent variable
Linear/multiple regression
■ This is an example of predictive data analysis –
also creates an opportunity for machine learning
(ML)
■ Take the example of vehicles and carbon
emission – let us suppose that we need to find a
model that will predict the level of carbon
emission of a vehicle if we know the weight and
engine capacity
The data as a csv
■ Use the pandas to arrange the data into data frames
■ import pandas as pd
■ df=pd.read_csv("cars.csv")
■ df
Get the main data items
■ To build the model – we need to isolate the
independent and the dependent variables
■ The volume and weight are the independent variables
■ The CO2 is the dependent variable…use X to
represent the independent variable and y to
represent the dependent variable
■ X = df[['Weight', 'Volume']]
y = df['CO2']
Using a ML Library
from sklearn import linear_model
regr = linear_model.LinearRegression()
X = df[['Weight','Volume']]
y = df['CO2']
regr.fit(X, y)
weight = input ('Enter the Car weight')
Engine=input('Enter the Engine Capacity in ccm')
predictedCO2 = regr.predict([[int(weight), int(Engine)]])
print(predictedCO2)
ML - The main data analysis route
■ Use pandas to import data file
■ Use seaborn and matplotlib to do the serious
stats
■ E.g. – multiple regression
– Similar to linear regression but with more than
one independent value, meaning that we try to
predict a value based on two or
more variables.
Getting the data
■ Another example involving more variables – The decision is to determine the most cost-effective method for advertising
■ The independent data is the cost of advertising via:
– TV
– Radio
– Newspaper
■ The dependent variable is an amount that represents sales
import pandas as pd
df=pd.read_csv("Advertising.csv", index_col="No")
df
Checking the head & tail
■ df.head(10)….df.tail(8)
Scatter plot TV vs Sales
df.plot(kind='scatter',x='TV', y='sales',figsize=(10,6),color='Red');
Matplot and seaborn
■ Pandas is quite good but there is another library
that is better for plotting – named matplotlib and
also seaborn which is even better
■ import seaborn as sns
■ sns.pairplot(df,x_vars=['TV','radio','newspaper'],y_vars='sales')
The color option
■ import seaborn as sns
■ sns.pairplot(df,x_vars=['TV','radio','newspaper'],y_vars='sales', kind='reg',plot_kws={'line_kws':{'color':'red'}} )
Tutorial Exercise
■ Use the Advertising data (csv download from
Learn) and build the predictive model that will
predict the sale given the cost of TV, Radio and
Newspaper
Another Example of ML
■ The file contains information about passengers who were on board the
Titanic when the collision took place.
■ We will use this data to perform exploratory data analysis
in Python and better understand the factors that contributed to a
passenger’s survival of the incident.
■ The idea here is to use the passengers details (independent variable)
to predict whether the passenger survived (dependent variable)
■ In this scenario – we have 2 datasets
– The 1st is a training dataset with full data including whether the
passenger has survived
– The 2nd is a sample dataset with the dependent variable missing
(i.e. the survival indicator is omitted)
The Titanic
import os
os.chdir("C:\SE_2025\Python\Data")
import pandas as pd
df=pd.read_csv('train.csv')
df.head()
print (df_num.corr())
sns.heatmap(df_num.corr(),annot=True)
Pivot the data
■ The Pivot tables are very good at providing a deeper insight
print(pd.pivot_table(train_data,index='Survived',columns='Pclass',values='Ticket', aggfunc='count'))
print('----------------------------------------------------------')
print(pd.pivot_table(train_data,index='Survived',columns='Sex',values='Ticket', aggfunc='count'))
print('----------------------------------------------------------')
print(pd.pivot_table(train_data,index='Survived',columns='Embarked',values='Ticket', aggfunc='count'))
Histogram of Categorical Variable
■ In the Seaborn library, we can create a count plot
to visualize the distribution of the
‘Survived’ variable.
■ Essentially, a count plot can be thought of as
a histogram across a categorical variable.
■ To do this, run the following code:
sns.barplot(data=df,x='Survived',y='Fare')
sns.barplot(data=df,x='Pclass',y='Survived')
Answering the questions via plots
■ Did a passenger’s age have any impact on what class they
traveled in? Yes, older passengers were more likely to travel
first class.
■ Were passengers who paid higher ticket fares in different
cabins as opposed to passengers who paid lower fares? Yes,
passengers who paid higher ticket fares seemed to mostly
travel in cabin B. However, the relationship between ticket fare
and cabin isn’t too clear because there were many missing
values in the ‘Cabin’ This might have compromised the quality
of the analysis.
■ Did ticket fare have any impact on a passenger’s survival? Yes,
first-class passengers were more likely to survive the collision.
The challenge