0% found this document useful (0 votes)
60 views33 pages

5 Data Science Project Lifecycle

Uploaded by

keladevi0208
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views33 pages

5 Data Science Project Lifecycle

Uploaded by

keladevi0208
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Data Science Life Cycle

Life Cycle of a Data Science Project


Life Cycle of Data Science Project

Step 1: Problem Understanding


Step 2: Data Collection
Step 3:Pre Processing Data
Step 4: Analysing Data
Step 5: Data Modeling
Step 6: Model Evaluation
Step 7: Driving insights and Generating Business Reports
Step 1: Problem Understanding

It is important to understand what the problem


statement is and ask the right questions (Quality of
Questions) to the customer that helps us understand
the data well and derive meaningful insights from the
data.
Step 1: Problem Understanding
Every domain and business works with a set of rules and
goals. In order to acquire the correct data, we should be
able to understand the business. Asking questions about
the dataset will help in narrowing it down to correct data
acquisition.
We typically use data science to answer five types of
questions:
 How much or how many? (regression)
 Which category? (classification)
 Which group? (clustering)
 is this weird? (anomaly detection)
 Which option should be taken? (recommendation)
Step 1: Problem Understanding
A few right questions that other successful businesses
have asked in the past of their data science teams
 Uber — What percentage of time do drivers actually
drive? How steady is their income?
 Oyo Hotels — What is the average occupancy of
mediocre hotels?
 Alibaba — What are the per-square-foot profits of our
warehouses?
Data Sources
 Evolution of Technology
 Internet of Things
 Social Media
 Other Factors
Data Sources: IoT
Data Sources: Social Media

347222 tweets
204,000,000.. emails
17361111 pictures

300 hours of video uploaded


41666667 likes
2,000,000 dislikes
Data Sources: Other Factors

Transport
Government
In Data Science projects
we extract the knowledge and insights from
data by using scientific methods.
How do Data Scientist get useful insights
from data?
How do Data Scientist get useful insights
from data?

 Its all starts with data exploration.


Prerequisites for Data Science
The following are the three essential traits of Data Scientist:

Curiosity

Curiosity: Only when you ask questions, you will have a better understanding of the
business problem.
Common Sense: To identify new ways to solve a business problems and to detect
priority problems.
Communication Skills: A Data Scientist needs to communicate their findings to
business teams to act upon the insights
Skills required for Data Scientist

Domain Knowledge:
• To get useful information out of raw data that
benefits a company’s business.
• Know about the business model of the
company .
• Ask the right questions to produce valuable
results.

Math Skills:
• Linear Algebra, Calculus, and other concepts
of mathematics help us to understand the
complex behavior of Machine Learning
algorithms.
• Probability and statistics are mainly used in
predictive modeling and clustering.
Skills required for Data Scientist
Computer Science:
• To implement Data Science techniques using programming
languages like Python, R, SQL, Scala, Julia, JavaScript, etc.
• To deal with varied databases and loud networks to process the
data.
• Knowledge about algorithms, relational and non-relational
databases, Distributed Computing, and Machine Learning.

Communication Skills:
• To have good communication when working in team.
• To draw conclusions from the data analysis and make
presentation.
Step 2: Data Collection

• Identify the person who knows data


to acquire.
• When to acquire, based on the
question to be answered.
• The person can be anyone who
knows the real difference between
the various available data sets and
making hard-hitting decisions about
the data investment strategy.
Step 2: Data Collection
Data might need to be collected from multiple types of
data sources.
Few Examples of Data Source.
 File format Data(Spreadsheet, CSV, Text files, XML,
jSON)
 Relational Database
 Non-relational Database(NoSQL)
 Scraping Website Data using tools
Step 3: Data cleaning phase

• It is also referred to as the data


wrangling phase.

• In this Step ,we understand more


about the data and prepare it for
further analysis.

• The data you collected should


represent the problem to be
solved.
Step 3:Cleaning data

 Cleaning data essentially means removing


discrepancies from your data such as

 missing fields
 improper values
 setting the right format of the data
 structuring data from raw files, etc.
Step 3:Cleaning data
 Format the data into the desired structure, remove
unwanted columns and features.
 Data preparation is the most time-consuming yet
arguably the most important step in the entire life
cycle.
 Data collection, data understanding, and data
preparation take up to 70% — 90% of the overall
project time.
 If you feel the data is not proper or enough for you to
proceed, you can go back to the data collection step.
Step 4: Analyzing Data

EXPLORE… EXPLORE… EXPLORE……………………


Step 4: Analyzing Data

• Exploratory analysis is often described as a philosophy,


and there are no fixed rules for how you approach it.
There are no shortcuts for data exploration.
• The quality of your inputs decides the quality of your
output.
• To understand the data a lot of people look at the
data statistics like mean, median, etc.
• People also plot the data and look at its distribution
through plots like histogram, spectrum analysis,
population distribution, etc.
• We create a plan to do analytics on the data.
Step 4: Analyzing Data

Different types of analytics may include as


💡 Descriptive Analytics
(what has happened in the past?)
Use data aggregation methods

💡 Predictive Analytics
(what could happen in the future?)
Use statistical methods and other forecast techniques

💡 Prescriptive Analytics
(what should we do?)
Use optimization and simulation methods, what-if and
if- what analysis
Step 5: Data Modeling/ Machine Learning modeling
Step 5: Data Modeling

Modeling is used to find patterns or behaviors in data.


These patterns either help us in one of two ways —

Descriptive modeling — Recommender systems that


are if a person liked the movie Matrix they would also
like the movie Inception

Predictive modeling — This involves getting a


prediction on future trends e.g. linear regression
where we might want to predict stock exchange values
Step 5: Data Modeling

Supervised Learning:
• A technique to train the machine using labeled data.
• A few examples of Supervised Algorithms:
o Naive Bayes
o Random Forest
o Neural Network Algorithms
o k-Nearest Neighbor (kNN)
o Linear Regression
o Logistic Regression
o Support Vector Machines(SVM)
o Decision Trees
o Boosting
o Bagging
Step 5: Data Modeling

Unsupervised Learning:
• It involves training by using unlabeled data and allowing
the model to act on that information without guidance.
• Examples of Unsupervised Algorithms
o KMeans/ KMeans++
o Hierarchical Clustering
o Density Based Spatial Clustering of Applications with
Noise(DBSCAN)
Step 6: Model Evaluation

• Model is validated and tested to identify the its


performance.
• Based on the business problem, models is selected.
• It is essential to identify what is the task, is it a
classification problem, regression problem, time series
forecasting, or a clustering problem. Once the problem
type is sorted out, the model is implemented.
Step 6: Model Evaluation

A few examples of Classification metrics:


 Classification Accuracy
 Confusion matrix
 Logarithmic Loss(Log Loss)
 Area under curve (AUC)
 F-Measure (F1 Score)
 Precision
 Recall

A few examples of Regression metrics:


 Mean Absolute Error (or MAE)
 Mean Square Error (MSE)
 Root Mean Squared Error (RMSE)
 Mean Absolute Percentage Error(MAPE)
Step 7: Driving insights and Business Intelligence Reports/ Visualization

• Reports are the outcome of Data Analysis and required


by the stakeholders.
• It should be meaningful, visualized, and meaningful to
the organization and the stakeholders.
• You may be presenting to an audience with no
technical background, so the way you communicate is
important.
Step 7: Driving insights and Business Intelligence Reports

A few tools used for Visualizaton purpose:


 Tableau
 Power BI
 R — ggplot2, lattice
 Kibana
 Grafana
 Spotfire
 Python — Matpoltlib, Seaborn, Plotly.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy