0% found this document useful (0 votes)
9 views21 pages

Data Science Methodology

The document outlines the methodology in data science, detailing a 10-stage iterative process for uncovering insights from data. It emphasizes the importance of business understanding, data collection, preparation, modeling, evaluation, deployment, and feedback in developing effective data-driven solutions. Each stage is crucial for ensuring that data scientists can address business problems efficiently and refine their models based on real-world performance.

Uploaded by

aniketsha784
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views21 pages

Data Science Methodology

The document outlines the methodology in data science, detailing a 10-stage iterative process for uncovering insights from data. It emphasizes the importance of business understanding, data collection, preparation, modeling, evaluation, deployment, and feedback in developing effective data-driven solutions. Each stage is crucial for ensuring that data scientists can address business problems efficiently and refine their models based on real-world performance.

Uploaded by

aniketsha784
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

M.

Sc (IT - AI/CC/Security) Semester I

DATA SCIENCE AND ANALYTICS


(DSA)

Ms. Pooja R. Tupe


Visiting Faculty ,UDIT, University of Mumbai.
TOPICS TO COVER
WHAT IS METHODOLOGY IN DATA
SCIENCE?

• Methodology :The best way to


organize your work, doing it better,
and without losing time.
• provides the data scientist with a
framework for how to proceed with
whatever methods, processes and
heuristics will be used to obtain
answers or results
WHAT IS METHODOLOGY IN DATA
SCIENCE?

• consists of 10 stages that form an iterative


process for using data to uncover insights.
• Each stage plays a vital role in the context of
the overall methodology.

1.From Problem to Approach


2.From Requirements to Collection
3.From Understanding to Preparation
4.From Modeling to Evaluation
5.From Deployment to Feedback

• It is highly iterative and never ends; that’s because in a real case study,
we have to repeat some steps to improve the model.
FROM PROBLEM TO APPROACH
STAGE 1: BUSINESS UNDERSTANDING

• Every customer’s request starts with a problem.


• Data Scientists’ job is first to understand it and approach this problem with statistical and
machine learning techniques.
STAGE 1: BUSINESS UNDERSTANDING

• Business Understanding : helps to clarify the goal of the customer.


• ask a lot of questions to the customer about every single aspect of the problem
• Clearly defined questions starts with understanding the goals.
• end of this stage, we will have a list of business requirements.
STAGE 1: BUSINESS UNDERSTANDING

• Goals :
• Specify the key variables that serve as the model targets. And specify the metrics of the
targets, which determine the success of the project.
• Identify the relevant data sources that the business has access to or needs to obtain.
STAGE 2: ANALYTIC APPROACH

• Analytic Approach: Define the analytic approach to solve the business problem.
• Contextual Expression: Express the problem in terms of statistical and machine-learning techniques.
• Pattern Identification: Identify the type of patterns needed to address the question effectively.
• Predictive Model: Used for determining probabilities.
• Descriptive Approach: Used for showing relationships.
• Statistical Analysis: Used for problems requiring counts.
• Algorithm Selection: Choose different algorithms based on the type of approach.
FROM REQUIREMENTS TO
COLLECTION

Once we have found a way to solve our problem, we will need


to discover the correct data for our model.
STAGE 3: DATA REQUIREMENTS

• identify the necessary data content, formats, and sources for initial data collection,
and we use this data inside the algorithm of the approach we chose.
• The chosen analytic approach determines the data requirements.
• Specifically, the analytic methods to be used require certain data content, formats
and representations, guided by domain knowledge
STAGE 4: DATA COLLECTION

• Identify and gather structured, unstructured, and semi-structured data relevant to the
problem domain.
• Data Scientists identify the available data resources relevant to the problem domain.
• Decide whether to invest in obtaining less-accessible data elements.
• To retrieve data, we can do web scraping on a related website, or we can use repository
with premade datasets ready to use.
• Usually, premade datasets are CSV files or Excel; if we want to collect data from any
website or repository, we should use Pandas, a useful tool to download, convert, and
modify datasets.
• Revise data requirements and collect new or additional data if there are gaps.
• Incorporating more data can help predictive models better represent rare events, such as
disease incidence or system failure.
FROM UNDERSTANDING TO
PREPARATION

data scientists
• use descriptive statistics and visualization techniques to
understand data better
• explore the dataset to understand its content, determine if
additional data is necessary to fill any gaps but also to
verify the quality of the data.
STAGE 5: DATA UNDERSTANDING

• data scientists use descriptive statistics and visualization techniques to understand


the data content, assess data quality and discover initial insights about the data.
• Additional data collection may be necessary to fill gaps.
• check the type of each data and to learn more about the attributes and their names.
STAGE 6: DATA PREPARATION

• It sets the foundation for the subsequent modeling phase as it encompasses all
activities to construct the data set.
• Data Cleaning: Handling missing or invalid values, removing duplicates, and
ensuring proper formatting.
• Combining Data: Integrating data from various sources like files, tables, and
platforms.
• Transforming Data: Creating more useful variables through feature engineering,
which involves using domain knowledge and existing variables.
• Text Analytics: Converting unstructured or semi-structured text data into
structured variables to enhance model accuracy.
• Automation: Automating common data preparation steps to save time and
improve efficiency.
• High-Performance Systems: Utilizing advanced systems and analytics to handle
large datasets more effectively.
STAGE 7: MODELING

• Model Development: This stage involves creating predictive or descriptive


models based on the analytic approach defined earlier.
• Training Set: Predictive models are built using historical data where the
outcome is known.
• Iterative Process: The modeling process is iterative, with intermediate
insights leading to refinements in both data preparation and model
specification.
• Algorithm Selection: Data scientists experiment with multiple algorithms
and their parameters to identify the best model for the given variables.
STAGE 8: EVALUATION

• In the Model Evaluation stage, data scientists can evaluate the model in two
ways: Hold-Out and Cross-Validation.
• In the Hold-Out method, the dataset is divided into three subsets: a training
set as we said in the modeling stage;
• a validation set that is a subset used to assess the performance of the model
built in the training phase;
• a test set is a subset to evaluate the likely future performance of a model.
FROM DEPLOYMENT TO
FEEDBACK

Data scientists have to make the stakeholders familiar with the tool
produced in different scenarios,
so once the model is evaluated and the data scientist is confident it
will work, it is deployed and put to the ultimate test.
STAGE 9: DEPLOYMENT

• Approval and Deployment: Once the model is approved by business sponsors, it


is deployed into a production or test environment.
• Limited Deployment: Initially, the model is deployed in a limited manner to fully
evaluate its performance.
• Deployment Methods: Deployment can range from generating reports with
recommendations to embedding the model into applications or systems.
• This stage ensures that the model is effectively integrated into the business
processes and its performance is closely monitored.
STAGE 10: FEEDBACK

• Feedback Collection: After deploying the model, organizations collect results


to evaluate its performance and impact. For example, response rates to a
promotional campaign can provide valuable feedback.
• Model Refinement: Analyzing this feedback helps data scientists refine the
model to enhance its accuracy and usefulness.
• Automation: Automating feedback collection, model assessment, refinement,
and redeployment can accelerate the process, leading to more timely and
effective model updates.
• This continuous loop of feedback and refinement ensures that the model
remains relevant and effective in changing environments.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy