Data Science Methodology
Data Science Methodology
• It is highly iterative and never ends; that’s because in a real case study,
we have to repeat some steps to improve the model.
FROM PROBLEM TO APPROACH
STAGE 1: BUSINESS UNDERSTANDING
• Goals :
• Specify the key variables that serve as the model targets. And specify the metrics of the
targets, which determine the success of the project.
• Identify the relevant data sources that the business has access to or needs to obtain.
STAGE 2: ANALYTIC APPROACH
• Analytic Approach: Define the analytic approach to solve the business problem.
• Contextual Expression: Express the problem in terms of statistical and machine-learning techniques.
• Pattern Identification: Identify the type of patterns needed to address the question effectively.
• Predictive Model: Used for determining probabilities.
• Descriptive Approach: Used for showing relationships.
• Statistical Analysis: Used for problems requiring counts.
• Algorithm Selection: Choose different algorithms based on the type of approach.
FROM REQUIREMENTS TO
COLLECTION
• identify the necessary data content, formats, and sources for initial data collection,
and we use this data inside the algorithm of the approach we chose.
• The chosen analytic approach determines the data requirements.
• Specifically, the analytic methods to be used require certain data content, formats
and representations, guided by domain knowledge
STAGE 4: DATA COLLECTION
• Identify and gather structured, unstructured, and semi-structured data relevant to the
problem domain.
• Data Scientists identify the available data resources relevant to the problem domain.
• Decide whether to invest in obtaining less-accessible data elements.
• To retrieve data, we can do web scraping on a related website, or we can use repository
with premade datasets ready to use.
• Usually, premade datasets are CSV files or Excel; if we want to collect data from any
website or repository, we should use Pandas, a useful tool to download, convert, and
modify datasets.
• Revise data requirements and collect new or additional data if there are gaps.
• Incorporating more data can help predictive models better represent rare events, such as
disease incidence or system failure.
FROM UNDERSTANDING TO
PREPARATION
data scientists
• use descriptive statistics and visualization techniques to
understand data better
• explore the dataset to understand its content, determine if
additional data is necessary to fill any gaps but also to
verify the quality of the data.
STAGE 5: DATA UNDERSTANDING
• It sets the foundation for the subsequent modeling phase as it encompasses all
activities to construct the data set.
• Data Cleaning: Handling missing or invalid values, removing duplicates, and
ensuring proper formatting.
• Combining Data: Integrating data from various sources like files, tables, and
platforms.
• Transforming Data: Creating more useful variables through feature engineering,
which involves using domain knowledge and existing variables.
• Text Analytics: Converting unstructured or semi-structured text data into
structured variables to enhance model accuracy.
• Automation: Automating common data preparation steps to save time and
improve efficiency.
• High-Performance Systems: Utilizing advanced systems and analytics to handle
large datasets more effectively.
STAGE 7: MODELING
• In the Model Evaluation stage, data scientists can evaluate the model in two
ways: Hold-Out and Cross-Validation.
• In the Hold-Out method, the dataset is divided into three subsets: a training
set as we said in the modeling stage;
• a validation set that is a subset used to assess the performance of the model
built in the training phase;
• a test set is a subset to evaluate the likely future performance of a model.
FROM DEPLOYMENT TO
FEEDBACK
Data scientists have to make the stakeholders familiar with the tool
produced in different scenarios,
so once the model is evaluated and the data scientist is confident it
will work, it is deployed and put to the ultimate test.
STAGE 9: DEPLOYMENT