0% found this document useful (0 votes)
13 views7 pages

Unit 2 - DS - 1st Year

The document outlines the six steps of the data science process: setting research goals, data retrieval, data preparation, data exploration, model building, and presenting results. Each step involves specific tasks such as defining project charters, cleansing data, and using graphical techniques for analysis. Additionally, it discusses methods for reading and writing data in R, emphasizing the use of the readr package for efficiency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views7 pages

Unit 2 - DS - 1st Year

The document outlines the six steps of the data science process: setting research goals, data retrieval, data preparation, data exploration, model building, and presenting results. Each step involves specific tasks such as defining project charters, cleansing data, and using graphical techniques for analysis. Additionally, it discusses methods for reading and writing data in R, emphasizing the use of the readr package for efficiency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

The Data Science Process

Prepared by: Varun Rao (Dean, Data Science & AI)


For: Data Science - 1st years

The Data Science Process


The typical data science process consists of six steps through which you’ll iterate, as shown:
The following list is a short introduction; each of the steps will be discussed in greater depth as
we go further:

1. The first step of this process is setting a research goal. The main purpose here is
making sure all the stakeholders understand the what, how, and why of the project. In
every serious project this will result in a project charter.
2. The second phase is data retrieval. You want to have data available for analysis, so this
step includes finding suitable data and getting access to the data from the data owner.
The result is data in its raw form, which probably needs polishing and transformation
before it becomes usable.
3. Now that you have the raw data, it’s time to prepare it. This includes transforming the
data from a raw form into data that’s directly usable in your models. To achieve this,
you’ll detect and correct different kinds of errors in the data, combine data from different
data sources, and transform it.
4. The fourth step is data exploration. The goal of this step is to gain a deep understanding
of the data. You’ll look for patterns, correlations, and deviations based on visual and
descriptive techniques. The insights you gain from this phase will enable you to start
modeling.
5. Finally, model building (often referred to as “data modeling” throughout this book). It is
now that you attempt to gain the insights or make the predictions stated in your project
charter.
6. The last step of the data science model is presenting your results and automating the
analysis, if needed.

Step 1: Defining research goals and creating a project charter :

A project starts by understanding the what, the why, and the how of your project. What does the
company expect you to do? And why does management place such a value on your research?
Is it part of a bigger strategic picture or what? Answering these three questions.

The outcome should be a clear research goal, a good understanding of the context, well-defined
deliverables, and a plan of action with a timetable.

A project charter requires teamwork, and your input covers at least the following:
■ A clear research goal
■ The project mission and context
■ How you’re going to perform your analysis
■ What resources you expect to use
■ Proof that it’s an achievable project, or proof of concepts
■ Deliverables and a measure of success
Step 2: Retrieving data
Data can be stored in many forms, ranging from simple text files to tables in a database. The
objective now is acquiring all the data you need. This may be difficult, and even if you succeed,
data is often like a diamond in the rough: it needs polishing to be of any use to you.

Step 3: Cleansing, integrating, and transforming data

Your task now is to sanitize and prepare it for use in the modeling and reporting phase.

CLEANSING PROCESS
Data cleansing is a subprocess of the data science process that focuses on removing errors in
your data so your data becomes true and consistent.

DATA ENTRY ERRORS


Data collection and data entry are error-prone processes. They often require human
intervention, and because humans are only human, they make typos or lose their concentration
for a second and introduce an error into the chain. But data collected by machines or computers
isn’t free from errors either.

REDUNDANT WHITESPACE
Whitespaces tend to be hard to detect but cause errors like other redundant characters would. If
you know to watch out for them, fixing redundant whitespaces is luckily easy enough in most
programming languages. They all provide string functions that will remove the leading and
trailing whitespaces.

OUTLIERS
An outlier is an observation that seems to be distant from other observations or, more
specifically, one observation that follows a different logic or generative process than the other
observations.

DEALING WITH MISSING VALUES


Missing values aren’t necessarily wrong, but you still need to handle them separately; certain
modeling techniques can’t handle missing values

Data Transformation:

THE DIFFERENT WAYS OF COMBINING DATA


You can perform two operations to combine information from different data sets. The first
operation is joining: enriching an observation from one table with information from another table.
The second operation is appending or stacking: adding the observations of one table to those of
another table. When you combine data, you have the option to create a new physical table or a
virtual table by creating a view. The advantage of a view is that it doesn’t consume more disk
space.

JOINING TABLES
Joining tables allows you to combine the information of one observation found in one table with
the information that you find in another table. To join tables, you use variables that represent
the same object in both tables, such as a date, a country name. These common fields are
known as keys. When these keys also uniquely define the records in the table they are called
primary keys.

APPENDING TABLES
Appending or stacking tables is effectively adding observations from one table to another table

Step 4: Exploratory data analysis :

During exploratory data analysis you take a deep dive into the data. Information becomes much
easier to grasp when shown in a picture, therefore you mainly use graphical techniques to gain
an understanding of your data and the interactions between variables.
Step 5: Build the models

With clean data in place and a good understanding of the content, you’re ready to build models
with the goal of making better predictions. Building a model is an iterative process. The way you
build your model depends on whether you go with classic statistics or the somewhat more
recent machine learning.

Either way, most models consist of the following main steps:


1. Selection of a modeling technique and variables to enter in the model
2. Execution of the model
3. Diagnosis and model comparison.

Step 6: Presenting findings and building applications on top of them

After you’ve successfully analyzed the data and built a well-performing model, you’re ready to
present your findings to the world. The last stage of the data science process is where your soft
skills will be most useful, and yes, they’re extremely important.

Getting Data In and Out of R


There are a few principal functions reading data into R.
● read.table, read.csv, for reading tabular data
● readLines, for reading lines of a text file
● source, for reading in R code files (inverse of dump)
● dget, for reading in R code files (inverse of dput)
● load, for reading in saved workspaces
● unserialize, for reading single R objects in binary form

There are analogous functions for writing data to files


● write.table, for writing tabular data to text files (i.e. CSV) or connections
● writeLines, for writing character data line-by-line to a file or connection
● dput, for outputting a textual representation of an R object
● save, for saving an arbitrary number of R objects in binary format (possibly
compressed) to a file.
● serialize, for converting an R object into a binary format for outputting to a
connection (or file).
Using the readr Package
The readr package was developed by Hadley Wickham to deal with large flat files
quickly. The package provides replacements for functions like read.table() and
read.csv(). The analogous functions in readr are read_table() and read_csv().

For the most part, you can use read_table() and read_csv() pretty much anywhere you
might use read.table() and read.csv(). In addition, if there are non-fatal problems that
occur while reading in the data, you will get a warning. The read_csv function will also
read compressed files automatically. There is no need to decompress the file first or use
the gzfile connection function.

Interfaces to the Outside World

Data is read, using connection interfaces. Connections can be made to files (most
common) or to other more exotic things.
● file, opens a connection to a file
● gzfile, opens a connection to a file compressed with gzip
● bzfile, opens a connection to a file compressed with bzip2
● url, opens a connection to a webpage

In general, connections are powerful tools that let you navigate files or other external
objects. Connections allow R functions to talk to all these different external objects
without you having to write custom code for each object.

File Connections

Connections to text files can be created with the file() function.

Reading Lines of a Text File

Text files can be read line by line using the readLines() function. This function is useful
for reading text files that may be unstructured or contain non-standard data.
Reading From a URL Connection

The readLines() function can be useful for reading in lines of webpages. Since web
pages are basically text files that are stored on a remote server, there is conceptually
not much difference between a web page and a local text file.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy