0% found this document useful (0 votes)

13 views7 pages

Unit 2 - DS - 1st Year

The document outlines the six steps of the data science process: setting research goals, data retrieval, data preparation, data exploration, model building, and presenting results. Each step involves specific tasks such as defining project charters, cleansing data, and using graphical techniques for analysis. Additionally, it discusses methods for reading and writing data in R, emphasizing the use of the readr package for efficiency.

Uploaded by

allusarunkumar2307

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views7 pages

Unit 2 - DS - 1st Year

Uploaded by

allusarunkumar2307

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

The Data Science Process

Prepared by: Varun Rao (Dean, Data Science & AI)

For: Data Science - 1st years

The Data Science Process

The typical data science process consists of six steps through which you’ll iterate, as shown:
The following list is a short introduction; each of the steps will be discussed in greater depth as
we go further:

1. The first step of this process is setting a research goal. The main purpose here is
making sure all the stakeholders understand the what, how, and why of the project. In
every serious project this will result in a project charter.
2. The second phase is data retrieval. You want to have data available for analysis, so this
step includes finding suitable data and getting access to the data from the data owner.
The result is data in its raw form, which probably needs polishing and transformation
before it becomes usable.
3. Now that you have the raw data, it’s time to prepare it. This includes transforming the
data from a raw form into data that’s directly usable in your models. To achieve this,
you’ll detect and correct different kinds of errors in the data, combine data from different
data sources, and transform it.
4. The fourth step is data exploration. The goal of this step is to gain a deep understanding
of the data. You’ll look for patterns, correlations, and deviations based on visual and
descriptive techniques. The insights you gain from this phase will enable you to start
modeling.
5. Finally, model building (often referred to as “data modeling” throughout this book). It is
now that you attempt to gain the insights or make the predictions stated in your project
charter.
6. The last step of the data science model is presenting your results and automating the
analysis, if needed.

Step 1: Defining research goals and creating a project charter :

A project starts by understanding the what, the why, and the how of your project. What does the
company expect you to do? And why does management place such a value on your research?
Is it part of a bigger strategic picture or what? Answering these three questions.

The outcome should be a clear research goal, a good understanding of the context, well-defined
deliverables, and a plan of action with a timetable.

A project charter requires teamwork, and your input covers at least the following:
■ A clear research goal
■ The project mission and context
■ How you’re going to perform your analysis
■ What resources you expect to use
■ Proof that it’s an achievable project, or proof of concepts
■ Deliverables and a measure of success
Step 2: Retrieving data
Data can be stored in many forms, ranging from simple text files to tables in a database. The
objective now is acquiring all the data you need. This may be difficult, and even if you succeed,
data is often like a diamond in the rough: it needs polishing to be of any use to you.

Step 3: Cleansing, integrating, and transforming data

Your task now is to sanitize and prepare it for use in the modeling and reporting phase.

CLEANSING PROCESS
Data cleansing is a subprocess of the data science process that focuses on removing errors in
your data so your data becomes true and consistent.

DATA ENTRY ERRORS

Data collection and data entry are error-prone processes. They often require human
intervention, and because humans are only human, they make typos or lose their concentration
for a second and introduce an error into the chain. But data collected by machines or computers
isn’t free from errors either.

REDUNDANT WHITESPACE
Whitespaces tend to be hard to detect but cause errors like other redundant characters would. If
you know to watch out for them, fixing redundant whitespaces is luckily easy enough in most
programming languages. They all provide string functions that will remove the leading and
trailing whitespaces.

OUTLIERS
An outlier is an observation that seems to be distant from other observations or, more
specifically, one observation that follows a different logic or generative process than the other
observations.

DEALING WITH MISSING VALUES

Missing values aren’t necessarily wrong, but you still need to handle them separately; certain
modeling techniques can’t handle missing values

Data Transformation:

THE DIFFERENT WAYS OF COMBINING DATA

You can perform two operations to combine information from different data sets. The first
operation is joining: enriching an observation from one table with information from another table.
The second operation is appending or stacking: adding the observations of one table to those of
another table. When you combine data, you have the option to create a new physical table or a
virtual table by creating a view. The advantage of a view is that it doesn’t consume more disk
space.

JOINING TABLES
Joining tables allows you to combine the information of one observation found in one table with
the information that you find in another table. To join tables, you use variables that represent
the same object in both tables, such as a date, a country name. These common fields are
known as keys. When these keys also uniquely define the records in the table they are called
primary keys.

APPENDING TABLES
Appending or stacking tables is effectively adding observations from one table to another table

Step 4: Exploratory data analysis :

During exploratory data analysis you take a deep dive into the data. Information becomes much
easier to grasp when shown in a picture, therefore you mainly use graphical techniques to gain
an understanding of your data and the interactions between variables.
Step 5: Build the models

With clean data in place and a good understanding of the content, you’re ready to build models
with the goal of making better predictions. Building a model is an iterative process. The way you
build your model depends on whether you go with classic statistics or the somewhat more
recent machine learning.

Either way, most models consist of the following main steps:

1. Selection of a modeling technique and variables to enter in the model
2. Execution of the model
3. Diagnosis and model comparison.

Step 6: Presenting findings and building applications on top of them

After you’ve successfully analyzed the data and built a well-performing model, you’re ready to
present your findings to the world. The last stage of the data science process is where your soft
skills will be most useful, and yes, they’re extremely important.

Getting Data In and Out of R

There are a few principal functions reading data into R.
● read.table, read.csv, for reading tabular data
● readLines, for reading lines of a text file
● source, for reading in R code files (inverse of dump)
● dget, for reading in R code files (inverse of dput)
● load, for reading in saved workspaces
● unserialize, for reading single R objects in binary form

There are analogous functions for writing data to files

● write.table, for writing tabular data to text files (i.e. CSV) or connections
● writeLines, for writing character data line-by-line to a file or connection
● dput, for outputting a textual representation of an R object
● save, for saving an arbitrary number of R objects in binary format (possibly
compressed) to a file.
● serialize, for converting an R object into a binary format for outputting to a
connection (or file).
Using the readr Package
The readr package was developed by Hadley Wickham to deal with large flat files
quickly. The package provides replacements for functions like read.table() and
read.csv(). The analogous functions in readr are read_table() and read_csv().

For the most part, you can use read_table() and read_csv() pretty much anywhere you
might use read.table() and read.csv(). In addition, if there are non-fatal problems that
occur while reading in the data, you will get a warning. The read_csv function will also
read compressed files automatically. There is no need to decompress the file first or use
the gzfile connection function.

Interfaces to the Outside World

Data is read, using connection interfaces. Connections can be made to files (most
common) or to other more exotic things.
● file, opens a connection to a file
● gzfile, opens a connection to a file compressed with gzip
● bzfile, opens a connection to a file compressed with bzip2
● url, opens a connection to a webpage

In general, connections are powerful tools that let you navigate files or other external
objects. Connections allow R functions to talk to all these different external objects
without you having to write custom code for each object.

File Connections

Connections to text files can be created with the file() function.

Reading Lines of a Text File

Text files can be read line by line using the readLines() function. This function is useful
for reading text files that may be unstructured or contain non-standard data.
Reading From a URL Connection

The readLines() function can be useful for reading in lines of webpages. Since web
pages are basically text files that are stored on a remote server, there is conceptually
not much difference between a web page and a local text file.

Foundation of Data Science
100% (2)
Foundation of Data Science
143 pages
FDS Notes PDF
No ratings yet
FDS Notes PDF
140 pages
DB Lab3
No ratings yet
DB Lab3
5 pages
CEv 12
No ratings yet
CEv 12
72 pages
Data Science Methodology
No ratings yet
Data Science Methodology
4 pages
Cs3352foundation of Data Science - 1
No ratings yet
Cs3352foundation of Data Science - 1
141 pages
Sap PP Guide Material Staging
No ratings yet
Sap PP Guide Material Staging
17 pages
Topper World Data-Science-Lifecycle-Fnl
No ratings yet
Topper World Data-Science-Lifecycle-Fnl
6 pages
Data Science Process
No ratings yet
Data Science Process
101 pages
Introduction To Data Science Methodology
No ratings yet
Introduction To Data Science Methodology
45 pages
(Power BI Data Transformation) (Cheatsheet) - 2
No ratings yet
(Power BI Data Transformation) (Cheatsheet) - 2
7 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
Introduction Data Science Edited
No ratings yet
Introduction Data Science Edited
33 pages
IDS - UNIT-2 - Notes Part1 - Introduction To Data Science and Prob Concept
No ratings yet
IDS - UNIT-2 - Notes Part1 - Introduction To Data Science and Prob Concept
66 pages
Operating System - Unit 2
No ratings yet
Operating System - Unit 2
131 pages
Fods Notes
No ratings yet
Fods Notes
139 pages
EBook - Data Science 4
No ratings yet
EBook - Data Science 4
14 pages
Data Science (Quick Guide) For College Exams
No ratings yet
Data Science (Quick Guide) For College Exams
34 pages
Data Science Life Cycle
No ratings yet
Data Science Life Cycle
7 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
23 pages
FDS UNIT 1 Part2
No ratings yet
FDS UNIT 1 Part2
47 pages
5 Data Science Project Lifecycle
No ratings yet
5 Data Science Project Lifecycle
33 pages
DSBD
No ratings yet
DSBD
23 pages
Unit - 1
No ratings yet
Unit - 1
25 pages
FDS Notes
No ratings yet
FDS Notes
5 pages
Fundamental of Data Science
No ratings yet
Fundamental of Data Science
20 pages
Data Science Methodology
No ratings yet
Data Science Methodology
21 pages
Unit 2
No ratings yet
Unit 2
21 pages
QB Ese FDS
No ratings yet
QB Ese FDS
29 pages
Customer Spent Analysis Using K-Means Clustering
No ratings yet
Customer Spent Analysis Using K-Means Clustering
1 page
The Data Science Process
No ratings yet
The Data Science Process
33 pages
Unit 2 Data Preprocessing and Association Rule Mining
No ratings yet
Unit 2 Data Preprocessing and Association Rule Mining
31 pages
Unit I and Unit II Dev
No ratings yet
Unit I and Unit II Dev
36 pages
Data Science
No ratings yet
Data Science
14 pages
DS Unit 1
No ratings yet
DS Unit 1
26 pages
Unit 1
No ratings yet
Unit 1
11 pages
Emerging - 2021 - Module 2 PDF
No ratings yet
Emerging - 2021 - Module 2 PDF
61 pages
Liceria Tech
No ratings yet
Liceria Tech
12 pages
Shailendra Kumar
No ratings yet
Shailendra Kumar
3 pages
Unit-2 - DS Notes
No ratings yet
Unit-2 - DS Notes
22 pages
Data Science PDF
No ratings yet
Data Science PDF
11 pages
DTS Modul Data Science Methodology
100% (1)
DTS Modul Data Science Methodology
56 pages
Dell Emc Powerprotect Data Manager Oracle Rman Agent Backup and Recovery WP
No ratings yet
Dell Emc Powerprotect Data Manager Oracle Rman Agent Backup and Recovery WP
31 pages
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
No ratings yet
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
28 pages
Principles of Concurrency
No ratings yet
Principles of Concurrency
7 pages
Unit - I
No ratings yet
Unit - I
17 pages
Lecture 4 Software Engineering - DR Mohammed Kamal 2024
No ratings yet
Lecture 4 Software Engineering - DR Mohammed Kamal 2024
32 pages
Data Scince Report
No ratings yet
Data Scince Report
11 pages
File
No ratings yet
File
27 pages
FDSMSE Imp
No ratings yet
FDSMSE Imp
6 pages
Accounting Information Systems 6th Edition James A. Hall - The Ebook Is Ready For Download To Explore The Complete Content
No ratings yet
Accounting Information Systems 6th Edition James A. Hall - The Ebook Is Ready For Download To Explore The Complete Content
47 pages
Dbms
No ratings yet
Dbms
12 pages
Syllabus Unit - 2
No ratings yet
Syllabus Unit - 2
21 pages
Iot and Applications SUBJECT CODE: 2180709: Laboratory Manual
No ratings yet
Iot and Applications SUBJECT CODE: 2180709: Laboratory Manual
38 pages
Power BI Desktop-Interactive Reports - Microsoft Power BI
No ratings yet
Power BI Desktop-Interactive Reports - Microsoft Power BI
11 pages
MLM FDS
No ratings yet
MLM FDS
19 pages
Datasheet Axis m3125 Lve Dome Camera
No ratings yet
Datasheet Axis m3125 Lve Dome Camera
8 pages
Unit - I
No ratings yet
Unit - I
6 pages
Unit 1
No ratings yet
Unit 1
19 pages
Data Science-Lec 1
No ratings yet
Data Science-Lec 1
17 pages
DS Unit 2
No ratings yet
DS Unit 2
7 pages
Data Science Process
No ratings yet
Data Science Process
7 pages
NIELIT-Networking Cerificate Course
No ratings yet
NIELIT-Networking Cerificate Course
4 pages
HTTTTC - Final Exam
No ratings yet
HTTTTC - Final Exam
4 pages
Life Cycle of DS Project
No ratings yet
Life Cycle of DS Project
9 pages
Data Science Process Stages Lecture 2
No ratings yet
Data Science Process Stages Lecture 2
4 pages
8 Weeks Plan
No ratings yet
8 Weeks Plan
4 pages
Java Web Start Configuration
No ratings yet
Java Web Start Configuration
6 pages
CS6551 Computer Networks
No ratings yet
CS6551 Computer Networks
39 pages
Lec 1 - Data Science
No ratings yet
Lec 1 - Data Science
3 pages
Data Science
No ratings yet
Data Science
5 pages
Dsur Ea2352001010391 W3
No ratings yet
Dsur Ea2352001010391 W3
3 pages
2.06 Sam Hodges
No ratings yet
2.06 Sam Hodges
5 pages
Citation Styles - Editing Step-By-Step (Zotero Documentation)
No ratings yet
Citation Styles - Editing Step-By-Step (Zotero Documentation)
4 pages
Architecture of Data Science Projects: Components
No ratings yet
Architecture of Data Science Projects: Components
4 pages
How To Upgrade ESXi 5
No ratings yet
How To Upgrade ESXi 5
10 pages
170ENT11002 Modicon Momentum
No ratings yet
170ENT11002 Modicon Momentum
2 pages
Owais Raza: Adobe Indesign 0%
No ratings yet
Owais Raza: Adobe Indesign 0%
2 pages
Data Science Lifecycle
No ratings yet
Data Science Lifecycle
3 pages
B Vamsi Krishna
No ratings yet
B Vamsi Krishna
10 pages
Musical Instrument Digital Interface
No ratings yet
Musical Instrument Digital Interface
30 pages
Using Jquery DataTable To Display SharePoint 2013 List Data On SharePoint Site Pages PDF
No ratings yet
Using Jquery DataTable To Display SharePoint 2013 List Data On SharePoint Site Pages PDF
13 pages
IBM Q1 Technical Marketing ASSET2 - Data Science Methodology-Best Practices For Successful Implementations Ov37176 PDF
No ratings yet
IBM Q1 Technical Marketing ASSET2 - Data Science Methodology-Best Practices For Successful Implementations Ov37176 PDF
6 pages
Project Tracking
No ratings yet
Project Tracking
4 pages
TM Series PDF
No ratings yet
TM Series PDF
4 pages
Life Cycle of Data Science - Complete Step-By-step Guide
No ratings yet
Life Cycle of Data Science - Complete Step-By-step Guide
3 pages
Read Me
No ratings yet
Read Me
1 page
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
Exploring Data with Access 2016
From Everand
Exploring Data with Access 2016
Larry Rockoff
No ratings yet
Preparing Data for Analysis with JMP
From Everand
Preparing Data for Analysis with JMP
Robert Carver
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit 2 - DS - 1st Year

Uploaded by

Unit 2 - DS - 1st Year

Uploaded by

The Data Science Process

Prepared by: Varun Rao (Dean, Data Science & AI)

The Data Science Process

Step 1: Defining research goals and creating a project charter :

Step 3: Cleansing, integrating, and transforming data

DATA ENTRY ERRORS

DEALING WITH MISSING VALUES

THE DIFFERENT WAYS OF COMBINING DATA

Step 4: Exploratory data analysis :

Either way, most models consist of the following main steps:

Step 6: Presenting findings and building applications on top of them

Getting Data In and Out of R

There are analogous functions for writing data to files

Interfaces to the Outside World

Connections to text files can be created with the file() function.

Reading Lines of a Text File

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.