Unit 2 - DS - 1st Year
Unit 2 - DS - 1st Year
1. The first step of this process is setting a research goal. The main purpose here is
making sure all the stakeholders understand the what, how, and why of the project. In
every serious project this will result in a project charter.
2. The second phase is data retrieval. You want to have data available for analysis, so this
step includes finding suitable data and getting access to the data from the data owner.
The result is data in its raw form, which probably needs polishing and transformation
before it becomes usable.
3. Now that you have the raw data, it’s time to prepare it. This includes transforming the
data from a raw form into data that’s directly usable in your models. To achieve this,
you’ll detect and correct different kinds of errors in the data, combine data from different
data sources, and transform it.
4. The fourth step is data exploration. The goal of this step is to gain a deep understanding
of the data. You’ll look for patterns, correlations, and deviations based on visual and
descriptive techniques. The insights you gain from this phase will enable you to start
modeling.
5. Finally, model building (often referred to as “data modeling” throughout this book). It is
now that you attempt to gain the insights or make the predictions stated in your project
charter.
6. The last step of the data science model is presenting your results and automating the
analysis, if needed.
A project starts by understanding the what, the why, and the how of your project. What does the
company expect you to do? And why does management place such a value on your research?
Is it part of a bigger strategic picture or what? Answering these three questions.
The outcome should be a clear research goal, a good understanding of the context, well-defined
deliverables, and a plan of action with a timetable.
A project charter requires teamwork, and your input covers at least the following:
■ A clear research goal
■ The project mission and context
■ How you’re going to perform your analysis
■ What resources you expect to use
■ Proof that it’s an achievable project, or proof of concepts
■ Deliverables and a measure of success
Step 2: Retrieving data
Data can be stored in many forms, ranging from simple text files to tables in a database. The
objective now is acquiring all the data you need. This may be difficult, and even if you succeed,
data is often like a diamond in the rough: it needs polishing to be of any use to you.
Your task now is to sanitize and prepare it for use in the modeling and reporting phase.
CLEANSING PROCESS
Data cleansing is a subprocess of the data science process that focuses on removing errors in
your data so your data becomes true and consistent.
REDUNDANT WHITESPACE
Whitespaces tend to be hard to detect but cause errors like other redundant characters would. If
you know to watch out for them, fixing redundant whitespaces is luckily easy enough in most
programming languages. They all provide string functions that will remove the leading and
trailing whitespaces.
OUTLIERS
An outlier is an observation that seems to be distant from other observations or, more
specifically, one observation that follows a different logic or generative process than the other
observations.
Data Transformation:
JOINING TABLES
Joining tables allows you to combine the information of one observation found in one table with
the information that you find in another table. To join tables, you use variables that represent
the same object in both tables, such as a date, a country name. These common fields are
known as keys. When these keys also uniquely define the records in the table they are called
primary keys.
APPENDING TABLES
Appending or stacking tables is effectively adding observations from one table to another table
During exploratory data analysis you take a deep dive into the data. Information becomes much
easier to grasp when shown in a picture, therefore you mainly use graphical techniques to gain
an understanding of your data and the interactions between variables.
Step 5: Build the models
With clean data in place and a good understanding of the content, you’re ready to build models
with the goal of making better predictions. Building a model is an iterative process. The way you
build your model depends on whether you go with classic statistics or the somewhat more
recent machine learning.
After you’ve successfully analyzed the data and built a well-performing model, you’re ready to
present your findings to the world. The last stage of the data science process is where your soft
skills will be most useful, and yes, they’re extremely important.
For the most part, you can use read_table() and read_csv() pretty much anywhere you
might use read.table() and read.csv(). In addition, if there are non-fatal problems that
occur while reading in the data, you will get a warning. The read_csv function will also
read compressed files automatically. There is no need to decompress the file first or use
the gzfile connection function.
Data is read, using connection interfaces. Connections can be made to files (most
common) or to other more exotic things.
● file, opens a connection to a file
● gzfile, opens a connection to a file compressed with gzip
● bzfile, opens a connection to a file compressed with bzip2
● url, opens a connection to a webpage
In general, connections are powerful tools that let you navigate files or other external
objects. Connections allow R functions to talk to all these different external objects
without you having to write custom code for each object.
File Connections
Text files can be read line by line using the readLines() function. This function is useful
for reading text files that may be unstructured or contain non-standard data.
Reading From a URL Connection
The readLines() function can be useful for reading in lines of webpages. Since web
pages are basically text files that are stored on a remote server, there is conceptually
not much difference between a web page and a local text file.