DS 1
DS 1
Unit 1
Prepared By
Dr.P.Sasikumar
Associate Professor, AIML Dept.
Unit # 01
Introduction: What is Data Science?
-In Academia
-In Industry
Basic Terminologies
• Data
• It can be Simulation
-generated
-collected
-retrieved.
Similarity Measures
Data Structures
Algorithms
• Data: facts with no meanings.
• Information: learning from facts.
• Knowledge: practical understanding of a subject.
• Understanding: the ability to absorb knowledge and learn to reason.
• Wisdom: the quality of having experience and good judgment; ability to think and foresee.
• Validity: ways to confirm truth.
The DIKW Pyramid
5
• Cross-sectional data: applied on data without time.
• Temporal data: applied on time series.
• Spatial: considers location i.e. coordinate determination in touch phones.
• Temporal cum Spatial (GIS): considers change with passage of time for example population density.
Measurements of Scales
InrtoducoitnW
: hasitDaatSceince?
• Sure, there’s is a difference between industry and academia. But does it really have to be that way?
Why do many courses in school have to be so intrinsically out of touch with reality?
• Even so, the gap doesn’t represent simply a difference between industry statistics and academic
statistics.
• The general experience of data scientists is that, at their job, they have access to a larger body
of knowledge and methodology, as well as a process, which we now define as the data science
process, that has foundations in both statistics and computer science.
Around all the hype, in other words, there is a ring of truth: this is something new.
Why Now?
• We have massive amounts of data about many aspects of our lives, and ,simultaneously, What people
might not know is that the “datafication” of our offline behavior has started as well.
• On the Internet, this means Amazon recommendation systems.
• on Facebook, friend recommendations, film and music recommendations, and so on.
• In finance, this means credit ratings, trading algorithms, and models.
• In education, this is starting to mean dynamic personalized learning and assessments coming out of
places like Knewton and Khan Academy.
• In government, this means policies based on data.
Datafication
• In the May/June 2013 issue of Foreign Affairs, Kenneth Neil Cukier and Viktor Mayer-Schoenberger wrote an
article called “The Rise of Big Data”, In it they discuss the concept of datafication,
They define datafication as a process of “taking all aspects of life and turning
them into data.”
• They follow up their definition in the article with a line that speaks volumes about their perspective:
Once we datafy things, we can transform their purpose and turn the information
into new forms of value.
Datafication
Examples:
• How we quantify friendships with “likes”.
• “Twitter(X) datafies stray thoughts.
• LinkedIn datafies professional networks.
• When we “like” someone or something online, we are intending to be datafied.
• Browse the Web, we are unintentionally through cookies.
• When we walk around in a store, or even on the street, we are being datafied, via sensors,
cameras, or Google glasses.
• Taking part in a social media experiment.
• Then at LinkedIn and Facebook, respectively—coined the term “data scientist” in 2008.
• Wikipedia finally gained an entry on data science in 2012.
Data Science Jobs
• For three years running, data science has been dubbed ¨the best job in America.¨ According to Stack
Overflow, it is one of the highest paying jobs in the software sector.
• The GDPR increased the reliance companies have on data scientists due to the need for real-time analytics
and storing data responsibly.
• There are 465 job openings in New York City alone for data scientists.
• LinkedIn recently picked data scientist as its most promising career of 2019. One of the reasons it got the
top spot was that the average salary for people in the role is $130,000.
• The January report from Indeed, one of the top job sites, showed a 29% increase in demand for data
scientists year over year and a 344% increase since 2013 -- a dramatic upswing. But while demand -- in
the form of job postings -- continues to rise sharply, searches by job seekers skilled in data science grew at
a slower pace (14%), suggesting a gap between supply and demand.
The growth in data scientist job postings on Indeed, from December 2016 to December 2018
What Is a Data Scientist, Really?
Perhaps the most concrete approach is to define data science is by its usage.
• In Academia
• An academic data scientist is a scientist, trained in anything from social science to biology, who works with large
amounts of data, and must deal with computational problems posed by the structure, size, messiness, and the complexity
and nature of the data, while simultaneously solving a real-world problem.
• In Industry
More generally, a data scientist is someone who knows
• How to design the experiments,
• how to the process of collecting, cleaning, and munging of data.
• Skills that are also necessary for understanding biases in the data, and for debugging logging output from code.
• Exploratory data analysis, which combines visualization and data sense.
• Find patterns, build models, and algorithms.
• Use analyses for decision making.
What Is a Data Scientist
Data Engineers are the
data professionals who
Data analyst is someone
prepare the “big data”
who merely curates
infrastructure to be
meaningful insights from
analyzed by Data
data.
Scientists
A data scientist is a professional with the capabilities to gather large amounts of data to analyze and synthesize
the information into actionable plans for companies and other organizations.
Statistical Inference
• What is Statistical inference is the process of using a sample to infer the properties of a population. Statistical
procedures use sample data to estimate the characteristics of the whole population from which the sample was
drawn.
• studying a phenomenon, such as the effects of a new medication or public opinion
• populations are usually too large to measure fully.
• Consequently, researchers must use a manageable subset of that population to learn about it.
• By using procedures that can make statistical inferences, you can estimate the properties and processes of a
population.
• More specifically, sample statistics can estimate population parameters.
21
How to Make Statistical Inferences
• Process of making a statistical inference requires you to do the following:
• Draw a sample that adequately represents the population.
• Measure your variables of interest.
• Use appropriate statistical methodology to generalize your sample results to the population while accounting for
sampling error.
22
Example Statistical Inference
• real flu vaccine study for an example of making a statistical inference
Effect 7.4%
Study Findings
• From the table above, 10.8% of the unvaccinated got the flu, while only 3.4% of the vaccinated caught it. The
apparent effect of the vaccine is 10.8% – 3.4% = 7.4%
23
Population and Sample
• In statistics as well as in quantitative methodology, the set of data are collected and selected from a statistical
population with the help of some defined procedures. There are two different types of data sets
namely, population and sample
Population
• It includes all the elements from the data set and measurable characteristics of the population such as mean and
standard deviation are known as a parameter.
• For example, All people living in India indicates the population of India.
24
Types
• Finite Population
The finite population is also known as a countable population in which the population can be counted. In
other words, it is defined as the population of all the individuals or objects that are finite. For statistical analysis, the
finite population is more advantageous than the infinite population. Examples of finite populations are employees of
a company, potential consumer in a market.
• Infinite Population
The infinite population is also known as an uncountable population in which the counting of units in the
population is not possible. Example of an infinite population is the number of germs in the patient’s body is
uncountable.
• Existent Population
The existing population is defined as the population of concrete individuals. In other words, the population whose
unit is available in solid form is known as existent population. Examples are books, students etc.
• Hypothetical Population
The population in which whose unit is not available in solid form is known as the hypothetical population. A
population consists of sets of observations, objects etc that are all something in common. In some situations, the
populations are only hypothetical.
Examples are an outcome of rolling the dice, the outcome of tossing a coin. 25
:
Meaning Collection of all the units or elements that possess A subgroup of the members of the
common characteristics population
Includes Each and every element of a group Only includes a handful of units of
population
• It includes one or more observations that are drawn from the population and the measurable characteristic of a
sample is a statistic.
• Sampling is the process of selecting the sample from the population.
• For example, some people living in India is the sample of the population.
27
Probability Sampling
• In probability sampling, the population units cannot be selected at the discretion(Option) of the researcher.
• This can be dealt with following certain procedures which will ensure that every unit of the population consists
of one fixed probability being included in the sample.
• Such a method is also called random sampling.
28
Non Probability Sampling
• In non-probability sampling, the population units can be selected at the discretion of the researcher.
• Those samples will use the human judgments for selecting units and has no theoretical basis for estimating the
characteristics of the population.
• Some of the techniques used for non-probability sampling are
Quota sampling
Judgment sampling
Purposive sampling
29
Statistical Modelling
• A statistical model is a type of mathematical model that comprises of the assumptions undertaken to describe the
data generation process.
• The mathematical expressions will be general enough that they have to include parameters, but the values of these
parameters are not yet known.
• In mathematical expressions, the convention is to use Greek letters for parameters and Latin letters for data.
• So, for example, if you have two columns of data, x and y, and you think there’s a linear relationship, you’d write
down y = β0 +β1x.
• You don’t know what β0 and β1 are in terms of actual numbers yet, so they’re the parameters.
• Other people prefer pictures and will first draw a diagram of data flow, possibly with arrows, showing how things
affect other things or what happens over time.
• This gives them an abstract picture of the relationships before choosing equations to express them.
30
Probability Distributions
What Is Probability?
• Probability denotes the possibility of something happening.
• It is a mathematical concept that predicts how likely events are to occur.
• The probability values are expressed between 0 and 1.
• The definition of probability is the degree to which something is likely to occur.
• This fundamental theory of probability is also applied to probability distributions.
Probability Distributions?
• Statistical function that describes all the possible values and probabilities for a random variable within a given
range.
• This range will be bound by the minimum and maximum possible values, but where the possible value would be
plotted on the probability distribution will be determined by a number of factors.
31
Probability Distribution
A probability distribution (function) is a list of the probabilities of the values (simple
outcomes) of a random variable.
Ex: Number of heads in two tosses of a coin
RVDist-32
Fitting a model to data
• Many data mining procedures fall within this general framework.
• illustrate with some of the most common, all of which are based on linear models.
• The crux of the fundamental concept of this chapter—fitting a model to data by finding “optimal” model
parameters.
33
Classification via mathematical function
34
Overfitting
• Overfitting occurs when our machine learning model tries to cover all the data points or more than the required
data points present in the given dataset.
• Because of this, the model starts caching noise and inaccurate values present in the dataset, and all these factors
reduce the efficiency and accuracy of the model.
• The chances of occurrence of overfitting increase as much we provide training to our model
• Example: The concept of the overfitting can be understood by the below graph of the linear regression output:
35
How to avoid the Overfitting in Model
• Both overfitting and underfitting cause the degraded performance of the machine learning model. But the main
cause is overfitting, so there are some ways by which we can reduce the occurrence of overfitting in our
model.
• Cross-Validation
• Training with more data
• Removing features
• Early stopping the training
• Regularization
36
basic terms for overfitting
• Signal: It refers to the true underlying pattern of the data that helps the machine learning model to learn from the
data.
• Noise: Noise is unnecessary and irrelevant data that reduces the performance of the model.
• Bias: Bias is a prediction error that is introduced in the model due to oversimplifying the machine learning
algorithms. Or it is the difference between the predicted values and the actual values.
• Variance: If the machine learning model performs well with the training dataset, but does not perform well with
the test dataset, then variance occurs.
37
Basics of R
Introduction
• R is a popular programming language used for statistical computing.
• Its most common use is to analyze and visualize data
• Graphics representation and reporting.
• R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is
currently developed by the R Development Core Team.
• R is freely available under the GNU General Public License, and pre compiled binary versions are provided for
various operating systems like Linux, Windows and Mac.
• This programming language was named R , based on the first letter of first name of the two R authors (Robert
Gentleman and Ross Ihaka), and partly a play on the name of the Bell Labs.
• R allows integration with the procedures written in the C, C++, .Net, Python or FORTRAN languages for
efficiency.
38
Why Use R?
• It is a great resource for data analysis, data visualization, data science and machine learning
• It provides many statistical techniques (such as statistical tests, classification, clustering and data reduction)
• It is easy to draw graphs in R, like pie charts, histograms, box plot, scatter plot
• It works on different platforms (Windows, Mac, Linux)
• It is open-source and free
• It has a large community support
• It has many packages (libraries of functions) that can be used to solve different problems
39
Features of R
• As stated earlier, R is a programming language and software environment for statistical analysis, graphics
representation and reporting.
40
R - Environment Setup
1. Installation of R
41
In Windows:
Step – 1: Go to CRAN R project website. (Comprehensive R Archive Network )
Step – 3: Click on the base subdirectory link or install R for the first time link.
Step – 4: Click Download R X.X.X for Windows (X.X.X stand for the latest version of R.
Step – 5: Run the .exe file and follow the installation instructions.
5.c. Select the components you wish to install (it is recommended to install all the components). Click Next.
5.d. Enter/browse the folder/path you wish to install R into and then confirm by clicking Next.
5.e. Select additional tasks like creating desktop shortcuts etc. then click Next.
To begin, go to download RStudio and click on the download button for RStudio desktop.
Step – 2: Click on the link for the windows version of RStudio and save the .exe file.
3.b. Enter/browse the path to the installation folder and click Next to proceed.
3.c. Select the folder for the start menu shortcut or click on do not create shortcuts and then click Next.
43
Syntax
1.To output text in R, use single or double quotes:
• Example
"Hello World!"
5
10
25
Example
5+5
44
R Print Output
1.Print : Unlike many other programming languages, you can output code in R without using a print function:
Example
"Hello World!"
• However, R does have a print() function available if you want to use it. This might be useful if you are familiar with
other programming languages, such as Python, which often uses the print() function to output code.
Example
print("Hello World!")
• And there are times you must use the print() function to output code, for example when working with for loops.
Example
• for (x in 1:10)
• {
print(x)
}
• It is up to you whether you want to use the print() function to output code. However, when your code is inside an R
expression (e.g. inside curly braces {} like in the example above), use the print() function to output the result. 45
Comments
• Comments can be used to explain R code, and to make it more readable. It can also be used to prevent execution when testing alternative
code.
• Comments starts with a #. When executing code, R will ignore anything that starts with #.
• Example
• # This is a comment
"Hello World!"
• Example
• Comments does not have to be text to explain the code, it can also be used to prevent R from executing the code:
• Example
• # "Good morning!"
"Good night!“
• Multiline Comments :Unlike other programming languages, such as Java, there are no syntax in R for multiline comments. However, we can
just insert a # for each line to create multiline comments: 46
Creating Variables in R
• Variables are containers for storing data values.
• R does not have a command for declaring a variable.
• A variable is created the moment you first assign a value to it. To assign a value to a variable, use the <- sign. To
output (or print) the variable value, just type the variable name:
• From the example above, name and age are variables, while "John" and 40 are values.
• In other programming language, it is common to use = as an assignment operator.
• In R, we can use both = and <- as assignment operators.
• However, <- is preferred in most cases because the = operator can be forbidden in some context in R.
47
Print / Output Variables
• Compared to many other programming languages, you do not have to use a function to print/output variables in
R. You can just type the name of the variable:
• However, R does have a print() function available if you want to use it. This might be useful if you are familiar
with other programming languages, such as Python, which often use a print() function to output variables.
• And there are times you must use the print() function to output code, for example when working with for loops
(which you will learn more about in a later chapter):
48
Concatenate Elements
• You can also concatenate, or join, two or more elements, by using the paste() function.
• To combine both text and a variable, R uses comma (,):
49
Multiple Variables
• R allows you to assign the same value to multiple variables in one line:
50
Variable Names
• A variable can have a short name (like x and y) or a more descriptive name (age, carname, total_volume). Rules
for R variables are:A variable name must start with a letter and can be a combination of letters, digits, period(.)
and underscore(_). If it starts with period(.), it cannot be followed by a digit.
• A variable name cannot start with a number or underscore (_)
• Variable names are case-sensitive
51
R - Data Types
• Generally, while doing programming in any programming language, you need to use various variables to store
various information.
• Variables are nothing but reserved memory locations to store values.
• This means that, when you create a variable you reserve some space in memory.
• You may like to store information of various data types like character, wide character, integer, floating point,
double floating point, Boolean etc. Based on the data type of a variable, the operating system allocates memory and
decides what can be stored in the reserved memory.
• In contrast to other programming languages like C and java in R, the variables are not declared as some data type.
• The variables are assigned with R-Objects and the data type of the R-object becomes the data type of the variable
52
Data Types in R are:
• Each R-Data Type requires different amounts of memory and has some specific operations which can be
performed over it.
• numeric – (3,6.7,121)
• Integer – (2L, 42L; where ‘L’ declares this as an integer)
• logical – (‘True’)
• complex – (7 + 5i; where ‘i’ is imaginary number)
• character – (“a”, “B”, “c is third”, “69”)
• raw – ( as.raw(55); raw creates a raw vector of the specified length)
53
Data type and the values that each data type can
take.
Basic Data Types Values Examples
54
Data Types
Data type Example Description
Logical True, False It is a special data type for data with only two possible values which
can be construed as true/false.
Numeric 12,32,112,5432 Decimal value is called numeric in R, and it is the default computational
data type.
Integer 3L, 66L, 2346L Here, L tells R to store the value as an integer,
Complex Z=1+2i, t=7+3i A complex value in R is defined as the pure imaginary value i.
Character 'a', '"good'", "TRUE", In R programming, a character is used to represent string values. We
'35.4' convert objects into character values with the help ofas.character()
function.
# logical
• # integer
x <- TRUE
• x <- 1000L class(x)
• class(x)
• # complex
• x <- 9i + 3
• class(x)
56