DPIR_IA1
DPIR_IA1
->
• Sources of data are:
• Various transactions done online(Electricity Bill, LIC Premiums, Online Shopping,
Selling Online)
• Data from Facebook, Instagram, WhatsApp, LinkedIn, Twitter
• Uploaded videos on the Internet, videos watched on various social media platforms
• Data generated through your mobile-in terms of the apps installed, apps used, data
generated through apps(Gaming apps, pictures, e-com apps)
• IoT Devices generating data.
• Approaches to collect data:
• Surveys/Polls – to gather data to answer specific questions(For Ex: a poll may be
used to understand how a population of eligible voters will cast their vote in an
upcoming election)
• An interview conducted either over phone/in person/over the internet(to elicit
information on people’s opinions, preferences, and behaviour)
• Experiments
• Case Studies(Uber, Amazon, Smart Toothbrush)
2. Attribute Construction:
->Attribute constructions can help improve accuracy and helps in the better
understanding of the data set.
->For example, We may wish to add the attribute area based on the attributes-
height and width.
3. Aggregation:
-> Summary or Aggregation operations are applied to the data.
->For Ex: The daily sales data may be aggregated so as to compute monthly
and annual total amounts.
4. Normalisation:
->The attribute data are scaled so as to fall within a smaller range.
Page 1
-> say range between -1.0 to +1.0(Mean Normalisation)
5. Discretisation:
Page 2
->An attribute is said to be redundant if it can be
derived(correlated to) from another attribute or set of attributes.
->Example of Positive Correlation-as height increases, weight
also increases
->Example of Negative Correlation-time spent on mobile and
performance in the Exams.
->Redundancies can be detected by correlation analysis.
->There are tests that can be performed on different kinds of
data.
->Chi-Squared Test: for measuring the correlation existing
between nominal data.
• Tuple Duplication:
->Duplication at the tuple level, for e.g., where there are more
than one identical tuple for a given unique data entry case is
called Tuple Duplication. These identical tuples have to be
eliminated.
Page 3
• Discrete wavelet transform is signal processing technique that, when applied
to a data vector X (transforms)-> numerically different vector X’ (wavelet
coefficients)
• But Both the vectors are of same length.
• The usefulness lies in the fact that the wavelet transformed data can be
truncated storing only a small fractions of the strongest of the wavelet
coefficients
- Numerosity:
• Numerosity Reduction Techniques replace the original data volume by
alternative, smaller forms of data representation.
- Parametric method:
• A model is used to estimate the data, so that only the data parameters need to
be stored, instead of the actual data. Ex: Regression Models
- Non-Parametric Method:
• In statistics, a histogram is a graphical representation of the distribution of
data.
• The histogram is represented by a set of rectangles, adjacent to each other,
where each bar represent a kind of data.
- Sampling:
• Sampling is the selection of a subset of individuals from within a statistical
population to estimate characteristics of the whole population.
5.EXPLAIN !
->
• Data Table:
- Collection of measured data values represented as numbers or text. They are
raw before they are transformed.
• Data Value:
- Measurements of various details in different measures
Page 4
- Distances(in cm/m)
- Categories(Telecom/Energy Industry)
- Weights(lb/kg)
• Observation:
- Each row in the Data Table contains information about a specific item.
• Variable:
- an attribute of a specified record
- Types of variable;
• Discrete Variable:
- If a variable contains a fixed number of values – be it numbers or
categorical, then such variables are called Discrete Variables.
- Ex: Number of Students, count of participants, different
sectors(telecom, retail)
• Numerical Variable:
- If a variable contains a Continuous numeric value(with infinite precision),
then such variables are called Numerical Values.
- Example: Height, weight.
Page 5
• Independent variable:
- a variable that stands alone and isn’t changed by the other variables.
• Dependent variable:
- the value which is dependent on the changes in the independent
variables.
• The Median:
- For variables with an even number of values, the average of the two values
closest to the middle is selected (sum the two values and divide by 2).
- The median can be calculated for variables measured on the ordinal, interval, and
ratio scales and is often the best indication of central tendency for variables
measured on the ordinal scale.
• The Mean:
- Its the most commonly used summary of central tendency for variables measured
on the interval or ratio scales.
- It is the average the values given.
Page 6
• The distribution of the data can be understood by using simple data visualisations.
• There are three types of Charts/Visual Representations that are most commonly used.
1. Bar Charts:
- For a variable measured on a nominal scale, a bar chart can be used to
display the relative frequencies for the different values.
- For nominal variables, the ordering of the x-axis is arbitrary; however,
they are often ordered alphabetically or based on the frequency value.
- The y-axis which measures frequency can also be replaced by values
representing the proportion or percentage of the overall number of
observations (replacing the frequency value).
- For variables measured on an ordinal scale containing a small number of
values, a bar chart can also be used to understand the relative
frequencies of the different values.
2. Frequency Histograms:
- The frequency histogram is useful for variables with an ordered scale—
ordinal, interval, or ratio—that contain a larger number of values.
- Each variable is divided into a series of groups based on the data values
and displayed as bars whose heights are proportional to the number of
observations within each group.
3. Box Plots :
- Box plots provide a summary of the overall frequency distribution of a
variable.
- Six values are usually displayed: the lowest value, the lower quartile
(Q1), the median (Q2), the upper quartile (Q3), the highest value, and the
mean.The box in the middle of the plot represents where the central
50% of observations lie.
- A vertical line inside the box , shows the location of the median value and
a dot represents the location of the mean value.
Page 7
• Standard Deviation:
- The standard deviation is the square root of the variance.
- The standard deviation is the most widely used measure of the deviation of a
variable.
- The higher the value, the more widely distributed the variable’s data values are
around the mean.
Page 8
- Replace all missing values of the attributes by the same constant, such as a
label ..like…”Unknown”.
• Use a measure of the central tendency for the attributes
• Use the attribute mean or median for all samples belonging to the same class
• Use the most probable value to fill in the missing value.
2. It groups data using a “top down approach”, since it starts with predefined number of
Clusters
3. Computationally faster and can handle greater number of observations than AHC
(can be grouped under Disadvantages of K-Means)
Page 9
5. When the data set contains many Outliers, K Means may not create optimal
grouping.
2. It uses “bottom up approach” for Clustering, as it starts with each observation and
progressively creates Clusters
3. Computational cost is higher since it has to generate the Hierarchical Tree (can be
grouped under Disadvantages of AH Clustering)
• This method can generate large numbers of rules that must be prioritized and
interpreted.
- Three values used to generate the association rules:
• Support :
Page 10
- Its value is the proportion of the observations a rule selects out of all
observations in the data set.
• Confidence :
- The Confidence score is a measure for how predictable a rule is.
- The Confidence or Predictability value is calculated using the support
for the entire group divided by the support for all observations satisfied
by the IF-part of the rule.
• Lift :
- The Lift Score indicates the strength of the association.
- Lift = confidence∕THEN-part support
- Advantages:
Page 11
1. They are easy to understand and used in explaining how decisions are reached
based on multiple criteria
2. They can handle categorical and continuous variables since they partition a data
set into distinct regions based on ranges or specific values.
- Disadvantages:
1. Building decision trees can be computationally expensive, particularly when
analyzing a large data set with many continuous variables.
2. Generating a useful decision tree automatically can be challenging(since large and
complex trees can be easily generated; trees that are too small may not capture
enough information; and generating the “best” tree through optimization is
difficult).
• Dichotomous :
• Variables with two values are the most straightforward to split, since each
branch represents a specific value. For example, a variable Temperature may have
only two values: “hot” and “cold.
• Nominal:
• Since nominal values are discrete values with no order, a two-way split is
accomplished by one subset being composed of a set of observations that are
equal a certain value and the other being those observations that do not equal
that value.
• Ordinal:
Page 12
• In the case where a variable’s discrete values are ordered, the resulting subsets
may be made up of more than one value, as long as the ordering is retained.
• Continuous:
• For variables with continuous values to be split two ways, a specific cut-off value
needs to be determined so that observations with values less than the cut-off are
in the subset on the left and those with values greater than or equal to are in the
subset on the right.
Page 13