0% found this document useful (0 votes)
7 views13 pages

DPIR_IA1

The document discusses various aspects of data handling, including sources and approaches for data collection, data transformation strategies, data integration, and data reduction techniques. It covers key concepts such as central tendency, hypothesis testing, data cleaning methods, exploratory data analysis, and clustering techniques like K-Means and Agglomerative Clustering. Each section provides definitions, examples, and significance of the respective topics in data analysis.

Uploaded by

saurabh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views13 pages

DPIR_IA1

The document discusses various aspects of data handling, including sources and approaches for data collection, data transformation strategies, data integration, and data reduction techniques. It covers key concepts such as central tendency, hypothesis testing, data cleaning methods, exploratory data analysis, and clustering techniques like K-Means and Agglomerative Clustering. Each section provides definitions, examples, and significance of the respective topics in data analysis.

Uploaded by

saurabh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

1. What are the Various sources of Data ! Approach to collect the data.

->
• Sources of data are:
• Various transactions done online(Electricity Bill, LIC Premiums, Online Shopping,
Selling Online)
• Data from Facebook, Instagram, WhatsApp, LinkedIn, Twitter
• Uploaded videos on the Internet, videos watched on various social media platforms
• Data generated through your mobile-in terms of the apps installed, apps used, data
generated through apps(Gaming apps, pictures, e-com apps)
• IoT Devices generating data.
• Approaches to collect data:
• Surveys/Polls – to gather data to answer specific questions(For Ex: a poll may be
used to understand how a population of eligible voters will cast their vote in an
upcoming election)
• An interview conducted either over phone/in person/over the internet(to elicit
information on people’s opinions, preferences, and behaviour)
• Experiments
• Case Studies(Uber, Amazon, Smart Toothbrush)

2. What is Data Transformation and the strategies of data transformation.


->
• Data Transformation: The data is transformed/consolidated into forms appropriate for
analysis.
• The strategies for data transformation include the following:
1. Smoothing:
-> It is used to remove noise from the data.
-> Techniques like regression, binning and clustering are used in this process.

2. Attribute Construction:
->Attribute constructions can help improve accuracy and helps in the better
understanding of the data set.
->For example, We may wish to add the attribute area based on the attributes-
height and width.

3. Aggregation:
-> Summary or Aggregation operations are applied to the data.
->For Ex: The daily sales data may be aggregated so as to compute monthly
and annual total amounts.

4. Normalisation:
->The attribute data are scaled so as to fall within a smaller range.
Page 1
-> say range between -1.0 to +1.0(Mean Normalisation)
5. Discretisation:

->The raw values of a numeric attribute are replaced by either Interval


labels(e.g., 0-10, 11-20)
Or
Conceptual labels(youth, adult, senior)

6. Concept hierarchy generation:

->The attributes can be generalised to higher level concepts.

3. What is Data Integration ?


->
• Data Integration is the merging of data from multiple data source.
• Careful Integration can help reduce and avoid:
• Entity Identity Problem:
->For instance, When two databases with respect to customer
information, are getting integrated, may be there are chances that
customer-id in one database and cust-num in another database, refer
to the same attribute. But the field names are different.
->So.. How to get to know that these fields are same??
who helps??
->Solution: Metadata contains the details such as
❖ Name of the attribute
❖ Its meaning
❖ Its data type
❖ Range of values permitted for the attribute
❖ Null rules for handling blank/zero/null values

• Redundancy and Correlation Analysis:

Page 2
->An attribute is said to be redundant if it can be
derived(correlated to) from another attribute or set of attributes.
->Example of Positive Correlation-as height increases, weight
also increases
->Example of Negative Correlation-time spent on mobile and
performance in the Exams.
->Redundancies can be detected by correlation analysis.
->There are tests that can be performed on different kinds of
data.
->Chi-Squared Test: for measuring the correlation existing
between nominal data.

• Tuple Duplication:
->Duplication at the tuple level, for e.g., where there are more
than one identical tuple for a given unique data entry case is
called Tuple Duplication. These identical tuples have to be
eliminated.

• Data value conflict detection and resolution:


->Data Value conflicts refer to the discrepancies in the data
->Discrepancies may arise from inconsistent data
representations and inconsistent use of codes
->For Example:
1) Weight attributes may be stored in different units
2) When exchanging information between schools, each school
may have its own curriculum and grading scheme.

4. What is Data Reduction?


->
• It is a solution to the problem when the data set is likely to be huge, complex data
analysis on huge amount of data can take a long time, making such analysis impractical or
infeasible.
• This technique is applied to obtain reduced representation of the data set that is much
smaller in volume, yet closely maintains the integrity of the original data and produce the
same analytical results.
• Strategies in data reduction are:
• Dimensionality:It is the process of reducing the number of attributes under
consideration. Dimensionality Reduction method includes:
- Wavelet transforms:

Page 3
• Discrete wavelet transform is signal processing technique that, when applied
to a data vector X (transforms)-> numerically different vector X’ (wavelet
coefficients)
• But Both the vectors are of same length.
• The usefulness lies in the fact that the wavelet transformed data can be
truncated storing only a small fractions of the strongest of the wavelet
coefficients

- Attribute Subset Selection:


• Data sets for analysis may contain hundreds and thousands of attributes,
many of which may be irrelevant in the process of arriving at the results.
• For Example, If the task is to classify customers based on whether or not
they are likely to purchase a popular new product, when notified of a sale,
attributes such as the customer’s telephone number is likely to be irrelevant,
unlike such as age or music taste.
• Attribute subset selection reduces the data size by removing irrelevant or
redundant attributes(or dimensions)

- Numerosity:
• Numerosity Reduction Techniques replace the original data volume by
alternative, smaller forms of data representation.

- Parametric method:
• A model is used to estimate the data, so that only the data parameters need to
be stored, instead of the actual data. Ex: Regression Models
- Non-Parametric Method:
• In statistics, a histogram is a graphical representation of the distribution of
data.
• The histogram is represented by a set of rectangles, adjacent to each other,
where each bar represent a kind of data.
- Sampling:
• Sampling is the selection of a subset of individuals from within a statistical
population to estimate characteristics of the whole population.

5.EXPLAIN !
->
• Data Table:
- Collection of measured data values represented as numbers or text. They are
raw before they are transformed.
• Data Value:
- Measurements of various details in different measures
Page 4
- Distances(in cm/m)
- Categories(Telecom/Energy Industry)
- Weights(lb/kg)
• Observation:
- Each row in the Data Table contains information about a specific item.
• Variable:
- an attribute of a specified record
- Types of variable;
• Discrete Variable:
- If a variable contains a fixed number of values – be it numbers or
categorical, then such variables are called Discrete Variables.
- Ex: Number of Students, count of participants, different
sectors(telecom, retail)
• Numerical Variable:
- If a variable contains a Continuous numeric value(with infinite precision),
then such variables are called Numerical Values.
- Example: Height, weight.

• Nominal scales (Value):


- limited number of different values that cannot be ordered.
- Ex: Financial, Engineering, Retail
• Ordinal scales (value):
- values can be ordered or ranked. They have fixed number of categories-
but with ranks.
- Ex: low, medium, high.
• Interval scales (value) :
- Values where the interval between values can be compared.
- Ex: On the Fahrenheit Scale with values 5, 10,15 degree Fahrenheit,
interval is 5 degree Fahrenheit.
• Ratio scale (value) :
- Intervals between values and ratios of values can be compared.
- Ex: For Bank Balance of 5$, 10$ and 15$, the difference between each
pair is 5$. And 10$ is twice as much as 5$.
• Dichotomous Variable:
- If a variable contains only two values. Ex: Yes/No reply for
questionnaires.
• Binary Variable:
- a Dichotomous variable with values 0/1.
- Ex: 1 to represent if a product is purchased.
- 0 to represent the product is not purchased.

Page 5
• Independent variable:
- a variable that stands alone and isn’t changed by the other variables.
• Dependent variable:
- the value which is dependent on the changes in the independent
variables.

6.What is Central Tendency? Approaches to calculate Central


location and its significance.
->
• There are various ways in which a variable can be summarised, the most important is the
value used to characterise the centre of the set of values it contains.
• It is that one value that can be used as a representative of the whole set of data.
• The common statistical approaches for calculating the central location are :
• The Mode:
- When a Researcher is quoting the opinion of a group, he/she is probably referring
to the most frequently expressed opinion which is the modal opinion.
- Its defined as the most frequently occurring value in the data.
- Mode provides the only measure of central tendency for variables measured on a
nominal scale;
- however, the mode can also be calculated for variables measured on the ordinal,
interval, and ratio scales.

• The Median:
- For variables with an even number of values, the average of the two values
closest to the middle is selected (sum the two values and divide by 2).
- The median can be calculated for variables measured on the ordinal, interval, and
ratio scales and is often the best indication of central tendency for variables
measured on the ordinal scale.
• The Mean:
- Its the most commonly used summary of central tendency for variables measured
on the interval or ratio scales.
- It is the average the values given.

7. What is Frequency Distribution ? 3 Types of Visual Representation


.
->
• The central location is a single value that characterises an individual variable’s data
values, it provides NO insight into the variation of the data.
• The frequency distribution, which is based on a simple count of how many times a value
occurs, is often a starting point for the analysis of variation.

Page 6
• The distribution of the data can be understood by using simple data visualisations.
• There are three types of Charts/Visual Representations that are most commonly used.
1. Bar Charts:
- For a variable measured on a nominal scale, a bar chart can be used to
display the relative frequencies for the different values.
- For nominal variables, the ordering of the x-axis is arbitrary; however,
they are often ordered alphabetically or based on the frequency value.
- The y-axis which measures frequency can also be replaced by values
representing the proportion or percentage of the overall number of
observations (replacing the frequency value).
- For variables measured on an ordinal scale containing a small number of
values, a bar chart can also be used to understand the relative
frequencies of the different values.
2. Frequency Histograms:
- The frequency histogram is useful for variables with an ordered scale—
ordinal, interval, or ratio—that contain a larger number of values.
- Each variable is divided into a series of groups based on the data values
and displayed as bars whose heights are proportional to the number of
observations within each group.

3. Box Plots :
- Box plots provide a summary of the overall frequency distribution of a
variable.
- Six values are usually displayed: the lowest value, the lower quartile
(Q1), the median (Q2), the upper quartile (Q3), the highest value, and the
mean.The box in the middle of the plot represents where the central
50% of observations lie.
- A vertical line inside the box , shows the location of the median value and
a dot represents the location of the mean value.

8.What is a Variance and Standard Deviation?


->
• Variance:
- A measure of how much the values of a variable differ from the mean.
- For variables that represent only a sample of some population and not the
population as a whole, the variance formula is ;
- variance= sum(value - mean)2 / n-1

Page 7
• Standard Deviation:
- The standard deviation is the square root of the variance.
- The standard deviation is the most widely used measure of the deviation of a
variable.
- The higher the value, the more widely distributed the variable’s data values are
around the mean.

9.What is Hypothesis Test?


->
• Hypothesis: A proposed explanation made on the basis of limited evidence as a starting
point for further investigation
• Hypothesis testing is the process used to evaluate the strength of evidence from the
sample.
• The Null Hypothesis is stated in terms of what would be expected if there were nothing
unusual about the measured values of the observations in the data from the samples we
collect—“null” implies the absence of effect.
• When a Null Hypothesis is made, there can be two outcomes, either the Hypothesis is
REJECTED or ACCEPTED.
• Hypothesis Tests are used to support making decisions by helping to understand
whether the data collected from a sample of the observations support a particular
hypothesis.
• Example: To test this hypothesis, the company collects a random sample of 100 shampoo
bottles and precisely measures the contents of the bottle. If it is inferred from the
sample that the average amount of shampoo in each bottle is not 200 mL, then a
decision may be made to stop production and rectify the manufacturing problem.

10.What is Data Cleaning and ways to clean the data?


->
• This process attempts to fill in missing values, smooth out the noise while identifying
outliers and correct the inconsistencies in the Data.
• Basic methods for Data Cleaning :
* Ways to handle missing values:
• Ignoring the tuple:
- This is usually done when the class label(assuming the task involved is
classification) is missing.
• Filling in the missing value manually:
- This approach is time consuming and may not be feasible, given a large data
set with many missing values.
• Use a global constant to fill in the missing values:

Page 8
- Replace all missing values of the attributes by the same constant, such as a
label ..like…”Unknown”.
• Use a measure of the central tendency for the attributes
• Use the attribute mean or median for all samples belonging to the same class
• Use the most probable value to fill in the missing value.

* Data Smoothing Techniques:


❖ Binning
❖ Regression
❖ Outlier Analysis

11.Exploratory Data analysis?


->
• Exploring the data in terms of :
- Examining the structure and components of the DataSet
- Distribution of individual variables
- Relationship between 2 variables
- Data Visualization tools to quickly absorb the information said in the DataSet
- To determine whether the question can be answered by the data that you have
- To develop a sketch of the answer to your question.
• For Example,
->The Question can be “Do Countries in the Eastern United States have
higher ozone levels than Countries in the Western United States”.
->To answer this Question, you should have Ozone, Country and US Region
Data, as part of the variables of each Observation.

12. K-Means Clustering.


->
1. It is an example of Partitioning method of grouping observations

2. It groups data using a “top down approach”, since it starts with predefined number of
Clusters

3. Computationally faster and can handle greater number of observations than AHC
(can be grouped under Disadvantages of K-Means)

4. The number of groups have to be specified before creating the Clusters.

Page 9
5. When the data set contains many Outliers, K Means may not create optimal
grouping.

13. Agglomerative Clustering.


->
1. It is an example of a hierarchical method of grouping observations

2. It uses “bottom up approach” for Clustering, as it starts with each observation and
progressively creates Clusters

3. Computational cost is higher since it has to generate the Hierarchical Tree (can be
grouped under Disadvantages of AH Clustering)

4. Limited to datasets with fewer than 10000 Observations.

14.All About Association rule.


->
- The association rules method is an example of an unsupervised grouping method.
- The association rules method groups observations and attempts to discover links or
associations between different attributes of the group.
- Advantages:
• The generated rules are easy to understand.
• This technique can be used with large numbers of observations.
- Disadvantages:
• This method forces you to either restrict your analysis to variables that
categorical or convert continuous variables to categorical variables.

• Generating the rules can be computationally expensive, especially where a data


set has many variables or many possible values per variable, or both.

• This method can generate large numbers of rules that must be prioritized and
interpreted.
- Three values used to generate the association rules:
• Support :

Page 10
- Its value is the proportion of the observations a rule selects out of all
observations in the data set.

• Confidence :
- The Confidence score is a measure for how predictable a rule is.
- The Confidence or Predictability value is calculated using the support
for the entire group divided by the support for all observations satisfied
by the IF-part of the rule.

• Lift :
- The Lift Score indicates the strength of the association.
- Lift = confidence∕THEN-part support

14.All About Decision tree.


->
- Decision trees are an example of a supervised method. Each observation is placed into
interesting groups based on selected variables.
- They can handle categorical and continuous variables since they partition a data set
into distinct regions based on ranges or specific values.
- A tree is made up of a series of decision points, where the split of the entire set of
observations or a subset of the observations is based on some criteria.
- Each point in the tree represents a set of observations under a particular attribute
called a node.
- The relationship between two connected nodes is defined as a parent–child
relationship.
- The variable that is responsible for the larger set that will be divided into two or more
smaller sets is the parent node.
- The nodes resulting from the division of the parent are child nodes.
- A child node with no children and that is called a leaf node.

- Advantages:

Page 11
1. They are easy to understand and used in explaining how decisions are reached
based on multiple criteria
2. They can handle categorical and continuous variables since they partition a data
set into distinct regions based on ranges or specific values.
- Disadvantages:
1. Building decision trees can be computationally expensive, particularly when
analyzing a large data set with many continuous variables.
2. Generating a useful decision tree automatically can be challenging(since large and
complex trees can be easily generated; trees that are too small may not capture
enough information; and generating the “best” tree through optimization is
difficult).

15.All About Splitting.


->
- A table of data is used to generate a decision tree where certain variables are used as
potential decision points (splitting variables)
- and one variable is used to guide the construction of the tree (response variable).
- The response variable will be used to guide which splitting variables are selected and at
what value the split is made.

• This uses the concept of Entropy(impurity).


• As the tree is being generated, it is desirable to decrease the level of impurity until ideally
there is only one category at a terminal node (a node with no children)

• Dichotomous :
• Variables with two values are the most straightforward to split, since each
branch represents a specific value. For example, a variable Temperature may have
only two values: “hot” and “cold.

• Nominal:
• Since nominal values are discrete values with no order, a two-way split is
accomplished by one subset being composed of a set of observations that are
equal a certain value and the other being those observations that do not equal
that value.

• Ordinal:
Page 12
• In the case where a variable’s discrete values are ordered, the resulting subsets
may be made up of more than one value, as long as the ordering is retained.

• Continuous:
• For variables with continuous values to be split two ways, a specific cut-off value
needs to be determined so that observations with values less than the cut-off are
in the subset on the left and those with values greater than or equal to are in the
subset on the right.

Page 13

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy