7dm Midterm Reviewer
7dm Midterm Reviewer
Ordinal Variables
● An ordinal variable can be discrete or
continuous
● Order is important, e.g., rank
● Can be treated like interval-scaled
Standardizing Numeric Data
○ replace xif by their rank
● Z-score:
○ map the range of each variable onto
○ X: raw score to be standardized, μ:
[0, 1] by replacing i-th object in the
mean of the population, σ: standard
f-th variable by
deviation
○ the distance between the raw score
and the population mean in units of
the standard deviation
○ negative when the raw score is
below the mean, “+” when above ○ compute the dissimilarity using
● An alternative way: Calculate the mean methods for interval-scaled variables
absolute deviation
Attributes of Mixed Types
● A database may contain all attribute types
○ Nominal, symmetric binary,
○ standardized measure (z-score) asymmetric binary, numeric, ordinal
● One may use a weighted formula to
combine their effects
● Scatterplot - provides summary of bivariate
data to see clusters of points and outliers
● Samples of Nominal Attributes
○ Ethnicity
○ Hair color
○ Gender
● Whiskers of boxplot (lines outside the box)
○ f is binary or nominal:
represents the minimum and maximum
○ dij(f) = 0 if xif = xjf , or dij(f) = 1
● An outlier in a box plot analysis is a value
otherwise
outside 1.5 times the interquartile range
○ f is numeric: use the normalized
● Outlier detection - can be detecting fraud
distance
in series of credit card transactions
○ f is ordinal
● Stratified Sampling - sampling type that
■ Compute ranks rif and
divide subjects into subgroups then each
■ Treat zif as interval-scaled
group are randomly sampled
● Standard deviation is computed based on
Cosine Similarity
square root of variance
● The IQR measures the range between 25th
and 75th percentiles in a dataset
● Symmetric binary - have equal importance
for both outcomes
● Asymmetric binary - not equal importance
● Low Medium High can represent an
Ordinal Type
● Histogram - display of tabulated
frequencies shown as bars
● Lossy Compression - type of compression
that reduces the size permanently due to
elimination of information
ADDITIONAL INFORMATION
● Lossless compression - technique where
● The primary goal of data visualization is
in if data is decompresses, it is restored to
to gain insight into data through graphical
its original form
representation
● Simple Linear Regression - involves one
● Tag Cloud - techniques visualizing
independent variable and one dependent
user-generated tags where the importance
variable. The relationship is represented by
of tag is represented by font size/color
a straight line
● Data Cleaning - first process of KDD
● Data Cube - organization of data in a way
● Data Integration - combines data from
that facilitates complex queries and analysis
multiple sources into a coherent store
across multiple dimension
● Data Reduction - strategy to apply to
● Outlier - data object that does not comply
shorten complex data analysis time like
with the general behavior of the data
removing unimportant attribute or applying
● Data reduction - is a data set that is
data compression
smaller volume but produces the same or
● Sampling - technique that involves taking a
almost the same analytical results
small number of participants from a much
● Noise on data - random error or variance in
bigger group
a measured variable
● Mode - most frequent occurring value on
the list
● Mean - adding the numbers and dividing the
- shanon
sum by the number of number in the