0% found this document useful (0 votes)
10 views10 pages

7dm Midterm Reviewer

The document provides a comprehensive overview of the evolution of database technology and data mining, detailing historical developments from the 1960s to the present, including various data models and applications. It outlines the data mining process, functions, and techniques, as well as the challenges posed by the increasing complexity and volume of data. Additionally, it discusses the importance of understanding data characteristics, types, and statistical measures for effective data analysis.

Uploaded by

ssanchez22-0643
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views10 pages

7dm Midterm Reviewer

The document provides a comprehensive overview of the evolution of database technology and data mining, detailing historical developments from the 1960s to the present, including various data models and applications. It outlines the data mining process, functions, and techniques, as well as the challenges posed by the increasing complexity and volume of data. Additionally, it discusses the importance of understanding data characteristics, types, and statistical measures for effective data analysis.

Uploaded by

ssanchez22-0643
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

7DM - MIDTERM REVIEWER Evolution of Database Technology

MODULE 1 - INTRODUCTION ● 1960s: Data collection, IMS, network


1IR - Steam Engine/Machine DBMS.
2IR - Electricity ● 1970s: Relational data model and DBMS.
Why Data Mining? ● 1980s: RDBMS, advanced data models
● Explosive Growth of Data: From terabytes (OO, deductive).
to petabytes. ● 1990s: Data warehousing, multimedia
● Major Sources: databases, web.
○ Business: E-commerce, stocks, etc. ● 2000s: Stream data management, data
○ Science: Remote sensing, mining applications.
bioinformatics, etc.
○ Society: News, YouTube, etc. What is Data Mining?
● Problem: "Drowning in data but starving for ● Extraction of interesting patterns or
knowledge." knowledge from large data sets.
● Alternative names: Knowledge Discovery in
Evolution of Sciences Databases (KDD), information harvesting,
● Before 1600: Empirical science. data dredging.
● 1600-1950s: Theoretical science.
○ Each discipline developed a Knowledge Discovery (KDD) Process
theoretical component, with ● Steps: Data cleaning, integration, selection,
theoretical models motivating mining, pattern evaluation.
experiments and generalizing ● Input data -> Data Pre-processing -> Data
understanding. Mining -> Post Processing
● 1950s-1990s: Computational science
○ Over the last 50 years, most Example: Web Mining Framework
disciplines have added a third ● Involves data cleaning, integration,
branch: computational (e.g., warehousing, and selection for mining.
empirical, theoretical, and
computational ecology, physics, or Multidimensional View of Data Mining
linguistics). Computational science ● Data to be mined
traditionally involved simulations, ○ Database data (extended-relational,
born from the inability to find object-oriented, heterogeneous,
closed-form solutions for complex legacy), data warehouse,
mathematical models. transactional data, stream,
● 1990-now: Data science spatiotemporal, time-series,
○ A flood of data from new scientific sequence, text and web,
instruments and simulations. multi-media, graphs & social and
○ The ability to economically store and information networks
manage petabytes of data online. ● Knowledge to be mined (or: Data mining
○ The Internet and computing Grid functions)
make these archives universally ○ Characterization, discrimination,
accessible. association, classification, clustering,
○ Scientific information management, trend/deviation, outlier analysis, etc.
organization, query, and ○ Descriptive vs. predictive data
visualization tasks now scale almost mining
linearly with data volumes. Data ○ Multiple/integrated functions and
mining has become a major mining at multiple levels
challenge. ● Techniques utilized
○ Data-intensive, data warehouse ■ Generalize, summarize, and
(OLAP), machine learning, statistics, contrast data characteristics,
pattern recognition, visualization, e.g., dry vs. wet region
high-performance, etc. ● Association and Correlation
● Applications adapted ○ Frequent patterns (or frequent
○ Retail, telecommunication, banking, itemsets)
fraud analysis, bio-data mining, ■ What items are frequently
stock market analysis, text mining, purchased together in your
Web mining, etc. Walmart?
○ Association, correlation vs. causality
On what kinds of Data? ■ A typical association rule
● Database-oriented data sets and ● Diaper Beer [0.5%,
applications 75%] (support,
○ Relational database, data confidence)
warehouse, transactional database ■ Are strongly associated items
● Advanced data sets and advanced also strongly correlated?
applications ○ How to mine such patterns and rules
○ Data streams and sensor data efficiently in large datasets?
○ Time-series data, temporal data, ○ How to use such patterns for
sequence data (incl. bio-sequences) classification, clustering, and other
○ Structure data, graphs, social applications?
networks and multi-linked data ● Classification: Predictive modeling.
○ Object-relational databases ○ Classification and label prediction
○ Heterogeneous databases and ■ Construct models (functions)
legacy databases based on some training
○ Spatial data and spatiotemporal data examples
○ Multimedia database ■ Describe and distinguish
○ Text databases classes or concepts for future
○ The World-Wide Web prediction
● E.g., classify
Data Mining Functions countries based on
● Generalization (climate), or classify
○ Information integration and data cars based on (gas
warehouse construction mileage)
■ Data cleaning, ■ Predict some unknown class
transformation, integration, labels
and multidimensional data ○ Typical methods
model ■ Decision trees, naïve
○ Data cube technology Bayesian classification,
■ Scalable methods for support vector machines,
computing (i.e., materializing) neural networks, rule-based
multidimensional aggregates classification, pattern-based
■ OLAP (online analytical classification, logistic
processing) regression, …
○ Multidimensional concept ○ Typical applications:
description: Characterization and ■ Credit card fraud detection,
discrimination direct marketing, classifying
stars, diseases, web-pages,
● Cluster Analysis: Grouping similar data.
○ Unsupervised learning (i.e., Class analyses like PageRank, web community
label is unknown) discovery, opinion mining, and usage
○ Group data to form new categories mining.
(i.e., clusters), e.g., cluster houses to
find distribution patterns Evaluation of Knowledge
○ Principle: Maximizing intra-class ● Not all mined knowledge is inherently
similarity & minimizing interclass interesting. While data mining can yield
similarity numerous patterns, some may only be
○ Many methods and applications relevant to specific contexts (e.g., time or
● Outlier Analysis: Detecting anomalies. location) and might not represent broader
○ Outlier: A data object that does not trends or could be transient. Evaluating
comply with the general behavior of mined knowledge involves determining its
the data relevance and interest. Key considerations
○ Noise or exception? ― One person’s include:
garbage could be another person’s ○ Descriptive vs. Predictive:
treasure Differentiating between knowledge
○ Methods: by product of clustering or that describes data and knowledge
regression analysis, … that predicts future outcomes.
○ Useful in fraud detection, rare events ○ Coverage: Ensuring the mined
analysis patterns adequately represent the
dataset.
Time and Ordering: Sequential Pattern, Trend ○ Typicality vs. Novelty: Balancing
and Evolution Analysis between common patterns
● This analysis involves studying patterns (typicality) and new, unique insights
over time, including trends, time-series, and (novelty).
deviations, often using techniques like ○ Accuracy: Assessing how correct
regression and value prediction. Sequential the mined knowledge is.
pattern mining identifies patterns in ○ Timeliness: Ensuring that the
sequences, such as purchasing behaviors knowledge is relevant and
(e.g., buying a camera, then memory up-to-date.
cards). It also covers periodicity and motif
analysis in biological sequences, Confluence of Multiple Disciple
similarity-based analysis, and mining
ordered, time-varying data streams, which
are often infinite.

Structure and Network Analysis


● This analysis focuses on graph mining to
identify frequent subgraphs, such as
chemical compounds and web fragments. It
includes information network analysis,
examining social networks with nodes
(actors) and edges (relationships), such as
author and terrorist networks. Individuals ● Tremendous Amount of Data: Algorithms
may belong to multiple networks (e.g., must be highly scalable to manage vast
friends, family, classmates). Link mining datasets, often in the terabyte range.
emphasizes the semantic information ● High Dimensionality: Data, such as
carried by links. Web mining treats the web microarrays, can contain tens of thousands
as a vast information network, facilitating of dimensions, complicating analysis.
● Complexity of Data: Various forms of data, ○ Parallel, distributed, stream, and
including data streams, sensor data, incremental mining methods
time-series, and sequence data, add layers ● Data Types: Managing multimedia, spatial,
of complexity. and text data.
○ Structured Data: Analyzing graphs, ● Privacy: Preserving data privacy in mining
social networks, and multi-linked processes.
data requires sophisticated
techniques. MODULE 2 - GETTING TO KNOW YOUR DATA
○ Diverse Data Sources: Handling Types of Data Sets
heterogeneous databases, legacy ● Record
systems, and various data types ○ Relational records
(spatial, spatiotemporal, multimedia, ○ Data matrix, e.g., numerical matrix,
text, and web data) poses crosstabs
challenges. ○ Document data: text documents:
● Advanced Applications: New software term-frequency vector
programs and scientific simulations demand ○ Transaction data
innovative approaches to data analysis. ● Graph and network
○ World Wide Web
Applications of Data Mining ○ Social or information networks
● Examples include web page analysis, ○ Molecular Structures
recommender systems, targeted marketing, ● Ordered
and bioinformatics. ○ Video data: sequence of images
○ Temporal data: time-series
Major Issues in Data Mining ○ Sequential Data: transaction
● Mining Methodology: Handling various sequences
types of knowledge and data. ○ Genetic sequence data
○ Mining various and new kinds of ● Spatial, image and multimedia:
knowledge ○ Spatial data: maps
○ Mining knowledge in ○ Image data:
multi-dimensional space ○ Video data:
○ Data mining: An interdisciplinary
effort Important Characteristics of Structured Data
○ Boosting the power of discovery in a ● Dimensionality - Curse of dimensionality
networked environment ● Sparsity - Only presence counts
○ Handling noise, uncertainty, and ● Resolution - Patterns depend on the scale
incompleteness of data ● Distribution - Centrality and dispersion
○ Pattern evaluation and pattern- or
constraint-guided mining Data Objects
● User Interaction ● Data sets are made up of data objects. A
○ Interactive mining data object represents an entity.
○ Incorporation of background Examples:
knowledge ○ sales database: customers, store
○ Presentation and visualization of items, sales
data mining results ○ medical database: patients,
● Efficiency and Scalability: Parallel and treatments
distributed mining. ○ university database: students,
○ Efficiency and scalability of data professors, courses
mining algorithms ● Also called samples , examples, instances,
data points, objects, tuples.
● Data objects are described by attributes. is twice as high as 5 K˚). e.g., temperature
● Database rows -> data objects; columns in Kelvin, length, counts, monetary
->attributes quantities

Attributes (or dimensions, features, variables) Discrete VS. Continuous Attributes


● a data field, representing a characteristic or ● Discrete Attribute - Has only a finite or
feature of a data object. E.g., customer _ID, countably infinite set of values. E.g., zip
name, address codes, profession, or the set of words in a
● Types: collection of documents Sometimes,
○ Nominal represented as integer variables
○ Binary ○ Note: Binary attributes are a special
○ Numeric: quantitative case of discrete attributes
■ Interval-scaled ● Continuous Attribute - Has real numbers
■ Ratio-scaled as attribute values. E.g., temperature,
height, or weight. Practically, real values
Attribute Types can only be measured and represented
● Nominal: categories, states, or “names of using a finite number of digits. Continuous
things” attributes are typically represented as
○ Hair_color = {auburn, black, blond, floating-point variables
brown, grey, red, white}
○ marital status, occupation, ID Basic Statistical Descriptions of Data
numbers, zip codes ● Motivation - To better understand the data:
● Binary - Nominal attribute with only 2 states central tendency, variation and spread
(0 and 1) ● Data dispersion characteristics - median,
○ Symmetric binary: both outcomes max, min, quantiles, outliers, variance, etc.
equally important e.g., gender ● Numerical dimensions correspond to sorted
○ Asymmetric binary: outcomes not intervals
equally important. e.g., medical test ○ Data dispersion: analyzed with
(positive vs. negative) multiple granularities of precision
■ Convention: assign 1 to most ○ Boxplot or quantile analysis on
important outcome (e.g., HIV sorted intervals
positive) ● Dispersion analysis on computed measures
● Ordinal ○ Folding measures into numerical
○ Values have a meaningful order dimensions
(ranking) but magnitude between ○ Boxplot or quantile analysis on the
successive values is not known. transformed cube
○ Size = {small, medium, large},
grades, army rankings Measuring the Central Tendency
● Mean (algebraic measure) (sample vs.
Numeric Attribute Types population). Note: n is sample size and N is
● Quantity (integer or real-valued) population size.
● Interval - Measured on a scale of ○ Weighted arithmetic mean:
equal-sized units ○ Trimmed mean: chopping extreme
○ Values have order E.g., temperature values
in C˚or F˚, calendar dates ● Median - Middle value if odd number of
○ No true zero-point values, or average of the middle two values
● Ratio - Inherent zero-point. We can speak otherwise. Estimated by interpolation (for
of values as being an order of magnitude grouped data):
larger than the unit of measurement (10 K˚
○ From μ–σ to μ+σ: contains about
68% of the measurements (μ:
mean, σ: standard deviation)
● Mode - Value that occurs most frequently in ○ From μ–2σ to μ+2σ: contains about
the data. Unimodal, bimodal, trimodal 95% of it
● Empirical formula: ○ From μ–3σ to μ+3σ: contains about
99.7% of it

Graphic Display of Basic Statistical


Measuring the Dispersion of Data Descriptions
● Quartiles, outliers and boxplots ● Boxplot: graphic display of five-number
○ Quartiles: Q1 (25th percentile), Q3 summary
(75th percentile) ● Histogram: x-axis are values, y-axis repres.
○ Inter-quartile range: IQR = Q3 – frequencies
Q1 ● Quantile plot: each value xi is paired with
○ Five number summary: min, Q1, fi indicating that approximately 100 fi % of
median, Q3, max data are xi
○ Boxplot: ends of the box are the ● Quantile-quantile (q-q) plot: graphs the
quartiles; median is marked; add quantiles of one univariant distribution
whiskers, and plot outliers against the corresponding quantiles of
individually another
○ Outlier: usually, a value higher/lower ● Scatter plot: each pair of values is a pair of
than 1.5 x IQR coordinates and plotted as points in the
● Variance and standard deviation plane
(sample: s, population: σ)
○ Variance: (algebraic, scalable Histogram Analysis
computation) ● Histogram: Graph display of tabulated
○ Standard deviation s (or σ) is the frequencies, shown as bars. It shows what
square root of variance s2 (or σ2) proportion of cases fall into each of several
categories
Boxplot Analysis ○ Differs from a bar chart in that it is
● Five-number summary of a distribution the area of the bar that denotes the
○ Minimum, Q1, Median, Q3, value, not the height as in bar
Maximum charts, a crucial distinction when the
● Boxplot categories are not of uniform width
○ Data is represented with a box ○ The categories are usually specified
○ The ends of the box are at the first as non-overlapping intervals of some
and third quartiles, i.e., the height of variable. The categories (bars) must
the box is IQR be adjacent
○ The median is marked by a line
within the box Quantile Plot
○ Whiskers: two lines outside the box ● Displays all of the data (allowing the user to
extended to Minimum and Maximum assess both the overall behavior and
○ Outliers: points beyond a specified unusual occurrences)
outlier threshold, plotted individually ● Plots quantile information
○ For a data xi data sorted in
Properties of Normal Distribution Curve increasing order, fi indicates that
● The normal (distribution) curve approximately 100 fi% of the data
are below or equal to the value xi
■ Methods
Quantile-Quantile (Q-Q) Plot ● Direct visualization
● Graphs the quantiles of one univariate ● Scatterplot and
distribution against the corresponding scatterplot matrices
quantiles of another ● Landscapes
● View: Is there a shift in going from one ● Projection pursuit
distribution to another? technique: Help users
find meaningful
Scatter Plot projections of
● Provides a first look at bivariate data to see multidimensional data
clusters of points, outliers, etc ● Prosection views
● Each pair of values is treated as a pair of ● Hyperslice
coordinates and plotted as points in the ● Parallel coordinates
plane ○ Icon-based visualization
techniques
Data Visualization ■ Visualization of the data
● Why data visualization? values as features of icons
○ Gain insight into an information ■ Typical visualization methods
space by mapping data onto ● Chernoff Faces - way
graphical primitives to display variables
○ Provide qualitative overview of large on a two-dimensional
data sets surface, e.g., let x be
○ Search for patterns, trends, eyebrow slant, y be
structure, irregularities, relationships eye size, z be nose
among data length
○ Help find interesting regions and ● Stick Figures
suitable parameters for further ■ General techniques
quantitative analysis ● Shape coding: Use
○ Provide a visual proof of computer shape to represent
representations derived certain information
● Categorization of visualization methods: encoding
○ Pixel-oriented visualization ● Color icons: Use color
techniques icons to encode more
■ For a data set of m information
dimensions, create m ● Tile bars: Use small
windows on the screen, one icons to represent the
for each dimension relevant feature
■ The m dimension values of a vectors in document
record are mapped to m retrieval
pixels at the corresponding ○ Hierarchical visualization
positions in the windows techniques
■ The colors of the pixels ■ Visualization of the data
reflect the corresponding using a hierarchical
values partitioning into subspaces
○ Geometric projection visualization ■ Methods
techniques ● Dimensional Stacking
■ Visualization of geometric - Partitioning of the
transformations and n-dimensional
projections of the data attribute space in 2-D
subspaces, which are technique where
‘stacked’ into each hierarchical
other. Partitioning of information is
the attribute value displayed as nested
ranges into classes. semi-transparent
The important cubes
attributes should be ○ Visualizing complex data and
used on the outer relations - Visualizing non-numerical
levels. data: text and social networks
● Worlds-within-Worlds ■ Tag cloud: visualizing
○ N–vision: user-generated tags
Dynamic
interaction Similarity and Dissimilarity
through data ● Similarity
glove and ○ Numerical measure of how alike two
stereo data objects are
displays, ○ Value is higher when objects are
including more alike
rotation, ○ Often falls in the range [0,1]
scaling (inner) ● Dissimilarity (e.g., distance)
and ○ Numerical measure of how different
translation two data objects are
(inner/outer) ○ Lower when objects are more alike
○ Auto Visual: ○ Minimum dissimilarity is often 0
Static ○ Upper limit varies
interaction by ● Proximity refers to a similarity or
means of dissimilarity
queries
● Tree-Map - Data Matrix and Dissimilarity Matrix
Screen-filling method ● Data matrix
which uses a ○ n data points with p dimensions
hierarchical ○ Two modes
partitioning of the ● Dissimilarity matrix
screen into regions ○ n data points, but registers only the
depending on the distance
attribute values ○ A triangular matrix
● Cone Trees - 3D ○ Single mode
cone tree
visualization Proximity Measure for Nominal Attributes
technique works well ● Can take 2 or more states, e.g., red, yellow,
for up to a thousand blue, green (generalization of a binary
nodes or so. First attribute)
build a 2D circle tree ○ Method 1: Simple matching
that arranges its ■ m: # of matches, p: total # of
nodes in concentric variables
circles centered on ○ Method 2: Use a large number of
the root node binary attributes
● InfoCube - A 3-D
visualization
■ creating a new binary ● Using mean absolute deviation is more
attribute for each of the M robust than using standard deviation
nominal states
Distance on Numeric Data: Minkowski Distance
Proximity Measure for Binary Attributes ● Minkowski distance: A popular distance
● A contingency table for binary data measure
● Distance measure for symmetric binary
variables:
● Distance measure for asymmetric binary
variables: ● where i = (xi1, xi2, …, xip) and j = (xj1, xj2,
● Jaccard coefficient (similarity measure for …, xjp) are two p-dimensional data objects,
asymmetric binary variables) and h is the order (the distance so defined
is also called L-h norm)
Dissimilarity between Binary Variables ● Properties
○ d(i, j) > 0 if i ≠ j, and d(i, i) = 0
(Positive definiteness)
○ d(i, j) = d(j, i) (Symmetry)
○ d(i, j) d(i, k) + d(k, j) (Triangle
Inequality)
● A distance that satisfies these properties is
a metric

Ordinal Variables
● An ordinal variable can be discrete or
continuous
● Order is important, e.g., rank
● Can be treated like interval-scaled
Standardizing Numeric Data
○ replace xif by their rank
● Z-score:
○ map the range of each variable onto
○ X: raw score to be standardized, μ:
[0, 1] by replacing i-th object in the
mean of the population, σ: standard
f-th variable by
deviation
○ the distance between the raw score
and the population mean in units of
the standard deviation
○ negative when the raw score is
below the mean, “+” when above ○ compute the dissimilarity using
● An alternative way: Calculate the mean methods for interval-scaled variables
absolute deviation
Attributes of Mixed Types
● A database may contain all attribute types
○ Nominal, symmetric binary,
○ standardized measure (z-score) asymmetric binary, numeric, ordinal
● One may use a weighted formula to
combine their effects
● Scatterplot - provides summary of bivariate
data to see clusters of points and outliers
● Samples of Nominal Attributes
○ Ethnicity
○ Hair color
○ Gender
● Whiskers of boxplot (lines outside the box)
○ f is binary or nominal:
represents the minimum and maximum
○ dij(f) = 0 if xif = xjf , or dij(f) = 1
● An outlier in a box plot analysis is a value
otherwise
outside 1.5 times the interquartile range
○ f is numeric: use the normalized
● Outlier detection - can be detecting fraud
distance
in series of credit card transactions
○ f is ordinal
● Stratified Sampling - sampling type that
■ Compute ranks rif and
divide subjects into subgroups then each
■ Treat zif as interval-scaled
group are randomly sampled
● Standard deviation is computed based on
Cosine Similarity
square root of variance
● The IQR measures the range between 25th
and 75th percentiles in a dataset
● Symmetric binary - have equal importance
for both outcomes
● Asymmetric binary - not equal importance
● Low Medium High can represent an
Ordinal Type
● Histogram - display of tabulated
frequencies shown as bars
● Lossy Compression - type of compression
that reduces the size permanently due to
elimination of information
ADDITIONAL INFORMATION
● Lossless compression - technique where
● The primary goal of data visualization is
in if data is decompresses, it is restored to
to gain insight into data through graphical
its original form
representation
● Simple Linear Regression - involves one
● Tag Cloud - techniques visualizing
independent variable and one dependent
user-generated tags where the importance
variable. The relationship is represented by
of tag is represented by font size/color
a straight line
● Data Cleaning - first process of KDD
● Data Cube - organization of data in a way
● Data Integration - combines data from
that facilitates complex queries and analysis
multiple sources into a coherent store
across multiple dimension
● Data Reduction - strategy to apply to
● Outlier - data object that does not comply
shorten complex data analysis time like
with the general behavior of the data
removing unimportant attribute or applying
● Data reduction - is a data set that is
data compression
smaller volume but produces the same or
● Sampling - technique that involves taking a
almost the same analytical results
small number of participants from a much
● Noise on data - random error or variance in
bigger group
a measured variable
● Mode - most frequent occurring value on
the list
● Mean - adding the numbers and dividing the
- shanon
sum by the number of number in the

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy