0% found this document useful (0 votes)
28 views44 pages

2 Data Mining Terms & Concepts

Uploaded by

saharsh0812
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views44 pages

2 Data Mining Terms & Concepts

Uploaded by

saharsh0812
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

DATA MINING TERMS &

CONCEPTS
DBMS
• Database System is used in traditional way of storing and
retrieving data.
• The major task of database system is to perform query
processing.
• These systems are generally referred as online
transaction processing system.
• These systems are used day to day operations of and
organization.
Data Warehouse
• Data Warehouse is the place where huge amount of data
is stored.

• It is meant for users or knowledge workers in the role of


data analysis and decision making.

• These systems are referred as online analytical


processing.
DBMS and Data Warehouse Difference
DBMS and Data Warehouse Difference
OLTP and OLAP
• OLTP

Transaction Oriented applications

Mainly concern with Entry, Storage and retrieval of data.

Design to day-to-day operations such as purchasing,


inventory, payroll, accounting etc.

It supports basically DML operations.


Users of OLTP

Almost all industries including:

Airlines

Supermarkets

Banking

Insurance

Etc.
• Data usually captured in OLTP are stored in
commercial relational databases. e.g;

• Database of supermarket store consists of the


following table to store the data about its
transactions, product, inventory, employee etc.
• Transactions

• ProductName

• EmployeeDetails

• InventorySupplies

• Suppliers
Advantages of OLTP
• Simplicity

• Efficiency

• Allow user to read, write and delete data quickly

• Fast query processing

• Respond user actions immediately and also support transaction


processing in demand.
Challenges
• Security

• It require concurrency control(locking) and


• recovery mechanism.

• OLTP system data content not suitable for decision


making

• A typical OLTP system manages the current data within the


enterprises/organization. These data are too far away from the
decision making.
Answer
The supermarket store is deciding on introducing a new
product. The key debating issue are: “which product should
they introduce?” and “should it be specific to a few
customer segments?”

The Supermarket store is looking at offering some discount


on their year of sale. The question here: “How much
discount should they offer ” and “ should different discount
to be given to different customer segment?”
Answer: OLAP

• OLAP differ from traditional DB in way the


data is conceptualized and stored.

• OLAP data are held in the dimensional


form rather than the relational form.

• OLAP life’s blood is multidimensional data


model.

• The multidimensional data model views


the data in the form of data cube.
Distributed Data Store (Distributed
Database)
• A distributed data store is a computer network where
information is stored on more than one node, often in a
replicated fashion It is usually specifically used to refer to
a distributed database where users store information on a
number of nodes.
Multidimensional Schema
• Multidimensional Schema is especially designed to model
data warehouse systems.

• The schemas are designed to address the unique needs


of very large databases designed for the analytical
purpose (OLAP).
• Two main types of schemas used are:

• Star Schema

• Snowflake Schema
Star Schema
• Star Schema in data warehouse, in which the center of
the star can have one fact table and a number of
associated dimension tables.

• It is known as star schema as its structure resembles a


star.

• The Star Schema data model is the simplest type of Data


Warehouse schema.
Star schema
Star schema Example
Characteristics of star schema
• Every dimension in a star schema is represented with the
only one-dimension table.
• The dimension table should contain the set of attributes.
• The dimension table is joined to the fact table using a
foreign key
• The dimension table are not joined to each other
Snowflake Schema
Snowflake Schema
Characteristics of Snowflake Schema

• It uses smaller disk space.

• Easier to implement a dimension as is added to the


Schema.

• Due to multiple tables query performance is reduced


Difference
Difference
ETL
• ETL is a process in Data Warehousing and it stands for
Extract, Transform and Load.

• It is a process in which an ETL tool extracts the data from


various data source systems, transforms it in the staging
area, and then finally, loads it into the Data Warehouse
system.
ETL
Extraction
• The first step of the ETL process is extraction.

• In this step, data from various source systems is extracted


which can be in various formats like relational databases,
No SQL, XML, and flat files into the staging area.

• It is important to extract the data from various source


systems and store it into the staging area first and not
directly into the data warehouse because the extracted
data is in various formats and can be corrupted also.
Transformation
• In this step, a set of rules or functions are applied on the
extracted data to convert it into a single standard format. It may
involve following processes/tasks:

• Filtering – loading only certain attributes into the data warehouse.

• Cleaning – filling up the NULL values with some default values,


mapping U.S.A, United States, and America into USA, etc.

• Joining – joining multiple attributes into one.

• Splitting – splitting a single attribute into multiple attributes.

• Sorting – sorting tuples on the basis of some attribute (generally key-


attribute).
Loading

• In this step, the transformed data is finally loaded into the


data warehouse.

• Sometimes the data is updated by loading into the data


warehouse very frequently and sometimes it is done after
longer but regular intervals.

• The rate and period of loading solely depends on the


requirements and varies from system to system.
Pipelining
Data mining
• Data mining has been defined as the non-trivial extraction
of implicit, previously unknown, and potentially useful
information from large data sets or databases.
Knowledge Discovery
• Knowledge discovery is the process of finding novel,
interesting, and useful patterns in data.

• Data mining is a subset of knowledge discovery. Thus,


data mining is also known as Knowledge Discovery in
Databases
Information Retrieval
• Automatic retrieval of all relevant documents while at the
same time retrieving as few of the non-relevant as
possible.

• It has the primary goals of indexing text and searching for


useful documents in a collection.
Triplet
• Data is an expression of feedback; a statement (rightly or
wrongly so) about an observation.
• Information is contextualized data.
• Knowledge is a phenomenon that implies our ability to
use the information for reasoning and decision making,
i.e., it is the basis of what you can, will, would, should or
might do with information.
Information Extraction
• Information Extraction has the goal of transforming a
collection of documents, usually with the help of an IR
system, into information that is more readily digested and
analyzed.
Knowledge Representation
• Knowledge representation is the presentation of
knowledge to the user for visualization in terms of trees,
tables, rules graphs, charts, matrices, etc.
Concept Hierarchies
• A concept hierarchy defines a sequence of mappings from
a set of low-level concepts to higher-level, more general
concepts.

• Depending on the type of the ordering relation we


distinguish several types of concept hierarchies.
Set Group Hierarchy
• Concept hierarchies may also be defined by discretizing
or grouping values for a given dimension or attribute,
resulting in a set-grouping hierarchy.
Schema Hierarchy
• A concept hierarchy that is a total or partial order among
attributes in a database schema is called a schema
hierarchy.
Different user view point

• There may be more than one concept hierarchy for a


given attribute or dimension, based on different user
viewpoints.

• For instance, a user may prefer to organize price by


defining ranges for inexpensive, moderately_priced, and
expensive.
Schema hierarchy

• Relating concept generality.

• The ordering reflects the generality of the attribute values,


e.g. street < city < state < country.
Set-grouping hierarchy
• The ordering relation is the subset relation (⊆). Applies to
set values.

• Example:
• {13, ..., 39} = young; {13, ..., 19} = teenage;
• {13, ..., 19} ⊆ {13, ..., 39} ⇒ teenage < young
Operation-derived hierarchy
• Produced by applying an operation (encoding, decoding,
information extraction).

• For example: markovz@cs.ccsu.edu instantiates the


hierarcy user−name < department < university <
education
Rule-based hierarchy

• Using rules to define the partial order.

• for example: if antecedent then consequent defines the


order antecedent < consequent.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy