0% found this document useful (0 votes)
67 views

Data Mining: Set-01: (Introduction)

Data mining is the process of sorting through large datasets to identify patterns and establish relationships through data analysis. It draws ideas from machine learning, pattern recognition, statistics, and database systems. Some challenges of data mining include scalability, dimensionality, complex and heterogeneous data, data quality, and privacy preservation. Data mining has applications in e-commerce, crime agencies, information retrieval, science and engineering, medical data mining, and more.

Uploaded by

Abdur Rahim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views

Data Mining: Set-01: (Introduction)

Data mining is the process of sorting through large datasets to identify patterns and establish relationships through data analysis. It draws ideas from machine learning, pattern recognition, statistics, and database systems. Some challenges of data mining include scalability, dimensionality, complex and heterogeneous data, data quality, and privacy preservation. Data mining has applications in e-commerce, crime agencies, information retrieval, science and engineering, medical data mining, and more.

Uploaded by

Abdur Rahim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Data Mining

Set-01: (Introduction)

Q 1. What is Data Mining? Describe the origins of data mining.


Data Mining: Data mining is the process of sorting through large data sets to identify patterns and
establish relationships to solve problems through data analysis.
Example:
 E-commerce
 Crime agencies
 Information retrieval
 Science and Engineering
 Medical data mining

Origins of Data Mining:


 Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems
 Traditional Techniques may be unsuitable due to
 Enormity of data
 High dimensionality of data
 Heterogeneous, distributed nature of data

Fig: Origins of data mining

Q 2. Write down some challenges in data mining.


 Scalability
 Dimensionality
 Complex and Heterogeneous Data
 Data Quality
 Data Ownership and Distribution
 Privacy Preservation
 Streaming Data
Q 3. Write down some applications of data mining.
 E-commerce
 Crime agencies
 Information retrieval
 Science and Engineering
 Medical data mining
 Market Basket Analysis
 Manufacturing Engineering
 Fraud Detection
 Corporate Surveillance
 Research Analysis
 Bio Informatics

Q 4. Distinguish between information extraction & data mining procedure.


Ans: Difference between information extraction and data mining procedure

Information Extraction Data Mining

1. Data Mining is the ability to retrieve


1. Information extraction is the task of
information from one or more data sources in
automatically extracting structured information
order to combine it, cluster it, visualize it and
from unstructured documents.
discover patterns in the data.

2. It return relevant results. 2. It discover patterns in the data

3. Obtaining required information from the 3. Process of discovering useful hidden patterns
sources you already have. from the data you have.

4. Uses: Fraud Detection, Research Analysis, Bio


4. Uses: Extract data from large databases.
informatics, Manufacturing Engineering etc.
Set-02: (Data)

Q 1. What is Data? Describe different types of attributes with appropriate example.


Data: In computing, data is information that has been translated into a form that is efficient for
movement or processing. Data is information converted into binary digital form.

Different types of attributes are following,


1. Nominal Data Attributes: The values of a nominal attribute are just different names, nominal
values provide only enough information to distinguish one object from another.

Operation: (=, ≠ )

Examples: Zip Code, employee ID numbers, eye color, gender etc.

2. Ordinal Data Attributes: The values of an ordinal attribute provide enough information to
order objects. All Values have a meaningful order.

Operation: (<, >)

Example: Hardness of minerals, grades, street numbers etc.

3. Interval Data Attributes: For interval attributes, the differences between values are
meaningful, i.e., a unit of measurement exists.

Operation: (+,-)

Example: Calendar dates, temperature in Celsius or Fahrenheit

4. Rational Data Attributes: For ratio variables, both differences and ratios are meaningful.

Operation: (*, /)

Example: Temperature in Kelvin, Monetary quantities, counts, age, mass, length, electrical
current etc.

Q 2. Definition: Principal component analysis, Dimensionality reduction, Cosine similarity, Feature


extraction and Feature creation.
1. Principal Component Analysis: Principal component analysis (PCA) is a statistical procedure
that uses an orthogonal transformation to convert a set of observations of possibly correlated
variables into a set of values of linearly uncorrelated variables called principal components.

2. Dimensionality reduction: Dimensionality reduction or dimension reduction is the process of


reducing the number of random variables under consideration by obtaining a set of principal
variables. It can be divided into feature selection and feature extraction.

3. Cosine similarity: Cosine similarity is a measure of similarity between two non-zero vectors
of an inner product space that measures the cosine of the angle between them.

4. Feature extraction: The creation of a new set of features from the original raw data is known
as feature extraction. Consider a set of photographs, where each photograph is to be classified
according to whether or not it contains a human face.

5. Feature creation: It is frequently possible to create, from the original attributes, a new set of
attributes that captures the important information in a data set much more effectively.
Q 3. Describe different types of Data with appropriate example.
Different types of data are given below:
1. Numeric Data: Numeric data consists of numeric digits from numeric digits from 0 to 9. It
may also contain decimal point “.”, plus sing “+” or negative sign “-“. The numeric type of data
may either be positive or negative. The use of “+” with positive numbers is optional.

Examples: 10, +5, -12, 13.7, -32.5 etc.

2. Text Data: Text data consists of words, sentences and paragraphs. Text processing refers to the
ability to manipulate words, lines and pages. Text is normally stored as ASCII code without
formatting.
Examples: Some examples of text data are Riaz Ameen, Pakistan, Islam etc.

3. Audio Data: Sound is a representation of audio. Audio data includes music, speech or any type
of sound.

4. Video Data: Video is a set of full-motion images played at a high speed. Video is used to
display actions and movements.

5. Image Data: This type of data includes chart, graph, pictures and drawing. This form of data is
more comprehensive. It can be transmitted as a set of bits. The bits are packed as bytes.
Set-03: (Data Visualization)
Q 1. What is Data Visualization?
Data Visualization: Data visualization is the display of information in a graphic or tabular format.
Successful visualization requires that the data be converted into a visual format so that the
characteristics of the data and the relationships among data items or attributes can be analyzed or
reported.

Q 2. Short note with figure: Histogram, Boxplot & Scatterplot.


 Histogram: A Histogram is graphical display of data using bars of different heights. It groups the
various numbers in the data set into many ranges. It also represents the estimation of the probability
of distribution of a continuous variable. Usually a histogram looks like this.

Fig: Histogram

 Boxplot: A Boxplot is graphical representation of groups of numerical data through their quartiles.
Box plots may also have lines extending vertically from the boxes indicating variability outside the
upper and lower quartiles.

Fig: Boxplot

 Scatterplot: A scatterplot is a type of graph which uses values from two variables plotted in a
Cartesian plane. It is usually used to find out the relationship between two variables.

Fig: Scatterplot
Q 3. What is OLAP? Describe different types of OLAP.
OLAP: OLAP stands for Online Analytical Processing Server. OLAP is based on the
multidimensional data model. It allows managers, and analysts to get an insight of the information
through fast, consistent, and interactive access to information.

Some types of OLAP are following:


1. Relational OLAP: ROLAP servers are placed between relational back-end server and client
front-end tools. To store and manage warehouse data, ROLAP uses relational or extended-
relational DBMS.
ROLAP includes the following −
 Implementation of aggregation navigation logic.
 Optimization for each DBMS back end.
 Additional tools and services.

2. Multidimensional OLAP: MOLAP uses array-based multidimensional storage engines for


multidimensional views of data. With multidimensional data stores, the storage utilization may
be low if the data set is sparse.

3. Hybrid OLAP: Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher
scalability of ROLAP and faster computation of MOLAP. HOLAP servers allows to store the
large data volumes of detailed information.

4. Specialized SQL Servers: Specialized SQL servers provide advanced query language and
query processing support for SQL queries over star and snowflake schemas in a read-only
environment.

Q 4. What is Data Cube?


Data Cube: A multidimensional representation of the data, together with all possible totals, is known
as a data cube. Despite the name, the size of each dimension-the number of attribute values-does not
need to be equal.

Fig: Data Cube


Set-04: Classification

Difference among supervised, semi-supervised, reinforcement & unsupervised learning


techniques:

Supervised Unsupervised Semi-Supervised Reinforcement


Learning Learning Learning Learning

1. Graph based 1. Markov decision


1. Linear regression 1. Clustering
method processes

2. Logistic regression 2. K Means 2. Generative models 2. Monte carlo methods

3. K Nearest 3. Dimensionality 3. Low density 3. Temporal difference


neighbors reduction separation learning

4. Principle component 4. Heuristics 4. Neuro-dynamic


4. Decision trees
analysis approaches programming
5. Input / Output
5. Input only 5. Input only 5. Input & critic
pairs

Decision tree: A decision tree is a structure that includes a root node, branches, and leaf nodes.
Each internal node denotes a test on an attribute, each branch denotes the outcome of a test, and
each leaf node holds a class label.

Entropy: Entropy can be defined as a measure of the average information content per source
symbol. Claude Shannon, the “father of the Information Theory”, provided a formula for it as −
H=−∑ipilogbpi

Misclassification Error: Misclassification may occur due to selection of property which is not
suitable for classification. When all classes, groups, or categories of a variable have the same error
rate or probability of being misclassified then it is said to be misclassification.

Overfitting: Overfitting is a modeling error which occurs when a function is too closely fit to a
limited set of data points. Overfitting the model generally takes the form of making an overly
complex model.

Under fitting: Under fitting occurs when a statistical model or machine learning algorithm cannot
capture the underlying trend of the data. Specifically, under fitting occurs if the model or algorithm
shows low variance but high bias.

Bias: Bias is prejudice in favor of or against one thing, person, or group compared with another,
usually in a way considered to be unfair.

Variance: Variance is a measurement of the spread between numbers in a data set. The variance
measures how far each number in the set is from the mean.

Advantages of nearest neighbor classifiers:


a) Simple to implement
b) Flexible to feature / distance choices
c) Naturally handles multi-class cases
d) Can do well in practice with enough representative data
Limitations of nearest neighbor classifier:
a) Required well classified training data
b) Can be sensitive to k value chosen
c) All attributes are used in classification even ones that may be irrelevant
Set-05: Regression

Difference between Classification & Regression techniques:


Subject Classification Technique Regression Technique

1. Basic The discovery of model or functions A devised model in which the mapping
where the mapping of objects is of objects is done into values.
done into predefined classes.
2.Involves Discrete values Continuous values
prediction of
3.Continuous Decision tree, logistic regression, Regression tree (Random forest), Linear
values etc. regression, etc.
4. Nature of the Unordered Ordered
predicted data
5. Method of Measuring accuracy Measurement of root mean square error
calculation

Linear Regression: Linear regression is a linear approach for modelling the relationship between a
scalar dependent variable y and one or more explanatory variables denoted X.

Logistic Regression: The Logistic Regression is a regression model in which the response variable
has categorical values such as True/False or 0/1. It actually measures the probability of a binary
response as the value of response variable based on the mathematical equation relating it with the
predictor variables.

Polynomial Regression: Polynomial regression is a form of regression analysis in which the


relationship between the independent variable x and the dependent variable y is modelled as an nth
degree polynomial in x.

Optimization Cost Function: Cost functions are a way to help the data modeler solve a
supervised learning problem, either classification or regression. The 'fit' of the response surface on
the data available will associate a cost of the event that occurs. In an optimization example, you
would want to minimize your cost.

Q 1.
Set-06: Model Evaluation + ANN

Precision: Precision is the percentage of retrieved documents that are in fact relevant to the query.
Precision can be defined as –
Precision= |{Relevant} ∩ {Retrieved}| / |{Retrieved}|

Recall: Recall is the percentage of documents that are relevant to the query and were in fact
retrieved. Recall is defined as –
Recall = |{Relevant} ∩ {Retrieved}| / |{Relevant}|

F-Measure: F-Measure (also known as F-score) is the commonly used trade-off. The information
retrieval system often needs to trade-off for precision or vice versa. F-score is defined as harmonic
mean of recall or precision as follows −
F-Measure = recall x precision / (recall + precision) / 2

Confusion Matrix: A confusion matrix is a table that is often used to describe the performance of
a classification model on a set of test data for which the true values are known.

Root Mean Square Error (RMSE): The root-mean-square error (RMSE) is a frequently used
measure of the differences between values predicted by a model or an estimator and the values
actually observed.

Mean Absolute Error (MAE): Mean absolute error (MAE) is a measure of difference between
two continuous variables. Assume X and Y are variables of paired observations that express the
same phenomenon. Examples of Y versus X include comparisons of predicted versus observed,
subsequent time versus initial time, and one technique of measurement versus an alternative
technique of measurement.

Artificial Neural Network: Artificial Neural Network (ANN) is an efficient computing system whose
central theme is borrowed from the analogy of biological neural networks. ANNs are also named as
“artificial neural systems”.

Why we use non-linear activation function on ANN: Neural networks are used to implement
complex functions, and non-linear activation functions enable them to approximate arbitrarily complex
functions. Without the non-linearity introduced by the activation function, multiple layers of a neural
network are equivalent to a single layer neural network.

Let’s see a simple example to understand why without non-linearity it is impossible to


approximate even simple functions like XOR and XNOR gate. In the figure below, we graphically
show an XOR gate. There are two classes in our dataset represented by a cross and a circle. When the
two features, and are the same, the class label is a red cross, otherwise, it is a blue circle. The
two red crosses have an output of 0 for input value (0,0) and (1,1) and the two blue rings have an
output of 1 for input value (0,1) and (1,0).

Fig: Graphical Representation of XOR gate


Difference between Forward & Back Propagation methods:

Forward Propagation Back Propagation

1. A feedforward neural network is an artificial 1. Backpropagation is a method used in artificial


neural network where in connections between neural networks to calculate a gradient.
the units do not form a cycle.
2. Compute functional signals 2. Computes error signal
Set-07: (Clustering & Association Rule Mining)
Q 1. Difference between Clustering & Classification technique.
Ans: Difference between Clustering & Classification technique:
Classification Technique Clustering Technique

1. A supervised learning technique 1. An unsupervised learning technique


2. Finite set of classes 2. Finite set of clusters
3. Goal of assigning new input to a class 3. Goal of finding similarities within a given
dataset
4. Infinite set of input data 4. Finite set of data

Q 2. What is association rule mining? Write down some frequent item set, support & confidence.
Association Rule Mining: Association rule mining is a procedure which is meant to find frequent
patterns, correlations, associations, or causal structures from data sets found in various kinds of
databases such as relational databases, transactional databases, and other forms of data repositories.

Q 3. Difference between Apriori & Eclat algorithm in association rule mining.


Ans: Difference between Apriori & Eclat algorithm:
Apriori Algorithm Eclat Algorithm

1. The Apriori Algorithm is an influential 1. The Eclat algorithm is used to perform item
algorithm for mining frequent item sets for set mining
Boolean association rules.
2. Apriori are use large dataset 2. Eclat are small and medium dataset

3. Apriori are scan original dataset 3. Eclat scan currently generated dataset

4. Apriori are slower than Eclat 4. Eclat are slower than Apriori

5. In Apriori database is taken as usual 5. Eclat using the database in vertical layout

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy