Data Mining: Set-01: (Introduction)
Data Mining: Set-01: (Introduction)
Set-01: (Introduction)
3. Obtaining required information from the 3. Process of discovering useful hidden patterns
sources you already have. from the data you have.
Operation: (=, ≠ )
2. Ordinal Data Attributes: The values of an ordinal attribute provide enough information to
order objects. All Values have a meaningful order.
3. Interval Data Attributes: For interval attributes, the differences between values are
meaningful, i.e., a unit of measurement exists.
Operation: (+,-)
4. Rational Data Attributes: For ratio variables, both differences and ratios are meaningful.
Operation: (*, /)
Example: Temperature in Kelvin, Monetary quantities, counts, age, mass, length, electrical
current etc.
3. Cosine similarity: Cosine similarity is a measure of similarity between two non-zero vectors
of an inner product space that measures the cosine of the angle between them.
4. Feature extraction: The creation of a new set of features from the original raw data is known
as feature extraction. Consider a set of photographs, where each photograph is to be classified
according to whether or not it contains a human face.
5. Feature creation: It is frequently possible to create, from the original attributes, a new set of
attributes that captures the important information in a data set much more effectively.
Q 3. Describe different types of Data with appropriate example.
Different types of data are given below:
1. Numeric Data: Numeric data consists of numeric digits from numeric digits from 0 to 9. It
may also contain decimal point “.”, plus sing “+” or negative sign “-“. The numeric type of data
may either be positive or negative. The use of “+” with positive numbers is optional.
2. Text Data: Text data consists of words, sentences and paragraphs. Text processing refers to the
ability to manipulate words, lines and pages. Text is normally stored as ASCII code without
formatting.
Examples: Some examples of text data are Riaz Ameen, Pakistan, Islam etc.
3. Audio Data: Sound is a representation of audio. Audio data includes music, speech or any type
of sound.
4. Video Data: Video is a set of full-motion images played at a high speed. Video is used to
display actions and movements.
5. Image Data: This type of data includes chart, graph, pictures and drawing. This form of data is
more comprehensive. It can be transmitted as a set of bits. The bits are packed as bytes.
Set-03: (Data Visualization)
Q 1. What is Data Visualization?
Data Visualization: Data visualization is the display of information in a graphic or tabular format.
Successful visualization requires that the data be converted into a visual format so that the
characteristics of the data and the relationships among data items or attributes can be analyzed or
reported.
Fig: Histogram
Boxplot: A Boxplot is graphical representation of groups of numerical data through their quartiles.
Box plots may also have lines extending vertically from the boxes indicating variability outside the
upper and lower quartiles.
Fig: Boxplot
Scatterplot: A scatterplot is a type of graph which uses values from two variables plotted in a
Cartesian plane. It is usually used to find out the relationship between two variables.
Fig: Scatterplot
Q 3. What is OLAP? Describe different types of OLAP.
OLAP: OLAP stands for Online Analytical Processing Server. OLAP is based on the
multidimensional data model. It allows managers, and analysts to get an insight of the information
through fast, consistent, and interactive access to information.
3. Hybrid OLAP: Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher
scalability of ROLAP and faster computation of MOLAP. HOLAP servers allows to store the
large data volumes of detailed information.
4. Specialized SQL Servers: Specialized SQL servers provide advanced query language and
query processing support for SQL queries over star and snowflake schemas in a read-only
environment.
Decision tree: A decision tree is a structure that includes a root node, branches, and leaf nodes.
Each internal node denotes a test on an attribute, each branch denotes the outcome of a test, and
each leaf node holds a class label.
Entropy: Entropy can be defined as a measure of the average information content per source
symbol. Claude Shannon, the “father of the Information Theory”, provided a formula for it as −
H=−∑ipilogbpi
Misclassification Error: Misclassification may occur due to selection of property which is not
suitable for classification. When all classes, groups, or categories of a variable have the same error
rate or probability of being misclassified then it is said to be misclassification.
Overfitting: Overfitting is a modeling error which occurs when a function is too closely fit to a
limited set of data points. Overfitting the model generally takes the form of making an overly
complex model.
Under fitting: Under fitting occurs when a statistical model or machine learning algorithm cannot
capture the underlying trend of the data. Specifically, under fitting occurs if the model or algorithm
shows low variance but high bias.
Bias: Bias is prejudice in favor of or against one thing, person, or group compared with another,
usually in a way considered to be unfair.
Variance: Variance is a measurement of the spread between numbers in a data set. The variance
measures how far each number in the set is from the mean.
1. Basic The discovery of model or functions A devised model in which the mapping
where the mapping of objects is of objects is done into values.
done into predefined classes.
2.Involves Discrete values Continuous values
prediction of
3.Continuous Decision tree, logistic regression, Regression tree (Random forest), Linear
values etc. regression, etc.
4. Nature of the Unordered Ordered
predicted data
5. Method of Measuring accuracy Measurement of root mean square error
calculation
Linear Regression: Linear regression is a linear approach for modelling the relationship between a
scalar dependent variable y and one or more explanatory variables denoted X.
Logistic Regression: The Logistic Regression is a regression model in which the response variable
has categorical values such as True/False or 0/1. It actually measures the probability of a binary
response as the value of response variable based on the mathematical equation relating it with the
predictor variables.
Optimization Cost Function: Cost functions are a way to help the data modeler solve a
supervised learning problem, either classification or regression. The 'fit' of the response surface on
the data available will associate a cost of the event that occurs. In an optimization example, you
would want to minimize your cost.
Q 1.
Set-06: Model Evaluation + ANN
Precision: Precision is the percentage of retrieved documents that are in fact relevant to the query.
Precision can be defined as –
Precision= |{Relevant} ∩ {Retrieved}| / |{Retrieved}|
Recall: Recall is the percentage of documents that are relevant to the query and were in fact
retrieved. Recall is defined as –
Recall = |{Relevant} ∩ {Retrieved}| / |{Relevant}|
F-Measure: F-Measure (also known as F-score) is the commonly used trade-off. The information
retrieval system often needs to trade-off for precision or vice versa. F-score is defined as harmonic
mean of recall or precision as follows −
F-Measure = recall x precision / (recall + precision) / 2
Confusion Matrix: A confusion matrix is a table that is often used to describe the performance of
a classification model on a set of test data for which the true values are known.
Root Mean Square Error (RMSE): The root-mean-square error (RMSE) is a frequently used
measure of the differences between values predicted by a model or an estimator and the values
actually observed.
Mean Absolute Error (MAE): Mean absolute error (MAE) is a measure of difference between
two continuous variables. Assume X and Y are variables of paired observations that express the
same phenomenon. Examples of Y versus X include comparisons of predicted versus observed,
subsequent time versus initial time, and one technique of measurement versus an alternative
technique of measurement.
Artificial Neural Network: Artificial Neural Network (ANN) is an efficient computing system whose
central theme is borrowed from the analogy of biological neural networks. ANNs are also named as
“artificial neural systems”.
Why we use non-linear activation function on ANN: Neural networks are used to implement
complex functions, and non-linear activation functions enable them to approximate arbitrarily complex
functions. Without the non-linearity introduced by the activation function, multiple layers of a neural
network are equivalent to a single layer neural network.
Q 2. What is association rule mining? Write down some frequent item set, support & confidence.
Association Rule Mining: Association rule mining is a procedure which is meant to find frequent
patterns, correlations, associations, or causal structures from data sets found in various kinds of
databases such as relational databases, transactional databases, and other forms of data repositories.
1. The Apriori Algorithm is an influential 1. The Eclat algorithm is used to perform item
algorithm for mining frequent item sets for set mining
Boolean association rules.
2. Apriori are use large dataset 2. Eclat are small and medium dataset
3. Apriori are scan original dataset 3. Eclat scan currently generated dataset
4. Apriori are slower than Eclat 4. Eclat are slower than Apriori
5. In Apriori database is taken as usual 5. Eclat using the database in vertical layout