0% found this document useful (0 votes)
3 views4 pages

1

Data mining is the process of extracting knowledge from large datasets, evolving from the need for effective data management and analysis. It involves multiple disciplines and follows several steps, including data cleaning, integration, selection, transformation, mining, evaluation, and presentation. Additionally, the document discusses the differences between databases and data warehouses, outlines decision tree construction, emphasizes the importance of evaluation criteria for classification methods, and suggests ways to improve classification accuracy.

Uploaded by

maira butt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views4 pages

1

Data mining is the process of extracting knowledge from large datasets, evolving from the need for effective data management and analysis. It involves multiple disciplines and follows several steps, including data cleaning, integration, selection, transformation, mining, evaluation, and presentation. Additionally, the document discusses the differences between databases and data warehouses, outlines decision tree construction, emphasizes the importance of evaluation criteria for classification methods, and suggests ways to improve classification accuracy.

Uploaded by

maira butt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Question 1: What is data mining?

in your answer, address the following:


Data mining refers to the process of extracting or mining interesting knowledge or
patterns from large amounts of data.
(a) is it another hype?
Data mining is not another hype. Instead, the need for data mining has arisen due
to the wide availability of huge amounts of data and the need for turning such data
into useful information and knowledge. Thus, data mining can be viewed as
the result of the natural evolution of information technology.
(b) is it a simple transformation of technology developed from databases,
statistics, and machine learning?
No. Data mining is more than a simple transformation of technology developed from
databases, statistics, and machine learning. Instead, data mining involves an
integration, rather than a simple transformation, of techniques from multiple
disciplines such as database technology, statistics ,machine learning, high-
performance computing, pattern recognition, neural networks, data visualization,
information retrieval, image and signal processing, and spatial data analysis.
(c) explain how the evolution of database technology led to data mining.
Database technology began with the development of data collection and database
creation mechanisms that led to the development of effective mechanisms for data
management including data storage and retrieval, and query and transaction
processing .The large number of database systems offering query and transaction
processing eventually and naturally led to the need for data analysis and
understanding .Hence, data mining began its development out of this necessity.
(d) describe the steps involved in data mining when viewed as a process of
knowledge discovery
The steps involved in data mining when viewed as a process of knowledge
discovery are as follows:
•Data cleaning, a process that removes or transforms noise and inconsistent data
•Data integration, where multiple data sources may be combined
•Data selection, where data relevant to the analysis task are retrieved from the
database
•Data transformation, where data are transformed or consolidated into forms
appropriate for mining
•Data mining, an essential process where intelligent and efficient methods
are applied in order to extract patterns
•Pattern evaluation, a process that identifies the truly interesting patterns
representing knowledge based on some interestingness measures
•Knowledge presentation, where visualization and knowledge representation
techniques are used to present the mined knowledge to the user.
Question2: How database is different from data warehouse.
Deference between a data warehouse and a database: A data warehouse is a
repository of
information collected from multiple sources, over a history of time, stored under a
unified schema, and used for data analysis and decision support; whereas a
database, is a collection of interrelated data that represents the current status of
the stored data. There could be multiple heterogeneous database where the
schema of one database may not agree with the schema of another. A database
system supports ad-hoc query and on-line transaction processing.
- Similarities between a data warehouse and a database: Both are repositories of
information, storing huge amounts of persistent data.

Question 3: Briefly explain the steps of making decision tree.


Step 1: Determine the Root of the Tree.
Step 2: Calculate Entropy for The Classes.
Step 3: Calculate Entropy After Split for Each Attribute.
Step 4: Calculate Information Gain for each split.
Step 5: Perform the Split.
Step 6: Perform Further Splits.
Step 7: Complete the Decision Tree
The core algorithm for building decision trees called ID3 .ID3 uses Entropy and
Information Gain to construct a decision tree.
Entropy
Entropy controls how a Decision Tree decides to split the data. It actually effects
how a Decision Tree draws its boundaries.
Information Gain
The information gain is based on the decrease in entropy after a dataset is split on
an attribute.
Constructing a decision tree is all about finding attribute that returns the highest
information gain
Step 1: Calculate entropy of the target/class variable
Step 2: The dataset is then split on the different attributes. The entropy for each
branch is calculated.
Then it is added proportionally, to get total entropy for the split. The resulting
entropy is subtracted
from the entropy before the split. The result is the Information Gain, or decrease in
entropy.
Step 3: Choose attribute with the largest information gain as the decision node,
divide the dataset by its branches and repeat the same process on every branch.

Step 4a: A branch with entropy of 0 is a leaf node.


Step 4b: A branch with entropy more than 0 needs further splitting.
Step 5: The ID3 algorithm is run recursively on the non-leaf branches, until all data
is classified.
Question 4: Explain the importance of evaluation criteria for classification
methods.
Performance evaluation of classification model is important for understanding the
quality of the model, to refine the model, and for choosing the adequate model.
Evaluation criteria Metrics help us understand how a classifier performs; many are
available, some with numerous tunable parameters. It is also critical for evaluating
reports by others—if a study presents a single metric, one might question the
performance of the classifier when evaluated using other metrics. Classification
metrics are calculated from true positives (TPs), false positives (FPs), false
negatives (FNs) and true negatives (TNs), all of which are tabulated in the so-called
confusion matrices. The relevance of each of these four quantities will depend on
the purpose of the classifier and motivate the choice of metric. For a medical test
that determines whether patients receive a treatment that is cheap, safe and
effective, FPs would not be as important as FNs, which would represent patients
who might suffer without adequate treatment. In contrast, if the treatment were an
experimental drug, then a very conservative test with few FPs would be required to
avoid testing the drug on unaffected individuals.
Question 5: How can we improve the accuracy of classification?
some methods to enhance a classification accuracy, talking generally, are:
1 - Cross Validation: Separate your train dataset in groups, always separate a group
for prediction and change the groups in each execution. Then you will know what
data is better to train a more accurate model.
2 - Cross Dataset : The same as cross validation, but using different datasets.
3 - Tuning your model : Its basically change the parameters you're using to train
your classification model
4 - Use the normalization process : Discover which techniques will provide a more
concise data to you to use on the training.
5 - Understand more the problem you're treating... Try to implement other methods
to solve the same problem. Always there's at least more than one way to solve the
same problem. You maybe not using the best approach.
Question 6: Solve by using k nearest neighbor:
P1=3 and P2=7
Where k=3
P1 P2 Class
7 7 F
7 4 F
3 4 T
1 4 T

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy