Data mining is the process of extracting knowledge from large datasets, evolving from the need for effective data management and analysis. It involves multiple disciplines and follows several steps, including data cleaning, integration, selection, transformation, mining, evaluation, and presentation. Additionally, the document discusses the differences between databases and data warehouses, outlines decision tree construction, emphasizes the importance of evaluation criteria for classification methods, and suggests ways to improve classification accuracy.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
3 views4 pages
1
Data mining is the process of extracting knowledge from large datasets, evolving from the need for effective data management and analysis. It involves multiple disciplines and follows several steps, including data cleaning, integration, selection, transformation, mining, evaluation, and presentation. Additionally, the document discusses the differences between databases and data warehouses, outlines decision tree construction, emphasizes the importance of evaluation criteria for classification methods, and suggests ways to improve classification accuracy.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4
Question 1: What is data mining?
in your answer, address the following:
Data mining refers to the process of extracting or mining interesting knowledge or patterns from large amounts of data. (a) is it another hype? Data mining is not another hype. Instead, the need for data mining has arisen due to the wide availability of huge amounts of data and the need for turning such data into useful information and knowledge. Thus, data mining can be viewed as the result of the natural evolution of information technology. (b) is it a simple transformation of technology developed from databases, statistics, and machine learning? No. Data mining is more than a simple transformation of technology developed from databases, statistics, and machine learning. Instead, data mining involves an integration, rather than a simple transformation, of techniques from multiple disciplines such as database technology, statistics ,machine learning, high- performance computing, pattern recognition, neural networks, data visualization, information retrieval, image and signal processing, and spatial data analysis. (c) explain how the evolution of database technology led to data mining. Database technology began with the development of data collection and database creation mechanisms that led to the development of effective mechanisms for data management including data storage and retrieval, and query and transaction processing .The large number of database systems offering query and transaction processing eventually and naturally led to the need for data analysis and understanding .Hence, data mining began its development out of this necessity. (d) describe the steps involved in data mining when viewed as a process of knowledge discovery The steps involved in data mining when viewed as a process of knowledge discovery are as follows: •Data cleaning, a process that removes or transforms noise and inconsistent data •Data integration, where multiple data sources may be combined •Data selection, where data relevant to the analysis task are retrieved from the database •Data transformation, where data are transformed or consolidated into forms appropriate for mining •Data mining, an essential process where intelligent and efficient methods are applied in order to extract patterns •Pattern evaluation, a process that identifies the truly interesting patterns representing knowledge based on some interestingness measures •Knowledge presentation, where visualization and knowledge representation techniques are used to present the mined knowledge to the user. Question2: How database is different from data warehouse. Deference between a data warehouse and a database: A data warehouse is a repository of information collected from multiple sources, over a history of time, stored under a unified schema, and used for data analysis and decision support; whereas a database, is a collection of interrelated data that represents the current status of the stored data. There could be multiple heterogeneous database where the schema of one database may not agree with the schema of another. A database system supports ad-hoc query and on-line transaction processing. - Similarities between a data warehouse and a database: Both are repositories of information, storing huge amounts of persistent data.
Question 3: Briefly explain the steps of making decision tree.
Step 1: Determine the Root of the Tree. Step 2: Calculate Entropy for The Classes. Step 3: Calculate Entropy After Split for Each Attribute. Step 4: Calculate Information Gain for each split. Step 5: Perform the Split. Step 6: Perform Further Splits. Step 7: Complete the Decision Tree The core algorithm for building decision trees called ID3 .ID3 uses Entropy and Information Gain to construct a decision tree. Entropy Entropy controls how a Decision Tree decides to split the data. It actually effects how a Decision Tree draws its boundaries. Information Gain The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding attribute that returns the highest information gain Step 1: Calculate entropy of the target/class variable Step 2: The dataset is then split on the different attributes. The entropy for each branch is calculated. Then it is added proportionally, to get total entropy for the split. The resulting entropy is subtracted from the entropy before the split. The result is the Information Gain, or decrease in entropy. Step 3: Choose attribute with the largest information gain as the decision node, divide the dataset by its branches and repeat the same process on every branch.
Step 4a: A branch with entropy of 0 is a leaf node.
Step 4b: A branch with entropy more than 0 needs further splitting. Step 5: The ID3 algorithm is run recursively on the non-leaf branches, until all data is classified. Question 4: Explain the importance of evaluation criteria for classification methods. Performance evaluation of classification model is important for understanding the quality of the model, to refine the model, and for choosing the adequate model. Evaluation criteria Metrics help us understand how a classifier performs; many are available, some with numerous tunable parameters. It is also critical for evaluating reports by others—if a study presents a single metric, one might question the performance of the classifier when evaluated using other metrics. Classification metrics are calculated from true positives (TPs), false positives (FPs), false negatives (FNs) and true negatives (TNs), all of which are tabulated in the so-called confusion matrices. The relevance of each of these four quantities will depend on the purpose of the classifier and motivate the choice of metric. For a medical test that determines whether patients receive a treatment that is cheap, safe and effective, FPs would not be as important as FNs, which would represent patients who might suffer without adequate treatment. In contrast, if the treatment were an experimental drug, then a very conservative test with few FPs would be required to avoid testing the drug on unaffected individuals. Question 5: How can we improve the accuracy of classification? some methods to enhance a classification accuracy, talking generally, are: 1 - Cross Validation: Separate your train dataset in groups, always separate a group for prediction and change the groups in each execution. Then you will know what data is better to train a more accurate model. 2 - Cross Dataset : The same as cross validation, but using different datasets. 3 - Tuning your model : Its basically change the parameters you're using to train your classification model 4 - Use the normalization process : Discover which techniques will provide a more concise data to you to use on the training. 5 - Understand more the problem you're treating... Try to implement other methods to solve the same problem. Always there's at least more than one way to solve the same problem. You maybe not using the best approach. Question 6: Solve by using k nearest neighbor: P1=3 and P2=7 Where k=3 P1 P2 Class 7 7 F 7 4 F 3 4 T 1 4 T