0% found this document useful (0 votes)
16 views20 pages

DMBI Sem 6 Important Topics (IT)

dmbi topics

Uploaded by

adippatil456
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views20 pages

DMBI Sem 6 Important Topics (IT)

dmbi topics

Uploaded by

adippatil456
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

1. Draw Data warehousing Architecture?

(5marks)

Data warehousing architecture typically consists of three main components:

1. **Data Sources**: This is where data originates from various operational systems such as
databases, applications, and external sources. It includes data extraction tools to gather information.

2. **Data Warehouse**: The central storage area where data from different sources is integrated,
cleaned, transformed, and stored for analytical purposes. It consists of staging area, data warehouse
database, and access layers.

3. **Data Access Tools**: These are the front-end tools used by analysts and decision-makers to
access and analyze the data stored in the data warehouse. Examples include reporting tools, query
tools, OLAP (Online Analytical Processing) tools, and data mining tools.

In summary, data warehousing architecture comprises data sources, a data warehouse, and data
access tools, facilitating efficient data storage, integration, and analysis for decision-making
purposes.

2. What is noisy data? How to handle noisy data?(5marks)

Noisy data refers to data that contains irrelevant, incorrect, or inconsistent information, which can
distort analysis and hinder accurate insights. To handle noisy data effectively:

1. **Identification**: Recognize noisy data by examining for outliers, errors, or inconsistencies.

2. **Filtering**: Use statistical methods such as mean, median, or clustering to remove or mitigate
the impact of noisy data points.

3. **Normalization**: Standardize the data to a common scale to minimize the effect of varying
magnitudes.

4. **Imputation**: Fill in missing values using techniques like mean substitution or predictive
modeling to maintain data integrity.
5. **Validation**: Validate the cleaned data through techniques like cross-validation or split
validation to ensure reliability for analysis and decision-making.

3. Compare and contrast between OLTP and OLAP.(5marks)

1. **Purpose**:

- OLTP (Online Transaction Processing) is used for day-to-day transactional activities like order
processing, inventory management, etc.

- OLAP (Online Analytical Processing) is used for complex analysis of data to support decision-
making processes.

2. **Data Structure**:

- OLTP systems typically deal with normalized data structures, which are optimized for transactional
processing and minimize redundancy.

- OLAP systems use denormalized or star/snowflake schema structures, which facilitate faster
querying and analysis.

3. **Usage**:

- OLTP systems are used by operational staff for routine transactions, requiring quick response
times.

- OLAP systems are used by analysts and decision-makers for complex queries and reporting, often
involving historical data.

4. **Query Complexity**:

- OLTP queries are usually simple, involving basic CRUD operations (Create, Read, Update, Delete).

- OLAP queries tend to be more complex, involving aggregations, drill-downs, and slicing/dicing of
data for analytical purposes.

5. **Performance Requirements**:

- OLTP systems prioritize high concurrency and low response times to handle multiple concurrent
transactions efficiently.
- OLAP systems prioritize query performance, focusing on processing large volumes of data for
analytical purposes.

4. Explain concept of information gain and gini value used in decision tree algorithm.(5marks)

1. **Information Gain**:

- Information gain is a measure used to decide the relevance of a feature in splitting the
data in a decision tree.

- It quantifies the effectiveness of a feature by measuring the reduction in entropy or


disorder in the data when it is split based on that feature.

- Higher information gain indicates that splitting the data based on that feature results in
more homogenous subsets, making it a better choice for splitting.

2. **Gini Value**:

- Gini value, also known as Gini impurity, is another measure used for deciding the optimal
split in a decision tree.

- It represents the probability of incorrectly classifying a randomly chosen element if it


were randomly labeled according to the distribution of labels in the subset.

- A lower Gini value indicates that the subset is more pure, meaning most of the elements
belong to the same class.

5. Consider we have age of 29 participants in a survey given to us in sorted order. 5, 10, 13, 15,
16, 16, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70, 85
Explain how to calculate mean, median, standard deviation, nd Quartile 1st and 3rd for given
data and also compute the same. Show the Box and Whisker plot for this data.

1. **Mean (Average):**
- Mean is the sum of all values divided by the total count.
- For the provided dataset, the sum of all values is 907, and the count is 29.
- Therefore, Mean = 907 / 29 = 31.276 (approximately).

2. **Median (Middle Value):**


- Median represents the middle value of a dataset when arranged in ascending order.
- If the count of the dataset is odd, the median is the middle value. If the count is even, it's
the average of the two middle values.
- In this case, since the count is odd (29), the median is the 15th value after sorting, which
is 25.
3. **Standard Deviation (Spread):**
- Standard deviation quantifies the amount of variation or dispersion in a dataset.
- It measures how much each number in a dataset differs from the mean.
- The standard deviation for this dataset is approximately 16.99.

4. **Quartiles (Data Division):**


- Quartiles divide a dataset into four equal parts.
- Q1 (First Quartile) is the median of the lower half of the dataset.
- Q3 (Third Quartile) is the median of the upper half of the dataset.
- For this dataset, Q1 = 20 and Q3 = 35.

5. **Box and Whisker Plot:**


- A Box and Whisker plot visually represents the distribution of data, indicating the median,
quartiles, and presence of outliers.
- The box represents the interquartile range (from Q1 to Q3), with a line inside representing
the median.
- Whiskers extend from the box to the minimum and maximum values within 1.5 times the
interquartile range.
- Any values beyond the whiskers are considered outliers.
- A box plot for this dataset would show a box extending from 20 to 35, with a line at 25
inside the box. The whiskers would extend from 5 to 36, with outliers beyond these ranges.

This comprehensive analysis provides insight into the central tendency, spread, and
distribution of the dataset.

6. What is Data Mining? Explain KDD process with Diagram.

Data Mining is the process of discovering patterns, trends, and insights from large datasets. It
involves extracting useful information from raw data, often using computational algorithms and
statistical techniques.

The Knowledge Discovery in Databases (KDD) process is a systematic approach to data mining,
consisting of several stages:

1. **Data Selection:**

- In this stage, relevant data is selected from various sources, including databases, data
warehouses, or even the web.

2. **Data Preprocessing:**
- This stage involves cleaning and transforming the selected data to ensure its quality and suitability
for analysis. Tasks may include removing duplicates, handling missing values, and normalization.

3. **Data Reduction:**

- Data reduction techniques are applied to reduce the complexity of the dataset while preserving
its integrity and meaningfulness. This can include techniques such as dimensionality reduction or
feature selection.

4. **Data Mining:**

- The core stage of the KDD process, where data mining algorithms are applied to the prepared
dataset to extract patterns, trends, and relationships.

5. **Interpretation/Evaluation:**

- In this final stage, the discovered patterns and insights are interpreted and evaluated to
determine their significance and usefulness. This may involve visualization techniques and statistical
analysis.
7. Explain Market Basket Analysis With Example.

Market Basket Analysis (MBA) is a data mining technique used to discover relationships between
items purchased together. It helps retailers understand customer purchasing behavior by identifying
associations between products. Here's a simple explanation with an example:

Let's say you own a grocery store, and you want to understand the buying patterns of your
customers. By using Market Basket Analysis, you can uncover which items are frequently bought
together. For instance, through analyzing your sales data, you find that customers who buy bread
also tend to buy butter. This association can be represented as a rule: {Bread} ➔ {Butter}.

Here's how Market Basket Analysis works:

1. **Collect Data**: Gather transactional data that records which items were purchased together in
each transaction.

2. **Identify Itemsets**: Group items purchased together into sets, known as itemsets.

3. **Calculate Support**: Calculate the support for each itemset, which is the proportion of
transactions that contain the itemset.

4. **Set Threshold**: Set a minimum support threshold to filter out itemsets that occur less
frequently.

5. **Generate Rules**: From the frequent itemsets, generate association rules that show the
likelihood of one item being purchased when another item is purchased.

6. **Evaluate Rules**: Evaluate the generated rules based on metrics like support, confidence, and
lift.

In our example, if 60% of transactions containing bread also contain butter, the rule {Bread} ➔
{Butter} would have a confidence of 60%.

Market Basket Analysis is beneficial for retailers as it allows them to:


- Understand customer preferences and behaviors.

- Optimize product placement and promotions.

- Cross-sell and upsell products effectively.

- Improve inventory management and stock replenishment.

8. Consider Training Dataset as given below. Use Naive Bayes Algorithm to determine whether
it is advisable to play tennis on a day with hot temperature, rainy outlook, high humidity and
no wind?

Outlook Temperature Humidity Windy Class


sunny hot High False No
sunny hot High true No
Overcast hot High False Play
Rain mild High False Play
Rain cool Normal False Play
Rain cool Normal True No
overcast cool Normal True Play
sunny Mild High False No
sunny Cool Normal False Play
Rain Mild Normal False Play
sunny Mild Normal True Play
Overcast Mild High True Play
Overcast Hot Normal False Play
rain Mild High True No

9. What is an outlier? Explain various method for performing outlier analysis.

An outlier is an observation in a dataset that significantly differs from other observations. It is a data
point that lies outside the overall pattern of the data. Outliers can occur due to errors in data
collection, measurement variability, or genuinely unusual phenomena. Here's an easy-to-remember
explanation of various methods for performing outlier analysis:

1. **Visual Inspection**: One simple method is to visually inspect the data using plots such as
histograms, box plots, or scatter plots. Outliers can often be identified as points that fall far away
from the main cluster of data points.

2. **Descriptive Statistics**: Calculate basic descriptive statistics such as mean, median, standard
deviation, and range. Observations that deviate significantly from these statistics may be considered
outliers.
3. **Z-Score Method**: Calculate the z-score for each data point, which measures how many
standard deviations an observation is from the mean. Data points with a z-score greater than a
threshold (commonly 2 or 3) are flagged as outliers.

4. **Interquartile Range (IQR) Method**: Calculate the interquartile range, which is the difference
between the third quartile (Q3) and the first quartile (Q1). Outliers are identified as observations that
fall below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR.

5. **Modified Z-Score Method**: Similar to the z-score method, but it uses a modified z-score that is
robust to outliers. This method is useful when the data contains outliers that may skew the mean
and standard deviation.

6. **Box Plot Method**: Plot the data using a box plot, which visually displays the median, quartiles,
and potential outliers. Observations outside the "whiskers" of the box plot are considered outliers.

7. **Machine Learning Methods**: Use machine learning algorithms such as Isolation Forest, Local
Outlier Factor (LOF), or One-Class SVM to automatically detect outliers based on the deviation from
the majority of the data points.

8. **Domain Knowledge**: Incorporate domain knowledge or subject matter expertise to identify


outliers that may represent genuine anomalies or errors in the data.

10. Use the Apriori algorithm to identify the frequent item-sets in the following database. Then
extract the strong association rules from these sets. Assume Min. Support = 50% Min.
Confidence = 75%

Tid a b c d e f g
Items 1,2,3,4,5,6 2,3,5, 1,2,3,5 1,2,4,5 1,2,3,4,5,6 2,3,5 1,2,4,5

11. Cluster the following eight points (with (x, y) representing locations) into three clusters

A1(2, 10), A2(2, 5), A3(8, 4), A 4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4,9). Assume Initial cluster
Centers are at: A1(2, 10), A4(5, 8) and A7(1, 2). The distance function between two points a =(x1,
y1) and b = (x2, y2) is defined as – P (a, b) = |x2-x1| + |y2-y1|. Use K-Means Algorithm to find the
three cluster centres after the second iteration.

Here's a simplified explanation of the K-Means Algorithm applied to the given problem:
1. **Initialization**: Start with three initial cluster centers: A1(2, 10), A4(5, 8), and A7(1, 2).

2. **Assignment Step**: Assign each point to the nearest cluster center based on the defined
distance function P(a, b) = |x2-x1| + |y2-y1|.

- For A1(2, 10):

- Distance to A1: P(A1, A1) = 0

- Distance to A4: P(A1, A4) = |5-2| + |8-10| = 5

- Distance to A7: P(A1, A7) = |1-2| + |2-10| = 9

- A1 is closest to A1.

- For A2(2, 5):

- Distance to A1: P(A2, A1) = |2-2| + |10-5| = 5

- Distance to A4: P(A2, A4) = |5-2| + |8-5| = 6

- Distance to A7: P(A2, A7) = |1-2| + |2-5| = 5

- A2 is closest to A1 and A7 (ties are resolved arbitrarily).

- Similarly, calculate distances for A3, A4, A5, A6, A7, and A8 and assign them to the nearest cluster
center.

3. **Update Step**: Recalculate the cluster centers based on the points assigned to each cluster.

- New center for Cluster 1: A1(2, 7.5)

- New center for Cluster 2: A4(6.33, 5.67)

- New center for Cluster 3: A7(2, 3.5)

4. **Repeat**: Repeat the assignment and update steps until convergence (when the cluster centers
no longer change significantly between iterations).

In this case, after the second iteration, the cluster centers would be:

Cluster 1: A1(2, 7.5)

Cluster 2: A4(6.33, 5.67)


Cluster 3: A7(2, 3.5)

This process continues until the cluster centers stabilize, and the algorithm converges to a solution.

12. Compare Star Schema, Snow Flake Schema and star Constellation

Star Schema, Snowflake Schema, and Star Constellation are all data modeling techniques used in
data warehousing to organize and structure data for efficient querying and analysis. Let's
compare them:

1. **Star Schema**:

- **Description**: Star Schema is the simplest and most common schema type used in data
warehousing. It consists of one or more fact tables referencing any number of dimension tables.

- **Structure**: In a Star Schema, the fact table sits at the center, surrounded by dimension
tables. Each dimension table is directly connected to the fact table through foreign key
relationships.

- **Advantages**:

- Simplicity and ease of understanding.

- Query performance is generally high due to denormalization.

- **Disadvantages**:

- Redundancy in data storage due to denormalization.

- Might not be suitable for complex relationships between dimensions.

2. **Snowflake Schema**:

- **Description**: Snowflake Schema is an extension of the Star Schema where dimension


tables are normalized into multiple related tables, resembling a snowflake's shape.

- **Structure**: Dimension tables in a Snowflake Schema are organized into multiple levels of
related tables, reducing redundancy by separating hierarchies into distinct tables.

- **Advantages**:

- Reduced data redundancy due to normalization, leading to less storage space.

- Better support for complex relationships between dimensions.

- **Disadvantages**:
- Increased complexity in schema design and maintenance.

- Query performance might be slightly slower compared to Star Schema due to additional
table joins.

3. **Star Constellation**:

- **Description**: Star Constellation is an advanced schema design that combines multiple Star
or Snowflake schemas into a single model. It's suitable for very large and complex data
warehousing environments.

- **Structure**: Star Constellation comprises multiple interconnected Star or Snowflake


schemas, allowing for more flexible and scalable data organization.

- **Advantages**:

- Greater flexibility and scalability for accommodating complex data relationships and
hierarchies.

- Can handle very large volumes of data across multiple domains.

- **Disadvantages**:

- Increased complexity in schema design and management.

- Higher resource requirements for implementation and maintenance.

13. Dimensional Modeling (5marks)

Dimensional modeling is a data modeling technique used in data warehousing. It organizes data
into easily understandable and accessible structures for efficient querying and analysis. Here's a
concise answer:

Dimensional modeling simplifies data storage by organizing it into two types of tables: fact tables
and dimension tables. Fact tables contain numerical measures, and dimension tables contain
descriptive attributes. This approach creates a star-like schema, where the fact table sits at the
center, surrounded by dimension tables. This simple structure enhances query performance and
makes it easier for users to navigate and analyze data. Overall, dimensional modeling optimizes
data warehousing for faster insights and decision-making.

14. Random Forest Technique. (5marks)


Random Forest is a powerful machine learning technique used for both classification and
regression tasks. Here's a simple answer:

Random Forest is an ensemble learning method that combines multiple decision trees to make
predictions. It works by creating a "forest" of decision trees during training. Each tree is trained
on a random subset of the training data and a random subset of features. During prediction, the
output from each tree is aggregated to produce the final result. This technique improves
prediction accuracy and reduces overfitting, making it a popular choice for various machine
learning tasks.

15. Decision Tree Induction (5marks)

Decision Tree Induction is a machine learning algorithm used for both classification and
regression tasks. Here's a simple answer:

Decision Tree Induction builds a tree-like structure where each internal node represents a
decision based on a feature, and each leaf node represents the outcome. It works by recursively
partitioning the data based on the most significant feature at each node, aiming to maximize
information gain or minimize impurity. This process continues until the tree adequately
represents the training data or a stopping criterion is met. Decision Tree Induction is intuitive,
interpretable, and widely used for its simplicity and effectiveness in solving various predictive
tasks.

16. Cross Validation (5marks)

Cross Validation is a technique used to assess the performance of a machine learning model.
Here's a concise answer:

Cross Validation involves splitting the dataset into multiple subsets, called folds. The model is
trained on a subset of the data and evaluated on the remaining subset iteratively. This process
helps to estimate how well the model will generalize to new, unseen data. Common methods
include k-fold cross-validation, where the data is divided into k equal-sized folds, and each fold is
used as a validation set while the rest are used for training. Cross Validation provides a robust
estimate of the model's performance and helps to identify overfitting or underfitting issues.
17. DBSCAN Algorithm (5marks)

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular clustering


algorithm used in machine learning and data mining. Here's a concise answer:

DBSCAN groups together closely packed data points based on their density. It works by
identifying "core points" surrounded by a specified number of neighboring points within a given
radius. These core points form clusters, while points that do not meet the density criteria are
considered as noise or outliers. DBSCAN does not require specifying the number of clusters in
advance, making it suitable for datasets with irregular shapes and varying cluster densities. It's
an efficient and effective algorithm for discovering clusters in spatial data.

18. Explain types of Attributes use data exploration

1. **Numerical Attributes**:

- Numerical attributes represent quantitative data measured on a continuous or discrete scale.

- Examples include age, height, temperature, and income.

- Numerical attributes allow for mathematical operations such as addition, subtraction, and
averaging, making them suitable for statistical analysis and visualization using techniques like
histograms and scatter plots.

2. **Categorical Attributes**:

- Categorical attributes represent qualitative data that can be divided into distinct categories or
groups.

- Examples include gender, color, marital status, and country.

- Categorical attributes are often represented using labels or codes, and they can be analyzed
using frequency tables, bar charts, and pie charts to understand the distribution of categories
within the dataset.

3. **Ordinal Attributes**:

- Ordinal attributes are similar to categorical attributes but have a natural order or hierarchy
among their categories.
- Examples include ratings (e.g., low, medium, high), education level (e.g., primary, secondary,
tertiary), and satisfaction level (e.g., very unsatisfied, unsatisfied, neutral, satisfied, very
satisfied).

- Ordinal attributes allow for comparisons of relative order or rank but may not have equal
intervals between categories. They can be visualized using ordered bar charts or dot plots.

4. **Time-Series Attributes**:

- Time-series attributes represent data collected or measured over a sequence of time


intervals.

- Examples include stock prices, weather data, and website traffic over time.

- Time-series attributes are analyzed to identify trends, patterns, and seasonality using
techniques like line charts, moving averages, and autocorrelation plots.

5. **Boolean Attributes**:

- Boolean attributes represent binary data with only two possible values, typically true or false,
yes or no, 1 or 0.

- Examples include whether a customer is a member (yes/no), whether an item is in stock


(true/false), or whether a condition is met (1/0).

- Boolean attributes are often used for filtering and categorization, and they can be visualized
using bar charts or pie charts to show proportions of true and false values.

19. Explain DBSCAN algorithm with example.

DBSCAN, which stands for Density-Based Spatial Clustering of Applications with Noise, is a
popular clustering algorithm used in machine learning to identify clusters of points in a dataset.
Here's an explanation of the DBSCAN algorithm with an example:

DBSCAN works by grouping together closely packed data points based on their density. It doesn't
require specifying the number of clusters in advance and is robust to noise and outliers.

The DBSCAN algorithm involves two key parameters:

1. **Epsilon (ε)**: A radius parameter that defines the maximum distance between two points
for them to be considered neighbors.

2. **Minimum Points (MinPts)**: The minimum number of points required to form a dense
region or cluster.
Here's how the DBSCAN algorithm works:

1. **Initialization**: Start with an unvisited data point in the dataset.

2. **Core Point Identification**: For each data point, identify its ε-neighborhood, which includes
all points within a distance of ε from the current point. If the number of points in the ε-
neighborhood is greater than or equal to MinPts, mark the point as a core point.

3. **Expansion of Cluster**: For each core point or a point reachable from a core point,
recursively find all points in its ε-neighborhood and add them to the same cluster. If a point is
reachable from multiple core points, it may belong to any of the corresponding clusters.

4. **Noise Point Identification**: Any point that is not a core point and not reachable from any
core point is considered a noise point or an outlier.

Let's illustrate the DBSCAN algorithm with an example:

Suppose we have a 2-dimensional dataset containing points as follows:

Dataset:

(1, 2), (2, 2), (2, 3), (8, 7), (8, 8), (25, 80), (80, 90), (90, 85), (91, 89)

Using DBSCAN with ε = 3 and MinPts = 3:

1. Start with an unvisited point, (1, 2).

2. Identify its ε-neighborhood: {(1, 2), (2, 2), (2, 3)}. Since the ε-neighborhood contains 3 points,
mark (1, 2) as a core point.

3. Expand the cluster by recursively finding all points in the ε-neighborhood of (1, 2) and adding
them to the cluster. Points (2, 2) and (2, 3) are added to the cluster.

4. Repeat the process for other points in the dataset.


The resulting clusters might be:

- Cluster 1: {(1, 2), (2, 2), (2, 3)}

- Cluster 2: {(8, 7), (8, 8)}

- Cluster 3: {(25, 80), (80, 90), (90, 85), (91, 89)}

Points (1, 2), (2, 2), and (2, 3) are core points, while the other points are either reachable or noise
points.

This example demonstrates how DBSCAN identifies clusters based on density, without needing
the number of clusters as input, and handles outliers effectively.

20. Explain K mean algorithm in detail. Apply K-mean Algorithm to divide the given set of values {
2,3,6,8,9,12,15,158,22} into 3 clusters

**K-means Algorithm**:

K-means is an iterative clustering algorithm used to partition a dataset into K distinct, non-
overlapping clusters. It aims to minimize the sum of squared distances between data points and
their respective cluster centroids. Here's how the K-means algorithm works:

1. **Initialization**:

- Choose K initial cluster centroids randomly from the dataset. These centroids represent the
initial cluster centers.

2. **Assignment Step**:

- Assign each data point to the nearest cluster centroid based on the Euclidean distance metric.

- Calculate the distance between each data point and each centroid, and assign each point to
the cluster with the nearest centroid.

3. **Update Step**:

- Recalculate the centroids of the clusters by taking the mean of all data points assigned to each
cluster.
- Move the centroids to the mean of the data points in their respective clusters.

4. **Repeat**:

- Repeat steps 2 and 3 until convergence, i.e., until the cluster assignments no longer change
significantly or a maximum number of iterations is reached.

**Application of K-means Algorithm**:

Given the set of values {2, 3, 6, 8, 9, 12, 15, 158, 22}, we want to divide them into 3 clusters using
the K-means algorithm:

1. **Initialization**:

- Randomly choose 3 initial cluster centroids: Let's say we choose 2, 8, and 158 as initial
centroids.

2. **Assignment Step**:

- Calculate the distance between each data point and each centroid.

- Assign each data point to the cluster with the nearest centroid:

- Cluster 1: {2, 3, 6, 8, 9, 12, 15, 22}

- Cluster 2: {}

- Cluster 3: {158}

3. **Update Step**:

- Recalculate the centroids of the clusters:

- Cluster 1 centroid: Mean of {2, 3, 6, 8, 9, 12, 15, 22} = 9.125

- Cluster 2 centroid: No points assigned

- Cluster 3 centroid: Mean of {158} = 158

4. **Repeat**:

- Repeat the assignment and update steps until convergence.

5. **Final Clusters**:
- After convergence, the final clusters might be:

- Cluster 1: {2, 3, 6, 8, 9, 12, 15, 22}

- Cluster 2: {}

- Cluster 3: {158}

This process demonstrates how the K-means algorithm divides the given set of values into 3
clusters based on their proximity to the cluster centroids.

21. Compare Bagging and Boosting of a Classifier


22. Explain Multilevel and Multidimension Association rules with suitable example

23. Define data mining. Explain KDD process with help of a suitable diagram

24. what is Noisy data? How to Handel it for the Following Data D =
{ 4,8,9,15,21,21,24,25,26,28,29,34} Number of bins = 3

Perform the Following:

i. Partition into equal frequency bins


ii. Smoothing by bin means
iii. Smoothing by bin boundaries

25. Define data Warehouse. Explain data Warehouse architecture with help of a diagram.

26. Draw a three tier data warehousing Architecture (5marks)

27. Data : 4,8,15,21,21,24,25,28,34, Divide data in 3 bins and perform smoothing by bin means and
Smoothing by bins boundaries o every bin (5marks)

28. how to calculate correlation coefficient for two numeric attributes and also comment on the
significance of this value (5marks)

29. Write a short Note on Support and Confidence (5marks)

30. Explain the Concept of Information gain which is used in Decision free algorithm? (5marks)

31. Describe any two methods of data reduction

32. Compare Star Schema, Snowflake Schema and Fact Constellation

33. Write and Explain Bayes Classification algorithm

34. Write the Step of Ada-boost algorithm

35. how is data mining used in business intelligence?

36. Give the Overview of Partition clustering method

37. How can we Further improve the efficiency of Apriori-based mining ?

38. Explain OLPA Operation with the examples.

39. Describe the Classification performance evaluation measures that are Obtained from Confusion
matrix?

40. Use the Normalization methods to normalize the following group of data:

200, 300, 400, 600, 1000

Use min-max normalization by setting min = 0 and max = 1 and z-score normalization

41. Using the Given training dataset classify the following tuple using Naïve Bayes
Algorithm:<Homeowner: No, Marital Status : Married, Job Experience:3>

Homeowner Marital Status Job experience (in Defaulted


Yeas)
Yes Single 3 No
No Married 4 No
No Single 5 No
Yes Married 4 No
No Divorced 2 Yes
No Married 4 No
Yes Divorced 2 No
No Married 3 Yes
No Married 3 No
Yes Single 2 Yes

42. For the table given perform apriori algorithm and show frequent item set and strong association
rules. Assume Minimum Support of 30% and Minimum confidence of 70%

TID Items
01 1,3,4,6,
02 2,3,5,7,
03 1,2,3,5,8,
04 2,5,9,10
05 1,4

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy