0% found this document useful (0 votes)

16 views20 pages

DMBI Sem 6 Important Topics (IT)

dmbi topics

Uploaded by

adippatil456

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views20 pages

DMBI Sem 6 Important Topics (IT)

dmbi topics

Uploaded by

adippatil456

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 20

1. Draw Data warehousing Architecture?

(5marks)

Data warehousing architecture typically consists of three main components:

1. **Data Sources**: This is where data originates from various operational systems such as
databases, applications, and external sources. It includes data extraction tools to gather information.

2. **Data Warehouse**: The central storage area where data from different sources is integrated,
cleaned, transformed, and stored for analytical purposes. It consists of staging area, data warehouse
database, and access layers.

3. **Data Access Tools**: These are the front-end tools used by analysts and decision-makers to
access and analyze the data stored in the data warehouse. Examples include reporting tools, query
tools, OLAP (Online Analytical Processing) tools, and data mining tools.

In summary, data warehousing architecture comprises data sources, a data warehouse, and data
access tools, facilitating efficient data storage, integration, and analysis for decision-making
purposes.

2. What is noisy data? How to handle noisy data?(5marks)

Noisy data refers to data that contains irrelevant, incorrect, or inconsistent information, which can
distort analysis and hinder accurate insights. To handle noisy data effectively:

1. Identification: Recognize noisy data by examining for outliers, errors, or inconsistencies.

2. **Filtering**: Use statistical methods such as mean, median, or clustering to remove or mitigate
the impact of noisy data points.

3. **Normalization**: Standardize the data to a common scale to minimize the effect of varying
magnitudes.

4. **Imputation**: Fill in missing values using techniques like mean substitution or predictive
modeling to maintain data integrity.
5. **Validation**: Validate the cleaned data through techniques like cross-validation or split
validation to ensure reliability for analysis and decision-making.

3. Compare and contrast between OLTP and OLAP.(5marks)

1. **Purpose**:

- OLTP (Online Transaction Processing) is used for day-to-day transactional activities like order
processing, inventory management, etc.

- OLAP (Online Analytical Processing) is used for complex analysis of data to support decision-
making processes.

2. **Data Structure**:

- OLTP systems typically deal with normalized data structures, which are optimized for transactional
processing and minimize redundancy.

- OLAP systems use denormalized or star/snowflake schema structures, which facilitate faster
querying and analysis.

3. **Usage**:

- OLTP systems are used by operational staff for routine transactions, requiring quick response
times.

- OLAP systems are used by analysts and decision-makers for complex queries and reporting, often
involving historical data.

4. **Query Complexity**:

- OLTP queries are usually simple, involving basic CRUD operations (Create, Read, Update, Delete).

- OLAP queries tend to be more complex, involving aggregations, drill-downs, and slicing/dicing of
data for analytical purposes.

5. **Performance Requirements**:

- OLTP systems prioritize high concurrency and low response times to handle multiple concurrent
transactions efficiently.
- OLAP systems prioritize query performance, focusing on processing large volumes of data for
analytical purposes.

4. Explain concept of information gain and gini value used in decision tree algorithm.(5marks)

1. **Information Gain**:

- Information gain is a measure used to decide the relevance of a feature in splitting the
data in a decision tree.

- It quantifies the effectiveness of a feature by measuring the reduction in entropy or

disorder in the data when it is split based on that feature.

- Higher information gain indicates that splitting the data based on that feature results in
more homogenous subsets, making it a better choice for splitting.

2. **Gini Value**:

- Gini value, also known as Gini impurity, is another measure used for deciding the optimal
split in a decision tree.

- It represents the probability of incorrectly classifying a randomly chosen element if it

were randomly labeled according to the distribution of labels in the subset.

- A lower Gini value indicates that the subset is more pure, meaning most of the elements
belong to the same class.

5. Consider we have age of 29 participants in a survey given to us in sorted order. 5, 10, 13, 15,
16, 16, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70, 85
Explain how to calculate mean, median, standard deviation, nd Quartile 1st and 3rd for given
data and also compute the same. Show the Box and Whisker plot for this data.

1. **Mean (Average):**
- Mean is the sum of all values divided by the total count.
- For the provided dataset, the sum of all values is 907, and the count is 29.
- Therefore, Mean = 907 / 29 = 31.276 (approximately).

2. Median (Middle Value):

- Median represents the middle value of a dataset when arranged in ascending order.
- If the count of the dataset is odd, the median is the middle value. If the count is even, it's
the average of the two middle values.
- In this case, since the count is odd (29), the median is the 15th value after sorting, which
is 25.
3. **Standard Deviation (Spread):**
- Standard deviation quantifies the amount of variation or dispersion in a dataset.
- It measures how much each number in a dataset differs from the mean.
- The standard deviation for this dataset is approximately 16.99.

4. Quartiles (Data Division):

- Quartiles divide a dataset into four equal parts.
- Q1 (First Quartile) is the median of the lower half of the dataset.
- Q3 (Third Quartile) is the median of the upper half of the dataset.
- For this dataset, Q1 = 20 and Q3 = 35.

5. Box and Whisker Plot:

- A Box and Whisker plot visually represents the distribution of data, indicating the median,
quartiles, and presence of outliers.
- The box represents the interquartile range (from Q1 to Q3), with a line inside representing
the median.
- Whiskers extend from the box to the minimum and maximum values within 1.5 times the
interquartile range.
- Any values beyond the whiskers are considered outliers.
- A box plot for this dataset would show a box extending from 20 to 35, with a line at 25
inside the box. The whiskers would extend from 5 to 36, with outliers beyond these ranges.

This comprehensive analysis provides insight into the central tendency, spread, and
distribution of the dataset.

6. What is Data Mining? Explain KDD process with Diagram.

Data Mining is the process of discovering patterns, trends, and insights from large datasets. It
involves extracting useful information from raw data, often using computational algorithms and
statistical techniques.

The Knowledge Discovery in Databases (KDD) process is a systematic approach to data mining,
consisting of several stages:

1. **Data Selection:**

- In this stage, relevant data is selected from various sources, including databases, data
warehouses, or even the web.

2. **Data Preprocessing:**
- This stage involves cleaning and transforming the selected data to ensure its quality and suitability
for analysis. Tasks may include removing duplicates, handling missing values, and normalization.

3. **Data Reduction:**

- Data reduction techniques are applied to reduce the complexity of the dataset while preserving
its integrity and meaningfulness. This can include techniques such as dimensionality reduction or
feature selection.

4. **Data Mining:**

- The core stage of the KDD process, where data mining algorithms are applied to the prepared
dataset to extract patterns, trends, and relationships.

5. **Interpretation/Evaluation:**

- In this final stage, the discovered patterns and insights are interpreted and evaluated to
determine their significance and usefulness. This may involve visualization techniques and statistical
analysis.
7. Explain Market Basket Analysis With Example.

Market Basket Analysis (MBA) is a data mining technique used to discover relationships between
items purchased together. It helps retailers understand customer purchasing behavior by identifying
associations between products. Here's a simple explanation with an example:

Let's say you own a grocery store, and you want to understand the buying patterns of your
customers. By using Market Basket Analysis, you can uncover which items are frequently bought
together. For instance, through analyzing your sales data, you find that customers who buy bread
also tend to buy butter. This association can be represented as a rule: {Bread} ➔ {Butter}.

Here's how Market Basket Analysis works:

1. **Collect Data**: Gather transactional data that records which items were purchased together in
each transaction.

2. **Identify Itemsets**: Group items purchased together into sets, known as itemsets.

3. **Calculate Support**: Calculate the support for each itemset, which is the proportion of
transactions that contain the itemset.

4. **Set Threshold**: Set a minimum support threshold to filter out itemsets that occur less
frequently.

5. **Generate Rules**: From the frequent itemsets, generate association rules that show the
likelihood of one item being purchased when another item is purchased.

6. **Evaluate Rules**: Evaluate the generated rules based on metrics like support, confidence, and
lift.

In our example, if 60% of transactions containing bread also contain butter, the rule {Bread} ➔
{Butter} would have a confidence of 60%.

Market Basket Analysis is beneficial for retailers as it allows them to:

- Understand customer preferences and behaviors.

- Optimize product placement and promotions.

- Cross-sell and upsell products effectively.

- Improve inventory management and stock replenishment.

8. Consider Training Dataset as given below. Use Naive Bayes Algorithm to determine whether
it is advisable to play tennis on a day with hot temperature, rainy outlook, high humidity and
no wind?

Outlook Temperature Humidity Windy Class

sunny hot High False No
sunny hot High true No
Overcast hot High False Play
Rain mild High False Play
Rain cool Normal False Play
Rain cool Normal True No
overcast cool Normal True Play
sunny Mild High False No
sunny Cool Normal False Play
Rain Mild Normal False Play
sunny Mild Normal True Play
Overcast Mild High True Play
Overcast Hot Normal False Play
rain Mild High True No

9. What is an outlier? Explain various method for performing outlier analysis.

An outlier is an observation in a dataset that significantly differs from other observations. It is a data
point that lies outside the overall pattern of the data. Outliers can occur due to errors in data
collection, measurement variability, or genuinely unusual phenomena. Here's an easy-to-remember
explanation of various methods for performing outlier analysis:

1. **Visual Inspection**: One simple method is to visually inspect the data using plots such as
histograms, box plots, or scatter plots. Outliers can often be identified as points that fall far away
from the main cluster of data points.

2. **Descriptive Statistics**: Calculate basic descriptive statistics such as mean, median, standard
deviation, and range. Observations that deviate significantly from these statistics may be considered
outliers.
3. **Z-Score Method**: Calculate the z-score for each data point, which measures how many
standard deviations an observation is from the mean. Data points with a z-score greater than a
threshold (commonly 2 or 3) are flagged as outliers.

4. **Interquartile Range (IQR) Method**: Calculate the interquartile range, which is the difference
between the third quartile (Q3) and the first quartile (Q1). Outliers are identified as observations that
fall below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR.

5. **Modified Z-Score Method**: Similar to the z-score method, but it uses a modified z-score that is
robust to outliers. This method is useful when the data contains outliers that may skew the mean
and standard deviation.

6. **Box Plot Method**: Plot the data using a box plot, which visually displays the median, quartiles,
and potential outliers. Observations outside the "whiskers" of the box plot are considered outliers.

7. **Machine Learning Methods**: Use machine learning algorithms such as Isolation Forest, Local
Outlier Factor (LOF), or One-Class SVM to automatically detect outliers based on the deviation from
the majority of the data points.

8. Domain Knowledge: Incorporate domain knowledge or subject matter expertise to identify

outliers that may represent genuine anomalies or errors in the data.

10. Use the Apriori algorithm to identify the frequent item-sets in the following database. Then
extract the strong association rules from these sets. Assume Min. Support = 50% Min.
Confidence = 75%

Tid a b c d e f g
Items 1,2,3,4,5,6 2,3,5, 1,2,3,5 1,2,4,5 1,2,3,4,5,6 2,3,5 1,2,4,5

11. Cluster the following eight points (with (x, y) representing locations) into three clusters

A1(2, 10), A2(2, 5), A3(8, 4), A 4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4,9). Assume Initial cluster
Centers are at: A1(2, 10), A4(5, 8) and A7(1, 2). The distance function between two points a =(x1,
y1) and b = (x2, y2) is defined as – P (a, b) = |x2-x1| + |y2-y1|. Use K-Means Algorithm to find the
three cluster centres after the second iteration.

Here's a simplified explanation of the K-Means Algorithm applied to the given problem:
1. **Initialization**: Start with three initial cluster centers: A1(2, 10), A4(5, 8), and A7(1, 2).

2. **Assignment Step**: Assign each point to the nearest cluster center based on the defined
distance function P(a, b) = |x2-x1| + |y2-y1|.

- For A1(2, 10):

- Distance to A1: P(A1, A1) = 0

- Distance to A4: P(A1, A4) = |5-2| + |8-10| = 5

- Distance to A7: P(A1, A7) = |1-2| + |2-10| = 9

- A1 is closest to A1.

- For A2(2, 5):

- Distance to A1: P(A2, A1) = |2-2| + |10-5| = 5

- Distance to A4: P(A2, A4) = |5-2| + |8-5| = 6

- Distance to A7: P(A2, A7) = |1-2| + |2-5| = 5

- A2 is closest to A1 and A7 (ties are resolved arbitrarily).

- Similarly, calculate distances for A3, A4, A5, A6, A7, and A8 and assign them to the nearest cluster
center.

3. **Update Step**: Recalculate the cluster centers based on the points assigned to each cluster.

- New center for Cluster 1: A1(2, 7.5)

- New center for Cluster 2: A4(6.33, 5.67)

- New center for Cluster 3: A7(2, 3.5)

4. **Repeat**: Repeat the assignment and update steps until convergence (when the cluster centers
no longer change significantly between iterations).

In this case, after the second iteration, the cluster centers would be:

Cluster 1: A1(2, 7.5)

Cluster 2: A4(6.33, 5.67)

Cluster 3: A7(2, 3.5)

This process continues until the cluster centers stabilize, and the algorithm converges to a solution.

12. Compare Star Schema, Snow Flake Schema and star Constellation

Star Schema, Snowflake Schema, and Star Constellation are all data modeling techniques used in
data warehousing to organize and structure data for efficient querying and analysis. Let's
compare them:

1. **Star Schema**:

- **Description**: Star Schema is the simplest and most common schema type used in data
warehousing. It consists of one or more fact tables referencing any number of dimension tables.

- **Structure**: In a Star Schema, the fact table sits at the center, surrounded by dimension
tables. Each dimension table is directly connected to the fact table through foreign key
relationships.

- **Advantages**:

- Simplicity and ease of understanding.

- Query performance is generally high due to denormalization.

- **Disadvantages**:

- Redundancy in data storage due to denormalization.

- Might not be suitable for complex relationships between dimensions.

2. **Snowflake Schema**:

- Description: Snowflake Schema is an extension of the Star Schema where dimension

tables are normalized into multiple related tables, resembling a snowflake's shape.

- **Structure**: Dimension tables in a Snowflake Schema are organized into multiple levels of
related tables, reducing redundancy by separating hierarchies into distinct tables.

- **Advantages**:

- Reduced data redundancy due to normalization, leading to less storage space.

- Better support for complex relationships between dimensions.

- **Disadvantages**:
- Increased complexity in schema design and maintenance.

- Query performance might be slightly slower compared to Star Schema due to additional
table joins.

3. **Star Constellation**:

- **Description**: Star Constellation is an advanced schema design that combines multiple Star
or Snowflake schemas into a single model. It's suitable for very large and complex data
warehousing environments.

- Structure: Star Constellation comprises multiple interconnected Star or Snowflake

schemas, allowing for more flexible and scalable data organization.

- **Advantages**:

- Greater flexibility and scalability for accommodating complex data relationships and
hierarchies.

- Can handle very large volumes of data across multiple domains.

- **Disadvantages**:

- Increased complexity in schema design and management.

- Higher resource requirements for implementation and maintenance.

13. Dimensional Modeling (5marks)

Dimensional modeling is a data modeling technique used in data warehousing. It organizes data
into easily understandable and accessible structures for efficient querying and analysis. Here's a
concise answer:

Dimensional modeling simplifies data storage by organizing it into two types of tables: fact tables
and dimension tables. Fact tables contain numerical measures, and dimension tables contain
descriptive attributes. This approach creates a star-like schema, where the fact table sits at the
center, surrounded by dimension tables. This simple structure enhances query performance and
makes it easier for users to navigate and analyze data. Overall, dimensional modeling optimizes
data warehousing for faster insights and decision-making.

14. Random Forest Technique. (5marks)

Random Forest is a powerful machine learning technique used for both classification and
regression tasks. Here's a simple answer:

Random Forest is an ensemble learning method that combines multiple decision trees to make
predictions. It works by creating a "forest" of decision trees during training. Each tree is trained
on a random subset of the training data and a random subset of features. During prediction, the
output from each tree is aggregated to produce the final result. This technique improves
prediction accuracy and reduces overfitting, making it a popular choice for various machine
learning tasks.

15. Decision Tree Induction (5marks)

Decision Tree Induction is a machine learning algorithm used for both classification and
regression tasks. Here's a simple answer:

Decision Tree Induction builds a tree-like structure where each internal node represents a
decision based on a feature, and each leaf node represents the outcome. It works by recursively
partitioning the data based on the most significant feature at each node, aiming to maximize
information gain or minimize impurity. This process continues until the tree adequately
represents the training data or a stopping criterion is met. Decision Tree Induction is intuitive,
interpretable, and widely used for its simplicity and effectiveness in solving various predictive
tasks.

16. Cross Validation (5marks)

Cross Validation is a technique used to assess the performance of a machine learning model.
Here's a concise answer:

Cross Validation involves splitting the dataset into multiple subsets, called folds. The model is
trained on a subset of the data and evaluated on the remaining subset iteratively. This process
helps to estimate how well the model will generalize to new, unseen data. Common methods
include k-fold cross-validation, where the data is divided into k equal-sized folds, and each fold is
used as a validation set while the rest are used for training. Cross Validation provides a robust
estimate of the model's performance and helps to identify overfitting or underfitting issues.
17. DBSCAN Algorithm (5marks)

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular clustering

algorithm used in machine learning and data mining. Here's a concise answer:

DBSCAN groups together closely packed data points based on their density. It works by
identifying "core points" surrounded by a specified number of neighboring points within a given
radius. These core points form clusters, while points that do not meet the density criteria are
considered as noise or outliers. DBSCAN does not require specifying the number of clusters in
advance, making it suitable for datasets with irregular shapes and varying cluster densities. It's
an efficient and effective algorithm for discovering clusters in spatial data.

18. Explain types of Attributes use data exploration

1. **Numerical Attributes**:

- Numerical attributes represent quantitative data measured on a continuous or discrete scale.

- Examples include age, height, temperature, and income.

- Numerical attributes allow for mathematical operations such as addition, subtraction, and
averaging, making them suitable for statistical analysis and visualization using techniques like
histograms and scatter plots.

2. **Categorical Attributes**:

- Categorical attributes represent qualitative data that can be divided into distinct categories or
groups.

- Examples include gender, color, marital status, and country.

- Categorical attributes are often represented using labels or codes, and they can be analyzed
using frequency tables, bar charts, and pie charts to understand the distribution of categories
within the dataset.

3. **Ordinal Attributes**:

- Ordinal attributes are similar to categorical attributes but have a natural order or hierarchy
among their categories.
- Examples include ratings (e.g., low, medium, high), education level (e.g., primary, secondary,
tertiary), and satisfaction level (e.g., very unsatisfied, unsatisfied, neutral, satisfied, very
satisfied).

- Ordinal attributes allow for comparisons of relative order or rank but may not have equal
intervals between categories. They can be visualized using ordered bar charts or dot plots.

4. **Time-Series Attributes**:

- Time-series attributes represent data collected or measured over a sequence of time

intervals.

- Examples include stock prices, weather data, and website traffic over time.

- Time-series attributes are analyzed to identify trends, patterns, and seasonality using
techniques like line charts, moving averages, and autocorrelation plots.

5. **Boolean Attributes**:

- Boolean attributes represent binary data with only two possible values, typically true or false,
yes or no, 1 or 0.

- Examples include whether a customer is a member (yes/no), whether an item is in stock

(true/false), or whether a condition is met (1/0).

- Boolean attributes are often used for filtering and categorization, and they can be visualized
using bar charts or pie charts to show proportions of true and false values.

19. Explain DBSCAN algorithm with example.

DBSCAN, which stands for Density-Based Spatial Clustering of Applications with Noise, is a
popular clustering algorithm used in machine learning to identify clusters of points in a dataset.
Here's an explanation of the DBSCAN algorithm with an example:

DBSCAN works by grouping together closely packed data points based on their density. It doesn't
require specifying the number of clusters in advance and is robust to noise and outliers.

The DBSCAN algorithm involves two key parameters:

1. **Epsilon (ε)**: A radius parameter that defines the maximum distance between two points
for them to be considered neighbors.

2. **Minimum Points (MinPts)**: The minimum number of points required to form a dense
region or cluster.
Here's how the DBSCAN algorithm works:

1. Initialization: Start with an unvisited data point in the dataset.

2. **Core Point Identification**: For each data point, identify its ε-neighborhood, which includes
all points within a distance of ε from the current point. If the number of points in the ε-
neighborhood is greater than or equal to MinPts, mark the point as a core point.

3. **Expansion of Cluster**: For each core point or a point reachable from a core point,
recursively find all points in its ε-neighborhood and add them to the same cluster. If a point is
reachable from multiple core points, it may belong to any of the corresponding clusters.

4. **Noise Point Identification**: Any point that is not a core point and not reachable from any
core point is considered a noise point or an outlier.

Let's illustrate the DBSCAN algorithm with an example:

Suppose we have a 2-dimensional dataset containing points as follows:

Dataset:

(1, 2), (2, 2), (2, 3), (8, 7), (8, 8), (25, 80), (80, 90), (90, 85), (91, 89)

Using DBSCAN with ε = 3 and MinPts = 3:

1. Start with an unvisited point, (1, 2).

2. Identify its ε-neighborhood: {(1, 2), (2, 2), (2, 3)}. Since the ε-neighborhood contains 3 points,
mark (1, 2) as a core point.

3. Expand the cluster by recursively finding all points in the ε-neighborhood of (1, 2) and adding
them to the cluster. Points (2, 2) and (2, 3) are added to the cluster.

4. Repeat the process for other points in the dataset.

The resulting clusters might be:

- Cluster 1: {(1, 2), (2, 2), (2, 3)}

- Cluster 2: {(8, 7), (8, 8)}

- Cluster 3: {(25, 80), (80, 90), (90, 85), (91, 89)}

Points (1, 2), (2, 2), and (2, 3) are core points, while the other points are either reachable or noise
points.

This example demonstrates how DBSCAN identifies clusters based on density, without needing
the number of clusters as input, and handles outliers effectively.

20. Explain K mean algorithm in detail. Apply K-mean Algorithm to divide the given set of values {
2,3,6,8,9,12,15,158,22} into 3 clusters

**K-means Algorithm**:

K-means is an iterative clustering algorithm used to partition a dataset into K distinct, non-
overlapping clusters. It aims to minimize the sum of squared distances between data points and
their respective cluster centroids. Here's how the K-means algorithm works:

1. **Initialization**:

- Choose K initial cluster centroids randomly from the dataset. These centroids represent the
initial cluster centers.

2. **Assignment Step**:

- Assign each data point to the nearest cluster centroid based on the Euclidean distance metric.

- Calculate the distance between each data point and each centroid, and assign each point to
the cluster with the nearest centroid.

3. **Update Step**:

- Recalculate the centroids of the clusters by taking the mean of all data points assigned to each
cluster.
- Move the centroids to the mean of the data points in their respective clusters.

4. **Repeat**:

- Repeat steps 2 and 3 until convergence, i.e., until the cluster assignments no longer change
significantly or a maximum number of iterations is reached.

Application of K-means Algorithm:

Given the set of values {2, 3, 6, 8, 9, 12, 15, 158, 22}, we want to divide them into 3 clusters using
the K-means algorithm:

1. **Initialization**:

- Randomly choose 3 initial cluster centroids: Let's say we choose 2, 8, and 158 as initial
centroids.

2. **Assignment Step**:

- Calculate the distance between each data point and each centroid.

- Assign each data point to the cluster with the nearest centroid:

- Cluster 1: {2, 3, 6, 8, 9, 12, 15, 22}

- Cluster 2: {}

- Cluster 3: {158}

3. **Update Step**:

- Recalculate the centroids of the clusters:

- Cluster 1 centroid: Mean of {2, 3, 6, 8, 9, 12, 15, 22} = 9.125

- Cluster 2 centroid: No points assigned

- Cluster 3 centroid: Mean of {158} = 158

4. **Repeat**:

- Repeat the assignment and update steps until convergence.

5. **Final Clusters**:
- After convergence, the final clusters might be:

- Cluster 1: {2, 3, 6, 8, 9, 12, 15, 22}

- Cluster 2: {}

- Cluster 3: {158}

This process demonstrates how the K-means algorithm divides the given set of values into 3
clusters based on their proximity to the cluster centroids.

21. Compare Bagging and Boosting of a Classifier

22. Explain Multilevel and Multidimension Association rules with suitable example

23. Define data mining. Explain KDD process with help of a suitable diagram

24. what is Noisy data? How to Handel it for the Following Data D =
{ 4,8,9,15,21,21,24,25,26,28,29,34} Number of bins = 3

Perform the Following:

i. Partition into equal frequency bins

ii. Smoothing by bin means
iii. Smoothing by bin boundaries

25. Define data Warehouse. Explain data Warehouse architecture with help of a diagram.

26. Draw a three tier data warehousing Architecture (5marks)

27. Data : 4,8,15,21,21,24,25,28,34, Divide data in 3 bins and perform smoothing by bin means and
Smoothing by bins boundaries o every bin (5marks)

28. how to calculate correlation coefficient for two numeric attributes and also comment on the
significance of this value (5marks)

29. Write a short Note on Support and Confidence (5marks)

30. Explain the Concept of Information gain which is used in Decision free algorithm? (5marks)

31. Describe any two methods of data reduction

32. Compare Star Schema, Snowflake Schema and Fact Constellation

33. Write and Explain Bayes Classification algorithm

34. Write the Step of Ada-boost algorithm

35. how is data mining used in business intelligence?

36. Give the Overview of Partition clustering method

37. How can we Further improve the efficiency of Apriori-based mining ?

38. Explain OLPA Operation with the examples.

39. Describe the Classification performance evaluation measures that are Obtained from Confusion
matrix?

40. Use the Normalization methods to normalize the following group of data:

200, 300, 400, 600, 1000

Use min-max normalization by setting min = 0 and max = 1 and z-score normalization

41. Using the Given training dataset classify the following tuple using Naïve Bayes
Algorithm:<Homeowner: No, Marital Status : Married, Job Experience:3>

Homeowner Marital Status Job experience (in Defaulted

Yeas)
Yes Single 3 No
No Married 4 No
No Single 5 No
Yes Married 4 No
No Divorced 2 Yes
No Married 4 No
Yes Divorced 2 No
No Married 3 Yes
No Married 3 No
Yes Single 2 Yes

42. For the table given perform apriori algorithm and show frequent item set and strong association
rules. Assume Minimum Support of 30% and Minimum confidence of 70%

TID Items
01 1,3,4,6,
02 2,3,5,7,
03 1,2,3,5,8,
04 2,5,9,10
05 1,4

Module 1 - BCS602 - Chapter 02
No ratings yet
Module 1 - BCS602 - Chapter 02
90 pages
DWM Question Bank Solution
No ratings yet
DWM Question Bank Solution
35 pages
Comptia Data+ Da0-001
No ratings yet
Comptia Data+ Da0-001
10 pages
Q Solve Bigdata
No ratings yet
Q Solve Bigdata
25 pages
DM & W SQ
No ratings yet
DM & W SQ
15 pages
206 Data Mining
No ratings yet
206 Data Mining
28 pages
FDS Pyq2
No ratings yet
FDS Pyq2
10 pages
Data Mining Simran
No ratings yet
Data Mining Simran
128 pages
Data Mining University Answer
No ratings yet
Data Mining University Answer
10 pages
DWDM - Unit - III
No ratings yet
DWDM - Unit - III
77 pages
DataMining WBSU Solution 1
No ratings yet
DataMining WBSU Solution 1
7 pages
Basic Statistical Descriptions of Data Are Essential Quantitative Summaries That Encapsulate The Fundamental Characteristics of A Dataset
No ratings yet
Basic Statistical Descriptions of Data Are Essential Quantitative Summaries That Encapsulate The Fundamental Characteristics of A Dataset
22 pages
Intro
No ratings yet
Intro
6 pages
Resume 1
100% (1)
Resume 1
106 pages
Data Mining
No ratings yet
Data Mining
44 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
Data Mining
No ratings yet
Data Mining
48 pages
BADM lý thuyết 1-6
No ratings yet
BADM lý thuyết 1-6
11 pages
Big Data
No ratings yet
Big Data
8 pages
Viva Preparation Notes
No ratings yet
Viva Preparation Notes
6 pages
Introduction To Data Mining and Data Warehousing
No ratings yet
Introduction To Data Mining and Data Warehousing
2 pages
Data Mining Presentation
No ratings yet
Data Mining Presentation
206 pages
Data 101 Terms
No ratings yet
Data 101 Terms
6 pages
EDA Question Bank Answers
No ratings yet
EDA Question Bank Answers
24 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
Data Mining Overview
No ratings yet
Data Mining Overview
4 pages
DM Unit-1-1
No ratings yet
DM Unit-1-1
56 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
It-3031 (DMDW) - CS End Nov 2023
No ratings yet
It-3031 (DMDW) - CS End Nov 2023
23 pages
Data Analytics
No ratings yet
Data Analytics
6 pages
Unit 4
No ratings yet
Unit 4
42 pages
Business Analytics
No ratings yet
Business Analytics
14 pages
FDS - 2 Solved
No ratings yet
FDS - 2 Solved
14 pages
Ad3301 Apr May 2024 Answer Key
No ratings yet
Ad3301 Apr May 2024 Answer Key
31 pages
DWDM 2marks
No ratings yet
DWDM 2marks
15 pages
DM UNIT-1 Question and Answer
No ratings yet
DM UNIT-1 Question and Answer
25 pages
Data Visualization
No ratings yet
Data Visualization
5 pages
Cambridge Igcse (9-1) Global Perspective Specimen Paper 1 Que + Ans :)
100% (1)
Cambridge Igcse (9-1) Global Perspective Specimen Paper 1 Que + Ans :)
8 pages
Unit No 3
No ratings yet
Unit No 3
10 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
16 pages
Ai Pass
No ratings yet
Ai Pass
12 pages
Data Mining Reviewer
No ratings yet
Data Mining Reviewer
4 pages
ML Chapter 2
No ratings yet
ML Chapter 2
9 pages
Document
No ratings yet
Document
44 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
ADS IA 1 Syllabus Prep
No ratings yet
ADS IA 1 Syllabus Prep
5 pages
Eda Indepth
No ratings yet
Eda Indepth
19 pages
Multi Rate Test
100% (2)
Multi Rate Test
39 pages
FDSMSE Imp
No ratings yet
FDSMSE Imp
6 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
Ia - Eda
No ratings yet
Ia - Eda
10 pages
Unit Iii
No ratings yet
Unit Iii
10 pages
DWM - Exp 1
No ratings yet
DWM - Exp 1
11 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
Unit 3 BI & Data Science
No ratings yet
Unit 3 BI & Data Science
19 pages
Unit 1,2
No ratings yet
Unit 1,2
17 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
1.data Mining Functionalities
No ratings yet
1.data Mining Functionalities
14 pages
MD-6SE70 MC Compendium FW2.5
No ratings yet
MD-6SE70 MC Compendium FW2.5
1,507 pages
Ballistics Lecture
100% (1)
Ballistics Lecture
16 pages
Communication Skills For Engineers - BTech Sem-I CCS
No ratings yet
Communication Skills For Engineers - BTech Sem-I CCS
2 pages
Eschscholzia Californica. Monograph. ENBT
No ratings yet
Eschscholzia Californica. Monograph. ENBT
3 pages
Concrete Masonry Report
No ratings yet
Concrete Masonry Report
21 pages
10 04 2023 - 17 07 44 - Crash
No ratings yet
10 04 2023 - 17 07 44 - Crash
15 pages
Q.1. Why Is Data Preprocessing Required?
100% (1)
Q.1. Why Is Data Preprocessing Required?
26 pages
Inventory Recording
No ratings yet
Inventory Recording
12 pages
Answers To Questions On The Bible Asked by Christians
No ratings yet
Answers To Questions On The Bible Asked by Christians
23 pages
BrahMos I
No ratings yet
BrahMos I
29 pages
391
No ratings yet
391
6 pages
Fire From Heaven Mary Renault Instant Download
No ratings yet
Fire From Heaven Mary Renault Instant Download
38 pages
Compare 2 Architects
No ratings yet
Compare 2 Architects
2 pages
GRADE 1 LERIO (Reading and Literacy) Describing Words
No ratings yet
GRADE 1 LERIO (Reading and Literacy) Describing Words
10 pages
Hardware Brochure Online
No ratings yet
Hardware Brochure Online
12 pages
Lesson Plan On Nutrion 2
No ratings yet
Lesson Plan On Nutrion 2
2 pages
CISSP-2022 Exam Cram Domain 4
No ratings yet
CISSP-2022 Exam Cram Domain 4
79 pages
Cond
No ratings yet
Cond
81 pages
3.3 Project Scheduling
No ratings yet
3.3 Project Scheduling
8 pages
King of The Pirates (Shonen Jumps One Piece, 1) (Michael Anthony Steele, Eiichiro Oda) (Z-Library)
No ratings yet
King of The Pirates (Shonen Jumps One Piece, 1) (Michael Anthony Steele, Eiichiro Oda) (Z-Library)
114 pages
Series BNS-B20 Safety Door-Handle Switch With Integrated Safety Sensor
No ratings yet
Series BNS-B20 Safety Door-Handle Switch With Integrated Safety Sensor
2 pages
PS and CR
No ratings yet
PS and CR
34 pages
E1w Metric
No ratings yet
E1w Metric
1 page
EternumWTv0.4 b1 Compressed
No ratings yet
EternumWTv0.4 b1 Compressed
27 pages
Hifonics Atlas Subwoofer Manual
No ratings yet
Hifonics Atlas Subwoofer Manual
8 pages
WT SEM6 IT BH Sample Notes
No ratings yet
WT SEM6 IT BH Sample Notes
14 pages
Matthean Antithesis
No ratings yet
Matthean Antithesis
34 pages
Non Text Magic Studio Magic Design For Presentations L&P - 20240930 - 143949 - 0000
No ratings yet
Non Text Magic Studio Magic Design For Presentations L&P - 20240930 - 143949 - 0000
13 pages
Error Handling: Poor Error Message May Result in Rejecting The Product
No ratings yet
Error Handling: Poor Error Message May Result in Rejecting The Product
10 pages
Paper 1
No ratings yet
Paper 1
26 pages
5.2 SCM
No ratings yet
5.2 SCM
8 pages
DMBI
No ratings yet
DMBI
8 pages
ClassiCon Catalog 2013
No ratings yet
ClassiCon Catalog 2013
61 pages
6 Black Box Testing
No ratings yet
6 Black Box Testing
6 pages
Sbar Template RN To PDF
No ratings yet
Sbar Template RN To PDF
2 pages
5.1 Risk
No ratings yet
5.1 Risk
5 pages
6 White Box Testing
No ratings yet
6 White Box Testing
5 pages
5.3 Six Sigma
No ratings yet
5.3 Six Sigma
3 pages
3.4 Earned Value Analysis
No ratings yet
3.4 Earned Value Analysis
2 pages
3.4 Earned Value Analysis
No ratings yet
3.4 Earned Value Analysis
2 pages
5.1.1 Risks
No ratings yet
5.1.1 Risks
2 pages
3.4 Earned Value Analysis
No ratings yet
3.4 Earned Value Analysis
2 pages
5.3 Six Sigma
No ratings yet
5.3 Six Sigma
3 pages
Be Information Technology Semester 4 2023 May Computer Network and Network Design Rev 2019 C Scheme
No ratings yet
Be Information Technology Semester 4 2023 May Computer Network and Network Design Rev 2019 C Scheme
1 page
3.4 Earned Value Analysis
No ratings yet
3.4 Earned Value Analysis
2 pages
Software Making
No ratings yet
Software Making
3 pages
03 - 2011 - Markus G. Kuhn - Compromising Emanations of LCD TV Sets
No ratings yet
03 - 2011 - Markus G. Kuhn - Compromising Emanations of LCD TV Sets
7 pages
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
From Everand
Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
César Pérez López
No ratings yet
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

DMBI Sem 6 Important Topics (IT)

Uploaded by

DMBI Sem 6 Important Topics (IT)

Uploaded by

1. Draw Data warehousing Architecture?

Data warehousing architecture typically consists of three main components:

2. What is noisy data? How to handle noisy data?(5marks)

1. **Identification**: Recognize noisy data by examining for outliers, errors, or inconsistencies.

3. Compare and contrast between OLTP and OLAP.(5marks)

- It quantifies the effectiveness of a feature by measuring the reduction in entropy or

- It represents the probability of incorrectly classifying a randomly chosen element if it

2. **Median (Middle Value):**

4. **Quartiles (Data Division):**

5. **Box and Whisker Plot:**

6. What is Data Mining? Explain KDD process with Diagram.

Here's how Market Basket Analysis works:

Market Basket Analysis is beneficial for retailers as it allows them to:

- Optimize product placement and promotions.

- Cross-sell and upsell products effectively.

- Improve inventory management and stock replenishment.

Outlook Temperature Humidity Windy Class

9. What is an outlier? Explain various method for performing outlier analysis.

8. **Domain Knowledge**: Incorporate domain knowledge or subject matter expertise to identify

- For A1(2, 10):

- Distance to A1: P(A1, A1) = 0

- Distance to A4: P(A1, A4) = |5-2| + |8-10| = 5

- Distance to A7: P(A1, A7) = |1-2| + |2-10| = 9

- For A2(2, 5):

- Distance to A1: P(A2, A1) = |2-2| + |10-5| = 5

- Distance to A4: P(A2, A4) = |5-2| + |8-5| = 6

- Distance to A7: P(A2, A7) = |1-2| + |2-5| = 5

- A2 is closest to A1 and A7 (ties are resolved arbitrarily).

- New center for Cluster 1: A1(2, 7.5)

- New center for Cluster 2: A4(6.33, 5.67)

- New center for Cluster 3: A7(2, 3.5)

Cluster 1: A1(2, 7.5)

Cluster 2: A4(6.33, 5.67)

- Simplicity and ease of understanding.

- Query performance is generally high due to denormalization.

- Redundancy in data storage due to denormalization.

- Might not be suitable for complex relationships between dimensions.

- **Description**: Snowflake Schema is an extension of the Star Schema where dimension

- Reduced data redundancy due to normalization, leading to less storage space.

- Better support for complex relationships between dimensions.

- **Structure**: Star Constellation comprises multiple interconnected Star or Snowflake

- Can handle very large volumes of data across multiple domains.

- Increased complexity in schema design and management.

- Higher resource requirements for implementation and maintenance.

13. Dimensional Modeling (5marks)

14. Random Forest Technique. (5marks)

15. Decision Tree Induction (5marks)

16. Cross Validation (5marks)

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular clustering

18. Explain types of Attributes use data exploration

- Numerical attributes represent quantitative data measured on a continuous or discrete scale.

- Examples include age, height, temperature, and income.

- Examples include gender, color, marital status, and country.

- Time-series attributes represent data collected or measured over a sequence of time

- Examples include whether a customer is a member (yes/no), whether an item is in stock

19. Explain DBSCAN algorithm with example.

The DBSCAN algorithm involves two key parameters:

1. **Initialization**: Start with an unvisited data point in the dataset.

Let's illustrate the DBSCAN algorithm with an example:

Suppose we have a 2-dimensional dataset containing points as follows:

Using DBSCAN with ε = 3 and MinPts = 3:

1. Start with an unvisited point, (1, 2).

4. Repeat the process for other points in the dataset.

- Cluster 1: {(1, 2), (2, 2), (2, 3)}

- Cluster 2: {(8, 7), (8, 8)}

- Cluster 3: {(25, 80), (80, 90), (90, 85), (91, 89)}

**Application of K-means Algorithm**:

- Cluster 1: {2, 3, 6, 8, 9, 12, 15, 22}

- Recalculate the centroids of the clusters:

- Cluster 1 centroid: Mean of {2, 3, 6, 8, 9, 12, 15, 22} = 9.125

- Cluster 2 centroid: No points assigned

- Cluster 3 centroid: Mean of {158} = 158

- Repeat the assignment and update steps until convergence.

- Cluster 1: {2, 3, 6, 8, 9, 12, 15, 22}

21. Compare Bagging and Boosting of a Classifier

Perform the Following:

1. Identification: Recognize noisy data by examining for outliers, errors, or inconsistencies.

2. Median (Middle Value):

4. Quartiles (Data Division):

5. Box and Whisker Plot:

8. Domain Knowledge: Incorporate domain knowledge or subject matter expertise to identify

- Description: Snowflake Schema is an extension of the Star Schema where dimension

- Structure: Star Constellation comprises multiple interconnected Star or Snowflake

1. Initialization: Start with an unvisited data point in the dataset.

Application of K-means Algorithm: