DMBI Sem 6 Important Topics (IT)
DMBI Sem 6 Important Topics (IT)
(5marks)
1. **Data Sources**: This is where data originates from various operational systems such as
databases, applications, and external sources. It includes data extraction tools to gather information.
2. **Data Warehouse**: The central storage area where data from different sources is integrated,
cleaned, transformed, and stored for analytical purposes. It consists of staging area, data warehouse
database, and access layers.
3. **Data Access Tools**: These are the front-end tools used by analysts and decision-makers to
access and analyze the data stored in the data warehouse. Examples include reporting tools, query
tools, OLAP (Online Analytical Processing) tools, and data mining tools.
In summary, data warehousing architecture comprises data sources, a data warehouse, and data
access tools, facilitating efficient data storage, integration, and analysis for decision-making
purposes.
Noisy data refers to data that contains irrelevant, incorrect, or inconsistent information, which can
distort analysis and hinder accurate insights. To handle noisy data effectively:
2. **Filtering**: Use statistical methods such as mean, median, or clustering to remove or mitigate
the impact of noisy data points.
3. **Normalization**: Standardize the data to a common scale to minimize the effect of varying
magnitudes.
4. **Imputation**: Fill in missing values using techniques like mean substitution or predictive
modeling to maintain data integrity.
5. **Validation**: Validate the cleaned data through techniques like cross-validation or split
validation to ensure reliability for analysis and decision-making.
1. **Purpose**:
- OLTP (Online Transaction Processing) is used for day-to-day transactional activities like order
processing, inventory management, etc.
- OLAP (Online Analytical Processing) is used for complex analysis of data to support decision-
making processes.
2. **Data Structure**:
- OLTP systems typically deal with normalized data structures, which are optimized for transactional
processing and minimize redundancy.
- OLAP systems use denormalized or star/snowflake schema structures, which facilitate faster
querying and analysis.
3. **Usage**:
- OLTP systems are used by operational staff for routine transactions, requiring quick response
times.
- OLAP systems are used by analysts and decision-makers for complex queries and reporting, often
involving historical data.
4. **Query Complexity**:
- OLTP queries are usually simple, involving basic CRUD operations (Create, Read, Update, Delete).
- OLAP queries tend to be more complex, involving aggregations, drill-downs, and slicing/dicing of
data for analytical purposes.
5. **Performance Requirements**:
- OLTP systems prioritize high concurrency and low response times to handle multiple concurrent
transactions efficiently.
- OLAP systems prioritize query performance, focusing on processing large volumes of data for
analytical purposes.
4. Explain concept of information gain and gini value used in decision tree algorithm.(5marks)
1. **Information Gain**:
- Information gain is a measure used to decide the relevance of a feature in splitting the
data in a decision tree.
- Higher information gain indicates that splitting the data based on that feature results in
more homogenous subsets, making it a better choice for splitting.
2. **Gini Value**:
- Gini value, also known as Gini impurity, is another measure used for deciding the optimal
split in a decision tree.
- A lower Gini value indicates that the subset is more pure, meaning most of the elements
belong to the same class.
5. Consider we have age of 29 participants in a survey given to us in sorted order. 5, 10, 13, 15,
16, 16, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70, 85
Explain how to calculate mean, median, standard deviation, nd Quartile 1st and 3rd for given
data and also compute the same. Show the Box and Whisker plot for this data.
1. **Mean (Average):**
- Mean is the sum of all values divided by the total count.
- For the provided dataset, the sum of all values is 907, and the count is 29.
- Therefore, Mean = 907 / 29 = 31.276 (approximately).
This comprehensive analysis provides insight into the central tendency, spread, and
distribution of the dataset.
Data Mining is the process of discovering patterns, trends, and insights from large datasets. It
involves extracting useful information from raw data, often using computational algorithms and
statistical techniques.
The Knowledge Discovery in Databases (KDD) process is a systematic approach to data mining,
consisting of several stages:
1. **Data Selection:**
- In this stage, relevant data is selected from various sources, including databases, data
warehouses, or even the web.
2. **Data Preprocessing:**
- This stage involves cleaning and transforming the selected data to ensure its quality and suitability
for analysis. Tasks may include removing duplicates, handling missing values, and normalization.
3. **Data Reduction:**
- Data reduction techniques are applied to reduce the complexity of the dataset while preserving
its integrity and meaningfulness. This can include techniques such as dimensionality reduction or
feature selection.
4. **Data Mining:**
- The core stage of the KDD process, where data mining algorithms are applied to the prepared
dataset to extract patterns, trends, and relationships.
5. **Interpretation/Evaluation:**
- In this final stage, the discovered patterns and insights are interpreted and evaluated to
determine their significance and usefulness. This may involve visualization techniques and statistical
analysis.
7. Explain Market Basket Analysis With Example.
Market Basket Analysis (MBA) is a data mining technique used to discover relationships between
items purchased together. It helps retailers understand customer purchasing behavior by identifying
associations between products. Here's a simple explanation with an example:
Let's say you own a grocery store, and you want to understand the buying patterns of your
customers. By using Market Basket Analysis, you can uncover which items are frequently bought
together. For instance, through analyzing your sales data, you find that customers who buy bread
also tend to buy butter. This association can be represented as a rule: {Bread} ➔ {Butter}.
1. **Collect Data**: Gather transactional data that records which items were purchased together in
each transaction.
2. **Identify Itemsets**: Group items purchased together into sets, known as itemsets.
3. **Calculate Support**: Calculate the support for each itemset, which is the proportion of
transactions that contain the itemset.
4. **Set Threshold**: Set a minimum support threshold to filter out itemsets that occur less
frequently.
5. **Generate Rules**: From the frequent itemsets, generate association rules that show the
likelihood of one item being purchased when another item is purchased.
6. **Evaluate Rules**: Evaluate the generated rules based on metrics like support, confidence, and
lift.
In our example, if 60% of transactions containing bread also contain butter, the rule {Bread} ➔
{Butter} would have a confidence of 60%.
8. Consider Training Dataset as given below. Use Naive Bayes Algorithm to determine whether
it is advisable to play tennis on a day with hot temperature, rainy outlook, high humidity and
no wind?
An outlier is an observation in a dataset that significantly differs from other observations. It is a data
point that lies outside the overall pattern of the data. Outliers can occur due to errors in data
collection, measurement variability, or genuinely unusual phenomena. Here's an easy-to-remember
explanation of various methods for performing outlier analysis:
1. **Visual Inspection**: One simple method is to visually inspect the data using plots such as
histograms, box plots, or scatter plots. Outliers can often be identified as points that fall far away
from the main cluster of data points.
2. **Descriptive Statistics**: Calculate basic descriptive statistics such as mean, median, standard
deviation, and range. Observations that deviate significantly from these statistics may be considered
outliers.
3. **Z-Score Method**: Calculate the z-score for each data point, which measures how many
standard deviations an observation is from the mean. Data points with a z-score greater than a
threshold (commonly 2 or 3) are flagged as outliers.
4. **Interquartile Range (IQR) Method**: Calculate the interquartile range, which is the difference
between the third quartile (Q3) and the first quartile (Q1). Outliers are identified as observations that
fall below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR.
5. **Modified Z-Score Method**: Similar to the z-score method, but it uses a modified z-score that is
robust to outliers. This method is useful when the data contains outliers that may skew the mean
and standard deviation.
6. **Box Plot Method**: Plot the data using a box plot, which visually displays the median, quartiles,
and potential outliers. Observations outside the "whiskers" of the box plot are considered outliers.
7. **Machine Learning Methods**: Use machine learning algorithms such as Isolation Forest, Local
Outlier Factor (LOF), or One-Class SVM to automatically detect outliers based on the deviation from
the majority of the data points.
10. Use the Apriori algorithm to identify the frequent item-sets in the following database. Then
extract the strong association rules from these sets. Assume Min. Support = 50% Min.
Confidence = 75%
Tid a b c d e f g
Items 1,2,3,4,5,6 2,3,5, 1,2,3,5 1,2,4,5 1,2,3,4,5,6 2,3,5 1,2,4,5
11. Cluster the following eight points (with (x, y) representing locations) into three clusters
A1(2, 10), A2(2, 5), A3(8, 4), A 4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4,9). Assume Initial cluster
Centers are at: A1(2, 10), A4(5, 8) and A7(1, 2). The distance function between two points a =(x1,
y1) and b = (x2, y2) is defined as – P (a, b) = |x2-x1| + |y2-y1|. Use K-Means Algorithm to find the
three cluster centres after the second iteration.
Here's a simplified explanation of the K-Means Algorithm applied to the given problem:
1. **Initialization**: Start with three initial cluster centers: A1(2, 10), A4(5, 8), and A7(1, 2).
2. **Assignment Step**: Assign each point to the nearest cluster center based on the defined
distance function P(a, b) = |x2-x1| + |y2-y1|.
- A1 is closest to A1.
- Similarly, calculate distances for A3, A4, A5, A6, A7, and A8 and assign them to the nearest cluster
center.
3. **Update Step**: Recalculate the cluster centers based on the points assigned to each cluster.
4. **Repeat**: Repeat the assignment and update steps until convergence (when the cluster centers
no longer change significantly between iterations).
In this case, after the second iteration, the cluster centers would be:
This process continues until the cluster centers stabilize, and the algorithm converges to a solution.
12. Compare Star Schema, Snow Flake Schema and star Constellation
Star Schema, Snowflake Schema, and Star Constellation are all data modeling techniques used in
data warehousing to organize and structure data for efficient querying and analysis. Let's
compare them:
1. **Star Schema**:
- **Description**: Star Schema is the simplest and most common schema type used in data
warehousing. It consists of one or more fact tables referencing any number of dimension tables.
- **Structure**: In a Star Schema, the fact table sits at the center, surrounded by dimension
tables. Each dimension table is directly connected to the fact table through foreign key
relationships.
- **Advantages**:
- **Disadvantages**:
2. **Snowflake Schema**:
- **Structure**: Dimension tables in a Snowflake Schema are organized into multiple levels of
related tables, reducing redundancy by separating hierarchies into distinct tables.
- **Advantages**:
- **Disadvantages**:
- Increased complexity in schema design and maintenance.
- Query performance might be slightly slower compared to Star Schema due to additional
table joins.
3. **Star Constellation**:
- **Description**: Star Constellation is an advanced schema design that combines multiple Star
or Snowflake schemas into a single model. It's suitable for very large and complex data
warehousing environments.
- **Advantages**:
- Greater flexibility and scalability for accommodating complex data relationships and
hierarchies.
- **Disadvantages**:
Dimensional modeling is a data modeling technique used in data warehousing. It organizes data
into easily understandable and accessible structures for efficient querying and analysis. Here's a
concise answer:
Dimensional modeling simplifies data storage by organizing it into two types of tables: fact tables
and dimension tables. Fact tables contain numerical measures, and dimension tables contain
descriptive attributes. This approach creates a star-like schema, where the fact table sits at the
center, surrounded by dimension tables. This simple structure enhances query performance and
makes it easier for users to navigate and analyze data. Overall, dimensional modeling optimizes
data warehousing for faster insights and decision-making.
Random Forest is an ensemble learning method that combines multiple decision trees to make
predictions. It works by creating a "forest" of decision trees during training. Each tree is trained
on a random subset of the training data and a random subset of features. During prediction, the
output from each tree is aggregated to produce the final result. This technique improves
prediction accuracy and reduces overfitting, making it a popular choice for various machine
learning tasks.
Decision Tree Induction is a machine learning algorithm used for both classification and
regression tasks. Here's a simple answer:
Decision Tree Induction builds a tree-like structure where each internal node represents a
decision based on a feature, and each leaf node represents the outcome. It works by recursively
partitioning the data based on the most significant feature at each node, aiming to maximize
information gain or minimize impurity. This process continues until the tree adequately
represents the training data or a stopping criterion is met. Decision Tree Induction is intuitive,
interpretable, and widely used for its simplicity and effectiveness in solving various predictive
tasks.
Cross Validation is a technique used to assess the performance of a machine learning model.
Here's a concise answer:
Cross Validation involves splitting the dataset into multiple subsets, called folds. The model is
trained on a subset of the data and evaluated on the remaining subset iteratively. This process
helps to estimate how well the model will generalize to new, unseen data. Common methods
include k-fold cross-validation, where the data is divided into k equal-sized folds, and each fold is
used as a validation set while the rest are used for training. Cross Validation provides a robust
estimate of the model's performance and helps to identify overfitting or underfitting issues.
17. DBSCAN Algorithm (5marks)
DBSCAN groups together closely packed data points based on their density. It works by
identifying "core points" surrounded by a specified number of neighboring points within a given
radius. These core points form clusters, while points that do not meet the density criteria are
considered as noise or outliers. DBSCAN does not require specifying the number of clusters in
advance, making it suitable for datasets with irregular shapes and varying cluster densities. It's
an efficient and effective algorithm for discovering clusters in spatial data.
1. **Numerical Attributes**:
- Numerical attributes allow for mathematical operations such as addition, subtraction, and
averaging, making them suitable for statistical analysis and visualization using techniques like
histograms and scatter plots.
2. **Categorical Attributes**:
- Categorical attributes represent qualitative data that can be divided into distinct categories or
groups.
- Categorical attributes are often represented using labels or codes, and they can be analyzed
using frequency tables, bar charts, and pie charts to understand the distribution of categories
within the dataset.
3. **Ordinal Attributes**:
- Ordinal attributes are similar to categorical attributes but have a natural order or hierarchy
among their categories.
- Examples include ratings (e.g., low, medium, high), education level (e.g., primary, secondary,
tertiary), and satisfaction level (e.g., very unsatisfied, unsatisfied, neutral, satisfied, very
satisfied).
- Ordinal attributes allow for comparisons of relative order or rank but may not have equal
intervals between categories. They can be visualized using ordered bar charts or dot plots.
4. **Time-Series Attributes**:
- Examples include stock prices, weather data, and website traffic over time.
- Time-series attributes are analyzed to identify trends, patterns, and seasonality using
techniques like line charts, moving averages, and autocorrelation plots.
5. **Boolean Attributes**:
- Boolean attributes represent binary data with only two possible values, typically true or false,
yes or no, 1 or 0.
- Boolean attributes are often used for filtering and categorization, and they can be visualized
using bar charts or pie charts to show proportions of true and false values.
DBSCAN, which stands for Density-Based Spatial Clustering of Applications with Noise, is a
popular clustering algorithm used in machine learning to identify clusters of points in a dataset.
Here's an explanation of the DBSCAN algorithm with an example:
DBSCAN works by grouping together closely packed data points based on their density. It doesn't
require specifying the number of clusters in advance and is robust to noise and outliers.
1. **Epsilon (ε)**: A radius parameter that defines the maximum distance between two points
for them to be considered neighbors.
2. **Minimum Points (MinPts)**: The minimum number of points required to form a dense
region or cluster.
Here's how the DBSCAN algorithm works:
2. **Core Point Identification**: For each data point, identify its ε-neighborhood, which includes
all points within a distance of ε from the current point. If the number of points in the ε-
neighborhood is greater than or equal to MinPts, mark the point as a core point.
3. **Expansion of Cluster**: For each core point or a point reachable from a core point,
recursively find all points in its ε-neighborhood and add them to the same cluster. If a point is
reachable from multiple core points, it may belong to any of the corresponding clusters.
4. **Noise Point Identification**: Any point that is not a core point and not reachable from any
core point is considered a noise point or an outlier.
Dataset:
(1, 2), (2, 2), (2, 3), (8, 7), (8, 8), (25, 80), (80, 90), (90, 85), (91, 89)
2. Identify its ε-neighborhood: {(1, 2), (2, 2), (2, 3)}. Since the ε-neighborhood contains 3 points,
mark (1, 2) as a core point.
3. Expand the cluster by recursively finding all points in the ε-neighborhood of (1, 2) and adding
them to the cluster. Points (2, 2) and (2, 3) are added to the cluster.
Points (1, 2), (2, 2), and (2, 3) are core points, while the other points are either reachable or noise
points.
This example demonstrates how DBSCAN identifies clusters based on density, without needing
the number of clusters as input, and handles outliers effectively.
20. Explain K mean algorithm in detail. Apply K-mean Algorithm to divide the given set of values {
2,3,6,8,9,12,15,158,22} into 3 clusters
**K-means Algorithm**:
K-means is an iterative clustering algorithm used to partition a dataset into K distinct, non-
overlapping clusters. It aims to minimize the sum of squared distances between data points and
their respective cluster centroids. Here's how the K-means algorithm works:
1. **Initialization**:
- Choose K initial cluster centroids randomly from the dataset. These centroids represent the
initial cluster centers.
2. **Assignment Step**:
- Assign each data point to the nearest cluster centroid based on the Euclidean distance metric.
- Calculate the distance between each data point and each centroid, and assign each point to
the cluster with the nearest centroid.
3. **Update Step**:
- Recalculate the centroids of the clusters by taking the mean of all data points assigned to each
cluster.
- Move the centroids to the mean of the data points in their respective clusters.
4. **Repeat**:
- Repeat steps 2 and 3 until convergence, i.e., until the cluster assignments no longer change
significantly or a maximum number of iterations is reached.
Given the set of values {2, 3, 6, 8, 9, 12, 15, 158, 22}, we want to divide them into 3 clusters using
the K-means algorithm:
1. **Initialization**:
- Randomly choose 3 initial cluster centroids: Let's say we choose 2, 8, and 158 as initial
centroids.
2. **Assignment Step**:
- Calculate the distance between each data point and each centroid.
- Assign each data point to the cluster with the nearest centroid:
- Cluster 2: {}
- Cluster 3: {158}
3. **Update Step**:
4. **Repeat**:
5. **Final Clusters**:
- After convergence, the final clusters might be:
- Cluster 2: {}
- Cluster 3: {158}
This process demonstrates how the K-means algorithm divides the given set of values into 3
clusters based on their proximity to the cluster centroids.
23. Define data mining. Explain KDD process with help of a suitable diagram
24. what is Noisy data? How to Handel it for the Following Data D =
{ 4,8,9,15,21,21,24,25,26,28,29,34} Number of bins = 3
25. Define data Warehouse. Explain data Warehouse architecture with help of a diagram.
27. Data : 4,8,15,21,21,24,25,28,34, Divide data in 3 bins and perform smoothing by bin means and
Smoothing by bins boundaries o every bin (5marks)
28. how to calculate correlation coefficient for two numeric attributes and also comment on the
significance of this value (5marks)
30. Explain the Concept of Information gain which is used in Decision free algorithm? (5marks)
39. Describe the Classification performance evaluation measures that are Obtained from Confusion
matrix?
40. Use the Normalization methods to normalize the following group of data:
Use min-max normalization by setting min = 0 and max = 1 and z-score normalization
41. Using the Given training dataset classify the following tuple using Naïve Bayes
Algorithm:<Homeowner: No, Marital Status : Married, Job Experience:3>
42. For the table given perform apriori algorithm and show frequent item set and strong association
rules. Assume Minimum Support of 30% and Minimum confidence of 70%
TID Items
01 1,3,4,6,
02 2,3,5,7,
03 1,2,3,5,8,
04 2,5,9,10
05 1,4