0% found this document useful (0 votes)
6 views104 pages

DWM EXP 1 To 14 C - Merged - Compressed

The document outlines various experiments related to artificial intelligence, including the implementation of medical expert systems, search algorithms like A*, and genetic algorithms. It also covers knowledge representation using predicates, block moving problems, and Bayesian networks for probabilistic inference. Each experiment includes code snippets and explanations of the algorithms and methods used.

Uploaded by

qurehitufail786
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views104 pages

DWM EXP 1 To 14 C - Merged - Compressed

The document outlines various experiments related to artificial intelligence, including the implementation of medical expert systems, search algorithms like A*, and genetic algorithms. It also covers knowledge representation using predicates, block moving problems, and Bayesian networks for probabilistic inference. Each experiment includes code snippets and explanations of the algorithms and methods used.

Uploaded by

qurehitufail786
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 104

Experiment 3

Medical Expert System:


% Define conditions and their symptoms
condition(asthma, [shortness_of_breath, wheezing]).
condition(flu, [fever,cough,fatigue]).
condition(cold, [sneezing,sore_throat]).

% Define symptoms
symptom(shortness_of_breath).
symptom(wheezing).
symptom(fever).
symptom(cough).
symptom(fatigue).
symptom(sneezing).
symptom(sore_throat).

% Main predicate to diagnose condition


diagnose_condition :-
write('Enter the symptoms (comma-separated): '),
read_line_to_string(user_input, InputString),
split_string(InputString, ",", " \t\n\r", SymptomsList),
maplist(atom_string, SymptomsAtoms, SymptomsList),
diagnose(SymptomsAtoms, Disease),
format('Diagnosis: ~w~n', [Disease]).

% Diagnose based on symptoms


diagnose(Symptoms, Condition) :-
condition(Condition, ConditionSymptoms),
subset(ConditionSymptoms, Symptoms), % Check if all condition
symptoms are in the input list
!. % Cut to ensure only the first match is considered

% Predicate to check if all elements of the first list are in the second list
subset([], _).
subset([Elem | Rest], List) :-
member(Elem, List),
subset(Rest, List).

% Start the diagnosis process


:- diagnose_condition.

Output:

Chatbot:

knowledge('what is your name?', 'My name is Chatbot.').


knowledge('who created you?', 'i was created by Expert Drashti.').
knowledge('What is the capital of India?', 'The capital of India is Delhi.').
knowledge('Are you Human?', 'No, I am Chatbot.').
knowledge('2+2=5', 'No, it is 4.').

response(Input, Response):-
knowledge(Input, Response), !.
response(_,'I do not understand.').

chatbot:-
repeat,
write('User:'),
flush_output(current_output),
read_line_to_codes(user_input, Input),
atom_codes(AtomInput, Input),
response(AtomInput,Output),
format('Chatbot: ~s~n',[Output]),
(AtomInput ='exit';AtomInput ='quit'), !.

:-initialization(chatbot).

Output:
EXPERIMENT-5
ARTIFICIAL INTELLIGENCE

Aim: To Implement A* Informed Search Algorithm.

Code:
def aStarAlgo(start_node, stop_node):

open_set = set([start_node])

closed_set = set()

g = {} #store distance from starting node

parents = {} # parents conatins an adjencency map of all nodes

#distance of starting node from itself is zero

g[start_node] = 0

#start_node is root node i.e it has no parent nodes so start_node is set to 1

parents[start_node] = start_node

def get_neighbours(v):

if v in Graph_nodes:

return Graph_nodes[v]

else:

return None

def heuristic(n):

H_dist ={

'S':17,

'A':10,

'B':13,

'C':4,

'D':2,

'E':4,

'F':1,
'G':0,

return H_dist[n]

Graph_nodes ={

'S':[('A', 6),('B', 5),('C', 10)],

'A':[('E', 6)],

'B':[('D', 7),('E', 6)],

'C':[('D', 6)],

'D':[('F', 6),('B', 7)],

'E':[('F', 4),('B', 6)],

'F':[('G', 3)]

while len(open_set) > 0:

n = None

#node with lowest f() is found

for v in open_set:

if n == None or g[v] + heuristic(v) < g[n] + heuristic(n):

n=v

if n == stop_node or Graph_nodes[n] == None:

pass

else:

for(m, weight) in get_neighbours(n):

#nodes 'm' not in first and last set are added to first

#a is set its parent

if m not in open_set and m not in closed_set:

open_set.add(m)

parents[m] = n
g[m] = g[n] + weight

#for each node m, comp-are its distance from start i.e g(m) to the

#from start thorugh n node

else:

if g[m] > g[n] + weight:

#update g(m)

g[m] = g[n] + weight

parents[m] = n

#if m is closed set,remove and add to open

if m in closed_set:

closed_set.remove(m)

open_set.add(m)

if n == None:

print("Path does not exist")

return None

#if the current node is the stop_node

# then we begin reconstruction the path from it to the start_node

if n == stop_node:

path=[]

while parents[n] != n:

path.append(n)

n = parents[n]

path.append(start_node)

path.reverse()

print("Path found: {}".format(path))

return path

#remove n from the open_list and add itb to closed_list


# bacause all of his neighbours were inspected

open_set.remove(n)

closed_set.add(n)

print("Path does not exist!!")

return None

aStarAlgo('S','G')

Output:
Experiment No. 6
Program:
import math

def minimax(curDepth, nodeIndex, maxTurn, scores, targetDepth):


if curDepth == targetDepth:
return scores[nodeIndex]
if maxTurn:
return max(
minimax(curDepth + 1, nodeIndex * 2, False, scores, targetDepth),
minimax(curDepth + 1, nodeIndex * 2 + 1, False, scores, targetDepth)
)
else:
return min(
minimax(curDepth + 1, nodeIndex * 2, True, scores, targetDepth),
minimax(curDepth + 1, nodeIndex * 2 + 1, True, scores, targetDepth)
)

scores = [9, 2, 8, 7, 12, 21, 14, 2]


number_of_scores = len(scores)
treeDepth = int(math.log2(number_of_scores))

print("The optimal value is:", end=" ")


print(minimax(0, 0, True, scores, treeDepth))

Output:
Experiment 7

# Python3 program to create target string, starting from


# random string using Genetic Algorithm

import random

# Number of individuals in each generation


POPULATION_SIZE = 100

# Valid genes
GENES = '''abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOP
QRSTUVWXYZ 1234567890, .-;:_!"#%&/()=?@${[]}'''

# Target string to be generated


TARGET = "I love GeeksforGeeks"

class Individual(object):
'''
Class representing individual in population
'''
def __init__(self, chromosome):
self.chromosome = chromosome
self.fitness = self.cal_fitness()

@classmethod
def mutated_genes(self):
'''
create random genes for mutation
'''
global GENES
gene = random.choice(GENES)
return gene

@classmethod
def create_gnome(self):
'''
create chromosome or string of genes
'''
global TARGET
gnome_len = len(TARGET)
return [self.mutated_genes() for _ in range(gnome_len)]

def mate(self, par2):


'''
Perform mating and produce new offspring
'''
# chromosome for offspring
child_chromosome = []
for gp1, gp2 in zip(self.chromosome, par2.chromosome):

# random probability
prob = random.random()

# if prob is less than 0.45, insert gene


# from parent 1
if prob < 0.45:
child_chromosome.append(gp1)

# if prob is between 0.45 and 0.90, insert


# gene from parent 2
elif prob < 0.90:
child_chromosome.append(gp2)

# otherwise insert random gene(mutate),


# for maintaining diversity
else:
child_chromosome.append(self.mutated_genes())

# create new Individual(offspring) using


# generated chromosome for offspring
return Individual(child_chromosome)

def cal_fitness(self):
'''
Calculate fitness score, it is the number of
characters in string which differ from target
string.
'''
global TARGET
fitness = 0
for gs, gt in zip(self.chromosome, TARGET):
if gs != gt: fitness+= 1
return fitness

# Driver code
def main():
global POPULATION_SIZE

#current generation
generation = 1

found = False
population = []
# create initial population
for _ in range(POPULATION_SIZE):
gnome = Individual.create_gnome()
population.append(Individual(gnome))

while not found:

# sort the population in increasing order of fitness score


population = sorted(population, key = lambda x:x.fitness)

# if the individual having lowest fitness score ie.


# 0 then we know that we have reached to the target
# and break the loop
if population[0].fitness <= 0:
found = True
break

# Otherwise generate new offsprings for new generation


new_generation = []

# Perform Elitism, that mean 10% of fittest population


# goes to the next generation
s = int((10*POPULATION_SIZE)/100)
new_generation.extend(population[:s])

# From 50% of fittest population, Individuals


# will mate to produce offspring
s = int((90*POPULATION_SIZE)/100)
for _ in range(s):
parent1 = random.choice(population[:50])
parent2 = random.choice(population[:50])
child = parent1.mate(parent2)
new_generation.append(child)

population = new_generation

print("Generation: {}\tString: {}\tFitness: {}".\


format(generation,
"".join(population[0].chromosome),
population[0].fitness))

generation += 1

print("Generation: {}\tString: {}\tFitness: {}".\


format(generation,
"".join(population[0].chromosome),
population[0].fitness))

if __name__ == '__main__':
main()

Output:
Experiment 8

Code:

class Predicate:
def __init__(self, name, *args):
self.name = name
self.args = args

def __repr__(self):
return f"{self.name}({', '.join(map(str, self.args))})"

class KnowledgeBase:
def __init__(self):
self.predicates = []

def add_predicate(self, predicate):


self.predicates.append(predicate)

def show(self):
for p in self.predicates:
print(p)

def query(self, predicate):


return predicate in self.predicates

# Define some predicates


P = Predicate('Parent', 'Alice', 'Bob')
Q = Predicate('Parent', 'Alice', 'Charlie')
R = Predicate('Sibling', 'Bob', 'Charlie')

# Create a knowledge base and add predicates


kb = KnowledgeBase()
kb.add_predicate(P)
kb.add_predicate(Q)
kb.add_predicate(R)
# Show knowledge base contents
print("Knowledge Base:")
kb.show()

# Query the knowledge base


query_predicate = Predicate('Parent', 'Alice', 'Bob')
print("\nQuery result for", query_predicate, ":", kb.query(query_predicate))

Output:
Experiment 9
Code:

tab = [] # Current state (list of lists or stacks)


result = [] # Stores the sequence of moves
goalList = ["a", "b", "c", "d", "e"] # Goal state

# Define actions (move block from one stack to another)


def move(block, from_stack, to_stack):
"""Move a block from one stack to another"""
if tab[from_stack][-1] == block: # Ensure the block is on top
tab[from_stack].pop() # Remove block from the top of the source stack
tab[to_stack].append(block) # Place block on the destination stack
result.append(f"Move {block} from Stack {from_stack} to Stack
{to_stack}")
else:
print(f"Error: Block {block} is not on top of Stack {from_stack}")

# Recursive function to move blocks


def moveBlocks(n, from_stack, to_stack, aux_stack):
"""Move 'n' blocks from 'from_stack' to 'to_stack' using 'aux_stack'"""
if n == 0:
return

# Step 1: Move n-1 blocks from 'from_stack' to 'aux_stack'


moveBlocks(n-1, from_stack, aux_stack, to_stack)

# Step 2: Move nth block (topmost) from 'from_stack' to 'to_stack'


move(goalList[n-1], from_stack, to_stack)

# Step 3: Move n-1 blocks from 'aux_stack' to 'to_stack'


moveBlocks(n-1, aux_stack, to_stack, from_stack)

# Plan solution for N moves (or blocks)


def parSolution(N):
"""A simple planner to achieve the goal state with N moves"""
# Initialize N empty stacks (assuming N stacks for N blocks)
global tab, result
tab = [[] for _ in range(N)] # Create N empty stacks
result = [] # Clear previous result
tab[0] = ["e", "d", "c", "b", "a"] # Initial setup: all blocks in the
first stack

print("Initial state of stacks:", tab)

# Move all N blocks from stack 0 to stack N-1 (using stack 1 as auxiliary)
moveBlocks(N, 0, N-1, 1)
print("Final state of stacks:", tab)
print("Plan to achieve the goal:")
for step in result:
print(step)

# Test the solution with 5 blocks


parSolution(5),

Output:
EXP 10

# Import necessary libraries

from pgmpy.models import BayesianNetwork

from pgmpy.factors.discrete import TabularCPD

from pgmpy.inference import VariableElimination

# Step 1: Define the structure of the Bayesian Network

model = BayesianNetwork([('Cloudy', 'Rain'),

('Cloudy', 'Sprinkler'),

('Rain', 'WetGrass'),

('Sprinkler', 'WetGrass')])

# Step 2: Define the Conditional Probability Distributions (CPDs)

cpd_cloudy = TabularCPD(variable='Cloudy', variable_card=2, values=[[0.5], [0.5]])

cpd_rain = TabularCPD(variable='Rain', variable_card=2,

values=[[0.8, 0.2], [0.2, 0.8]],

evidence=['Cloudy'], evidence_card=[2])

cpd_sprinkler = TabularCPD(variable='Sprinkler', variable_card=2,

values=[[0.5, 0.9], [0.5, 0.1]],

evidence=['Cloudy'], evidence_card=[2])

cpd_wet_grass = TabularCPD(variable='WetGrass', variable_card=2,

values=[[1.0, 0.1, 0.1, 0.01],

[0.0, 0.9, 0.9, 0.99]],

evidence=['Sprinkler', 'Rain'],

evidence_card=[2, 2])

# Step 3: Add the CPDs to the model

model.add_cpds(cpd_cloudy, cpd_rain, cpd_sprinkler, cpd_wet_grass)

# Check if the model is valid


assert model.check_model()

# Step 4: Perform Inference

infer = VariableElimination(model)

# Example Query: What is the probability of WetGrass given that it is cloudy?

query_result = infer.query(variables=['WetGrass'], evidence={'Cloudy': 1})

print(query_result)

# Example Query: Probability of WetGrass given that both Rain and Sprinkler are true

query_result_2 = infer.query(variables=['WetGrass'], evidence={'Rain': 1, 'Sprinkler': 1})

print(query_result_2)

Output:
Experiment No: 1

• Aim: Study of data warehousing and mining & its applications.


• Theory: Data Warehouse.
• Definition:
➢ A Data Warehouse is a subject oriented, integrated, non-volatile, and time variant
collection of data in support of management’s decisions. A data warehouse can be
summarized as follows:
- Provides an integrated view of enterprise data.
- Makes current and historical data readily accessible for decision-making.
- Makes decision-support transactions possible without hindering operational systems.
- Ensures consistent and reliable organizational information.
- Presents a flexible and interactive source of strategic information.

• Need:
➢ Data warehouses are essential because they:
- Handle large volumes of data (TBs) efficiently.
- Support complex analytical queries not feasible with transactional databases.
- Integrate and centralize data for comprehensive business analysis.
- Enable organizations to derive strategic insights and make informed decisions.
• Features of Data-Warehouse:
1. Subject Oriented:

In data warehousing, data is organized and stored based on business subjects or areas (e.g.,
sales, customers, products) rather than being tied to specific operational applications. This
approach allows for a comprehensive view of each subject across the organization, integrating
data from various sources to provide a unified perspective. For example, data about orders,
customers, inventory, and transactions are structured to support comprehensive analysis and
decision-making across all related functions, rather than being siloed within individual
application datasets. This subject-oriented approach in data warehousing facilitates integrated
analysis and strategic insights across the organization.
2. Integrated Data:

For proper decision-making, you must gather all the relevant data from the various
applications. The data in the data warehouse comes from several operational systems. Source
data are in different databases, files, and data segments. These are disparate applications, so
the operational platforms and operating systems could differ. The file layouts, character code
representations, and field naming conventions could be different.

3. Time-Variant Data:

The data collected in a data warehouse is identified with a particular time period. The data in
a data warehouse provides information from the historical point of view. For an operational
system, the stored data contains the current values. In an accounts receivable system, the
balance is the current outstanding balance in the customer’s account. In an order entry
system, the status of an order is the current status of the order. In a consumer loans
application, the balance amount owed by the customer is the current amount. Of course, we
store some past transactions in operational systems, but, essentially, operational systems
reflect current information because these systems support day-to-day current operations.
4. Non-Volatile Data:
Non-volatile means the previous data is not erased when new data is added to it. A data
warehouse is kept separate from the operational database and therefore frequent changes in
operational database is not reflected in the data warehouse. Data extracted from the various
operational systems and pertinent data obtained from outside sources are transformed,
integrated, and stored in the data warehouse. The data in the data warehouse is not intended
to run the day-to-day business. When you want to process the next order received from a
customer, you do not look into the data warehouse to find the current stock status. The
operational order entry application is meant for that purpose. In the data warehouse, you
keep the extracted stock status data as snapshots over time. You do not update the data
warehouse every time you process a single order.
5. Data Granularity:
In an operational system, data is usually kept at the lowest level of detail. In a point-of sale
system for a grocery store, the units of sale are captured and stored at the level of units of a
product per transaction at the check-out counter. In an order entry system, the quantity
ordered is captured and stored at the level of units of a product per order received from the
customer. Whenever you need summary data, you add up the individual transactions. If you
are looking for units of a product ordered this month, you read all the orders entered for the
entire month for that product and add up. You do not usually keep summary data in an
operational system.

• Difference Between Data Warehouse and Data Mart:


• Top Down & Bottom-Up Approach:

• Practical Approach:
The steps in this practical approach are as follows:
- Plan and define requirements at the overall corporate level.
- Create a surrounding architecture for a complete warehouse.
- Conform and standardize the data content.
- Implement the data warehouse as a series of super-marts, one at a time.
• Architecture of Data Warehouse:

- Source Systems: These are the operational systems where raw data originates, such as
transactional databases, CRM systems, ERP systems, flat files, etc.
- ETL (Extract, Transform, Load): ETL tools and processes are used to extract data from
source systems, transform it into a structured format suitable for analysis (e.g., cleaning,
filtering, aggregating), and load it into the data warehouse.
- Data Warehouse: The central repository where integrated, structured, and cleaned data
from various sources is stored. It typically uses a schema optimized for querying and
analysis, such as star schema or snowflake schema.
- Data Marts: These are subsets of data warehouses that are optimized for specific
departments or functions within an organization. Data marts provide easier and faster
access to data relevant to specific user groups.
- Metadata Repository: Metadata is data about data, which describes the characteristics
of the data warehouse content, structure, and usage. A metadata repository stores this
information, helping users understand and manage the data.
- OLAP (Online Analytical Processing) Engine: OLAP tools and engines enable
multidimensional analysis of data stored in the data warehouse. They support complex
queries and facilitate interactive analysis through features like drill-down, slice-and-dice,
and pivot operations.
- Reporting and Analysis Tools: These tools provide interfaces for querying the data
warehouse, creating reports, and visualizing data. Examples include BI (Business
Intelligence) tools, dashboards, and ad-hoc query tools.
- Data Quality Tools: Tools and processes to ensure data quality by detecting and correcting
errors, maintaining consistency, and ensuring data integrity throughout the data
warehouse.
- Security and Access Control: Measures to secure data within the data warehouse,
including authentication, authorization, and encryption to protect against unauthorized
access and ensure compliance with data privacy regulations.
- Backup and Recovery: Strategies and processes for backing up data in the data warehouse
and recovering it in case of data loss or system failure.

• Meta Data:

Metadata is essential data that describes other data, such as its content, structure, and
context. It improves data organization, discoverability, and usability across different
applications and domains.
Examples include file details, image properties, music attributes, and web page information.
Metadata supports effective data governance, interoperability, and enhances data
preservation and visualization efforts. Its structured format enables better search engine
optimization, content management, and data integration.
• Types of Meta Data:
- Operational Metadata. As you know, data for the data warehouse comes from several
operational systems of the enterprise. These source systems contain different data
structures. The data elements selected for the data warehouse have various field lengths
and data types. In selecting data from the source systems for the data warehouse, you
split records, combine parts of records from different source files, and deal with multiple
coding schemes and field lengths. When you deliver information to the end-users, you
must be able to tie that back to the original source data sets. Operational metadata
contain all of this information about the operational data sources.
- Extraction and Transformation Metadata: Extraction and transformation metadata
contain data about the extraction of data from the source systems, namely, the extraction
frequencies, extraction methods, and business rules for the data extraction. Also, this
category of metadata contains information about all the data transformations that take
place in the data staging area.
- End-User Metadata: The end-user metadata is the navigational map of the data
warehouse. It enables the end-users to find information from the data warehouse. The
end-user metadata allows the end-users to use their own business terminology and look
for information in those ways in which they normally think of the business.
• Applications:

- Business Intelligence (BI) and Reporting: Data warehouses centralize data from multiple
sources, enabling businesses to perform complex queries and analysis. This facilitates the
generation of insightful reports and dashboards for decision-making.
- Strategic Decision Making: By providing a unified view of data across an organization,
data warehouses support strategic decision-making processes. Executives and managers
can access timely and accurate information to formulate strategies and business plans.
- Operational Analytics: Data warehouses enable real-time or near-real-time analysis of
operational data. This helps businesses monitor performance metrics, identify trends, and
optimize operations for improved efficiency.
- Customer Analytics: Analysing customer data stored in a data warehouse allows
businesses to understand customer behaviour, preferences, and trends. This supports
targeted marketing campaigns, customer segmentation, and personalized customer
experiences.
- Financial Analysis: Finance departments use data warehouses to consolidate financial
data from various systems (e.g., ERP systems), analyse financial performance, manage
budgets, and forecast future financial trends.
- Supply Chain Management: Data warehouses help optimize supply chain operations by
integrating and analysing data related to inventory levels, supplier performance, logistics,
and demand forecasts. This improves inventory management and reduces costs.
- Regulatory Compliance and Risk Management: Data warehouses assist in regulatory
compliance by providing auditable records and ensuring data integrity. They also support
risk management initiatives by analysing historical data to identify potential risks and
trends.
- Healthcare Analytics: In healthcare, data warehouses integrate patient records, medical
histories, treatment outcomes, and operational data. This enables healthcare providers
to improve patient care, manage resources effectively, and conduct medical research.
• Theory: Data Mining.
• Definition:
➢ Data mining is the process of discovering interesting patterns and knowledge from large
amounts of data. The data sources can include databases, data warehouses, the Web, other
information repositories, or data that are streamed into the system dynamically.
• Steps required in Data Mining:
- Data cleaning (to remove noise and inconsistent data).
- Data integration (where multiple data sources may be combined).
- Data selection (where data relevant to the analysis task are retrieved from the database).
- Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations) .
- Data mining (an essential process where intelligent methods are applied to extract data
patterns).
- Knowledge presentation (where visualization and knowledge representation techniques
are used to present mined knowledge to users).
• Types of Data that can be mined:
- Database Data: A database system, also called a database management system (DBMS),
consists of a collection of interrelated data, known as a database, and a set of software
programs to manage and access the data. The software programs provide mechanisms
for defining database structures and data storage; for specifying and managing
concurrent, shared, or distributed data access; and for ensuring consistency and security
of the information stored despite system crashes or attempts at unauthorized access.
- Data Warehouses data: A data warehouse is usually modelled by a multidimensional data
structure, called a data cube, in which each dimension corresponds to an attribute or a
set of attributes in the schema, and each cell stores the value of some aggregate measure
such as counter sum(sales amount). A data cube provides a multidimensional view of data
and allows the precomputation and fast access of summarized data.
- Transactional Data: In general, each record in a transactional database captures a
transaction, such as a customer’s purchase, a flight booking, or a user’s clicks on a web
page. A transaction typically includes a unique transaction identity number (trans ID) and
a list of the items making up the transaction, such as the items purchased in the
transaction. A transactional database may have additional tables, which contain other
information related to the transactions, such as item description, information about the
salesperson or the branch, and so on.
• Kinds of pattern that can be mined:
- Characterization and Discrimination: Data characterization is a summarization of the
general characteristics or features of a target class of data. The data corresponding to the
user-specified class are typically collected by a query. Data discrimination is a comparison
of the general features of the target class data objects against the general features of
objects from one or multiple contrasting classes. The target and contrasting classes can
be specified by a user, and the corresponding data objects can be retrieved through
database queries.
- Mining Frequent Patterns, Associations, and Correlations: Frequent patterns, as the
name suggests, are patterns that occur frequently in data. There are many kinds of
frequent patterns, including frequent item sets, frequent sub sequences (also known as
sequential patterns), and frequent substructures. A frequent itemset typically refers to a
set of items that often appear together in a transactional data set.
- Classification and Regression for Predictive Analysis: Classification is the process of
finding a model (or function) that describes and distinguishes data classes or concepts.
The model are derived based on the analysis of a set of training data (i.e., data objects for
which the class labels are known). The model is used to predict the class label of objects
for which the class label is unknown.
- Cluster Analysis: Unlike classification and regression, which analyse class-labelled
(training) data sets, clustering analyses data objects without consulting class labels. In
many cases, class labelled data may simply not exist at the beginning.
- Outlier Analysis: A data set may contain objects that do not comply with the general
behaviour or model of the data. These data objects are outliers. Many data mining
methods discard outliers as noise or exceptions.
• Technologies Used:

1. Machine Learning:
- Supervised Learning: Involves learning a function that maps an input to an output based
on example input-output pairs (e.g., decision trees, neural networks, support vector
machines).
- Unsupervised Learning: Involves finding hidden patterns or intrinsic structures in input
data without labelled responses (e.g., clustering with K-means, association rule learning
with Apriori algorithm).
2. Statistics:
- Descriptive Statistics: Summarizes and describes the main features of a data set, including
measures such as mean, median, mode, and standard deviation.
- Inferential Statistics: Makes inferences and predictions about a population based on a
sample of data, using techniques like regression analysis, hypothesis testing, and ANOVA
(Analysis of Variance).
3. Pattern Recognition:
- Techniques used to identify patterns and regularities in data. Often applied in image and
speech recognition to classify or categorize data based on learned patterns.
4. Neural Networks:
- Computational models inspired by the human brain. They consist of interconnected nodes
(neurons) and are used for complex pattern recognition tasks, such as image and speech
recognition.
5. Artificial Intelligence:
- Expert Systems: Rule-based systems that emulate human decision-making.
- Genetic Algorithms: Optimization algorithms inspired by natural selection.
• Issues of data mining:
- Mining Methodology: Researchers have been vigorously developing new data mining
methodologies. This involves the investigation of new kinds of knowledge, mining in
multidimensional space, integrating methods from other disciplines, and the
consideration of semantic ties among data objects. In addition, mining methodologies
should consider issues such as data uncertainty, noise, and incompleteness.
- User Interaction: The user plays an important role in the data mining process. Interesting
areas of research include how to interact with a data mining system, how to incorporate
a user’s background knowledge in mining, and how to visualize and comprehend data
mining results.
- Efficiency and Scalability: Data mining algorithms must be efficient and scalable in order
to effectively extract information from huge amounts of data in many data repositories or
in dynamic data streams. In other words, the running time of a data mining algorithm
must be predictable, short, and acceptable by applications. Efficiency, scalability,
performance, optimization, and the ability to execute in real time are key criteria that
drive the development of many new data mining algorithms.
- Diversity of Database Types: Diverse applications generate a wide spectrum of new data
types, from structured data such as relational and data warehouse data to semi-
structured and unstructured data; from stable data repositories to dynamic data streams;
from simple data objects to temporal data, biological sequences, sensor data, spatial data,
hypertext data, multimedia data, software program code, Web data, and social network
data.
- Data Mining and Society: With data mining penetrating our everyday lives, it is important
to study the impact of data mining on society. How can we use data mining technology to
benefit society? How can we guard against its misuse? The improper disclosure or use of
data and the potential violation of individual privacy and data protection rights are areas
of concern that need to be addressed.
• Applications:
1. Market Analysis and Management
- Customer Profiling: Identifying characteristics of customers who are likely to purchase a
particular product.
- Market Segmentation: Dividing a market into distinct subsets of customers with common
needs or characteristics.
2. Risk Management and Fraud Detection
- Credit Scoring: Assessing the creditworthiness of potential customers.
- Fraud Detection: Identifying fraudulent activities in areas like credit card transactions,
insurance claims, and telecommunications.
3. Healthcare and Medical Diagnosis
- Disease Diagnosis: Identifying patterns that lead to the diagnosis of diseases.
- Treatment Effectiveness: Evaluating the effectiveness of treatments based on patient
data.
4. Manufacturing and Production
- Quality Control: Identifying defects and ensuring product quality.
- Process Optimization: Improving manufacturing processes through data analysis.
5. Web Mining
- Web Usage Mining: Analysing web server logs to find user navigation patterns.
- Web Structure Mining: Understanding the structure of websites and their hyperlink
hierarchy.
- Web Content Mining: Extracting useful information from the content of web pages.
6. Scientific Research
- Bioinformatics: Analysing biological data, such as gene sequences.
- Astronomy: Discovering patterns and relationships in astronomical data.

• Conclusion:
Concept of data warehouse and data mining are studied and understood successfully.
Experiment-2
Aim: To study and create star schema and snowflake schema for real life problem.

Theory:
A] Information Package Diagrams (IPD)

i. Definition:

The presence of information package diagrams in the requirements definition document is the
major and significant difference between operational systems and data warehouse systems.
Remember that information package diagrams are the best approach for determining
requirements for a data warehouse. The information package diagrams crystallize the information
requirements for the data warehouse. They contain the critical metrics measuring the
performance of the business units, the business dimensions along which the metrics are analyzed,
and the details how drill-down and roll-up analyses are done. Spend as much time as needed to
make sure that the information package diagrams are complete and accurate. Your data design
for the data warehouse will be totally dependent on the accuracy and adequacy of the information
package diagrams.

Fig 1.1 Information Package Diagram.

ii. Dimensional Modeling Basics


Dimensional modeling gets its name from the business dimensions we need to incorporate into the
logical data model. It is a logical design technique to structure the business dimensions and the
metrics that are analyzed along these dimensions. This modeling technique is intuitive for that
purpose. The model has also proved to provide high performance for queries and analysis. The
multidimensional information package diagram we have discussed is the foundation for the
dimensional model. Therefore, the dimensional model consists of the specific data structures
needed to represent the business dimensions. These data structures also contain the metrics or
facts.
B] Star Schema

A star schema is a type of database schema that is used in data warehousing and business
intelligence. It is designed to optimize the querying and reporting of large datasets. The star schema
gets its name because its structure resembles a star, with a central fact table connected to multiple
dimension tables. Here's a breakdown of its components:
1. Fact Table:

 Central Table: Contains quantitative data for analysis (e.g., sales revenue, number of
units sold).

 Measures: These are the numerical values that are analyzed (e.g., total sales, total
cost).

 Foreign Keys: These are keys that link to the primary keys in the dimension tables.

2. Dimension Tables:

 Peripheral Tables: Contain descriptive attributes related to the facts (e.g., date,
product, customer).

 Attributes: Provide context for the measures (e.g., product name, customer name,
date).

 Primary Keys: Unique identifiers for each record, used to join with the fact table.

Fig 1.2 Star schema structure

C] Snowflake Schema

A snowflake schema is a type of database schema used in data warehousing that represents data
in a more normalized form compared to the star schema. It gets its name because its structure
resembles a snowflake, with the dimension tables normalized into multiple related tables. This
reduces redundancy and can save storage space, but it can also make the schema more complex
and slow down query performance due to the additional joins required.
this, data warehouses store data summarized at different levels, known as data granularity.
Lower granularity means finer detail, which requires more storage. Deciding on granularity
levels depends on the data types and expected query performance.

Fig 1.3 Snowflake Schema Structure

D] Problem Statement:
Design a star and snowflake schema for a port operation data warehouse to enhance reporting
and analysis. The system should manage data from various sources, including vessel details, cargo
information, docking schedules, port staff, operations records, and customer information. In the
star schema, a central fact table will record metrics like cargo handled and operational costs,
surrounded by dimension tables for vessels, cargo, docks, staff, time, and customers. The
snowflake schema will normalize these dimensions into sub-dimensions, such as vessel types and
cargo categories, to reduce redundancy and optimize storage. The goal is to create a data model
that supports efficient querying and reporting, providing clear insights into port performance and
operations.

E] Working of the organization:


for a port operation starts with gathering data from different sources. Think of it like pulling
together info from all over the place—like details about ships, what cargo they’re carrying, when
and where they dock, who’s working at the port, and who the customers are. Once we’ve got all
this data, we clean it up to make sure it’s accurate and consistent.
Next, we set up the central piece of the puzzle: the fact table. This table is like the heart of our
data warehouse. It tracks the important stuff, like how much cargo was handled, the costs of
operations, and how long various activities took. Each entry in this table links out to other tables
for more context.

Surrounding the fact table, we have dimension tables. These are like the “extras” that give you
more details:
i. The Vessel Dimension has info about the ships—like their types and owners.
ii. The Cargo Dimension tells you about different types of cargo and their categories.
iii. The Time Dimension breaks down dates and times so you can track changes over periods.
iv. The Customer Dimension keeps track of who’s using the port’s services.
When it’s time to dig into the data, you run queries that pull from the fact table and join it with
these dimension tables. For example, if you want to see how much cargo was handled by different
vessels during a certain time, you’d link the fact table with the Vessel Dimension, Cargo Dimension,
and Time Dimension.

D] Information Package Diagram for Port Operation:

Port Container Time Carrier


Port Name Type Date Name
Location Size Week Product Name
County Name Weight Month Brand Name
City Name Owner Name Year Package Type
Port_Key Conatiner_Key Time_Key Carrier_Key
Facts: Fact_Key, Port_Key, Container_Key, Carrier_Key, Time_Key,
Handling time, Revenue, Frieght Cost.

Information Package Diagram (IPD)

E] STAR SCHEMA FOR PORT OPERATION:

Fig 1.4 Star schema for PORT OPERATION.


F] SNOWFLAKE SCHEMA FOR PORT OPERATION:

Fig 1.5 Snowflake schema for PORT OPERATION.


Experiment – 3

AIM: Implementation of all dimension table and fact table for ‘PORT OPERATION SYSTEM'.
THEORY:
Fact Table:
1. The fact table is central in a star or snowflake schema.
2. The primary key in the fact table is mapped as foreign keys to dimensions.
3. It contains fewer attributes and more records.
4. The fact table comes after the dimension table.
5. It has a numeric and text data format.
6. It is utilized for analysis and reporting.
Dimensional Table:
1. The dimensional table is located at the edge of a star or snowflake schema,
2. Dimension tables are used to describe dimensions; they contain dimension keys. values,
and attributes.
3. When we create a dimension, we logically define a structure for our projects.
4. The foreign key is mapped to the facts table.
5. The dimensional table is in text data format.

The star and snowflake schema are logical storage designs commonly found in data marts and
data warehouse architecture. While common database types use ER (Entity- Relationship)
diagrams, the logical structure of warehouses uses dimensional models to conceptualize the
storage system.

What is a Star Schema?


A star schema is a logical structure for the development of data marts and simpler data
warehouses. The simple model consists of dimension tables connected to a facts table in the
center.

What is a Snowflake Schema?


The snowflake schema has a branched-out logical structure used in large data warehouses.
From the center to the edges, entity information goes from general to more specific. Apart
from the dimensional model's common elements, the snowflake schema further decomposes
dimensional tables into subdimensions.
MySQL Queries
1) MySQL Create Database
MySQL create database is used to create database.
For example:
create database db1:
2) MySQL Select/Use Database
MySQL use database is used to select database.
For example:
use db1;
3) MySQL Create Query
MySQL create query is used to create a table, view, procedure and function.
For example:
CREATE TABLE customers
(id int(10),
name varchar(50),
city varchar(50), PRIMARY KEY (id));
4) MySQL Describe Table
DESCRIBE means to show the information in detail. Since we have tables in MySQL, so we will
use the DESCRIBE command to show the structure of our table, such as column names,
constraints on column names, etc. The DESC command is a short form of the DESCRIBE
command. Both DESCRIBE and DESC command are equivalent and case sensitive. Syntax: The
following are the syntax to display the table structure:
{DESCRIBE DESC} table_name;
5) MySQL Insert Query
MySQL insert query is used to insert records into table. For example:
insert into customers values(101. 'rahul. 'delhi');
6) MySQL Select Query
Oracle select query is used to fetch records from database.
For example:
SELECT from customers;
7)MySQL Primary Key
MySQL primary key is a single or combination of the field, which is used to identify each record
in a table uniquely. If the column contains primary key constraints, then it cannot be null or
empty. A table may have duplicate columns, but it can contain only one primary key. It always
contains unique value into a column.
When you insert a new row into the table, the primary key column can also use the
AUTO_INCREMENT attribute to generate a sequential number for that row automatically.
MySQL automatically creates an index named "Primary" after defining a primary key into the
table. Since it has an associated index, we can say that the primary key makes the query
performance fast.
Primary Key Using CREATE TABLE Statement-In this section, we are going to see how a primary
key is created using the CREATE TABLE statement.
Syntax: The following are the syntax used to create a primary key in MySQL.
If we want to create only one primary key column into the table, use the below syntax: CREATE
TABLE table_name(
col 1 datatype PRIMARY KEY,
col2 datatype,
………
);
If we want to create more than one primary key column into the table, use the below syntax:
CREATE TABLE table name coll col definition, col2 col definition,
CONSTRAINT [constraint_name] PRIMARY KEY (column_name(s))
D;
Primary Key Using ALTER TABLE Statement-This statement allows us to do the modification
into the existing table. When the table does not have a primary key, this statement is used to
add the primary key to the column of an existing table.
Syntax: Following are the syntax of the ALTER TABLE statement to create a primary key in
MySQL:
ALTER TABLE table name ADD PRIMARY KEY(column_list);
8) MySQL Foreign Key
The foreign key is used to link one or more than one table together. It is also known as the
referencing key. A foreign key matches the primary key field of another table. It means a
foreign key field in one table refers to the primary key field of the other table. It identifies cach
row of another table uniquely that maintains the referential integrity In MySQL.
A foreign key makes it possible to create a parent-child relationship with the tables. In his
relationship, the parent table holds the initial column values, and column values of child table
reference the parent column values. MySQL allows us to define a Foreign key constraint on
the child table.
Syntax: Following are the basic syntax used for defining a foreign key using
CREATE TABLE OR ALTER TABLE statement in the MySQL:
[CONSTRAINT constraint_name]
FOREIGN KEY [foreign_key_name] (col_name,…)
REFERNCES parent_tb1_name (col_name…..)
ON DELETE referenceOption
ON UPDATE referenceOption

TO CREATE DIMENSION TABLE: TO CREATE FACT TABLE:


1. PORT 1. fact_port
2. CONTAINER
3. CARRIER
4. TIME

DATABASE:- port created


mysql> CREATE DATABASE port;
Query Ok, 0 row affected (0.01 sec)
mysql> use port;
Database changed.

CREATE TABLE port( port_key INT AUTO_INCREMENT PRIMARY KEY, port_name char(20),
port_location char(20), country_name char(20), city_name char(20));
desc port;
INSERT INTO port VALUES (100, 'JNPT', 'MAHARASHTRA', 'INDIA', 'MUMBAI');
INSERT INTO port VALUES (101, 'KANDLA', 'GUJRAT', 'INDIA', 'GANDHIDHAM ');
INSERT INTO port VALUES (102, 'MUMBAI_PORT', 'MAHARASHTRA', 'INDIA', 'MUMBAI');
INSERT INTO port VALUES (103, 'MORMUGAO_PORT', 'GOA', 'INDIA', 'MORMUGAO');
INSERT INTO port VALUES (104, 'COCHIN_PORT', 'KERELA', 'INDIA', 'KOCHI');
SELECT * FROM port;

CREATE TABLE container;


CREATE TABLE container(container_key INT PRIMARY KEY, type VARCHAR(20), size
VARCHAR(20), weight VARCHAR(20), owner_name VARCHAR(20));
DESC container;
INSERT INTO container VALUES (001, 'dry_container', '20f', '2mt', 'Tanay');
INSERT INTO container VALUES(002, 'cold_container', '40f', '4mt', 'Adarsh');
INSERT INTO container VALUES (003, 'dry_container', '40f', '4mt', 'Ayush');
INSERT INTO container VALUES (004, 'cold_container', '20f', '2mt', 'Ashu');
INSERT INTO container VALUES (005, 'flat_rack', '20f', '3mt', 'Sakshi');
SELECT * FROM container;

CREATE TABLE carrier( carrier_key INT PRIMARY KEY, carrier_name char(20), product_name
varchar(20), package_type varchar(20), brand_name varchar(20), product_category
varchar(20));
Desc carrier;

INSERT INTO carrier VALUES (200, 'Tanay', 'Electronics', 'fiberboard', 'ASUS', 'Laptops');
INSERT INTO carrier VALUES (201, 'Adarsh', 'Sea_food', 'polystyrene_box', 'Aqua', 'Tuna');
INSERT INTO carrier VALUES (202, 'Harsh', 'Mining_Product', 'Conatiner', 'Coal_India','coal');
INSERT INTO carrier VALUES (203, 'Tanmay', 'Machineries', 'Secure_belts', 'BHEL',
'Generator');
SELECT * FROM carrier;

CREATE TABLE time( time_key TIMESTAMP PRIMARY KEY, date varchar(20), week int(20),
month int(20), year int(20));
Desc time;

INSERT INTO time VALUES ('2023-2-15', '15th February',15,2,2023);


INSERT INTO time VALUES ('2022-6-12', '12th June',12,6,2022);
INSERT INTO time VALUES (‘2023-12-10’, ‘10th December’,10,12,2023);
INSERT INTO time VALUES (‘2024-07-08’, ‘8th August’,2,7,2024);

SELECT * FROM time;


CREATE TABLE FACT_PORT (port_key int references port(port_key), container_key int
references container (container_key), carrier_key int references carrier (carrier_key),
handling_time varchar(20), revenue varchar(20), freight cost varchar(20), time_key
TIMESTAMP REFERENCES time(time_key), PRIMARY KEY(Port_key, container_key,
carrier_key));
DESC FACT_PORT;

INSERT INTO FACT_PORT VALUES (100,3,201,50,30000, 60000, '2024-07-08 01:00:00');


INSERT INTO FACT_PORT VALUES (100, 4,202,40, 10000, 64000, '2024-05-03 09:09:00');
INSERT INTO FACT_PORT VALUES (104,5,203,20,15000,45000, '2023-06-06 12:05:00');
INSERT INTO FACT_PORT VALUES (102,3,201,15,25000, 70000, '2023-01-01 11:10:00');
INSERT INTO FACT_PORT VALUES (104,5,204,25,20000, 80000, '2022-09-03 02:15:00');
INSERT INTO FACT_PORT VALUES (102,4,202,30,35000,90000, '2022-08-05 17:20:00');
INSERT INTO FACT_PORT VALUES (101,1,200,15,30000, 80000, '2022-06-06 11:20:00');
INSERT INTO FACT_PORT VALUES (103,2,200,35,8000,55000, '2021-12-02 12:20:00');
SELECT * FROM FACT_PORT;

Conclusion: In this experiment we learned, implementation of all dimension table and fact
table for ‘PORT OPERATION SYSTEM’.
EXPERIMENT NO 4
AIM: Implementation of OLAP operations: i) Roll Up, ii) Drill Down, iii) Slice, iv) Dice, v) Pivot.

Theory:

OLAP:
OLAP (Online Analytical Processing) in data warehousing enables quick, multidimensional analysis of
large datasets. It allows users to perform complex queries, analyze trends, and extract insights by
viewing data across various dimensions, like time, geography, or product. OLAP tools support
operations such as slicing, dicing, and pivoting for detailed examination. In data mining, patterns,
correlations, or trends are identified from large datasets using techniques like clustering, classification,
or association. Both OLAP and data mining help businesses make data-driven decisions by revealing
insights from stored data.

OLAP Operations:

 Roll-up Operation:
The Roll-up operation in OLAP aggregates data by climbing up a hierarchy or reducing
dimensions. It moves from detailed data to summarized data, such as going from daily sales to
monthly sales. Roll-up can be performed by either grouping data by higher-level dimensions
or removing one or more dimensions.

Syntax: SELECT dimension_1, dimension_2, SUM(measure) FROM table_name GROUP BY


dimension_1, dimension_2 WITH ROLLUP;

 Drill-down Operation:

Drill-down in OLAP is an operation that allows users to navigate from a summary view to more
detailed data by moving down a data hierarchy. For example, from "Year" to "Month" to "Day."
It helps in gaining deeper insights by exploring finer data granularity.

Syntax: Select Dimension1, Dimension2, Sum(measure)/Count(*) OVER (PARTITION BY


dimension1, dimension2) FROM table_name JOIN table_name GROUP BY
dimension1,dimension2;

 Slice Operation:

In OLAP, the slice operation selects a single dimension from a multidimensional cube, creating
a new sub-cube by fixing a value for that dimension. This reduces the cube's dimensionality
and allows focused analysis. For example, if a cube has dimensions for Time, Location, and
Product, applying a slice on Time (e.g., "Time = 2023") will display data for 2023 across all
other dimensions.

Syntax: SELECT measures FROM cube WHERE dimension = value;

 Dice Operation:
The dice operation in OLAP allows users to focus on a specific subset of data by selecting
multiple dimensions and applying conditions on them. It refines the data cube to show only
the relevant slice based on multiple dimension values.

Syntax: SELECT [measure] FROM [cube] WHERE [dimension1 condition] AND [dimension2
condition] AND [dimension3 condition];

 Pivot Operation:

The pivot operation in OLAP rotates data dimensions to view it from different perspectives,
swapping rows and columns. For example, if sales data is shown by region and year, pivoting
could switch the view to display year by product instead. This allows users to analyze
relationships and trends across various dimensions.

Syntax: Select [dimension1] ,SUM(CASE WHEN [condition1] THEN [measure] END), SUM(CASE
WHEN [condition2] THEN [measure] END),….. SUM(CASE WHEN [condition_n] THEN
[measure] END) FROM (Select dimension1, dimension2 FROM table_name) GROUP BY
[dimension];

OLAP OPERATIONS SYNTAX AND OUTPUT:

 ROLL-UP OPERATION:
SYNTAX:
SELECT port.port_location, SUM(fact_port.revenue) AS total_revenue FROM fact_port JOIN
port ON fact_port.port_key = port.port_key GROUP BY port.port_location WITH rollup;
 DRILL-DOWN OPERATION:
SYNTAX:
SELECT port_key,
Extract (year from
time_key) as year,
sum(revenue) as
total_revenue from
fact_port group by
port_key,
year order by port_key;

 SLICE OPERATION:
SYNTAX:
SELECT port_key, date(time_key) as date, revenue FROM fact_port WHERE port_key in(102);

 DICE OPERATION
SYNTAX:
SELECT date(time_key) as date, revenue FROM fact_port_key=1005 AND EXTRACT(YEAR
FROM time_key)=2022 AND EXTRACT(MONTH_KEY)=8 AND port_key=102;
 PIVOT OPERATION:
SYNTAX:
SELECT container_key, sum(case when month=6 then revenue else 0 end) as June, sum(case
when month=9 then revenue else 0 end) as September, sum(case when month=10 then
revenue else 0 end) as October, sum(case when month=12 then revenue else 0 end) as
December from(select container_key from container, month from time_key) as month,
revenue from fact_port where container_key in(1,5));

Conclusion: Thus, we have successfully implemented all the OLAP operation.


EXPERIMENT NO 5
AIM: Implement classifiers using Weka.
Theory:
Classification is a task in data mining that involves assigning a class label to each instance in a
dataset based on its features. The goal of classification is to build a model that accurately
predicts the class labels of new instances based on their features.
There are two main types of classification: binary classification and multi-class classification.
Binary classification involves classifying instances into two classes, such as "spam" or "not
spam", while multi-class classification involves classifying instances into more than two
classes.
Different types of classifiers:
1. Decision Trees
2. Bayesian Classifiers
3. Neural Networks
4. K-Nearest Neighbour
5. Support Vector Machines
6. Linear Regression
7. Logistic Regression
ID3:
A decision tree is a type of supervised learning algorithm that is commonly used in machine
learning to model and predict outcomes based on input data. It is a tree-like structure where
each internal node tests on attribute, each branch corresponds to attribute value and each
leaf node represents the final decision or prediction.

• In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.
• The decisions or the test are performed on the basis of features of the given dataset.
• It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
• It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
• In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
• A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.

Working of Algorithm:
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record
(real dataset) attribute and, based on the comparison, follows the branch and jumps to the
next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes
and move further. It continues the process until it reaches the leaf node of the tree.

• Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
• Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
• Step-3: Divide the S into subsets that contains possible values for the best attributes.
• Step-4: Generate the decision tree node, which contains the best attribute.
• Step-5: Recursively make new decision trees using the subsets of the dataset created
in step -3. Continue this process until a stage is reached where you cannot further
classify the nodes and called the final node as a leaf node.
1. Information Gain: Information gain is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute.
It calculates how much information a feature provides us about a class.

Information Gain= Entropy(S)-[(Weighted Avg) *Entropy(each feature)


2. Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies
randomness in data. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
• S= Total number of samples
• P(yes)= probability of yes
• P(no)= probability of no
3. Gini Index:

• Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
• An attribute with the low Gini index should be preferred as compared to the high Gini
index.

Bayesian Classifiers:
Bayesian classification uses Bayes theorem to predict the occurrence of any event. Bayesian
classifiers are the statistical classifiers with the Bayesian probability understandings. The
theory expresses how a level of belief, expressed as a probability.
Bayes theorem came into existence after Thomas Bayes, who first utilized conditional
probability to provide an algorithm that uses evidence to calculate limits on an unknown
parameter.
P(X/Y)=P(Y/X)P(X)/ P(Y)
Where X and Y are the events and P (Y) not equal to 0
P(X/Y) is a conditional probability that describes the occurrence of event X is given that Y is
true. P(Y/X) is a conditional probability that describes the occurrence of event Y is given that
X is true. P(X) and P(Y) are the probabilities of observing X and Y independently of each other.
This is known as the marginal probability.
Bayesian interpretation:
In the Bayesian interpretation, probability determines a "degree of belief." Bayes theorem
connects the degree of belief in a hypothesis before and after accounting for evidence.
For example, lets us consider an example of the coin. If we toss a coin, then we get either
heads or tails, and the percent of occurrence of either heads or tails is 50%. If the coin is
flipped numbers of times, and the outcomes are observed, the degree of belief may rise, fall,
or remain the same depending on the outcomes.

• P(X), the prior, is the primary degree of belief in X.


• P(X), the posterior is the degree of belief having accounted for Y.
• The quotient P(Y/X)/P(Y) represents the supports Y provides for X.
Bayes theorem can be derived from the conditional probability:
P(X/Y) = P(X∩Y), if P(Y) ≠ 0
P(Y/X) = P(Y∩X)/P(X), if P(X) ≠ 0

Where P (XNY) is the joint probability of both X and Y being true, because
P (Y∩X) = P(X∩Y)
or, P (X∩Y) = P(X/Y)P(Y) = P(Y/X)P(X)
or, P(X/Y) =P(Y/X)P(X)/ P(Y), if P(Y) ≠ 0
Bayesian network:
A Bayesian Network falls under the classification of Probabilistic Graphical Modelling (PGM)
procedure that is utilized to compute uncertainties by utilizing the probability concept.
Generally known as Belief Networks, Bayesian Networks are used to show uncertainties using
Directed Acyclic Graphs (DAG).
A Directed Acyclic Graph is used to show a Bayesian Network, and like some other statistical
graph, a DAG consists of a set of nodes and links, where the links signify the connection
between the nodes.

Conclusion: Hence, we have studied and implemented Classifiers using Weka.


Output:
Dataset:
Naïve Bayes on the same dataset:
EXPERIMENT NO. 6
AIM: Implement Clustering Using Weka.
Theory:
Clustering is an unsupervised Machine Learning-based Algorithm that comprises a group of
data points into clusters so that the objects belong to the same group. Clustering helps to
splits data into several subsets. Each of these subsets contains data similar to each other, and
these subsets are called clusters. Weka is commonly used in educational settings to teach data
mining and machine learning concepts due to its user-friendly interface and rich set of
algorithms.
There are two types of clustering algorithm we studied in this experiment using weka tool.
1. K-means Clustering.
2. Hierarchical Clustering.
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabelled
dataset into different clusters. Here K defines the number of pre-defined clusters that need to
be I created in the process, as if K 2, there will be two clusters, and for K-3, there will be three
clusters, and so on. It allows us to cluster the data into different groups and a convenient way
to discover the categories of groups in the unlabelled dataset on its own without the need for
any training. It is a centroid-based algorithm, where each cluster is associated with a centroid.
The main aim of this algorithm is to minimize the sum of distances between the data point
and their corresponding clusters. The algorithm takes the unlabelled dataset as input. divides
the dataset into k-number of clusters. and repeats the process until it does not find the best
clusters. The value of k should be predetermined in this algorithm.
Algorithm:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: Stop

Advantages:
Simplicity: K-means is easy to understand and implement. It is straightforward to grasp thbasic
concept of partitioning data into clusters.
Efficiency: K-means is computationally efficient and works well with large datasets. It's faster
than many other clustering algorithms.
Disadvantages:
Sensitive to Initial Centroid Placement: K-means is sensitive to the initial placement of
centroids. Depending on where the initial centroids are chosen, it can lead to different results.
This sensitivity can lead to suboptimal clustering, and you might need to run the algorithm
multiple times with different initializations.
Hierarchical Clustering:
Hierarchical clustering is a widely used method in data mining and statistics for cluster
analysis, which aims to build a hierarchy of clusters. This approach can be particularly useful
for exploratory data analysis, allowing researchers to visualize how data points are grouped
based on their similarities.
Agglomerative Hierarchical Clustering:
This is the most common method, which follows a bottom-up approach. It starts with each
data point as its own cluster and iteratively merges the closest pairs of clusters until only one
cluster remains or a specified number of clusters is achieved. The results are often represented
in a tree-like structure known as a dendrogram.
Algorithm:
1. Compute the distance matrix between the input data points
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the distance matrix
6. Until only a single cluster remains
Distance between two clusters
1. Single-link distance between clusters C; and Cj is the minimum distance between any
object in C; and any object in Cj.
2. Complete-link distance between clusters C; and C; is the maximum distance between
any object in C; and any object in C.
3. Average-link distance between clusters C; and Cj is the average distance between any
object in C; and any object in C.
Advantages:
1. No prior information about the number of clusters required.
2. Easy to implement and gives best result in some cases.

Disadvantage:
1. Algorithm can never undo what was done previously.
2. Time complexity of at least O (n log n) is required, where 'n' is the number of data points.

CONCLUSION: So, we have successfully implemented Clustering using WEKA.


OUTPUT:
DATASET:
2D Clustering:
DATASET:
HIERARCHICAL CLUSTERING:
EXPERIMENT NO. 7
Aim: Implement Association Rule mining algorithms using Weka.
Theory:
Association rule learning is a type of unsupervised learning technique that checks for the
dependency of one data item on another data item and maps accordingly so that it can be
more profitable. It tries to find some interesting relations or associations among the variables
of dataset. It is based on different rules to discover the interesting relations between variables
in the database.
The association rule learning is one of the very important concepts of machine learning, and
it is employed in Market Basket analysis, Web usage mining, continuous production, etc. Here
market basket analysis is a technique used by the various big retailer to discover the
associations between items. We can understand it by taking an example of a supermarket, as
in a supermarket, all products that are purchased together are put together.
Association rule learning can be divided into three types of algorithms:
1. Apriori
2. Eclat
3. F-P Growth Algorithm
Apriori algorithm refers to an algorithm that is used in mining frequent products sets and
relevant association rules. Generally, the apriori algorithm operates on a database containing
a huge number of transactions. For example, the items customers but at a Big Bazar.
The given three components comprise the apriori algorithm.
1. Support
2. Confidence
3. Lift
Support:
Support refers to the default popularity of any product. You find the support as a quotient of
the division of the number of transactions comprising that product by the total number of
transactions.
Confidence:
Confidence refers to the possibility that the customers bought both biscuits and chocolates
together. So, you need to divide the number of transactions that comprise both biscuits and
chocolates by the total number of transactions to get the confidence.
Lift:
Consider the above example; lift refers to the increase in the ratio of the sale of chocolates
when you sell biscuits. The mathematical equations of lift are given below.
Lift = (Confidence (x-y)/ (Support (x))
Apriori algorithm Algorithm:
Algorithm for association rule mining
Step 1: Data Collection
Gather transaction data, where each transaction lists items bought together.
Step 2: Data Preprocessing
Prepare the data, ensuring items and transactions are well-defined and properly structured.
Step 3: Itemset Generation
Create a list of frequent items (items that occur above a minimum support threshold).
Step 4: Rule Generation
Generate association rules from frequent item sets by examining itemset combinations.
Step 5: Rule Evaluation
Evaluate rules based on metrics like confidence and lift to find meaningful associations.
Step 6: Visualization and Interpretation
Visualize and interpret the discovered association rules for insights.
Association rule mining aims to discover patterns in data, such as "if X, then Y," and is widely
used in retail and recommendation systems.
Advantages of Apriori Algorithm:

• It is used to calculate large item sets.


• Simple to understand and apply.
Disadvantages of Apriori Algorithm:

• Apriori algorithm is an expensive method to find support since the calculation has to
pass through the whole database.
• Sometimes, you need a huge number of candidate rules, so it becomes
computationally more expensive.

Conclusion: Hence, we have studied and implemented Association rule mining.


OUTPUT:
DATASET:
EXPERIMENT 8
AIM: Implement Data Exploration Techniques using R. (i) Histogram, (ii) Box-Plot, (iii) Scatter
plot.
Theory:
Data exploration techniques include both manual analysis and automated data exploration
software solutions that visually explore and identify relationships between different data
variables, the structure of the dataset, the presence of outliers, and the distribution of data
values to reveal patterns and points of interest, enabling data analysts to gain greater insight
into the raw data.

Box Plot:
Boxplots are a measure of how well data is distributed across a data set. This divides the data
set into three quartiles. This graph represents the minimum, maximum, average, first quartile,
and the third quartile in the data set. Boxplot is also useful in comparing the distribution of
data in a data set by drawing a boxplot for each of them.
A box plot gives a five-number summary of a set of data which is-

• Minimum - It is the minimum value in the dataset excluding the outliers


• First Quartile (Q1) -25% of the data lies below the First (lower) Quartile.
• Median (Q2) - It is the mid-point of the dataset. Half of the values lie below it and half
above.
• Third Quartile (Q3) -75% of the data lies below the Third (Upper) Quartile.
• Maximum - It is the maximum value in the dataset excluding the outliers.
Advantages:

• A box plot is a good way to summarize large amounts of data.


• It displays the range and distribution of data along a number line.
• Box plots provide some indication of the data's symmetry and skew-ness.
• Box plots show outliers.
Scatter Plot:
A scatterplot is a type of data display that shows the relationship between two numerical
variables. Each member of the dataset gets plotted as a point whose (x, y coordinates relates
to its values for the two variables.
For example, we can make a scatterplot that shows the shoe sizes and quiz scores for students
in a class.
A scatterplot is used to represent a correlation between two variables
Positive correlation
When the [y] variable tends to increase as the [x] variable increases, we say there is a positive
correlation between the variables.

Negative correlation
When the [y] variable tends to decrease as the [x] variable increases, we say there is a negative
correlation between the variables.

When there is no clear relationship between the two variables, we say there is no correlation
between the two variables.
• Scale:
The dataset on the graph is measured or quantified by a set of integers called the
Histogram's scale. Each rectangular bar on the histogram chart has a width and height
that are partially determined by this.

• Photographs:
A histogram chart is a graph that combines the top midpoints of rectangular bars. They
are typically used to visualize a dataset of continuous variables and are known as
frequency polygons.
Applications of Histogram Graph

• Finding the Mode in a Dataset:


The most frequent process results in a dataset can be quickly found without using
intricate mathematical calculations. The highest frequency result will stand out as the
graph's peak when the collected data is visualized on a histogram chart.

• Recognizing the data structure:


When examining a histogram chart, trends in the data are simple to identify. This can
be useful for predicting outcomes, streamlining procedures, and spotting potential
problems.

• Identifying data variations:


Unlike other data visualization techniques, utilizing a histogram makes it simple to
identify data variances. When you are gathering data over time, this is quite helpful.

Conclusion: Hence, we have studied and implemented data exploration techniques in R.


EXPERIMENT 9
AIM: Implement Data Preprocessing techniques using R.
Theory:
Data preprocessing is a process of preparing the raw data and making it suitable for a machine
learning model. It is the first and crucial step while creating a machine learning model.
Linear regression is a type of supervised machine learning algorithm that computes the linear
relationship between a dependent variable and one or more independent
features.

When the number of the independent feature, is 1 then it is known as Univariate Linear
regression, and in the case of more than one feature, it is known as multivariate linear
regression. The goal of the algorithm is to find the best linear equation that can predict the
value of the dependent variable based on the independent variables.
The equation provides a straight line that represents the relationship between the dependent
and independent variables. The slope of the line indicates how much the dependent variable
changes for a unit change in the independent variable(s).

• Positive Linear Relationship:


If the dependent variable increases on the Y-axis and independent variable increases
on X-axis, then such a relationship is termed as a Positive linear relationship.
• Negative Linear Relationship:
If the dependent variable decreases on the Y-axis and independent variable increases
on the X-axis, then such a relationship is called a negative linear relationship.

Assumptions of Linear Regression

• Linear relationship between the features and target:


Linear regression assumes the linear relationship between the dependent and
independent variables.

• Small or no multicollinearity between the features:


Multicollinearity means high-correlation between the independent variables. Due to
multicollinearity, it may difficult to find the true relationship between the predictors
and target variables. Or we can say, it is difficult to determine which predictor variable
is affecting the target variable and which is not. So, the model assumes either little or
no multicollinearity between the features or independent variables.
• Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the values of
independent variables. With homoscedasticity, there should be no clear pattern
distribution of data in the scatter plot.

• Normal distribution of error terms:


Linear regression assumes that the error term should follow the normal distribution
pattern. If error terms are not normally distributed, then confidence intervals will
become either too wide or too narrow, which may cause difficulties in finding
coefficients. It can be checked using the q-q plot. If the plot shows a straight line
without any deviation, which means the error is normally distributed.

• No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there will be
any correlation in the error term, then it will drastically reduce the accuracy of the
model. Autocorrelation usually occurs if there is a dependency between residual errors.

Conclusion: Hence, we have studied and implemented Data Preprocessing techniques.


EXPERIMENT 10
AIM: Implement any one classification algorithm using python (ID3 & Naïve Bayes).
Theory:
Classification is a task in data mining that involves assigning a class label to each instance in a
dataset based on its features. The goal of classification is to build a model that accurately
predicts the class labels of new instances based on their features.
There are two main types of classification: binary classification and multi-class classification.
Binary classification involves classifying instances into two classes, such as "spam" or "not
spam", while multi-class classification involves classifying instances into more than two
classes.
Different types of classifiers:
1. Decision Trees
2. Bayesian Classifiers
3. Neural Networks
4. K-Nearest Neighbour
5. Support Vector Machines
6. Linear Regression
7. Logistic Regression
ID3:
A decision tree is a type of supervised learning algorithm that is commonly used in machine
learning to model and predict outcomes based on input data. It is a tree-like structure where
each internal node tests on attribute, each branch corresponds to attribute value and each
leaf node represents the final decision or prediction.

• In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.
• The decisions or the test are performed on the basis of features of the given dataset.
• It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
• It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
• In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
• A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
Working of Algorithm:
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record
(real dataset) attribute and, based on the comparison, follows the branch and jumps to the
next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes
and move further. It continues the process until it reaches the leaf node of the tree.

• Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
• Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
• Step-3: Divide the S into subsets that contains possible values for the best attributes.
• Step-4: Generate the decision tree node, which contains the best attribute.
• Step-5: Recursively make new decision trees using the subsets of the dataset created
in step -3. Continue this process until a stage is reached where you cannot further
classify the nodes and called the final node as a leaf node.
1. Information Gain:
Information gain is the measurement of changes in entropy after the segmentation of a
dataset based on an attribute.
It calculates how much information a feature provides us about a class.
Information Gain= Entropy(S)-[(Weighted Avg) *Entropy(each feature)
2. Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies
randomness in data. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,

• S= Total number of samples


• P(yes)= probability of yes
• P(no)= probability of no
3. Gini Index:

• Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
• An attribute with the low Gini index should be preferred as compared to the high Gini
index.
Bayesian Classifiers:
Bayesian classification uses Bayes theorem to predict the occurrence of any event. Bayesian
classifiers are the statistical classifiers with the Bayesian probability understandings. The
theory expresses how a level of belief, expressed as a probability.
Bayes theorem came into existence after Thomas Bayes, who first utilized conditional
probability to provide an algorithm that uses evidence to calculate limits on an unknown
parameter.
P(X/Y)=P(Y/X)P(X)/ P(Y)
Where X and Y are the events and P (Y) not equal to 0
P(X/Y) is a conditional probability that describes the occurrence of event X is given that Y is
true. P(Y/X) is a conditional probability that describes the occurrence of event Y is given that
X is true. P(X) and P(Y) are the probabilities of observing X and Y independently of each other.
This is known as the marginal probability.
Bayesian interpretation:
In the Bayesian interpretation, probability determines a "degree of belief." Bayes theorem
connects the degree of belief in a hypothesis before and after accounting for evidence.
For example, lets us consider an example of the coin. If we toss a coin, then we get either
heads or tails, and the percent of occurrence of either heads or tails is 50%. If the coin is
flipped numbers of times, and the outcomes are observed, the degree of belief may rise, fall,
or remain the same depending on the outcomes.

• P(X), the prior, is the primary degree of belief in X.


• P(X), the posterior is the degree of belief having accounted for Y.
• The quotient P(Y/X)/P(Y) represents the supports Y provides for X.
Bayes theorem can be derived from the conditional probability:
P(X/Y) = P(X∩Y), if P(Y) ≠ 0
P(Y/X) = P(Y∩X)/P(X), if P(X) ≠ 0

Where P (XNY) is the joint probability of both X and Y being true, because
P (Y∩X) = P(X∩Y)
or, P (X∩Y) = P(X/Y)P(Y) = P(Y/X)P(X)
or, P(X/Y) =P(Y/X)P(X)/ P(Y), if P(Y) ≠ 0
Bayesian network:
A Bayesian Network falls under the classification of Probabilistic Graphical Modelling (PGM)
procedure that is utilized to compute uncertainties by utilizing the probability concept.
Generally known as Belief Networks, Bayesian Networks are used to show uncertainties using
Directed Acyclic Graphs (DAG).
A Directed Acyclic Graph is used to show a Bayesian Network, and like some other statistical
graph, a DAG consists of a set of nodes and links, where the links signify the connection
between the nodes.

Conclusion: Hence, we have studied and implemented Classifiers using python.


Code:
import pandas as pd

# Sample data
data = pd.DataFrame({
'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast', 'sunny', 'sunny',
'rainy', 'sunny', 'overcast', 'overcast', 'rainy'],
'Temperature': ['hot', 'hot', 'hot', 'mild', 'cool', 'cool', 'cool', 'mild', 'cool', 'mild', 'mild', 'mild',
'hot', 'mild'],
'Humidity': ['high', 'high', 'high', 'high', 'normal', 'normal', 'normal', 'high', 'normal', 'normal',
'normal', 'high', 'normal', 'high'],
'Wind': ['weak', 'strong', 'weak', 'weak', 'weak', 'strong', 'strong', 'weak', 'weak', 'weak',
'strong', 'strong', 'weak', 'strong'],
'Play Tennis': ['no', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'yes', 'no']
})

class NaiveBayesClassifier:
def __init__(self):
self.class_probabilities = {} # Probability of each class (e.g., "yes" or "no")
self.feature_probabilities = {} # Conditional probabilities of features given class

def fit(self, data, target_column):


# Calculate class probabilities
class_counts = data[target_column].value_counts()
total_samples = len(data)

for class_label, count in class_counts.items():


self.class_probabilities[class_label] = count / total_samples

# Calculate conditional probabilities for each feature given the class


for feature in data.columns:
if feature == target_column:
continue

self.feature_probabilities[feature] = {}
for class_label in self.class_probabilities:
class_data = data[data[target_column] == class_label]
feature_counts = class_data[feature].value_counts()
total_samples_in_class = len(class_data)

self.feature_probabilities[feature][class_label] = {
feature_value: count / total_samples_in_class
for feature_value, count in feature_counts.items()
}

def predict(self, input_data):


class_predictions = {}

for class_label in self.class_probabilities:


class_probability = self.class_probabilities[class_label]

for feature, feature_value in input_data.items():


if feature in self.feature_probabilities:
feature_given_class = self.feature_probabilities[feature][class_label]
if feature_value in feature_given_class:
class_probability *= feature_given_class[feature_value]
else:
class_probability *= 0

class_predictions[class_label] = class_probability

# Normalize probabilities
total_probability = sum(class_predictions.values())
for class_label in class_predictions:
class_predictions[class_label] /= total_probability

# Determine the class with the highest probability


best_class = max(class_predictions, key=class_predictions.get)

return best_class, class_predictions

# Example usage
nb_classifier = NaiveBayesClassifier()
nb_classifier.fit(data, 'Play Tennis')

# Predict whether to play tennis for a new set of conditions


new_conditions = {
'Outlook': 'sunny',
'Temperature': 'cool',
'Humidity': 'high',
'Wind': 'strong'
}
prediction, probabilities = nb_classifier.predict(new_conditions)
print("Class Probabilities:")
for class_label, probability in probabilities.items():
print(f"{class_label}: {probability:.2f}")
print(f"Prediction: {prediction}")

OUTPUT:
EXPERIMENT NO: 11

AIM: Implement clustering using python (K-Means & Hierarchical).


Theory:

• K-Means Clustering Algorithm:


K-Means Clustering is an unsupervised learning algorithm that is used to solve the
clustering problems in machine learning or data science. In this topic, we will learn
what is K-means clustering algorithm, how the algorithm works, along with the
Python implementation of k- means clustering.

K-Means Clustering is an Unsupervised Learning algorithm, which groups the


unlabelled dataset into different clusters. Here K defines the number of pre-defined
clusters that need to be created in the process, as if K-2, there will be two clusters,
and for K=3, there will be three clusters and so on.

It is an iterative algorithm that divides the unlabelled dataset into k different clusters
in such a way that each dataset belongs only one group that has similar properties.

It allows us to cluster the data into different groups and a convenient way to
discover the categories of groups in the unlabelled dataset on its own without the
need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid.


The main aim of this algorithm is to minimize the sum of distances between the data
point and their corresponding clusters.

The algorithm takes the unlabelled dataset as input, divides the dataset into k
number of clusters, and repeats the process until it does not find the best clusters.
The value of k should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:


1. Determines the best value for K centre points or centroids by an iterative
process.
2. Assigns each data point to its closest k-centre. Those data points which are
near to the particular k-centre, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from
other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Hierarchical Clustering:
Hierarchical clustering is a widely used method in data mining and statistics for cluster analysis,
which aims to build a hierarchy of clusters. This approach can be particularly useful for
exploratory data analysis, allowing researchers to visualize how data points are grouped based
on their similarities.
Agglomerative Hierarchical Clustering:
This is the most common method, which follows a bottom-up approach. It starts with each data
point as its own cluster and iteratively merges the closest pairs of clusters until only one cluster
remains or a specified number of clusters is achieved. The results are often represented in a
tree-like structure known as a dendrogram.
Algorithm:
1.Compute the distance matrix between the input data points
2. Let each data point be a cluster
3.Repeat
4.Merge the two closest clusters
5.Update the distance matrix
6. Until only a single cluster remains
Distance between two clusters
1. Single-link distance between clusters C; and Cj is the minimum distance between any
object in C; and any object in Cj.
2. Complete-link distance between clusters C; and C; is the maximum distance between
any object in C; and any object in C.
3. Average-link distance between clusters C; and Cj is the average distance between any
object in C; and any object in C.

Advantages:
1. No prior information about the number of clusters required.
2. Easy to implement and gives best result in some cases.
Disadvantage:
1. Algorithm can never undo what was done previously.
2. Time complexity of at least O (n log n) is required, where 'n' is the number of data points.

CONCLUSION: Hence, we implemented and studied Clustering with python. (K-Means


&Hierarchical).

K-MEANS CLUSTERING:
1. 1-D CLUSTERING:
Code:
#K-means clustering algorithm 1D #Import
the necessary libraries from
sklearn.cluster import KMeans import numpy
as np
import matplotlib.pyplot as plt

#Sample 1D data
data=np.array([2,4,10,12,3,20,30,11,25])

#reshape the data to a 2D array data=data.reshape(1,1)


#Define the number of cluster(K) k=2

#Create a KMeans instance kmeans=KMeans(n_clusters=k)


#Fit the data to the
KMeans model kmeans.fit(data)

#Get cluster labels for each data point


labels=kmeans.labels_

#Get cluster centres


centroids=kmeans.cluster_centers_

#Plot the data point and cluster centres


plt.scatter(data,np.zeros_like(data),c=labels)

plt.scatter(centroids,np.zeros_like(centroids),s=200,marker='x',c='red'
) plt.xlabel('X-axis') plt.title('K-means
Clustering(1D Data)') plt.show()

Output:

2. 2-D CLUSTERING: Code:


# Import the necessary libraries from
sklearn.cluster import KMeans import
numpy as np import
matplotlib.pyplot as plt
# Sample 2D data
data = np.array([[1, 1], [2, 1], [4, 3], [5, 4]])

# Define the number of clusters (K) k =


2

# Create a KMeans instance kmeans


= KMeans(n_clusters=k)

# Fit the data to the KMeans model kmeans.fit(data)

# Get cluster labels for each data point labels


= kmeans.labels_

# Get cluster centers centroids =


kmeans.cluster_centers_

# Plot the data points and cluster centers


plt.scatter(data[:, 0], data[:, 1], c=labels, s=50, cmap='viridis')
plt.scatter(centroids[:, 0], centroids[:, 1], s=200, marker='X', c='red') plt.xlabel('Xaxis')
plt.ylabel('Y-axis')
plt.title('K-means Clustering (2D Data)') plt.show()

OUTPUT:
HIERARCHICAL CLUSTERING:

1. Average Linkage Method:


import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage from
scipy.spatial.distance import squareform import
matplotlib.pyplot as plt
mat=np.array([[0,1,1.41,3.61,5],[1,0,1,3.16,4.47],[1.41,1,0,2.24,3.61],[3.61,3.16,2.24
,0,1.41],[5,4.47,3.61,1.41,0]]) distance
= squareform(mat) linkage_matrix=linkage(distance,"average")
dendrogram(linkage_matrix,labels=["A","B","C","D","E"])
plt.title("Average Linkage Method") plt.show()

OUTPUT:

2. COMPLETE LINKAGE METHOD:


Code:
import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage from scipy.spatial.distance
import squareform import matplotlib.pyplot as plt
mat=np.array([[0,1,1.41,3.61,5],[1,0,1,3.16,4.47],[1.41,1,0,2.24,3.61],[3.61,3.16,2.24
,0,1.41],[5,4.47,3.61,1.41,0]]) distance
= squareform(mat)
linkage_matrix=linkage(distance,"complete")
dendrogram(linkage_matrix,labels=["A","B","C","D","E"]) plt.title("Complete Linkage
Method") plt.show()

OUTPUT:

3. SINGLE LINKAGE METHOD: Code:


import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage from
scipy.spatial.distance import squareform import
matplotlib.pyplot as plt
mat=np.array([[0,1,1.41,3.61,5],[1,0,1,3.16,4.47],[1.41,1,0,2.24,3.61],[3.61,3.16,2.24
,0,1.41],[5,4.47,3.61,1.41,0]]) distance = squareform(mat)
linkage_matrix=linkage(distance,"single")
dendrogram(linkage_matrix,labels=["A","B","C","D","E"]) plt.title("Single
Linkage Method") plt.show()

OUTPUT:
EXPERIMENT 12
AIM: To study and implement Association Rule mining algorithms using python (Apriori).
Theory:
Association rule learning is a type of unsupervised learning technique that checks for the
dependency of one data item on another data item and maps accordingly so that it can be
more profitable. It tries to find some interesting relations or associations among the
variables of dataset. It is based on different rules to discover the interesting relations
between variables in the database.
The association rule learning is one of the very important concepts of machine learning, and
it is employed in Market Basket analysis, Web usage mining, continuous production, etc.
Here market basket analysis is a technique used by the various big retailer to discover the
associations between items. We can understand it by taking an example of a supermarket, as
in a supermarket, all products that are purchased together are put together.
Association rule learning can be divided into three types of algorithms:
1. Apriori
2. Eclat
3. F-P Growth Algorithm
Apriori algorithm refers to an algorithm that is used in mining frequent products sets and
relevant association rules. Generally, the apriori algorithm operates on a database
containing a huge number of transactions. For example, the items customers but at a Big
Bazar.
The given three components comprise the apriori algorithm.
1. Support
2. Confidence
3. Lift
Support
Support refers to the default popularity of any product. You find the support as a quotient of
the division of the number of transactions comprising that product by the total number of
transactions.
Confidence
Confidence refers to the possibility that the customers bought both biscuits and chocolates
together. So, you need to divide the number of transactions that comprise both biscuits and
chocolates by the total number of transactions to get the confidence.
Lift
Consider the above example; lift refers to the increase in the ratio of the sale of chocolates
when you sell biscuits. The mathematical equations of lift are given below.
Lift = (Confidence (x-y)/ (Support (x))
Advantages of Apriori Algorithm

• It is used to calculate large item sets.


• Simple to understand and apply.
Disadvantages of Apriori Algorithms

• Apriori algorithm is an expensive method to find support since the calculation has to
pass through the whole database.
• Sometimes, you need a huge number of candidate rules, so it becomes
computationally more expensive.

Conclusion: Hence, we have studied and implemented Association rule mining using Python
(Apriori).
Apriori Code:
#Apriori Algorithm
from itertools import combinations
from functools import reduce
#Sample tranctions dataset
transactions=[
{"Bread","Idli","Butter"},
{"Bread","Butter"},
{"Bread","Butter","Milk"},
{"Coke","Bread"},
{"Coke","Bread"}
]
def generate_itemset(transactions,n):
return [set(c) for c in combinations(reduce(lambda x,y:x|y,transactions),n)]
def support_count(transactions,item):
return sum([len(t&item)==len(item) for t in transactions])
def prune(itemset,transactions,min_support_count):
return list(filter(lambda x:support_count(transactions,x)>=min_support_count,itemset))
def create_rules(ttenset):
return [(set(c),set(c)) for itemset in ttenset for i in range(1,len(itemset)) for c in
combinations(itemset,i)]
def confidence(rule,transactions):
return support_count(transactions,rule[0]|rule[1])/support_count(transactions,rule[1])

min_support=float(input("Enter support threshhold[0,1]:"))


min_confidence=float(input("Enter confidence threshhold[0,1]:"))
min_support_count=min_support*len(transactions)
prev_candidate=[]
curr_candidate=[0]
n+=1
while prev_candidate!=curr_candidate and curr_candidate!=[]:
print("Generating candidate of length:",n)
prev_candidate=curr_candidate.copy()
candidates=generate_itemset(transactions,n)
print("Candidates:",candidates)
curr_candidate=prune(candidates,transactions,min_support_count)
print("After pruning,current candidate:",curr_candidate)
print(","*80)
n+=1

if curr_candidate:
itemset=curr_candidate
else:
itemset=prev_candidate

print("Itemset:",itemset)
print("-"*80)
final_rules=[rule for rule in create_rules(itemset) if
confidence(rule,transactions)>=min_confidence]
print("Final rules:")
OUTPUT:
EXPERIMENT 13
AIM: Implementation of Page Rank Algorithm.
Theory:
PageRank (PR) is an algorithm used by Google Search to rank websites in their search engine
results. PageRank was named after Larry Page, one of the founders of Google. PageRank is a
way of measuring the importance of website pages.
PageRank works by counting the number and quality of links to a page to determine a rough
estimate of how important the website is. The underlying assumption is that more important
websites are likely to receive more links from other websites.
A PageRank results from a mathematical algorithm based on the web graph, created by all
World Wide Web pages as nodes and hyperlinks as edges, taking into consideration authority
hubs
The rank value indicates an importance of a particular page. A hyperlink to a page count as a
vote of support. The PageRank of a page is defined recursively and depends on the number
and PageRank metric of all pages that link to it ("incoming links"). A page that is linked to by
many pages with high PageRank receives a high rank itself.
Applications:

• Search engines (websites ranking according to their PageRank; gives additional


criterion based on the quality of a page, not only its content),
• Predictions of Web traffic (estimation of users' visits count, server load, etc.),
• Optimal crawling - crawlers should visit important pages more frequently (important
page - a page with a high PageRank),
• Website navigation,
• Modelling ecosystems, protein networks.
Each link from one-page (A) to another (B) casts a so-called vote, the weight of which depends
on the collective weight of all the pages that link to page A. And we can't know their weight
till we calculate it, so the process goes in cycles.
The mathematical formula of the original PageRank is the following:
1−𝑑 𝑃𝑅(𝐵) 𝑃𝑅(𝐶) 𝑃𝑅(𝐷)
PR(A) = 1 − +𝑑( + + +⋯)
𝑁 𝐿(𝐵) 𝐿(𝐶) 𝐿(𝐷)

Conclusion: Hence, we have studied and implemented Page Rank Algorithm.


EXPERIMENT 14
AIM: Implementation of Hits Algorithm.
Theory:
Hyperlink Induced Topic Search (HITS) Algorithm is a Link Analysis Algorithm that rates
webpages, developed by Jon Kleinberg. This algorithm is used to the web link structures to
discover and rank the webpages relevant for a particular search. HITS uses hubs and
authorities to define a recursive relationship
between webpages. Before understanding the HITS Algorithm, we first need to know about
Hubs and Authorities.
The algorithm performs a series of iterations, each consisting of two basic steps:

• Authority update: Update each node's authority score to be equal to the sum of the
hub scores of each node that points to it. That is, a node is given a high authority score
by being linked from pages that are recognized as Hubs for information.
• Hub update: Update each node's hub score to be equal to the sum of the authority
scores of each node that it points to. That is, a node is given a high hub score by linking
to nodes that are considered to be authorities on the subject.
The HITS algorithm iteratively updates the authority and hub scores until convergence is
achieved. It starts by assigning an initial authority score of 1 to all web pages.
Then, it calculates the hub score for each page based on the authority scores of the pages it
links to. Then, it updates the authority scores based on the hub scores of the pages that link
to it. This process is repeated until the scores stabilize.
HITS is applied on a subgraph after a search is done on the complete graph. Firstly, the search
is applied. Then, HITS analyses the structure of the links of the retrieved relevant pages.
The Hub score and Authority score for a node is calculated with the following algorithm:

• Start with each node having a hub score and authority score of 1. Run the authority
update rule
• Run the hub update rule
• Normalize the values by dividing each Hub score by square root of the sum of the
squares of all Hub scores, and dividing each Authority score by square root of the sum
of the squares of all Authority scores.
• Repeat from the second step as necessary.
The algorithm works iteratively, where the authority and hub scores are updated until they
converge. Here's a simplified step-by-step process:

• Initialization: Assign equal hub and authority scores to all pages.


• Update Hub Scores: For each page, update its hub score by summing the authority
scores of the pages it links to.
• Update Authority Scores: For each page, update its authority score by summing the
hub scores of the pages linking to it.
• Normalize Scores: Normalize both hub and authority scores to ensure they sum to 1.
Repeat Steps 2-4: Continue iterating the above steps until the scores converge (ie., they stop
changing significantly)

Algorithm:

• Identify the set of relevant pages by the standard text search (a root set Rq).
• Extend the root set by adding:
• pages which are linked by the pages from Rq.
• pages that links to Rq.
The achieved (extended) set is called a base setSq

• Then, method analyzes the structure of the base set to identify hubs and authorities:
• Let L be the adjacency matrix of the Sq (Li,j= 1, if page i links to page j, Li,j = 0
otherwise).
• Let a = [a1, a2,..., an] be the vector of authorities and h = [h1, h2,….hn] be the
vector of hubs. These vectors contain coefficients (weights/importance). The
greater the value of ai (hi)is, the better authority (hub) the i page is.h and a are
defined as:
• hi= Σj:Li,j-1 ai,
• ai =Σj:Li,j=1 hi.
• How to find the values of a and h?
a. Iteratively:
• Initialize a = [1,1,...,1] and h = [1, 1, ...,1].
• hit+1= Σj:Li,j=1 ajt, ait+1 = Σj:Lj,i=1hjt
• normalize at+1 andht+1 (sum of values in the vectors should equal 1).

Conclusion: Hence, we have studied and implemented Hits Algorithm.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy