0% found this document useful (0 votes)
8 views30 pages

Ds Notes Mca

The document covers key concepts in data mining, including frequent patterns, association rules, and classification techniques. It discusses algorithms like Apriori for association rule mining and various classification methods such as decision trees and Naive Bayes. Additionally, it provides Python implementation examples for these concepts and highlights issues related to classification and prediction.

Uploaded by

rrdr1530
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views30 pages

Ds Notes Mca

The document covers key concepts in data mining, including frequent patterns, association rules, and classification techniques. It discusses algorithms like Apriori for association rule mining and various classification methods such as decision trees and Naive Bayes. Additionally, it provides Python implementation examples for these concepts and highlights issues related to classification and prediction.

Uploaded by

rrdr1530
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Mining Frequent Patterns, Associations, and Correlations

1. Basic Concepts
• Frequent Patterns:
Patterns (itemsets, subsequences, or substructures) that occur frequently in a
dataset.
Example: In a supermarket, bread and butter being bought together often.
• Association Rules:
Imply a strong relationship between items in a dataset.
Form: X ⇒ Y (If X happens, Y is likely to happen too).
• Support:
o Probability that a transaction contains X ∪ Y.
o support(X ⇒ Y) = P(X ∪ Y)
• Confidence:
o How often Y appears in transactions that contain X.
o confidence(X ⇒ Y) = P(Y | X) = support(X ∪ Y) / support(X)
• Lift:
o Measures how much more often X and Y occur together than expected if
independent.
o lift(X ⇒ Y) = confidence(X ⇒ Y) / support(Y)
• Correlations:
o Find relationships between itemsets that go beyond simple co-occurrence.
o Positive correlation if lift > 1.

2. Association Rule Mining


Objective:
• Discover interesting relationships between variables in large datasets.
Challenges:
• Search Space: Exponential growth.
• Interestingness Measures: Not all frequent itemsets are meaningful.
• Scalability: Handle very large data volumes.

3. The Apriori Algorithm


Goal: Find frequent itemsets to generate association rules.
Principle:
• Apriori Property:
If an itemset is frequent, all of its subsets must also be frequent.
Steps:
1. Scan Dataset to find frequent 1-itemsets (items appearing frequently alone).
2. Generate Candidate Itemsets of length k from frequent itemsets of length k-1.
3. Prune candidates whose subsets are not frequent (using the Apriori property).
4. Repeat until no frequent itemsets can be generated.
Theoretical Example:
Transaction ID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
• Frequent itemsets: {Bread}, {Milk}, {Bread, Milk}, {Diaper, Beer}

4. Python Implementation Tips


You can use mlxtend or apriori libraries.
python
CopyEdit
from mlxtend.frequent_patterns import apriori, association_rules
import pandas as pd

# Sample dataset
dataset = [
['Bread', 'Milk'],
['Bread', 'Diaper', 'Beer', 'Eggs'],
['Milk', 'Diaper', 'Beer', 'Coke'],
['Bread', 'Milk', 'Diaper', 'Beer'],
['Bread', 'Milk', 'Diaper', 'Coke']
]

# Convert to one-hot encoding


from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)

# Frequent itemsets
frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
print(frequent_itemsets)

# Generate association rules


rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])
Classification and Prediction

1. Basic Concepts
• Classification:
Predict categorical labels. (E.g., spam or not spam)
• Prediction:
Predict continuous values. (E.g., house prices)
Classification Types:
• Binary Classification: Two classes (Yes/No).
• Multiclass Classification: More than two classes (Cat/Dog/Fish).
• Multi-label Classification: Multiple labels at once (e.g., News tagged as Sports and
Politics).

2. Supervised Learning Flow


Step Description
Data Preparation Collect and preprocess the data
Model Building Train a model using training data
Prediction Use model to predict unseen data
Evaluation Measure accuracy, precision, recall

3. Common Algorithms
Algorithm Quick Idea
Decision Tree Tree-like structure of decisions
Random Forest Ensemble of decision trees
Logistic Regression Probabilistic classification
k-Nearest Neighbors (KNN) Based on closest training examples
Naive Bayes Bayes' theorem with independence assumptions
Support Vector Machine (SVM) Maximize margin between classes
Neural Networks Layered networks of neurons

4. Python Example (Simple Classification)

from sklearn.model_selection import train_test_split


from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Dummy dataset
from sklearn.datasets import load_iris
data = load_iris()

X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3)

# Train classifier
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

# Prediction
y_pred = model.predict(X_test)

# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))

Classification
Definition:
Classification is the process of finding a model that describes and distinguishes data classes
or concepts.
The model can be used to predict the class label of new data points.

Issues Regarding Classification


Issue Description
Accuracy How correct the model is.
Speed Training time and prediction time.
Robustness Ability to handle noisy or missing data.
Scalability Ability to handle large datasets.
Interpretability Whether humans can understand the model's decisions.
Overfitting Model performs well on training data but poorly on new data.
Underfitting Model is too simple to capture data patterns.
Imbalanced Classes One class has significantly more samples than others.

Classification by Decision Tree Induction


Decision Tree
• A flowchart-like tree structure where:
o Each internal node represents a test on an attribute.
o Each branch represents an outcome of the test.
o Each leaf node represents a class label.
Popular Algorithms: ID3, C4.5, CART
Algorithm Steps:
1. Start with the entire dataset.
2. Choose the best attribute using a splitting criterion (e.g., Information Gain, Gini
Index).
3. Split the dataset based on the selected attribute.
4. Recur for each branch.
Splitting Criteria:

Python Example (Decision Tree)


from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train decision tree


tree = DecisionTreeClassifier(criterion='gini') # or 'entropy'
tree.fit(X, y)

# Visualize tree
plt.figure(figsize=(12,8))
plot_tree(tree, filled=True, feature_names=iris.feature_names,
class_names=iris.target_names)
plt.show()

Bayesian Classification
Naive Bayes Assumption:
• Features are conditionally independent given the class.
Types:
• Gaussian Naive Bayes (for continuous data).
• Multinomial Naive Bayes (for count data).
• Bernoulli Naive Bayes (for binary/boolean features).

Python Example (Naive Bayes)


from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train Naive Bayes classifier


model = GaussianNB()
model.fit(X, y)

# Prediction
y_pred = model.predict(X)
print("Predictions:", y_pred)

Rule-Based Classification
Idea:
• IF-THEN rules to perform classification.
Example:
IF age < 30 AND income = high THEN buy_computer = no
How Rules Are Generated:
• From decision trees (e.g., extract paths).
• From association rule mining.
• Direct rule induction algorithms (e.g., RIPPER, CN2).
Advantages:
• Easy to understand.
• Fast classification.

Metrics for Evaluating Classifier Performance


Metric Formula Description
Accuracy (TP + TN) / (TP + TN + FP + FN) Overall correctness.
How many predicted positives are
Precision TP / (TP + FP)
actual positives.
How many actual positives were
Recall (Sensitivity) TP / (TP + FN)
correctly predicted.
2 × (Precision × Recall) /
F1-Score Balance between precision and recall.
(Precision + Recall)
Diagnostic ability at various
ROC Curve Plot of TPR vs. FPR
thresholds.
AUC (Area Under Single number summary of
Higher AUC = Better model.
Curve) ROC
Where:
• TP = True Positive
• TN = True Negative
• FP = False Positive
• FN = False Negative

Python Example (Evaluation Metrics)


from sklearn.metrics import confusion_matrix, classification_report

# Assume y_test and y_pred are available


print(confusion_matrix(y, y_pred))
print(classification_report(y, y_pred))

Holdout Method and Random Subsampling


Holdout Method:
• Split dataset into:
o Training set (e.g., 70%)
o Testing set (e.g., 30%)
Simple but can suffer from bias if unlucky split happens.
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


Random Subsampling:
• Repeat holdout method multiple times with different random splits.
• Average the performance over multiple trials.
• Reduces bias compared to one-time holdout.
Pseudo-code:
for i in range(N):
split data randomly
train model
evaluate and record performance
average all recorded performances

Prediction
Definition:
Prediction involves building a model to predict continuous-valued functions (numeric
outcomes), unlike classification which predicts discrete labels.
Example:
• Predict house prices based on area, location.
• Predict temperature for the next week.

Issues Regarding Prediction


Issue Description
Quality of Input
Missing, noisy, or irrelevant features can affect performance.
Data
Model Complexity Overly complex models may overfit; simple models may underfit.
Feature Selection Important to choose the right input attributes.
Interpretability How understandable the prediction model is.
Scalability Handle large datasets efficiently.
Hard to measure "goodness" accurately, especially for time-series
Evaluation
data.
Python Example: (Prediction Error Metrics)
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# True and predicted values


y_true = np.array([3.0, -0.5, 2.0, 7.0])
y_pred = np.array([2.5, 0.0, 2.0, 8.0])

print("MAE:", mean_absolute_error(y_true, y_pred))


print("MSE:", mean_squared_error(y_true, y_pred))
print("RMSE:", mean_squared_error(y_true, y_pred, squared=False))
print("R2 Score:", r2_score(y_true, y_pred))

Evaluating the Accuracy of a Classifier or Predictor


Method Description
Holdout Method Split into training and testing once.
Split into k subsets, train k times with different
k-Fold Cross-Validation
subsets.
Leave-One-Out Cross-Validation Extreme case of k-fold where k = number of
(LOOCV) samples.
Method Description
Resample with replacement to create multiple
Bootstrap
datasets.
Key Metrics:
• For Classifiers: Accuracy, Precision, Recall, F1-score.
• For Predictors: MAE, MSE, RMSE, R2.

Clustering
Cluster Analysis
Definition:
Unsupervised learning task that groups a set of objects such that objects in the same group
(cluster) are more similar to each other than to those in other groups.
Applications:
• Customer segmentation.
• Image compression.
• Anomaly detection.

Agglomerative vs Divisive Hierarchical Clustering


Type Description Example
Agglomerative Start with each point as its own cluster, then Single-link, Complete-
(bottom-up) iteratively merge the closest clusters. link methods.
Start with one large cluster and recursively
Divisive (top-down) Bisecting k-means.
split into smaller clusters.
Hierarchical Clustering creates a tree structure (dendrogram).

Agglomerative Hierarchical Clustering Python Example:


from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
import numpy as np

# Sample data
X = np.array([[1, 2],
[3, 4],
[5, 6],
[8, 8]])

# Perform hierarchical clustering


linked = linkage(X, 'single') # 'single', 'complete', 'average'
# Plot dendrogram
plt.figure(figsize=(8, 5))
dendrogram(linked, labels=[1, 2, 3, 4])
plt.show()

Evaluation of Clustering
Since clustering is unsupervised, evaluation is tricky!
Metric Description
Measures how similar an object is to its own cluster vs other
Silhouette Coefficient
clusters. Ranges [-1, 1].
Lower value = better clustering (based on intra-cluster
Davies-Bouldin Index
similarity).
Inertia (Within-Cluster Sum
Used in KMeans, lower is better.
of Squares)
External Validation (if labels
Adjusted Rand Index, Mutual Information Score.
known)

Python Example: (Evaluate Clustering)


from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Sample data
X = np.random.rand(50, 2)

# Apply clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
# Evaluate
labels = kmeans.labels_
print("Silhouette Score:", silhouette_score(X, labels))

Python Example: Simple Gradient Descent


import numpy as np

# Hypothesis function
def predict(X, theta):
return X.dot(theta)

# Cost function
def cost(X, y, theta):
m = len(y)
return (1/2*m) * np.sum((predict(X, theta) - y)**2)

# Gradient descent function


def gradient_descent(X, y, theta, alpha, iterations):
m = len(y)
for _ in range(iterations):
theta -= (alpha/m) * (X.T.dot(predict(X, theta) - y))
return theta

# Dummy dataset
X = np.array([[1], [2], [3]])
y = np.array([2, 4, 6])
X_b = np.c_[np.ones((3, 1)), X] # add bias term

theta = np.zeros(2)
theta = gradient_descent(X_b, y, theta, alpha=0.1, iterations=1000)
print("Theta:", theta)

Linear Regression with One Variable (Univariate)


Simple case where only one independent variable XXX.
Equation:
hθ(x)=θ0+θ1xh_\theta(x) = \theta_0 + \theta_1 xhθ(x)=θ0+θ1x
Use case:
Predict salary based on years of experience.
Python Example (One Variable)

import matplotlib.pyplot as plt


from sklearn.linear_model import LinearRegression

# Data
X = np.array([[1], [2], [3], [4]])
y = np.array([2, 4, 6, 8])

# Model
model = LinearRegression()
model.fit(X, y)

# Predict and plot


plt.scatter(X, y, color='red')
plt.plot(X, model.predict(X), color='blue')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression with One Variable')
plt.show()

Python Example (Multiple Variables)


# Data
X = np.array([[1, 2], [2, 3], [4, 5]])
y = np.array([5, 7, 11])

# Model
model = LinearRegression()
model.fit(X, y)
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)

from sklearn.preprocessing import PolynomialFeatures

# Data
X = np.array([[1], [2], [3], [4]])
y = np.array([2, 5, 10, 17])

# Transform features
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

# Model
model = LinearRegression()
model.fit(X_poly, y)

# Predict and plot


X_fit = np.linspace(1, 4, 100).reshape(-1,1)
X_fit_poly = poly.transform(X_fit)
y_fit = model.predict(X_fit_poly)

plt.scatter(X, y, color='red')
plt.plot(X_fit, y_fit, color='blue')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Polynomial Regression')
plt.show()
Python Example (Scaling)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Feature Selection
Definition:
Choosing only the most important/related features for the model.
Techniques:
• Filter Methods (e.g., correlation, chi-square test)
• Wrapper Methods (e.g., recursive feature elimination (RFE))
• Embedded Methods (e.g., Lasso regression)
Why Needed:
• Reduce model complexity.
• Improve accuracy.
• Reduce training time.

Python Example (Feature Selection using RFE)

from sklearn.feature_selection import RFE

model = LinearRegression()
rfe = RFE(model, n_features_to_select=1)
rfe = rfe.fit(X, y)
print("Selected Features:", rfe.support_)

Classification Using Logistic Regression


Definition:
Logistic Regression is a supervised learning algorithm used for binary classification (output
0 or 1).
Despite its name, it's used for classification, not regression!
Core Idea:
Rather than fitting a straight line (like in Linear Regression), Logistic Regression fits an S-
shaped curve (Sigmoid function) that maps any real-valued number to a probability between
0 and 1.
Python Example (One Variable)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression

# Dummy Data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([0, 0, 0, 1, 1])

# Model
model = LogisticRegression()
model.fit(X, y)

# Prediction
x_test = np.linspace(0, 6, 100).reshape(-1,1)
y_pred = model.predict_proba(x_test)[:,1]

# Plot
plt.scatter(X, y, color='red')
plt.plot(x_test, y_pred, color='blue')
plt.xlabel('Feature X')
plt.ylabel('Probability')
plt.title('Logistic Regression with One Variable')
plt.show()

Python Example (Multiple Variables)

# Dummy Data
X = np.array([[1, 2], [2, 3], [3, 1], [4, 3], [5, 5]])
y = np.array([0, 0, 0, 1, 1])

# Model
model = LogisticRegression()
model.fit(X, y)

# Prediction
print("Predicted probabilities:", model.predict_proba([[3, 4]]))
print("Predicted class:", model.predict([[3, 4]]))

Deep Learning
History of Deep Learning
• 1943: McCulloch & Pitts — first mathematical model of a neuron.
• 1958: Frank Rosenblatt — Perceptron (first algorithm for binary classification).
• 1986: Rumelhart, Hinton, Williams — Backpropagation algorithm.
• 1998: Yann LeCun — Convolutional Neural Networks (LeNet for handwritten digit
recognition).
• 2006–2012: "Deep Learning" boom — Geoffrey Hinton popularized Deep Belief
Networks, then came AlexNet (2012 ImageNet winner).

Scope and Specifications of Deep Learning


• Scope:
o Computer Vision (object detection, facial recognition)
o Natural Language Processing (translation, sentiment analysis)
o Healthcare (disease diagnosis)
o Robotics (autonomous control)
o Finance (fraud detection)
• Specifications:
o Requires large datasets.
o Requires high computational power (GPUs/TPUs).
o Involves training models with millions of parameters.

Why Deep Learning Now?


• Big Data: Availability of massive datasets (e.g., ImageNet, OpenAI datasets).
• Hardware Advances: GPUs, TPUs, Parallel Computing.
• Algorithmic Improvements: Better optimizers (Adam, RMSProp), regularization
methods (dropout, batch normalization).
• Open-Source Ecosystem: Libraries like TensorFlow, PyTorch, Keras, Hugging Face.

Building Blocks of Neural Networks


Building Block Description
Neuron (Unit) Mimics a biological neuron, performs weighted sum + activation
Layer Group of neurons; input, hidden, output
Weights & Bias Learnable parameters
Activation Function Non-linear transformation (e.g., ReLU, Sigmoid)
Loss Function Measures prediction error
Optimizer Adjusts weights to minimize loss
Deep Learning Hardware
• GPU: Graphical Processing Unit — parallelism for matrix ops.
• TPU: Tensor Processing Unit — Google's hardware for tensor ops.
• ASICs: Application-Specific Integrated Circuits for DL workloads.
• Frameworks: TensorFlow, PyTorch leverage hardware accelerations.

Forward and Backward Propagation


Direction Purpose
Forward Propagation Compute output predictions from inputs
Backward Propagation Update weights based on prediction error using gradients
Forward Pass:
• Inputs → Multiply weights → Add bias → Activation → Output
Backward Pass:
• Use Chain Rule of derivatives to compute gradient w.r.t each parameter.

XOR Model
• XOR (Exclusive OR) Problem:
o Input: two binary variables.
o Output: 1 if only one of the inputs is 1, else 0.
• Challenge: Single-layer perceptron can't solve XOR (not linearly separable).
• Solution:
o Multi-Layer Neural Networks (MLPs) can solve XOR!

Python Example (Simple XOR with MLP)

import torch
import torch.nn as nn

# XOR dataset
X = torch.tensor([[0,0],[0,1],[1,0],[1,1]], dtype=torch.float32)
y = torch.tensor([[0],[1],[1],[0]], dtype=torch.float32)

# Model
model = nn.Sequential(
nn.Linear(2, 2),
nn.Sigmoid(),
nn.Linear(2, 1),
nn.Sigmoid()
)

# Loss and Optimizer


criterion = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# Training loop
for epoch in range(10000):
optimizer.zero_grad()
outputs = model(X)
loss = criterion(outputs, y)
loss.backward()
optimizer.step()

print(model(X).detach())
Layers:
• Input Layer: Takes raw features.
• Hidden Layers: Feature extraction (representation learning).
• Output Layer: Final prediction.

Normalization
Normalization improves training speed and stability:
• Batch Normalization: Normalize activations of a layer during training.
• Input Normalization: Normalize features to mean 0 and variance 1.
Why Normalize?
To prevent some features from dominating and to stabilize gradients.

Hyper-Parameter Tuning
Hyperparameters: Values set before training (not learned).
Examples:
• Learning Rate
• Number of Layers
• Number of Neurons per Layer
• Batch Size
• Number of Epochs
• Dropout Rate
Tuning methods:
• Grid Search
• Random Search
• Bayesian Optimization
• Hyperband
Convolutional Neural Networks (CNNs)
CNNs are specialized for image data!
Component Purpose
Convolution Layer Extract features (edges, corners, patterns)
Pooling Layer Reduce spatial size (downsampling)
Fully Connected Layer Final classification

CNN Architecture Example:


Input Image → Conv Layer → ReLU → Pooling → Conv Layer → ReLU → Pooling → Flatten →
Fully Connected Layer → Output

CNN Architecture Typical Example:


Layer Details
Conv2D filters=32, kernel_size=(3,3)
Activation ReLU
MaxPooling2D pool_size=(2,2)
Conv2D filters=64, kernel_size=(3,3)
Activation ReLU
MaxPooling2D pool_size=(2,2)
Flatten
Dense 128 neurons
Output Dense Softmax for multiclass classification

Python Code (Basic CNN - Keras)


from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense

model = Sequential([
Conv2D(32, (3,3), activation='relu', input_shape=(64, 64, 3)),
MaxPooling2D(pool_size=(2,2)),
Conv2D(64, (3,3), activation='relu'),
MaxPooling2D(pool_size=(2,2)),
Flatten(),
Dense(128, activation='relu'),
Dense(10, activation='softmax') # 10 classes
])
Iris Dataset Classification

1. Import Libraries
python
CopyEdit
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns

2. Load and Explore the Dataset


python
CopyEdit
# Load Dataset
iris = load_iris()
X = iris.data
y = iris.target

# Feature Names
print(iris.feature_names)

# Target Names
print(iris.target_names)

3. Preprocess the Data


• Scaling is optional for tree-based models but good for Logistic Regression.
python
CopyEdit
# Feature Scaling
scaler = StandardScaler()
X = scaler.fit_transform(X)

4. Split into Train and Test


python
CopyEdit
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

5. Build Classifiers
(a) Logistic Regression
python
CopyEdit
# Logistic Regression
lr_model = LogisticRegression(max_iter=200)
lr_model.fit(X_train, y_train)

# Predictions
y_pred_lr = lr_model.predict(X_test)
(b) Decision Tree Classifier
python
CopyEdit
# Decision Tree
dt_model = DecisionTreeClassifier(criterion='gini', random_state=42)
dt_model.fit(X_train, y_train)

# Predictions
y_pred_dt = dt_model.predict(X_test)

6. Evaluate the Models


python
CopyEdit
# Logistic Regression Evaluation
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_lr))
print("\nClassification Report for Logistic Regression:\n", classification_report(y_test,
y_pred_lr))

# Decision Tree Evaluation


print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_dt))
print("\nClassification Report for Decision Tree:\n", classification_report(y_test, y_pred_dt))

7. Confusion Matrix Visualization


python
CopyEdit
# Confusion Matrix for Logistic Regression
cm_lr = confusion_matrix(y_test, y_pred_lr)
sns.heatmap(cm_lr, annot=True, cmap='Blues', fmt='d')
plt.title('Logistic Regression Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

# Confusion Matrix for Decision Tree


cm_dt = confusion_matrix(y_test, y_pred_dt)
sns.heatmap(cm_dt, annot=True, cmap='Greens', fmt='d')
plt.title('Decision Tree Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

Loan Dataset Classification (Machine Learning)

1. Import Libraries
python
CopyEdit
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns

2. Load the Dataset


Suppose you have a CSV called loan_data.csv.
(If you don’t have it yet, I can generate a sample dataset for you.)
python
CopyEdit
# Load the dataset
df = pd.read_csv('loan_data.csv')

# View first few rows


print(df.head())
Typical columns in loan dataset:
• Loan_ID
• Gender
• Married
• Dependents
• Education
• Self_Employed
• ApplicantIncome
• CoapplicantIncome
• LoanAmount
• Loan_Amount_Term
• Credit_History
• Property_Area
• Loan_Status (Target variable: Y/N)

3. Preprocess the Data


• Handle missing values.
• Encode categorical variables.
• Feature scaling.
python
CopyEdit
# Drop Loan_ID (not useful for prediction)
df = df.drop('Loan_ID', axis=1)

# Fill missing values


df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
df['Married'].fillna(df['Married'].mode()[0], inplace=True)
df['Dependents'].fillna(df['Dependents'].mode()[0], inplace=True)
df['Self_Employed'].fillna(df['Self_Employed'].mode()[0], inplace=True)
df['Credit_History'].fillna(df['Credit_History'].mode()[0], inplace=True)
df['LoanAmount'].fillna(df['LoanAmount'].median(), inplace=True)
df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mode()[0], inplace=True)

# Encode categorical variables


label_encoders = {}
categorical_columns = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed',
'Property_Area', 'Loan_Status']

for col in categorical_columns:


label_encoders[col] = LabelEncoder()
df[col] = label_encoders[col].fit_transform(df[col])
# Separate features and target
X = df.drop('Loan_Status', axis=1)
y = df['Loan_Status']

# Feature Scaling
scaler = StandardScaler()
X = scaler.fit_transform(X)

4. Train-Test Split
python
CopyEdit
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

5. Build Classifiers
(a) Logistic Regression
python
CopyEdit
# Logistic Regression
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)

# Predictions
y_pred_lr = lr_model.predict(X_test)
(b) Decision Tree Classifier
python
CopyEdit
# Decision Tree
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)

# Predictions
y_pred_dt = dt_model.predict(X_test)

6. Evaluate the Models


python
CopyEdit
# Logistic Regression
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_lr))
print("\nClassification Report for Logistic Regression:\n", classification_report(y_test,
y_pred_lr))
# Decision Tree
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_dt))
print("\nClassification Report for Decision Tree:\n", classification_report(y_test, y_pred_dt))

7. Confusion Matrix Visualization


python
CopyEdit
# Logistic Regression Confusion Matrix
cm_lr = confusion_matrix(y_test, y_pred_lr)
sns.heatmap(cm_lr, annot=True, cmap='Blues', fmt='d')
plt.title('Logistic Regression Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

# Decision Tree Confusion Matrix


cm_dt = confusion_matrix(y_test, y_pred_dt)
sns.heatmap(cm_dt, annot=True, cmap='Greens', fmt='d')
plt.title('Decision Tree Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy