Ds Notes Mca
Ds Notes Mca
1. Basic Concepts
• Frequent Patterns:
Patterns (itemsets, subsequences, or substructures) that occur frequently in a
dataset.
Example: In a supermarket, bread and butter being bought together often.
• Association Rules:
Imply a strong relationship between items in a dataset.
Form: X ⇒ Y (If X happens, Y is likely to happen too).
• Support:
o Probability that a transaction contains X ∪ Y.
o support(X ⇒ Y) = P(X ∪ Y)
• Confidence:
o How often Y appears in transactions that contain X.
o confidence(X ⇒ Y) = P(Y | X) = support(X ∪ Y) / support(X)
• Lift:
o Measures how much more often X and Y occur together than expected if
independent.
o lift(X ⇒ Y) = confidence(X ⇒ Y) / support(Y)
• Correlations:
o Find relationships between itemsets that go beyond simple co-occurrence.
o Positive correlation if lift > 1.
# Sample dataset
dataset = [
['Bread', 'Milk'],
['Bread', 'Diaper', 'Beer', 'Eggs'],
['Milk', 'Diaper', 'Beer', 'Coke'],
['Bread', 'Milk', 'Diaper', 'Beer'],
['Bread', 'Milk', 'Diaper', 'Coke']
]
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
# Frequent itemsets
frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
print(frequent_itemsets)
1. Basic Concepts
• Classification:
Predict categorical labels. (E.g., spam or not spam)
• Prediction:
Predict continuous values. (E.g., house prices)
Classification Types:
• Binary Classification: Two classes (Yes/No).
• Multiclass Classification: More than two classes (Cat/Dog/Fish).
• Multi-label Classification: Multiple labels at once (e.g., News tagged as Sports and
Politics).
3. Common Algorithms
Algorithm Quick Idea
Decision Tree Tree-like structure of decisions
Random Forest Ensemble of decision trees
Logistic Regression Probabilistic classification
k-Nearest Neighbors (KNN) Based on closest training examples
Naive Bayes Bayes' theorem with independence assumptions
Support Vector Machine (SVM) Maximize margin between classes
Neural Networks Layered networks of neurons
# Dummy dataset
from sklearn.datasets import load_iris
data = load_iris()
# Train classifier
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# Prediction
y_pred = model.predict(X_test)
# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
Classification
Definition:
Classification is the process of finding a model that describes and distinguishes data classes
or concepts.
The model can be used to predict the class label of new data points.
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
# Visualize tree
plt.figure(figsize=(12,8))
plot_tree(tree, filled=True, feature_names=iris.feature_names,
class_names=iris.target_names)
plt.show()
Bayesian Classification
Naive Bayes Assumption:
• Features are conditionally independent given the class.
Types:
• Gaussian Naive Bayes (for continuous data).
• Multinomial Naive Bayes (for count data).
• Bernoulli Naive Bayes (for binary/boolean features).
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
# Prediction
y_pred = model.predict(X)
print("Predictions:", y_pred)
Rule-Based Classification
Idea:
• IF-THEN rules to perform classification.
Example:
IF age < 30 AND income = high THEN buy_computer = no
How Rules Are Generated:
• From decision trees (e.g., extract paths).
• From association rule mining.
• Direct rule induction algorithms (e.g., RIPPER, CN2).
Advantages:
• Easy to understand.
• Fast classification.
Prediction
Definition:
Prediction involves building a model to predict continuous-valued functions (numeric
outcomes), unlike classification which predicts discrete labels.
Example:
• Predict house prices based on area, location.
• Predict temperature for the next week.
Clustering
Cluster Analysis
Definition:
Unsupervised learning task that groups a set of objects such that objects in the same group
(cluster) are more similar to each other than to those in other groups.
Applications:
• Customer segmentation.
• Image compression.
• Anomaly detection.
# Sample data
X = np.array([[1, 2],
[3, 4],
[5, 6],
[8, 8]])
Evaluation of Clustering
Since clustering is unsupervised, evaluation is tricky!
Metric Description
Measures how similar an object is to its own cluster vs other
Silhouette Coefficient
clusters. Ranges [-1, 1].
Lower value = better clustering (based on intra-cluster
Davies-Bouldin Index
similarity).
Inertia (Within-Cluster Sum
Used in KMeans, lower is better.
of Squares)
External Validation (if labels
Adjusted Rand Index, Mutual Information Score.
known)
# Sample data
X = np.random.rand(50, 2)
# Apply clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
# Evaluate
labels = kmeans.labels_
print("Silhouette Score:", silhouette_score(X, labels))
# Hypothesis function
def predict(X, theta):
return X.dot(theta)
# Cost function
def cost(X, y, theta):
m = len(y)
return (1/2*m) * np.sum((predict(X, theta) - y)**2)
# Dummy dataset
X = np.array([[1], [2], [3]])
y = np.array([2, 4, 6])
X_b = np.c_[np.ones((3, 1)), X] # add bias term
theta = np.zeros(2)
theta = gradient_descent(X_b, y, theta, alpha=0.1, iterations=1000)
print("Theta:", theta)
# Data
X = np.array([[1], [2], [3], [4]])
y = np.array([2, 4, 6, 8])
# Model
model = LinearRegression()
model.fit(X, y)
# Model
model = LinearRegression()
model.fit(X, y)
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
# Data
X = np.array([[1], [2], [3], [4]])
y = np.array([2, 5, 10, 17])
# Transform features
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
# Model
model = LinearRegression()
model.fit(X_poly, y)
plt.scatter(X, y, color='red')
plt.plot(X_fit, y_fit, color='blue')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Polynomial Regression')
plt.show()
Python Example (Scaling)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Feature Selection
Definition:
Choosing only the most important/related features for the model.
Techniques:
• Filter Methods (e.g., correlation, chi-square test)
• Wrapper Methods (e.g., recursive feature elimination (RFE))
• Embedded Methods (e.g., Lasso regression)
Why Needed:
• Reduce model complexity.
• Improve accuracy.
• Reduce training time.
model = LinearRegression()
rfe = RFE(model, n_features_to_select=1)
rfe = rfe.fit(X, y)
print("Selected Features:", rfe.support_)
# Dummy Data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([0, 0, 0, 1, 1])
# Model
model = LogisticRegression()
model.fit(X, y)
# Prediction
x_test = np.linspace(0, 6, 100).reshape(-1,1)
y_pred = model.predict_proba(x_test)[:,1]
# Plot
plt.scatter(X, y, color='red')
plt.plot(x_test, y_pred, color='blue')
plt.xlabel('Feature X')
plt.ylabel('Probability')
plt.title('Logistic Regression with One Variable')
plt.show()
# Dummy Data
X = np.array([[1, 2], [2, 3], [3, 1], [4, 3], [5, 5]])
y = np.array([0, 0, 0, 1, 1])
# Model
model = LogisticRegression()
model.fit(X, y)
# Prediction
print("Predicted probabilities:", model.predict_proba([[3, 4]]))
print("Predicted class:", model.predict([[3, 4]]))
Deep Learning
History of Deep Learning
• 1943: McCulloch & Pitts — first mathematical model of a neuron.
• 1958: Frank Rosenblatt — Perceptron (first algorithm for binary classification).
• 1986: Rumelhart, Hinton, Williams — Backpropagation algorithm.
• 1998: Yann LeCun — Convolutional Neural Networks (LeNet for handwritten digit
recognition).
• 2006–2012: "Deep Learning" boom — Geoffrey Hinton popularized Deep Belief
Networks, then came AlexNet (2012 ImageNet winner).
XOR Model
• XOR (Exclusive OR) Problem:
o Input: two binary variables.
o Output: 1 if only one of the inputs is 1, else 0.
• Challenge: Single-layer perceptron can't solve XOR (not linearly separable).
• Solution:
o Multi-Layer Neural Networks (MLPs) can solve XOR!
import torch
import torch.nn as nn
# XOR dataset
X = torch.tensor([[0,0],[0,1],[1,0],[1,1]], dtype=torch.float32)
y = torch.tensor([[0],[1],[1],[0]], dtype=torch.float32)
# Model
model = nn.Sequential(
nn.Linear(2, 2),
nn.Sigmoid(),
nn.Linear(2, 1),
nn.Sigmoid()
)
# Training loop
for epoch in range(10000):
optimizer.zero_grad()
outputs = model(X)
loss = criterion(outputs, y)
loss.backward()
optimizer.step()
print(model(X).detach())
Layers:
• Input Layer: Takes raw features.
• Hidden Layers: Feature extraction (representation learning).
• Output Layer: Final prediction.
Normalization
Normalization improves training speed and stability:
• Batch Normalization: Normalize activations of a layer during training.
• Input Normalization: Normalize features to mean 0 and variance 1.
Why Normalize?
To prevent some features from dominating and to stabilize gradients.
Hyper-Parameter Tuning
Hyperparameters: Values set before training (not learned).
Examples:
• Learning Rate
• Number of Layers
• Number of Neurons per Layer
• Batch Size
• Number of Epochs
• Dropout Rate
Tuning methods:
• Grid Search
• Random Search
• Bayesian Optimization
• Hyperband
Convolutional Neural Networks (CNNs)
CNNs are specialized for image data!
Component Purpose
Convolution Layer Extract features (edges, corners, patterns)
Pooling Layer Reduce spatial size (downsampling)
Fully Connected Layer Final classification
model = Sequential([
Conv2D(32, (3,3), activation='relu', input_shape=(64, 64, 3)),
MaxPooling2D(pool_size=(2,2)),
Conv2D(64, (3,3), activation='relu'),
MaxPooling2D(pool_size=(2,2)),
Flatten(),
Dense(128, activation='relu'),
Dense(10, activation='softmax') # 10 classes
])
Iris Dataset Classification
1. Import Libraries
python
CopyEdit
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
# Feature Names
print(iris.feature_names)
# Target Names
print(iris.target_names)
5. Build Classifiers
(a) Logistic Regression
python
CopyEdit
# Logistic Regression
lr_model = LogisticRegression(max_iter=200)
lr_model.fit(X_train, y_train)
# Predictions
y_pred_lr = lr_model.predict(X_test)
(b) Decision Tree Classifier
python
CopyEdit
# Decision Tree
dt_model = DecisionTreeClassifier(criterion='gini', random_state=42)
dt_model.fit(X_train, y_train)
# Predictions
y_pred_dt = dt_model.predict(X_test)
1. Import Libraries
python
CopyEdit
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
# Feature Scaling
scaler = StandardScaler()
X = scaler.fit_transform(X)
4. Train-Test Split
python
CopyEdit
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
5. Build Classifiers
(a) Logistic Regression
python
CopyEdit
# Logistic Regression
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
# Predictions
y_pred_lr = lr_model.predict(X_test)
(b) Decision Tree Classifier
python
CopyEdit
# Decision Tree
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
# Predictions
y_pred_dt = dt_model.predict(X_test)