AIDS-II PT1 Question Bank
AIDS-II PT1 Question Bank
Convolutional Neural Networks (CNNs) are a class of deep learning models primarily
used for analyzing visual data. They are particularly effective in tasks like image
classification, object detection, and more due to their ability to automatically extract
features from images. Here’s an overview of the key components of CNNs and how
they work.
1. Convolutional Layers
The convolutional layer is the core component of a CNN. It applies a set of filters (also
known as kernels) to the input image. Each filter is a small matrix that slides over the
image, performing element-wise multiplication and summing the results to produce a
feature map. This operation allows CNNs to detect patterns such as edges, textures,
and shapes. For example, a filter might be designed to detect horizontal edges, while
another might detect vertical edges. The process of convolution helps in extracting local
features from the input image, which are crucial for understanding the overall structure.
2. Activation Functions
After the convolution operation, an activation function is applied to introduce
non-linearity into the model. The most commonly used activation function in CNNs is the
Rectified Linear Unit (ReLU), which replaces negative values with zero, allowing the
network to learn complex patterns. Other variants include Leaky ReLU and Parametric
ReLU, which help mitigate issues like the "dying ReLU" problem.
3. Pooling Layers
Pooling layers are used to downsample the feature maps produced by the convolutional
layers. This reduces the spatial dimensions of the data, which helps to decrease
computational load and mitigate overfitting. The most common pooling operation is max
pooling, which takes the maximum value from a specified window (e.g., 2x2) of the
feature map. This process retains the most significant features while discarding less
important information, leading to a more compact representation of the data.
4. Pooling Layer: Max pooling is applied to reduce the dimensionality of the feature
maps, retaining the most important features.
5. Fully Connected Layer: The pooled feature maps are flattened and passed through
one or more fully connected layers, culminating in an output layer that predicts the digit
class (0-9) based on the learned features.
This hierarchical structure allows CNNs to effectively learn and recognize patterns in
images, making them powerful tools in computer vision tasks.
Recurrent Neural Networks (RNNs) are a specialized type of artificial neural network
designed to process sequential data, where the order of the data points is crucial. This
capability makes RNNs particularly effective for tasks involving time series, natural
language processing, audio, and video data. Below is an explanation of the need for
RNNs to process sequential data, along with various RNN variants and their
applications.
1. Temporal Dependencies:
● Sequential data often contains temporal dependencies, meaning that the current
data point is influenced by previous ones. Traditional feedforward neural
networks treat each input independently, which is not suitable for sequential data
where context matters.
2. Memory Mechanism:
● RNNs possess an internal memory that allows them to retain information from
previous time steps. This memory enables RNNs to remember important patterns
and relationships over time, making them ideal for tasks like language modeling
and time series prediction.
Variants of RNNs
Several variants of RNNs have been developed to address specific challenges, such as
the vanishing gradient problem and the need for better memory management. Here are
some notable variants:
3. Bidirectional RNNs
● Overview: Bidirectional RNNs consist of two RNNs: one processes the input
sequence from start to end, while the other processes it from end to start. This
allows the model to capture context from both directions.
4. Attention Mechanisms
● Overview: While not a variant of RNNs per se, attention mechanisms are often
used in conjunction with RNNs to improve their performance on tasks requiring
long-range dependencies. Attention allows the model to focus on specific parts of
the input sequence when making predictions.
A Long Short Term Memory Network consists of four different gates for different
purposes as described below:-
1. Forget Gate(f): At forget gate the input is combined with the previous output to
generate a fraction between 0 and 1, that determines how much of the previous
state needs to be preserved (or in other words, how much of the state should be
forgotten). This output is then multiplied with the previous state. Note: An
activation output of 1.0 means “remember everything” and activation output of
0.0 means “forget everything.” From a different perspective, a better name for the
forget gate might be the “remember gate”.
2. Input Gate(i): Input gate operates on the same signals as the forget gate, but
here the objective is to decide which new information is going to enter the state
of LSTM. The output of the input gate (again a fraction between 0 and 1) is
multiplied with the output of tanh block that produces the new values that must
be added to the previous state. This gated vector is then added to previous state
to generate current state.
3. Input Modulation Gate(g): It is often considered as a sub-part of the input gate
and much literature on LSTM does not even mention it and assume it is inside
the Input gate. It is used to modulate the information that the Input gate will write
onto the Internal State Cell by adding non-linearity to the information and making
the information Zero-mean. This is done to reduce the learning time as
Zero-mean input has faster convergence. Although this gate’s actions are less
important than the others and are often treated as a finesse-providing concept, it
is good practice to include this gate in the structure of the LSTM unit.
4. Output Gate(o): At output gate, the input and previous state are gated as before
to generate another scaling fraction that is combined with the output of tanh block
that brings the current state. This output is then given out. The output and state
are fed back into the LSTM block.
OR
Working of LSTM
Long Short-Term Memory (LSTM) networks are a type of recurrent neural network
(RNN) that are particularly effective at processing sequential data. LSTMs overcome the
limitations of traditional RNNs, such as the vanishing gradient problem, by introducing a
unique architecture with memory cells and gates.The core components of an LSTM unit
are:
1. Cell State (C): The cell state acts as the "memory" of the LSTM, allowing
information to be selectively passed along or forgotten.
2. Hidden State (h): The hidden state is the output of the LSTM unit, which is
passed to the next layer or used for prediction.
3. Gates: LSTMs use three gates to control the flow of information:
○ Forget Gate (f): Decides what information from the previous cell state to
keep or discard.
○ Input Gate (i): Determines what new information from the current input
and previous hidden state to add to the cell state.
○ Output Gate (o): Selects the information from the current input, previous
hidden state, and current cell state to produce the output.
1. Forget Gate: The forget gate decides what information to keep or discard from
the previous cell state (C(t-1)). It takes the current input (x(t)) and the previous
hidden state (h(t-1)) as inputs, and outputs a value between 0 and 1 for each
number in the cell state C(t-1). A value closer to 1 indicates that the
corresponding information should be kept, while a value closer to 0 indicates that
it should be forgotten.
2. Input Gate: The input gate determines what new information from the current
input (x(t)) and previous hidden state (h(t-1)) to add to the cell state. It consists of
two parts:
○ A sigmoid layer called the "input gate layer" decides which values to
update.
○ A tanh layer creates a vector of new candidate values (C~(t)) that could be
added to the state.
3. Cell State Update: The old cell state (C(t-1)) is multiplied by the output of the
forget gate (f(t)) to forget the information decided to be forgotten earlier. Then,
the new candidate values (C~(t)), scaled by the output of the input gate (i(t)), are
added to the cell state to obtain the new cell state C(t).
4. Output Gate: The output gate decides what information from the current input
(x(t)), previous hidden state (h(t-1)), and current cell state (C(t)) to use as output.
It consists of:
○ A sigmoid layer that decides which parts of the cell state to output.
○ A tanh layer that puts the cell state through a tanh activation to push the
values between -1 and 1.
○ The output of the sigmoid layer is multiplied with the output of the tanh
layer to produce the final output.
The updated hidden state (h(t)) is then passed to the next layer or used for
prediction.By selectively remembering and forgetting information using the gates,
LSTMs can effectively capture long-term dependencies in sequential data, making them
powerful tools for tasks such as language modeling, machine translation, and speech
recognition.
Autoencoders are a type of artificial neural network used primarily for unsupervised
learning tasks, such as dimensionality reduction and feature learning. They consist of
three main components: the encoder, the bottleneck (or latent space), and the decoder.
Below is a detailed explanation of the architecture and working of autoencoders, along
with a diagram to illustrate their structure.
Architecture of Autoencoder
1. Encoder
● The encoder is responsible for compressing the input data into a
lower-dimensional representation. It consists of one or more layers of neurons
that progressively reduce the dimensionality of the input.
● The output of the encoder is a compact representation of the input, often referred
to as the "code" or "latent representation."
2. Bottleneck (Latent Space)
● The bottleneck layer is the most critical component of the autoencoder. It
contains the compressed knowledge representation of the input data.
● This layer restricts the flow of information, allowing only the most significant
features to pass through to the decoder. The size of the bottleneck layer is a
hyperparameter that influences the amount of compression.
3. Decoder
● The decoder reconstructs the original input from the compressed representation
provided by the bottleneck layer.
● It mirrors the encoder's architecture, using layers that expand the dimensionality
back to the original input size. The goal is to minimize the difference between the
input and the reconstructed output.
Working of Autoencoder
1. Input Data: The autoencoder takes raw input data (e.g., images) and passes it
through the encoder.
2. Encoding: The encoder compresses the input data into a lower-dimensional
representation. This representation captures the essential features of the input while
discarding irrelevant information.
4. Decoding: The decoder takes the compressed representation from the bottleneck and
attempts to reconstruct the original input. It uses layers that expand the dimensionality
back to the input size.
5. Loss Calculation: The output of the decoder is compared to the original input, and a
loss function (such as Mean Squared Error) calculates the reconstruction error. The
goal is to minimize this error during training.
1. Resampling Techniques
● Oversampling the Minority Class:This involves increasing the number of
instances in the minority class by duplicating existing examples or generating
new synthetic examples (e.g., using SMOTE - Synthetic Minority Over-sampling
Technique).
● Undersampling the Majority Class:This technique reduces the number of
instances in the majority class to match the minority class. While this can help
balance the dataset, it may lead to a loss of valuable information.
2. Cost-sensitive Learning
● This approach involves modifying the learning algorithm to penalize
misclassifications of the minority class more heavily than those of the majority
class. This can be done by assigning different weights to classes during training.
● Example: In a medical diagnosis model, misclassifying a disease (minority class)
may incur a higher cost than misclassifying a healthy patient (majority class). By
assigning a higher weight to the disease class, the model learns to prioritize its
correct classification.
3. Ensemble Methods
● Techniques such as bagging and boosting can be adapted to handle imbalanced
datasets. For example, using ensemble methods like Random Forests or
Gradient Boosting can improve the model's ability to learn from the minority
class.
● Example: In a credit scoring model, using an ensemble of decision trees can help
improve predictions for the minority class (e.g., defaulting customers) by
aggregating multiple models' predictions.
2. Boosting: Boosting trains models sequentially, where each new model focuses on the
errors made by the previous ones. This method combines the outputs of weak learners
to create a strong learner. Examples include AdaBoost and Gradient Boosting.
4. Random Forest: A specific type of bagging that uses decision trees as base learners.
It introduces randomness in both data sampling and feature selection to create a
diverse set of trees.
5. Voting Classifiers: This method combines the predictions of multiple models by taking
a vote (for classification) or averaging (for regression) to make a final prediction.
1. Bootstrap Sampling:
● Multiple subsets of the training data are created through bootstrap sampling.
Each subset is generated by randomly selecting instances from the original
dataset with replacement. This means that some instances may appear multiple
times in a subset, while others may not appear at all.
3. Aggregation of Predictions:
● For regression tasks, the final prediction is obtained by averaging the predictions
of all base models. For classification tasks, the final prediction is determined by
majority voting among the base models.
Advantages of Bagging
● Parallelization: The training of base models can be done in parallel since they are
independent of each other, making bagging computationally efficient.
Random Forest is a popular example of bagging. It builds multiple decision trees using
bootstrap samples of the data and averages their predictions. This method not only
reduces overfitting but also improves the model's generalization capabilities.
7. Numerical on calculating various performance metrics like precision,
recall, accuracy, specificity and sensitivity given the confusion matrix.
(CO5)
Actual Positive TP FN
Actual Negative FP TN
3. Accuracy:
Accuracy = (TP + TN)/(TP + TN + FP + FN)
Example Calculation
1. Precision:
Precision = 70/(70+10) = 70/80 = 0.875 (87.5%)
2. Recall:
Recall = 70/(70 + 20) = 70/90 = 0.778 (77.8%)
3. Accuracy:
Accuracy = (70 + 50)/(70 + 50 + 10 + 20) = 120/150 = 0.8 (80%)
4. Specificity:
Specificity = 50/(50 + 10) = 50/60 = 0.833 (83.3%)
Summary of Results
● Precision: 87.5%
● Recall: 77.8%
● Accuracy: 80%
● Specificity: 83.3%
● Sensitivity: 77.8%
Bootstrapping
Bootstrapping is a powerful statistical resampling technique used to estimate the
distribution of a statistic (such as the mean, variance, or confidence intervals) by
repeatedly sampling with replacement from a single dataset. This method is particularly
useful when the underlying distribution of the data is unknown or when traditional
parametric assumptions cannot be met.
3. Confidence Intervals:
● Bootstrapping can be used to construct confidence intervals for the estimated
statistics. By examining the distribution of the bootstrap estimates, we can
determine the range within which the true population parameter is likely to fall.
Steps in the Bootstrapping Process
1. Select a Sample:
● Choose a sample of size n from the original dataset.
Advantages of Bootstrapping
Disadvantages of Bootstrapping
What is Cross-Validation?
Cross-validation involves partitioning the available data into subsets, training the model
on some of these subsets, and validating it on the remaining subsets. This process is
repeated multiple times, allowing for a more robust estimate of the model’s
performance. The primary goal of cross-validation is to provide a more accurate
estimate of a model's ability to predict new data that was not used during training.
Types of Cross-Validation
1. K-Fold Cross-Validation:
● The dataset is divided into k subsets (or folds). The model is trained on k-1 folds
and tested on the remaining fold. This process is repeated k times, with each fold
serving as the test set once. The results are averaged to produce a single
performance metric.
● Example: In 5-fold cross-validation, the dataset is split into 5 parts. The model is
trained on 4 parts and tested on the 1 part, repeating this for each fold.
4. Holdout Method:
● The dataset is split into two parts: a training set and a testing set. The model is
trained on the training set and evaluated on the testing set. This method is simple
but can lead to high variance in performance estimates.
● Example: A common split is 70% training and 30% testing.
5. Repeated Cross-Validation:
● This involves repeating the cross-validation process multiple times with different
random splits of the data. It provides a more stable estimate of model
performance.
● Example: Perform 10-fold cross-validation 5 times, averaging the results to
reduce variability.
1. Shuffle the Dataset: Randomly shuffle the dataset to ensure that the folds are
representative of the overall data distribution.
1. Data Splitting:
● The dataset is divided into two subsets:
● Training Set: Typically, a larger portion of the data (e.g., 70-80%) is used to train
the model.
● Test Set: The remaining portion (e.g., 20-30%) is reserved for testing the model's
performance.
2. Model Training:
● The model is trained using the training set. During this phase, the algorithm
learns the underlying patterns and relationships in the data.
3. Model Evaluation:
● After training, the model is evaluated using the test set. This involves making
predictions on the test data and comparing them to the actual outcomes.
● Common performance metrics include accuracy, precision, recall, and F1-score,
among others.
Suppose we have a dataset with 1000 samples. We can apply the hold-out method as
follows:
Random Subsampling
1. Data Splitting:
● The dataset is randomly divided into two subsets: a training set and a validation
set. The size of these subsets can be defined by the user, typically with a larger
portion allocated to training and a smaller portion to validation.
2. Model Training:
● A model is trained on the training set. This process is repeated multiple times,
with each iteration involving a new random split of the data.
3. Model Evaluation:
● After training, the model is evaluated on the validation set. Performance metrics
(such as accuracy, precision, recall, etc.) are calculated for each iteration.
4. Averaging Results:
● The performance metrics from all iterations are averaged to provide a more
robust estimate of the model's performance.
● Flexibility: The user can define the size of the training and validation sets,
allowing for more control over the evaluation process.
● Potential for Overfitting: If the same data points are used repeatedly in the
training set, the model may overfit to those specific instances, leading to an
overly optimistic performance estimate.
4. Aggregate Results:
● After all iterations, average the performance metrics to obtain a final estimate of
the model's performance.
One of the most prominent applications of multimodal data science is in the field of
autonomous vehicles. Self-driving cars rely on a variety of sensors and data sources to
perceive their environment, make decisions, and navigate safely. These sensors
include:
● Cameras: Used for object detection, lane detection, traffic sign recognition, and
more.
● Lidar (Light Detection and Ranging): Measures distance using laser light,
creating a 3D map of the environment.
● Radar (Radio Detection and Ranging): Detects and measures the distance and
velocity of objects.
● GPS (Global Positioning System): Provides location and navigation data.
● Odometry: Measures the distance traveled by the vehicle's wheels.
● Inertial Measurement Unit (IMU): Measures acceleration and rotation, providing
information about the vehicle's motion.
Autonomous vehicles use multimodal deep learning to fuse and process data from
these various sensors. By combining information from multiple modalities, the vehicle
can build a more comprehensive understanding of its surroundings and make more
informed decisions.
For example, the vehicle might use camera images to detect pedestrians and other
vehicles, while lidar data provides precise distance measurements. Radar can help
track the speed and trajectory of moving objects, while GPS and odometry data help the
vehicle localize itself on a map. By integrating all this information, the autonomous
vehicle can navigate safely and avoid collisions.
Data science has made significant strides in processing and analyzing various types of
data, including text, images, and videos. These multimodal applications leverage
advanced techniques to extract insights, enhance user experiences, and automate
processes across different industries. Below are detailed explanations of how data
science is applied to text, images, and videos, along with real-world examples.
Natural Language Processing (NLP) is a branch of data science that focuses on the
interaction between computers and human language. It enables machines to
understand, interpret, and generate human language in a valuable way. Key
applications of NLP include:
● Chatbots and Virtual Assistants: NLP powers chatbots that can understand and
respond to user queries in real time. For instance, customer support chatbots on
websites can handle common inquiries, reducing the need for human
intervention.
2. Image Analysis
Data science has transformed image analysis through the use of deep learning
techniques, particularly Convolutional Neural Networks (CNNs). Key applications
include:
● Object Detection and Recognition: Image recognition systems can identify and
classify objects within images. For example, autonomous vehicles use image
analysis to detect pedestrians, traffic signs, and other vehicles, enhancing safety
on the road.
● Facial Recognition: Security systems and social media platforms use facial
recognition technology to identify individuals in images. Companies like
Facebook and Apple utilize this technology for tagging and unlocking devices.
3. Video Analysis
Video analysis combines techniques from both image processing and time-series
analysis to extract meaningful information from video data. Applications include:
● Surveillance and Security: Video analytics systems can monitor live feeds from
security cameras to detect unusual activities or recognize faces in real-time. For
example, many retail stores use video analytics to prevent theft and enhance
customer service.
● Traffic Monitoring: Intelligent transportation systems analyze video feeds from
traffic cameras to monitor vehicle flow, detect accidents, and optimize traffic
signals. This helps in reducing congestion and improving road safety.
● Content Moderation: Platforms like YouTube and Facebook use video analysis to
automatically detect inappropriate content by analyzing video frames and audio.
This helps in maintaining community guidelines and ensuring user safety.
How It Works:
● Data Fusion: Autonomous vehicles use data fusion techniques to combine
information from multiple sensors. For example, a camera might detect a
pedestrian, while lidar provides precise distance measurements. By integrating
this data, the vehicle can make informed decisions, such as stopping to avoid a
collision.
● Deep Learning Models: Convolutional Neural Networks (CNNs) are used for
image classification tasks, while Recurrent Neural Networks (RNNs) may be
employed for processing sequences of video frames to recognize actions or
predict future movements.
● Real-Time Processing: The ability to analyze data in real-time is crucial for safety.
Autonomous vehicles must process sensor data quickly to respond to dynamic
environments, such as changing traffic conditions or unexpected obstacles.