Dav Exps - Merged - Merged
Dav Exps - Merged - Merged
Experiment No. 1
1. Matplotlib:
Matplotlib is a widely used plotting library for Python. It's powerful and provides a wide variety of plots like line plots,
scatter plots, histograms, bar charts, etc. It's highly customizable, allowing users to control almost every aspect of the plot.
```
These libraries/tools offer a range of functionalities and flexibility for creating visualizations based on your data and
programming preferences. Explore them further to leverage their full potential in data visualization tasks.
Learning Objectives:
To understand the graph libraries.
Conclusion/Learning outcome:
The concept of graph libraries is studied and understood using matplotlib/Seaborn/Excel plots.
R1 R2 R3
DOP DOS Conduction File Record Viva Voice Total Signature
Course:DataAnalytics&Visualization CourseCode:CSL-601
Semester:6 Department:AI&DS
LaboratoryNo:407 Name of Subject Teacher: Pramod Sir
Name of Student : Sahil Surve RollId:VU2S2223020
ExperimentNo.2
Aim: Data Exploration: Knowing the data, Data preparation and Cleaning
Prior Concepts:
Exploring data is a crucial step in understanding its characteristics, trends, and underlying patterns.
Conducting experiments in data exploration involves various techniques and tools to gain insights into the
dataset. Following are the different approaches to conduct a data exploration.
8. Iterative Process:
- Data exploration is often iterative. Revisit steps, try different techniques, and compare results
togain a comprehensive understanding of the dataset.
9. Ethical Considerations:
- Ensure ethical use of data, especially regarding privacy, biases, and the implications of
insightsdrawn from the data.
New Concept:
Data loading in R
1. CSV File
# Load a CSV file
data <- read.csv("your_file.csv")
2. EXCEL File
# Load an Excel file (assuming 'readxl' package is installed)
library(readxl)
data <- read_excel("your_file.xlsx")
Data preparation and cleaning involve various steps to ensure that the dataset is in a suitable format foranalysis or
modeling.
New Concepts:
Example of R Code for Data Cleaning and Preparation:
These steps ensure that your data is clean, formatted correctly, and ready for analysis or modeling tasksin R.
Adjust these methods based on your specific dataset and analysis requirements.
Learning Objectives:
To understand the different techniques of Data exploration, Data preparation and Cleaning.
Conclusion/Learning outcome:
The use of different tools and commands for understanding the data are studied and implemented. Theconcept of
Data preparation and Cleaning is understood and implemented in R language.
R1 R2 R3
DOP DOS Conduction FileRecord Viva Voice Total Signature
ExperimentNo.3
Aim: To understand and implement visualization of data.
Prior Concepts:
R provides numerous packages for data visualization. One of the most commonly used packages is ggplot2,
which offers a flexible and powerful system for creating a wide variety of visualizations. Here are some basic
examples of data visualization using ggplot2:
New Concept:
1. Install and load ggplot2 Package:
# Install ggplot2 if not already installed
install.packages("ggplot2")
# Load ggplot2
librarylibrary(ggplot2)
2. Histogram
# Create a histogram of a numerical variable 'numeric_column' from dataframe 'data'
ggplot(data, aes(x = numeric_column)) +
geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
labs(title = "Histogram of Numeric Column", x = "Values", y = "Frequency")
3. Scatter plot
# Create a scatter plot of 'numeric_column1' against 'numeric_column2'
ggplot(data, aes(x = numeric_column1, y = numeric_column2)) +
geom_point(color = "blue") +
labs(title = "Scatter Plot", x = "X-axis Label", y = "Y-axis Label")
4. Boxplot
# Create a boxplot to visualize distribution of 'numeric_column' across 'group_column'
ggplot(data, aes(x = group_column, y = numeric_column)) +
geom_boxplot(fill = "lightgreen", color = "black") +
labs(title = "Boxplot of Numeric Column by Group", x = "Groups", y = "Values")
5. Barchart
# Create a bar chart to visualize counts of categories in 'categorical_column'
ggplot(data, aes(x = categorical_column)) +
geom_bar(fill = "orange") +
labs(title = "Bar Chart of Categorical Column", x = "Categories", y = "Count")
6. Lineplot
# Create a line plot to show trends over time using 'date_column' and 'numeric_column'
ggplot(data, aes(x = date_column, y = numeric_column)) +
R1 R2 R3
DOP DOS Conduction FileRecord Viva Voice Total Signature
Course:DataAnalytics&Visualization CourseCode:CSL-601
Semester:6 Department: AI & DS
Laboratory No:407 Name of Subject Teacher:Promad Bhavarthe
Name of Student : Sahil Surve RollId: VU2S2223020
ExperimentNo.4
Prior Concept:
Covariance
It’s a statistical term demonstrating a systematic association between two random variables, where the
change in the other mirrors the change in one variable.
Where,
Where,
Interpreting Correlation Values
There are three types of correlation based on diverse values. Negative correlation, positive correlation, and no
or zero correlation.
In a negative correlation, one variable’s value increases while the second one’s value decreases. A perfect
negative correlation has a value of -1.
The negative correlation appears as follows:
Just like in the case of Covariance, a zero correlation means no relation between the variables. Therefore,
whether one variable increases or decreases won’t affect the other variable.
Advantages
• Easy to Calculate: Calculating covariance doesn’t require any assumptions of the underlying
datadistribution. Hence, it’s easy to calculate covariance with the formula given above.
• Beneficial in Portfolio Analysis: Covariance is typically employed in portfolio analysis to evaluate
thediversification advantages of integrating different assets.
Disadvantages
• Restricted to Linear Relationships: Covariance only gauges linear relationships between
variablesand does not capture non-linear associations.
• Scale Dependency: Covariance is affected by the variables’ measurement scales, making
comparingcovariances across various datasets or variables with distinct units challenging.
Advantages and Disadvantages of Correlation
The advantages and disadvantages of correlation are as follows:
Advantages
• Determining Non-Linear Relationships: While correlation primarily estimates linear relationships,
it can also demonstrate the presence of non-linear connections, especially when using alternative
correlation standards like Spearman’s rank correlation coefficient.
• Standardized Criterion: Correlation coefficients, such as the Pearson correlation coefficient, are
standardized, varying from -1 to 1. This allows for easy comparison and interpretation of the
direction and strength of relationships across different datasets.
• Robustness to Outliers: Correlation coefficients are typically less sensitive to outliers than
Covariance, delivering a more potent standard of the association between variables.
• Scale Independencies: Correlation is not affected by the measurement scales, making it convenient
forcomparing affinities between variables with distinct units or scales.
Disadvantages
• Driven by Extreme Values: Extreme values can still affect the correlation coefficient, even
thoughit is less susceptible to outliers than Covariance.
New Concept:
Python code for Correlation & Covariance
1. Using Numpy:
2. Using Pandas:
R-code for Correlation & Covariance
Learning Objectives:
To understand the correlation and covariance between the variables.
Conclusion/Learning outcome:
Correlation and Covariance between the variables between variables is understood and implemented in
Python and R-code.
R1 R2 R3
DOP DOS Conduction FileRecord Viva Voice Total Signature
ExperimentNo.5
Aim: To understand and implement Hypothesis Testing.
Prior Concepts:
Hypothesis testing is a statistical method used to make inferences about a population parameter based on
sample data. The process involves stating a hypothesis, collecting and analyzing data, and then determining
whether the data provides enough evidence to reject or fail to reject the null hypothesis.
6. Make a Decision:
- If the test statistic falls into the critical region (or if the p-value is less than the significance
level),reject the null hypothesis.
- If the test statistic does not fall into the critical region (or if the p-value is greater than thesignificance
level), fail to reject the null hypothesis.
New Concepts:
Example using Student's t-test in Python:
Suppose we want to test if there's a significant difference in the mean of two independent groups (e.g.,group
A and group B). You can use the t-test to perform this hypothesis test.
Learning Objectives:
To understand and implement Hypothesis Testing.
Conclusion/Learning outcome:
Hypothesis testing is understood and implemented with a sample dataset.
R1 R2 R3
DOP DOS Conduction FileRecord Viva Voice Total Signature
Practical Number 6
Title of Practical To implement Simple Linear Regression.
Prior Concepts:
Linear Regression is an algorithm that belongs to supervised Machine Learning. It tries to apply
relations that will predict the outcome of an event based on the independent variable data
points. The relation is usually a straight line that best fits the different data points as close as
possible. The output is of a continuous form, i.e., numerical value. For example, the output could
be revenue or sales in currency, the number of products sold, etc. In the above example, the
independent variable can be single or multiple.
Page | 2
New Concepts:
Example of simple linear regression using a simple dataset in R-code:
# Sample data
x <- c(1, 2, 3, 4, 5)
y <- c(2, 4, 6, 8, 10)
# Fit a linear regression model
model <- lm(y ~ x)
# Summary of the linear regression model
summary(model)
# Plotting the data with the regression line
plot(x, y, main = "Scatterplot with Regression Line")
abline(model, col = "red")
Output:
Learning Objectives:
To understand and implement linear regression on sample dataset.
Conclusion/Learning outcome:
Regression analysis is a family of statistical tools that can help business analysts build models to
predict trends, make tradeoff decisions, and model the real world for decision-making
support. These models can be used to predict the value of one or more variables from knowledge of
the value of other variables. Specific regression techniques include simple linear regression analysis,
multiple linear regression analysis, multiple curvilinear regression, multivariate linear regression,
and multivariate polynomial regression.
R1 R2 R3
DOP DOS Conduction File Record Viva Voice Total Signature
5 Marks 5 Marks 5 Marks 15 Marks
DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE
Practical Number 7
Title of Practical To implement Multiple Linear Regression.
Prior Concepts:
• Multiple linear regression refers to a statistical technique that uses two or more
independent variables to predict the outcome of a dependent variable.
• The technique enables analysts to determine the variation of the model and the relative
contribution of each independent variable in the total variance.
• Multiple regression can take two forms, i.e., linear regression and non-linear regression.
Where:
• yi is the dependent or predicted variable
• β0 is the y-intercept, i.e., the value of y when both xi and x2 are 0.
• β1 and β2 are the regression coefficients representing the change in y relative to a
one-unit change in xi1 and xi2, respectively.
• βp is the slope coefficient for each independent variable
• ϵ is the model’s random error (residual) term.
Simple linear regression enables statisticians to predict the value of one variable using the
available information about another variable. Linear regression attempts to establish the
relationship between the two variables along a straight line.
Multiple regression is a type of regression where the dependent variable shows
a linear relationship with two or more independent variables. It can also be non-linear,
where the dependent and independent variables do not follow a straight line.
Both linear and non-linear regression track a particular response using two or more
variables graphically. However, non-linear regression is usually difficult to execute since it is
created from assumptions derived from trial and error.
Assumptions of Multiple Linear Regression
Multiple linear regression is based on the following assumptions:
Page | 1
1. A linear relationship between the dependent and independent variables
The first assumption of multiple linear regression is that there is a linear relationship
between the dependent variable and each of the independent variables. The best way to
check the linear relationships is to create scatterplots and then visually inspect the
scatterplots for linearity. If the relationship displayed in the scatterplot is not linear, then
the analyst will need to run a non-linear regression or transform the data using statistical
software, such as SPSS.
2. The independent variables are not highly correlated with each other
The data should not show multicollinearity, which occurs when the independent variables
(explanatory variables) are highly correlated. When independent variables show
multicollinearity, there will be problems figuring out the specific variable that contributes to
the variance in the dependent variable. The best method to test for the assumption is the
Variance Inflation Factor method.
3. The variance of the residuals is constant
Multiple linear regression assumes that the amount of error in the residuals is similar at
each point of the linear model. This scenario is known as homoscedasticity. When analyzing
the data, the analyst should plot the standardized residuals against the predicted values to
determine if the points are distributed fairly across all the values of independent variables.
To test the assumption, the data can be plotted on a scatterplot or by using statistical
software to produce a scatterplot that includes the entire model.
4. Independence of observation
The model assumes that the observations should be independent of one another. Simply
put, the model assumes that the values of residuals are independent. To test for this
assumption, we use the Durbin Watson statistic.
The test will show values from 0 to 4, where a value of 0 to 2 shows positive autocorrelation,
and values from 2 to 4 show negative autocorrelation. The mid-point, i.e., a value of 2,
shows that there is no autocorrelation.
5. Multivariate normality
Multivariate normality occurs when residuals are normally distributed. To test this
assumption, look at how the values of residuals are distributed. It can also be tested using
two main methods, i.e., a histogram with a superimposed normal curve or the Normal
Probability Plot method.
R-code:
# Sample data
x1 <- c(1, 2, 3, 4, 5)
x2 <- c(3, 4, 5, 6, 7)
y <- c(2, 4, 6, 8, 10)
# Combine the predictors into a data frame
data <- data.frame(x1, x2, y)
# Fit a multiple linear regression model
model <- lm(y ~ x1 + x2, data = data)
# Summary of the multiple linear regression model
summary(model)
Page | 2
predicted_values <- predict(model, newdata = new_data)
print(predicted_values)
Output:
print(predicted_values)
12
12 14
Python code:
from sklearn.linear_model import LinearRegression
# Sample data
x1 = [1, 2, 3, 4, 5]
x2 = [3, 4, 5, 6, 7]
y = [2, 4, 6, 8, 10]
print(predicted_values)
Output:
Coefficients: [1. 1.]
Intercept: -2.0
[12. 14.]
CodeText
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.linear_model import LinearRegression
# Sample data
x1 = np.array([1, 2, 3, 4, 5])
x2 = np.array([3, 4, 5, 6, 7])
y = np.array([2, 4, 6, 8, 10])
Page | 3
# Reshape variables for sklearn
X = np.column_stack((x1, x2))
# Make predictions
y_pred = model.predict(X)
ax.set_xlabel('X1')
ax.set_ylabel('X2')
ax.set_zlabel('Y')
ax.set_title('Multiple Linear Regression')
#ax.legend()
plt.show()
Output:
Learning Objectives:
To understand and implement Multiple Linear Regression in R/Python.
Page | 4
Conclusion/Learning outcome:
R1 R2 R3
DOP DOS Conduction File Record Viva Voice Total Signature
5 Marks 5 Marks 5 Marks 15 Marks
Page | 5
DEPARTMENT OF ARTIFICIAL
INTELLIGENCE & DATA SCIENCE
Experiment No .8
Prior Concepts:
Sometimes data changes over time. This data is called time-dependent data. Given time-dependent data,
the past data can be analyzed to predict the future. The future prediction will also include time as a
variable, and the output will vary with time. Using time-dependent data, patterns that repeat over time
can be found. A Time Series is a set of observations that are collected after regular intervals of time. If
plotted, the Time series would always have one of its axes as time.
Time Series Analysis in Python considers data collected over time might have some structure; hence it
analyses Time Series data to extract its valuable characteristics.
Consider the running of a bakery. Given the data of the past few months, one can predict what items are needed to
bake at what time. The morning crowd would need more bread items, like bread rolls,croissants, breakfast muffins,
etc. At night, people may come in to buy cakes and pastries or otherdessert items. Using time series analysis, one can
predict items popular during different times and even different seasons.
Different Components of Time Series Analysis:
The diagram depicted below shows the different components of Time Series Analysis:
1. Trend: The Trend shows the variation of data with time or the frequency of data. Using a Trend, you can see how
your data increases or decreases over time. The data can increase, decrease, or remain stable. Over time, population,
stock market fluctuations, and production in a company are all examples of trends.
2. Seasonality: Seasonality is used to find the variations which occur at regular intervals of time. Examples are
festivals, conventions, seasons, etc. These variations usually happen around the same time period and affect the data
in specific ways which you can predict.
3. Irregularity: Fluctuations in the time series data do not correspond to the trend or seasonality. These variations in
your time series are purely random and usually caused by unforeseeable circumstances, such as a sudden decrease in
population because of a diagram shows the
components of an ARIMA model:
Auto-Regressive models predict future behavior using past behavior where there is some correlation between past
and future data. The formula below represents the autoregressive model. It is a modified version of the slope formula
with the target value being expressed as the sum of the intercept, the product of a coefficient and the previous output,
and an error correction term.
Moving Average
Moving Average is a statistical method that takes the updated average of values to help cut down on noise. It takes
the average over a specific interval of time. You can get it by taking different subsets of your data and finding their
respective averages. You first consider a bunch of data points and take their average. You then find the next average
by removing the first value of the data and including the next value of the series.
Integration
Integration is the difference between present and previous observations. It is used to make the time series stationary.
Each of these values acts as a parameter for an ARIMA model. Instead of representing the ARIMA model by these
various operators and models, one can use parameters to represent them.
These parameters are:
1. p: Previous lagged values for each time point. Derived from the Auto-Regressive Model.
2. q: Previous lagged values for the error term. Derived from the Moving Average.
3. d: Number of times data is differenced to make it stationary. It is the number of times it performs integration.
Output:
ARIMA with Python
The statsmodels library stands as a vital tool for those looking to harness the power of ARIMA for time series
forecasting in Python. Building an ARIMA Model:
A Step-by-Step Guide:
1. Model Definition: Initialize the ARIMA model by invoking ARIMA() and specifying the p, d, and q parameters.
2. Model Training: Train the model on your dataset using the fit() method.
3. Making Predictions: Generate forecasts by utilizing the predict() function and designating the desired time index
or indices.
Let us fit an ARIMA model to the entire Shampoo Sales dataset and review the residual errors.
We’ll employ the ARIMA(5,1,0) configuration:
5 lags for autoregression (AR)
1st order differencing (I)
No moving average term (MA)
OUTPUT:
Rolling Forecast ARIMA Model
The ARIMA model can be used to forecast future time steps.
The ARIMA model is adept at forecasting future time points. In a rolling forecast, the model is often retrained as
new data becomes available, allowing for more accurate and adaptive predictions. We can use the predict() function
on the ARIMAR results object to make predictions. It accepts the index of the time steps to make predictions as
arguments. These indexes are relative to the start of the training dataset used to make predictions.
The model could use further tuning of the p, d, and maybe even the q parameters.
Learning Objectives:
To understand and implement Time Series Analysis using Python/Excel
Conclusion/Learning outcome:
Time Series Analysis using ARIMA Model has been understood and implemented in Python.
DEPARTMENT OF ARTIFICIAL INTELLIGENCE
AND DATA SCIENCE
EXPERIMENT NO.09
2) Power BI supports a wide range of data sources, including Excel files, CSV files, databases
(SQL Server, Azure SQL Database, etc.), cloud services (like Salesforce, Google Analytics), and
more.
Explore the "Get Data" section in Power BI Desktop to browse available connectors.
3) Explore Your Data:
Talking about Microsoft’s Power BI is a business analytics service that helps in analyzing and
visualizing data from various sources to craft data stories and share them with the end-users.
Power BI is a combination of the following services and applications which work for hand in
glove to create and share interactive business insights
a) Area Charts
The area chart depends on line charts to display quantitative graphical data. The area between the axis
and lines is commonly filled with colors, textures, and patterns. You can compare more than two
quantities with area charts. It shows the trend changes over time and can be used to attract the
attention of the users to know the total changes across the trends.
b) Line Charts
Line charts are mostly used charts to represent the data and are characterized by a series of data
points connected by a straight line. Each point in the line corresponds to a data value in the given
category. It shows the exact value of the plotted data. Line charts should only be used to measure
the trends over a period of time, e.g. dates, months, and years
c) Bar Charts
In the list of Power BI visualization types, next, we are going to discuss bar charts.
Bar charts are mostly used graphs because they are simple to create and easy to understand. Bar
charts are also called horizontal charts that represent the absolute data. They are useful to display
the data that include negative values because it is possible to position the bars above and below
the x-axis.
d) Column Charts
Column charts are similar to bar charts, and the only difference between these two is, column
chart divides the same category data into the clusters and compares within the clusters. Also, it
compares the data from other clusters.
e) Pie Charts
A pie chart is a circular statistical chart, and it shows the whole data in parts. Each portion of a
pie chart represents the percentages, and the sum of all parts should be equal to 100%. The whole
data can be divided into slices to show the numerical propositions of each part of the data. Pie
charts are mostly used to represent the same category of data. It helps users to understand the
data quickly. They are widely used in education, the business world, and communication media.
f) Doughnut Charts
Doughnuts are similar to pie charts, and it is named doughnut chart because it looks similar to a
doughnut. You can easily understand the data because doughnut charts show the whole data into
the proposition. It is the most useful chart when you need to display various propositions that
make up the final value.
1) Report view:
This is the default view where we create our reports by arranging the visualizations
including different graphs, charts, and a lot more, according to our requirements over
multiple pages in a single report.
2) Data view:
When we are modeling our data, sometimes without creating a visual on the canvas, we
would like to see the loaded table or column. This view helps us to view data in a grid
format; piece by piece to analyze it closely.
3) Model view:
This view helps us to see the relationships between different tables and also the columns
they comprise of.
Since we deal with a large volume of data from multiple sources, there are chances that our data
might be incorrect or have discrepancies in a variety of ways. So before getting into the action of
visualizing this data it is very essential to prepare this raw data by cleaning, transforming, and
modeling this data before further use.
You might have noticed when we were loading our data into Power BI, we came across two
options- one was Load and the other was Transform Data. If we click on the latter option,
another window launches. This is known as the Power Query Editor.
One more thing to be noted is we can also transform data by clicking on the Transform Data icon
from the Home tab of Power BI Desktop.
As explained above in this blog, we know that data modeling is done in the Model View in
Power BI Desktop. Here we can see all the tables, their columns, and the relationships between
them, which depicts how the different sources are connected.
8) Understanding DAX Functions and creating measures
Microsoft extensively uses DAX, Data Analysis Expressions, to create required information
using the existing tables and columns. This programming language is used to calculate and return
one or more values.
R1 R2 R3
DOP DOS Conduction File Record Viva -Voce Total Signature
EXPERIMENT NO.10
1) Data Import:
2) Data Cleaning:
➢ Once the data is imported, you may need to clean it to remove any inconsistencies or errors.
➢ Identify and handle missing values, incorrect data types, and outliers.
➢ Ensure that columns are correctly formatted, especially numerical values like costof2 and
rating.
➢ Remove any duplicate rows if present.
➢ Data Modeling:
➢ Define relationships between tables if your data is spread across multiple tables.
➢ Create calculated columns or measures as needed. For example, you could calculate the
average rating for each restaurant or the total number of shops in each region.
➢ Rename columns to make them more understandable if necessary.
4) Data Visualization:
A) Bar Chart:
D) Line Chart:
➢
◆ Data Preparation: Identify the time-based variable (e.g., date) and the numerical
variable (e.g., sales revenue) you want to visualize over time.
◆ Design Decision: Determine the granularity of the time intervals (e.g., daily,
monthly, yearly) and the aggregation function for the numerical variable.
◆ Implementation: In Power BI, drag the time-based variable into the Axis field and
the numerical variable into the Values field. Power BI will automatically aggregate
the numerical values based on the chosen time intervals
◆ Customization: Customize the appearance of the line chart by adjusting colors, line
styles, axis titles, and other formatting options. You can also add data labels or
markers to highlight specific data points.
E) Scatter Plot:
➢
◆ Data Preparation: Identify two numerical variables (e.g., cost and rating) that you
want to visualize for each data point.
◆ Design Decision: Decide whether you want to include a trend line or regression
analysis to show the relationship between the variables.
◆ Implementation: In Power BI, drag one numerical variable into the X-axis field and
the other into the Y-axis field. Power BI will plot each data point based on the
values of the two variables.
◆ Customization: Customize the appearance of the scatter plot by adjusting colors,
markers, axis titles, and other formatting options. You can also add a trend line or
regression analysis to visualize the relationship between the variables.
F) Applying Filters:
➢ Use the "Filters" pane to apply filters as needed. For example, you can filter the data
based on location by dragging the "location" field into the "Filters" section and selecting
the desired location(s).
5) Design and Formatting:
Customize the appearance of your report by adjusting colors, fonts, and layouts to make it more
visually appealing and easy to understand.
Arrange your visualizations in a logical order and use containers or backgrounds to group related
elements together.
Once your report is built, thoroughly test it to ensure that all visualizations are accurate and
interactive.
Iterate on your design based on feedback or if you discover any issues during testing.
7) Publishing:
When you're satisfied with your Power BI project, you can publish it to the Power BI service to
share it with others or to access it from anywhere.
Click on "Publish" from the Home tab and follow the prompts to publish your report to your
Power BI workspace.
Conclusion: Thus, we have studied and impliment data visualization using PowerBI
R1 R2 R3
DOP DOS Conduction File Record Viva -Voce Total Signature
Practical Number 8
Title of Practical To implement Error Back propagation Perceptron Training Algorithm.
Prior Concepts:
Back propagation is the essence of neural network training. It is the method of fine-tuning the
weights of a neural network based on the error rate obtained in the previous epoch (i.e.,
iteration). Proper tuning of the weights reduces error rates and makes the model reliable by
increasing its generalization. Back propagation in neural network is a short form for “backward
propagation of errors.” It is a standard method of training artificial neural networks. This
method helps calculate the gradient of a loss function with respect to all the weights in the
network.
How Back propagation Algorithm Works
The Back propagation algorithm in neural network computes the gradient of the loss function
for a single weight by the chain rule. It efficiently computes one layer at a time, unlike a native
direct computation. It computes the gradient, but it does not define how the gradient is used. It
generalizes the computation in the delta rule.
Consider the following Back propagation neural network example diagram to understand:
Python Code:
import numpy as np
def sigmoid_derivative(x):
return x * (1 - x)
#Input datasets
inputs = np.array([[0,0],[0,1],[1,0],[1,1]])
expected_output = np.array([[0],[1],[1],[0]])
epochs = 10000
lr = 0.1
inputLayerNeurons, hiddenLayerNeurons, outputLayerNeurons = 2,2,1
#Training algorithm
for _ in range(epochs):
#Forward Propagation
hidden_layer_activation = np.dot(inputs,hidden_weights)
hidden_layer_activation += hidden_bias
hidden_layer_output = sigmoid(hidden_layer_activation)
output_layer_activation = np.dot(hidden_layer_output,output_weights)
output_layer_activation += output_bias
predicted_output = sigmoid(output_layer_activation)
#Backpropagation
error = expected_output - predicted_output
d_predicted_output = error * sigmoid_derivative(predicted_output)
error_hidden_layer = d_predicted_output.dot(output_weights.T)
d_hidden_layer = error_hidden_layer * sigmoid_derivative(hidden_layer_output)
Learning Objectives:
To understand and implement Error Back propagation Perceptron Training Algorithm.
Conclusion/Learning outcome:
An Error Back propagation Perceptron Training Algorithm is understood and used for
implementing an XOR Gate.
R1 R2 R3
DOP DOS Conduction File Record Viva Total Signature
Voice
5 Marks 5 Marks 5 Marks 15 Marks
DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE
Practical Number 9
Title of Practical To implement Principal Component Analysis.
Theory: Dimensionality reduction is the process of reducing the number of features (or dimensions)
in a dataset while retaining as much information as possible. In other words, it is a process of
transforming high-dimensional data into a lower-dimensional space that still preserves the essence
of the original data. This can be done for a variety of reasons, such as to reduce the complexity of a
model, to improve the performance of a learning algorithm, or to make it easier to visualize the data.
There are several techniques for dimensionality reduction, including principal component analysis
(PCA), singular value decomposition (SVD), and linear discriminant analysis (LDA). Each
technique uses a different method to project the data onto a lower-dimensional space while
preserving important information.
Feature Selection:
Feature selection involves selecting a subset of the original features that are most relevant to the
problem at hand. The goal is to reduce the dimensionality of the dataset while retaining the most
important features. There are several methods for feature selection, including filter methods,
wrapper methods, and embedded methods. Filter methods rank the features based on their relevance
to the target variable, wrapper methods use the model performance as the criteria for selecting
features, and embedded methods combine feature selection with the model training process.
Feature Extraction:
Feature extraction involves creating new features by combining or transforming the original
features. The goal is to create a set of features that captures the essence of the original data in a
lower-dimensional space. There are several methods for feature extraction, including principal
component analysis (PCA), linear discriminant analysis (LDA), and t-distributed stochastic
neighbor embedding (t-SNE). PCA is a popular technique that projects the original features onto a
lower-dimensional space while preserving as much of the variance as possible.
Principal Component Analysis
This method was introduced by Karl Pearson. It works on the condition that while the data in a
higher dimensional space is mapped to data in a lower dimension space, the variance of the data in
the lower dimensional space should be maximum.
It involves the following steps:
• Construct the covariance matrix of the data.
• Compute the eigenvectors of this matrix.
• Eigenvectors corresponding to the largest eigenvalues are used to reconstruct a large
fraction of variance of the original data.
Code:
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import pandas as pd
Learning Objectives:
To understand and implement Principal Component Analysis
R1 R2 R3
DOP DOS Conduction File Record Viva Total Signature
Voice
5 Marks 5 Marks 5 Marks 15 Marks
DEPARTMENT OF ARTIFICIAL
INTELLIGENCE & DATA SCIENCE
Theory: Iris Dataset is considered as the Hello World for data science. It contains five
columns namely – Petal Length, Petal Width, Sepal Length, Sepal Width, and Species
Type. Iris is a flowering plant, the researchers have measured various features of the
different iris flowers and recorded them digitally.
You can download the Iris.csv file from the above link. Now we will use the Pandas library
to load this CSV file, and we will convert it into the dataframe. read_csv() method is used
to read CSV files.
Code:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
iris = pd.read_csv('/kaggle/input/iris/Iris.csv')
iris.head()
iris.shape
iris.info()
Output:
Learning Objectives: Our main objective is to classify the flowers into their respective species -
Iris setosa, Iris virginica and Iris versicolor by using various possible plots.
Subject:DataAnalytics&Visualization CourseCode:CSL601
Semester:6 Course:AI&DS
Laboratoryno.:407 Nameofsubjectteacher:Anagha Dhavikar
Nameofstudent: Sahil Surve Rollno:VU2S2223020
ExperimentNo.11
Aim: Design and implement an Expert system for classification useful for real world application.
PriorConcepts:
Problem Definition,Knowledge Acquisition,Knowledge Representation,Inference Mechanism,
Uncertainty Handling,Validation and Testing Integration ,Maintenance and Updates
Theory:
An expert system for classification applies artificial intelligence to categorize data. Key
components include defining the problem, acquiring domain knowledge, and representing it in a
format the system can understand. An inference mechanism drives decision-making, utilizing
techniques like forward or backward chaining. Handling uncertainty is crucial, often managed
through fuzzy logic or probabilistic methods. Validation ensures accuracy, while integration
embeds the system into applications. Maintenance involves updating knowledge bases to keep
pace with evolving domains. Key points:
Conclusion/Learningoutcome:
We learn to Design and implement an Expert system for classification useful for real world application. Here we
learn to implement classification model successfully.
Evaluation:
R1 R2 R3
DOP DOS Conduction FileRecord Viva-Voce Total Signature
Subject:DataAnalytics&Visualization CourseCode:CSL601
Semester:6 Course:AI&DS
Laboratoryno.:407 Nameofsubjectteacher:Anagha Dhavikar
Nameofstudent: Sahil Surve Rollno:VU2S2223020
ExperimentNo.12
Aim: Develop a regression model for prediction or forecasting, useful for real world application.
PriorConcepts:
Problem Identification, Data Collection, Data Preprocessing, Feature Selection, Model Selection,
Model Training, Model Evaluation, Model Tuning, Deployment and Integration.
Theory:
Regression models are statistical methods used for predicting the value of a dependent variable based on one or more
independent variables. These models establish a relationship between the dependent variable and the independent
variables through mathematical equations. The goal is to minimize the difference between the observed and predicted
values, typically measured using metrics like mean squared error or R-squared.
Key Points:
• They assume a linear relationship between the independent and dependent variables.
• Data preprocessing involves cleaning, transforming, and scaling the data for accurate modeling.
• Feature selection helps identify the most relevant independent variables for prediction.
• Model selection includes choosing appropriate regression algorithms like linear regression, polynomial
regression, or more advanced methods such as ridge or lasso regression.
• Model training involves fitting the selected algorithm to the training data to learn the relationship
between variables.
• Model evaluation assesses the performance using metrics like mean squared error, root mean squared
error, or R-squared on test data.
Conclusion/Learningoutcome:
We learn to Develop a regression model for prediction or forecasting, useful for real world application. Here we
learn to implement regression model on weather dataset model successfully.
Evaluation:
R1 R2 R3
DOP DOS Conduction FileRecord Viva-Voce Total Signature
Experiment No: 6
Aim: Study of packet sniffer tools: wireshark,: 1. Download and install wireshark and capture
icmp, tcp, and http packets in promiscuous mode. 2. Explore how the packets can be traced based
on different filters.
Theory:
Wireshark is a network packet analyzer. A network packet analyzer will try to capture network
packets and tries to display that packet data as detailed as possible.
Wireshark is used for:
• Open files containing packet data captured with tcpdump/WinDump, Wireshark, and a
• Import packets from text files containing hex dumps of packet data.
c) Once wireshark window opens, select the interface and click on start
Capturing Packets
d) After downloading and installing wireshark, you can launch it and click the name of an interface
under Interface List to start capturing packets on that interface.
e) For example, if you want to capture traffic on the wireless network, click your wireless
interface. You can configure advanced features by clicking Capture Options.
f) As soon as you click the interface„s name, you„ll see the packets start to appear in real time.
Wireshark captures each packet sent to or from your system.
g) Click the stop capture button near the top left corner of the window when you want to stop
capturing traffic
Wireshark uses colors to help you identify the types of traffic at a glance. By default, green is TCP
traffic, dark blue is DNS traffic, light blue is UDP traffic, and black identifies TCP packets with
problems — for example, they could have been delivered out-of-order.
Wireshark can record the capturing information in the file with extension .pcap (packet capture).
This file can be again reopened for analysis in offline mode.There is no need to remember filtering
commands. Filters can be applied by putting predefined strings in Wireshark.
Commands:-
Filtering Packets The most basic way to apply a filter is by typing it into the filter box at the top
of the window and clicking Apply (or pressing Enter). For example, type ―dns and you„ll see
only DNS packets. When you start typing, Wireshark will help you auto complete your filter.
4. Sets a filter for any TCP packet with 4000 as a source or destination port.
5. Filter specific packets tcp.flags.reset== 0 Displays all TCP resets.
6. Filter for http request packets Displays all HTTP GET requests. Http. request
8. Masks out arp, icmp, dns, or whatever other protocols may be background noise, allowing
you to focus on the traffic of interest.
9. Capturing packets after applying multiple filters not (tcp.port == 80) and not (tcp port ==
25)
Get all packets which are not HTTP or UDP.
To stop capturing click on the “red square”
To capture packets of FTP server. (Login ID and Password)What is FTP?
FTP stands for File Transfer Protocol. As the name suggest this network protocol allows you to
transfer files or directories from one host to another over the network whether it is your LAN or
Internet.
The package required to install FTP is known as VSFTPD (Very Secure File Transfer Protocol
Daemon)
Steps:-
The above command will install and start the xinetdsuperserver on your system. The chances are
that you already have xinetdinstalled on your system. In that case you can omit the above
installation command.
In the next step we need to edit the FTP server's configuration file which is present in
/etc/vsftpd.conf
1. # cd /etc
2. # ls
This will instruct the FTP server to allow connecting with an anonymous client.
4. Save and close the gedit file
Now, that we are ready we can start the FTP server in the normal mode with:
5. # servicexinetd restart
6. # servicevsftpd restart OR
7. # init.d/vsftpd restart
Start WIRESHARK. In the FILTER field put FTP. This will filter all FTP packets Connectimg to
a client present in other machine
$ ftp ip address of the FTP server
Name: anonymous
Please specify the password. Password:
Login successful. ftp>
ftp> quit Goodbye.
Output:
While the client is establishing a connection with the FTP server, the wireshark running in the
background of the FTP server is able to capture all FTP packets. So, the Name and Password entered
by the client is visible in plain text in Wireshark. Apart from that the source and destination address
is also visible. If many clients are trying to connect with the server then source address, name and
password are visible for all of them.
Conclusion: Thus packet snifer software is explored for it‟s different benefits.
R1 R2 R3
DOP DOS Conduction File Record Viva -Voce Total Signature
EXPERIMENT NO.07
Aim: Download and install nmap. Use it with different options to scan open ports, perform OS
fingerprinting, do a ping scan, tcp port scan, udp port scan, xmas scan etc
Steps:-
1.Get root access: $ sudosu root
2.#ifconfig
3.# apt-get install nmap
Commands:-
1. # nmap -V
It gives the version of Nmap
2. # nmap 192.168.23.20
It gives information about a single host. It gives the output in column form where first column is
the PORT, second column is the STATE and third column is the SERVICE
3. #nmap –v 192.168.23.20
It gives the detailed information about remote host.
4. #nmap –O 192.168.23.20
It finds the remote host operating system and version (OS detection)
5. # nmap –sP 192.168.23.0/24
It scans a network and discover which servers and devices are up and running(ping
scan)
6. # nmap -sA 192.168.23.20
To discover if a host/network is protected by a firewall. The output has the word
FILTERED which shows presence of firewall. UNFILTEREDmeans no firewall.
7. # nmap -p T:23 192.168.23.20
It scans TCP port 23
8. #nmap -p 80,443 192.168.23.20
It scansmultiple ports at one time
9. # nmap -sV 192.168.23.20
It detect remote services (server / daemon) version numbers. Version numbers are displayed only
if the Port is open
10. nmap -sS 192.168.23.20
It performs SYN scan or Stealth scan. Open wireshark.
Set the Filter to TCP.
See the grey and red color packets
Double click any grey color TCP packet where destination address is the neighbour’s address
See the Flag field of TCP: SYN bit should be set to 1
11. # nmap -sN 192.168.23.20
It performs TCP Null Scan. It does not set any bits (TCP flag header is 0) Open wireshark.
Set the Filter to TCP.
Double click any grey color TCP packet where destination address is the neighbour’s address
See the Flag field of TCP: No flag bits should be set
12. # nmap –sF192.168.23.20
It performs FIN scan. It sets just the TCP FIN bit. Open wireshark.
Set the Filter to TCP.
Double click any grey color TCP packet where destination address is the neighbour’s address
See the Flag field of TCP: FIN flag should be set to 1
13. # nmap -sX 192.168.23.20
It performs TCP Xmas. It sets the FIN, PSH, and URG flags.
Open wireshark.
Set the Filter to TCP.
Double click any grey color TCP packet where destination address is the neighbour’s address
See the Flag field of TCP: FIN, PSH, and URG flagsshould be set to 1
14. # nmap –sO192.168.23.20
It performs IP protocol scan and allows us to determine which IP protocols) are supported by
target machines.
15. #nmap –sU192.168.23.20
It performs UDP port scan.
OUTPUT:
C:\Users\vppcoe 11>nmap -V
Nmap version 7.93 ( https://nmap.org )
Platform: i686-pc-windows-windows
Compiled with: nmap-liblua-5.3.6 openssl-3.0.5 nmap-libssh2-1.10.0 nmap-libz-1.2.12 nmap-
libpcre-7.6 Npcap-1.78 nmap-libdnet-1.12 ipv6
Compiled without:
Available nsock engines: iocp poll select
Conclusion: Thus the nmap tool is explored with the different option available for the better
result of network scanning.
R1 R2 R3
DOP DOS Conduction File Record Viva -Voce Total Signature
Experiment No. 8
Conclusion: The Sqlmap tool is useful for identifying the need of input validation and Sql
injection vulnerability.
Output:
Evaluation:
R1 R2 R3
DOP DOS Conduction FileRecord Viva-Voce Total Signature
Experiment 9
Aim: To set up, configuration and use of SNORT for Intrusion Detection
Theory:
Snort is an open source network intrusion prevention and detection system (IDS/IPS)
developed by Sourcefire. Combining the benefits of signature, protocol, and anomaly-based
inspection, Snort is the most widely deployed IDS/IPS technology worldwide. With millions
of downloads and nearly 400,000 registered users, Snort has become the de facto standard for
IPS.
Steps:
a) Get root access $ sudosu root
b) Do updation # apt-get update
c) Installation
# apt-get install snort During installation:
Put the name of network interface (by default it is eth0, change it to the interface name of
your machine)
Put the IP address of the machine followed by /24 (by default it is the network address.
Replace it with your IP addr/24)
d) Configuration
# cd /etc
# ls
# cd /snort # ls
# geditsnort.conf Go to line no. 51
ipvar HOME_NET any Replace “any” with your ip address i.e. ipvar HOME_NET
192.168._._
Save and close the file
c) Monitoring
# snort –q –A console –i enp2s0 enp2s0 is the name of the interface
Cd/etc
output in your machine
$ nmapipaddr of your machine (This command is to be performed on neignbour’s machine)
Output to be observed in SNORT terminal: IP address of the neighbour who is performing
Intrusion i.e. Port Scanning
OUTPUT:
Conclusion: The SNORT can help to understand & implementation of intrusion detection
process.
R1 R2 R3
DOP DOS Conduction File Viva Total Signature
Record Voice
5 Marks 5 Marks 5 Marks 15
Marks
DEPARTMENT OFARTIFICIAL INTELLIGENCE
ANDDATASCIENCE
Experiment 10
Aim: Design personal Firewall using Iptables Theory:
All packets inspected by iptables pass through a sequence of built-in tables (queues) for
processing. Each of these queues is dedicated to a particular type of packet activity and is
controlled by an associated packet transformation/filtering chain.
• Filter Table
o INPUT chain – Incoming to firewall. For packets coming to the local server.
o OUTPUT chain – Outgoing from firewall. For packets generated locally and
going out of the local server.
o FORWARD chain – Packet for another NIC on the local server. For packets
routed through the local server.
• NAT Table
• Mangle Table
• Iptables’s Mangle table is for specialized packet alteration. This alters QOS bits in the
TCP header. Mangle table has the following built-in chains.
o PREROUTING chain
o OUTPUT chain
o FORWARD chain
o INPUT chain
o POSTROUTING chain
• Raw Table
• Iptable’s Raw table is for configuration exemptions. Raw table has the following built-in
chains.
o PREROUTING chain
o OUTPUT chain
• Security Table
• This table is used for Mandatory Access Control (MAC) networking rules, such as those
enabled by the SECMARK and CONNSECMARK targets. Mandatory Access Control is
implemented by Linux Security Modules such as SELinux. The security table is called
after the filter table, allowing any Discretionary Access Control (DAC) rules in the filter
table to take effect before MAC rules. This table provides the following built-in chains:
INPUT (for packets coming into the box itself), OUTPUT (for altering locally-generated
packets before routing), and FORWARD (for altering packets being routed through the
box).
• Chains :-Tables consist of chains, Rules are combined into different chains. The kernel
uses chains to manage packets it receives and sends out. A chain is simply a checklist of
rules which are lists of rules which are followed in order. The rules operate with an if-
then -else structure.
• Input – This chain is used to control the behaviour for incoming connections. For
example, if a user attempts to SSH into your PC/server, iptables will attempt to match the
IP address and port to a rule in the input chain.
• Forward – This chain is used for incoming connections that aren’t actually being
delivered locally. Think of a router – data is always being sent to it but rarely actually
destined for the router itself; the data is just forwarded to its target.
• Output – This chain is used for outgoing connections. For example, if you try to ping
howtogeek.com, iptables will check its output chain to see what the rules are regarding
ping and howtogeek.com before making a decision to allow or deny the connection
attempt.
• Targets:
• ACCEPT: Allow packet to pass through the firewall. DROP: Deny access by the packet.
• REJECT: Deny access and notify the server. QUEUE: Send packets to user space.
• RETURN: jump to the end of the chain and let the default target process it
• Steps:-
• # iptables -L
• . Initially it is empty
• #ping 192.168.208.6
• To block incoming traffic from particular destination for a specific protocol to machine
• #ping 192.168.208.6
• # iptables -F
• If we change target from REJECT to ACCEPT, the site can be visited again.
Observations:
• In case of OUTPUT chain, for DROP and REJECT chain, at source machine we get two
different messages.
• In case of INPUT chain for DROP and REJECT chain at source machine we get two
different responses as follows:
OUTPUT
Installing IPtables
Check current iptables status
Evaluation:
R1 R2 R3
DOP DOS Conduction FileRecord Viva-Voce Total Signature