0% found this document useful (0 votes)
38 views99 pages

Dav Exps - Merged - Merged

The document discusses data exploration and cleaning. It describes important steps like understanding the data, identifying missing values and outliers, feature engineering, and more. Various techniques are covered, including statistical analysis, visualization, and using tools like R.

Uploaded by

Sahil Surve
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views99 pages

Dav Exps - Merged - Merged

The document discusses data exploration and cleaning. It describes important steps like understanding the data, identifying missing values and outliers, feature engineering, and more. Various techniques are covered, including statistical analysis, visualization, and using tools like R.

Uploaded by

Sahil Surve
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 99

Course: Data Analytics & Visualization Course Code: CSL-601

Semester: 6 Department: AI & DS


Laboratory No: 407 Name of Subject Teacher: Pramod Bhavarthe
Name of Student: Sahil Surve Roll Id: VU2S2223020

Experiment No. 1

Aim: Introduction to graph libraries such as matplotlib/Seaborn/Excel plots.

1. Matplotlib:
Matplotlib is a widely used plotting library for Python. It's powerful and provides a wide variety of plots like line plots,
scatter plots, histograms, bar charts, etc. It's highly customizable, allowing users to control almost every aspect of the plot.

Example using Matplotlib in Python:


Code:
import matplotlib.pyplot as plt
# Sample data
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

# Creating a line plot


plt.plot(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Sample Line Plot')
plt.show()
2. Seaborn:
Seaborn is built on top of Matplotlib and provides a higher-level interface for creating attractive and informative statistical
graphics. It simplifies many common visualization tasks and works well with Pandas dataframes.

Example using Seaborn in Python:


Code:
import seaborn as sns
import pandas as pd
# Sample data
data = pd.DataFrame({'X': [1, 2, 3, 4, 5], 'Y': [2, 4, 6, 8, 10]})

# Creating a scatter plot

sns.scatterplot(x='X', y='Y', data=data)


plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Sample Scatter Plot')
plt.show()
3. ggplot2 (R Programming):
ggplot2 is a powerful and popular plotting system in R that follows the Grammar of Graphics. It allows users to create
sophisticated plots with ease and provides a high level of customization.

Example using ggplot2 in R:


library(ggplot2)
# Sample data
data <- data.frame(X = c(1, 2, 3, 4, 5), Y = c(2, 4, 6, 8, 10))
# Creating a scatter plot
ggplot(data, aes(x = X, y = Y)) +
geom_point() +
labs(x = 'X-axis', y = 'Y-axis', title = 'Sample Scatter Plot')

```

# Create some variables


x <- 1:10
y1 <- x*x
y2 <- 2*y1

# Create a basic stair steps plot


plot(x, y1, type = "S")
# Show both points and line
plot(x, y1, type = "b", pch = 19,
col = "red", xlab = "x", ylab = "y")
# Create a first line
plot(x, y1, type = "b", frame = FALSE, pch = 19,
col = "red", xlab = "x", ylab = "y")
# Add a second line
lines(x, y2, pch = 18, col = "blue", type = "b", lty = 2)
# Add a legend to the plot
legend("topleft", legend=c("Line 1", "Line 2"),
col=c("red", "blue"), lty = 1:2, cex=0.8)
4. Excel:
Excel is a widely used spreadsheet program that also provides various charting and graphing capabilities. Users can create
different types of charts (line charts, bar charts, pie charts, etc.) by selecting data and using the Chart tools available in
Excel.

To create a simple plot in Excel:


- Enter your data into cells.
- Select the data range.
- Go to the "Insert" tab and choose the type of chart you want to create.

These libraries/tools offer a range of functionalities and flexibility for creating visualizations based on your data and
programming preferences. Explore them further to leverage their full potential in data visualization tasks.

Learning Objectives:
To understand the graph libraries.

Conclusion/Learning outcome:
The concept of graph libraries is studied and understood using matplotlib/Seaborn/Excel plots.

R1 R2 R3
DOP DOS Conduction File Record Viva Voice Total Signature

5 Marks 5 Marks 5 Marks 15 Marks


DEPARTMENTOFARTIFICIALINTELLIGENCE&
DATA SCIENCE

Course:DataAnalytics&Visualization CourseCode:CSL-601
Semester:6 Department:AI&DS
LaboratoryNo:407 Name of Subject Teacher: Pramod Sir
Name of Student : Sahil Surve RollId:VU2S2223020

ExperimentNo.2
Aim: Data Exploration: Knowing the data, Data preparation and Cleaning

Prior Concepts:
Exploring data is a crucial step in understanding its characteristics, trends, and underlying patterns.
Conducting experiments in data exploration involves various techniques and tools to gain insights into the
dataset. Following are the different approaches to conduct a data exploration.

1. Data Collection and Understanding:


- Collect the dataset you want to explore. This could be a CSV file, database, or anystructured/unstructured
data source.
- Understand the data sources, variables, and their meanings. Look into data dictionaries or metadatato
comprehend what each column represents.

2. Data Cleaning and Preprocessing:


- Check for missing values, duplicates, and outliers in the dataset.
- Handle missing values by imputation or deletion, depending on the nature of the data.
- Normalize or standardize data if necessary for certain algorithms or analyses.

3. Statistical Summaries and Visualizations:


- Calculate descriptive statistics (mean, median, mode, standard deviation, etc.) for numericalvariables.
- Generate frequency counts for categorical variables.
- Create visualizations like histograms, box plots, scatter plots, and heatmaps to understand thedistribution
and relationships between variables.

4. Exploratory Data Analysis (EDA):


- Conduct correlation analysis to identify relationships between variables.
- Perform dimensionality reduction techniques like PCA (Principal Component Analysis) or t-
SNE(t-Distributed Stochastic Neighbor Embedding) for visualization in lower dimensions.
- Cluster analysis to identify natural groupings within the data.

5. Hypothesis Testing and Feature Engineering:


- Formulate hypotheses about relationships or patterns within the data.
- Perform hypothesis tests to validate or reject these hypotheses.
- Create new features by transforming existing ones to improve model performance if working
on apredictive modeling task.
6. Interactive Exploration and Tools:
- Utilize tools like Jupyter Notebooks, Pandas, Matplotlib, Seaborn, Plotly, or Tableau for interactiveexploration.
- Utilize widgets or interactive visualization libraries for dynamic exploration of data subsets ortrends.

7. Documentation and Communication:


- Document all findings, insights, and assumptions made during exploration.
- Create visualizations, summaries, and reports to communicate the insights gained.

8. Iterative Process:
- Data exploration is often iterative. Revisit steps, try different techniques, and compare results
togain a comprehensive understanding of the dataset.

9. Ethical Considerations:
- Ensure ethical use of data, especially regarding privacy, biases, and the implications of
insightsdrawn from the data.

New Concept:
Data loading in R
1. CSV File
# Load a CSV file
data <- read.csv("your_file.csv")

# View the first few rows of the dataset


head(data)

# View summary statistics of the dataset


summary(data)

# Check the structure of the dataset


str(data)

# Check the dimensions of the dataset


dim(data)

2. EXCEL File
# Load an Excel file (assuming 'readxl' package is installed)
library(readxl)
data <- read_excel("your_file.xlsx")

# View the first few rows of the dataset


head(data)

# View summary statistics of the dataset


summary(data)

# Check the structure of the dataset


str(data)
# Check the dimensions of the dataset
dim(data)

3. Identifying Missing values


# Assuming 'data' is your dataframe# Check for
missing values in the entire dataset
colSums(is.na(data))

# Check for missing values in a specific column


sum(is.na(data$column_name))

4. Removing missing values


# Remove rows with any missing values
data_clean <- na.omit(data)

# Remove rows with missing values in specific columns


data_clean <- data[complete.cases(data$column_name), ]

Data preparation and cleaning involve various steps to ensure that the dataset is in a suitable format foranalysis or
modeling.

Data Cleaning Steps:


1. Handling Missing Values:
- Identify missing values using functions like `is.na()` or `complete.cases()`.
- Decide whether to remove missing values using `na.omit()` or impute them with methods like
mean,median, mode, or using advanced imputation techniques from packages like `mice` or
`missForest`.
2. Handling Outliers:
- Detect outliers using statistical methods like z-scores, IQR (Interquartile Range), or visualization
techniques such as boxplots.
- Decide whether to remove outliers or transform them based on the context of your analysis.
3. Dealing with Duplicates:
- Check for and remove duplicate rows using `duplicated()` and `unique()` functions.

Data Preparation Steps:


1. Feature Engineering:
- Create new features from existing ones based on domain knowledge or insights.
- Transform variables (e.g., log transformation for skewed data) to improve the distribution of data.
2. Standardization and Normalization:
- Scale numerical variables to a standard scale using methods like `scale()` for standardization or
min-max scaling for normalization.
- Normalize features to bring them on a similar scale, especially when using distance-basedalgorithms.
3. Handling Categorical Variables:
- Convert categorical variables to factors using `as.factor()` or one-hot encode them using
techniquesfrom packages like `dummies`.
- Handle ordinal variables appropriately by assigning levels.
4. Data Splitting:
- Split the dataset into training and testing subsets using functions like `sample()` or from
packageslike `caret` or `tidymodels`.
5. Handling Date and Time Variables:
- Convert date/time variables to appropriate formats using functions like `as.Date()` or
`as.POSIXct()`.

New Concepts:
Example of R Code for Data Cleaning and Preparation:

# Handling missing values


data_clean <- na.omit(data) # Removing rows with missing values#
OR
data$column_name[is.na(data$column_name)] <- mean(data$column_name, na.rm = TRUE) #
Imputing missing values with mean

# Handling outliers (example: using z-score)


threshold <- 3
data_clean <- data[abs(scale(data$numeric_column)) < threshold, ]

# Dealing with duplicates


data_unique <- unique(data)

# Feature engineering (example: creating a new feature)


data$feature_sum <- rowSums(data[, c("feature1", "feature2", "feature3")])

# Handling categorical variables (example: one-hot encoding)


library(dummies)
data <- dummy.data.frame(data, names = "categorical_column")

# Data splitting (example: 70-30 split for training and testing)


library(caret)
set.seed(123)
train_index <- createDataPartition(data$target_variable, p = 0.7, list = FALSE)
train_data <- data[train_index, ]
test_data <- data[-train_index, ]

These steps ensure that your data is clean, formatted correctly, and ready for analysis or modeling tasksin R.
Adjust these methods based on your specific dataset and analysis requirements.
Learning Objectives:
To understand the different techniques of Data exploration, Data preparation and Cleaning.
Conclusion/Learning outcome:
The use of different tools and commands for understanding the data are studied and implemented. Theconcept of
Data preparation and Cleaning is understood and implemented in R language.

R1 R2 R3
DOP DOS Conduction FileRecord Viva Voice Total Signature

5Marks 5Marks 5Marks 15Marks


DEPARTMENTOFARTIFICIALINTELLIGENCE&
DATA SCIENCE

Course:Data Analytics &Visualization CourseCode:CSL-601


Semester:6 Department:AI&DS
LaboratoryNo:407 Name of Subject Teacher: Pramod Bhavarthe
Name of Student:Sahil Surve Roll Id:VU2S2223020

ExperimentNo.3
Aim: To understand and implement visualization of data.
Prior Concepts:
R provides numerous packages for data visualization. One of the most commonly used packages is ggplot2,
which offers a flexible and powerful system for creating a wide variety of visualizations. Here are some basic
examples of data visualization using ggplot2:
New Concept:
1. Install and load ggplot2 Package:
# Install ggplot2 if not already installed
install.packages("ggplot2")
# Load ggplot2
librarylibrary(ggplot2)
2. Histogram
# Create a histogram of a numerical variable 'numeric_column' from dataframe 'data'
ggplot(data, aes(x = numeric_column)) +
geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
labs(title = "Histogram of Numeric Column", x = "Values", y = "Frequency")
3. Scatter plot
# Create a scatter plot of 'numeric_column1' against 'numeric_column2'
ggplot(data, aes(x = numeric_column1, y = numeric_column2)) +
geom_point(color = "blue") +
labs(title = "Scatter Plot", x = "X-axis Label", y = "Y-axis Label")

4. Boxplot
# Create a boxplot to visualize distribution of 'numeric_column' across 'group_column'
ggplot(data, aes(x = group_column, y = numeric_column)) +
geom_boxplot(fill = "lightgreen", color = "black") +
labs(title = "Boxplot of Numeric Column by Group", x = "Groups", y = "Values")
5. Barchart
# Create a bar chart to visualize counts of categories in 'categorical_column'
ggplot(data, aes(x = categorical_column)) +
geom_bar(fill = "orange") +
labs(title = "Bar Chart of Categorical Column", x = "Categories", y = "Count")

6. Lineplot
# Create a line plot to show trends over time using 'date_column' and 'numeric_column'
ggplot(data, aes(x = date_column, y = numeric_column)) +
R1 R2 R3
DOP DOS Conduction FileRecord Viva Voice Total Signature

5Marks 5Marks 5Marks 15Marks


DEPARTMENTOFARTIFICIALINTELLIGENCE&
DATA SCIENCE

Course:DataAnalytics&Visualization CourseCode:CSL-601
Semester:6 Department: AI & DS
Laboratory No:407 Name of Subject Teacher:Promad Bhavarthe
Name of Student : Sahil Surve RollId: VU2S2223020

ExperimentNo.4

Aim: To implement Correlation and Covariance.

Prior Concept:
Covariance
It’s a statistical term demonstrating a systematic association between two random variables, where the
change in the other mirrors the change in one variable.

Definition and Calculation of Covariance


Covariance implies whether the two variables are directly or inversely proportional.
The covariance formula determines data points in a dataset from their average value. For instance, youcan
compute the Covariance between two random variables, X and Y, using the following formula:

Where,

Interpreting Covariance Values


Covariance values indicate the magnitude and direction (positive or negative) of the relationship between
variables. The covariance values range from -∞ to +∞. The positive value implies a positive relationship,
whereas the negative value represents a negative relationship.

Positive, Negative, and Zero Covariance


The higher the number, the more reliant the relationship between the variables. Let’s comprehend each
variance type individually:
Positive Covariance
If the relationship between the two variables is a positive covariance, they are progressing in the same
direction. It represents a direct relationship between the variables. Hence, the variables will behave similarly.
The relationship between the variables will be positive Covariance only if the values of one variable (smaller
or more significant) are equal to the importance of another variable.
Negative Covariance
A negative number represents negative Covariance between two random variables. It implies that the variables
will share an inverse relationship. In negative Covariance, the variables move in the opposite direction.
In contrast to the positive Covariance, the greater of one variable correspond to the smaller value of another
variable and vice versa.
Zero Covariance
Zero Covariance indicates no relationship between two variables.

Significance of Covariance in Assessing Linear Relationship


Covariance is significant in determining the linear relationship between variables. It suggests the
direction (negative or positive) and magnitude of the relationship between variables.
A higher covariance value indicates a strong linear relationship between the variables, while a zero covariance
suggests no ties.

Limitations and Considerations of Covariance


The scales of measurements influence the Covariance and are highly affected by outliers. Covariance is
restricted to measuring only the linear relationships and doesn’t apprehend the direction or strength.
Moreover, comparing covariances across various datasets demand caution due to different variable ranges.

Definition and Calculation of Correlation Coefficient


Correlation is a statistical concept determining the relationship potency of two numerical variables. While
deducing the relation between variables, we conclude the change in one variable that impacts a difference in
another.
When an analogous movement of another variable reciprocates the progression of one variable in some
manner or another throughout the study of two variables, the variables are correlated.
The formula for calculating the correlation coefficient is as follows:

Where,
Interpreting Correlation Values
There are three types of correlation based on diverse values. Negative correlation, positive correlation, and no
or zero correlation.

Positive, Negative, and Zero Correlation


If the variables are directly proportional to one another, the two variables are said to hold a positive
correlation. This implies that if one variable’s value rises, the other’s value will exceed. An ideal positive
correlation possesses a value of 1.
Here’s what a positive correlation looks like:

In a negative correlation, one variable’s value increases while the second one’s value decreases. A perfect
negative correlation has a value of -1.
The negative correlation appears as follows:

Just like in the case of Covariance, a zero correlation means no relation between the variables. Therefore,
whether one variable increases or decreases won’t affect the other variable.

Strength and Direction of Correlation


Correlation assesses the direction and strength of a linear relationship between multiple variables. The
correlation coefficient varies from -1 to 1, with values near -1 or 1 implying a high association
(negative or positive, respectively) and values near 0 suggesting a weak or no correlation.

Pearson Correlation Coefficient and Its Properties


The Pearson correlation coefficient (r) measures the linear connection between two variables. The properties
of the Pearson correlation coefficient include the following:
• Strength: The coefficient’s absolute value indicates the relationship’s strength. The closer the value
of the coefficient is to 1, the stronger the correlation between variables. However, a value nearer to 0
represents a weaker association.
• Linearity: The Pearson correlation coefficient only assesses linear relationships between variables.
Thecoefficient could be insufficient to describe non-linear connections fully.
• Sensitivity to Outliers: Outliers in the data might influence the correlation coefficient’s value,
thereby boosting or deflating its size.

Other Types of Correlation Coefficients


Other correlation coefficients are:
• Spearman’s Rank Correlation: It’s a nonparametric indicator of rank correlation or the statistical
dependency between the ranks of two variables. It evaluates how effectively a monotonic function
can capture the connection between two variables.
• Kendall Rank Correlation: A statistic determines the ordinal relationship between two measured
values. It represents the similarity of the data orderings when ordered by each quantity, which is a
measure of rank correlation.
An image of an anti-symmetric family of copulas’ Spearman rank correlation and Kendall’s tau are inherently
odd parameter functions.

Advantages and Disadvantages of Covariance


Following are the advantages and disadvantages of Covariance:

Advantages
• Easy to Calculate: Calculating covariance doesn’t require any assumptions of the underlying
datadistribution. Hence, it’s easy to calculate covariance with the formula given above.
• Beneficial in Portfolio Analysis: Covariance is typically employed in portfolio analysis to evaluate
thediversification advantages of integrating different assets.

Disadvantages
• Restricted to Linear Relationships: Covariance only gauges linear relationships between
variablesand does not capture non-linear associations.
• Scale Dependency: Covariance is affected by the variables’ measurement scales, making
comparingcovariances across various datasets or variables with distinct units challenging.
Advantages and Disadvantages of Correlation
The advantages and disadvantages of correlation are as follows:

Advantages
• Determining Non-Linear Relationships: While correlation primarily estimates linear relationships,
it can also demonstrate the presence of non-linear connections, especially when using alternative
correlation standards like Spearman’s rank correlation coefficient.
• Standardized Criterion: Correlation coefficients, such as the Pearson correlation coefficient, are
standardized, varying from -1 to 1. This allows for easy comparison and interpretation of the
direction and strength of relationships across different datasets.
• Robustness to Outliers: Correlation coefficients are typically less sensitive to outliers than
Covariance, delivering a more potent standard of the association between variables.
• Scale Independencies: Correlation is not affected by the measurement scales, making it convenient
forcomparing affinities between variables with distinct units or scales.

Disadvantages
• Driven by Extreme Values: Extreme values can still affect the correlation coefficient, even
thoughit is less susceptible to outliers than Covariance.

New Concept:
Python code for Correlation & Covariance
1. Using Numpy:

2. Using Pandas:
R-code for Correlation & Covariance

Learning Objectives:
To understand the correlation and covariance between the variables.

Conclusion/Learning outcome:
Correlation and Covariance between the variables between variables is understood and implemented in
Python and R-code.

R1 R2 R3
DOP DOS Conduction FileRecord Viva Voice Total Signature

5Marks 5Marks 5Marks 15Marks


DEPARTMENTOFARTIFICIALINTELLIGENCE&
DATA SCIENCE

Course:Data Analytics &Visualization CourseCode:CSL-601


Semester:6 Department:AI&DS
LaboratoryNo:407 Name of Subject Teacher: Srushti Jadhav
Name of Student:Sahil Surve RollId:VU2S2223020

ExperimentNo.5
Aim: To understand and implement Hypothesis Testing.

Prior Concepts:
Hypothesis testing is a statistical method used to make inferences about a population parameter based on
sample data. The process involves stating a hypothesis, collecting and analyzing data, and then determining
whether the data provides enough evidence to reject or fail to reject the null hypothesis.

Following are the steps involved in hypothesis testing:


1. State the Hypotheses:
- Null Hypothesis (H0): Represents the default assumption or no effect.
- Alternative Hypothesis (H1 or Ha): Represents the hypothesis to be tested.
2. Choose the Significance Level (α):
- The significance level (α) determines how extreme the evidence must be before rejecting the
nullhypothesis.
- Common levels include 0.05 (5%) or 0.01 (1%).

3. Select the Appropriate Test:


- The choice of test depends on the nature of the data and the hypothesis being tested (e.g., t-test,
chi-square test, ANOVA, etc.).
4. Collect Data and Compute Test Statistic:
- Use sample data to calculate a test statistic (e.g., t-statistic, z-statistic) based on the chosen test.

5. Determine the Critical Region or P-value:


- The critical region is determined based on the significance level.
- The p-value represents the probability of observing the test statistic or more extreme results if
thenull hypothesis is true.

6. Make a Decision:
- If the test statistic falls into the critical region (or if the p-value is less than the significance
level),reject the null hypothesis.
- If the test statistic does not fall into the critical region (or if the p-value is greater than thesignificance
level), fail to reject the null hypothesis.

New Concepts:
Example using Student's t-test in Python:
Suppose we want to test if there's a significant difference in the mean of two independent groups (e.g.,group
A and group B). You can use the t-test to perform this hypothesis test.

Learning Objectives:
To understand and implement Hypothesis Testing.

Conclusion/Learning outcome:
Hypothesis testing is understood and implemented with a sample dataset.

R1 R2 R3
DOP DOS Conduction FileRecord Viva Voice Total Signature

5Marks 5Marks 5Marks 15Marks


DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE

Subject: Data Analytics & Visualization Course Code: CSL-601


Semester: 5 Course: AI & DS
Laboratory No: 406 A Name of Subject Teacher: Srushti Jadhav
Name of Student: Sahil Surve Roll Id: VU2S2223020

Practical Number 6
Title of Practical To implement Simple Linear Regression.

Prior Concepts:
Linear Regression is an algorithm that belongs to supervised Machine Learning. It tries to apply
relations that will predict the outcome of an event based on the independent variable data
points. The relation is usually a straight line that best fits the different data points as close as
possible. The output is of a continuous form, i.e., numerical value. For example, the output could
be revenue or sales in currency, the number of products sold, etc. In the above example, the
independent variable can be single or multiple.

Linear Regression Line

Linear regression can be expressed mathematically as:


y= β0+ β 1x+ ε
Here,
Y= Dependent Variable
X= Independent Variable
β 0= intercept of the line
β1 = Linear regression coefficient (slope of the line)
ε = random error
The last parameter, random error ε, is required as the best fit line also doesn't include the data
points perfectly.
Linear Regression Model
Page | 1
Since the Linear Regression algorithm represents a linear relationship between a dependent (y)
and one or more independent (y) variables, it is known as Linear Regression. This means it finds
how the value of the dependent variable changes according to the change in the value of

Types of Linear Regression


Linear Regression can be broadly classified into two types of algorithms:
1. Simple Linear Regression
A simple straight-line equation involving slope (dy/dx) and intercept (an integer/continuous
value) is utilized in simple Linear Regression. Here a simple form is y=mx+c where y denotes
the output x is the independent variable, and c is the intercept when x=0. With this equation, the
algorithm trains the model of machine learning and gives the most accurate output
2. Multiple Linear Regression
When a number of independent variables more than one, the governing linear equation
applicable to regression takes a different form like: y= c+m1x1+m2x2… mnxn
where represents the coefficient responsible for impact of different independent variables
x1, x2 etc. This machine learning algorithm, when applied, finds the values of coefficients m1, m2,
etc., and gives the best fitting line.
3. Non-Linear Regression
When the best fitting line is not a straight line but a curve, it is referred to as Non-Linear
Regression. Algorithm:
i. Use a LoadDataSet() function to open text file with tab delimited values and assume
the last value is the target value.
ii. Use second function, standRegres(), to compute the best-fit line as follows:
Load the x and y arrays and then convert them into matrices.
Compute X X and then test if its determinate is zero .
T

if the determinate is zero then you’ll get error


if the determinate is nonzero, you compute regression weights, ws and
return them
iii. Using ws, determinate the predicted value of y, by multiplying the x Matrix and ws.
iv. sort the point in ascending order, and plot the best fit line by plotting y.

Page | 2
New Concepts:
Example of simple linear regression using a simple dataset in R-code:

# Sample data
x <- c(1, 2, 3, 4, 5)
y <- c(2, 4, 6, 8, 10)
# Fit a linear regression model
model <- lm(y ~ x)
# Summary of the linear regression model
summary(model)
# Plotting the data with the regression line
plot(x, y, main = "Scatterplot with Regression Line")
abline(model, col = "red")
Output:

Learning Objectives:
To understand and implement linear regression on sample dataset.

Conclusion/Learning outcome:
Regression analysis is a family of statistical tools that can help business analysts build models to
predict trends, make tradeoff decisions, and model the real world for decision-making
support. These models can be used to predict the value of one or more variables from knowledge of
the value of other variables. Specific regression techniques include simple linear regression analysis,
multiple linear regression analysis, multiple curvilinear regression, multivariate linear regression,
and multivariate polynomial regression.

R1 R2 R3
DOP DOS Conduction File Record Viva Voice Total Signature
5 Marks 5 Marks 5 Marks 15 Marks
DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE

Subject: Data Analytics & Visualization Course Code: CSL-601


Semester: 5 Course: AI & DS
Laboratory No: 406 A Name of Subject Teacher: Srushti Jadhav
Name of Student: Sahil Surve Roll Id: VU2S2223020

Practical Number 7
Title of Practical To implement Multiple Linear Regression.

Prior Concepts:
• Multiple linear regression refers to a statistical technique that uses two or more
independent variables to predict the outcome of a dependent variable.
• The technique enables analysts to determine the variation of the model and the relative
contribution of each independent variable in the total variance.
• Multiple regression can take two forms, i.e., linear regression and non-linear regression.

Where:
• yi is the dependent or predicted variable
• β0 is the y-intercept, i.e., the value of y when both xi and x2 are 0.
• β1 and β2 are the regression coefficients representing the change in y relative to a
one-unit change in xi1 and xi2, respectively.
• βp is the slope coefficient for each independent variable
• ϵ is the model’s random error (residual) term.

Simple linear regression enables statisticians to predict the value of one variable using the
available information about another variable. Linear regression attempts to establish the
relationship between the two variables along a straight line.
Multiple regression is a type of regression where the dependent variable shows
a linear relationship with two or more independent variables. It can also be non-linear,
where the dependent and independent variables do not follow a straight line.
Both linear and non-linear regression track a particular response using two or more
variables graphically. However, non-linear regression is usually difficult to execute since it is
created from assumptions derived from trial and error.
Assumptions of Multiple Linear Regression
Multiple linear regression is based on the following assumptions:

Page | 1
1. A linear relationship between the dependent and independent variables
The first assumption of multiple linear regression is that there is a linear relationship
between the dependent variable and each of the independent variables. The best way to
check the linear relationships is to create scatterplots and then visually inspect the
scatterplots for linearity. If the relationship displayed in the scatterplot is not linear, then
the analyst will need to run a non-linear regression or transform the data using statistical
software, such as SPSS.
2. The independent variables are not highly correlated with each other
The data should not show multicollinearity, which occurs when the independent variables
(explanatory variables) are highly correlated. When independent variables show
multicollinearity, there will be problems figuring out the specific variable that contributes to
the variance in the dependent variable. The best method to test for the assumption is the
Variance Inflation Factor method.
3. The variance of the residuals is constant
Multiple linear regression assumes that the amount of error in the residuals is similar at
each point of the linear model. This scenario is known as homoscedasticity. When analyzing
the data, the analyst should plot the standardized residuals against the predicted values to
determine if the points are distributed fairly across all the values of independent variables.
To test the assumption, the data can be plotted on a scatterplot or by using statistical
software to produce a scatterplot that includes the entire model.
4. Independence of observation
The model assumes that the observations should be independent of one another. Simply
put, the model assumes that the values of residuals are independent. To test for this
assumption, we use the Durbin Watson statistic.
The test will show values from 0 to 4, where a value of 0 to 2 shows positive autocorrelation,
and values from 2 to 4 show negative autocorrelation. The mid-point, i.e., a value of 2,
shows that there is no autocorrelation.
5. Multivariate normality
Multivariate normality occurs when residuals are normally distributed. To test this
assumption, look at how the values of residuals are distributed. It can also be tested using
two main methods, i.e., a histogram with a superimposed normal curve or the Normal
Probability Plot method.

R-code:
# Sample data
x1 <- c(1, 2, 3, 4, 5)
x2 <- c(3, 4, 5, 6, 7)
y <- c(2, 4, 6, 8, 10)
# Combine the predictors into a data frame
data <- data.frame(x1, x2, y)
# Fit a multiple linear regression model
model <- lm(y ~ x1 + x2, data = data)
# Summary of the multiple linear regression model
summary(model)

# Making predictions using the model


new_data <- data.frame(x1 = c(6, 7), x2 = c(8, 9)) # New data for prediction

Page | 2
predicted_values <- predict(model, newdata = new_data)

print(predicted_values)

Output:
print(predicted_values)
12
12 14

Python code:
from sklearn.linear_model import LinearRegression

# Sample data
x1 = [1, 2, 3, 4, 5]
x2 = [3, 4, 5, 6, 7]
y = [2, 4, 6, 8, 10]

# Combine predictors into a feature matrix


X = list(zip(x1, x2))

# Fit the multiple linear regression model


model = LinearRegression().fit(X, y)

# Print model coefficients and intercept


print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)

# Predict using the model


new_data = [[6, 8], [7, 9]] # New data for prediction
predicted_values = model.predict(new_data)

print(predicted_values)

Output:
Coefficients: [1. 1.]
Intercept: -2.0
[12. 14.]
CodeText

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.linear_model import LinearRegression

# Sample data
x1 = np.array([1, 2, 3, 4, 5])
x2 = np.array([3, 4, 5, 6, 7])
y = np.array([2, 4, 6, 8, 10])

Page | 3
# Reshape variables for sklearn
X = np.column_stack((x1, x2))

# Fit the multiple linear regression model


model = LinearRegression().fit(X, y)

# Make predictions
y_pred = model.predict(X)

# Plotting the actual data


fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111, projection='3d')

ax.scatter(x1, x2, y, color='blue', label='Actual data')


ax.scatter(x1, x2, y_pred, color='red', label='Predicted data')

# Create a meshgrid for plotting the regression plane


x1_mesh, x2_mesh = np.meshgrid(np.linspace(min(x1), max(x1), 10), np.linspace(min(x2), max(x2), 10))
y_mesh = model.predict(np.column_stack((x1_mesh.ravel(), x2_mesh.ravel())))
y_mesh = y_mesh.reshape(x1_mesh.shape)

# Plotting the regression plane


ax.plot_surface(x1_mesh, x2_mesh, y_mesh, alpha=0.5, color='green', label='Regression plane')

ax.set_xlabel('X1')
ax.set_ylabel('X2')
ax.set_zlabel('Y')
ax.set_title('Multiple Linear Regression')

#ax.legend()
plt.show()

Output:

Learning Objectives:
To understand and implement Multiple Linear Regression in R/Python.

Page | 4
Conclusion/Learning outcome:

Multiple Linear Regression is understood and implemented in R and Python.

R1 R2 R3
DOP DOS Conduction File Record Viva Voice Total Signature
5 Marks 5 Marks 5 Marks 15 Marks

Page | 5
DEPARTMENT OF ARTIFICIAL
INTELLIGENCE & DATA SCIENCE

Course: Data Analytics &Visualization CourseCode:CSL-601


Semester:6 Department: AI&DS
LaboratoryNo:406-A Name of Subject Teacher: Shrushti Jadhav
Name of Student: Sahil Surve RollId:VU2S2223020

Experiment No .8

Aim: To implement Time Series Analysis.

Prior Concepts:
Sometimes data changes over time. This data is called time-dependent data. Given time-dependent data,
the past data can be analyzed to predict the future. The future prediction will also include time as a
variable, and the output will vary with time. Using time-dependent data, patterns that repeat over time
can be found. A Time Series is a set of observations that are collected after regular intervals of time. If
plotted, the Time series would always have one of its axes as time.

Figure 1: Time Series

Time Series Analysis in Python considers data collected over time might have some structure; hence it
analyses Time Series data to extract its valuable characteristics.

Consider the running of a bakery. Given the data of the past few months, one can predict what items are needed to
bake at what time. The morning crowd would need more bread items, like bread rolls,croissants, breakfast muffins,
etc. At night, people may come in to buy cakes and pastries or otherdessert items. Using time series analysis, one can
predict items popular during different times and even different seasons.
Different Components of Time Series Analysis:
The diagram depicted below shows the different components of Time Series Analysis:

1. Trend: The Trend shows the variation of data with time or the frequency of data. Using a Trend, you can see how
your data increases or decreases over time. The data can increase, decrease, or remain stable. Over time, population,
stock market fluctuations, and production in a company are all examples of trends.
2. Seasonality: Seasonality is used to find the variations which occur at regular intervals of time. Examples are
festivals, conventions, seasons, etc. These variations usually happen around the same time period and affect the data
in specific ways which you can predict.
3. Irregularity: Fluctuations in the time series data do not correspond to the trend or seasonality. These variations in
your time series are purely random and usually caused by unforeseeable circumstances, such as a sudden decrease in
population because of a diagram shows the
components of an ARIMA model:

Auto Regressive Model

Auto-Regressive models predict future behavior using past behavior where there is some correlation between past
and future data. The formula below represents the autoregressive model. It is a modified version of the slope formula
with the target value being expressed as the sum of the intercept, the product of a coefficient and the previous output,
and an error correction term.
Moving Average
Moving Average is a statistical method that takes the updated average of values to help cut down on noise. It takes
the average over a specific interval of time. You can get it by taking different subsets of your data and finding their
respective averages. You first consider a bunch of data points and take their average. You then find the next average
by removing the first value of the data and including the next value of the series.

Integration
Integration is the difference between present and previous observations. It is used to make the time series stationary.
Each of these values acts as a parameter for an ARIMA model. Instead of representing the ARIMA model by these
various operators and models, one can use parameters to represent them.
These parameters are:
1. p: Previous lagged values for each time point. Derived from the Auto-Regressive Model.
2. q: Previous lagged values for the error term. Derived from the Moving Average.
3. d: Number of times data is differenced to make it stationary. It is the number of times it performs integration.

Output:
ARIMA with Python
The statsmodels library stands as a vital tool for those looking to harness the power of ARIMA for time series
forecasting in Python. Building an ARIMA Model:
A Step-by-Step Guide:
1. Model Definition: Initialize the ARIMA model by invoking ARIMA() and specifying the p, d, and q parameters.
2. Model Training: Train the model on your dataset using the fit() method.
3. Making Predictions: Generate forecasts by utilizing the predict() function and designating the desired time index
or indices.
Let us fit an ARIMA model to the entire Shampoo Sales dataset and review the residual errors.
We’ll employ the ARIMA(5,1,0) configuration:
5 lags for autoregression (AR)
1st order differencing (I)
No moving average term (MA)

OUTPUT:
Rolling Forecast ARIMA Model
The ARIMA model can be used to forecast future time steps.
The ARIMA model is adept at forecasting future time points. In a rolling forecast, the model is often retrained as
new data becomes available, allowing for more accurate and adaptive predictions. We can use the predict() function
on the ARIMAR results object to make predictions. It accepts the index of the time steps to make predictions as
arguments. These indexes are relative to the start of the training dataset used to make predictions.

How to Forecast with ARIMA:


1. Use the predict() function on the ARIMAResults object. This function requires the index of the time
steps for which predictions are needed.
2. To revert any differencing and return predictions in the original scale, set the typ argument to ‘levels’.
3. For a simpler one-step forecast, employ the forecast() function.
We can split the training dataset into train and test sets, use the train set to fit the model and generate a
prediction for each element on the test set.
A rolling forecast is required given the dependence on observations in prior time steps for differencing
and the AR model. A crude way to perform this rolling forecast is to re-create the ARIMA model after
each new observation is received.
Output:

The model could use further tuning of the p, d, and maybe even the q parameters.

Learning Objectives:
To understand and implement Time Series Analysis using Python/Excel

Conclusion/Learning outcome:
Time Series Analysis using ARIMA Model has been understood and implemented in Python.
DEPARTMENT OF ARTIFICIAL INTELLIGENCE
AND DATA SCIENCE

Subject: DAV Course Code: CSL602


Semester: 06 Branch: AI & DS
Laboratory no. : 315 Name of subject teacher: Srushti Jadhav
Name of Student: Sahil Surve Roll no: VU2S2223020

EXPERIMENT NO.09

Aim: To explore and impliment data visualization using PowerBI

Getting Started with Power BI

1) Download and Install Power BI Desktop:

Free to download from Microsoft: https://www.microsoft.com/en-


us/download/details.aspx?id=58494
Offers a user-friendly interface for creating reports and visualizations Connect to Data Sources:

2) Power BI supports a wide range of data sources, including Excel files, CSV files, databases
(SQL Server, Azure SQL Database, etc.), cloud services (like Salesforce, Google Analytics), and
more.
Explore the "Get Data" section in Power BI Desktop to browse available connectors.
3) Explore Your Data:

Once connected, Power BI displays a preview of your data.


Use the "Fields" pane to examine the structure of your data (tables and columns).
Familiarize yourself with the data types (text, numbers, dates, etc.) for effective visualization

4) Visualizing data using Power BI:

Talking about Microsoft’s Power BI is a business analytics service that helps in analyzing and
visualizing data from various sources to craft data stories and share them with the end-users.
Power BI is a combination of the following services and applications which work for hand in
glove to create and share interactive business insights
a) Area Charts

The area chart depends on line charts to display quantitative graphical data. The area between the axis
and lines is commonly filled with colors, textures, and patterns. You can compare more than two
quantities with area charts. It shows the trend changes over time and can be used to attract the
attention of the users to know the total changes across the trends.

b) Line Charts

Line charts are mostly used charts to represent the data and are characterized by a series of data
points connected by a straight line. Each point in the line corresponds to a data value in the given
category. It shows the exact value of the plotted data. Line charts should only be used to measure
the trends over a period of time, e.g. dates, months, and years
c) Bar Charts

In the list of Power BI visualization types, next, we are going to discuss bar charts.
Bar charts are mostly used graphs because they are simple to create and easy to understand. Bar
charts are also called horizontal charts that represent the absolute data. They are useful to display
the data that include negative values because it is possible to position the bars above and below
the x-axis.

d) Column Charts
Column charts are similar to bar charts, and the only difference between these two is, column
chart divides the same category data into the clusters and compares within the clusters. Also, it
compares the data from other clusters.
e) Pie Charts
A pie chart is a circular statistical chart, and it shows the whole data in parts. Each portion of a
pie chart represents the percentages, and the sum of all parts should be equal to 100%. The whole
data can be divided into slices to show the numerical propositions of each part of the data. Pie
charts are mostly used to represent the same category of data. It helps users to understand the
data quickly. They are widely used in education, the business world, and communication media.

f) Doughnut Charts
Doughnuts are similar to pie charts, and it is named doughnut chart because it looks similar to a
doughnut. You can easily understand the data because doughnut charts show the whole data into
the proposition. It is the most useful chart when you need to display various propositions that
make up the final value.

5) Views in Power BI:

1) Report view:
This is the default view where we create our reports by arranging the visualizations
including different graphs, charts, and a lot more, according to our requirements over
multiple pages in a single report.

2) Data view:
When we are modeling our data, sometimes without creating a visual on the canvas, we
would like to see the loaded table or column. This view helps us to view data in a grid
format; piece by piece to analyze it closely.

3) Model view:
This view helps us to see the relationships between different tables and also the columns
they comprise of.

6) Perform Data Cleaning

Since we deal with a large volume of data from multiple sources, there are chances that our data
might be incorrect or have discrepancies in a variety of ways. So before getting into the action of
visualizing this data it is very essential to prepare this raw data by cleaning, transforming, and
modeling this data before further use.

You might have noticed when we were loading our data into Power BI, we came across two
options- one was Load and the other was Transform Data. If we click on the latter option,
another window launches. This is known as the Power Query Editor.
One more thing to be noted is we can also transform data by clicking on the Transform Data icon
from the Home tab of Power BI Desktop.

7) Data Modeling in Power BI

As explained above in this blog, we know that data modeling is done in the Model View in
Power BI Desktop. Here we can see all the tables, their columns, and the relationships between
them, which depicts how the different sources are connected.
8) Understanding DAX Functions and creating measures

Microsoft extensively uses DAX, Data Analysis Expressions, to create required information
using the existing tables and columns. This programming language is used to calculate and return
one or more values.

This measure includes the following syntax elements:

A) The measure name is Total Turnover.


B) The equals sign operator (=) shows the beginning of the formula.
C) The DAX function SUM adds up all the numbers in the NIFTY Table1[Turnover] column.
D)Parentheses contain one or more arguments for the expression. The argument passes a value to
a function.
E)The referenced table NIFTY Table1.F)The referenced column [Turnover] is in the table. This
is the argument where the SUM functions know to aggregate the SUM.
Conclusion: Thus, we have studied and impliment data visualization using PowerBI

R1 R2 R3
DOP DOS Conduction File Record Viva -Voce Total Signature

5 Marks 5 Marks 5 Marks 15


Marks
DEPARTMENT OF ARTIFICIAL INTELLIGENCE
AND DATA SCIENCE

Subject: DAV Course Code: CSL602


Semester: 06 Branch: AI & DS
Laboratory no. : 315 Name of subject teacher: Srushti Jadhav
Name of Student: Sahil Surve Roll no: VU2S2223020

EXPERIMENT NO.10

Aim: To Develop a small Data Analysis project using power bi

1) Data Import:

➢ Open Power BI Desktop.


➢ Click on "Get Data" from the Home tab.
➢ Select your data source, which could be Excel, CSV, or any other database where you have
stored your Swiggy Bangalore data.
➢ Choose the appropriate file or database table and click "Load" to import the data into Power
BI.

2) Data Cleaning:

➢ Once the data is imported, you may need to clean it to remove any inconsistencies or errors.
➢ Identify and handle missing values, incorrect data types, and outliers.
➢ Ensure that columns are correctly formatted, especially numerical values like costof2 and
rating.
➢ Remove any duplicate rows if present.
➢ Data Modeling:

3) Navigate to the "Modeling" tab.

➢ Define relationships between tables if your data is spread across multiple tables.
➢ Create calculated columns or measures as needed. For example, you could calculate the
average rating for each restaurant or the total number of shops in each region.
➢ Rename columns to make them more understandable if necessary.
4) Data Visualization:

➢ Go to the "Report" tab to start creating visualizations.


➢ Use a pie chart to display the distribution of restaurants by region. Drag the "region" field
into the "Legend" and "Values" sections of the chart.
➢ Create a bar chart to show the top 10 restaurants based on some metric like rating or number
of orders. Drag the appropriate field (e.g., restaurant name) into the "Axis" section and the
metric (e.g., rating or number of orders) into the "Values" section.
➢ Generate a word cloud to visualize the cuisine served by different restaurants. Use the
"cuisine" field and adjust the size based on the frequency of each cuisine.
➢ Add cards to display key metrics such as average cost, number of shops, number of cuisines,
and average rating. Use the appropriate aggregation functions (e.g., average, count) on the
relevant fields.

A) Bar Chart:

◆ Data Preparation: Start by identifying the categorical variable (e.g., restaurant


names) and the numerical variable (e.g., ratings or number of orders).
◆ Design Decision: Determine whether you want to display a count, sum, average, or
another aggregation of the numerical variable for each category.
◆ Implementation: In Power BI, drag the categorical variable into the Axis field and
the numerical variable into the Values field. Power BI will automatically aggregate
the numerical values based on the chosen function.
◆ Customization: Customize the appearance of the bar chart by adjusting colors,
labels, axis titles, and other formatting options.
B) Pie Chart:

◆ Data Preparation: Similar to the bar chart, identify the categorical variable for
which you want to display proportions.
◆ Design Decision: Consider whether you want to display absolute values or
percentages for each category.
◆ Implementation: Drag the categorical variable into the Values field of the pie chart
visualization. Power BI will automatically calculate the proportions or percentages
for each category.
◆ Customization: Customize the appearance of the pie chart by adjusting colors,
labels, and other formatting options. You can also explode or highlight specific
segments for emphasis.
C) Word Cloud:

◆ Data Preparation: Prepare a text-based variable (e.g., cuisine names) for which you
want to visualize the frequency of occurrence.
◆ Design Decision: Decide whether to display raw frequencies or use a weighting
scheme (e.g., font size based on frequency).
◆ Implementation: In Power BI, use a custom visual or third-party visualization tool
that supports word clouds. Drag the text-based variable into the appropriate field
and configure the visualization settings.
◆ Customization: Customize the appearance of the word cloud by adjusting font styles,
colors, and other visual properties. You can also filter out common words or adjust
the weighting scheme.

D) Line Chart:

◆ Data Preparation: Identify the time-based variable (e.g., date) and the numerical
variable (e.g., sales revenue) you want to visualize over time.
◆ Design Decision: Determine the granularity of the time intervals (e.g., daily,
monthly, yearly) and the aggregation function for the numerical variable.
◆ Implementation: In Power BI, drag the time-based variable into the Axis field and
the numerical variable into the Values field. Power BI will automatically aggregate
the numerical values based on the chosen time intervals
◆ Customization: Customize the appearance of the line chart by adjusting colors, line
styles, axis titles, and other formatting options. You can also add data labels or
markers to highlight specific data points.

E) Scatter Plot:

◆ Data Preparation: Identify two numerical variables (e.g., cost and rating) that you
want to visualize for each data point.
◆ Design Decision: Decide whether you want to include a trend line or regression
analysis to show the relationship between the variables.
◆ Implementation: In Power BI, drag one numerical variable into the X-axis field and
the other into the Y-axis field. Power BI will plot each data point based on the
values of the two variables.
◆ Customization: Customize the appearance of the scatter plot by adjusting colors,
markers, axis titles, and other formatting options. You can also add a trend line or
regression analysis to visualize the relationship between the variables.

F) Applying Filters:

➢ Use the "Filters" pane to apply filters as needed. For example, you can filter the data
based on location by dragging the "location" field into the "Filters" section and selecting
the desired location(s).
5) Design and Formatting:

Customize the appearance of your report by adjusting colors, fonts, and layouts to make it more
visually appealing and easy to understand.
Arrange your visualizations in a logical order and use containers or backgrounds to group related
elements together.

6) Testing and Iteration:

Once your report is built, thoroughly test it to ensure that all visualizations are accurate and
interactive.
Iterate on your design based on feedback or if you discover any issues during testing.

7) Publishing:

When you're satisfied with your Power BI project, you can publish it to the Power BI service to
share it with others or to access it from anywhere.
Click on "Publish" from the Home tab and follow the prompts to publish your report to your
Power BI workspace.
Conclusion: Thus, we have studied and impliment data visualization using PowerBI

R1 R2 R3
DOP DOS Conduction File Record Viva -Voce Total Signature

5 Marks 5 Marks 5 Marks 15


Marks
DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE

Subject: Machine Learning Course Code: CLS604


Semester: 6 Course: AI & DS
Laboratory No: 406 A Name of Subject Teacher: Anagha Dhavalikar
Name of Student: Sahil Surve Roll Id: VU2S2223020

Practical Number 8
Title of Practical To implement Error Back propagation Perceptron Training Algorithm.

Prior Concepts:
Back propagation is the essence of neural network training. It is the method of fine-tuning the
weights of a neural network based on the error rate obtained in the previous epoch (i.e.,
iteration). Proper tuning of the weights reduces error rates and makes the model reliable by
increasing its generalization. Back propagation in neural network is a short form for “backward
propagation of errors.” It is a standard method of training artificial neural networks. This
method helps calculate the gradient of a loss function with respect to all the weights in the
network.
How Back propagation Algorithm Works
The Back propagation algorithm in neural network computes the gradient of the loss function
for a single weight by the chain rule. It efficiently computes one layer at a time, unlike a native
direct computation. It computes the gradient, but it does not define how the gradient is used. It
generalizes the computation in the delta rule.
Consider the following Back propagation neural network example diagram to understand:

1. Inputs X, arrive through the pre-connected path


2. Input is modeled using real weights W. The weights are usually randomly selected.
3. Calculate the output for every neuron from the input layer, to the hidden layers, to the
output layer.
4. Calculate the error in the outputs

Error = Actual Output – Desired Output


B
5. Travel back from the output layer to the hidden layer to adjust the weights such that the
error is decreased.

Keep repeating the process until the desired output is achieved.


Most prominent advantages of Back propagation are:

• Back propagation is fast, simple and easy to program


• It has no parameters to tune apart from the numbers of input
• It is a flexible method as it does not require prior knowledge about the network
• It is a standard method that generally works well
• It does not need any special mention of the features of the function to be learned.

Python Code:
import numpy as np

def sigmoid (x):


return 1/(1 + np.exp(-x))

def sigmoid_derivative(x):
return x * (1 - x)

#Input datasets
inputs = np.array([[0,0],[0,1],[1,0],[1,1]])
expected_output = np.array([[0],[1],[1],[0]])

epochs = 10000
lr = 0.1
inputLayerNeurons, hiddenLayerNeurons, outputLayerNeurons = 2,2,1

#Random weights and bias initialization


hidden_weights = np.random.uniform(size=(inputLayerNeurons,hiddenLayerNeurons))
hidden_bias =np.random.uniform(size=(1,hiddenLayerNeurons))
output_weights = np.random.uniform(size=(hiddenLayerNeurons,outputLayerNeurons))
output_bias = np.random.uniform(size=(1,outputLayerNeurons))

print("Initial hidden weights: ",end='')


print(*hidden_weights)
print("Initial hidden biases: ",end='')
print(*hidden_bias)
print("Initial output weights: ",end='')
print(*output_weights)
print("Initial output biases: ",end='')
print(*output_bias)

#Training algorithm
for _ in range(epochs):
#Forward Propagation
hidden_layer_activation = np.dot(inputs,hidden_weights)
hidden_layer_activation += hidden_bias
hidden_layer_output = sigmoid(hidden_layer_activation)

output_layer_activation = np.dot(hidden_layer_output,output_weights)
output_layer_activation += output_bias
predicted_output = sigmoid(output_layer_activation)

#Backpropagation
error = expected_output - predicted_output
d_predicted_output = error * sigmoid_derivative(predicted_output)

error_hidden_layer = d_predicted_output.dot(output_weights.T)
d_hidden_layer = error_hidden_layer * sigmoid_derivative(hidden_layer_output)

#Updating Weights and Biases


output_weights += hidden_layer_output.T.dot(d_predicted_output) * lr
output_bias += np.sum(d_predicted_output,axis=0,keepdims=True) * lr
hidden_weights += inputs.T.dot(d_hidden_layer) * lr
hidden_bias += np.sum(d_hidden_layer,axis=0,keepdims=True) * lr

print("Final hidden weights: ",end='')


print(*hidden_weights)
print("Final hidden bias: ",end='')
print(*hidden_bias)
print("Final output weights: ",end='')
print(*output_weights)
print("Final output bias: ",end='')
print(*output_bias)
print("\nOutput from neural network after 10,000 epochs: ",end='')
print(*predicted_output)
Output:
Initial hidden weights: [0.86055117 0.72704426] [0.27032791 0.1314828 ]
Initial hidden biases: [0.05537432 0.30159863]
Initial output weights: [0.26211815] [0.45614057]
Initial output biases: [0.68328134]
Final hidden weights: [3.40623101 5.55641764] [3.39232116 5.48326586]
Final hidden bias: [-5.18724528 -2.20651674]
Final output weights: [-7.35157564] [6.82917403]
Final output bias: [-3.06370489]
Output from neural network after 10,000 epochs: [0.08112647] [0.92197194]
[0.9222861] [0.08599761]

Learning Objectives:
To understand and implement Error Back propagation Perceptron Training Algorithm.

Conclusion/Learning outcome:
An Error Back propagation Perceptron Training Algorithm is understood and used for
implementing an XOR Gate.

R1 R2 R3
DOP DOS Conduction File Record Viva Total Signature
Voice
5 Marks 5 Marks 5 Marks 15 Marks
DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE

Subject: Machine Learning Course Code: CLS604


Semester: 6 Course: AI & DS
Laboratory No: 406 A Name of Subject Teacher: Anagha Dhavalikar
Name of Student: Sahil Surve Roll Id: VU2S2223020

Practical Number 9
Title of Practical To implement Principal Component Analysis.

Theory: Dimensionality reduction is the process of reducing the number of features (or dimensions)
in a dataset while retaining as much information as possible. In other words, it is a process of
transforming high-dimensional data into a lower-dimensional space that still preserves the essence
of the original data. This can be done for a variety of reasons, such as to reduce the complexity of a
model, to improve the performance of a learning algorithm, or to make it easier to visualize the data.
There are several techniques for dimensionality reduction, including principal component analysis
(PCA), singular value decomposition (SVD), and linear discriminant analysis (LDA). Each
technique uses a different method to project the data onto a lower-dimensional space while
preserving important information.
Feature Selection:
Feature selection involves selecting a subset of the original features that are most relevant to the
problem at hand. The goal is to reduce the dimensionality of the dataset while retaining the most
important features. There are several methods for feature selection, including filter methods,
wrapper methods, and embedded methods. Filter methods rank the features based on their relevance
to the target variable, wrapper methods use the model performance as the criteria for selecting
features, and embedded methods combine feature selection with the model training process.
Feature Extraction:
Feature extraction involves creating new features by combining or transforming the original
features. The goal is to create a set of features that captures the essence of the original data in a
lower-dimensional space. There are several methods for feature extraction, including principal
component analysis (PCA), linear discriminant analysis (LDA), and t-distributed stochastic
neighbor embedding (t-SNE). PCA is a popular technique that projects the original features onto a
lower-dimensional space while preserving as much of the variance as possible.
Principal Component Analysis
This method was introduced by Karl Pearson. It works on the condition that while the data in a
higher dimensional space is mapped to data in a lower dimension space, the variance of the data in
the lower dimensional space should be maximum.
It involves the following steps:
• Construct the covariance matrix of the data.
• Compute the eigenvectors of this matrix.
• Eigenvectors corresponding to the largest eigenvalues are used to reconstruct a large
fraction of variance of the original data.

Steps for PCA Algorithm


1. Standardize the data: PCA requires standardized data, so the first step is to standardize
the data to ensure that all variables have a mean of 0 and a standard deviation of 1.
2. Calculate the covariance matrix: The next step is to calculate the covariance matrix of
the standardized data. This matrix shows how each variable is related to every other
variable in the dataset.
3. Calculate the eigenvectors and eigenvalues: The eigenvectors and eigenvalues of the
covariance matrix are then calculated. The eigenvectors represent the directions in which
the data varies the most, while the eigenvalues represent the amount of variation along
each eigenvector.
4. Choose the principal components: The principal components are the eigenvectors with
the highest eigenvalues. These components represent the directions in which the data
varies the most and are used to transform the original data into a lower-dimensional
space.
5. Transform the data: The final step is to transform the original data into the lower-
dimensional space defined by the principal components.

Ref. Link: https://www.turing.com/kb/guide-to-principal-component-analysis

Code:
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import pandas as pd

# Load the Iris dataset


iris = load_iris()
X = iris.data
y = iris.target

# Instantiate PCA and specify the number of components


pca = PCA(n_components=2)
# Fit and transform the data to the new space
X_pca = pca.fit_transform(X)

# Create a DataFrame for visualization


df = pd.DataFrame(data=X_pca, columns=['Principal Component 1', 'Principal Component
2'])
df['Target'] = y

# Display the transformed data


print(df.head())
Output:
Principal Component 1 Principal Component 2 Target
0 -2.684126 0.319397 0
1 -2.714142 -0.177001 0
2 -2.888991 -0.144949 0
3 -2.745343 -0.318299 0
4 -2.728717 0.326755 0

Learning Objectives:
To understand and implement Principal Component Analysis

Conclusion/Learning outcome: Thus, PCA algorithm is studied and implemented.

R1 R2 R3
DOP DOS Conduction File Record Viva Total Signature
Voice
5 Marks 5 Marks 5 Marks 15 Marks
DEPARTMENT OF ARTIFICIAL
INTELLIGENCE & DATA SCIENCE

Course: Machine Learning Course Code: CSL-60


Semester: 6 Department: AI & DS
Laboratory No: Name of Subject Teacher:
Name of Student: Sahil Surve Roll Id: VU2S2223020
EXPERIMENT NO:10
Aim: Case study on Classification using IRIS data set.

Theory: Iris Dataset is considered as the Hello World for data science. It contains five
columns namely – Petal Length, Petal Width, Sepal Length, Sepal Width, and Species
Type. Iris is a flowering plant, the researchers have measured various features of the
different iris flowers and recorded them digitally.

You can download the Iris.csv file from the above link. Now we will use the Pandas library
to load this CSV file, and we will convert it into the dataframe. read_csv() method is used
to read CSV files.

Code:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
iris = pd.read_csv('/kaggle/input/iris/Iris.csv')
iris.head()
iris.shape
iris.info()

Output:
Learning Objectives: Our main objective is to classify the flowers into their respective species -
Iris setosa, Iris virginica and Iris versicolor by using various possible plots.

Conclusion/Learning outcome: As can be seen from the above info, it is a balanceddataset.


DEPARTMENTOFARTIFICIALINTELLIGENCE
AND DATA SCIENCE

Subject:DataAnalytics&Visualization CourseCode:CSL601
Semester:6 Course:AI&DS
Laboratoryno.:407 Nameofsubjectteacher:Anagha Dhavikar
Nameofstudent: Sahil Surve Rollno:VU2S2223020

ExperimentNo.11

Aim: Design and implement an Expert system for classification useful for real world application.
PriorConcepts:
Problem Definition,Knowledge Acquisition,Knowledge Representation,Inference Mechanism,
Uncertainty Handling,Validation and Testing Integration ,Maintenance and Updates
Theory:

An expert system for classification applies artificial intelligence to categorize data. Key
components include defining the problem, acquiring domain knowledge, and representing it in a
format the system can understand. An inference mechanism drives decision-making, utilizing
techniques like forward or backward chaining. Handling uncertainty is crucial, often managed
through fuzzy logic or probabilistic methods. Validation ensures accuracy, while integration
embeds the system into applications. Maintenance involves updating knowledge bases to keep
pace with evolving domains. Key points:

• Define problem and acquire domain knowledge.

• Represent knowledge for system understanding.

• Employ inference mechanisms for decision-making.

• Manage uncertainty through fuzzy logic or probabilistic methods.

• Validate system accuracy.

• Integrate into applications..


Importing Dataset And implementing Valid Model
Trainig Dataset and Calculating Accuracy
Final Output

Conclusion/Learningoutcome:

We learn to Design and implement an Expert system for classification useful for real world application. Here we
learn to implement classification model successfully.

Evaluation:

R1 R2 R3
DOP DOS Conduction FileRecord Viva-Voce Total Signature

5Marks 5Marks 5Marks 15Marks


DEPARTMENTOFARTIFICIALINTELLIGENCE
AND DATA SCIENCE

Subject:DataAnalytics&Visualization CourseCode:CSL601
Semester:6 Course:AI&DS
Laboratoryno.:407 Nameofsubjectteacher:Anagha Dhavikar
Nameofstudent: Sahil Surve Rollno:VU2S2223020

ExperimentNo.12

Aim: Develop a regression model for prediction or forecasting, useful for real world application.
PriorConcepts:
Problem Identification, Data Collection, Data Preprocessing, Feature Selection, Model Selection,
Model Training, Model Evaluation, Model Tuning, Deployment and Integration.

Theory:
Regression models are statistical methods used for predicting the value of a dependent variable based on one or more
independent variables. These models establish a relationship between the dependent variable and the independent
variables through mathematical equations. The goal is to minimize the difference between the observed and predicted
values, typically measured using metrics like mean squared error or R-squared.

Key Points:

• Regression models predict or forecast continuous numerical values.

• They assume a linear relationship between the independent and dependent variables.

• Data preprocessing involves cleaning, transforming, and scaling the data for accurate modeling.

• Feature selection helps identify the most relevant independent variables for prediction.

• Model selection includes choosing appropriate regression algorithms like linear regression, polynomial
regression, or more advanced methods such as ridge or lasso regression.

• Model training involves fitting the selected algorithm to the training data to learn the relationship
between variables.

• Model evaluation assesses the performance using metrics like mean squared error, root mean squared
error, or R-squared on test data.
Conclusion/Learningoutcome:
We learn to Develop a regression model for prediction or forecasting, useful for real world application. Here we
learn to implement regression model on weather dataset model successfully.

Evaluation:

R1 R2 R3
DOP DOS Conduction FileRecord Viva-Voce Total Signature

5Marks 5Marks 5Marks 15Marks


DEPARTMENT OF ARTIFICIAL INTELLIGENCE
AND DATA SCIENCE

Subject: Cryptography & System Security Lab Course Code: CSL602


Semester: 6 Branch: AI & DS
Laboratory no. : 302 Name of subject teacher: Gitanjali Korgaonkar
Name of Student : Sahil Surve Roll no: VU2S2223020

Experiment No: 6

Aim: Study of packet sniffer tools: wireshark,: 1. Download and install wireshark and capture
icmp, tcp, and http packets in promiscuous mode. 2. Explore how the packets can be traced based
on different filters.

Theory:
Wireshark is a network packet analyzer. A network packet analyzer will try to capture network
packets and tries to display that packet data as detailed as possible.
Wireshark is used for:

Network administrators use it to troubleshoot network problems


Network security engineers use it to examine security problems
Developers use it to debug protocol implementations
People use it to learn network protocol internals
Wireshark, a network analysis tool formerly known as Ethereal, captures packets in real time and
display them in human-readable format. Wireshark includes filters, color-coding and other features
that let you dig deep into network traffic and inspect individual packets. Features of Wireshark :
• Available for UNIX and Windows.

• Capture live packet data from a network interface.

• Open files containing packet data captured with tcpdump/WinDump, Wireshark, and a

• Import packets from text files containing hex dumps of packet data.

• Display packets with very detailed protocol information.

• Export some or all packets in a number of capture file formats.

• Filter packets on many criteria.

• Search for packets on many criteria.

• Colorize packet display based on filters.


Commands:-
Open ubuntu terminal
Install wireshark: #apt-get install wireshark
a) To know the name of your Ethernet interface: (Mostly it is “etht0”) : #ifconfig

b) Start wireshark: #sudo wireshark

c) Once wireshark window opens, select the interface and click on start

Capturing Packets

d) After downloading and installing wireshark, you can launch it and click the name of an interface
under Interface List to start capturing packets on that interface.

e) For example, if you want to capture traffic on the wireless network, click your wireless
interface. You can configure advanced features by clicking Capture Options.
f) As soon as you click the interface„s name, you„ll see the packets start to appear in real time.
Wireshark captures each packet sent to or from your system.
g) Click the stop capture button near the top left corner of the window when you want to stop
capturing traffic
Wireshark uses colors to help you identify the types of traffic at a glance. By default, green is TCP
traffic, dark blue is DNS traffic, light blue is UDP traffic, and black identifies TCP packets with
problems — for example, they could have been delivered out-of-order.

Wireshark can record the capturing information in the file with extension .pcap (packet capture).

This file can be again reopened for analysis in offline mode.There is no need to remember filtering
commands. Filters can be applied by putting predefined strings in Wireshark.
Commands:-

1. Capturing packets of a particular host :- ip.addr = = 192.168.42.3


Sets a filter for any packet with 192.168.42.3, as either the source or destination.

2. To capture a conversation between specified hosts


ip.addr == 10.0.5.119 &&ip.addr == 91.189.94.25
Sets a conversation filter between the two defined IP addresses.

Filtering Packets The most basic way to apply a filter is by typing it into the filter box at the top
of the window and clicking Apply (or pressing Enter). For example, type ―dns and you„ll see
only DNS packets. When you start typing, Wireshark will help you auto complete your filter.

1. To filter packets for a specific protocol http or dns

2. Sets a filter to display all http and dns requests.

3. To filter packets for specific port tcp.port==4000

4. Sets a filter for any TCP packet with 4000 as a source or destination port.
5. Filter specific packets tcp.flags.reset== 0 Displays all TCP resets.

6. Filter for http request packets Displays all HTTP GET requests. Http. request

7. To filter traffic except given protocol packets: !(arp or icmp or dns)

8. Masks out arp, icmp, dns, or whatever other protocols may be background noise, allowing
you to focus on the traffic of interest.
9. Capturing packets after applying multiple filters not (tcp.port == 80) and not (tcp port ==
25)
Get all packets which are not HTTP or UDP.
To stop capturing click on the “red square”
To capture packets of FTP server. (Login ID and Password)What is FTP?

FTP stands for File Transfer Protocol. As the name suggest this network protocol allows you to
transfer files or directories from one host to another over the network whether it is your LAN or
Internet.
The package required to install FTP is known as VSFTPD (Very Secure File Transfer Protocol
Daemon)
Steps:-

1. Get root access: $ sudosu root

2. Find your ip address: # ifconfig Installation of FTP server in Ubuntu


Name of Packages required: VSFTPD, XINETD

1. # sudo apt-get install vsftpd

2. # sudo apt-get install xinetd

The above command will install and start the xinetdsuperserver on your system. The chances are
that you already have xinetdinstalled on your system. In that case you can omit the above
installation command.
In the next step we need to edit the FTP server's configuration file which is present in

/etc/vsftpd.conf

1. # cd /etc

2. # ls

3. # geditvsftpd.conf Change the following line:


Anonymous_enable=NO
To Anonymous_enable=YES

This will instruct the FTP server to allow connecting with an anonymous client.
4. Save and close the gedit file

Now, that we are ready we can start the FTP server in the normal mode with:

5. # servicexinetd restart

6. # servicevsftpd restart OR

7. # init.d/vsftpd restart

Start WIRESHARK. In the FILTER field put FTP. This will filter all FTP packets Connectimg to
a client present in other machine
$ ftp ip address of the FTP server
Name: anonymous
Please specify the password. Password:
Login successful. ftp>
ftp> quit Goodbye.

Output:
While the client is establishing a connection with the FTP server, the wireshark running in the
background of the FTP server is able to capture all FTP packets. So, the Name and Password entered
by the client is visible in plain text in Wireshark. Apart from that the source and destination address
is also visible. If many clients are trying to connect with the server then source address, name and
password are visible for all of them.

Conclusion: Thus packet snifer software is explored for it‟s different benefits.

R1 R2 R3
DOP DOS Conduction File Record Viva -Voce Total Signature

5 Marks 5 Marks 5 Marks 15 Marks


DEPARTMENT OF ARTIFICIAL INTELLIGENCE
AND DATA SCIENCE

Subject: Cryptography & System Security lab Course Code: CSL602


Semester: 06 Branch: AI & DS
Laboratory no. : 315 Name of subject teacher: Gitanjali Korgaonkar
Name of Student: Sahil Surve Roll no: VU2S2223020

EXPERIMENT NO.07

Aim: Download and install nmap. Use it with different options to scan open ports, perform OS
fingerprinting, do a ping scan, tcp port scan, udp port scan, xmas scan etc

Nmap features include:


Host Discovery – Identifying hosts on a network. For example, listing the hosts which respond
to pings or have a particular port open.
Port Scanning – Enumerating the open ports on one or more target hosts.
Version Detection – Interrogating listening network services listening on remote devices to
determine the application name and version number.
OS Detection – Remotely determining the operating system and some hardware characteristics
of network devices.

Basic commands working in Nmap:


• .For target specifications: nmap<target‘s URL or IP with spaces between them>
• .For OS detection: nmap -O <target-host's URL or IP>
• .For version detection: nmap -sV<target-host's URL or IP>
SYN scan is the default and most popular scan option for good reasons. It can be performed
quickly, scanning thousands of ports per second on a fast network not hampered by restrictive
firewalls. It is also relatively unobtrusive and stealthy since it never completes TCP connections

Steps:-
1.Get root access: $ sudosu root
2.#ifconfig
3.# apt-get install nmap
Commands:-
1. # nmap -V
It gives the version of Nmap
2. # nmap 192.168.23.20
It gives information about a single host. It gives the output in column form where first column is
the PORT, second column is the STATE and third column is the SERVICE
3. #nmap –v 192.168.23.20
It gives the detailed information about remote host.
4. #nmap –O 192.168.23.20
It finds the remote host operating system and version (OS detection)
5. # nmap –sP 192.168.23.0/24
It scans a network and discover which servers and devices are up and running(ping
scan)
6. # nmap -sA 192.168.23.20
To discover if a host/network is protected by a firewall. The output has the word
FILTERED which shows presence of firewall. UNFILTEREDmeans no firewall.
7. # nmap -p T:23 192.168.23.20
It scans TCP port 23
8. #nmap -p 80,443 192.168.23.20
It scansmultiple ports at one time
9. # nmap -sV 192.168.23.20
It detect remote services (server / daemon) version numbers. Version numbers are displayed only
if the Port is open
10. nmap -sS 192.168.23.20
It performs SYN scan or Stealth scan. Open wireshark.
Set the Filter to TCP.
See the grey and red color packets
Double click any grey color TCP packet where destination address is the neighbour’s address
See the Flag field of TCP: SYN bit should be set to 1
11. # nmap -sN 192.168.23.20
It performs TCP Null Scan. It does not set any bits (TCP flag header is 0) Open wireshark.
Set the Filter to TCP.
Double click any grey color TCP packet where destination address is the neighbour’s address
See the Flag field of TCP: No flag bits should be set
12. # nmap –sF192.168.23.20
It performs FIN scan. It sets just the TCP FIN bit. Open wireshark.
Set the Filter to TCP.
Double click any grey color TCP packet where destination address is the neighbour’s address
See the Flag field of TCP: FIN flag should be set to 1
13. # nmap -sX 192.168.23.20
It performs TCP Xmas. It sets the FIN, PSH, and URG flags.
Open wireshark.
Set the Filter to TCP.
Double click any grey color TCP packet where destination address is the neighbour’s address
See the Flag field of TCP: FIN, PSH, and URG flagsshould be set to 1
14. # nmap –sO192.168.23.20
It performs IP protocol scan and allows us to determine which IP protocols) are supported by
target machines.
15. #nmap –sU192.168.23.20
It performs UDP port scan.

OUTPUT:

C:\Users\vppcoe 11>nmap -V
Nmap version 7.93 ( https://nmap.org )
Platform: i686-pc-windows-windows
Compiled with: nmap-liblua-5.3.6 openssl-3.0.5 nmap-libssh2-1.10.0 nmap-libz-1.2.12 nmap-
libpcre-7.6 Npcap-1.78 nmap-libdnet-1.12 ipv6
Compiled without:
Available nsock engines: iocp poll select

C:\Users\vppcoe 11>nmap 192.168.1.254


Starting Nmap 7.93 ( https://nmap.org ) at 2024-03-27 12:06 India Standard Time
Nmap scan report for 192.168.1.254
Host is up (0.0021s latency).
Not shown: 996 filtered tcp ports (no-response)
PORT STATE SERVICE
53/tcp open domain
80/tcp open http
222/tcp open rsh-spx
443/tcp open https
MAC Address: 08:35:71:EC:CE:E8 (CASwell)

Nmap done: 1 IP address (1 host up) scanned in 5.07 seconds

C:\Users\vppcoe 11>nmap -V 192.168.1.254


Nmap version 7.93 ( https://nmap.org )
Platform: i686-pc-windows-windows
Compiled with: nmap-liblua-5.3.6 openssl-3.0.5 nmap-libssh2-1.10.0 nmap-libz-1.2.12 nmap-
libpcre-7.6 Npcap-1.78 nmap-libdnet-1.12 ipv6
Compiled without:
Available nsock engines: iocp poll select

C:\Users\vppcoe 11>nmap -O 192.168.1.254


Starting Nmap 7.93 ( https://nmap.org ) at 2024-03-27 12:07 India Standard Time
Nmap scan report for 192.168.1.254
Host is up (0.0010s latency).
Not shown: 996 filtered tcp ports (no-response)
PORT STATE SERVICE
53/tcp open domain
80/tcp open http
222/tcp open rsh-spx
443/tcp open https
MAC Address: 08:35:71:EC:CE:E8 (CASwell)
Warning: OSScan results may be unreliable because we could not find at least 1 open and 1
closed port
Device type: WAP|general purpose
Running (JUST GUESSING): Linux 2.4.X|2.6.X|3.X (85%)
OS CPE: cpe:/o:linux:linux_kernel:2.4 cpe:/o:linux:linux_kernel:2.6.22
cpe:/o:linux:linux_kernel:2.4.18 cpe:/o:linux:linux_kernel:3.0
Aggressive OS guesses: OpenWrt 0.9 - 7.09 (Linux 2.4.30 - 2.4.34) (85%), OpenWrt White
Russian 0.9 (Linux 2.4.30) (85%), OpenWrt Kamikaze 7.09 (Linux 2.6.22) (85%), Linux 2.4.18
(85%), Linux 3.0 (85%)
No exact OS matches for host (test conditions non-ideal).
Network Distance: 1 hop

OS detection performed. Please report any incorrect results at https://nmap.org/submit/ .


Nmap done: 1 IP address (1 host up) scanned in 9.22 seconds

C:\Users\vppcoe 11>nmap -sP 192.168.1.254


Starting Nmap 7.93 ( https://nmap.org ) at 2024-03-27 12:11 India Standard Time
Nmap scan report for 192.168.1.254
Host is up (0.00s latency).
MAC Address: 08:35:71:EC:CE:E8 (CASwell)
Nmap done: 1 IP address (1 host up) scanned in 0.06 seconds

C:\Users\vppcoe 11>nmap -sA 192.168.1.254


Starting Nmap 7.93 ( https://nmap.org ) at 2024-03-27 12:12 India Standard Time
Nmap scan report for 192.168.1.254
Host is up (0.0010s latency).
All 1000 scanned ports on 192.168.1.254 are in ignored states.
Not shown: 1000 filtered tcp ports (no-response)
MAC Address: 08:35:71:EC:CE:E8 (CASwell)

Nmap done: 1 IP address (1 host up) scanned in 22.90 seconds


C:\Users\vppcoe 11>nmap -p T:23 192.168.1.254
Starting Nmap 7.93 ( https://nmap.org ) at 2024-03-27 12:13 India Standard Time
Nmap scan report for 192.168.1.254
Host is up (0.00s latency).

PORT STATE SERVICE


23/tcp filtered telnet
MAC Address: 08:35:71:EC:CE:E8 (CASwell)

Nmap done: 1 IP address (1 host up) scanned in 0.30 seconds

C:\Users\vppcoe 11>nmap -sS 192.168.1.254


Starting Nmap 7.93 ( https://nmap.org ) at 2024-03-27 12:22 India Standard Time
Nmap scan report for 192.168.1.254
Host is up (0.0025s latency).
Not shown: 996 filtered tcp ports (no-response)
PORT STATE SERVICE
53/tcp open domain
80/tcp open http
222/tcp open rsh-spx
443/tcp open https
MAC Address: 08:35:71:EC:CE:E8 (CASwell)

Nmap done: 1 IP address (1 host up) scanned in 4.79 seconds

Conclusion: Thus the nmap tool is explored with the different option available for the better
result of network scanning.

R1 R2 R3
DOP DOS Conduction File Record Viva -Voce Total Signature

5 Marks 5 Marks 5 Marks 15


Marks
DEPARTMENT OFARTIFICIAL INTELLIGENCE
ANDDATASCIENCE

Subject: Cryptography & System Security lab Course Code: CSL602


Semester: 06 Branch: AI & DS
Laboratory no. : 315 Name of subject teacher: Gitanjali Korgaonkar
Name of Student: Sahil Surve Roll no: VU2S2223020

Experiment No. 8

Aim: Simulate the SQL injection attack, Cross-Site Scripting attack.

Sqlmap for sql injection attack:


Sqlmap is written in python, the first thing you need is the python interpreter.
Download the python interpreter from python.org. There are two series of python, 2.7.x and
3.3.x. Sqlmap should run fine with either.So download and install.
Next download the sqlmap zip file from sqlmap.org. Extract the zip files in any directory.Launch
the dos prompt and navigate to the directory of sqlmap.
Now run the sqlmap.py script with the python interpreter.Start with a simple command:
sqlmap .py -u <URL to inject>.
sqlmap.py -u http://testphp.vulnweb.com/listproducts.php?cat=1
Use –time-sec to speed up the process in case of slow server responses:
sqlmap -u http://testphp.vulnweb.com/listproducts.php?cat=1 --time-sec 15.
It will show the Mysql version along with useful information about the database. Database
Obtain the names of available databases by adding –dbs to
the previous command:sqlmap.py -u http://testphp.vulnweb.com/listproducts.php?cat=1 –dbs
Tables
Specify the desired database using –D and tell SQLmap to list the tables using – -
tablescommand.
sqlmap.py -u http://testphp.vulnweb.com/listproducts.php?cat=1 -D acurt – - tables
Columns
Specify the database using –D, table using – T and columns using –columns:
sqlmap.py -u http://testphp.vulnweb.com/listproducts.php?cat=1 - D acurt –T artists –-
columns
Data
As usual, use –D for database, -T for table, -C for column and – dump for data. The
finalcommand to fetch data will appear as shown below:
sqlmap.py -u "http://testphp.vulnweb.com/listproducts.php?cat=1" -D acuart -T artists –Caname
–dump

Conclusion: The Sqlmap tool is useful for identifying the need of input validation and Sql
injection vulnerability.

Output:
Evaluation:

R1 R2 R3
DOP DOS Conduction FileRecord Viva-Voce Total Signature

5Marks 5 Marks 5Marks 15Marks


DEPARTMENT OF ARTIFICIAL INTELLIGENCE
AND DATA SCIENCE

Subject: Cryptography & System Security Course Code: CSL602


Semester: VI Course: AI&DS
Laboratory no. : 302 Name of subject teacher: Gitanjali Kongaonkar
Name of student: Sahil Surve Roll no: VU2S2223020

Experiment 9
Aim: To set up, configuration and use of SNORT for Intrusion Detection

Theory:
Snort is an open source network intrusion prevention and detection system (IDS/IPS)
developed by Sourcefire. Combining the benefits of signature, protocol, and anomaly-based
inspection, Snort is the most widely deployed IDS/IPS technology worldwide. With millions
of downloads and nearly 400,000 registered users, Snort has become the de facto standard for
IPS.

Snort can be configured to run in three modes:


• Sniffer mode : It simply reads the packets off of the network and displays them for
you in a continuous stream on the console (screen)
• Packet Logger mode : logs the packets to disk
• Network Intrusion Detection System (NIDS) mode: it performs detection and
analysis on network traffic. This is the most complex and configurable mode

Steps:
a) Get root access $ sudosu root
b) Do updation # apt-get update

c) Installation
# apt-get install snort During installation:
Put the name of network interface (by default it is eth0, change it to the interface name of
your machine)
Put the IP address of the machine followed by /24 (by default it is the network address.
Replace it with your IP addr/24)

d) Configuration
# cd /etc
# ls
# cd /snort # ls
# geditsnort.conf Go to line no. 51
ipvar HOME_NET any Replace “any” with your ip address i.e. ipvar HOME_NET
192.168._._
Save and close the file

c) Monitoring
# snort –q –A console –i enp2s0 enp2s0 is the name of the interface

Cd/etc
output in your machine
$ nmapipaddr of your machine (This command is to be performed on neignbour’s machine)
Output to be observed in SNORT terminal: IP address of the neighbour who is performing
Intrusion i.e. Port Scanning

OUTPUT:
Conclusion: The SNORT can help to understand & implementation of intrusion detection
process.

R1 R2 R3
DOP DOS Conduction File Viva Total Signature
Record Voice
5 Marks 5 Marks 5 Marks 15
Marks
DEPARTMENT OFARTIFICIAL INTELLIGENCE
ANDDATASCIENCE

Subject: Cryptography & System Security lab Course Code: CSL602


Semester: 06 Branch: AI & DS
Laboratory no. : 315 Name of subject teacher: Gitanjali Korgaonkar
Name of Student: Sahil Surve Roll no: VU2S2223020

Experiment 10
Aim: Design personal Firewall using Iptables Theory:
All packets inspected by iptables pass through a sequence of built-in tables (queues) for
processing. Each of these queues is dedicated to a particular type of packet activity and is
controlled by an associated packet transformation/filtering chain.
• Filter Table

• Filter is default table for iptables.

• Iptables’s filter table has the following built-in chains.

o INPUT chain – Incoming to firewall. For packets coming to the local server.

o OUTPUT chain – Outgoing from firewall. For packets generated locally and
going out of the local server.

o FORWARD chain – Packet for another NIC on the local server. For packets
routed through the local server.

• NAT Table

• This table is consulted when a packet that creates a newconnection is encountered.


Iptable’s NAT table has the following built-in chains.

o PREROUTING chain – Alters packets before routing. i.e Packet translation


happens immediately after the packet comes to the system (and before routing).
This helps to translate the destination ip address of the packets to something that
matches the routing on the local server. This is used for DNAT (destination NAT).

o POSTROUTING chain – Alters packets after routing. i.e Packet translation


happens when the packets are leaving the system. This helps to translate the
source ip address of the packets to something that might match the routing on the
destination server. This is used for SNAT (source NAT).

o OUTPUT chain – NAT for locally generated packets on the firewall.

• Mangle Table

• Iptables’s Mangle table is for specialized packet alteration. This alters QOS bits in the
TCP header. Mangle table has the following built-in chains.

o PREROUTING chain

o OUTPUT chain

o FORWARD chain

o INPUT chain

o POSTROUTING chain

• Raw Table

• Iptable’s Raw table is for configuration exemptions. Raw table has the following built-in
chains.

o PREROUTING chain

o OUTPUT chain

• Security Table

• This table is used for Mandatory Access Control (MAC) networking rules, such as those
enabled by the SECMARK and CONNSECMARK targets. Mandatory Access Control is
implemented by Linux Security Modules such as SELinux. The security table is called
after the filter table, allowing any Discretionary Access Control (DAC) rules in the filter
table to take effect before MAC rules. This table provides the following built-in chains:
INPUT (for packets coming into the box itself), OUTPUT (for altering locally-generated
packets before routing), and FORWARD (for altering packets being routed through the
box).

• Chains :-Tables consist of chains, Rules are combined into different chains. The kernel
uses chains to manage packets it receives and sends out. A chain is simply a checklist of
rules which are lists of rules which are followed in order. The rules operate with an if-
then -else structure.
• Input – This chain is used to control the behaviour for incoming connections. For
example, if a user attempts to SSH into your PC/server, iptables will attempt to match the
IP address and port to a rule in the input chain.

• Forward – This chain is used for incoming connections that aren’t actually being
delivered locally. Think of a router – data is always being sent to it but rarely actually
destined for the router itself; the data is just forwarded to its target.

• Output – This chain is used for outgoing connections. For example, if you try to ping
howtogeek.com, iptables will check its output chain to see what the rules are regarding
ping and howtogeek.com before making a decision to allow or deny the connection
attempt.

• Targets:

• ACCEPT: Allow packet to pass through the firewall. DROP: Deny access by the packet.

• REJECT: Deny access and notify the server. QUEUE: Send packets to user space.

• RETURN: jump to the end of the chain and let the default target process it

• Steps:-

• Get root access: $ sudosu root

• # apt-get install iptables


• Commands:-

• To see the list of iptables rules

• # iptables -L

• . Initially it is empty

• To block outgoing traffic to a particular destination for a specific protocol from a


machine

• Syntax: iptables -I OUTPUT -s <your ip> -d <neighbourip> -p <protocol> -j <action>


Open one terminal and Ping the neighbour. Let the ping run.

• #ping 192.168.208.6

• Open another terminal and run the iptables command

• # iptables -I OUTPUT -s 192.168.208.18 -d 192.168.208.6 -p icmp -j DROP

• To allow outgoing traffic to a particular destination for a specific protocol from a


machine

• # iptables -I OUTPUT -s 192.168.208.18 -d 192.168.208.6 -p icmp -j ACCEPT

• To block outgoing traffic to a particular destination for a specific protocol from a


machine for sometime

• # iptables -I OUTPUT -s 192.168.208.18 -d 192.168.208.6 -p icmp -j REJECT

• Allow the traffic again by using ACCEPT instead of REJECT

• To block incoming traffic from particular destination for a specific protocol to machine

• Syntax: iptables -I INPUT -s <neighbourip> -d <firewall ip> -p <protocol> -j <action>


Open one terminal and Ping the neighbour. Let the ping run.

• #ping 192.168.208.6

• Open another terminal and run the iptables command

• #iptables -I INPUT -s 192.168.208.6 -d 192.168.208.18 -p icmp -j DROP

• To allow incoming traffic from particular destination for a specific protocol to


machine
• Syntax: iptables -I INPUT -s <neighbourip> -d <firewall ip> -p <protocol> -j <action>

• Open another terminal and run the iptables command

• #iptables -I INPUT -s 192.168.208.6 -d 192.168.208.18 -p icmp -j ACCEPT

• Check the ping status on the other terminal

• To clear the rules in iptables

• # iptables -F

• To block specific URL from machine

• # iptables -t filter -I INPUT -m string --string facebook.com -j REJECT --algo kmp


It will block facebook.com by performing string matching. The algorithm used for string
matching is KMP.

• If we change target from REJECT to ACCEPT, the site can be visited again.

Observations:

• In case of OUTPUT chain, for DROP and REJECT chain, at source machine we get two
different messages.

• For DROP – ‘Operation Not Permitted’. Here No acknowledgement is provided. For


REJECT – ‘Destination Port Unreachable’. Here acknowledgement is given.

• In case of INPUT chain for DROP and REJECT chain at source machine we get two
different responses as follows:

• For DROP – No message. Here No acknowledgement is provided.

• For REJECT – ‘Destination Port Unreachable’. Here acknowledgement is given.

OUTPUT

Installing IPtables
Check current iptables status

Allowing the connection:

Dropping the connection:


Rejecting the connection :

Conclusion : The implementation of Iptables helps to understand the working principle of


Firewall in Network security.

Evaluation:

R1 R2 R3
DOP DOS Conduction FileRecord Viva-Voce Total Signature

5Marks 5 Marks 5Marks 15Marks

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy