O180421 Summer Internship Report
O180421 Summer Internship Report
BACHELOR OF TECHNOLOGY
in
ELECTRONICS AND COMMUNICATION ENGINEERING
by
YARRAMSETTI VENKATALAKSHMI
(O180421)
Under the Supervision of
YBI FOUNDATION,
New Delhi
2023
INTERNSHIP CERTIFICATE
ii
CERTIFICATE
This is certify that the Internship report on ‘DATA SCIENCE AND MACHINE LEARNING’
being submitted by YARRAMSETTI VENKATALAKSHMI(O180421) in partial fulfillment
of the requirements for the award of the degree of the Bachelor Of Technology in electronics
and communication Engineering in Dr. APJ Abdul Kalam ,RGUKT-AP IIIT Ongole is a record
of bonafide internship work carried out by them under my guidance and supervision during the
academic year 2023-24.
The results presented in this report have been verified and found to be satisfactory. The results
embodied in this internship report have not been submitted to any other University
for the award of any other degree or diploma.
Head of Department,
Mr.G.Bala Nagireddy,
Department of ECE,
RGUKT,ONGOLE .
iii
APPROVAL SHEET
Examiners ____________________________
____________________________
____________________________
Supervisors ____________________________
____________________________
____________________________
Date: ________________________
Place: ________________________
iv
DECLARATION
I declare that this written submission represents my ideas in my own words and where others'
ideas or words have been included, I have adequately cited and referenced the original sources.
I also declare that I have adhered to all principles of academic honesty and integrity and have
not misrepresented or fabricated or falsified any idea/data/fact/source in my submission. I
understand that any violation of the above will be cause for disciplinary action by the Institute
and can also evoke penal action from the sources which have thus not been properly cited or
from whom proper permission has not been taken when needed.
YARRAMSETTI VENKATALAKSHMI
(O180421)
Date: _____________________
ACKNOWLEDGEMENT
Firstly I would like to thank the team of YBI Foundation for giving me this opportunity to do an
virtual internship within the organization
I am highly indebted to YBI Foundation for the guidance and constant supervision as well as
for providing necessary information regarding the internship and also for their kind
cooperation, encouragement and their support in completing the internship.
I would like to express my special gratitude and thanks to our ELECTRONICS AND
CMMUNICATION ENGINEERING branch H.O.D Mr.G.BALA NAGIREDDY and
Director of Ongole-RGUKT Prof. B.JAYARAMI REDDY sir for giving me such attention
and time.
Date: _____________________
vi
ABSTRACT
Data Science and Machine Learning are essential domains that extract insights from data and
enable computers to learn and make predictions. They have transformed industries like
healthcare, finance, and marketing, improving disease diagnosis, fraud detection, and
personalized recommendations. These technologies are also integrated into our daily lives,
from streaming platforms to voice assistants. It is important for everyone to be aware of these
domains as they empower individuals to make informed decisions, navigate the data-driven
world, and seize career opportunities. Understanding Data Science and Machine Learning
contributes to problem-solving, innovation, and shaping the future.
vii
CONTENTS
Title i
INTERNSHIP CERTIFICATE ii
CERTIFICATE iii
APPROVAL SHEET iv
DECLARATION v
ABSTRACT vii
1. Introduction 1
1.1. Background 1
1.2. Learning Objectives 1
1.3. Assessment Works 2
2. Requirement Analysis 3
2.1. Requirements Specification 3
2.1.1. Hardware Requirements 3
2.1.2. Software Requirements 3
2.2. Technologies Used 3
2.2.1. Python 3
2.2.2. Numpy 4
2.2.3. Pandas 4
2.2.4. Matplotlib 4
2.2.5. Seaborn 5
2.2.6. 2.2.6. Sklearn 5
3. Methodologies Used 6
3.1. Machine Learning Prediction Flow 6
viii
ix
LIST OF FIGURES
CHAPTER-1
INTRODUCTION
1.1. BACKGROUND
Data Science and Machine Learning are two interconnected fields that play a crucial
role in extracting insights and knowledge from data. Data Science involves the
collection, cleaning, analysis, and interpretation of data using various techniques,
while Machine Learning focuses on developing algorithms and models that enable
computers to learn and make predictions without explicit programming. These
domains have revolutionized industries, such as healthcare, finance, and marketing,
and have become integral to our daily lives. Understanding Data Science and Machine
Learning empowers individuals to make data-driven decisions, navigate the data-
driven world, and seize opportunities for innovation and career growth.
The following objectives are set at the beginning of the Summer Internship and I was
able to complete them with some efforts.
The following are the practice project and Internship final assessment project.
CHAPTER-2
REQUIREMENT ANALYSIS
2.
In order to run an application/software we need to have some basic configuration related to
hardware and software. The following were listed requirements to run the projects smoothly.
2.2.2. Numpy
NumPy, short for Numerical Python, is a fundamental library in the Python
ecosystem that plays a crucial role in Machine Learning and Data Science. It
provides powerful tools for efficient numerical computations and array
operations. NumPy's main feature is the ndarray, a multidimensional array object
that allows for fast and efficient manipulation of large datasets. This library offers
a wide range of mathematical functions and operations, making it ideal for tasks
such as data preprocessing, statistical analysis, and linear algebra computations.
NumPy's integration with other libraries, such as Pandas and Matplotlib, further
enhances its capabilities in data manipulation and visualization. Its efficient array
operations and mathematical functions make NumPy an essential tool for
implementing machine learning algorithms and performing complex data analysis
tasks in a concise and efficient manner.
2.2.3. Pandas
Pandas is a powerful and widely-used Python library that plays a crucial role in
the fields of Machine Learning and Data Science. It provides high-performance
data structures and data analysis tools, making it easier to manipulate, clean, and
analyze data. Pandas' DataFrame object allows for efficient handling of structured
data, enabling tasks such as data preprocessing, feature engineering, and
exploratory data analysis. With its intuitive and flexible API, Pandas simplifies
complex data operations, such as filtering, grouping, and merging, making it an
essential tool for data scientists and machine learning practitioners. Moreover,
Pandas seamlessly integrates with other libraries in the Python ecosystem, such
as NumPy and Matplotlib, enabling a comprehensive and streamlined workflow
for data analysis and visualization. Overall, Pandas empowers users to efficiently
work with data, making it an invaluable asset in the fields of Machine Learning
and Data Science.
2.2.4. Matplotlib
Matplotlib is a widely-used data visualization library in the field of Data Science
and Machine Learning. It provides a comprehensive set of tools for creating
highquality plots, charts, and graphs, allowing for effective data exploration and
communication. With Matplotlib, data scientists and machine learning
practitioners can visualize patterns, trends, and relationships in their data, aiding
2.2.5. Seaborn
Seaborn is a Python data visualization library built on top of Matplotlib. It
provides a high-level interface for creating visually appealing and informative
statistical graphics. Seaborn is particularly useful in the fields of Machine
Learning and Data Science as it offers a wide range of built-in functions and tools
for visualizing data distributions, relationships, and patterns. With Seaborn, one
can easily create various types of plots, such as scatter plots, bar plots, box plots,
and heatmaps, to explore and analyze data. These visualizations aid in
understanding the underlying patterns and trends in the data, making it easier to
make informed decisions and derive meaningful insights. Seaborn's integration
with Pandas, another popular Python library for data manipulation, further
enhances its usability and makes it an invaluable tool for data scientists and
machine learning practitioners.
2.2.6. Sklearn
Scikit-learn, also known as sklearn, is a widely-used Python library that provides
a comprehensive set of tools for machine learning and data science tasks. It offers
a user-friendly interface and a wide range of algorithms and utilities for tasks such
as classification, regression, clustering, and dimensionality reduction. Sklearn
simplifies the process of building and evaluating machine learning models by
providing a consistent API and a variety of preprocessing techniques for data
transformation and feature engineering. It also includes modules for model
selection, cross-validation, and performance evaluation, making it easier to
finetune and optimize models. Sklearn's extensive documentation and active
community support make it a valuable resource for both beginners and
experienced practitioners in the field of machine learning and data science.
CHAPTER-3
METHODOLOGIES USED
3.
Apart from the above discussed technologies and libraries, a generic Machine Learning
prediction flow and some regression algorithms are used in the Internship
The process of machine learning prediction involves several key steps to effectively
solve a problem and make accurate predictions. These steps include defining the
problem, gathering and preparing data, selecting and training a model, evaluating its
performance, fine-tuning the model, and ultimately making predictions on new data.
By following this systematic approach, we can harness the power of machine learning
to gain valuable insights and make informed decisions.
1. Define the Problem: Clearly define the problem you want to solve and
determine the type of prediction task, such as classification or regression.
2. Gather and Prepare Data: Collect relevant data for your problem and
preprocess it. This includes handling missing values, encoding categorical variables,
and scaling numerical features.
3. Split the Data: Divide your dataset into training and testing sets. The training
set is used to train the machine learning model, while the testing set is used to evaluate
its performance.
4. Select a Model: Choose an appropriate machine learning algorithm based on
your problem and data characteristics. Consider factors such as interpretability,
complexity, and performance.
5. Train the Model: Fit the selected model to the training data. The model learns
patterns and relationships in the data to make predictions.
6. Evaluate the Model: Use the testing set to assess the performance of the trained
model. Common evaluation metrics include accuracy, precision, recall, and mean
squared error, depending on the prediction task.
The specific steps and techniques may vary depending on the problem and the machine
learning algorithm being used. It's important to adapt and refine the process based on
the unique requirements of every project.
Linear regression is a widely used statistical technique for modeling the relationship
between a dependent variable and one or more independent variables. It assumes a
linear relationship between the variables, where the dependent variable can be
predicted as a linear combination of the independent variables.
In simple linear regression, there is only one independent variable, while in multiple
linear regression, there are multiple independent variables. The goal of linear
regression is to find the best-fit line that minimizes the difference between the
predicted values and the actual values of the dependent variable.
The equation for a simple linear regression model can be represented as:
y = b0 + b1 * x, where y is the dependent variable, x is the independent variable, b0 is
the y-intercept, and b1 is the slope of the line.
To estimate the coefficients (b0 and b1), the ordinary least squares (OLS) method is
commonly used. It minimizes the sum of the squared differences between the predicted
and actual values. The coefficients can be interpreted as the change in the dependent
variable for a unit change in the independent variable.
Linear regression is widely used in various fields, including economics, finance, social
sciences, and machine learning. It is not only used for prediction but also for
understanding the relationship between variables and identifying important features
that influence the dependent variable.
1. Randomly select a subset of the training data (with replacement) to build each
decision tree. This is known as bootstrap aggregating or "bagging."
2. Randomly select a subset of features at each node of the decision tree. This
helps to introduce randomness and reduce overfitting.
3. Build each decision tree using the selected data and features. The trees are
constructed by recursively splitting the data based on the selected features, aiming to
minimize the variance within each leaf node.
4. For regression tasks, the final prediction is obtained by averaging the
predictions of all the decision trees. For classification tasks, the majority vote of the
trees is taken as the final prediction.
y = (1/N) * Σ(y_i), where y is the predicted value, N is the number of decision trees
in the forest, and y_i is the prediction of each individual decision tree.
The algorithm's effectiveness lies in the diversity of the decision trees and their
collective wisdom, resulting in more accurate and stable predictions compared to a
single decision tree. It is important to note that Random Forest Regression, like any
other machine learning algorithm, has its limitations and assumptions. It may not
perform well with noisy or irrelevant features, and the interpretability of the model can
be challenging due to the ensemble nature of the algorithm.
CHAPTER-4
CODE IMPLEMENTATION
4.
The projects and assignments done for the completion of the internship are explained below
along with the code snippets of each prediction.
With a dataset of fish species, with some of it characteristic like it vertical, diagonal,
length, height, and width. We will try to predict the weight of the fish based on their
characteristic. We will use Linear Regression Method to see whether the weight of the
fish related to their characteristic.
We are considering ‘Weight’ as the target variable and the remaining features except
species are considered as independent variables. All the given code statements are to
be put in individuals cells in google colab or jupyter notebook.
Code :
import pandas as pd fish =
pd.read_csv('https://github.com/ybifoundation/Dataset/raw/main/Fish.csv')
fish.head()
fish.info()
fish.describe()
fish.columns
y = fish['Weight']
X = fish[['Category','Height', 'Width', 'Length1', 'Length2', 'Length3']]
10
The digit dataset consists of 8x8 pixel images of digits. The images attribute of the
dataset stores 8x8 arrays of grayscale values for each image. We will use these arrays
to visualize the first 4 images. The target attribute of the dataset stores the digit each
image represents. The Random Forest Classifier is used in the code to classify the
given images into digits of [0,1,2,3,4,5,6,7,8,9]. All the given code statements are to
be put in individuals cells in google colab or jupyter notebook.
Code :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
df=load_digits()
_,axes=plt.subplots(nrows=1,ncols=4,figsize=(10,3))
for ax,image,label in zip(axes,df.images,df.target):
ax.set_axis_off()
ax.imshow(image,cmap=plt.cm.gray_r,interpolation="nearest")
11
ax.set_title("Training:%i"%label)
df.images.shape
df.images[0]
df.images[0].shape
len(df.images)
n_samples=len(df.images)
data=df.images.reshape((n_samples,-1))
data[0]
data[0].shape
data.shape
data.min()
data.max()
data=data/16
data.min()
data.max()
data[0]
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(data,df.target,test_size=0.3)
x_train.shape,x_test.shape,y_train.shape,y_test.shape
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier()
rf.fit(x_train,y_train)
y_pred=rf.predict(x_test)
y_pred
from sklearn.metrics import confusion_matrix,classification_report
confusion_matrix(y_test,y_pred)
print(classification_report(y_test,y_pred))
12
4.Item_Visibility
5.Item_Type
6.Iten_MRP
7.Outlet_Identifier
8.Outlet_Establishment_Year
9.Outlet_Size
10.Outlet_Location_Type
11.Outlet_Type
12.Item_Outlet_Sales
Code :
import numpy as np import
pandas as pd
df = pd.read_csv(r'https://raw.githubusercontent.com/YBI
Foundation/Dataset/main/Big%20Sales%20Data.csv')
df.head()
df.info()
df.columns
df.describe()
df.isnull().sum()
df['Item_Weight'].fillna(df.groupby(['Item_Type'])['Item_Weight'].transform('mean'),
inplace=True)
df.isnull().sum()
import seaborn as sns
sns.pairplot(df)
df[['Item_Identifier']].value_counts()
df[['Item_Fat_Content']].value_counts)
df.replace({'Item_Fat_Content':{'LF':'Low Fat','reg':'Regular','low fat':'Low
Fat'}},inplace=True)
13
df[['Item_Fat_Content']].value_counts()
df.replace({'Item_Fat_Content':{'Low Fat':0,'Regular':1}},inplace=True)
df[['Item_Type']].value_counts()
df.replace({'Item_Type':{'Fruits and Vegetables':0,'Snack Foods':0,'Household':1,
'Frozen Foods':0,'Dairy':0,'Baking Goods':0,'Canned':0,
'Health and Hygiene':1,'Meat':0,'Soft Drinks':0,'Breads':0,
'Hard Drinks':0,'Others':2,'Starchy Foods':0,'Breakfast':0,
'Seafood':0}},inplace=True)
df[['Item_Type']].value_counts()
df[['Outlet_Identifier']].value_counts()
df.replace({'Outlet_Identifier':{'OUT027':0,'OUT013':1,'OUT049':2,'OUT046':3,
'OUT035':4,'OUT045':5,'OUT018':6,'OUT017':7,
'OUT010':8,'OUT019':9,
}},inplace=True)
df[['Outlet_Identifier']].value_counts()
df[['Outlet_Size']].value_counts()
df.replace({'Outlet_Size':{'Small':0,'Medium':1,'High':2}},inplace=True)
df[['Outlet_Size']].value_counts()
df[['Outlet_Location_Type']].value_counts()
df.replace({'Outlet_Location_Type':{'Tier 1':0,'Tier 2':1,'Tier 3':2}},inplace=True)
df[['Outlet_Location_Type']].value_counts()
df[['Outlet_Type']].value_counts()
df.replace({'Outlet_Type':{'Grocery Store':0,'Supermarket Type1':1,'Supermarket
Type2':2, 'Supermarket Type3':3}},inplace=True)
df[['Outlet_Type']].value_counts()
df.head()
df.info()
df.shape
y = df['Item_Outlet_Sales']
y.shape
y
X = df[['Item_Weight','Item_Fat_Content','Item_Visibility','Item_Type','Item_MRP',
'Outlet_Identifier','Outlet_Establishment_Year','Outlet_Size',
14
'Outlet_Location_Type','Outlet_Type']]
X.shape
X
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_std =
df[['Item_Weight','Item_Visibility','Item_MRP','Outlet_Establishment_Year']]
X_std = sc.fit_transform(X_std)
X_std
X[['Item_Weight','Item_Visibility','Item_MRP','Outlet_Establishment_Year']] =
pd.DataFrame(X_std,columns=[['Item_Weight','Item_Visibility','Item_MRP','Outlet_
Establishment_Year']])
X
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test =
train_test_split(X,y,test_size=0.1,random_state=2529)
X_train.shape,X_test.shape,y_train.shape,y_test.shape
from sklearn.ensemble importRandomForestRegressor
rfr = RandomForestRegressor(random_state=2529)
rfr.fit(X_train,y_train)
y_pred = rfr.predict(X_test)
y_pred.shape
y_pred
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
mean_squared_error(y_test,y_pred)
mean_absolute_error(y_test,y_pred)
r2_score(y_test,y_pred)
import matplotlib.pyplot as plt
plt.scatter(y_test,y_pred)
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Actual Price vs. Predicted Price")
plt.show()
15
CHAPTER-5
RESULTS AND DISCUSSION
5.
The above project assessments are executed successfully and resulted in expected outputs.
Those images are listed down showcasing the execution of the code snippets.
This program performs linear regression on the fish dataset using Python's scikit-learn
library. It loads the dataset, separates the target variable and input features, and splits
the data into training and testing sets. The program then trains a linear regression
model, makes predictions on the test data, and evaluates the model's performance using
mean absolute error and R-squared score. The program provides insights into the
relationship between the input features and the target variable and serves as a
foundation for further analysis and improvement of the model.
16
This program uses scikit-learn to perform classification on the digits dataset. It loads
the dataset, prepares the data by reshaping and normalizing it, and splits it into training
and testing sets. The program then trains a Random Forest classifier, makes predictions
on the test data, and evaluates the model's performance using a confusion matrix and
classification report. This program showcases the use of scikit-learn for classification
tasks and provides insights into the model's accuracy and performance on different
classes of digits.
17
18
This program uses scikit-learn to perform regression analysis on the Big Sales Data. It
loads the dataset, preprocesses the data by handling missing values and encoding
categorical variables, and performs feature scaling. The data is then split into training
and testing sets. A Random Forest regressor is trained on the training data and used to
make predictions on the test data. The program evaluates the model's performance
using mean squared error, mean absolute error, and R-squared score.
19
Finally, a scatter plot is generated to visualize the relationship between the actual and
predicted prices. This program demonstrates the use of scikit-learn for regression tasks
and provides insights into the model's accuracy in predicting sales prices.
Fig. 5.3.2. Prediction, Accuracy and Visualization of the Big Sales Prediction
20
Furthermore, the internship provided an introduction to Kaggle, a popular platform for data
science competitions and projects. I learned about the machine learning prediction flow, which
involves steps like train-test split to evaluate model performance. I also gained practical
experience in implementing linear regression, logistic regression, and random forest regression
algorithms for predictive modeling.
Overall, my internship at YBI Foundations equipped me with a solid foundation in data science
concepts and practical skills, enabling me to apply my knowledge in real-world scenarios.
6.
21
CHAPTER-7
REFERENCES
The following websites are referred in order to complete the summer internship program. I
am hereby including the github repository link that is submitted for the internship
completion.
https://github.com/Dinesh-Goli/ybi_project/tree/main
https://colab.research.google.com/drive/13QW3UaePIR9ieSF6aYZcII-
5rTsgPUWG#scrollTo=BqlxbXNno4K1
https://www.ybifoundation.org/#/home
https://www.youtube.com/
https://www.ybifoundation.org/certificate-validation?credentialId=DWT1Z0O9AI93C
22