0% found this document useful (0 votes)

67 views33 pages

O180421 Summer Internship Report

Daa

Uploaded by

narutouzmaki86888

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views33 pages

O180421 Summer Internship Report

Daa

Uploaded by

narutouzmaki86888

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

lOMoARcPSD|48357171

O180421 summer internship report

electronics & communications (Rajiv Gandhi University of Knowledge and Technologies)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)
lOMoARcPSD|48357171

A SUMMER INTERNSHIP REPORT on

DATA SCIENCE AND MACHINE LEARNING
Submitted in partial fulfilment of the Requirements for the award of the degree of

BACHELOR OF TECHNOLOGY
in
ELECTRONICS AND COMMUNICATION ENGINEERING
by
YARRAMSETTI VENKATALAKSHMI
(O180421)
Under the Supervision of
YBI FOUNDATION,
New Delhi

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

RAJIV GANDHI UNIVERSITY OF KNOWLEDGE TECHNOLOGIES
ONGOLE CAMPUS

2023

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

lOMoARcPSD|48357171

INTERNSHIP CERTIFICATE

Can be verified at https://www.ybifoundation.org/certificate-

validation?credentialId=DWT1Z0O9AI93C

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

lOMoARcPSD|48357171

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

RAJIV GANDHI UNIVERSITY OF KNOWLEDGE TECHNOLOGIES
ONGOLE CAMPUS 2023

CERTIFICATE

This is certify that the Internship report on ‘DATA SCIENCE AND MACHINE LEARNING’
being submitted by YARRAMSETTI VENKATALAKSHMI(O180421) in partial fulfillment
of the requirements for the award of the degree of the Bachelor Of Technology in electronics
and communication Engineering in Dr. APJ Abdul Kalam ,RGUKT-AP IIIT Ongole is a record
of bonafide internship work carried out by them under my guidance and supervision during the
academic year 2023-24.

The results presented in this report have been verified and found to be satisfactory. The results
embodied in this internship report have not been submitted to any other University
for the award of any other degree or diploma.

Head of Department,
Mr.G.Bala Nagireddy,
Department of ECE,
RGUKT,ONGOLE .

iii

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

lOMoARcPSD|48357171

APPROVAL SHEET

This report entitled “SUMMER INTERNSHIP on DATA SCIENCE AND MACHINE

LEARNING” by YARRAMSETTI VENKATALAKSHMI(O180421) is approved for the
degree of Bachelor of Technology in Electronics and Communication Engineering.

Examiners ____________________________

____________________________

Supervisors ____________________________

____________________________

Date: ________________________

Place: ________________________

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

lOMoARcPSD|48357171

DECLARATION

I declare that this written submission represents my ideas in my own words and where others'
ideas or words have been included, I have adequately cited and referenced the original sources.
I also declare that I have adhered to all principles of academic honesty and integrity and have
not misrepresented or fabricated or falsified any idea/data/fact/source in my submission. I
understand that any violation of the above will be cause for disciplinary action by the Institute
and can also evoke penal action from the sources which have thus not been properly cited or
from whom proper permission has not been taken when needed.

YARRAMSETTI VENKATALAKSHMI
(O180421)

Date: _____________________

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

lOMoARcPSD|48357171

ACKNOWLEDGEMENT

Firstly I would like to thank the team of YBI Foundation for giving me this opportunity to do an
virtual internship within the organization

I am highly indebted to YBI Foundation for the guidance and constant supervision as well as
for providing necessary information regarding the internship and also for their kind
cooperation, encouragement and their support in completing the internship.

I would like to express my special gratitude and thanks to our ELECTRONICS AND
CMMUNICATION ENGINEERING branch H.O.D Mr.G.BALA NAGIREDDY and
Director of Ongole-RGUKT Prof. B.JAYARAMI REDDY sir for giving me such attention
and time.

With Sincere Regards,

YARRAMSETTI VENKATALAKSHMI

Date: _____________________

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

lOMoARcPSD|48357171

ABSTRACT

Data Science and Machine Learning are essential domains that extract insights from data and
enable computers to learn and make predictions. They have transformed industries like
healthcare, finance, and marketing, improving disease diagnosis, fraud detection, and
personalized recommendations. These technologies are also integrated into our daily lives,
from streaming platforms to voice assistants. It is important for everyone to be aware of these
domains as they empower individuals to make informed decisions, navigate the data-driven
world, and seize career opportunities. Understanding Data Science and Machine Learning
contributes to problem-solving, innovation, and shaping the future.

This project report aims to provide a comprehensive overview of my internship experience,

highlighting the key concepts and techniques I learned during my time at YBI Foundation.
Additionally, this report will delve into the fundamentals of Data Science and Machine
Learning, explaining their significance and applications in various industries. By exploring
these domains, we can better understand how data-driven insights and predictive models can
be leveraged to solve complex problems and make informed decisions.

vii

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

lOMoARcPSD|48357171

CONTENTS

DESCRIPTION Page No.

Title i
INTERNSHIP CERTIFICATE ii
CERTIFICATE iii
APPROVAL SHEET iv
DECLARATION v
ABSTRACT vii

1. Introduction 1
1.1. Background 1
1.2. Learning Objectives 1
1.3. Assessment Works 2
2. Requirement Analysis 3
2.1. Requirements Specification 3
2.1.1. Hardware Requirements 3
2.1.2. Software Requirements 3
2.2. Technologies Used 3
2.2.1. Python 3
2.2.2. Numpy 4
2.2.3. Pandas 4
2.2.4. Matplotlib 4
2.2.5. Seaborn 5
2.2.6. 2.2.6. Sklearn 5
3. Methodologies Used 6
3.1. Machine Learning Prediction Flow 6

viii

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

lOMoARcPSD|48357171

3.2. Linear Regression 7

3.3. Random Forest Regression 8
4. Code Implementation 10
4.1. Fish Weight Prediction – Practice Project 10-11
4.2. Hand Written Digits Classification and Prediction 11-12
4.3. Big Sales Prediction 12-15
5. Results and Discussion 16
5.1. Fish Weight Prediction – Practice Project 16
5.2. Hand Written Digits Classification and Prediction 17-18
5.3. Big Sales Prediction 19-20
6. Summary and Conclusion 21
7. References 22

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

lOMoARcPSD|48357171

LIST OF FIGURES

Fig.No. DESCRIPTION Page No.

Fig. 5.1.1. Prediction Result of Fish Weights and Accuracy metrics 16

Fig. 5.2.1. Loading and Flattening the images of dataset 17
Fig. 5.2.2. Classification and Accuracy metrics of the model 18

Fig. 5.3.1. Seaborn Pairplot of the Big Sales Data 19

Fig. 5.3.2. Prediction, Accuracy and Visualization of the Big Sales Prediction 20

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

lOMoARcPSD|48357171

CHAPTER-1
INTRODUCTION

1.1. BACKGROUND

Data Science and Machine Learning are two interconnected fields that play a crucial
role in extracting insights and knowledge from data. Data Science involves the
collection, cleaning, analysis, and interpretation of data using various techniques,
while Machine Learning focuses on developing algorithms and models that enable
computers to learn and make predictions without explicit programming. These
domains have revolutionized industries, such as healthcare, finance, and marketing,
and have become integral to our daily lives. Understanding Data Science and Machine
Learning empowers individuals to make data-driven decisions, navigate the data-
driven world, and seize opportunities for innovation and career growth.

I successfully completed an online course internship at YBI Foundation, New Delhi,

specializing in the fields of Data Science and Machine Learning. Throughout the
internship, I gained valuable experience and acquired a solid foundation of knowledge
in these domains. The internship provided me with practical exposure to various
techniques and tools used in Data Science and Machine Learning, enabling me to apply
them effectively in real-world scenarios. I am confident that the skills and knowledge
I have gained during this internship will greatly contribute to my future endeavours in
these fields.

1.2. LEARNING OBJECTIVES

The following objectives are set at the beginning of the Summer Internship and I was
able to complete them with some efforts.

• Scope of Data Science

• Introduction to Python
• Introduction to Google Colab
• Python Libraries for Data Science and Machine Learning
• Working on DataFrames
• Introduction to Kaggle

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

lOMoARcPSD|48357171

• Machine Learning Prediction Flow

• Train Test Split
• Linear Regression
• Logistic Regression
• Random Forest Regression
• Practice Project
• Internship Final Assessment Project

1.3. ASSESSMENT WORKS

The following are the practice project and Internship final assessment project.

• Fish Weight Prediction using Linear Regression

• Hand Written Digits Classification and Prediction using Random Forest
• Big Sales Prediction using Random Forest Regression

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

lOMoARcPSD|48357171

CHAPTER-2
REQUIREMENT ANALYSIS
2.
In order to run an application/software we need to have some basic configuration related to
hardware and software. The following were listed requirements to run the projects smoothly.

2.1. REQUIREMENTS SPECIFICATION

2.1.1. HARDWARE REQUIREMENTS
• Processor : Intel i3 or above
• Memory : 4GB RAM
• CPU : 64 bit Architecture
2.1.2. SOFTWARE REQUIREMENTS
• Operating System : Windows/Linux/Mac (all compatible)
• Browser : Google chrome/ Mozilla Firefox/ Microsoft edge (all compatible)

2.2. TECHNOLOGIES USED

2.2.1. Python
Python is a versatile and widely-used programming language that has gained
immense popularity in the field of Data Science and Machine Learning. Its
simplicity, readability, and extensive libraries make it an ideal choice for data
analysis, modeling, and visualization. Python's rich ecosystem of libraries, such
as NumPy, Pandas, and Matplotlib, provide powerful tools for handling and
manipulating data efficiently. These libraries offer a wide range of functions and
methods for data preprocessing, exploratory data analysis, and statistical
modeling. Additionally, Python's integration with popular Machine Learning
frameworks like TensorFlow and scikit-learn allows for the development and
deployment of complex machine learning models. Its flexibility and ease of use
make Python a preferred language for data scientists and machine learning
practitioners, enabling them to efficiently tackle real-world problems and derive
meaningful insights from data.

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

lOMoARcPSD|48357171

2.2.2. Numpy
NumPy, short for Numerical Python, is a fundamental library in the Python
ecosystem that plays a crucial role in Machine Learning and Data Science. It
provides powerful tools for efficient numerical computations and array
operations. NumPy's main feature is the ndarray, a multidimensional array object
that allows for fast and efficient manipulation of large datasets. This library offers
a wide range of mathematical functions and operations, making it ideal for tasks
such as data preprocessing, statistical analysis, and linear algebra computations.
NumPy's integration with other libraries, such as Pandas and Matplotlib, further
enhances its capabilities in data manipulation and visualization. Its efficient array
operations and mathematical functions make NumPy an essential tool for
implementing machine learning algorithms and performing complex data analysis
tasks in a concise and efficient manner.

2.2.3. Pandas
Pandas is a powerful and widely-used Python library that plays a crucial role in
the fields of Machine Learning and Data Science. It provides high-performance
data structures and data analysis tools, making it easier to manipulate, clean, and
analyze data. Pandas' DataFrame object allows for efficient handling of structured
data, enabling tasks such as data preprocessing, feature engineering, and
exploratory data analysis. With its intuitive and flexible API, Pandas simplifies
complex data operations, such as filtering, grouping, and merging, making it an
essential tool for data scientists and machine learning practitioners. Moreover,
Pandas seamlessly integrates with other libraries in the Python ecosystem, such
as NumPy and Matplotlib, enabling a comprehensive and streamlined workflow
for data analysis and visualization. Overall, Pandas empowers users to efficiently
work with data, making it an invaluable asset in the fields of Machine Learning
and Data Science.

2.2.4. Matplotlib
Matplotlib is a widely-used data visualization library in the field of Data Science
and Machine Learning. It provides a comprehensive set of tools for creating
highquality plots, charts, and graphs, allowing for effective data exploration and
communication. With Matplotlib, data scientists and machine learning
practitioners can visualize patterns, trends, and relationships in their data, aiding

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

lOMoARcPSD|48357171

in the understanding and interpretation of complex datasets. Matplotlib's

versatility enables the creation of various types of visualizations, including line
plots, scatter plots, bar plots, histograms, and more. These visualizations are
invaluable for data preprocessing, exploratory data analysis, model evaluation,
and presenting results. By leveraging Matplotlib, professionals in the field can
effectively communicate their findings and insights, making it an essential tool in
the data science and machine learning workflow.

2.2.5. Seaborn
Seaborn is a Python data visualization library built on top of Matplotlib. It
provides a high-level interface for creating visually appealing and informative
statistical graphics. Seaborn is particularly useful in the fields of Machine
Learning and Data Science as it offers a wide range of built-in functions and tools
for visualizing data distributions, relationships, and patterns. With Seaborn, one
can easily create various types of plots, such as scatter plots, bar plots, box plots,
and heatmaps, to explore and analyze data. These visualizations aid in
understanding the underlying patterns and trends in the data, making it easier to
make informed decisions and derive meaningful insights. Seaborn's integration
with Pandas, another popular Python library for data manipulation, further
enhances its usability and makes it an invaluable tool for data scientists and
machine learning practitioners.

2.2.6. Sklearn
Scikit-learn, also known as sklearn, is a widely-used Python library that provides
a comprehensive set of tools for machine learning and data science tasks. It offers
a user-friendly interface and a wide range of algorithms and utilities for tasks such
as classification, regression, clustering, and dimensionality reduction. Sklearn
simplifies the process of building and evaluating machine learning models by
providing a consistent API and a variety of preprocessing techniques for data
transformation and feature engineering. It also includes modules for model
selection, cross-validation, and performance evaluation, making it easier to
finetune and optimize models. Sklearn's extensive documentation and active
community support make it a valuable resource for both beginners and
experienced practitioners in the field of machine learning and data science.

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

lOMoARcPSD|48357171

CHAPTER-3
METHODOLOGIES USED
3.
Apart from the above discussed technologies and libraries, a generic Machine Learning
prediction flow and some regression algorithms are used in the Internship

3.1. Machine Learning Prediction Flow

The process of machine learning prediction involves several key steps to effectively
solve a problem and make accurate predictions. These steps include defining the
problem, gathering and preparing data, selecting and training a model, evaluating its
performance, fine-tuning the model, and ultimately making predictions on new data.
By following this systematic approach, we can harness the power of machine learning
to gain valuable insights and make informed decisions.

1. Define the Problem: Clearly define the problem you want to solve and
determine the type of prediction task, such as classification or regression.
2. Gather and Prepare Data: Collect relevant data for your problem and
preprocess it. This includes handling missing values, encoding categorical variables,
and scaling numerical features.
3. Split the Data: Divide your dataset into training and testing sets. The training
set is used to train the machine learning model, while the testing set is used to evaluate
its performance.
4. Select a Model: Choose an appropriate machine learning algorithm based on
your problem and data characteristics. Consider factors such as interpretability,
complexity, and performance.
5. Train the Model: Fit the selected model to the training data. The model learns
patterns and relationships in the data to make predictions.
6. Evaluate the Model: Use the testing set to assess the performance of the trained
model. Common evaluation metrics include accuracy, precision, recall, and mean
squared error, depending on the prediction task.

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

lOMoARcPSD|48357171

7. Fine-tune the Model: Optimize the model's performance by adjusting

hyperparameters. This can be done through techniques like grid search or random
search.
8. Make Predictions: Once the model is trained and fine-tuned, use it to make
predictions on new, unseen data. Preprocess the new data in the same way as the
training data before feeding it into the model.
9. Evaluate Predictions: Assess the accuracy and reliability of the predictions by
comparing them to the ground truth or known outcomes.
10. Iterate and Improve: Analyze the results, identify areas for improvement, and
iterate on the process. This may involve trying different algorithms, feature
engineering techniques, or collecting more data.

The specific steps and techniques may vary depending on the problem and the machine
learning algorithm being used. It's important to adapt and refine the process based on
the unique requirements of every project.

3.2. Linear Regression

Linear regression is a widely used statistical technique for modeling the relationship
between a dependent variable and one or more independent variables. It assumes a
linear relationship between the variables, where the dependent variable can be
predicted as a linear combination of the independent variables.

In simple linear regression, there is only one independent variable, while in multiple
linear regression, there are multiple independent variables. The goal of linear
regression is to find the best-fit line that minimizes the difference between the
predicted values and the actual values of the dependent variable.

The equation for a simple linear regression model can be represented as:
y = b0 + b1 * x, where y is the dependent variable, x is the independent variable, b0 is
the y-intercept, and b1 is the slope of the line.

To estimate the coefficients (b0 and b1), the ordinary least squares (OLS) method is
commonly used. It minimizes the sum of the squared differences between the predicted
and actual values. The coefficients can be interpreted as the change in the dependent
variable for a unit change in the independent variable.

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

lOMoARcPSD|48357171

Linear regression is widely used in various fields, including economics, finance, social
sciences, and machine learning. It is not only used for prediction but also for
understanding the relationship between variables and identifying important features
that influence the dependent variable.

However, it is important to note that linear regression assumes certain assumptions,

such as linearity, independence of errors, and homoscedasticity. Violation of these
assumptions may affect the accuracy and reliability of the model. Therefore, it is
crucial to assess the assumptions and perform appropriate diagnostics to ensure the
validity of the linear regression model.

3.3. Random Forest Regression

Random Forest Regression is a powerful ensemble learning method that combines

multiple decision trees to make predictions. It is a variant of the Random Forest
algorithm, which is widely used for both classification and regression tasks. In
Random Forest Regression, multiple decision trees are built using different subsets of
the training data and random subsets of the features. Each decision tree independently
predicts the target variable, and the final prediction is obtained by averaging the
predictions of all the trees (for regression tasks).

The algorithm follows these steps:

1. Randomly select a subset of the training data (with replacement) to build each
decision tree. This is known as bootstrap aggregating or "bagging."
2. Randomly select a subset of features at each node of the decision tree. This
helps to introduce randomness and reduce overfitting.
3. Build each decision tree using the selected data and features. The trees are
constructed by recursively splitting the data based on the selected features, aiming to
minimize the variance within each leaf node.
4. For regression tasks, the final prediction is obtained by averaging the
predictions of all the decision trees. For classification tasks, the majority vote of the
trees is taken as the final prediction.

Mathematically, the prediction of a Random Forest Regression model can be

represented as:

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

lOMoARcPSD|48357171

y = (1/N) * Σ(y_i), where y is the predicted value, N is the number of decision trees
in the forest, and y_i is the prediction of each individual decision tree.

The algorithm's effectiveness lies in the diversity of the decision trees and their
collective wisdom, resulting in more accurate and stable predictions compared to a
single decision tree. It is important to note that Random Forest Regression, like any
other machine learning algorithm, has its limitations and assumptions. It may not
perform well with noisy or irrelevant features, and the interpretability of the model can
be challenging due to the ensemble nature of the algorithm.

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

lOMoARcPSD|48357171

CHAPTER-4
CODE IMPLEMENTATION
4.
The projects and assignments done for the completion of the internship are explained below
along with the code snippets of each prediction.

4.1. Fish Weight Prediction – Practice Project

With a dataset of fish species, with some of it characteristic like it vertical, diagonal,
length, height, and width. We will try to predict the weight of the fish based on their
characteristic. We will use Linear Regression Method to see whether the weight of the
fish related to their characteristic.

• Species: Species name of fish

• Weight: Weight of fish in gram
• Length1: Vertical length in cm
• Length2: Diagonal length in cm
• Length3: Cross length in cm
• Height: Height in cm
• Width: Diagonal width in cm

We are considering ‘Weight’ as the target variable and the remaining features except
species are considered as independent variables. All the given code statements are to
be put in individuals cells in google colab or jupyter notebook.

Code :
import pandas as pd fish =
pd.read_csv('https://github.com/ybifoundation/Dataset/raw/main/Fish.csv')
fish.head()
fish.info()
fish.describe()
fish.columns
y = fish['Weight']
X = fish[['Category','Height', 'Width', 'Length1', 'Length2', 'Length3']]

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

lOMoARcPSD|48357171

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.7,
random_state=2529)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
from sklearn.linear_model import LinearRegression model
= LinearRegression() model.fit(X_train,y_train)
model.intercept_ model.coef_ y_pred =
model.predict(X_test) y_pred from sklearn.metrics import
mean_absolute_error, r2_score
mean_absolute_error(y_test,y_pred)
r2_score(y_test,y_pred)

4.2. Hand Written Digits Classification and Prediction

The digit dataset consists of 8x8 pixel images of digits. The images attribute of the
dataset stores 8x8 arrays of grayscale values for each image. We will use these arrays
to visualize the first 4 images. The target attribute of the dataset stores the digit each
image represents. The Random Forest Classifier is used in the code to classify the
given images into digits of [0,1,2,3,4,5,6,7,8,9]. All the given code statements are to
be put in individuals cells in google colab or jupyter notebook.

Code :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
df=load_digits()
_,axes=plt.subplots(nrows=1,ncols=4,figsize=(10,3))
for ax,image,label in zip(axes,df.images,df.target):
ax.set_axis_off()
ax.imshow(image,cmap=plt.cm.gray_r,interpolation="nearest")

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

lOMoARcPSD|48357171

ax.set_title("Training:%i"%label)
df.images.shape
df.images[0]
df.images[0].shape
len(df.images)
n_samples=len(df.images)
data=df.images.reshape((n_samples,-1))
data[0]
data[0].shape
data.shape
data.min()
data.max()
data=data/16
data.min()
data.max()
data[0]
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(data,df.target,test_size=0.3)
x_train.shape,x_test.shape,y_train.shape,y_test.shape
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier()
rf.fit(x_train,y_train)
y_pred=rf.predict(x_test)
y_pred
from sklearn.metrics import confusion_matrix,classification_report
confusion_matrix(y_test,y_pred)
print(classification_report(y_test,y_pred))

4.3. Big Sales Prediction

The 12 variables/features in the Dataset are

1.Item_Identifier
2.Item_Weight
3.Item_Fat_Content

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

lOMoARcPSD|48357171

4.Item_Visibility
5.Item_Type
6.Iten_MRP
7.Outlet_Identifier
8.Outlet_Establishment_Year
9.Outlet_Size
10.Outlet_Location_Type
11.Outlet_Type
12.Item_Outlet_Sales

We are considering ‘Item_Outlet_Sales’ as the target variable and the remaining

features are considered as independent variables. And I used Random Forest Regressor
to predict the target variable here. All the given code statements are to be put in
individuals cells in google colab or jupyter notebook.

Code :
import numpy as np import
pandas as pd
df = pd.read_csv(r'https://raw.githubusercontent.com/YBI
Foundation/Dataset/main/Big%20Sales%20Data.csv')
df.head()
df.info()
df.columns
df.describe()
df.isnull().sum()
df['Item_Weight'].fillna(df.groupby(['Item_Type'])['Item_Weight'].transform('mean'),
inplace=True)
df.isnull().sum()
import seaborn as sns
sns.pairplot(df)
df[['Item_Identifier']].value_counts()
df[['Item_Fat_Content']].value_counts)
df.replace({'Item_Fat_Content':{'LF':'Low Fat','reg':'Regular','low fat':'Low
Fat'}},inplace=True)

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

lOMoARcPSD|48357171

df[['Item_Fat_Content']].value_counts()
df.replace({'Item_Fat_Content':{'Low Fat':0,'Regular':1}},inplace=True)
df[['Item_Type']].value_counts()
df.replace({'Item_Type':{'Fruits and Vegetables':0,'Snack Foods':0,'Household':1,
'Frozen Foods':0,'Dairy':0,'Baking Goods':0,'Canned':0,
'Health and Hygiene':1,'Meat':0,'Soft Drinks':0,'Breads':0,
'Hard Drinks':0,'Others':2,'Starchy Foods':0,'Breakfast':0,
'Seafood':0}},inplace=True)
df[['Item_Type']].value_counts()
df[['Outlet_Identifier']].value_counts()
df.replace({'Outlet_Identifier':{'OUT027':0,'OUT013':1,'OUT049':2,'OUT046':3,
'OUT035':4,'OUT045':5,'OUT018':6,'OUT017':7,
'OUT010':8,'OUT019':9,
}},inplace=True)
df[['Outlet_Identifier']].value_counts()
df[['Outlet_Size']].value_counts()
df.replace({'Outlet_Size':{'Small':0,'Medium':1,'High':2}},inplace=True)
df[['Outlet_Size']].value_counts()
df[['Outlet_Location_Type']].value_counts()
df.replace({'Outlet_Location_Type':{'Tier 1':0,'Tier 2':1,'Tier 3':2}},inplace=True)
df[['Outlet_Location_Type']].value_counts()
df[['Outlet_Type']].value_counts()
df.replace({'Outlet_Type':{'Grocery Store':0,'Supermarket Type1':1,'Supermarket
Type2':2, 'Supermarket Type3':3}},inplace=True)
df[['Outlet_Type']].value_counts()
df.head()
df.info()
df.shape
y = df['Item_Outlet_Sales']
y.shape
y
X = df[['Item_Weight','Item_Fat_Content','Item_Visibility','Item_Type','Item_MRP',
'Outlet_Identifier','Outlet_Establishment_Year','Outlet_Size',

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

lOMoARcPSD|48357171

'Outlet_Location_Type','Outlet_Type']]
X.shape
X
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_std =
df[['Item_Weight','Item_Visibility','Item_MRP','Outlet_Establishment_Year']]
X_std = sc.fit_transform(X_std)
X_std
X[['Item_Weight','Item_Visibility','Item_MRP','Outlet_Establishment_Year']] =
pd.DataFrame(X_std,columns=[['Item_Weight','Item_Visibility','Item_MRP','Outlet_
Establishment_Year']])
X
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test =
train_test_split(X,y,test_size=0.1,random_state=2529)
X_train.shape,X_test.shape,y_train.shape,y_test.shape
from sklearn.ensemble importRandomForestRegressor
rfr = RandomForestRegressor(random_state=2529)
rfr.fit(X_train,y_train)
y_pred = rfr.predict(X_test)
y_pred.shape
y_pred
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
mean_squared_error(y_test,y_pred)
mean_absolute_error(y_test,y_pred)
r2_score(y_test,y_pred)
import matplotlib.pyplot as plt
plt.scatter(y_test,y_pred)
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Actual Price vs. Predicted Price")
plt.show()

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

lOMoARcPSD|48357171

CHAPTER-5
RESULTS AND DISCUSSION
5.
The above project assessments are executed successfully and resulted in expected outputs.
Those images are listed down showcasing the execution of the code snippets.

5.1. Fish Weight Prediction – Practice Project

Fig. 5.1.1. Prediction Result of Fish Weights and Accuracy metrics

This program performs linear regression on the fish dataset using Python's scikit-learn
library. It loads the dataset, separates the target variable and input features, and splits
the data into training and testing sets. The program then trains a linear regression
model, makes predictions on the test data, and evaluates the model's performance using
mean absolute error and R-squared score. The program provides insights into the
relationship between the input features and the target variable and serves as a
foundation for further analysis and improvement of the model.

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

lOMoARcPSD|48357171

5.2. Hand Written Digits Classification and Prediction

Fig. 5.2.1. Loading and Flattening the images of dataset

This program uses scikit-learn to perform classification on the digits dataset. It loads
the dataset, prepares the data by reshaping and normalizing it, and splits it into training
and testing sets. The program then trains a Random Forest classifier, makes predictions
on the test data, and evaluates the model's performance using a confusion matrix and
classification report. This program showcases the use of scikit-learn for classification
tasks and provides insights into the model's accuracy and performance on different
classes of digits.

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

lOMoARcPSD|48357171

Fig. 5.2.2. Classification and Accuracy metrics of the model

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

lOMoARcPSD|48357171

5.3. Big Sales Prediction

Fig. 5.3.1. Seaborn Pairplot of the Big Sales Data

This program uses scikit-learn to perform regression analysis on the Big Sales Data. It
loads the dataset, preprocesses the data by handling missing values and encoding
categorical variables, and performs feature scaling. The data is then split into training
and testing sets. A Random Forest regressor is trained on the training data and used to
make predictions on the test data. The program evaluates the model's performance
using mean squared error, mean absolute error, and R-squared score.

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

lOMoARcPSD|48357171

Finally, a scatter plot is generated to visualize the relationship between the actual and
predicted prices. This program demonstrates the use of scikit-learn for regression tasks
and provides insights into the model's accuracy in predicting sales prices.

Fig. 5.3.2. Prediction, Accuracy and Visualization of the Big Sales Prediction

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

lOMoARcPSD|48357171

CHAPTER-6 SUMMARY AND

CONCLUSIONS

To summarize about the internship period, during my educational internship at YBI

Foundations, I had the opportunity to learn and explore various topics related to data science.
The internship covered a wide range of subjects, including the scope of data science,
introduction to Python programming language, and using Google Colab for data analysis. I
also gained knowledge about essential Python libraries for data science and machine learning,
and learned how to work with DataFrames for data manipulation and analysis.

Furthermore, the internship provided an introduction to Kaggle, a popular platform for data
science competitions and projects. I learned about the machine learning prediction flow, which
involves steps like train-test split to evaluate model performance. I also gained practical
experience in implementing linear regression, logistic regression, and random forest regression
algorithms for predictive modeling.

Overall, my internship at YBI Foundations equipped me with a solid foundation in data science
concepts and practical skills, enabling me to apply my knowledge in real-world scenarios.

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

lOMoARcPSD|48357171

CHAPTER-7
REFERENCES

The following websites are referred in order to complete the summer internship program. I
am hereby including the github repository link that is submitted for the internship
completion.
https://github.com/Dinesh-Goli/ybi_project/tree/main
https://colab.research.google.com/drive/13QW3UaePIR9ieSF6aYZcII-
5rTsgPUWG#scrollTo=BqlxbXNno4K1
https://www.ybifoundation.org/#/home
https://www.youtube.com/
https://www.ybifoundation.org/certificate-validation?credentialId=DWT1Z0O9AI93C

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

Flight Fare Prediction Final
No ratings yet
Flight Fare Prediction Final
65 pages
AWS AI ML Virtual Internship Full Report
No ratings yet
AWS AI ML Virtual Internship Full Report
33 pages
Student Performance Analysis Using Machine Learning
No ratings yet
Student Performance Analysis Using Machine Learning
40 pages
Batch 9
No ratings yet
Batch 9
90 pages
Project Report Gr-12
No ratings yet
Project Report Gr-12
25 pages
Rajat Naik
No ratings yet
Rajat Naik
53 pages
Python Training Report (ML)
No ratings yet
Python Training Report (ML)
19 pages
Venkata Vinod
No ratings yet
Venkata Vinod
75 pages
Python Report SHREE
No ratings yet
Python Report SHREE
35 pages
Intership BKK
No ratings yet
Intership BKK
33 pages
Vikas Internship Document
No ratings yet
Vikas Internship Document
34 pages
Project Final1
No ratings yet
Project Final1
38 pages
FULL Report
No ratings yet
FULL Report
63 pages
REPORT
No ratings yet
REPORT
18 pages
Naveen Python - For - Data-Science-Report
100% (1)
Naveen Python - For - Data-Science-Report
24 pages
Intern Report
No ratings yet
Intern Report
43 pages
Final Modified Document PG
No ratings yet
Final Modified Document PG
58 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
28 pages
RK Final
No ratings yet
RK Final
32 pages
SVG Documentation PDF
No ratings yet
SVG Documentation PDF
26 pages
INDEX
No ratings yet
INDEX
8 pages
Module 2 - Lecture Note - Introduction To Excel
0% (1)
Module 2 - Lecture Note - Introduction To Excel
77 pages
Yes Report
No ratings yet
Yes Report
32 pages
Osish Bantha Internship
No ratings yet
Osish Bantha Internship
29 pages
Documentation Sample
No ratings yet
Documentation Sample
47 pages
Tableau Training
No ratings yet
Tableau Training
12 pages
Mini Project DSB Da
No ratings yet
Mini Project DSB Da
19 pages
Bro ibaDaVIS v2.3 en
No ratings yet
Bro ibaDaVIS v2.3 en
8 pages
Topic 35 - Compu WPS Office Report
No ratings yet
Topic 35 - Compu WPS Office Report
8 pages
Eh
No ratings yet
Eh
3 pages
PPF and Train-Summer-Internship-Report
No ratings yet
PPF and Train-Summer-Internship-Report
33 pages
Major (1) 1 (1) 2
No ratings yet
Major (1) 1 (1) 2
52 pages
DVP Project Sandeep
No ratings yet
DVP Project Sandeep
19 pages
DL Brochure 1701951996207
No ratings yet
DL Brochure 1701951996207
18 pages
Nikhil Project
No ratings yet
Nikhil Project
49 pages
Last Change
No ratings yet
Last Change
13 pages
Major 1 (B-16)
No ratings yet
Major 1 (B-16)
51 pages
SHRIHARI Internship Final Report
No ratings yet
SHRIHARI Internship Final Report
48 pages
Analytics and Knowledge Management
No ratings yet
Analytics and Knowledge Management
467 pages
EDITEDclassification
No ratings yet
EDITEDclassification
71 pages
PDF&Rendition 1
No ratings yet
PDF&Rendition 1
56 pages
Report of Data Analytics in Python
No ratings yet
Report of Data Analytics in Python
43 pages
Internship Report (Data Science)
No ratings yet
Internship Report (Data Science)
32 pages
Sravs Mini
No ratings yet
Sravs Mini
65 pages
Internship Report Sample
No ratings yet
Internship Report Sample
9 pages
CHANDRIKA INTERNSHIP REPORT To Be Printed Tomorrow
No ratings yet
CHANDRIKA INTERNSHIP REPORT To Be Printed Tomorrow
34 pages
Azmi Et Al. A Context-Aware Empowering Business With AI
No ratings yet
Azmi Et Al. A Context-Aware Empowering Business With AI
7 pages
Final Report of Mini Project
No ratings yet
Final Report of Mini Project
52 pages
Stating
No ratings yet
Stating
11 pages
Mini
No ratings yet
Mini
73 pages
Data Science: Industrial Training Report
No ratings yet
Data Science: Industrial Training Report
45 pages
Sample Resumes
No ratings yet
Sample Resumes
3 pages
Purushotham finREPORT
No ratings yet
Purushotham finREPORT
42 pages
Virtusa Healthcare LifeSciences Overview 2021
No ratings yet
Virtusa Healthcare LifeSciences Overview 2021
7 pages
6406 Seminar Report
No ratings yet
6406 Seminar Report
7 pages
IP Assign 12th Standard
No ratings yet
IP Assign 12th Standard
28 pages
D6 Mainpage
No ratings yet
D6 Mainpage
10 pages
Next - Generation - AI-Based - Firewalls - A - Comparative - Study
No ratings yet
Next - Generation - AI-Based - Firewalls - A - Comparative - Study
19 pages
Research Paper For Personal Finance Tracker
No ratings yet
Research Paper For Personal Finance Tracker
10 pages
APJ Elevate - Databricks Certification Exam Overview Training Data Analyst Associate
No ratings yet
APJ Elevate - Databricks Certification Exam Overview Training Data Analyst Associate
96 pages
"Library Mannagement Sysytem": Visvesvaraya Technological University "JNANA SANGAMA", Belagavi-590018, Karnataka
No ratings yet
"Library Mannagement Sysytem": Visvesvaraya Technological University "JNANA SANGAMA", Belagavi-590018, Karnataka
7 pages
Profile
No ratings yet
Profile
3 pages
Data Visualization Techniques Tools
No ratings yet
Data Visualization Techniques Tools
8 pages
Tushar Internship Report 4th Year
No ratings yet
Tushar Internship Report 4th Year
17 pages
DMV Lab Manual
No ratings yet
DMV Lab Manual
45 pages
Keerthana Nagaraj
No ratings yet
Keerthana Nagaraj
1 page
Chatbot in Python
No ratings yet
Chatbot in Python
41 pages
PGDDS Syllabus Final (2025)
No ratings yet
PGDDS Syllabus Final (2025)
20 pages
570 Report
No ratings yet
570 Report
38 pages
Artificial Intelligence-Based Tools in Research Writing - 25 - 03 - 22 - 22 - 54 - 20
No ratings yet
Artificial Intelligence-Based Tools in Research Writing - 25 - 03 - 22 - 22 - 54 - 20
16 pages
Internship Report
No ratings yet
Internship Report
64 pages
Python Internship Report
No ratings yet
Python Internship Report
49 pages
Crime Prediction Project Report
No ratings yet
Crime Prediction Project Report
3 pages
Internship Report
No ratings yet
Internship Report
39 pages
Sarumathi Intern18
No ratings yet
Sarumathi Intern18
37 pages
J1 (SkillDzire)
No ratings yet
J1 (SkillDzire)
49 pages
Internship Report - 020
No ratings yet
Internship Report - 020
32 pages
ADAC - Brochure
No ratings yet
ADAC - Brochure
26 pages
Intern Chapters Merged
No ratings yet
Intern Chapters Merged
36 pages
Abhi Inter 01
No ratings yet
Abhi Inter 01
68 pages
DVT Unit-Ii
No ratings yet
DVT Unit-Ii
55 pages
Final Reportrrrrttnb
No ratings yet
Final Reportrrrrttnb
60 pages
Doing Data Science in R An Introduction For Social Scientists 1st Edition Mark Andrews Download
No ratings yet
Doing Data Science in R An Introduction For Social Scientists 1st Edition Mark Andrews Download
55 pages
DSV-S7 Data Collection and Data Pre Processing Overview
No ratings yet
DSV-S7 Data Collection and Data Pre Processing Overview
28 pages
Grade 9 (AI Project Cycle)
No ratings yet
Grade 9 (AI Project Cycle)
21 pages
Jaswitha 01
No ratings yet
Jaswitha 01
50 pages
21P31A05C3
No ratings yet
21P31A05C3
54 pages
Project Report
No ratings yet
Project Report
28 pages
Chotu 101
No ratings yet
Chotu 101
28 pages
Comprehensive Review of the ELECTRONICS (Analog, Digital, Microprocessor)
From Everand
Comprehensive Review of the ELECTRONICS (Analog, Digital, Microprocessor)
DR.MOHAMMAD GHUFRAN ALI SIDDIQUI
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.