0% found this document useful (0 votes)
67 views33 pages

O180421 Summer Internship Report

Daa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views33 pages

O180421 Summer Internship Report

Daa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

lOMoARcPSD|48357171

O180421 summer internship report

electronics & communications (Rajiv Gandhi University of Knowledge and Technologies)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)
lOMoARcPSD|48357171

A SUMMER INTERNSHIP REPORT on


DATA SCIENCE AND MACHINE LEARNING
Submitted in partial fulfilment of the Requirements for the award of the degree of

BACHELOR OF TECHNOLOGY
in
ELECTRONICS AND COMMUNICATION ENGINEERING
by
YARRAMSETTI VENKATALAKSHMI
(O180421)
Under the Supervision of
YBI FOUNDATION,
New Delhi

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


RAJIV GANDHI UNIVERSITY OF KNOWLEDGE TECHNOLOGIES
ONGOLE CAMPUS

2023

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)


lOMoARcPSD|48357171

INTERNSHIP CERTIFICATE

Can be verified at https://www.ybifoundation.org/certificate-


validation?credentialId=DWT1Z0O9AI93C

ii

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)


lOMoARcPSD|48357171

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


RAJIV GANDHI UNIVERSITY OF KNOWLEDGE TECHNOLOGIES
ONGOLE CAMPUS 2023

CERTIFICATE

This is certify that the Internship report on ‘DATA SCIENCE AND MACHINE LEARNING’
being submitted by YARRAMSETTI VENKATALAKSHMI(O180421) in partial fulfillment
of the requirements for the award of the degree of the Bachelor Of Technology in electronics
and communication Engineering in Dr. APJ Abdul Kalam ,RGUKT-AP IIIT Ongole is a record
of bonafide internship work carried out by them under my guidance and supervision during the
academic year 2023-24.

The results presented in this report have been verified and found to be satisfactory. The results
embodied in this internship report have not been submitted to any other University
for the award of any other degree or diploma.

Head of Department,
Mr.G.Bala Nagireddy,
Department of ECE,
RGUKT,ONGOLE .

iii

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)


lOMoARcPSD|48357171

APPROVAL SHEET

This report entitled “SUMMER INTERNSHIP on DATA SCIENCE AND MACHINE


LEARNING” by YARRAMSETTI VENKATALAKSHMI(O180421) is approved for the
degree of Bachelor of Technology in Electronics and Communication Engineering.

Examiners ____________________________

____________________________

____________________________

Supervisors ____________________________

____________________________

____________________________

Date: ________________________

Place: ________________________

iv

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)


lOMoARcPSD|48357171

DECLARATION

I declare that this written submission represents my ideas in my own words and where others'
ideas or words have been included, I have adequately cited and referenced the original sources.
I also declare that I have adhered to all principles of academic honesty and integrity and have
not misrepresented or fabricated or falsified any idea/data/fact/source in my submission. I
understand that any violation of the above will be cause for disciplinary action by the Institute
and can also evoke penal action from the sources which have thus not been properly cited or
from whom proper permission has not been taken when needed.

YARRAMSETTI VENKATALAKSHMI
(O180421)

Date: _____________________

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)


lOMoARcPSD|48357171

ACKNOWLEDGEMENT

Firstly I would like to thank the team of YBI Foundation for giving me this opportunity to do an
virtual internship within the organization

I am highly indebted to YBI Foundation for the guidance and constant supervision as well as
for providing necessary information regarding the internship and also for their kind
cooperation, encouragement and their support in completing the internship.

I would like to express my special gratitude and thanks to our ELECTRONICS AND
CMMUNICATION ENGINEERING branch H.O.D Mr.G.BALA NAGIREDDY and
Director of Ongole-RGUKT Prof. B.JAYARAMI REDDY sir for giving me such attention
and time.

With Sincere Regards,


YARRAMSETTI VENKATALAKSHMI

Date: _____________________

vi

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)


lOMoARcPSD|48357171

ABSTRACT

Data Science and Machine Learning are essential domains that extract insights from data and
enable computers to learn and make predictions. They have transformed industries like
healthcare, finance, and marketing, improving disease diagnosis, fraud detection, and
personalized recommendations. These technologies are also integrated into our daily lives,
from streaming platforms to voice assistants. It is important for everyone to be aware of these
domains as they empower individuals to make informed decisions, navigate the data-driven
world, and seize career opportunities. Understanding Data Science and Machine Learning
contributes to problem-solving, innovation, and shaping the future.

This project report aims to provide a comprehensive overview of my internship experience,


highlighting the key concepts and techniques I learned during my time at YBI Foundation.
Additionally, this report will delve into the fundamentals of Data Science and Machine
Learning, explaining their significance and applications in various industries. By exploring
these domains, we can better understand how data-driven insights and predictive models can
be leveraged to solve complex problems and make informed decisions.

vii

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)


lOMoARcPSD|48357171

CONTENTS

DESCRIPTION Page No.

Title i
INTERNSHIP CERTIFICATE ii
CERTIFICATE iii
APPROVAL SHEET iv
DECLARATION v
ABSTRACT vii

1. Introduction 1
1.1. Background 1
1.2. Learning Objectives 1
1.3. Assessment Works 2
2. Requirement Analysis 3
2.1. Requirements Specification 3
2.1.1. Hardware Requirements 3
2.1.2. Software Requirements 3
2.2. Technologies Used 3
2.2.1. Python 3
2.2.2. Numpy 4
2.2.3. Pandas 4
2.2.4. Matplotlib 4
2.2.5. Seaborn 5
2.2.6. 2.2.6. Sklearn 5
3. Methodologies Used 6
3.1. Machine Learning Prediction Flow 6

viii

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)


lOMoARcPSD|48357171

3.2. Linear Regression 7


3.3. Random Forest Regression 8
4. Code Implementation 10
4.1. Fish Weight Prediction – Practice Project 10-11
4.2. Hand Written Digits Classification and Prediction 11-12
4.3. Big Sales Prediction 12-15
5. Results and Discussion 16
5.1. Fish Weight Prediction – Practice Project 16
5.2. Hand Written Digits Classification and Prediction 17-18
5.3. Big Sales Prediction 19-20
6. Summary and Conclusion 21
7. References 22

ix

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)


lOMoARcPSD|48357171

LIST OF FIGURES

Fig.No. DESCRIPTION Page No.

Fig. 5.1.1. Prediction Result of Fish Weights and Accuracy metrics 16


Fig. 5.2.1. Loading and Flattening the images of dataset 17
Fig. 5.2.2. Classification and Accuracy metrics of the model 18

Fig. 5.3.1. Seaborn Pairplot of the Big Sales Data 19


Fig. 5.3.2. Prediction, Accuracy and Visualization of the Big Sales Prediction 20

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)


lOMoARcPSD|48357171

CHAPTER-1
INTRODUCTION

1.1. BACKGROUND

Data Science and Machine Learning are two interconnected fields that play a crucial
role in extracting insights and knowledge from data. Data Science involves the
collection, cleaning, analysis, and interpretation of data using various techniques,
while Machine Learning focuses on developing algorithms and models that enable
computers to learn and make predictions without explicit programming. These
domains have revolutionized industries, such as healthcare, finance, and marketing,
and have become integral to our daily lives. Understanding Data Science and Machine
Learning empowers individuals to make data-driven decisions, navigate the data-
driven world, and seize opportunities for innovation and career growth.

I successfully completed an online course internship at YBI Foundation, New Delhi,


specializing in the fields of Data Science and Machine Learning. Throughout the
internship, I gained valuable experience and acquired a solid foundation of knowledge
in these domains. The internship provided me with practical exposure to various
techniques and tools used in Data Science and Machine Learning, enabling me to apply
them effectively in real-world scenarios. I am confident that the skills and knowledge
I have gained during this internship will greatly contribute to my future endeavours in
these fields.

1.2. LEARNING OBJECTIVES

The following objectives are set at the beginning of the Summer Internship and I was
able to complete them with some efforts.

• Scope of Data Science


• Introduction to Python
• Introduction to Google Colab
• Python Libraries for Data Science and Machine Learning
• Working on DataFrames
• Introduction to Kaggle

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)


lOMoARcPSD|48357171

• Machine Learning Prediction Flow


• Train Test Split
• Linear Regression
• Logistic Regression
• Random Forest Regression
• Practice Project
• Internship Final Assessment Project

1.3. ASSESSMENT WORKS

The following are the practice project and Internship final assessment project.

• Fish Weight Prediction using Linear Regression


• Hand Written Digits Classification and Prediction using Random Forest
• Big Sales Prediction using Random Forest Regression

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)


lOMoARcPSD|48357171

CHAPTER-2
REQUIREMENT ANALYSIS
2.
In order to run an application/software we need to have some basic configuration related to
hardware and software. The following were listed requirements to run the projects smoothly.

2.1. REQUIREMENTS SPECIFICATION


2.1.1. HARDWARE REQUIREMENTS
• Processor : Intel i3 or above
• Memory : 4GB RAM
• CPU : 64 bit Architecture
2.1.2. SOFTWARE REQUIREMENTS
• Operating System : Windows/Linux/Mac (all compatible)
• Browser : Google chrome/ Mozilla Firefox/ Microsoft edge (all compatible)

2.2. TECHNOLOGIES USED


2.2.1. Python
Python is a versatile and widely-used programming language that has gained
immense popularity in the field of Data Science and Machine Learning. Its
simplicity, readability, and extensive libraries make it an ideal choice for data
analysis, modeling, and visualization. Python's rich ecosystem of libraries, such
as NumPy, Pandas, and Matplotlib, provide powerful tools for handling and
manipulating data efficiently. These libraries offer a wide range of functions and
methods for data preprocessing, exploratory data analysis, and statistical
modeling. Additionally, Python's integration with popular Machine Learning
frameworks like TensorFlow and scikit-learn allows for the development and
deployment of complex machine learning models. Its flexibility and ease of use
make Python a preferred language for data scientists and machine learning
practitioners, enabling them to efficiently tackle real-world problems and derive
meaningful insights from data.

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)


lOMoARcPSD|48357171

2.2.2. Numpy
NumPy, short for Numerical Python, is a fundamental library in the Python
ecosystem that plays a crucial role in Machine Learning and Data Science. It
provides powerful tools for efficient numerical computations and array
operations. NumPy's main feature is the ndarray, a multidimensional array object
that allows for fast and efficient manipulation of large datasets. This library offers
a wide range of mathematical functions and operations, making it ideal for tasks
such as data preprocessing, statistical analysis, and linear algebra computations.
NumPy's integration with other libraries, such as Pandas and Matplotlib, further
enhances its capabilities in data manipulation and visualization. Its efficient array
operations and mathematical functions make NumPy an essential tool for
implementing machine learning algorithms and performing complex data analysis
tasks in a concise and efficient manner.

2.2.3. Pandas
Pandas is a powerful and widely-used Python library that plays a crucial role in
the fields of Machine Learning and Data Science. It provides high-performance
data structures and data analysis tools, making it easier to manipulate, clean, and
analyze data. Pandas' DataFrame object allows for efficient handling of structured
data, enabling tasks such as data preprocessing, feature engineering, and
exploratory data analysis. With its intuitive and flexible API, Pandas simplifies
complex data operations, such as filtering, grouping, and merging, making it an
essential tool for data scientists and machine learning practitioners. Moreover,
Pandas seamlessly integrates with other libraries in the Python ecosystem, such
as NumPy and Matplotlib, enabling a comprehensive and streamlined workflow
for data analysis and visualization. Overall, Pandas empowers users to efficiently
work with data, making it an invaluable asset in the fields of Machine Learning
and Data Science.

2.2.4. Matplotlib
Matplotlib is a widely-used data visualization library in the field of Data Science
and Machine Learning. It provides a comprehensive set of tools for creating
highquality plots, charts, and graphs, allowing for effective data exploration and
communication. With Matplotlib, data scientists and machine learning
practitioners can visualize patterns, trends, and relationships in their data, aiding

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)


lOMoARcPSD|48357171

in the understanding and interpretation of complex datasets. Matplotlib's


versatility enables the creation of various types of visualizations, including line
plots, scatter plots, bar plots, histograms, and more. These visualizations are
invaluable for data preprocessing, exploratory data analysis, model evaluation,
and presenting results. By leveraging Matplotlib, professionals in the field can
effectively communicate their findings and insights, making it an essential tool in
the data science and machine learning workflow.

2.2.5. Seaborn
Seaborn is a Python data visualization library built on top of Matplotlib. It
provides a high-level interface for creating visually appealing and informative
statistical graphics. Seaborn is particularly useful in the fields of Machine
Learning and Data Science as it offers a wide range of built-in functions and tools
for visualizing data distributions, relationships, and patterns. With Seaborn, one
can easily create various types of plots, such as scatter plots, bar plots, box plots,
and heatmaps, to explore and analyze data. These visualizations aid in
understanding the underlying patterns and trends in the data, making it easier to
make informed decisions and derive meaningful insights. Seaborn's integration
with Pandas, another popular Python library for data manipulation, further
enhances its usability and makes it an invaluable tool for data scientists and
machine learning practitioners.

2.2.6. Sklearn
Scikit-learn, also known as sklearn, is a widely-used Python library that provides
a comprehensive set of tools for machine learning and data science tasks. It offers
a user-friendly interface and a wide range of algorithms and utilities for tasks such
as classification, regression, clustering, and dimensionality reduction. Sklearn
simplifies the process of building and evaluating machine learning models by
providing a consistent API and a variety of preprocessing techniques for data
transformation and feature engineering. It also includes modules for model
selection, cross-validation, and performance evaluation, making it easier to
finetune and optimize models. Sklearn's extensive documentation and active
community support make it a valuable resource for both beginners and
experienced practitioners in the field of machine learning and data science.

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)


lOMoARcPSD|48357171

CHAPTER-3
METHODOLOGIES USED
3.
Apart from the above discussed technologies and libraries, a generic Machine Learning
prediction flow and some regression algorithms are used in the Internship

3.1. Machine Learning Prediction Flow

The process of machine learning prediction involves several key steps to effectively
solve a problem and make accurate predictions. These steps include defining the
problem, gathering and preparing data, selecting and training a model, evaluating its
performance, fine-tuning the model, and ultimately making predictions on new data.
By following this systematic approach, we can harness the power of machine learning
to gain valuable insights and make informed decisions.

1. Define the Problem: Clearly define the problem you want to solve and
determine the type of prediction task, such as classification or regression.
2. Gather and Prepare Data: Collect relevant data for your problem and
preprocess it. This includes handling missing values, encoding categorical variables,
and scaling numerical features.
3. Split the Data: Divide your dataset into training and testing sets. The training
set is used to train the machine learning model, while the testing set is used to evaluate
its performance.
4. Select a Model: Choose an appropriate machine learning algorithm based on
your problem and data characteristics. Consider factors such as interpretability,
complexity, and performance.
5. Train the Model: Fit the selected model to the training data. The model learns
patterns and relationships in the data to make predictions.
6. Evaluate the Model: Use the testing set to assess the performance of the trained
model. Common evaluation metrics include accuracy, precision, recall, and mean
squared error, depending on the prediction task.

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)


lOMoARcPSD|48357171

7. Fine-tune the Model: Optimize the model's performance by adjusting


hyperparameters. This can be done through techniques like grid search or random
search.
8. Make Predictions: Once the model is trained and fine-tuned, use it to make
predictions on new, unseen data. Preprocess the new data in the same way as the
training data before feeding it into the model.
9. Evaluate Predictions: Assess the accuracy and reliability of the predictions by
comparing them to the ground truth or known outcomes.
10. Iterate and Improve: Analyze the results, identify areas for improvement, and
iterate on the process. This may involve trying different algorithms, feature
engineering techniques, or collecting more data.

The specific steps and techniques may vary depending on the problem and the machine
learning algorithm being used. It's important to adapt and refine the process based on
the unique requirements of every project.

3.2. Linear Regression

Linear regression is a widely used statistical technique for modeling the relationship
between a dependent variable and one or more independent variables. It assumes a
linear relationship between the variables, where the dependent variable can be
predicted as a linear combination of the independent variables.

In simple linear regression, there is only one independent variable, while in multiple
linear regression, there are multiple independent variables. The goal of linear
regression is to find the best-fit line that minimizes the difference between the
predicted values and the actual values of the dependent variable.

The equation for a simple linear regression model can be represented as:
y = b0 + b1 * x, where y is the dependent variable, x is the independent variable, b0 is
the y-intercept, and b1 is the slope of the line.

To estimate the coefficients (b0 and b1), the ordinary least squares (OLS) method is
commonly used. It minimizes the sum of the squared differences between the predicted
and actual values. The coefficients can be interpreted as the change in the dependent
variable for a unit change in the independent variable.

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)


lOMoARcPSD|48357171

Linear regression is widely used in various fields, including economics, finance, social
sciences, and machine learning. It is not only used for prediction but also for
understanding the relationship between variables and identifying important features
that influence the dependent variable.

However, it is important to note that linear regression assumes certain assumptions,


such as linearity, independence of errors, and homoscedasticity. Violation of these
assumptions may affect the accuracy and reliability of the model. Therefore, it is
crucial to assess the assumptions and perform appropriate diagnostics to ensure the
validity of the linear regression model.

3.3. Random Forest Regression

Random Forest Regression is a powerful ensemble learning method that combines


multiple decision trees to make predictions. It is a variant of the Random Forest
algorithm, which is widely used for both classification and regression tasks. In
Random Forest Regression, multiple decision trees are built using different subsets of
the training data and random subsets of the features. Each decision tree independently
predicts the target variable, and the final prediction is obtained by averaging the
predictions of all the trees (for regression tasks).

The algorithm follows these steps:

1. Randomly select a subset of the training data (with replacement) to build each
decision tree. This is known as bootstrap aggregating or "bagging."
2. Randomly select a subset of features at each node of the decision tree. This
helps to introduce randomness and reduce overfitting.
3. Build each decision tree using the selected data and features. The trees are
constructed by recursively splitting the data based on the selected features, aiming to
minimize the variance within each leaf node.
4. For regression tasks, the final prediction is obtained by averaging the
predictions of all the decision trees. For classification tasks, the majority vote of the
trees is taken as the final prediction.

Mathematically, the prediction of a Random Forest Regression model can be


represented as:

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)


lOMoARcPSD|48357171

y = (1/N) * Σ(y_i), where y is the predicted value, N is the number of decision trees
in the forest, and y_i is the prediction of each individual decision tree.

The algorithm's effectiveness lies in the diversity of the decision trees and their
collective wisdom, resulting in more accurate and stable predictions compared to a
single decision tree. It is important to note that Random Forest Regression, like any
other machine learning algorithm, has its limitations and assumptions. It may not
perform well with noisy or irrelevant features, and the interpretability of the model can
be challenging due to the ensemble nature of the algorithm.

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)


lOMoARcPSD|48357171

CHAPTER-4
CODE IMPLEMENTATION
4.
The projects and assignments done for the completion of the internship are explained below
along with the code snippets of each prediction.

4.1. Fish Weight Prediction – Practice Project

With a dataset of fish species, with some of it characteristic like it vertical, diagonal,
length, height, and width. We will try to predict the weight of the fish based on their
characteristic. We will use Linear Regression Method to see whether the weight of the
fish related to their characteristic.

• Species: Species name of fish


• Weight: Weight of fish in gram
• Length1: Vertical length in cm
• Length2: Diagonal length in cm
• Length3: Cross length in cm
• Height: Height in cm
• Width: Diagonal width in cm

We are considering ‘Weight’ as the target variable and the remaining features except
species are considered as independent variables. All the given code statements are to
be put in individuals cells in google colab or jupyter notebook.

Code :
import pandas as pd fish =
pd.read_csv('https://github.com/ybifoundation/Dataset/raw/main/Fish.csv')
fish.head()
fish.info()
fish.describe()
fish.columns
y = fish['Weight']
X = fish[['Category','Height', 'Width', 'Length1', 'Length2', 'Length3']]

10

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)


lOMoARcPSD|48357171

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.7,
random_state=2529)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
from sklearn.linear_model import LinearRegression model
= LinearRegression() model.fit(X_train,y_train)
model.intercept_ model.coef_ y_pred =
model.predict(X_test) y_pred from sklearn.metrics import
mean_absolute_error, r2_score
mean_absolute_error(y_test,y_pred)
r2_score(y_test,y_pred)

4.2. Hand Written Digits Classification and Prediction

The digit dataset consists of 8x8 pixel images of digits. The images attribute of the
dataset stores 8x8 arrays of grayscale values for each image. We will use these arrays
to visualize the first 4 images. The target attribute of the dataset stores the digit each
image represents. The Random Forest Classifier is used in the code to classify the
given images into digits of [0,1,2,3,4,5,6,7,8,9]. All the given code statements are to
be put in individuals cells in google colab or jupyter notebook.

Code :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
df=load_digits()
_,axes=plt.subplots(nrows=1,ncols=4,figsize=(10,3))
for ax,image,label in zip(axes,df.images,df.target):
ax.set_axis_off()
ax.imshow(image,cmap=plt.cm.gray_r,interpolation="nearest")

11

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)


lOMoARcPSD|48357171

ax.set_title("Training:%i"%label)
df.images.shape
df.images[0]
df.images[0].shape
len(df.images)
n_samples=len(df.images)
data=df.images.reshape((n_samples,-1))
data[0]
data[0].shape
data.shape
data.min()
data.max()
data=data/16
data.min()
data.max()
data[0]
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(data,df.target,test_size=0.3)
x_train.shape,x_test.shape,y_train.shape,y_test.shape
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier()
rf.fit(x_train,y_train)
y_pred=rf.predict(x_test)
y_pred
from sklearn.metrics import confusion_matrix,classification_report
confusion_matrix(y_test,y_pred)
print(classification_report(y_test,y_pred))

4.3. Big Sales Prediction

The 12 variables/features in the Dataset are


1.Item_Identifier
2.Item_Weight
3.Item_Fat_Content

12

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)


lOMoARcPSD|48357171

4.Item_Visibility
5.Item_Type
6.Iten_MRP
7.Outlet_Identifier
8.Outlet_Establishment_Year
9.Outlet_Size
10.Outlet_Location_Type
11.Outlet_Type
12.Item_Outlet_Sales

We are considering ‘Item_Outlet_Sales’ as the target variable and the remaining


features are considered as independent variables. And I used Random Forest Regressor
to predict the target variable here. All the given code statements are to be put in
individuals cells in google colab or jupyter notebook.

Code :
import numpy as np import
pandas as pd
df = pd.read_csv(r'https://raw.githubusercontent.com/YBI
Foundation/Dataset/main/Big%20Sales%20Data.csv')
df.head()
df.info()
df.columns
df.describe()
df.isnull().sum()
df['Item_Weight'].fillna(df.groupby(['Item_Type'])['Item_Weight'].transform('mean'),
inplace=True)
df.isnull().sum()
import seaborn as sns
sns.pairplot(df)
df[['Item_Identifier']].value_counts()
df[['Item_Fat_Content']].value_counts)
df.replace({'Item_Fat_Content':{'LF':'Low Fat','reg':'Regular','low fat':'Low
Fat'}},inplace=True)

13

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)


lOMoARcPSD|48357171

df[['Item_Fat_Content']].value_counts()
df.replace({'Item_Fat_Content':{'Low Fat':0,'Regular':1}},inplace=True)
df[['Item_Type']].value_counts()
df.replace({'Item_Type':{'Fruits and Vegetables':0,'Snack Foods':0,'Household':1,
'Frozen Foods':0,'Dairy':0,'Baking Goods':0,'Canned':0,
'Health and Hygiene':1,'Meat':0,'Soft Drinks':0,'Breads':0,
'Hard Drinks':0,'Others':2,'Starchy Foods':0,'Breakfast':0,
'Seafood':0}},inplace=True)
df[['Item_Type']].value_counts()
df[['Outlet_Identifier']].value_counts()
df.replace({'Outlet_Identifier':{'OUT027':0,'OUT013':1,'OUT049':2,'OUT046':3,
'OUT035':4,'OUT045':5,'OUT018':6,'OUT017':7,
'OUT010':8,'OUT019':9,
}},inplace=True)
df[['Outlet_Identifier']].value_counts()
df[['Outlet_Size']].value_counts()
df.replace({'Outlet_Size':{'Small':0,'Medium':1,'High':2}},inplace=True)
df[['Outlet_Size']].value_counts()
df[['Outlet_Location_Type']].value_counts()
df.replace({'Outlet_Location_Type':{'Tier 1':0,'Tier 2':1,'Tier 3':2}},inplace=True)
df[['Outlet_Location_Type']].value_counts()
df[['Outlet_Type']].value_counts()
df.replace({'Outlet_Type':{'Grocery Store':0,'Supermarket Type1':1,'Supermarket
Type2':2, 'Supermarket Type3':3}},inplace=True)
df[['Outlet_Type']].value_counts()
df.head()
df.info()
df.shape
y = df['Item_Outlet_Sales']
y.shape
y
X = df[['Item_Weight','Item_Fat_Content','Item_Visibility','Item_Type','Item_MRP',
'Outlet_Identifier','Outlet_Establishment_Year','Outlet_Size',

14

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)


lOMoARcPSD|48357171

'Outlet_Location_Type','Outlet_Type']]
X.shape
X
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_std =
df[['Item_Weight','Item_Visibility','Item_MRP','Outlet_Establishment_Year']]
X_std = sc.fit_transform(X_std)
X_std
X[['Item_Weight','Item_Visibility','Item_MRP','Outlet_Establishment_Year']] =
pd.DataFrame(X_std,columns=[['Item_Weight','Item_Visibility','Item_MRP','Outlet_
Establishment_Year']])
X
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test =
train_test_split(X,y,test_size=0.1,random_state=2529)
X_train.shape,X_test.shape,y_train.shape,y_test.shape
from sklearn.ensemble importRandomForestRegressor
rfr = RandomForestRegressor(random_state=2529)
rfr.fit(X_train,y_train)
y_pred = rfr.predict(X_test)
y_pred.shape
y_pred
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
mean_squared_error(y_test,y_pred)
mean_absolute_error(y_test,y_pred)
r2_score(y_test,y_pred)
import matplotlib.pyplot as plt
plt.scatter(y_test,y_pred)
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Actual Price vs. Predicted Price")
plt.show()

15

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)


lOMoARcPSD|48357171

CHAPTER-5
RESULTS AND DISCUSSION
5.
The above project assessments are executed successfully and resulted in expected outputs.
Those images are listed down showcasing the execution of the code snippets.

5.1. Fish Weight Prediction – Practice Project

Fig. 5.1.1. Prediction Result of Fish Weights and Accuracy metrics

This program performs linear regression on the fish dataset using Python's scikit-learn
library. It loads the dataset, separates the target variable and input features, and splits
the data into training and testing sets. The program then trains a linear regression
model, makes predictions on the test data, and evaluates the model's performance using
mean absolute error and R-squared score. The program provides insights into the
relationship between the input features and the target variable and serves as a
foundation for further analysis and improvement of the model.

16

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)


lOMoARcPSD|48357171

5.2. Hand Written Digits Classification and Prediction

Fig. 5.2.1. Loading and Flattening the images of dataset

This program uses scikit-learn to perform classification on the digits dataset. It loads
the dataset, prepares the data by reshaping and normalizing it, and splits it into training
and testing sets. The program then trains a Random Forest classifier, makes predictions
on the test data, and evaluates the model's performance using a confusion matrix and
classification report. This program showcases the use of scikit-learn for classification
tasks and provides insights into the model's accuracy and performance on different
classes of digits.

17

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)


lOMoARcPSD|48357171

Fig. 5.2.2. Classification and Accuracy metrics of the model

18

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)


lOMoARcPSD|48357171

5.3. Big Sales Prediction

Fig. 5.3.1. Seaborn Pairplot of the Big Sales Data

This program uses scikit-learn to perform regression analysis on the Big Sales Data. It
loads the dataset, preprocesses the data by handling missing values and encoding
categorical variables, and performs feature scaling. The data is then split into training
and testing sets. A Random Forest regressor is trained on the training data and used to
make predictions on the test data. The program evaluates the model's performance
using mean squared error, mean absolute error, and R-squared score.

19

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)


lOMoARcPSD|48357171

Finally, a scatter plot is generated to visualize the relationship between the actual and
predicted prices. This program demonstrates the use of scikit-learn for regression tasks
and provides insights into the model's accuracy in predicting sales prices.

Fig. 5.3.2. Prediction, Accuracy and Visualization of the Big Sales Prediction

20

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)


lOMoARcPSD|48357171

CHAPTER-6 SUMMARY AND


CONCLUSIONS

To summarize about the internship period, during my educational internship at YBI


Foundations, I had the opportunity to learn and explore various topics related to data science.
The internship covered a wide range of subjects, including the scope of data science,
introduction to Python programming language, and using Google Colab for data analysis. I
also gained knowledge about essential Python libraries for data science and machine learning,
and learned how to work with DataFrames for data manipulation and analysis.

Furthermore, the internship provided an introduction to Kaggle, a popular platform for data
science competitions and projects. I learned about the machine learning prediction flow, which
involves steps like train-test split to evaluate model performance. I also gained practical
experience in implementing linear regression, logistic regression, and random forest regression
algorithms for predictive modeling.

Overall, my internship at YBI Foundations equipped me with a solid foundation in data science
concepts and practical skills, enabling me to apply my knowledge in real-world scenarios.

6.

21

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)


lOMoARcPSD|48357171

CHAPTER-7
REFERENCES

The following websites are referred in order to complete the summer internship program. I
am hereby including the github repository link that is submitted for the internship
completion.
https://github.com/Dinesh-Goli/ybi_project/tree/main
https://colab.research.google.com/drive/13QW3UaePIR9ieSF6aYZcII-
5rTsgPUWG#scrollTo=BqlxbXNno4K1
https://www.ybifoundation.org/#/home
https://www.youtube.com/
https://www.ybifoundation.org/certificate-validation?credentialId=DWT1Z0O9AI93C

22

Downloaded by Mahidhar Nani (nanimahidhar76@gmail.com)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy