0% found this document useful (0 votes)
5 views20 pages

Internship Report

The internship report by Divya Drishti details her experience at Internshala Trainings, where she learned about Python programming and machine learning algorithms over a 56-day period. The report includes an introduction to the platform, an agenda of topics covered, and a project focused on credit card fraud detection using machine learning techniques. Key algorithms discussed include KNN, linear regression, logistic regression, and decision trees.

Uploaded by

restrict7678
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views20 pages

Internship Report

The internship report by Divya Drishti details her experience at Internshala Trainings, where she learned about Python programming and machine learning algorithms over a 56-day period. The report includes an introduction to the platform, an agenda of topics covered, and a project focused on credit card fraud detection using machine learning techniques. Key algorithms discussed include KNN, linear regression, logistic regression, and decision trees.

Uploaded by

restrict7678
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

INTERNSHIP REPORT

AT INTERNSHALA TRAININGS,

Attended by
DIVYA DRISHTI
21105111017

in partial fulfilment for the award of the degree of


Bachelor of Technology
In
COMPUTER SCIENCE ENGINEERING
from 23.11.2024 to 22.07.2025

1
BONAFIDE CERTIFICATE

This is to certify that the “Internship report” submitted by


DIVYA DRISHTI is the work done by her and submitted during
academic year, in partial fulfilment of the requirements for the
award of the degree of Bachelor of Technology in COMPUTER
SCIENCE ENGINEERING, at INTERNSHALA
TRAININGS.

HEAD OF THE DEPARTMENT INTERNSHIP COORDINATOR

2
CERTIFICATION

3
TABLE OF CONTENT

S.NO TITLE PAGE NO

1 Introduction 5

2 Agenda 6

3 Topics 8
3.1 Python History 8
3.2 Data Types 10
3.3 Machine Learning Introduction 10
3.4 Machine Learning Algorithms 11
3.4.1 KNN Algorithm 12
3.4.2 Linear regression 14
3.4.3 Logistic regression 14
3.4.4 Decision tree 16
3.4.5 Clustering modules 17

4 PROJECT 18
4.1 Project Explanation 18
4.2 Solution Approach 19

4
1.Introduction

The platform, which was founded in 2010, started out as a WordPress


blog that aggregated internships across India and articles on education,
technology and skill gap. Internshala launched its online training in
2014. As of 2018, the platform had 3.5 million students and 80,000
companies.

Services: Internship matching, online training

Founder: Sarvesh Agrawal

Industry: Education, Employment

Headquarters: Gurgaon, India

Internshala is India's no.1 internship and training platform with 40000+


paid internships in Engineering, MBA, media, law, arts, and other
streams.

5
In this Internship, I had learnt about introduction of ml along with the
ml algorithms. Such that 56 Days of internship has 32 topics and did
mini projects and final assessment in the end of internship i.e., from
56th day.

2. AGENDA

Day 1 - Introduction to python and software requirements to download.


Day 2 - Python comments, variables and syntax
Day 3&4 - Python Data Types, Numbers and Casting
Day 5&6 - Python Strings, Booleans and Operators
Day 7&8 - Lists and Tuples
Day 9 - Sets and Dictionaries
Day 10 - Machine Learning Introduction
Day 11 - Life cycle of Data Scientist
Day 12&13 - Data exploration
Day 14 - Data Manipulation
Day 15 - Brief introduction to Classifications
Day 16 - Supervised learning
Day 17&18 - Unsupervised learning with an real time example
Day 19&20 - Reinforcement learning with real time example

Day 21 & 22 - Linear Regression


Day 23 - Logistic Regression
Day 24 - Random Forest
Day 25 - Module Quiz
Day 26 - Ensemble models
Day 27&28 – Decision trees & knn model
Day 29 - Programming Questions from Level 0
Day 30 - Programming Questions from Level 1
Day 31 - Programming Questions from Level 2
Day 32 - Programming Questions from Level 3

6
3. Topics
3.1 Python History
Python is a widely used general-purpose, high-level programming language.
It was initially designed by Guido van Rossum in 1991 and developed by
Python Software Foundation. It was mainly developed for emphasis on code
readability, and its syntax allows programmers to express concepts in fewer
lines of code. The two of the most used versions have to do with Python 2.x
& 3.x. There is a lot of competition between the two and both of them seem
to have quite a number of different fanbase.
For various purposes such as developing, scripting, generation and software
testing, this language is utilised. Due to its elegance and simplicity, top
technology organisations like Dropbox, Google, Quora, Mozilla,
HewlettPackard, Qualcomm, IBM, and Cisco have implemented Python.

7
8
9
3.2 Data Types in Python

3.3 MACHINE LEARNING INTRODUCTION


Machine learning is a subfield of artificial intelligence (AI). The goal of
machine learning generally is to understand the structure of data and fit that
data into models that can be understood and utilized by people.
Although machine learning is a field within computer science, it differs from
traditional computational approaches. In traditional computing, algorithms
are sets of explicitly programmed instructions used by computers to
calculate or problem solve. Machine learning algorithms instead allow for
computers to train on data inputs and use statistical analysis in order to
output values that fall within a specific range. Because of this, machine
learning facilitates computers in building models from sample data in order
to automate decision-making processes based on data inputs.
Any technology user today has benefitted from machine learning. Facial
recognition technology allows social media platforms to help users tag and
share photos of friends. Optical character recognition (OCR) technology
converts images of text into movable type. Recommendation engines,

10
powered by machine learning, suggest what movies or television shows to
watch next based on user preferences. Self-driving cars that rely on machine
learning to navigate may soon be available to consumers.

Classification of ML:

11
3.4 MACHINE LEARNING ALGORITHMS
3.4.1 KNN Algorithm:
The k-nearest neighbour algorithm is a pattern recognition model that can
be used for classification as well as regression. Often abbreviated as kNN,
the k in k-nearest neighbour is a positive integer, which is typically small.
In either classification or regression, the input will consist of the k closest
training examples within a space.
We will focus on k-NN classification. In this method, the output is class
membership. This will assign a new object to the class most common among
its k nearest neighbours. In the case of k = 1, the object is assigned to the
class of the single nearest neighbour.

Let’s look at an example of k-nearest neighbour. In the diagram below, there


are blue diamond objects and orange star objects. These belong to two
separate classes: the diamond class and the star class.

When a new object is added to the space in this case a green heart, we will
want the machine learning algorithm to classify the heart to a certain class.

12
When we choose k = 3, the algorithm will find the three nearest neighbours
of the green heart in order to classify it to either the diamond class or the
star class.
In our diagram, the three nearest neighbours of the green heart are one
diamond and two stars. Therefore, the algorithm will classify the heart with
the star class.

13
3.4.2 LINEAR REGRESSION
Linear Regression is a machine learning algorithm based on supervised
learning. It performs a regression task. Regression models a target
prediction value based on independent variables. It is mostly used for
finding out the relationship between variables and forecasting. Different
regression models differ based on the kind of relationship between
dependent and independent variables, they are considering and the number
of independent variables being used.

Linear regression performs the task to predict a dependent variable value (y)
based on a given independent variable (x). So, this regression technique
finds out a linear relationship between x (input) and y(output). In the figure
above, X (input) is the work experience and Y (output) is the salary of a
person. The regression line is the best fit line for our model.

3.4.2 LOGISTIC REGRESSION


Logistic regression is a statistical model that in its basic form uses a logistic
function to model a binary dependent variable, although many more
complex extensions exist. In regression analysis, logistic regression (or logit
regression) is estimating the parameters of a logistic model (a form of binary
regression).

14
It is used in statistical software to understand the relationship between the
dependent variable and one or more independent variables by estimating
probabilities using a logistic regression equation. This type of analysis can
help you predict the likelihood of an event happening or a choice being
made.

In statistics, the logistic model is used to model the probability of a certain


class or event taking place, such as the probability of a team winning, of a
patient being healthy, etc. This can be extended to model several classes of
events such as determining whether an image contains a cat, dog, lion, etc.

15
3.4.3 DECISION TREES
For general use, decision trees are employed to visually represent decisions
and show or inform decision making. When working with machine learning
and data mining, decision trees are used as a predictive model. These models
map observations about data to conclusions about the data’s target value.
The goal of decision tree learning is to create a model that will predict the
value of a target based on input variables.

In the predictive model, the data’s attributes that are determined through
observation are represented by the branches, while the conclusions about the
data’s target value are represented in the leaves.
When “learning” a tree, the source data is divided into subsets based on an
attribute value test, which is repeated on each of the derived subsets
recursively. Once the subset at a node has the equivalent value as its target
value has, the recursion process will be complete.

Let’s look at an example of various conditions that can determine whether


or not someone should go fishing. This includes weather conditions as well
as barometric pressure conditions.

In the simplified decision tree above, an example is classified by sorting it


through the tree to the appropriate leaf node. This then returns the
classification associated with the particular leaf, which in this case is either
a Yes or a No. The tree classifies a day’s conditions based on whether or not
it is suitable for going fishing.

16
A true classification tree data set would have a lot more features than what
is outlined above, but relationships should be straightforward to determine.
When working with decision tree learning, several determinations need to
be made, including what features to choose, what conditions to use for
splitting, and understanding when the decision tree has reached a clear
ending.

3.4.4 CLUSTERING MODULES


Cluster analysis, or clustering, is an unsupervised machine learning task. It
involves automatically discovering natural grouping in data. Unlike
supervised learning (like predictive modelling), clustering algorithms only
interpret the input data and find natural groups or clusters in feature space.

17
4. PROJECT

CREDIT CARD FRAUD DETECTION

4.1 Project Explanation


The problem statement chosen for this project is to predict fraudulent
credit card transactions with the help of machine learning models.
In this project, we will analyse customer-level data which has been collected
and analysed during a research collaboration of Worldline and the Machine
Learning Group.
The dataset is taken from the Kaggle Website and it has a total of 2,84,807
transactions, out of which 492 are fraudulent. Since the dataset is highly
imbalanced, so it needs to be handled before model building.

Understanding and Defining Fraud


Credit card fraud is any dishonest act and behaviour to obtain
information without the proper authorization from the account holder for
financial gain. Among different ways of frauds, Skimming is the most
common one, which is the way of duplicating of information located on the
magnetic strip of the card. Apart from this, the other ways are:

• Manipulation/alteration of genuine cards


• Creation of counterfeit cards
• Stolen/lost credit cards
• Fraudulent telemarketing

Data Dictionary
The dataset can be download using this link
The data set includes credit card transactions made by European cardholders
over a period of two days in September 2013. Out of a total of 2,84,807
transactions, 492 were fraudulent. This data set is highly unbalanced, with
the positive class (frauds) accounting for 0.172% of the total transactions.
The data set has also been modified with Principal Component Analysis
(PCA) to maintain confidentiality. Apart from ‘time’ and ‘amount’, all the
other features (V1, V2, V3, up to V28) are the principal components

18
obtained using PCA. The feature 'time' contains the seconds elapsed
between the first transaction in the data set and the subsequent transactions.
The feature 'amount' is the transaction amount. The feature 'class' represents
class labelling, and it takes the value 1 in cases of fraud and 0 in others.

4.2 Solution Approach


1. Data understanding and exploring
2. Data cleaning

• Handling missing values


• Outliers’ treatment
3. Exploratory data analysis

• Univariate analysis
• Bivariate analysis
4. Prepare the data for modelling
Check the skewness of the data and mitigate it for fair analysis

• Handling data imbalance as we see only 0.172% records are the fraud
transactions 5. Split the data into train and test set
• Scale the data (normalization)
6. Model building

• Train the model with various algorithm such as Logistic regression,


SVM, Decision Tree, Random Forest, XG Boost etc.
• Tune the hyperparameters with Grid Search Cross Validation and find
the optimal values of the hyperparameters
7. Model evaluation

• As we see that the data is heavily imbalanced, Accuracy may not be


the correct measure for this particular case
• We have to look for a balance between Precision and Recall over
Accuracy. We also have to find out the good ROC score with high
TPR and low FPR in order to get the lower number of
misclassifications.

19
20

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy