Internship Report
Internship Report
AT INTERNSHALA TRAININGS,
Attended by
DIVYA DRISHTI
21105111017
1
BONAFIDE CERTIFICATE
2
CERTIFICATION
3
TABLE OF CONTENT
1 Introduction 5
2 Agenda 6
3 Topics 8
3.1 Python History 8
3.2 Data Types 10
3.3 Machine Learning Introduction 10
3.4 Machine Learning Algorithms 11
3.4.1 KNN Algorithm 12
3.4.2 Linear regression 14
3.4.3 Logistic regression 14
3.4.4 Decision tree 16
3.4.5 Clustering modules 17
4 PROJECT 18
4.1 Project Explanation 18
4.2 Solution Approach 19
4
1.Introduction
5
In this Internship, I had learnt about introduction of ml along with the
ml algorithms. Such that 56 Days of internship has 32 topics and did
mini projects and final assessment in the end of internship i.e., from
56th day.
2. AGENDA
6
3. Topics
3.1 Python History
Python is a widely used general-purpose, high-level programming language.
It was initially designed by Guido van Rossum in 1991 and developed by
Python Software Foundation. It was mainly developed for emphasis on code
readability, and its syntax allows programmers to express concepts in fewer
lines of code. The two of the most used versions have to do with Python 2.x
& 3.x. There is a lot of competition between the two and both of them seem
to have quite a number of different fanbase.
For various purposes such as developing, scripting, generation and software
testing, this language is utilised. Due to its elegance and simplicity, top
technology organisations like Dropbox, Google, Quora, Mozilla,
HewlettPackard, Qualcomm, IBM, and Cisco have implemented Python.
7
8
9
3.2 Data Types in Python
10
powered by machine learning, suggest what movies or television shows to
watch next based on user preferences. Self-driving cars that rely on machine
learning to navigate may soon be available to consumers.
Classification of ML:
11
3.4 MACHINE LEARNING ALGORITHMS
3.4.1 KNN Algorithm:
The k-nearest neighbour algorithm is a pattern recognition model that can
be used for classification as well as regression. Often abbreviated as kNN,
the k in k-nearest neighbour is a positive integer, which is typically small.
In either classification or regression, the input will consist of the k closest
training examples within a space.
We will focus on k-NN classification. In this method, the output is class
membership. This will assign a new object to the class most common among
its k nearest neighbours. In the case of k = 1, the object is assigned to the
class of the single nearest neighbour.
When a new object is added to the space in this case a green heart, we will
want the machine learning algorithm to classify the heart to a certain class.
12
When we choose k = 3, the algorithm will find the three nearest neighbours
of the green heart in order to classify it to either the diamond class or the
star class.
In our diagram, the three nearest neighbours of the green heart are one
diamond and two stars. Therefore, the algorithm will classify the heart with
the star class.
13
3.4.2 LINEAR REGRESSION
Linear Regression is a machine learning algorithm based on supervised
learning. It performs a regression task. Regression models a target
prediction value based on independent variables. It is mostly used for
finding out the relationship between variables and forecasting. Different
regression models differ based on the kind of relationship between
dependent and independent variables, they are considering and the number
of independent variables being used.
Linear regression performs the task to predict a dependent variable value (y)
based on a given independent variable (x). So, this regression technique
finds out a linear relationship between x (input) and y(output). In the figure
above, X (input) is the work experience and Y (output) is the salary of a
person. The regression line is the best fit line for our model.
14
It is used in statistical software to understand the relationship between the
dependent variable and one or more independent variables by estimating
probabilities using a logistic regression equation. This type of analysis can
help you predict the likelihood of an event happening or a choice being
made.
15
3.4.3 DECISION TREES
For general use, decision trees are employed to visually represent decisions
and show or inform decision making. When working with machine learning
and data mining, decision trees are used as a predictive model. These models
map observations about data to conclusions about the data’s target value.
The goal of decision tree learning is to create a model that will predict the
value of a target based on input variables.
In the predictive model, the data’s attributes that are determined through
observation are represented by the branches, while the conclusions about the
data’s target value are represented in the leaves.
When “learning” a tree, the source data is divided into subsets based on an
attribute value test, which is repeated on each of the derived subsets
recursively. Once the subset at a node has the equivalent value as its target
value has, the recursion process will be complete.
16
A true classification tree data set would have a lot more features than what
is outlined above, but relationships should be straightforward to determine.
When working with decision tree learning, several determinations need to
be made, including what features to choose, what conditions to use for
splitting, and understanding when the decision tree has reached a clear
ending.
17
4. PROJECT
Data Dictionary
The dataset can be download using this link
The data set includes credit card transactions made by European cardholders
over a period of two days in September 2013. Out of a total of 2,84,807
transactions, 492 were fraudulent. This data set is highly unbalanced, with
the positive class (frauds) accounting for 0.172% of the total transactions.
The data set has also been modified with Principal Component Analysis
(PCA) to maintain confidentiality. Apart from ‘time’ and ‘amount’, all the
other features (V1, V2, V3, up to V28) are the principal components
18
obtained using PCA. The feature 'time' contains the seconds elapsed
between the first transaction in the data set and the subsequent transactions.
The feature 'amount' is the transaction amount. The feature 'class' represents
class labelling, and it takes the value 1 in cases of fraud and 0 in others.
• Univariate analysis
• Bivariate analysis
4. Prepare the data for modelling
Check the skewness of the data and mitigate it for fair analysis
• Handling data imbalance as we see only 0.172% records are the fraud
transactions 5. Split the data into train and test set
• Scale the data (normalization)
6. Model building
19
20