Regression Log
Regression Log
Regression is a statistical method used to model the relationship between one or more independent
variables and a dependent variable. Its primary goal is to understand and predict the value of the
dependent variable based on the values of the independent variables.
Logistic Regression:
Logistic regression is a type of regression used for predicting the probability of a binary outcome
based on one or more independent variables. Unlike linear regression, which predicts continuous
values, logistic regression predicts the probability of a particular outcome occurring, typically
encoded as 0 or 1.
Logistic regression models the relationship between the independent variables and the log-odds of
the dependent variable. The output is transformed using the logistic function, which maps the log-
odds to a probability value between 0 and 1.
Logistic regression [1] (FROM Mastering Machine Learning with Python in Six Steps by Manohar
Swamynathan) can be explained better in odds ratio. The odds of an event occurring are defined as
the probability of an event occurring divided by the probability of that event not occurring. For
example, logistic regression could be used to predict whether a customer is likely to purchase a
product based on factors like age, income, and past purchasing behaviour. The output would be the
probability of the customer making a purchase, which can then be used to make informed business
decisions.
Let’s breakdown the steps involved in using the logistic regression model to determine the output of
our Credit card fraudulent detection dataset. We have also included the confusion matrix at the end.
It is used in evaluating the performance of classification models, in our case, the logistic regression.
Its purpose is to provide a detailed breakdown of the model's predictions compared to the actual
outcomes across different classes.
Fig 1.0 Credit card fraud detection dataset
Data Preparation:
Independent variables (features) are extracted from the DataFrame into a variable named 'X',
excluding the last column which represents the dependent variable ('Class').The dependent variable
is extracted into a variable named 'y'.
The dataset is split into training and testing sets using the train_test_split function from scikit-learn.
The test size is set to 20% of the data, and a random state is specified for reproducibility.
Feature Scaling:
Standardization is applied to the features using StandardScaler from scikit-learn to ensure that all
features have the same scale. This step is crucial for many machine learning algorithms, including
logistic regression.
Model Training:
A logistic regression model is instantiated and trained on the scaled training data using
LogisticRegression from scikit-learn.
Model Evaluation:
The R² score is calculated to evaluate the performance of the logistic regression model on both the
training and testing sets.
R² score measures the proportion of the variance in the dependent variable that is predictable from
the independent variables. A score closer to 1 indicates a better fit.
Predictions:
Probabilities of the positive class (fraud) are predicted for both the training and testing sets using the
trained logistic regression model.
Visualization:
The predicted probabilities against the actual classes are plotted for both the training and testing
sets using Matplotlib. This helps visualize the performance of the logistic regression model.
Fig 1.2 Visualization of predicted vs. actual values
Two confusion matrices are generated, one for the training data and another for the test data. These
matrices capture the counts of true positive, true negative, false positive, and false negative
predictions made by the logistic regression model.
Each confusion matrix is plotted using the ConfusionMatrixDisplay class. The size of the figure is set
to 6x6 inches for better visualization. The colormap 'Blues' is used to represent different shades of
blue for different values in the confusion matrix.
After plotting each confusion matrix, the final result would be as shown in Fig 1.3 (a) and (b).
Fig 1.3 (a) Confusion matrix of training data (b) Confusion matrix of test data