0% found this document useful (0 votes)
10 views7 pages

DAL Assignment 3

IITB DAL Assignment 3

Uploaded by

msrirang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views7 pages

DAL Assignment 3

IITB DAL Assignment 3

Uploaded by

msrirang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

A Bayesian Model for Income Bracket

Classification
1st
Department of Chemical Engineering
IIT Madras
Chennai, India

Abstract—This paper explores the prediction of individuals’ in- encoding categorical features. Additionally, we employ feature
come levels based on the 1994 Census Bureau database by Ronny selection techniques to identify the most influential variables,
Kohavi and Barry Becker, using a Naive Bayes Classifier. The improving the model’s interpretability and efficiency.
study focuses on determining whether a person’s income exceeds
$50,000, utilizing demographic and socio-economic attributes like
education level, marital status, capital gains and losses, and The primary objective of this study is to evaluate the
more. The census data is cleaned and processed. A Naive Bayes effectiveness of the Naive Bayes Classifier in predicting
Classifier is used for the predictive model, and is evaluated income levels based on the provided dataset. To achieve this,
using metrics like accuracy and precision by cross-validation. we employ rigorous evaluation metrics, including accuracy,
The classifier is effective in income prediction and we emphasize
its potential applications in decision-making processes in fields precision, recall, and F1-score, while applying the Boostrap
like social policy planning and targeted marketing. Overall, this Technique to assess the model’s generalization capabilities.
research demonstrates the feasibility and significance of machine
learning techniques in income classification. Section III has been II. DATA AND C LEANING
changed. A. The Datasets
Index Terms—naive Bayes, bootstrapping, 1994 census, Kohavi
and Becker, cross-validation One dataset (adult.xlsx) was provided to train the Naive
Bayes model. This dataset contained around 32, 000 training
I. I NTRODUCTION samples. The target label was a binary class ’income-class’
with a person’s income either being above $50000 or below it.
Income prediction is an important part of social policy The dataset contained a mixture of categorical and numerical
planning and business marketing strategies. Accurately variables. The description of the features in the dataset are
predicting an individual’s income level enables more effective summarized in Table I.
resource allocation, targeted assistance, and improved
decision-making. Bayesian models offer a promising avenue TABLE I
for income classification, and in this study, we delve into TABLE OF THE FEATURES IN THE GIVEN DATASETS ALONG WITH THEIR
DESCRIPTIONS . W E OBSERVE THAT MOST VARIABLES ARE CATEGORICAL
the development and evaluation of a Naive Bayes Classifier
BUT THERE ARE SOME IMPORTANT NUMERICAL VARIABLES THAT COULD
for predicting income levels based on demographic and BE POWERFUL INDICATORS OF THE INCOME BRACKET.
socio-economic features.
Feature Description Type
age Age Continuous
The data is taken from 1994 Census Bureau database workclass Work Class Categorical (8)
by Ronny Kohavi and Barry Becker, containing information fnlwgt - Numerical
education Lvl. of education Categorical (16)
such as education level, marital status, capital gains and education-num Years of education Numerical
losses. It offers a comprehensive view of the factors that marital-status Marital Status Categorical (7)
may influence an individual’s income. Using this dataset, our Occupation Occupation Categorical (14)
relationship Relationship Categorical (6)
study aims to construct a robust predictive model capable of race Race Categorical (5)
categorizing individuals into income groups: those earning sex Gender Categorical (2)
more than $50,000 and those earning less. capital-gain Capital Gain Numerical
capital-loss Capital Loss Numerical
hours-per-week Hours per week Numerical
The choice of a Naive Bayes Classifier is motivated by native-country Native Country Categorical (41)
its simplicity, efficiency, and ability to handle categorical income-category Income Bracket Categorical (2)
and continuous data. By exploiting conditional independence
among attributes, the Naive Bayes Classifier provides an
intuitive framework for modeling complex relationships in B. Data Cleaning
the data. A pipeline is coded to take a dataset of the above format
and a flag (’train’ or ’test’) and clean it. Persons with variables
W first preprocess the data, imputing missing values and that cannot be imputed such as ’income-category’ having
missing values are removed. We find that the placeholder for
missing values is ’ ?’. We do not drop any variables with
missing data, instead choosing to impute them

A Simple Imputer based on the most frequent value is


used on the dataset to impute missing values. This largely
preserves variable distribution. Finally the variables are
converted to their appropriate types and the cleaned dataset Fig. 2. The probability and cumulative distributions of the FNL Weight of the
is returned. No confounding symbols are present in the train various persons is plotted. The left image contains the KDE of the data after
or test data, we only find missing values. Most Freq. Imputation for both classes. The right image shows the ECDFs
of the data after Most Freq. Imputation for both classes.

There are multiple imputation techniques available. One


can impute missing values by 0, by the mean, median or
based on the k-NN of the data point or by randomly sampling
from the distribution of the variable. The Expectation
Imputers distort the distribution of the imputed data about the
expectation estimator used, when compared to the Random
Sampling Imputer (RSI) and KNN Imputer.
Fig. 3. The probability and cumulative distributions of the Years of Educa-
Unfortunately the RSI is a slow imputation technique. tion of the various persons is plotted. The left image contains the KDE of
Either a prior distribution must be assumed and its parameters the data after Most Freq. Imputation for both classes. The right image shows
estimated from data, or a non-parametric method such as a the ECDFs of the data after Most Freq. Imputation for both classes.
Kernel Density Estimate (KDE) can be used.

However, given that we are dealing with multiple categorical


variables, we choose to use the most frequent value for
imputation, given the KNN’s difficulty with handling
categorical variables..

We can also observe this empirically. In Figs. 1-9, we


present the Kernel Density Estimate (KDE) and Empirical Fig. 4. The probability and cumulative distributions of the Hours per Week
Cumulative Density Function (ECDF) of the numerical of the various persons is plotted. The left image contains the KDE of the
variables in the train dataset, after imputation for both data after Most Freq. Imputation for both classes. The right image shows the
ECDFs of the data after Most Freq. Imputation for both classes.
categories. Finally all categorical variables are encoded as
features. In Figs. 5-8, we present the Count Plots of some

Fig. 1. The probability and cumulative distributions of the Age of the various
persons is plotted. The left image contains the KDE of the data after Most
Freq. Imputation for both classes. The right image shows the ECDFs of the
data after Most Freq. Imputation for both classes.

categorical variables in the train dataset, after imputation. Fig. 5. The count plot of the various classes of Work Class for various per-
sons are shown after Most Frequent Imputation. Unlike numerical variables,
III. M ETHODS categorical variables are not visualized well using density plots.

A. Naive Bayes Classifier


The Naive Bayes Classifier is a probabilistic model used feature independence is a reasonable approximation.
for classification tasks. It is based on Bayes’ theorem and the
assumption of feature independence, making it particularly Given the following variables:
suitable for text classification and other domains where • C: The class variable representing income categories (>
P (C|X). This is done by Bayes’ theorem:
P (C) · P (X|C)
P (C|X) = (1)
P (X)
Here:
• P (C) is the prior probability of class C.
• P (X|C) is the likelihood, representing the probability of
observing feature vector X given class C.
• P (X) is the marginal likelihood, acting as a normalizing
constant.
The ”naive” assumption in the Naive Bayes Classifier is that
the features in X are conditionally independent given the class
variable C. This simplifies the likelihood calculation:
Fig. 6. The count plot of the various classes of Race for various persons are
shown after Most Frequent Imputation. Unlike numerical variables, categorical P (X|C) = P (X1 |C) · P (X2 |C) · . . . · P (Xn |C) (2)
variables are not visualized well using density plots.
The class prediction for a given instance is made by selecting
the class C that maximizes P (C|X). In binary classification,
this involves comparing P (> 50K|X) and P (≤ 50K|X) and
selecting the class with the higher probability.

In cases where the features are continuous and follow a


Gaussian (normal) distribution, the Gaussian Naive Bayes
Classifier is often employed. This variant assumes that the
likelihood of each feature given the class follows a Gaussian
distribution:
1 (x−µ)2
P (Xi |C) = √ e− 2σ2 (3)
2πσ 2
Where:
• µ is the mean of the feature Xi for class C.
2
Fig. 7. The count plot of the various classes of Marital Status for • σ is the variance of the feature Xi for class C.
various persons are shown after Most Frequent Imputation. Unlike numerical The Gaussian Naive Bayes Classifier is suitable for continuous
variables, categorical variables are not visualized well using density plots.
features and can handle multivariate Gaussian distributions ef-
ficiently. It is an extension of the basic Naive Bayes Classifier
and is particularly effective when the data distribution aligns
with the Gaussian assumption. In this paper, we use the the
Gaussian Naive Bayes Classifier to predict income categories
based on the 1994 Census Bureau database. We assess its
performance using various evaluation metrics to determine its
suitability for the task at hand.
B. Classification Metrics
There are various metrics that can evaluate the goodness-
of-fit of a given classifier. Some of these metrics are presented
in this section. In classification tasks, it is essential to choose
appropriate evaluation metrics based on the problem’s context
Fig. 8. The count plot of the various classes of Occupation for various per- and objectives.
sons are shown after Most Frequent Imputation. Unlike numerical variables, 1) Accuracy: Accuracy is one of the most straightforward
categorical variables are not visualized well using density plots.
classification metrics and is defined as:
Number of Correct Predictions
Accuracy = (4)
Total Number of Predictions
50K or ≤ 50K).
• X: A vector of feature variables, including education It measures the proportion of correct predictions made by the
level, marital status, capital gains, and losses. model. While accuracy provides an overall sense of model
performance, it may not be suitable for imbalanced datasets,
the Naive Bayes Classifier calculates the conditional proba- where one class dominates the other.
bility of a class C given the feature vector X, denoted as
2) Recall: Recall, also known as sensitivity or true positive C. Singular Value Decomposition (SVD)
rate, quantifies a model’s ability to correctly identify positive Singular Value Decomposition (SVD) is a fundamental
instances: matrix factorization technique in linear algebra. It breaks
True Positives down a matrix into three separate matrices, capturing the
Recall = (5) inherent structure of the original matrix. SVD has a wide
True Positives + False Negatives
range of applications, and one of its practical uses is in data
Recall is essential when the cost of missing positive cases compression. Given a matrix A of dimensions N × p, SVD
(false negatives) is high, such as in medical diagnoses. decomposes A into three matrices:
A = U ΣV T (8)
3) Precision: Precision measures the accuracy of positive
predictions made by the model: where:
T
• U is an N ×N orthogonal matrix (eigenvectors of AA ).
True Positives • Σ is an N ×p diagonal matrix with non-negative singular
Precision = (6)
True Positives + False Positives values.
T
• V is an p × p orthogonal matrix (eigenvectors of A A).
Precision is valuable when minimizing false positive
The columns of U are called the left singular vectors, the
predictions is critical, like in spam email detection.
diagonal entries of Σ are the singular values, and the columns
of V are the right singular vectors.
4) F1-score: The F1 score is the harmonic mean of preci-
sion and recall, providing a balance between the two: SVD can be leveraged for data compression by approximating
the original matrix A using a lower-rank approximation.
Precision · Recall
F1 Score = 2 · (7) This is particularly useful when dealing with large datasets
Precision + Recall or images. The lower-rank approximation retains the most
It is particularly useful when there is an uneven class important features of the data while reducing its dimensions.
distribution or when both precision and recall need to be
considered simultaneously. Given the SVD of matrix A as A = U ΣV T , the matrix Ak
obtained by keeping only the first k singular values and their
corresponding singular vectors is given by:
5) Receiver Operator Characteristic Curve (ROC Curve):
The ROC curve is a graphical representation of a model’s Ak = U (:, 1 : k)Σ(1 : k, 1 : k)V (:, 1 : k)T (9)
performance across different classification thresholds. It plots
the true positive rate (recall) against the false positive rate (1 where U (:, 1 : k) contains the first k columns of U ,
- specificity) at various threshold values. Σ(1 : k, 1 : k) is the upper-left k × k submatrix of Σ, and
V (:, 1 : k) contains the first k columns of V .

By using a lower-rank approximation, the original data can


be represented more compactly, leading to data compression.
The extent of compression depends on the choice of k. A
smaller value of k reduces the storage requirements but may
lead to a loss of information.

Sometimes the columns of data matrix A may contain


linear relationships between themselves. In this case, if n
linear relationships exist, n singular values are 0. We can
drop up to n columns and perform a loss-less compression of
the data. This allows us to use fewer independent variables
in our regression models i.e., enforce parsimony.
IV. R ESULTS
Fig. 9. A sample ROC curve from a classifier. Note the trade-off between A. Existence of Linear Relationships among income factors
sensitivity and specificity. Based on the problem, we may optimize be required
to optimize for only one. Exploratory analysis of the Independent Variables indicates
the existence of linear relationships between themselves.
The area under the ROC curve (AUC-ROC) quantifies the This could allow us to loss-lessly reduce the number of
model’s overall performance. A higher AUC-ROC indicates independent variables used in our model. This is evident
a better model at distinguishing between positive and negative from Fig. 10 where the singular values of the Independent
instances. Variables dataset are presented. Three linear relationships
Fig. 10. Singular values of the Independent Variables are presented. The last
three singular values are of order < 10−13 and can be considered to be 0.
This allows us to loss-lessly remove up to three variables from the dataset

exist between the variables.


The correlation heatmap for the independent variables is
shown in Fig. 11. We observe several variables that are
perfectly correlated with each other. This is an artefact of

Fig. 12. The correlation heatmap between all numerical independent variables.
This was obtained by finding the pairwise correlation coefficient between
each independent variable. The color gradient indicates the magnitude of the
correlation between the variables.

The Naive Bayes model is first trained on the train


split without any regularization. We then bootstrap the
validation set (1000 bootstrap samples) and compute the
evaluation metrics presented in Section III-B. We provide
the 95% CIs for our evaluation metrics in Table II. The
probability distributions and ECDFs of our evaluation metrics
are presented in Figs. 13-16.

TABLE II
E VALUATION METRICS OF THE NAIVE BAYES CLASSIFIER . W E FIND THAT
Fig. 11. The correlation heatmap between all independent variables. This ACCURACY AND P RECISION ARE REASONABLY HIGH . T HE VARIANCE IN
was obtained by finding the pairwise correlation coefficient between each THESE ESTIMATES ARE ALSO ACCEPTABLE .
independent variable. The color gradient indicates the magnitude of the
correlation between the variables. Metric Value 95% CI
Accuracy 0.80 (0.79, 0.81)
our encoding method. When we encoded our categorical Precision 0.68 (0.64, 0.72)
Recall 0.32 (0.29, 0.35)
variables, atleast one class will be highly correlated with all F1 Score 0.43 (0.40, 0.46)
other classes. For example in our ’sex’ feature, only ’M’ and
’F’ classes are present. If a sample has ’sex’ attribute ’M’,
then it cannot have ’F’, making the two classes, which have
now become features, perfectly negatively correlated.

To verify this, we plot the heatmap of only the numerical


features in Fig. 12. We find no correlation between them,
confirming our suspicion.
B. Naive Bayes is a fast and accurate classifier
To train and evaluate our Naive Bayes model, we split our
train data into train and validation splits. This is done using
Fig. 13. The left plot contains the histogram of the accuracy obtained for
a fixed random seed for replicability, with 20% of our given each bootstrap sample from the validation split. The right plot contains the
data in the validation split. ECDF of the accuracy obtained for each bootstrap sample from the validation
split. We find that the metric is high and its variance is acceptable
Fig. 14. The left plot contains the histogram of the recall obtained for each
bootstrap sample from the validation split. The right plot contains the ECDF
of the recall obtained for each bootstrap sample from the validation split. We
find that the metric is high and its variance is acceptable

Fig. 17. The Receiver Operator Characteristic curve obtained for the Naive
Bayes classifier. We find that we can achieve a good True Positive Rate with
a small False Positive Rate, indicating that our classifier is robust to class
imbalances. We also find that the classifier is significantly better than a random
classifier.

While our classifier demonstrates a high precision, it is


important to acknowledge that its recall falls in the medium
Fig. 15. The left plot contains the histogram of the precision obtained for range. This implies that the classifier is effective at capturing a
each bootstrap sample from the validation split. The right plot contains the substantial portion of individuals with incomes above $50,000
ECDF of the precision obtained for each bootstrap sample from the validation but may miss some such instances. In other words, there is a
split. We find that the metric is high and its variance is acceptable
trade-off between precision and recall. The balance between
these two metrics depends on the specific application context.
In cases where identifying all high-income individuals is
critical, further model refinement may be needed to enhance
recall.
VI. C ONCLUSIONS AND F UTURE W ORK
The classifier exhibits high precision, indicating its ability
to make accurate predictions for identifying individuals with
incomes exceeding $50,000. This precision ensures that
resources are efficiently allocated to those who genuinely
Fig. 16. The left plot contains the histogram of the F1 score obtained for qualify for certain programs or benefits.
each bootstrap sample from the validation split. The right plot contains the
ECDF of the F1 score obtained for each bootstrap sample from the validation
split. We find that the metric is high and its variance is acceptable While precision is high, we observed a trade-off with
recall, which falls in the medium range. This means that
while the classifier excels at minimizing false positives,
The ROC curves for both the Naive Bayes Classifier is shown it may miss some high-income individuals. The balance
in Fig. 17. We find that it performs significantly better than a between precision and recall should be carefully considered
random classifier. based on the specific application’s priorities.
V. D ISCUSSION There is room for improvement in terms of recall without
Our analysis indicates that the Gaussian Naive Bayes significantly sacrificing precision. Future work should focus on
Classifier provides a good performance in predicting income refining the model to better capture high-income individuals.
levels, based on the 1994 Census Bureau database. We This could involve feature engineering, incorporating
observe that our classifier has high precision. This suggests additional data sources, or exploring alternative machine
that the classifier is particularly adept at minimizing false learning algorithms.
positives, which are instances where it predicts a higher
income when it’s not the case. High precision is crucial in Ensemble methods and interpretability metrics such as
scenarios such as targeted marketing, where false positives the SHAP values can be incorporated into the classifier
can result in inefficient resource allocation. model. Future work must also consider the socio-economic
implications of using these models when deciding public
policy and economic planning. Temporal data may also
provide a more comprehensive picture.

R EFERENCES
[1] James, G., Witten, D., Hastie, T. and Tibshirani, R., 2013. An introduc-
tion to statistical learning (Vol. 112, p. 18). New York: springer.
[2] Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H. (2009). The
elements of statistical learning: data mining, inference, and prediction
(Vol. 2, pp. 1-758). New York: springer.
[3] Bishop, C. M., & Nasrabadi, N. M. (2006). Pattern recognition and
machine learning (Vol. 4, No. 4, p. 738). New York: springer.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy