0% found this document useful (0 votes)
27 views6 pages

Report 1 AI17C DBM302m KhaiHoan BaoChau VanThu

The report outlines a project focused on analyzing socio-economic factors that influence individual income, specifically targeting the prediction of earning over $50,000 based on demographics. The dataset used is the Adult Census Income from Kaggle, containing 32,561 rows and 15 columns, which includes various features such as age, education, occupation, and gender. The proposed methodology involves applying machine learning algorithms, conducting exploratory data analysis, and validating results to verify the hypothesis that education, age, and occupation significantly predict income levels.

Uploaded by

Kids YoLi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views6 pages

Report 1 AI17C DBM302m KhaiHoan BaoChau VanThu

The report outlines a project focused on analyzing socio-economic factors that influence individual income, specifically targeting the prediction of earning over $50,000 based on demographics. The dataset used is the Adult Census Income from Kaggle, containing 32,561 rows and 15 columns, which includes various features such as age, education, occupation, and gender. The proposed methodology involves applying machine learning algorithms, conducting exploratory data analysis, and validating results to verify the hypothesis that education, age, and occupation significantly predict income levels.

Uploaded by

Kids YoLi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

REPORT

Class: AI17C
Subject: DBM302m
Instructor: Nguyen Van Vinh – VinhNV27
Group: 1
Members:
Ha Khai Hoan - QE170157
Dang Phuc Bao Chau - QE170060
Nguyen Van Thu – QE170147

1. Which problem are you trying to solve why?


I am interested in understanding which socio-economic factors most influence an
individual's income. Specifically, I would like to explore the relationship between
factors such as age, education, occupation, and gender in predicting whether an
individual will earn more than $50,000 per year. Additionally, I would like to
explore whether there are significant income differences between genders and
races.

This would be helpful in areas such as:


- Advertising and marketing: Companies can target high or low-income groups
to offer suitable products or services.
- Credit analysis: Financial institutions and banks can use this information to
assess repayment ability, plan loans, and set credit limits.
- Customer segmentation: Businesses can divide customers by income to
develop tailored business strategies, thereby increasing sales efficiency.
- Public policy: Government agencies can use this data to shape social policies,
such as welfare support for low-income households.
- Insurance: Insurance companies can assess risks or design insurance packages
tailored to different income groups.
- Real estate: Real estate brokers can use this information to predict housing
demand among high or low-income earners.
- Education: Educational institutions can offer scholarships or support programs
based on income levels.

2. Where and how do you obtain the data? How big is your data?
We took the Adult Census Income dataset on Kaggle, which is a popular dataset
often used to build machine learning models that predict individual income based
on demographic factors.

+ Number of rows of data: 32561


+ Number of columns of data: 15
Note:
Feature Description
1 Age Describes the age of individuals. Continuous.
Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-
2 Workclass
gov, State-gov, Without-pay, Never-worked.
Continuous. A weighting factor created by the US Census
3 fnlwgt Bureau indicating the number of people represented by each
data entry.
Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-
4 Education acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th,
Doctorate, Preschool.
Education-
5 Number of years spent in education. Continuous.
num
Marital- Married-civ-spouse, Divorced, Never-married, Separated,
6
status Widowed, Married-spouse-absent, Married-AF-spouse.
7 Occupation Tech-support, Craft-repair, Other-service, Sales, Exec-
managerial, Prof-specialty, Handlers-cleaners, Machine-op-
inspct, Adm-clerical, etc.
Wife, Own-child, Husband, Not-in-family, Other-relative,
8 Relationship
Unmarried.
9 Race White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
1
Sex Female, Male.
0
1 Represents the profit from the sale of assets (e.g., stocks or
Capital-gain
1 real estate). Continuous.
1 Represents the loss from the sale of assets (e.g., stocks or
Capital-loss
2 real estate). Continuous.
1 Hours-per-
Continuous.
3 week
1 Native- List of countries including United-States, Cambodia, England,
4 country Puerto-Rico, Canada, Germany, etc.
1
Salary >50K, <=50K.
5

3. What are your ideas to solve the problem?


My approach is to apply various machine learning classification algorithms such
as:
+ Logistic Regression for its simplicity and interpretability.
+ Random Forest for handling non-linear relationships and importance
weighting of features.
+ XGBoost, Support Vector Machine (SVM) for high performance and scalability
in large datasets.
+ KNN (K-Nearest Neighbors) is a supervised learning algorithm.
The pipeline will include:
+ Data preprocessing:
. Missing handle
. Duplicate handle
. Outlier handle
+ Feature engineering:
Separate categorical and numerical features for easy management.
 Categorical features
Example: [“Income”]

 Numerical features
Example: [“education”]
+ Build model
+ Model tuning to optimize performance.
In addition, I also visualized the data to better understand the interactions
between features, to identify which groups of factors are important in predicting
whether a person is truly high-income or not.
4. What is your hypothesis for the ideas to work? A more interesting
question is how do you verify your hypothesis?
Hypothesis: Certain features such as education, age, and occupation will
have the strongest predictive power for determining income. I hypothesize that
more educated individuals or those in higher-tier occupations are likely to earn
more than 50K USD.
To verify this, I will:
+ Conduct exploratory data analysis (EDA) to check feature distributions.
+ Use feature importance analysis from Random Forest and XGBoost.
+ Compare model performance through accuracy, precision, recall, and F1-
score on a test dataset.
+ Validate the models with cross-validation to ensure generalizability.

5. How does the result look like? Does it confirm your hypothesis?
# Pending
6. What have you done to make your original ideas better?
# Pending
7. What is the running time of your algorithm? Is your algorithm scalable?
# Pending
8. If you are given more time, what can be done to even improve it further?
# Pending
9. What have you learned from the project?
# Pending

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy