0% found this document useful (0 votes)
13 views39 pages

123

The document is a midterm progress report on a project titled 'Automated Resume Screening Using Natural Language Processing' submitted by students of Tribhuvan University. The project aims to develop a web-based application that automates the resume screening process using NLP techniques to improve efficiency, accuracy, and fairness in hiring. The system has achieved an accuracy of 96.55%, significantly enhancing the recruitment process by streamlining candidate evaluation and reducing bias.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views39 pages

123

The document is a midterm progress report on a project titled 'Automated Resume Screening Using Natural Language Processing' submitted by students of Tribhuvan University. The project aims to develop a web-based application that automates the resume screening process using NLP techniques to improve efficiency, accuracy, and fairness in hiring. The system has achieved an accuracy of 96.55%, significantly enhancing the recruitment process by streamlining candidate evaluation and reducing bias.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

TRIBHUVAN UNIVERSITY

INSTITUTE OF ENGINEERING
NATIONAL COLLEGE OF ENGINEERING

A
MIDTERM PROGRESS REPORT
ON
”AUTOMATED RESUME SCREENING USING
NATURAL LANGUAGE PROCESSING”

SUBMITTED BY:
PRASAMSHA PANDAY (NCE078BCT026)
SABIN PYAKUREL (NCE078BCT036)
PRATIK PANDE (NCE078BECT028)
SUDIP GHIMIRE (NCE078BCT041)

SUBMITTED TO:
DEPARTMENT OF ELECTRONICS & COMPUTER
ENGINEERING

LALITPUR, NEPAL
JANUARY, 2025
Certificate
This is to certify that the work carried out by Mrs. Prasamsha Panday, Mr. Pratik
Pande, Mr. Sabin Pyakurel and Mr. Sudip Ghimire for the project entitled ”Re-
sume Screening Using NLP” for the award of the degree of Bachelor of Computer
Engineering of the Institute of Engineering is based upon the authentic work. We
have the pleasure in forwarding their project. The project was carried out under
our supervision and all the materials included as well as the software product is
the result of their yearlong authentic work-effort.

123 Er. Suroj Burlakoti


(External Examiner) Department of Electronics and
Computer Engineering
National College of Engineering
Talchhikhel, Lalitpur
(Project Coordinator)

Er. Suroj Burlakoti


Department of Electronics and
Computer Engineering
National College of Engineering
Talchhikhel, Lalitpur
(Head of Department/Project Superviser)

i
Acknowledgments
We would like to express our sincere gratitude to all those who have contributed
to the completion of this project.
First and foremost, we would like to thank our project supervisor Er Suroj Burlakoti,
for his invaluable guidance, continuous support, and encouragement throughout
the course of this project.
We are also deeply grateful to Department Of Electronics And Computer Engi-
neering for providing the resources and infrastructure necessary to carry out this
work.
We would like to acknowledge the support of our respected teachers of our col-
lege for their sincere advice and constant guidance, supervision and continuous
encouragement throughout the study of the project.
Last but not least, we would like to express my heartfelt gratitude to all the stu-
dents and teachers who helped us in the project and who are directly and indirectly
involved in this project.
Thank you all for your contributions.

ii
Abstract
Automated resume screening using Natural Language Processing (NLP) refers to
the use of AI-driven software to analyze job applicants’ resumes in an automated
fashion. In today’s competitive job market, hiring has become a challenging and
time-consuming process, especially when it comes to reviewing a large number
of resumes. Traditional methods of manual resume screening are not only inef-
ficient but can also introduce unintentional bias into the selection process. The
project, titled “Automated Resume Screening Using NLP,” aims to address these
challenges by creating a web-based application that streamlines the recruitment
process through automation. Using advanced Natural Language Processing (NLP)
techniques, this system will analyze resumes, extract essential details such as skills,
experience, qualifications, and job titles, and match them against specific job re-
quirements. By employing algorithms like cosine similarity, the application will
rank resumes based on how well they align with the job description, helping re-
cruiters identify top candidates efficiently. Additionally, the system is designed
to promote fairness by focusing solely on job-related information, ensuring con-
sistent and unbiased evaluations. The ultimate goal of this project is to improve
the hiring process by making it faster, more accurate, and equitable, benefiting
both employers and job seekers alike. The project produces an accuracy of 96.55
percent, which is considerably high than preceding projects.

iii
Contents
Certificate i

Acknowledgements ii

Abstract iii

List of Figures vi

List of Abbreviations vii

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem statements . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.4 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Literature Review 3
2.1 Related theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 System Analysis 10
3.1 Requirement specification . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.1 Functional Requirements . . . . . . . . . . . . . . . . . . . . 10
3.1.2 Non-Functional Requirements . . . . . . . . . . . . . . . . . 11
3.2 Feasibility study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.1 Technical Feasibility . . . . . . . . . . . . . . . . . . . . . . 11
3.2.2 Economic Feasibility . . . . . . . . . . . . . . . . . . . . . . 12
3.2.3 Legal Feasibility . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.4 Time Feasibility . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 Methodology 13
4.1 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

iv
4.2 Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3 Sequence Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.4 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.5 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.6 Vector Generation Using Word Embeddings . . . . . . . . . . . . . 16
4.7 Model Training: Random Forest Classifier . . . . . . . . . . . . . . 17
4.8 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.9 Similarity Calculation . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.10 Frontend and Backend Development . . . . . . . . . . . . . . . . . . 19
4.11 Tools Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5 Results and Discussion 21


5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2 Analysis of the Output . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.3 Analysis of Classification Report . . . . . . . . . . . . . . . . . . . . 21
5.4 Analysis of Confusion Matrix . . . . . . . . . . . . . . . . . . . . . 23

6 Conclusion and Future Enhancements 26


6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.2 Future Enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . 26

References 27

A APPENDIX 29

v
List of Figures
4.1 Block Diagram of the System . . . . . . . . . . . . . . . . . . . . . 13
4.2 Activity Diagram of the System . . . . . . . . . . . . . . . . . . . . 14
4.3 Sequence Diagram of the System . . . . . . . . . . . . . . . . . . . 15
4.4 Training of Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.1 Classification report Of the Model . . . . . . . . . . . . . . . . . . . 21


5.2 Confusion Matrix of the Model . . . . . . . . . . . . . . . . . . . . 23

A.1 Front page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29


A.2 login page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
A.3 Job Seeker View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
A.4 Job Application Access Page . . . . . . . . . . . . . . . . . . . . . . 30
A.5 Job Provider View . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

vi
List of Abbreviations
HR Human Resources
NER Named Entity Recognition
LSTM Long Term Short Memory
VSM Vector Space Model
BERT Bidirectional Encoder Representation From Transformer
PDF Portable Document Format
NLP Natural Language Processor
CBOW Continuous Bag Of words

vii
1. Introduction

1.1 Background
Finding and hiring qualified employees is a critical function within Human Re-
sources (HR), especially in large and ever-changing job markets. Every month,
millions of individuals enter the workforce, creating a high volume of applications
for each open position. Such high mass can make it difficult to efficiently identify
the best candidates.
One of the main challenges HR departments is time and efficiency. Resumes come
in a variety of formats, making it time-consuming and prone to errors to manu-
ally screen and shortlist applicants. Effectively evaluating resumes also requires a
deep understanding of the specific skills and experience needed for the role, which
can be inconsistent within HR teams. This creates a situation where qualified
candidates might be overlooked due to inefficient screening processes, while HR
departments spend excessive time sifting through applications.

1.2 Problem statements


In large and dynamic job markets, HR departments face significant challenges in
efficiently and accurately identifying the best candidates from a vast pool of ap-
plicants. The manual screening and shortlisting of resumes are time-consuming
and error-prone due to the diversity in resume formats and the varying levels
of understanding within HR teams regarding the specific skills and experience re-
quired for different roles. This inefficiency leads to the risk of overlooking qualified
candidates and expending excessive time and resources on the screening process.
An effective solution is needed to streamline and improve the accuracy of resume
screening to ensure that the most qualified candidates are identified promptly.

1.3 Objectives
The general objective of the project is to design a employee selection system using
classification algorithm. The specific objectives are:

1
• To design and develop web application to screen resume effectively and effi-
ciently for It sectors.

• To automate the selection process of applicants.

1.4 Scope
This project aims to transform how recruitment works by automating the process
of screening and ranking resumes. It will make a big difference in areas like IT sec-
tor where candidates apply for different job positions. By using machine learning
and natural language processing, it helps employers quickly find the best candi-
dates while saving time and effort. The system also promotes fairness by giving
all job seekers an equal chance and ensuring resumes match job descriptions more
accurately. Ultimately, it will improve communication between employers and
candidates, making the hiring process smoother and more effective. The scope of
this project is currently limited to processing resumes written in English and fol-
lowing standard resume formats.Additionally,the screening process mainly focuses
on matching keywords and phrases that are most relevant to the job descriptions
provided by employers. While this approach is effective for structured and clearly
defined resumes, future improvements could make the system more versatile by
supporting different formats and multiple languages, making it even more inclusive
and user-friendly.

2
2. Literature Review
The volume of job applications has increased exponentially with the advent of
online job portals, necessitating the use of automated systems to manage and pri-
oritize candidate profiles. Resume ranking programs, leveraging advancements in
artificial intelligence and machine learning, offer promising solutions to streamline
this process. By employing algorithms capable of analyzing and evaluating re-
sumes based on predefined criteria, these programs aim to enhance the efficiency
and accuracy of candidate selection processes. This literature review explores the
evolution, methodologies, challenges, and advancements in resume ranking sys-
tems, providing insights into their effectiveness and potential impact on modern
recruitment practices.
Recent research demonstrates the effectiveness of machine learning and other
such methodologies for ranking resumes through various innovative approaches.
A paper by Chirag Darwani [1] employed various Named Entity Recognition
(NER) approaches to assess similarity between categorized resume data and job
requirements. Techniques included Rule-Based algorithms, regular expressions,
and Bidirectional-LSTM with Conditional Random Field algorithms. The spaCy
module, pre-trained on resume samples, identified entities like names, phone num-
bers, and educational institutions. A content-based recommendation system uti-
lized vectorization, TF-IDF, and cosine similarity measures to rank resumes based
on their fit for job requirements. Vectorization transformed text into numerical
vectors essential for machine learning models. TF-IDF scores reflected term impor-
tance, while cosine similarity computed similarity between resume and job query
vectors. The system used the Vector Space Model (VSM) to represent resumes
and job descriptions, facilitating similarity calculations and candidate ranking.
Performance testing with Software Developer Engineer resumes validated candi-
date rankings based on cosine similarity scores.
In their paper, Tomas Mikolov and his team introduced two new models that
quickly create word vector representations from large datasets. These models im-

3
prove word similarity tasks, performing better than older neural network methods
in both accuracy and speed. Impressively, the techniques can generate high-quality
word vectors from a 1.6 billion-word dataset in less than a day. These word repre-
sentations set new standards for capturing both grammatical and meaning-based
similarities, marking a significant advancement in natural language processing.
The book ”Speech and Language Processing” by Dan Jurafsky and James H. Mar-
tin explore a variety of topics related to natural language processing. They discuss
text preprocessing methods such as tokenization and stemming, as well as vector
embeddings through n-gram models, and offer a thorough introduction to neural
networks. The book also dives into text classification, covering multi-class classi-
fiers like Naive Bayes, along with more advanced techniques like sequence labeling
and machine translation. It offers in-depth insights into both foundational and
contemporary methods utilized in NLP.
A paper[2] used text preprocessing to remove numbers and convert text to low-
ercase, followed by BERT-based extractive summarization for job vacancies using
bert-base-uncased. Summaries were limited to 10 sentences, determined by the
ELBOW method. Text representation involved converting resumes and vacancies
into numeric vectors. Cosine similarity computed between these vectors deter-
mined match scores, sorted for final ranking.
One article compared the performance of two multiclass Classifier(Random Forest
Classifier and Naive Bayes Classifier) and found out that Random Forest Classifier
had better performance and lower error rate of 2% for test dataset with non-linear
relationship among the features than Naive Bayes Classifier which had a error
rate of 6.2%.However,this was not the same case when using another test dataset
where the features were largely independent.In such case,Naive Bayes Classifier
was found to have better error rate of 1% than Random Forest Classifier.
Pradeep Kumar Roy in their research [3],created a system where they can mini-
mize the cost of hiring new candidates for the job positions in the company. They
focused on 3 major problems in this process:

• Picking the right candidates from the applicants

• Making sense of their CV’s

4
• Finding out if the candidate is fit

for the job role.They performed various NLP techniques for text preprocessing,TF-
IDF for vectorization and used Machine Learning to perform the classification
using the algorithms of Random Forest with 38.9 percent accuracy, Multinomial
Naı̈ve Bayes with 44.39 percent, Logistic Regression with 62.4 percent, and the
highest accuracy was obtained by Linear Support Vector Machine Classifier with
an accuracy of 78.53 percent.
In enhancing their project, we integrated Word2Vec embeddings[4] with a lin-
ear Support Vector Machine (SVM) classifier, aiming to augment resume parsing
capabilities. Word2Vec embeddings were employed to capture semantic relation-
ships between words, enriching the system’s understanding of resume content.
The linear SVM classifier utilized these embeddings to classify resumes based on
extracted features such as skills, experiences, and project details. Additionally,
emphasis was placed on user interface (UI) design to provide an intuitive and ef-
ficient experience for recruiters and hiring managers. This integration not only
aimed to improve accuracy in resume parsing but also sought to enhance usabil-
ity through a well-designed interface, addressing both technical and user-centric
aspects of the project.

2.1 Related theory


2.2.1 Random Forest Classifier
A Random Forest Classifier is an ensemble learning method used for classification
tasks. It builds multiple decision trees during training and outputs the class that
is the majority vote of the individual trees. Random Forest works by creating
many decision trees based on random subsets of the data and features. Each tree
is trained independently, and their predictions are aggregated to improve accuracy
and prevent overfitting. It is robust, handles both numerical and categorical data,
and is less prone to overfitting compared to a single decision tree. It is widely
used for classification, regression, and feature selection tasks.
2.2.2 Cosine Similarity
In data analysis, cosine similarity is a measure of similarity between two non-zero
vectors defined in an inner product space. Cosine similarity is the cosine of the

5
angle between the vectors; that is, it is the dot product of the vectors divided by
the product of their lengths. It follows that the cosine similarity does not depend
on the magnitudes of the vectors, but only on their angle. The cosine similarity
always belongs to the interval [-1,1].
For example, two proportional vectors have a cosine similarity of 1, two orthogonal
vectors have a similarity of 0, and two opposite vectors have a similarity of -1. In
some contexts, the component values of the vectors cannot be negative, in which
case the cosine similarity is bounded in [0,1].
For example, in information retrieval and text mining, each word is assigned a
different coordinate and a document is represented by the vector of the numbers
of occurrences of each word in the document. Cosine similarity then gives a useful
measure of how similar two documents are likely to be, in terms of their subject
matter, and independently of the length of the documents.
The technique is also used to measure cohesion within clusters in the field of data
mining.
One advantage of cosine similarity is its low complexity, especially for sparse vec-
tors: only the non-zero coordinates need to be considered.
Other names for cosine similarity include Orchini similarity and Tucker coefficient
of congruence; the Otsuka–Ochiai similarity is cosine similarity applied to binary
data.
2.2.3 Machine Learning
Machine learning is a field of artificial intelligence (AI) that allows computers to
learn without being explicitly programmed. Machine learning algorithms use data
to learn how to perform tasks such as classification, prediction, and clustering.
ML algorithms are mathematical models that uses different data-sets in the form
of text, audio, images and videos, in order to help the machine to learn, improving
its performance in each iteration.
ML algorithms can be used to perform a variety of tasks like:
1. Classification: This is the task of assigning a label to an input. For example, a
machine learning algorithm could be used to classify images as either cats or dogs.
2. Prediction: This is the task of predicting a future value based on past data.
For example, a machine learning algorithm could be used to predict the weather

6
or the stock market.
3. Clustering: This is the task of grouping similar data together. For example, a
machine learning algorithm could be used to group customers together based on
their buying behavior.
Some of the exciting applications of ML technology are fraud detection, spam fil-
tering, medical diagnosis, self-driving cars, recommendation systems, etc.
2.2.4 Word2Vec
Word2Vecis a widely used technique in natural language processing (NLP) for
learning vector representations of words. Developed by Tomas Mikolov and his
team at Google in 2013, Word2Vec aims to convert words into dense, continuous
vector spaces, enabling the model to capture intricate semantic relationships and
contextual meanings of words.
The core idea behind Word2Vec is to map each word to a vector in a high-
dimensional space where semantically similar words are positioned close to each
other. This mapping allows the model to capture meaningful relationships be-
tween words based on their usage in large text corpora. For example, words with
similar meanings or functions, such as ”king” and ”queen,” will have vector rep-
resentations that are close in this space.
Word2Vec employs two primary model architectures to generate these word vec-
tors: the Continuous Bag of Words (CBOW) model and the Skip-Gram model.
The CBOW model predicts a target word based on its surrounding context words.
For instance, given the context words ”the,” ”cat,” and ”on,” CBOW might pre-
dict the target word ”mat.” On the other hand, the Skip-Gram model works in
reverse; it uses a target word to predict the surrounding context words. For ex-
ample, given the target word ”cat,” Skip-Gram would attempt to predict context
words like ”the,” ”on,” and ”mat.”
During training, Word2Vec uses a neural network to adjust the word vectors such
that the probability of predicting the correct context words (or target words) is
maximized. This process ensures that the learned vectors reflect meaningful rela-
tionships and similarities between words, making them useful for a variety of NLP
tasks, such as text classification, sentiment analysis, and machine translation. By
representing words in a continuous vector space, Word2Vec provides a powerful

7
tool for understanding and processing natural language.
2.2.5 Natural Language Processing(NLP)

• Stemming:
Stemming is a process that reduces words to their root form by stripping
suffixes. This technique helps in standardizing words and reducing dimen-
sionality in text analysis. For example, the words ”running,” ”runner,” and
”runs” might all be reduced to ”run.” Stemming algorithms, such as the
Porter Stemmer and Snowball Stemmer, apply heuristic rules to remove
common prefixes and suffixes, though they do not always produce actual
words. For instance, ”fishing” might be stemmed to ”fish,” but ”fished”
could be stemmed to ”fish” as well.
Example: ”running” → ”run”
”happily” → ”happy”

• Lemmatization:
Lemmatization is a more sophisticated approach than stemming, focusing
on reducing words to their base or dictionary form called a lemma. Unlike
stemming, lemmatization considers the context and the part of speech to
ensure the resulting lemma is a valid word. For example, ”better” is lem-
matized to ”good,” and ”running” is lemmatized to ”run.” Lemmatization
often uses lexical databases like WordNet for accuracy.
Example: ”running” → ”run”
”better” → ”good”

• Tokenization:
Tokenization involves breaking down text into smaller units, such as words,
phrases, or sentences. This process is essential for many NLP tasks as it
simplifies the text into manageable pieces. Tokenization can be word-level
(breaking text into words) or sentence-level (breaking text into sentences).
Example: ”Hello world!” → [”Hello”, ”world!”]

• Stop Word Removal:


Stop word removal involves filtering out common words that are deemed to

8
have little significance for text analysis, such as ”the,” ”is,” ”in,” etc. Re-
moving stop words helps in focusing on the more meaningful words in the
text. Example: ”The quick brown fox” → [”quick”, ”brown”, ”fox”]

9
3. System Analysis

3.1 Requirement specification

3.1.1 Functional Requirements


1. Resume Ingestion:
- The system must allow users to upload resumes in various formats (PDF, DOCX).
- The system should extract relevant textual information from the uploaded re-
sumes.
2. Job Description Ingestion:
- The system must allow users to input or upload job descriptions.
- The system should extract relevant textual information from the job descrip-
tions.
3. Feature Extraction:
- The system must convert resumes and job descriptions into numerical vectors
using word embeddings.
4. Similarity Calculation:
- The system must calculate the cosine similarity between the job description and
each resume.
5. k-NN Ranking:
- The system must identify and rank the k most similar resumes to the job de-
scription based on cosine similarity scores.
6. Classification with Random Forest Classifier:
- The system must classify the ranked resumes into categories such as ”highly
relevant,” ”moderately relevant,” and ”not relevant.”
7. User Interface:
- The system must provide a user-friendly interface for users to upload resumes
and job descriptions. - The system must display the ranked and classified resumes
with relevant details.

10
8. Report Generation:
- The system should generate reports summarizing the ranking and classification
results.

3.1.2 Non-Functional Requirements


1. Performance:
- The system should process and rank resumes within a reasonable time frame
(e.g., within a few minutes for a batch of 100 resumes).
2. Scalability:
- The system should be able to handle large volumes of resumes and job descrip-
tions without significant degradation in performance.
3. Accuracy:
- The system should maintain high accuracy in both ranking and classification,
with precision and recall metrics above a specified threshold.
4. Usability:
- The user interface should be intuitive and easy to use for individuals with basic
computer skills.
5. Security:
- The system should ensure that all uploaded resumes and job descriptions are
securely stored and processed, maintaining data privacy.
6. Compatibility:
- The system should be compatible with various browsers and devices.

3.2 Feasibility study

3.2.1 Technical Feasibility


Development expertise required in:

• Technology Stack: Appropriate technologies (e.g., programming languages,


frameworks, databases) that are well-suited for the project requirements.

• Skill Set: The development team has the necessary skills and expertise to
build and maintain the system.

11
• Scalability: The system architecture to handle future growth in the volume
of resumes and job descriptions.

• Performance: Optimization techniques to ensure the system performs effi-


ciently even with large datasets.

3.2.2 Economic Feasibility


• Open-source tools (Django) minimize licensing costs.

• Hardware and software costs for development and deployment (servers, stor-
age, etc.)

• Ongoing maintenance and potential upgrades.

3.2.3 Legal Feasibility


• User Consent: Explicit consent from users for processing and storing their
resumes.

• Data Security: Robust security measures to protect sensitive personal infor-


mation.

3.2.4 Time Feasibility


• Development time depends on project scope and team expertise.

• Utilize existing libraries for faster development.

Consider a phased approach:

• Initial prototype demonstrating core functionalities

• Iterative development and testing for accuracy and scalability

Given the availability of open-source tools, development expertise, and careful


consideration of economic, legal, and time constraints, this project is feasible.

12
4. Methodology

4.1 System Design

Figure 4.1: Block Diagram of the System

This flowchart represents a job-matching system. Job seekers upload resumes, and
job providers post job descriptions through a user interface. The data is stored
in respective databases and processed using text preprocessing and Word2Vec
embeddings. Cosine similarity ranks resumes based on relevance, while a Ran-
domForest model categorizes job postings. The system helps match candidates to
suitable jobs efficiently.

13
4.2 Activity Diagram

Figure 4.2: Activity Diagram of the System

14
This activity diagram represents a job processing system. It starts with user
interaction, where job providers post and manage jobs, while job seekers view jobs
and upload resumes. The system stores job descriptions and resumes in respective
databases. These are processed through text preprocessing, followed by ranking
resumes and matching job categories. The process ensures efficient job matching
between job providers and job seekers.

4.3 Sequence Diagram

Figure 4.3: Sequence Diagram of the System

This sequence diagram represents the job processing system’s workflow. Users log
in, select a job, or upload a resume through the frontend (Django). The backend

15
stores the data and sends it for processing. Machine learning (Jupyter) applies
text preprocessing, converts text using Word2Vec/FastText, and classifies resumes
using Random Forest Classifier. The processed data is stored, and job matches
with scores are sent back to the frontend. Finally, users see the matching job
results.

The methodology is divided into several key stages: data collection, data pre-
processing, feature extraction,model training, and evaluation.

4.4 Data Collection


The first step involved creating a custom dataset for the resume screening sys-
tem.During the creation of dataset,resumes of different roles were collected to
form a diverse dataset.Data belonging to 8 different job categories,namely Data
Scientist,Cloud Engineer,DevOps Engineer,Software Developer,Machine Learning
Engineer,Data Analyst,FrontEnd Developer and Backend Developer has been tab-
ulated to form the dataset.

4.5 Data Preprocessing


After creating the dataset, the next step was to preprocess the textual data to
make it suitable for analysis. This step included the following techniques:

• Tokenization: The text was tokenized into individual words or phrases to


break the contentinto discrete units that could be analyzed further.

• Stopword Removal: Common words such as ”the”, ”and”, ”is”, etc., that do
not contribute meaningful information, were removed.

• Lemmatization: Words were reduced to their base or root form (e.g., “run-
ning” to “run”) to ensure consistency and improve analysis.

These preprocessing steps helped clean the data and reduce noise, preparing the
resumes for feature extraction.

4.6 Vector Generation Using Word Embeddings


To convert the textual content of the resumes and the job description into numer-
ical features that could be used by the Random Forest classifier,word embedding

16
was applied.Pre-trained word embedding,Word2Vec was used to represent each
word in the resume as a vector. These embedding were chosen because they
capture semantic meaning and relationships between words, helping the model
understand context and job-related terms in resumes.
This process converted each resume into a numerical vector that captured the es-
sential information and context from the text, which was then used as input for
the machine learning model.

4.7 Model Training: Random Forest Classifier

Figure 4.4: Training of Model

The next step was to train a machine learning model using the processed and
vectored data.Random Forest Classifier was used as a classification model which
helped to make prediction of the resumes into respective categories. During model
training,following steps were carried out:

• Data Splitting: The dataset was divided into training and validation sets

17
using an 80-20 split. The training set (80 percent) was used to train the
model, while the validation set (20 percent) was kept aside to evaluate the
model’s performance on unseen data.

• Model Training: The Random Forest classifier was trained on the feature
vectors of the resumes, with each vector labeled according to its correspond-
ing job category.

4.8 Model Evaluation


After training the Random Forest classifier, the model’s performance was eval-
uated using the validation set. During the process,we measured the following
metrices:
1.Confusion Matrix
The confusion matrix is a table used in machine learning to evaluate the perfor-
mance of a classification model. It compares the predicted classes of the model
with the actual classes in the dataset. The matrix has rows and columns rep-
resenting the actual and predicted classes respectivelly, and contains four main
components:
i)TP (True Positive) is the number of instances correctly predicted as positive.
ii)FP (False Positive) is the number of instances incorrectly predicted as positive.
iii)TN (True Negative) is the number of instances correctly predicted as negative.
iv)FN (False Negative) is the number of instances incorrectly predicted as nega-
tive.
2.Accuracy
Accuracy is a performance metric used in classification tasks to measure the over-
all correctness of the model’s predictions. It represents the proportion of correctly
classified instances out of the total number of instances in the dataset.
Mathematically, accuracy is calculated using this formula:
T P +T N
Accuracy = T P +F P +T N +F N

3.Recall
Recall, also known as sensitivity or true positive rate, is a performance metric
used in binary classification tasks. It measures the proportion of actual positive
instances that are correctly identified by the model.

18
Mathematically, recall is calculated using the formula:
TP
Recall = T P +F N

4.Precision
Precision is a performance metric used in binary classification tasks that measure
the proportion of correctly predicted instances out of all instances predicted as
positive by the model.
TP
Precision = T P +F P

5.F1-score
The F1 score is a performance metric commonly used in binary classification
tasks,which considers both precision and recall to provide a balanced measure of a
model’s performance. It is the harmonic mean of precision and recall,emphasizing
the balance between the two metrics.
2×(Precision×Recall)
F1 = Precision+Recall

4.9 Similarity Calculation


After transforming the job description and each resume within the job category
into vectors, every resume vector is compared with the corresponding job descrip-
tion vector to compute a match score using Cosine Similarity. Subsequently, the
resumes are ranked based on these match scores.

4.10 Frontend and Backend Development


To facilitate the resume screening process, we created a frontend interface using
HTML and CSS, where users can upload resumes and employers can publish job
vacancies along with job descriptions. The backend of the system was built using
Django, which handled data processing, storage, and communication between the
frontend and the machine learning model.

4.11 Tools Used


• HTML & CSS: HTML and CSS have been used for designing the front-end
display of the website, ensuring a user-friendly interface for interacting with
the resume screening system.

• Django: Django, a Python framework has been used for backend develop-
ment, handling the server-side logic, database interactions, and integration

19
of the NLP model to process resumes.

• Jupyter Notebook:Jupytar Notebook has been used for data preprocess-


ing, including cleaning and preparing the resumes for model training by
performing tasks like tokenization, stop word removal, and lemmatization,

• Tensorflow: Tensorflow has been utilized for training and evaluating the
NLP model, which has helped in automatically analyzing and scoring re-
sumes based on the job requirements.

20
5. Results and Discussion

5.1 Results
The resume screening product innovated from this project was deployed and put
to a test to check its efficiency and effectiveness.The project was successful in
effectively ranking the resumes automatically helping job recruiter to reduce the
tiring work of evaluating the resumes manually.The Random Forest Classification
model that was put in an application was for the most part correctly able to
classify the resumes into respective job categories.Not only this,the product also
provided a platform for job recruiter to post job vacancies and for job seeker to
apply to for those vacant jobs.

5.2 Analysis of the Output

5.3 Analysis of Classification Report

Figure 5.1: Classification report Of the Model

For class ”Cloud Engineer:”


Precision: 1.00, i.e., out of all instances classified as ”Cloud Engineer,” 100% of
the predictions were correct points). Recall: 0.90, i.e., 90% of the actual ”Cloud
Engineer” (18)instances were correctly predicted.F1-score: 0.947 (harmonic mean
of precision and recall).Support: 20 instances of ”Cloud Engineer” in the test set.

21
For class ”Cloud Enginner:” Precision: 0.944, i.e., out of the predicted ”Cybersecu-
rity Specialist” instances, 94.4% were correct.Recall: 1.00, i.e., all actual ”Cyberse-
curity Specialist”(17) instances were correctly identified.F1-score: 0.971 (harmonic
mean of precision and recall).Support: 17 instances of ”Cybersecurity Specialist”
in the test set.

Data Scientist: Precision: 0.875, i.e., 87.5% of ”Data Scientist” predictions


were correct. Recall: 0.933, i.e., 93.3% of actual ”Data Scientist” instances were
correctly identified. F1-score: 0.903 (harmonic mean of precision and recall).
Support: 15 instances of ”Data Scientist” in the test set.
DevOps Engineer: Precision: 0.938, i.e., 93.8% of ”DevOps Engineer” pre-
dictions were correct. Recall: 0.938, i.e., 93.8% of actual ”DevOps Engineer”
instances were correctly identified. F1-score: 0.938 (harmonic mean of precision
and recall). Support: 16 instances of ”DevOps Engineer” in the test set.
Graphics Designer: Precision: 0.917, i.e., 91.7% of ”Graphics Designer” pre-
dictions were correct. Recall: 1.00, i.e., all actual ”Graphics Designer” instances
were correctly identified. F1-score: 0.957 (harmonic mean of precision and recall).
Support: 11 instances of ”Graphics Designer” in the test set.
Machine Learning Engineer: Precision: 0.960, i.e., 96% of ”Machine Learn-
ing Engineer” predictions were correct. Recall: 0.960, i.e., 96% of actual ”Machine
Learning Engineer” instances were correctly identified. F1-score: 0.960 (harmonic
mean of precision and recall). Support: 25 instances of ”Machine Learning Engi-
neer” in the test set.
Robotics Engineer: Precision: 1.00, i.e., all predicted ”Robotics Engineer”
instances were correct. Recall: 0.895, i.e., 89.5% of actual ”Robotics Engineer”
instances were correctly identified. F1-score: 0.944 (harmonic mean of precision
and recall). Support: 19 instances of ”Robotics Engineer” in the test set.
Software Developer: Precision: 0.955, i.e., 95.5% of ”Software Developer”
predictions were correct. Recall: 1.00, i.e., all actual ”Software Developer” in-
stances were correctly identified. F1-score: 0.977 (harmonic mean of precision
and recall). Support: 21 instances of ”Software Developer” in the test set.

22
5.4 Analysis of Confusion Matrix

Figure 5.2: Confusion Matrix of the Model

Cloud Engineer:
TP: 18 (Correctly classified as Cloud Engineer)
FN: 2 (1 misclassified as Cybersecurity Specialist, 1 as Data Scientist)
FP: 0 (No other class was wrongly predicted as Cloud Engineer)
TN: Remaining 106 instances

Cybersecurity Specialist:
TP: 17 (Correctly classified as Cybersecurity Specialist)
FN: 0 (No Cybersecurity Specialist instance was misclassified)
FP: 0 (No other class was wrongly predicted as Cybersecurity Specialist)
TN: Remaining 110 instances

23
Data Scientist:
TP: 14 (Correctly classified as Data Scientist)
FN: 1 (1 misclassified as DevOps Engineer)
FP: 1 (1 wrongly predicted as Data Scientist from Cloud Engineer)
TN: Remaining 111 instances

DevOps Engineer:
TP: 15 (Correctly classified as DevOps Engineer)
FN: 1 (1 misclassified as Data Scientist)
FP: 1 (1 wrongly predicted as DevOps Engineer from Data Scientist)
TN: Remaining 109 instances

Graphics Designer:
TP: 11 (Correctly classified as Graphics Designer)
FN: 0 (No misclassification)
FP: 0 (No other class wrongly predicted as Graphics Designer)
TN: Remaining 115 instances

Machine Learning Engineer:


TP: 24 (Correctly classified as Machine Learning Engineer)
FN: 1 (1 misclassified as Robotics Engineer)
FP: 0 (No other class wrongly predicted as Machine Learning Engineer)
TN: Remaining 102 instances

Robotics Engineer:
TP: 17 (Correctly classified as Robotics Engineer)
FN: 2 (1 misclassified as Machine Learning Engineer, 1 as Software Developer)
FP: 1 (1 wrongly predicted as Robotics Engineer from Machine Learning Engi-
neer)
TN: Remaining 108 instances

24
Software Developer:
TP: 21 (Correctly classified as Software Developer)
FN: 0 (No misclassification)
FP: 0 (No other class wrongly predicted as Software Developer)
TN: Remaining 117 instances

25
6. Conclusion and Future Enhance-
ments

6.1 Conclusion
The Resume Screening system developed using NLP, Cosine Similarity, and Ran-
dom Forest algorithms provides an efficient and automated solution for resume
screening in recruitment processes. By automating the tedious task of evaluat-
ing resumes, it allows recruiters to focus on high-value tasks such as interviewing
and final decision-making. The system is capable of extracting key data from
resumes, calculating similarity scores with job descriptions, and ranking candi-
dates accordingly, significantly reducing the manual effort involved in candidate
selection.Through the integration of machine learning models, the system not only
offers efficiency but also accuracy, ensuring that the most qualified candidates are
prioritized. Despite facing challenges in handling varied resume formats and op-
timizing the algorithms, the project has demonstrated the potential of leveraging
NLP and machine learning to enhance recruitment processes.
In conclusion, the project successfully addresses a critical need in the recruitment
industry, and with further enhancements, it has the potential to become a robust
solution for organizations looking to streamline their hiring process.

6.2 Future Enhancements


While the Resume Screening system developed using NLP and the Cosine Simi-
larity algorithm is functional, several enhancements can be made to improve its
efficiency and overall user experience. Some potential future enhancements in-
clude:

• Improved Resume Parsing: Currently, the system extracts key details


from resumes using basic text extraction techniques. In the future, more
advanced NLP techniques, such as Named Entity Recognition (NER) or

26
deep learning-based models, can be integrated to better handle different
resume formats and improve data extraction accuracy.

• Personalized Ranking System: Currently, resumes are ranked based on


cosine similarity to the job description. In the future, this ranking can
be personalized using more detailed user preferences, past hiring data, or
feedback loops, making the system more tailored to specific job requirements
and improving the hiring process further.

• Real-Time Resume Screening: The system can be expanded to provide


real-time screening of resumes as they are uploaded. This would allow re-
cruiters to instantly evaluate resumes against job descriptions as part of a
continuous hiring pipeline.

• Support for More File Formats: In the future, the system could support
additional file formats beyond DOCX and PDF, such as TXT, RTF, and
ODT, making it more versatile for different user needs.

• Offline Functionality: For users with unreliable internet access, offline


functionality can be developed. This could include storing resume data
locally and performing initial screening without a network connection, with
the results later being synced to the cloud when an internet connection is
available.

27
References
[1] Chirag Daryani. An automated resume screening system using natural lan-
guage processing and similarity. Ethics And Information Technology, 2020.

[2] Natalia Vanetik1. Nlp-based screening for it job vacancies. 2023.

[3] Pradeep Kumar Roy. A machine learning approach for automation of resume
recommendation system. Procedia Computer Science, 2020.

[4] Tomas Mikolov. Efficient estimation of word representations in vector space.


2013.

28
Appendix A. APPENDIX

Figure A.1: Front page

Figure A.2: login page

29
Figure A.3: Job Seeker View

Figure A.4: Job Application Access Page

30
Figure A.5: Job Provider View

31

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy