0% found this document useful (0 votes)

34 views23 pages

17 - PPT - NLP Project-2-24

Uploaded by

nicolesaldanha96

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views23 pages

17 - PPT - NLP Project-2-24

Uploaded by

nicolesaldanha96

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Problem Statement

In the digital age, the massive volume of news content makes it difficult for
readers to find relevant information. News classification is essential for organizing
articles into categories, improving access to specific topics and enhancing user
experience. Given the broad range of topics and the limitations of manual
categorization, automating the process with Natural Language Processing (NLP)
and machine learning is more efficient. This approach also enables personalized
recommendations, sentiment analysis, and trend detection. The project aims to
develop a robust classification model to streamline news categorization and
improve information access for users.
Objectives
1. Categorize News Reports: Accurately classify news articles into predefined
categories (e.g., politics, sports, entertainment).
2. Analyze Categorization Accuracy: Compare the performance of the Bag of
Words (BOW) model and TF-IDF in terms of classification accuracy.
3. Evaluate Misclassification Errors: Identify and analyze patterns in
misclassification errors to understand their causes.
4. Optimize Hyperparameters: Tune hyperparameters for the BOW model to
enhance classification performance and reduce errors.
5. Provide Insights for Improvement: Summarize findings and recommend
strategies to improve automated news categorization systems.
Implementation
Implementation
1. Data Collection: News articles are loaded from a CSV file and stored in a Pandas DataFrame
for analysis.
2. Data Preprocessing: Critical steps include handling missing and duplicated data, formatting
date columns, cleaning text (removing HTML, emojis, URLs, punctuation, stopwords), and
tokenizing the text for further processing.
3. Feature Extraction: A Bag of Words model is used to convert text into a numerical format for
machine learning.
4. Model Training: The system uses a Multinomial Naive Bayes classifier, with the dataset split
into 80% training and 20% testing.
5. Evaluation: Model performance is assessed using accuracy, confusion matrix, and classification
reports. Cross-validation ensures generalization.
6. Hyperparameter Tuning: Randomized Search CV is used to optimize parameters for feature
extraction and classification.
7. Final Model Evaluation: The model's results are visualized using bar charts, highlighting correct
and incorrect predictions for performance insights.
Dataset
Dataset
Result Analysis
• Dataset Shape and Quality: The initial exploration revealed the dataset's dimensions,
providing insights into the number of entries and features. The first few rows were
printed to visualize the structure and content of the data, ensuring its relevance and
integrity.
• Missing Values and Duplicates: An assessment for missing values was performed,
revealing any gaps that could affect model training. Similarly, checks for duplicate
entries were essential to maintain the quality of the dataset. The absence of significant
missing values or duplicates suggested that the dataset was largely clean and ready for
further processing.
Result Analysis
Cross-Validation:
◦ A five-fold cross-validation approach was applied, yielding mean accuracy
scores. This method involved partitioning the data into three subsets and training
the model on two subsets while validating it on the third. This iterative approach
enhances confidence in the model's predictive capability.

Cross-Validation Results
The cross-validation process using Stratified K-Folds revealed consistent performance across
different data splits. The accuracy scores for each of the three folds were as follows:
• Fold 1: 0.8806
• Fold 2: 0.8800
• Fold 3: 0.8806
This led to a mean accuracy of 0.88, indicating that the model was stable and performed well
regardless of the specific data split. The cross-validation helped ensure that the model's
performance was not overly dependent on any single train-test split.
Result Analysis
Fine-tuning
To further optimize the model, a RandomizedSearchCV was employed for hyperparameter tuning.
This method explored different configurations of the CountVectorizer and Multinomial Naive Bayes
hyperparameters, aiming to identify the combination that would yield the best performance. The following
parameters were considered:

• max_features: [5000, 10000, None]

• ngram_range: [(1, 1), (1, 2)]
• alpha: A range of values between 0.1 and 2.0 (for smoothing in Naive Bayes)
After conducting 25 fits, the best hyperparameters identified were:
• max_features: None
• ngram_range: (1, 2)
• alpha: 1.92
This optimized model was then re-trained on the training data, and when tested, it achieved a
significant improvement in accuracy, reaching 0.933 on the test set.
Result Analysis
Error Analysis
To further analyze the model's performance, the number of correct and incorrect predictions
was examined:

• Correct Predictions: 37,246

• Wrong Predictions: 2,696
This shows that the model correctly categorized a substantial portion of the test data,
although
there were still areas for improvement, particularly in reducing the number of
misclassifications.
A visualization of the prediction outcomes highlighted the proportion of correct versus wrong
predictions, with correct predictions (green) vastly outnumbering incorrect ones (red).
Result Analysis
Bag of Words:

The Bag of Words (BoW) model is a simple yet effective technique in Natural Language Processing that
converts text into numerical data by representing each document as a collection of word counts. It creates
a vocabulary of all unique words in a dataset and then transforms each document into a vector, where
each element represents the frequency of a specific word in that document. The model disregards word
order and grammar, focusing solely on word occurrence. BoW is commonly used for tasks like text
classification, where numerical representations of text are required for machine learning algorithms to
process.
Result Analysis
Multinomial Naive Bayes with Bag of Words
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score,classification_report

model = make_pipeline(CountVectorizer(), MultinomialNB())

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print(f"MultinomialNB with Bag of Words accuracy: {accuracy:.3f}")
# Print classification report
print("Classification Report:\n", classification_report(y_test, y_pred))

In this code, the Bag of Words (BoW) model, implemented through CountVectorizer(), converts the text data from the news
articles into a numerical matrix by counting the occurrences of each word in the dataset's vocabulary. This transformed data is then fed
into the Multinomial Naive Bayes classifier (MultinomialNB()), which is effective for text classification tasks. The model is trained on
the X_train data, predicts labels for the X_test data, and evaluates its performance using metrics like accuracy and a classification
report. This pipeline efficiently processes text data for classification using BoW and Naive Bayes.
Implementation
○ qwerty
Implementation
Implementation
Implementation
Implementation
Implementation
Implementation
Conclusion
Visualization and Interpretation of Results
To provide a clearer understanding of the model's performance, several visualizations were
created:
• Correct vs. Wrong Predictions:
◦ A bar chart comparing correct and incorrect predictions illustrated the model's
effectiveness visually. The chart showed a substantial number of correct
predictions (green) relative to incorrect ones (red), reinforcing the model's
reliability.
• Final DataFrame of Predictions:
◦ A summary DataFrame was generated, showcasing the content of the articles
alongside predicted and actual labels. This presentation allows for
straightforward comparisons and highlights specific cases of misclassification,
which can be critical for further analysis and model refinement.
Conclusion
1. Bag of Words vs. TF-IDF: The model using Bag of Words (BoW) performed better than when
using TF-IDF, achieving higher accuracy. This suggests that BoW was more effective in capturing
the relevant features for this specific news classification task.
2. Optimal Hyperparameters: The best performance was obtained using the hyperparameters:
◦ max_features: None
◦ ngram_range: (1, 2)
◦ alpha: 1.92
3. Model Accuracy: After fine-tuning, the optimized model achieved an accuracy of 0.933,
with 37,246 correct predictions and 2,696 wrong predictions, reflecting a highly effective
classification system.
4. Potential for Improvement: Exploring other machine learning models (e.g., Support
Vector Machines or neural networks) might further enhance accuracy and reduce
misclassification.
References
1. Scikit-learn Documentation: Text Classification
https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
2. Natural Language Processing with Python
https://www.nltk.org/book/
3. Introduction to Natural Language Processing (NLP)
https://towardsdatascience.com/natural-language-processing-nlp-in-python-a-beginners-guide-5c93f0a7b4a6
4. Understanding Multinomial Naive Bayes for Text Classification
https://towardsdatascience.com/multinomial-naive-bayes-for-text-classification-5c30e1e473c7
5. Cross-Validation in Machine Learning
https://scikit-learn.org/stable/modules/cross_validation.html
THANK YOU

17 Result Analysis NLP
No ratings yet
17 Result Analysis NLP
13 pages
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Classification Model Class Task
No ratings yet
Classification Model Class Task
3 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
César Pérez López
No ratings yet
Academic Internship Final Report
No ratings yet
Academic Internship Final Report
11 pages
Project Proposal - Group 17-2-5
No ratings yet
Project Proposal - Group 17-2-5
4 pages
IT Specialist: Artificial Intelligence Exam Prep - 500 Questions for Certification Success (0225)
From Everand
IT Specialist: Artificial Intelligence Exam Prep - 500 Questions for Certification Success (0225)
Satou Takahiro
No ratings yet
XGBoost in Practice: Definitive Reference for Developers and Engineers
From Everand
XGBoost in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet
IEEE-paper On NLP
No ratings yet
IEEE-paper On NLP
3 pages
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
422 News
No ratings yet
422 News
10 pages
WDM - Week - I
No ratings yet
WDM - Week - I
24 pages
03 ML Essentials
No ratings yet
03 ML Essentials
52 pages
Text Classification - Movie Review - News Wires
No ratings yet
Text Classification - Movie Review - News Wires
5 pages
2023 Article Jatit 19Vol101No14-3
No ratings yet
2023 Article Jatit 19Vol101No14-3
6 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
Data Mining Numericals
No ratings yet
Data Mining Numericals
38 pages
Naive Bayes Classification - Jupyter Notebook
No ratings yet
Naive Bayes Classification - Jupyter Notebook
4 pages
Fake News Detection Using NLP
No ratings yet
Fake News Detection Using NLP
11 pages
CatBoost Algorithms and Applications: Definitive Reference for Developers and Engineers
From Everand
CatBoost Algorithms and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Applied Statistical Analysis with SPSS: Definitive Reference for Developers and Engineers
From Everand
Applied Statistical Analysis with SPSS: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
ML Lab
No ratings yet
ML Lab
26 pages
IEEE-paper (1) Original
No ratings yet
IEEE-paper (1) Original
3 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
No ratings yet
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
38 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
C++ Data Structures Explained: A Practical Guide with Examples
From Everand
C++ Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Practical # 11
No ratings yet
Practical # 11
10 pages
Practical MXNet Applications: Definitive Reference for Developers and Engineers
From Everand
Practical MXNet Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Chapter 4 Text Classification
No ratings yet
Chapter 4 Text Classification
28 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
4 pages
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
From Everand
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
Sama Alshatali
No ratings yet
ML7 - Text Classification
No ratings yet
ML7 - Text Classification
13 pages
Naive Bayes Classifier: Fundamentals and Applications
From Everand
Naive Bayes Classifier: Fundamentals and Applications
Fouad Sabry
No ratings yet
Report
No ratings yet
Report
2 pages
ML Report Fake News Detection
No ratings yet
ML Report Fake News Detection
15 pages
DLT Experiment 3
No ratings yet
DLT Experiment 3
10 pages
Jadavpur University: Assignment Submission
No ratings yet
Jadavpur University: Assignment Submission
9 pages
Lab5 Example Fall 23
No ratings yet
Lab5 Example Fall 23
4 pages
Mla Unit-5'2
No ratings yet
Mla Unit-5'2
74 pages
Defect Prediction in Software Development & Maintainence
From Everand
Defect Prediction in Software Development & Maintainence
Rudra Kumar
No ratings yet
Ascertaining Polarity of Public Opinions On Bangladesh Cricket Through Sentiment Analysis
No ratings yet
Ascertaining Polarity of Public Opinions On Bangladesh Cricket Through Sentiment Analysis
51 pages
Fake News Detection
No ratings yet
Fake News Detection
8 pages
Module 4 - Classification
No ratings yet
Module 4 - Classification
10 pages
Model Evaluation - II
No ratings yet
Model Evaluation - II
12 pages
Cohere Rerank in Practice: The Complete Guide for Developers and Engineers
From Everand
Cohere Rerank in Practice: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Final Report
No ratings yet
Final Report
17 pages
Naive Bayes Classifiers - Parta
No ratings yet
Naive Bayes Classifiers - Parta
17 pages
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
From Everand
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
Fouad Sabry
No ratings yet
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
Practical File OF Machine Learning
No ratings yet
Practical File OF Machine Learning
31 pages
2 4 Chapters
No ratings yet
2 4 Chapters
3 pages
LightGBM in Practice: Definitive Reference for Developers and Engineers
From Everand
LightGBM in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
César Pérez López
No ratings yet
CN4 IP Primer
No ratings yet
CN4 IP Primer
79 pages
Design and Simulation of Digital Down Converter Based On System Generator
No ratings yet
Design and Simulation of Digital Down Converter Based On System Generator
3 pages
Database Systems - BIT - University of Colombo - Year 3 (Lecture Note 3)
No ratings yet
Database Systems - BIT - University of Colombo - Year 3 (Lecture Note 3)
64 pages
PROFORMA of BCA PROJECT PROPOSAL Bcsp064
50% (2)
PROFORMA of BCA PROJECT PROPOSAL Bcsp064
1 page
Sales and Inventory System Document
100% (1)
Sales and Inventory System Document
29 pages
Packet Tracer
No ratings yet
Packet Tracer
4 pages
ES - Lecture2 - Aug 2
No ratings yet
ES - Lecture2 - Aug 2
37 pages
Data Structures, Algorithms and Applications in C++
No ratings yet
Data Structures, Algorithms and Applications in C++
826 pages
Lect8 Spice
No ratings yet
Lect8 Spice
27 pages
120 Excel Formulas
No ratings yet
120 Excel Formulas
57 pages
Ilovepdf Merged 4c7bdd33 159a 4e13 8c6b De1ec358b3ca
No ratings yet
Ilovepdf Merged 4c7bdd33 159a 4e13 8c6b De1ec358b3ca
75 pages
31503922-MA5105 Configuration Guide - (V100R010 - 02)
No ratings yet
31503922-MA5105 Configuration Guide - (V100R010 - 02)
254 pages
AC51526140 Nimh Battery Pack
No ratings yet
AC51526140 Nimh Battery Pack
1 page
Mac OS X Hacks
No ratings yet
Mac OS X Hacks
504 pages
1 - Icue49301.2020.9307075
No ratings yet
1 - Icue49301.2020.9307075
7 pages
Character - Ai Faces Lawsuit After Teen's Suicide - The New York Times
No ratings yet
Character - Ai Faces Lawsuit After Teen's Suicide - The New York Times
10 pages
UG Courses of Study 2007
100% (2)
UG Courses of Study 2007
147 pages
Structured Network Cabling Baguio
No ratings yet
Structured Network Cabling Baguio
5 pages
Shanghai City Times
No ratings yet
Shanghai City Times
3 pages
Final PDF
100% (1)
Final PDF
71 pages
Developing With Web Services
No ratings yet
Developing With Web Services
6 pages
Dataflair FTPO Free Certification Courses
No ratings yet
Dataflair FTPO Free Certification Courses
14 pages
Brochure SpeedCast
No ratings yet
Brochure SpeedCast
16 pages
TWITTER
No ratings yet
TWITTER
2 pages
CSM User Manual 2020
No ratings yet
CSM User Manual 2020
183 pages
ICT Assignment 4 Bachelors
No ratings yet
ICT Assignment 4 Bachelors
4 pages
SCADA User Interface: E-Terracontrol - Module 4
No ratings yet
SCADA User Interface: E-Terracontrol - Module 4
14 pages
Processor Architecture
No ratings yet
Processor Architecture
25 pages
Trawnih Et Al 2023 Determining Perceptions of Banking Customers Regarding Fingerprint Atms
No ratings yet
Trawnih Et Al 2023 Determining Perceptions of Banking Customers Regarding Fingerprint Atms
19 pages
Avh-X8650bt Firmware - Update - Instruction
No ratings yet
Avh-X8650bt Firmware - Update - Instruction
6 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

17 - PPT - NLP Project-2-24

Uploaded by

17 - PPT - NLP Project-2-24

Uploaded by

Problem Statement

• max_features: [5000, 10000, None]

• Correct Predictions: 37,246

model = make_pipeline(CountVectorizer(), MultinomialNB())

accuracy = accuracy_score(y_test, y_pred)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.