B3 Twitter Data
B3 Twitter Data
With the rise of social media platforms, Twitter has become a significant source of real-time
data. Analyzing Twitter data can provide valuable insights into public opinions, sentiments,
and trends. The use of Twitter data for analysis gained prominence with the growth of social
media platforms in recent years. Researchers and businesses recognized the potential of
Twitter data in understanding public sentiment, predicting trends, and conducting market
research. As a result, various methods and algorithms were developed to process and classify
Twitter data efficiently. Traditional systems often rely on basic text processing techniques
such as tokenization, stemming, and stop-word removal. While these techniques are useful,
they were not be sufficient for handling the unique characteristics of Twitter data, such as
hashtags, mentions, and emoticons. In addition, the unstructured and noisy nature of Twitter
data poses challenges for effective analysis. Therefore, the need for a comprehensive pre-
processing approach arises from the growing importance of Twitter data in decision-making
processes. Businesses, researchers, and organizations rely on Twitter data for sentiment
analysis, brand monitoring, and trend prediction. To extract meaningful insights from this
data, it is essential to preprocess it effectively, ensuring that irrelevant information and noise
are removed while preserving the context and nuances of social media language. Thus, this
research proposes the effective classification of Twitter data using machine learning
algorithms. This comprehensive pre-processing approach is significant for several reasons
such as improved accuracy, better understanding of public opinion, enhanced decision
making, and research advancements.
CHAPTER 1
INTRODUCTION
1.1 History
The history of analyzing Twitter data for insights traces back to the early 2000s when social
media platforms began to burgeon. With the inception of Twitter in 2006, a new avenue for
real-time data analysis emerged. Initially, researchers and businesses viewed Twitter as a
platform for social interaction. However, as its user base expanded exponentially, it became
evident that Twitter harbored a wealth of information beyond mere conversations.
Around 2010, the academic community and industry pioneers started recognizing Twitter's
potential as a goldmine for understanding public sentiment, predicting trends, and conducting
market research. This recognition marked the onset of a concerted effort to develop methods
and algorithms specifically tailored for processing and classifying Twitter data effectively.
The evolution of machine learning algorithms further propelled the analysis of Twitter data.
Researchers started experimenting with various models to extract insights from the vast pool
of tweets generated every second. This experimentation led to the development of novel
techniques aimed at improving the accuracy and efficiency of Twitter data classification.
Research Motivation:
Furthermore, the growing importance of Twitter data in shaping public opinion and driving
market trends accentuates the need for robust preprocessing techniques. By harnessing the
power of machine learning algorithms coupled with comprehensive preprocessing,
stakeholders can gain a deeper understanding of consumer behavior, market dynamics, and
societal trends.
Problem Statement:
The problem statement revolves around the inadequacy of traditional text processing
techniques in handling the unique characteristics of Twitter data. Conventional methods such
as tokenization, stemming, and stop-word removal fall short when confronted with hashtags,
mentions, and emoticons prevalent in tweets.
The unstructured and noisy nature of Twitter data poses significant challenges for effective
analysis. Without a comprehensive preprocessing approach, it becomes challenging to extract
meaningful insights while preserving the context and nuances of social media language.
Businesses, researchers, and organizations face the daunting task of navigating through the
vast sea of tweets to distill relevant information and derive actionable insights. Therefore,
there is an urgent need to develop a robust preprocessing framework that can effectively filter
out noise and irrelevant content while retaining the essence of Twitter discourse.
Applications:
The applications of the proposed comprehensive preprocessing approach are multifaceted and
span across various domains. In the realm of business, organizations can leverage Twitter
data for sentiment analysis to gauge customer satisfaction, identify emerging trends, and
monitor brand perception. By preprocessing the data effectively, businesses can extract
actionable insights to inform marketing strategies, product development, and customer
engagement initiatives.
In the field of academia, researchers can utilize Twitter data to study social phenomena,
conduct opinion polls, and analyze public discourse on diverse topics ranging from politics to
health. A robust preprocessing approach ensures the reliability and validity of research
findings by filtering out noise and irrelevant information inherent in Twitter data.
Moreover, government agencies can harness Twitter data for real-time monitoring of public
sentiment, crisis management, and disaster response. By preprocessing the data
comprehensively, policymakers can gain valuable insights into public opinion, identify areas
of concern, and formulate timely interventions to address societal issues.
CHAPTER 2
LITERATURE SURVEY
Sanjay et al. [8] conducted sentiment analysis on Twitter data related to the Indian farmer
protests to gain insights into global public sentiment. They employed algorithms to
analyze approximately twenty thousand tweets associated with theprotests and assess the
sentiments expressed. The researchers analyzed and contrasted the success of 2 popular
text representation techniques BoW and TF-IDF, and discovered that BoW
outperformed TF-IDF in sentiment analysis accuracy. The study further involved the
application of various classifiers, including SVM, RF, DT, and NB, on the dataset. The
results revealed that the RF classifier achieved the best possible accuracy among the
evaluated classifiers.
Behl et al. [9] gathered tweets related to various natural disasters and categorized
them into three groups based on their content: "resource availability," "resource
requirements," and "others." To accomplish this classification task, they employed a
Multi-Layer Perceptron (MLP) network with an optimizer. The proposed model
demonstrated an accuracy of 83%, indicating its effectiveness in accurately
classifying the tweets into the designated categories.
Tan et al. [10] introduced a model that combined BI-LSTM, RoBERTa, and GRU
models. To further enhance the general effectiveness of sentiment analysis, the
model's predictions were averaged using majority voting. Addressing the
challenges posed by unbalanced datasets, the researchers enhanced the data by
utilizing GloVe pre-trained word embeddings. The experimental results
demonstrated that the proposed model surpassed state-of-the-art approaches, achieving
accuracy rates of 0.942, 0.892, and0.9177 on the Sentiment 140, USAirlines, and IMDB
datasets, respectively.For Aspect-level SA, Lu et al. [11] presented IRAN
(Interactive Rule Attention Network). Tosimulate the operation of grammar at the
sentence level, IRAN includes a grammar rule encoder that normalizes the result of
adjacent locations. Furthermore, IRAN makes use of an attention network that interacts
with its environment to better understand the target and its surroundings. We show that
IRAN learns informative features successfully and beats baseline models by
experimenting on the ACL 2014 Twitter & SemEval 2014 datasets. As a result of these
results, it is clear that IRAN is an effective tool for aspect-level sentiment analysis, which
can lead to enhanced performance in the field.
He et al. [13] introduced LGCF, a multilingual learning paradigm that emphasized active
learning inboth global and local contexts. Unlike its predecessors, this model, LGCF
International Journal of Intelligent Systems and Applications in EngineeringIJISAE, 2024,
12(1), 235–266|237demonstrated the ability to effectively learn the connections between
target aspects and local contexts, along with the connections between target aspects
and global contexts simultaneously. This innovative approach enables the model to capture
and utilizeboth local and global contextual information efficiently, enhancing its overall
performance in sentiment analysis tasks.
To better understand the state of the art in SA using DNNs and CNNs, Qurat et al. [15]
undertook a systematic literature review of current studies. Topics covered in their
investigation of sentiment analysis included text sentiment categorization, cross-lingual
analysis, and both textual and visual analysis. Datasets were culled from a wide range of
social media platforms. The authors presented the various stages of the successful
construction of DL models in emotion analysis and noted that many difficulties in this
field were efficiently solved with high accuracy using deep learning methodologies. With
their more complex structures, deep learning networks were able to extract and represent
features more accurately than traditional neural networks and SVMs. This study
demonstrates the benefits of using DL models for sentiment analysis, which can lead to
improved results in emotion analysis.
CHAPTER 3
EXISTING SYSTEM
Before the integration of AI and machine learning, the analysis of Twitter data primarily
relied on traditional text processing techniques. These methods were rudimentary, focusing
on basic text manipulation rather than understanding the underlying meaning or context of
the data. Techniques such as tokenization, which breaks down text into individual words or
tokens; stemming, which reduces words to their base or root form; and stop-word removal,
which eliminates common but uninformative words, were commonly used. While these
techniques allowed for some level of text processing, they were limited in their ability to
handle the unique features of Twitter data.
Twitter, as a platform, presents several challenges for traditional text processing. The
presence of hashtags, mentions, and emoticons adds layers of complexity that traditional
methods struggle to manage. For example, hashtags and mentions often carry significant
contextual information, and their proper interpretation is crucial for accurate sentiment
analysis and trend prediction. Traditional systems also faced difficulties with the unstructured
and noisy nature of Twitter data, where informal language, abbreviations, and slang are
prevalent. As a result, the insights derived from such data were often shallow and lacked
depth, limiting their usefulness in decision-making processes.
o Twitter users often employ informal language, abbreviations, and slang, which
traditional text processing techniques struggle to interpret accurately. This
leads to a loss of contextual meaning, reducing the effectiveness of analysis.
2. Contextual Understanding:
o Traditional methods lack the ability to grasp the nuanced meanings behind
words or phrases, particularly in the context of hashtags, mentions, and
emoticons. These elements are often critical for understanding the sentiment
and intent behind a tweet.
3. Noise in Data:
4. Scalability Issues:
o The basic text processing techniques used in traditional systems fail to extract
meaningful features from Twitter data, such as user interactions, tweet
metadata, and temporal patterns. This results in a loss of valuable information
that could enhance analysis.
CHAPTER 4
PROPOSED SYSTEM
4.1 Overview
This Python script appears to be a comprehensive approach for preprocessing and classifying
Twitter data using various machine learning algorithms.
Importing Libraries:
The script begins by importing necessary libraries, including NumPy, Pandas,
Matplotlib, Seaborn, NLTK, and warnings.
Loading Data:
The training and testing datasets are loaded from CSV files using Pandas.
The shape of the datasets is printed to provide an overview of the data size.
Exploratory Data Analysis (EDA):
Displaying the first few rows of the training and testing datasets to inspect the
structure of the data. Checking for missing values in both datasets. Exploring positive
and negative comments in the training set. Visualizing the distribution of tweet
lengths in both training and testing datasets. Creating a new column to represent the
length of each tweet.Grouping the data by label (positive or negative) and analyzing
statistics.
Data Visualization:
Creating count plots and histograms to visualize the distribution of tweet lengths,
label frequencies, and hashtag frequencies. Generating word clouds to display the
most frequent words in the overall vocabulary, neutral words, and negative words.
Hashtag Analysis:
Extracting hashtags from both positive and negative tweets. Creating frequency
distributions and bar plots to display the most common hashtags in each category.
Word Embeddings with Word2Vec:
Using Gensim to train a Word2Vec model on tokenized tweets. Demonstrating word
similarities for certain words using the trained Word2Vec model.
Text Preprocessing:
Removing unwanted patterns, converting text to lowercase, and stemming words
using NLTK.Creating bag-of-words representations for both the training and testing
datasets.
Model Training:
Splitting the training dataset into training and validation sets.
Standardizing the data using StandardScaler.
Training machine learning models including RandomForestClassifier,
LogisticRegression
Evaluating the models on the validation set, calculating training and validation
accuracy, F1 score, and generating confusion matrices.
The script covers a wide range of tasks from data loading and exploration to text
preprocessing, visualization, and training various machine learning models for sentiment
analysis on Twitter data.
Data pre-processing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model. When creating a machine learning project, it is not always a case that we come across
the clean and formatted data. And while doing any operation with data, it is mandatory to
clean it and put in a formatted way. So, for this, we use data pre-processing task. A real-world
data generally contains noises, missing values, and maybe in an unusable format which
cannot be directly used for machine learning models. Data pre-processing is required tasks
for cleaning the data and making it suitable for a machine learning model which also
increases the accuracy and efficiency of a machine learning model.
Dataset Splitting
In machine learning data pre-processing, we divide our dataset into a training set and test set.
This is one of the crucial steps of data pre-processing as by doing this, we can enhance the
performance of our machine learning model. Suppose if we have given training to our
machine learning model by a dataset and we test it by a completely different dataset. Then, it
will create difficulties for our model to understand the correlations between the models. If we
train our model very well and its training accuracy is also very high, but we provide a new
dataset to it, then it will decrease the performance. So we always try to make a machine
learning model which performs well with the training set and also with the test dataset. Here,
we can define these datasets as:
Training Set: A subset of dataset to train the machine learning model, and we already know
the output.
Test set: A subset of dataset to test the machine learning model, and by using the test set,
model predicts the output.
4.3 ML Module
The machine learning model building process for Twitter sentiment analysis begins with
preprocessing the text data. The text is cleaned by removing unwanted characters, stopwords,
and stemming words to their root forms. Following this, a Bag of Words (BoW)
representation is created using CountVectorizer, which transforms the cleaned tweets into a
structured format suitable for model input.
The dataset is then split into training and validation sets using train_test_split, ensuring that
the model is tested on unseen data. Standardization is applied using StandardScaler to
normalize the features, improving the performance of the machine learning models.
Various models, including Random Forest, Logistic Regression, Decision Tree, Support
Vector Machine (SVM), and XGBoost, are trained on the processed data. Each model is
evaluated based on its training and validation accuracy, along with the F1 score, which
balances precision and recall.
The model's performance is further analyzed using confusion matrices, which provide
insights into the classification errors. Among the models tested, the one with the highest
validation accuracy and F1 score is selected as the final model for predicting sentiment on the
test dataset.
Extra Trees Regressor (Extremely Randomized Trees) is an ensemble learning method that
aggregates multiple decision trees to make a regression prediction. It is similar to Random
Forests but with additional randomness when constructing the trees. In contrast to traditional
decision trees, Extra Trees do not rely on bootstrapping and instead randomly select both
features and thresholds when splitting nodes.
How It Works:
Random Feature Splits: For each decision tree, the algorithm selects random subsets
of features, but unlike random forests, Extra Trees further randomize by selecting
thresholds for splits randomly rather than choosing the best possible split.
Tree Construction: Each tree is fully grown without pruning, increasing bias but
reducing variance.
Ensemble Averaging: The model aggregates the predictions from multiple trees by
averaging (for regression tasks) or taking the majority vote (for classification).
Architecture:
1. Randomness in Node Splitting: Extra Trees splits nodes by selecting random
features and thresholds, unlike traditional decision trees, which select the best split
based on criteria like Gini impurity or entropy.
Disadvantages:
Higher Bias: Due to the randomness in splits, Extra Trees tend to have a higher bias
than models like Random Forests.
Sensitivity to Noisy Data: The additional randomness can lead to overfitting if the
dataset is small or contains significant noise.
Interpretability: Like other ensemble methods, the model is not easy to interpret
because it involves many decision trees.
Gradient Boosting Classifier (GBC) is a powerful machine learning algorithm that builds an
ensemble of weak learners, usually decision trees, and combines them sequentially to
minimize a loss function. It is a boosting technique where each new tree corrects the errors
made by the previous ones.
How It Works:
Gradient Descent: The model uses gradient descent to minimize the loss function by
updating the weights of misclassified or poorly predicted instances.
Additive Model: Trees are added in sequence, and each tree's contribution is
weighted. The final prediction is the weighted sum of all trees’ outputs.
Architecture:
1. Loss Function: The model optimizes a loss function, which could be binary cross-
entropy (for classification) or mean squared error (for regression).
2. Weak Learners: It typically uses shallow decision trees (stumps) as weak learners.
Each tree corrects the mistakes of the previous ones by focusing on instances with
higher residual errors.
3. Gradient Updates: After each iteration, the gradient of the loss function is calculated
to update the model parameters.
4. Shrinkage: To prevent overfitting, a learning rate is applied to the weights of the trees
to control the contribution of each tree.
Advantages:
High Accuracy: GBC often outperforms other ensemble models in terms of accuracy,
especially on structured/tabular data.
Handles Imbalanced Data Well: GBC can be tailored to handle imbalanced datasets
by adjusting the loss function or class weights.
Disadvantages:
UML DIAGRAMS
UML stands for Unified Modeling Language. UML is a standardized general-purpose
modeling language in the field of object-oriented software engineering. The standard is
managed, and was created by, the Object Management Group. The goal is for UML to
become a common language for creating models of object-oriented computer software. In its
current form UML is comprised of two major components: a Meta-model and a notation. In
the future, some form of method or process also be added to; or associated with, UML.
GOALS: The Primary goals in the design of the UML are as follows:
Provide users a ready-to-use, expressive visual modeling Language so that they can
develop and exchange meaningful models.
Provide extendibility and specialization mechanisms to extend the core concepts.
Be independent of particular programming languages and development process.
Provide a formal basis for understanding the modeling language.
Encourage the growth of OO tools market.
Support higher level development concepts such as collaborations, frameworks,
patterns and components.
Integrate best practices.
Class Diagram
The class diagram is used to refine the use case diagram and define a detailed design of the
system. The class diagram classifies the actors defined in the use case diagram into a set of
interrelated classes. The relationship or association between the classes can be either an “is-a”
or “has-a” relationship. Each class in the class diagram may be capable of providing certain
functionalities. These functionalities provided by the class are termed “methods” of the class.
Apart from this, each class may have certain “attributes” that uniquely identify the class.
The purpose of use case diagram is to capture the dynamic aspect of a system.
DATA FLOW DIAGRAM
A Data Flow Diagram (DFD) is a visual representation of the flow of data within a system or
process. It is a structured technique that focuses on how data moves through different
processes and data stores within an organization or a system. DFDs are commonly used in
system analysis and design to understand, document, and communicate data flow and
processing
SEQUENCE DIAGRAM
Activity diagram is another important diagram in UML to describe the dynamic aspects of the
system.
DATAFLOW DIAGRAM
Component Diagram
In UML (Unified Modeling Language), system architecture refers to the high-level structure
of a software system, capturing the organization of its components, their relationships, and
interactions. It provides a blueprint that describes how different parts of the system fit
together to fulfill the requirements and objectives. System architecture in UML is often
represented through diagrams like component diagrams, deployment diagrams, and class
diagrams, which visually depict the system's structural and behavioral aspects. This helps in
understanding, designing, and communicating the system's framework effectively to
stakeholders.
CHAPTER 6
SOFTWARE ENVIRONMENT
What is Python?
The biggest strength of Python is huge collection of standard library which can be used for
the following –
Machine Learning
GUI Applications (like Kivy, Tkinter, PyQt etc. )
Web frameworks like Django (used by YouTube, Instagram, Dropbox)
Image processing (like Opencv, Pillow)
Web scraping (like Scrapy, BeautifulSoup, Selenium)
Test frameworks
Multimedia
Advantages of Python
Python downloads with an extensive library and it contain code for various purposes like
regular expressions, documentation-generation, unit-testing, web browsers, threading,
databases, CGI, email, image manipulation, and more. So, we don’t have to write the
complete code for that manually.
2. Extensible
As we have seen earlier, Python can be extended to other languages. You can write some of
your code in languages like C++ or C. This comes in handy, especially in projects.
3. Embeddable
Complimentary to extensibility, Python is embeddable as well. You can put your Python code
in your source code of a different language, like C++. This lets us add scripting capabilities to
our code in the other language.
4. Improved Productivity
The language’s simplicity and extensive libraries render programmers more productive than
languages like Java and C++ do. Also, the fact that you need to write less and get more things
done.
5. IOT Opportunities
Since Python forms the basis of new platforms like Raspberry Pi, it finds the future bright for
the Internet Of Things. This is a way to connect the language with the real world.
When working with Java, you may have to create a class to print ‘Hello World’. But in
Python, just a print statement will do. It is also quite easy to learn, understand, and code. This
is why when people pick up Python, they have a hard time adjusting to other more verbose
languages like Java.
7. Readable
Because it is not such a verbose language, reading Python is much like reading English. This
is the reason why it is so easy to learn, understand, and code. It also does not need curly
braces to define blocks, and indentation is mandatory. This further aids the readability of the
code.
8. Object-Oriented
This language supports both the procedural and object-oriented programming paradigms.
While functions help us with code reusability, classes and objects let us model the real world.
A class allows the encapsulation of data and functions into one.
9. Free and Open-Source
Like we said earlier, Python is freely available. But not only can you download Python for
free, but you can also download its source code, make changes to it, and even distribute it. It
downloads with an extensive collection of libraries to help you with your tasks.
10. Portable
When you code your project in a language like C++, you may need to make some changes to
it if you want to run it on another platform. But it isn’t the same with Python. Here, you need
to code only once, and you can run it anywhere. This is called Write Once Run Anywhere
(WORA). However, you need to be careful enough not to include any system-dependent
features.
11.Interpreted
Lastly, we will say that it is an interpreted language. Since statements are executed one by
one, debugging is easier than in compiled languages.
Any doubts till now in the advantages of Python? Mention in the comment section.
Almost all of the tasks done in Python requires less coding when the same task is done in
other languages. Python also has an awesome standard library support, so you don’t have to
search for any third-party libraries to get your job done. This is the reason that many people
suggest learning Python to beginners.
2. Affordable
Python is free therefore individuals, small companies or big organizations can leverage the
free available resources to build applications. Python is popular and widely used so it gives
you better community support.
The 2019 Github annual survey showed us that Python has overtaken Java in the most
popular programming language category.
3. Python is for Everyone
Python code can run on any machine whether it is Linux, Mac or Windows. Programmers
need to learn different languages for different jobs but with Python, you can professionally
build web apps, perform data analysis and machine learning, automate things, do web
scraping and also build games and powerful visualizations. It is an all-rounder programming
language.
Disadvantages of Python
So far, we’ve seen why Python is a great choice for your project. But if you choose it, you
should be aware of its consequences as well. Let’s now see the downsides of choosing Python
over another language.
We have seen that Python code is executed line by line. But since Python is interpreted, it
often results in slow execution. This, however, isn’t a problem unless speed is a focal point
for the project. In other words, unless high speed is a requirement, the benefits offered by
Python are enough to distract us from its speed limitations.
While it serves as an excellent server-side language, Python is much rarely seen on the client-
side. Besides that, it is rarely ever used to implement smartphone-based applications. One
such application is called Carbonnelle.
The reason it is not so famous despite the existence of Brython is that it isn’t that secure.
3. Design Restrictions
As you know, Python is dynamically typed. This means that you don’t need to declare the
type of variable while writing the code. It uses duck-typing. But wait, what’s that? Well, it
just means that if it looks like a duck, it must be a duck. While this is easy on the
programmers during coding, it can raise run-time errors.
5. Simple
No, we’re not kidding. Python’s simplicity can indeed be a problem. Take my example. I
don’t do Java, I’m more of a Python person. To me, its syntax is so simple that the verbosity
of Java code seems unnecessary.
This was all about the Advantages and Disadvantages of Python Programming Language.
NumPy
It is the fundamental package for scientific computing with Python. It contains various
features including these important ones:
Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional
container of generic data. Arbitrary datatypes can be defined using NumPy which allows
NumPy to seamlessly and speedily integrate with a wide variety of databases.
Pandas
For simple plotting the pyplot module provides a MATLAB-like interface, particularly when
combined with Ipython. For the power user, you have full control of line styles, font
properties, axes properties, etc, via an object oriented interface or via a set of functions
familiar to MATLAB users.
Scikit – learn
There have been several updates in the Python version over the years. The question is how to
install Python? It might be confusing for the beginner who is willing to start learning Python
but this tutorial will solve your query. The latest or the newest version of Python is version
3.7.4 or in other words, it is Python 3.
Note: The python version 3.7.4 cannot be used on Windows XP or earlier devices.
Before you start with the installation process of Python. First, you need to know about your
System Requirements. Based on your system type i.e. operating system and based processor,
you must download the python version. My system type is a Windows 64-bit operating
system. So the steps below are to install python version 3.7.4 on Windows 7 device or to
install Python 3. Download the Python Cheatsheet here.The steps on how to install Python on
Windows 10, 8 and 7 are divided into 4 parts to help understand better.
Step 1: Go to the official site to download and install python using Google Chrome or any
other web browser. OR Click on the following link: https://www.python.org
Now, check for the latest and the correct version for your operating system.
Step 4: Scroll down the page until you find the Files option.
Step 5: Here you see a different version of python along with the operating system.
To download Windows 32-bit python, you can select any one from the three options:
Windows x86 embeddable zip file, Windows x86 executable installer or Windows x86
web-based installer.
To download Windows 64-bit python, you can select any one from the three options:
Windows x86-64 embeddable zip file, Windows x86-64 executable installer or
Windows x86-64 web-based installer.
Here we will install Windows x86-64 web-based installer. Here your first part regarding
which version of python is to be downloaded is completed. Now we move ahead with the
second part in installing python i.e. Installation
Note: To know the changes or updates that are made in the version you can click on the
Release Note Option.
Installation of Python
Step 1: Go to Download and Open the downloaded python version to carry out the
installation process.
Step 2: Before you click on Install Now, Make sure to put a tick on Add Python 3.7 to PATH.
Step 3: Click on Install NOW After the installation is successful. Click on Close.
With these above three steps on python installation, you have successfully and correctly
installed Python. Now is the time to verify the installation.
Step 4: Let us test whether the python is correctly installed. Type python –V and press Enter.
Step 3: Click on IDLE (Python 3.7 64-bit) and launch the program
Step 4: To go ahead with working in IDLE you must first save the file. Click on File > Click
on Save
Step 5: Name the file and save as type should be Python files. Click on SAVE. Here I have
named the files as Hey World.
Step 6: Now for e.g. enter print (“Hey World”) and Press Enter.
You will see that the command given is launched. With this, we end our tutorial on how to
install Python. You have learned how to download python for windows into your respective
operating system.
Note: Unlike Java, Python does not need semicolons at the end of the statements otherwise it
won’t work.
CHAPTER 7
SYSTEM REQUIREMENTS
Software Requirements
The functional requirements or the overall description documents include the product
perspective and features, operating system and operating environment, graphics requirements,
design constraints and user documentation.
The appropriation of requirements and implementation constraints gives the general overview
of the project in regard to what the areas of strength and deficit are and how to tackle them.
Minimum hardware requirements are very dependent on the particular software being
developed by a given Enthought Python / Canopy / VS Code user. Applications that need to
store large arrays/objects in memory will require more RAM, whereas applications that need
to perform numerous calculations or tasks more quickly will require a faster processor.
CHAPTER 8
FUNCTIONAL REQUIREMENTS
OUTPUT DESIGN
Outputs from computer systems are required primarily to communicate the results of
processing to users. They are also used to provides a permanent copy of the results for later
consultation. The various types of outputs in general are:
Output Definition
Input Design
Input design is a part of overall system design. The main objective during the input design is
as given below:
Input Stages
Data recording
Data transcription
Data conversion
Data verification
Data control
Data transmission
Data validation
Data correction
Input Types
It is necessary to determine the various types of inputs. Inputs can be categorized as follows:
Input Media
At this stage choice has to be made about the input media. To conclude about the input media
consideration has to be given to;
Type of input
Flexibility of format
Speed
Accuracy
Verification methods
Rejection rates
Ease of correction
Storage and handling requirements
Security
Easy to use
Portability
Keeping in view the above description of the input types and input media, it can be said that
most of the inputs are of the form of internal and interactive. As
Input data is to be the directly keyed in by the user, the keyboard can be considered to be the
most suitable input device.
Error Avoidance
At this stage care is to be taken to ensure that input data remains accurate form the stage at
which it is recorded up to the stage in which the data is accepted by the system. This can be
achieved only by means of careful control each time the data is handled.
Error Detection
Even though every effort is make to avoid the occurrence of errors, still a small proportion of
errors is always likely to occur, these types of errors can be discovered by using validations to
check the input data.
Data Validation
Procedures are designed to detect errors in data at a lower level of detail. Data validations
have been included in the system in almost every area where there is a possibility for the user
to commit errors. The system will not accept invalid data. Whenever an invalid data is keyed
in, the system immediately prompts the user and the user has to again key in the data and the
system will accept the data only if the data is correct. Validations have been included where
necessary.
The system is designed to be a user friendly one. In other words the system has been
designed to communicate effectively with the user. The system has been designed with
popup menus.
It is essential to consult the system users and discuss their needs while designing the user
interface:
User initiated interface the user is in charge, controlling the progress of the
user/computer dialogue. In the computer-initiated interface, the computer selects the
next stage in the interaction.
Computer initiated interfaces
In the computer-initiated interfaces the computer guides the progress of the user/computer
dialogue. Information is displayed and the user response of the computer takes action or
displays further information.
Command driven interfaces: In this type of interface the user inputs commands or
queries which are interpreted by the computer.
Forms oriented interface: The user calls up an image of the form to his/her screen and
fills in the form. The forms-oriented interface is chosen because it is the best choice.
Computer-Initiated Interfaces
Right from the start the system is going to be menu driven, the opening menu displays the
available options. Choosing one option gives another popup menu with more options. In this
way every option leads the users to data entry form where the user can key in the data.
The design of error messages is an important part of the user interface design. As user is
bound to commit some errors or other while designing a system the system should be
designed to be helpful by providing the user with information regarding the error he/she has
committed.
This application must be able to produce output at different modules for different inputs.
Performance Requirements
The requirement specification for any system can be broadly stated as given below:
SOURCE CODE
import pandas as pd
import nltk
nltk.download('wordnet')
import re
import wordcloud
import warnings
warnings.filterwarnings('ignore')
import joblib
import numpy as np
# In[2]:
df=pd.read_csv('train_E6oV3lV.csv')
df
# In[3]:
df.head()
# In[4]:
random_sample=df.sample(n=100, random_state=1)
random_sample=random_sample.drop('label',axis=1)
random_sample_test_path="test.csv"
random_sample.to_csv(random_sample_test_path,index=False)
# In[5]:
df.isnull().sum()
# In[6]:
df.shape
# In[7]:
df.columns
# In[8]:
df.nunique()
# In[9]:
df.describe()
# In[10]:
df.info()
# In[11]:
tk=TweetTokenizer()
ps = PorterStemmer()
lem=WordNetLemmatizer()
def cleaning(s):
s = str(s)
s = s.lower()
s = re.sub('\s\W',' ',s)
s = re.sub('\W,\s',' ',s)
s = re.sub("\d+", "", s)
s = re.sub('\s+',' ',s)
s = re.sub('[!@#$_ðâJó¾ãº½çæåä¹³ìà¹ëêµéà³à²ùø]', '', s)
s = s.replace("co","")
s = s.replace("https","")
s = s.replace(",","")
s = s.replace("[\w*"," ")
s=s.lower()
s=tk.tokenize(s)
s= ' '.join(s)
return s
# In[12]:
df['content']
# In[13]:
all_words
# In[14]:
#creating a count plot:
plt.figure(figsize=(10, 6))
ax = sns.countplot(data=df, x='label')
plt.xlabel('label')
ax.set_xticklabels(category_order)
plt.ylabel('Count')
plt.title('Count of label')
for p in ax.patches:
textcoords='offset points')
plt.show()
# In[15]:
normal_words
# In[16]:
def hashtag_extract(x):
hashtags = []
for i in x:
ht =re.findall(r'\w+', i)
hashtags.append(ht)
return hashtags
# In[17]:
HT_regular
# In[18]:
HT_negative
# In[19]:
HT_regular = sum(HT_regular,[])
HT_regular
# In[20]:
HT_negative = sum(HT_negative,[])
HT_negative
# In[21]:
a = nltk.FreqDist(HT_regular)
# In[22]:
d = d.nlargest(columns="Count", n = 10)
plt.figure(figsize=(16,5))
ax.set(ylabel = 'Count')
plt.show()
# In[23]:
b = nltk.FreqDist(HT_negative)
# In[24]:
plt.figure(figsize=(16,5))
ax.set(ylabel = 'Count')
plt.show()
# In[25]:
vectorizer =
CountVectorizer(max_features=3000,stop_words=stopwords.words('english')).fit(df['content']
)
vectorizer
# In[26]:
X=vectorizer.transform(df['content']).toarray()
# In[27]:
y=df.iloc[:,1]
# In[28]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=42)
# In[29]:
X_train.shape
# In[30]:
y_train.shape
# In[31]:
labels=['Negative', 'Positive']
# In[32]:
precision = []
recall = []
fscore = []
accuracy = []
# In[33]:
testY = testY.astype('int')
predict = predict.astype('int')
a = accuracy_score(testY,predict)*100
accuracy.append(a)
precision.append(p)
recall.append(r)
fscore.append(f)
report=classification_report(predict, testY,target_names=labels)
ax.set_ylim([0,len(labels)])
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.show()
# In[34]:
#ExtraTreesClassifier:
if os.path.exists('ExtraTreesClassifier.pkl'):
ETC= joblib.load('ExtraTreesClassifier.pkl')
predict = ETC.predict(X_test)
else:
ETC=ExtraTreesClassifier()
ETC.fit(X_train, y_train)
joblib.dump(ETC, 'ExtraTreesClassifier.pkl')
predict = ETC.predict(X_test)
#GradientBoostingClassifier:
if os.path.exists('GradientBoostingClassifier.pkl'):
GBC= joblib.load('GradientBoostingClassifier.pkl')
predict = GBC.predict(X_test)
else:
GBC=GradientBoostingClassifier()
GBC.fit(X_train, y_train)
joblib.dump(GBC, 'GradientBoostingClassifier.pkl')
predict = GBC.predict(X_test)
# In[36]:
test=pd.read_csv('test.csv')
test
# In[37]:
tk=TweetTokenizer()
ps = PorterStemmer()
lem=WordNetLemmatizer()
def cleaning(s):
s = str(s)
s = s.lower()
s = re.sub('\s\W',' ',s)
s = re.sub('\W,\s',' ',s)
s = re.sub("\d+", "", s)
s = re.sub('\s+',' ',s)
s = re.sub('[!@#$_ðâJó¾ãº½çæåä¹³ìà¹ëêµéà³à²ùø]', '', s)
s = s.replace("co","")
s = s.replace("https","")
s = s.replace(",","")
s = s.replace("[\w*"," ")
s=s.lower()
s=tk.tokenize(s)
s= ' '.join(s)
return s
# In[38]:
test['content']
# In[43]:
tran = StandardScaler()
X_test = vectorizer.transform(test['content']).toarray()
X_test = tran.fit_transform(X_test)
# In[45]:
test['Predict_label']=GBC.predict(X_test)
test['Predict_label']
test
CHAPTER 10
Data Loading and Exploration: Loaded training and test datasets using Pandas. Checked
the shape of the datasets and displayed the first few rows. Checked for missing values in both
datasets.
Exploratory Data Analysis (EDA): Explored negative and positive comments in the training
set. Visualized the distribution of tweet lengths for both training and test sets. Created a bar
plot to show the distribution of sentiment labels. Analyzed the variation in tweet length with
respect to sentiment labels.
Text Tokenization and Labeling: Tokenized words in the training set using NLTK. Utilized
Gensim to create a Word2Vec model. Labeled each tweet for further processing.
Model Building: Applied various machine learning models including Random Forest,
Logistic Regression, Decision Tree, Support Vector Machine (SVM), and XGBoost.
Standardized the data using StandardScaler.
Model Evaluation:
Evaluated models on training and validation sets using accuracy, F1 score, and confusion
matrices. Provided results for each model, including training accuracy, validation accuracy,
F1 score, and confusion matrix.
The provided dataset appears to be related to Twitter data, containing information such as
tweet IDs, labels, and the content of the tweets. Here's a detailed description:
It's important to note that without additional context, the specific criteria for labeling tweets
as dysfunctional or the context behind the labeling are not clear. The dataset comprises a
diverse range of tweets, suggesting it may be used for sentiment analysis, classification, or
related natural language processing tasks.
Figure 1: Data frame used for Twitter data analysis figure likely represents the structure and
content of the data frame used for Twitter data analysis. It might include information such as
tweet text, sentiment labels, other relevant features.
Figure 2: Count plot of target column figure is a visual representation, likely in the form of a
bar chart, showing the distribution or count of different classes in the target column. In the
context of Twitter data analysis, the target column might represent sentiment labels such as
positive, negative, or neutral.
Figure 3 shows
Accuracy: 92.87% This indicates that the model correctly predicted 92.87% of the data
points in the test set. It's a general measure of overall performance.
Precision: 50.0% This metric measures the proportion of positive predictions that were
actually correct. In other words, out of all the instances the model predicted as positive, 50%
were truly positive.
Recall: 46.43% This metric measures the proportion of actual positive instances that were
correctly predicted. It indicates how well the model was able to identify all the positive cases.
F1-score: 48.15% This metric combines precision and recall into a single value. It provides a
balanced measure of both metrics, considering both the model's ability to correctly predict
positive instances and its ability to avoid false positives.
The confusion matrix for the ExtraTreesClassifier model shows the distribution of predicted
and actual classes. The diagonal elements represent correct predictions (e.g., 8905 Negative
instances were correctly predicted as Negative). Off-diagonal elements indicate incorrect
predictions (e.g., 684 Positive instances were incorrectly predicted as Negative). The color
intensity of each cell corresponds to the number of instances in that category, with darker
colors indicating larger quantities.
Accuracy: This is the overall correctness of the model. It's calculated as the number of
correct predictions divided by the total number of predictions. In this case, the accuracy is
94.69%, which means the model correctly predicted 94.69% of the samples.
Precision: This measures how many of the positive predictions made by the model were
actually correct. It's calculated as the number of true positives divided by the sum of true
positives and false positives. In this case, the precision is 65.22%, which means that out of all
the samples the model predicted as positive, only 65.22% were truly positive.
Recall: This measures how many of the actual positive samples the model correctly
identified. It's calculated as the number of true positives divided by the sum of true positives
and false negatives. In this case, the recall is 90.18%, which means that the model correctly
identified 90.18% of the positive samples.
F1-score: This is a harmonic mean of precision and recall. It provides a balance between
precision and recall. A higher F1-score indicates better overall performance. In this case, the
F1-score is 71.27%, which is a good balance between precision and recall.
Classification report: This table provides a more detailed breakdown of the model's
performance for each class (negative and positive). It includes precision, recall, F1-score, and
support for each class.
the model achieved high accuracy (94.69%) but had some limitations in precision (65.22%)
and recall (90.18%). The F1-score of 71.27% indicates a reasonable balance between
precision and recall. The classification report provides further insights into the model's
performance for each class.
TP: 8869
TN: 36
FP: 473
FN: 211
Numerical Explanation:
11.1 Conclusion
With the advancement of web technology and its growth, there is a huge volume of data
present on the web for internet users and a lot of data is generated too. The Internet has
become a platform for online learning, exchanging ideas and sharing opinions. Social
networking sites like Twitter, Facebook, Google+ are rapidly gaining popularity as they allow
people to share and express their views about topics, have discussions with different
communities, or post messages across the world. Therefore, this project implemented the
sentiment analysis of twitter dataset for opinion mining using NLP, AI, and lexicon-based
approaches, together with evaluation metrics. Using various machine learning algorithms like
Naive Bayes, and logistic regression, this work provided research on twitter data streams. In
addition, this project has also discussed general challenges and applications of Sentiment
Analysis on Twitter
The increasing prevalence of social media platforms, particularly Twitter, has significantly
heightened the importance of real-time data analysis. Twitter data has emerged as a crucial
source for gaining valuable insights into public opinions, sentiments, and ongoing trends.
This surge in significance has been particularly notable in recent years with the exponential
growth of social media platforms. Researchers and businesses alike have come to
acknowledge the vast potential of Twitter data, leveraging it for understanding public
sentiment, predicting trends, and conducting insightful market research.As the utilization of
Twitter data became more widespread, various methods and algorithms were developed to
efficiently process and classify this unique form of social media content. Traditional systems
often relied on basic text processing techniques, including tokenization, stemming, and stop-
word removal. While these techniques have proven useful, they may not suffice for handling
the distinctive characteristics of Twitter data, such as hashtags, mentions, and emoticons. The
unstructured and noisy nature of Twitter data further complicates matters, presenting
challenges for effective analysis. Consequently, there has been a growing recognition of the
need for a comprehensive pre-processing approach to address these challenges.
Businesses, researchers, and organizations have increasingly come to rely on Twitter data for
tasks such as sentiment analysis, brand monitoring, and trend prediction. To extract
meaningful insights from this rich source of information, effective pre-processing is essential.
This involves the removal of irrelevant information and noise while preserving the context
and nuances inherent in social media language. Therefore, this research advocates for an
efficient classification of Twitter data through the application of machine learning algorithms,
supported by a comprehensive pre-processing approach.
REFERENCES
[1] Neogi, A. S., Garg, K. A., Mishra, R. K., & Dwivedi, Y. K. (2021). Sentiment
analysis and classification of Indian farmers’ protest using twitter data.
International Journal of Information Management Data Insights, 1(2),
100019. https://doi.org/10.1016/j.jjimei.2021.100019.
[2] Behl, S., Rao, A., Aggarwal, S., Chadha, S., & Pannu, H. (2021). Twitter for
disaster relief through sentiment analysis for COVID-19 and natural hazard
crises. International Journal of Disaster Risk Reduction, 55, 102101.
https://doi.org/10.1016/j.ijdrr.2021.102101.
[3] Tan, K. L., Lee, C. P., Lim, K. M., & Anbananthen, K. S. M. (2022). Sentiment
Analysis With Ensemble Hybrid Deep Learning Model. IEEE Access, 10,
103694–103704. https://doi.org/10.1109/access.2022.3210182.
[4] Lu, Q., Zhu, Z., Zhang, D., Wu, W., & Guo, Q. (2020). Interactive Rule
Attention Network for Aspect-Level Sentiment Analysis. IEEE Access, 8, 52505-
52516,, https://doi.org/10.1109/ACCESS.2020.2981139.
[5] Mehta, K & Panda, S. (2019). A Comparative Analysis Of Sentiment analysis
In Big Data. International Journal of Computer Science and Information
Security, 17, 31-40.
[6] J He, J., Wumaier, A., Kadeer, Z., Sun, W., Xin, X., & Zheng, L. (2022). A Local
and Global Context Focus Multilingual Learning Model for Aspect-Based
Sentiment Analysis. IEEE Access, 10, 84135–84146.
https://doi.org/10.1109/access.2022.3197218.
[7] E. Psomakelis, K. Tserpes, D. Anagnostopoulos, and T. Varvarigou, “Comparing
methods for twitter sentiment analysis,” KDIR 2014 -Proceedings of the Int. Conf.
on Knowledge Discovery and Information Retrieval, pp. 225-232, 2014.
[8] Qurat Tul Ain_, Mubashir Ali_, Amna Riazy, Amna Noureenz, Muhammad
Kamranz, Babar Hayat_ and A. Rehman, Sentiment Analysis Using Deep
Learning Techniques: A Review , International Journal of Advanced
Computer Science and Applications, Vol. 8, No. 6, 2017.
[9] A. Lopez-Chau, D. Valle-Cruz, and R. Sandoval-Almaz ́ an, “Sentiment ́
Analysis of Twitter Data Through Machine Learning Techniques,” Software
Engineering in the Era of Cloud Computing, pp. 185–209, 2020. Publisher:
Springer, Cham.
[10] [17]P. Kalaivani and D. Dinesh, “Machine Learning Approach to
Analyze Classification Result for Twitter Sentiment,” in 2020 International
Conference on Smart Electronics and Communication (ICOSEC), (Trichy, India),
pp. 107–112, IEEE, Sept. 2020.
[11] [18]A. B. S, R. D. B, R. K. M, and N. M, “Real Time Twitter Sentiment
Analysis using Natural Language Processing,” International Journal of
Engineering Research & Technology, vol. 9, July 2020. Publisher: IJERT-
International Journal of Engineering Research & Technology.[19]J. Ranganathan
and A. Tzacheva, “Emotion Mining in Social Media Data,” Procedia Computer
Science, vol. 159, pp. 58–66, Jan. 2019.[20]S. Xiong, H. Lv, W. Zhao, and D. Ji,
“Towards Twitter sentiment classification by multi-level sentiment-enriched
word embeddings,” Neurocomputing, vol. 275, pp. 2459–2466, Jan. 2018.
[12] S. Aloufi and A. E. Saddik, "Sentiment Identification in Football-Specific
Tweets," in IEEE Access, vol. 6, pp. 78609-78621, 2018, doi:
10.1109/ACCESS.2018.2885117.
[13] S. A. El Rahman, F. A. AlOtaibi and W. A. AlShehri, "Sentiment Analysis of
Twitter Data," 2019 International Conference on Computer and Information Sciences
(ICCIS), 2019, pp. 1-4, doi: 10.1109/ICCISci.2019.8716464.
[14] Arora, M., Kansal, V. Character level embedding with deep convolutional
neural network for text normalization of unstructured data for Twitter sentiment
analysis. Soc. Netw. Anal. Min. 9, 12 (2019). https://doi.org/10.1007/s13278-019-
0557-y
[15] L. Wang, J. Niu and S. Yu, "SentiDiff: Combining Textual Information and
Sentiment Diffusion Patterns for Twitter Sentiment Analysis," in IEEE Transactions
on Knowledge and Data Engineering, vol. 32, no. 10, pp. 2026-2039, 1 Oct. 2020,
doi: 10.1109/TKDE.2019.2913641.