A Comparative Analysis of Social Networking
A Comparative Analysis of Social Networking
Rishabh Tariyal
2019
I certify that this dissertation which I now submit for examination for the award of
MSc in Computing (Advanced Software Develop), is entirely my own work and has
not been taken from the work of others save and to the extent that such work has been
cited and acknowledged within the test of my work.
This dissertation was prepared according to the regulations for postgraduate study of
the Dublin Institute of Technology and has not been submitted in whole or part for an
award in any other Institute or University.
The work reported on in this dissertation conforms to the principles and requirements
of the Institute’s guidelines for ethics in research.
Signed: _________________________________
TABLE OF TABLES 10
1. INTRODUCTION 11
1.1. Project Background 11
1.3. Project Aims and Objectives 13
1.4. Orthogonal Issues 14
1.5. Thesis Roadmap 15
3. SENTIMENT ANALYSIS 26
3.1 Introduction 26
3.2 Opinion Retrieval 26
3.3 Sentiment Analysis and Natural Language Processing 27
3.4 Levels of Sentiment Analysis 29
3.5 Word Embedding 32
3.5.1 Bag-of-words (BOW) Model 32
3.5.2 Term Frequency–Inverse Document Frequency (TF-IDF) Model 33
3.5.3 Word2vec 34
3.6 Key Challenges 34
3.7 Conclusions 35
4. EXPERIMENT DESIGN 37
4.1 Introduction 37
4.2. Development Methodology 37
4.3. Dataset Acquisition 39
4.3.1 The General Data Protection Regulation (GDPR) 39
4.3.2 Alternative approach 39
4.4 Experiment Architecture 42
4.5 Experiment Stages 43
4.5.1 Data Cleansing and Pre-processing 44
4.5.2 Morphological Analysis 45
4.5.3 Syntactic Analysis 45
4.5.4 Latent Semantic Analysis 46
4.5.5 Preserving the Scores Matrix 47
4.5.6 Collaborative Filtering 48
4.6 Technology Choices 49
4.6.2 Technique Choices 50
4.7 Conclusions 51
5. EXPERIMENT DEPLOYMENT 52
5.1. Introduction 52
5.2. The CRISP-DM Cycle 52
5.2.1 Data Cleaning and pre-processing 53
5.2.2 Morphological Analysis 55
5.2.3 Syntactic Analysis 56
5.2.4 Latent Semantic Analysis 57
5.2.5 Preserving the Scores Matrix 58
5.3. Collaborative Filtering 59
5.3.1 Implementation Process 59
5.4 Challenges Encountered 60
5.5. Design 61
5.6. Recruiting volunteers 62
5.7. Questionnaires and Interviews Deployment 62
5.7.1 Online survey 62
5.7.2 Interviews 63
5.8. Conclusions 64
6. EVALUATION 65
6.1. Introduction 65
6.2. Importance of Evaluation 65
6.3. Quantitative Evaluation 66
6.3.1 Area Under the Curve (AUC) 66
6.3.2 Classification Matrix 67
6.4. Experimental Outcomes 67
6.5 Survey and Interviews Evaluation 68
6.5.1 Survey Evaluation 68
6.5.2 Interview Evaluation 70
6.6. Conclusions 71
8. BIBLIOGRAPHY 77
Figure 17: AUC plots False Positive Rate against True Positive Rate...............................67
Typically, out of top 10 international tourists, 5 are non-native English speakers (data
worldbank, 2017). Online portals like TripAdvisor provide users with the ability to
read or write reviews in the English language. This creates a significant linguistic
barrier for a non-native English speaker to convey their thoughts properly. At times
words could be misleading and are unable to coherently capture reviewers’ actual
thoughts. Adding in the ability to give numerical scores to subcategories (such as
room, location, cleanliness) from a scale 0 to 5 seem to be an effective way to help in
this matter. Number systems are universal and remove the fuzziness of English
grammar. A smart case-based recommender system can be developed for a user that
will produce a template review that correlates with the given ratings.
Case Based Reasoning (CBR) is an automated reasoning and decision making process
whereby a new problem is solved through the experience we have accumulated in
solving previous cases. Richter and Weber (2008) stated that term “case” is basically
an experience of a solved problem, the term “based” implies that the reasoning is based
on cases and the term “reasoning” means that the approach is intended to draw
conclusions using cases. According to Aamodt and Plaza (1994) CBR is structured as a
four-step process, sometimes referred to as the 4R’s:
● retrieval,
● reuse,
● revision and
● retention
Retrieval is the process of finding a case that is similar to current situation or state.
Reuse is when we retrieve a case and propose it as a valid action to apply in the current
state. Revision is when we evaluate through a series of metrics or simulation how well
the proposed action will perform. Retention is applied after a successful execution by
storing the result of this experience in memory.
A fact that is often ignored is that most of the time users are not professional
journalists and thus can miscommunicate their actual opinion for the amenity. In
general, sentiment and subjectivity are quite context-sensitive, and, at a coarser
granularity, quite domain dependent (Pang and Lee, 2008, p. 19). Even the exact same
expression can indicate different sentiment in different domains and different contexts.
For instance, a reviewer can say “The rooms were nice” and gave 5 star ratings to
rooms while other reviewer can say “Rooms were the nicest” and gives 5 star ratings.
The problem with first review is that, they gave best possible rating to rooms but didn’t
use superlative degree in their text while expressing his feelings in words.
Moreover, sentiment analysis typically generates a Boolean value for a given amenity,
which is hard to map with an actual scale of rating which is usually a discrete set of
values ranging from zero to five.
The objective of this research is to show that the common solution for recommending
hotel reviews given sub-ratings (location, hospitality, price, food, and room) does not
produce relevant reviews and use of sub-rating scores along with sentiment analysis
can effectively improve recommendations. Two CBR based hotel review recommender
systems will be built. The first system will use Natural Language Processing (NLP) in
the retrieval phase to extract sub-rating scores from text and the second system will use
the sub-ratings scores submitted by users along with sub-ratings scores extracted using
NLP. Each system will recommend different review text from the sub-rating scores.
An online survey will be conducted in which people must choose either of the
generated text, which they find most correlated for a given set of sub-ratings. The
system, whose text will be selected the most number of times will imply a better
retrieval process. Quantitative research methodologies will be used to justify
hypothesis. Each time the user chooses the text, it will be stored in persistent data
storage. The ratio between the numbers of times the text was chosen to the total
numbers of times the text was generated will be compared between the two systems.
The focus of the project is to generate semantically correct sentences rather than
grammatically correct sentences. The project is built using Python as a programming
language. The project neither explores machine learning capabilities of other
programming languages nor do any comparison between models built by using other
programming languages.
There are multiple machine learning techniques available but only the best suited is
applied in this dissertation. Since there is no hard line to distinguish which technique
should be implied under a given circumstance, it becomes hard to choose one. A valid
reason is provided in “Technology Choices” section under Experimental Design
chapter while choosing a technique based on surrounding factors of the dataset and
project need and experiments are conducted to choose the technique.
The size of raw dataset is 68.7 megabytes which is present in a spreadsheet that
contains 106,266 rows. The dataset and the output generated by the proposed
algorithm is in English language only. The models are trained using this dataset only
which was divided into 80% training dataset and 20% test dataset.
The weights of each aspects of a text review are considered equal ranging from 0 to 5
as whole numbers. Reviews present in the corpus that are missing rating of any of the
aspect will be discarded.
1.5. Thesis Roadmap
The rest of this dissertation is presented as follows:
Chapter Five; Experiment Deployment: This chapter walks through the experimental
setup, how it was created, how the volunteers were recruited and how the response was
collected and analysed.
Chapter Six; Evaluation: This chapter details the results obtained from the experiment.
The result obtained are both qualitative and quantitative. It also draws a comparison
between classification algorithms.
Chapter Seven; Conclusions and Future Work: This chapter summarizes the results and
proposes future work.
2. RECOMMENDER SYSTEMS AND CASE-BASED REASONING
2.1 Introduction
This chapter covers available Artificial Intelligence techniques for building a
recommender systems. First, this chapter explains Recommender systems (RS) and the
range of approaches associated with them. It also draws a comparison between
different approaches. Secondly, it covers Case Based-Reasoning (CBR) systems, how
it is related to RS and how it can be implemented. This chapter also identifies the best
applicable RS technique for this research and point out where exactly we want to
tweak the existing approaches.
● Collaborative filtering
● User/Memory-based
● Item/Model based
2.2.1 Collaborative filtering(CF)
User-based similarity is to compute the relevance between users as the values of two
vectors. In UBCF, after the similarity is calculated, it is used in building
neighborhoods of the current target user. Since the similarity measure plays a
significant role in improving accuracy in prediction algorithms, it can be effectively
used to balance the ratings significance (Gong, Y. and Liu, X., 2001). There are a
many similarity algorithms that have been used in the CF recommendation algorithms
such as Cosine vector similarity, Pearson correlation, Euclidean distance similarity,
and Tanimoto coefficient (Sarwar, Karypis, Konstan., and Reidl, 2001).
2.2.2 User/Memory-based
This approach is based on identifying the users’ neighbours (i.e. the most similar users)
and then predicting a rating, based on the ratings of the neighbours. A memory-based
CF approach, or nearest-neighbour (Jin, Chai, and Si, 2004) is said to form an
implementation of the “Word of Mouth” phenomenon by maintaining a database of all
the users known preferences of all items, and for each prediction performing some
computation across the entire database. It predicts the user’s interest in an item based
on ratings of information from similar user profiles. The prediction of a specific item
(belonging to a specific user) is done by sorting the row vectors (user profiles) by its
dissimilarity to the specific user. In this way, ratings from users that are more similar
will contribute more to the rating prediction. Memory-based CF methods have reached
a high level of popularity because they are simple and intuitive on a conceptual level
while avoiding the complications of a potentially expensive model-building stage. At
the same time, they are sufficient to solve many real-world problems. Yet there are
some shortcomings (Hofmann, 2004):
● Sparsity - In practice, many memory-based CF systems are used to evaluate
large sets of items. In these systems, even active users may have consumed
well under 1% of the items. Accordingly, a memory-based CF system may be
unable to make any item recommendation for a user. As a result, the
recommendation accuracy can be poor.
● Scalability - The algorithms used by most memory-based CF systems require
computations that grow according to the number of users and items. Because of
this, a typical memory-based CF system with millions of users and items will
suffer from serious scalability problems.
● Learning - Since no explicit statistical model is constructed, nothing is learned
from the available user profile and no general insight is gained.
2.2.3 Item/Model-based
This approach is based on identifying the items’ neighbours (i.e. the most similar
items) and then predicting a rating, based on the ratings of the neighbours. The
motivation behind model-based CF is that by compiling a model that reflects user
preferences, some of the problems related to memory-based CF might be solved. This
can be done by first compiling the complete dataset into a descriptive model of users,
items, and ratings. This model can be built off-line over several hours or days.
Recommendations can then be computed by consulting the model. Instead of using the
similarity of users to predict the rating of an item, the model-based approach uses the
similarity of items. Prediction is done by averaging the ratings of similar items rated by
the user (Sarwar, Karypis, Konstan., and Reidl, 2001). Sorting is done according to
dissimilarity, as in memory-based CF. The difference is that the column vectors
(items) are sorted around the specific item, and not as in memory based CF, where row
vectors are sorted around the specific user. Sorting of the column vectors assures that
the ratings from more similar items are weighted more strongly. Early research on this
approach evaluated two probabilistic models, Bayesian clustering and Bayesian
networks (Breese, Heckerman, and Kadie, 1998). In the Bayesian clustering model,
users with similar preferences are clustered together into classes. Given the user’s class
membership, the ratings are assumed to be independent. The number of classes and the
model parameters are learned from the dataset. In the Bayesian network model, each
node in the network corresponds to an item in the dataset. The state of each node
corresponds to the possible rating values for each item. Both the structure of the
network, which encodes the dependencies between items, and the conditional
probabilities, are learned from the dataset.
In general, the CBR approach can be applied to problem domains that are only
partially understood, and can provide solutions when no algorithmic or rule-based
methods are available. The main advantages of CBR over rule-based models include
the following (Watson and Marir, 1994):
● CBR systems can be built where a model of the problem does not exist;
● Implementation is commonly made easy, as it is a matter of simply identifying
relevant case features;
● CBR systems can be rolled out with only a partial case-base, as it will be
continually growing due to its cyclic nature;
● CBR systems are highly efficient by avoiding the need to infer answers from
first principles each time;
● Retrieved cases can be used to provide satisfactory explanations as to why the
given solution is produced; and
● The case-based nature of the learning system makes maintenance easier.
The CBR model has been traditionally presented as a continuous cycle of retrieval,
reuse, revision, and retaining of cases, noted as the mnemonic of “the four REs”
(Aamodt and Plaza, 1994), see Figure 2 below.
2.3.1 Representation
● the problem describing the state of the world when the case occurred;
● the solution describing the derived solution to the problem; and
● the outcome describing the state of the world after the case occurred.
One can make different combinations of these three types of information in a case
representation scheme: cases comprising problems and solutions can be used to derive
solutions to new problems, while cases comprising information about problems and
their outcomes can be used to evaluate and make predictions about new problems.
Theoretically, all representational formalisms encountered in artificial intelligence
literature can be used as a basis of CBR case representation, including frames,
propositional logic, rule-based systems, and networks. Case collection in CBR is an
incremental process. That is, due to the cyclic nature of CBR, in which the case-base is
repeatedly enlarged with new cases as they are encountered, the system can be
deployed initially with a partial case-base. However, there are some factors to consider
as to what kind of cases should be included in the initial case-base.
2.3.2 Retrieval
In this process, one or more cases similar to the current case are retrieved using some
matching algorithm from the database of previously solved cases. Retrieval is one of
the most important research areas in CBR. An issue highly related with case retrieval is
the indexing of cases, whereby cases are assigned indices to facilitate their retrieval.
There are both manual and automated methods of case indexing. Some common
methods of indexing include:
Case retrieval in CBR differs from database searches that look for a specific value
among a given set of records due to the fact that in general there would be no existing
case that would exactly match any given new problem, retrieval in CBR typically
involves partial matches. A common method of retrieval widely used in CBR is nearest
neighbour calculations, where similarity between cases is calculated using a weighted
sum of their features. A disadvantage of nearest neighbour approaches is that the
retrieval time scales linearly with the number of cases in the case-base. Other retrieval
methods are based on induction, where features that are most useful for discriminating
cases are discovered by a learning algorithm, producing a decision tree structure to
parse the case-base.
2.3.3 Adaptation
After the retrieval stage, the solution of the matching case from the case-base should
be adapted to address the new case. While issues such as case representation, similarity
computations, and retrieval have been amply addressed in CBR literature, adaptation
has been considered the most difficult step and remains somewhat under-addressed and
controversial. In large scale applications, while it is commonly easy to accumulate
enough cases, the formulation of the required adaptation scheme is often difficult.
Therefore, it is common to use very simplistic adaptation rules, or, bypass adaptation
entirely, and to try to make up for this deficiency with a very comprehensive case-base
ensuring the availability of a similar case for every problem instance. Adaptation
models in CBR can be classified into several categories (Watson and Marir, 1994;
Kolodner and Leake, 1996):
2.3.4 Retaining
In the classical representation of the CBR cycle, retaining is the final step after an
acceptable solution to the new case has been produced by the system. The newly
solved case is added to the case-base of the system for making it available for future
retrieval, enabling the CBR system to learn from its problem-solving experience. The
retaining of new cases enlarges the coverage of problem space represented by the
case-base. In addition to the solution to the problem, the steps used in deriving that
solution can also be stored as a part of the case. For example, in a CBR system using
generative adaptation such as derivational analogy, derivational traces describing the
decision-making process for solving the problem can be retained for future use. A
related issue is the maintenance of the case-base, for preventing uncontrolled growth of
case-bases and addressing issues related to retrieval efficiency. Depending on the
design of the CBR system and the complexity of the used case representation, many
approaches are possible. For example, if a newly solved case is found to be highly
similar to a case already in the case-base, the new case may not be retained at all, or
the two cases may be merged.
2.4. Conclusions
Recommendations have become an integral part of modern era. With the increasing
number of options, people have become very particular about their preferences. We
discussed how RS provides a novel approach to suggest options to users as per their
preferences. One way to approach this problem is via CBR. CBR is a problem solving
paradigm in Artificial Intelligence where knowledge of previously solved cases is
utilized to address a new problem.
The dataset used for this thesis is limited with no prior user input and contains five
changing variables (sub-ratings). In which case Collaborative filtering approach fits the
best as described in Table 2.1. While retrieving case from CBR system we are going
to merge user provided sub-rating score with the sentiment score of the review and
examine the impact on predictability of existing recommender system.
3. SENTIMENT ANALYSIS
3.1 Introduction
This chapter focuses on subdomain of Artificial Intelligence and techniques associated
with it. This chapter describes how a sentence written in human readable language can
be broken down into smaller segments and sentiment of the sentence can be extracted.
Shortcomings of the available technique is also discussed in this chapter. The hidden
idea is to gain knowledge about sentiment analyzer through which a recommendation
system can be build. That recommendation system will then server the purpose to
answer the research question.
There are millions online users, who write and read content around the world daily.
Online daily sentiments become the most significant issue in making a decision. An
annual survey conducted by Dimensional Research explores the relationship between
the percentages of trust there is in online customer reviews compared to personal
recommendations. These percentages vary in different years, as per Neuhuttler,
Woyke, and Ganz (2017):
2011 74%
2012 60%
2013 57%
2014 94%
Feature “room”
Object-feature (explicit) “clean”
Opinion word “good”
In this example: the explicit feature is “clean”, but sometimes object features must be
inferred from the sentence and are called Implicit Features. For example: "The
bathroom size is too small":
Feature “bathroom”
Object-feature “clean”
(implicit)
Opinion word “small”
The aspect level of sentiment analysis focuses on opinions itself instead of looking at
the constructs of documents, such as paragraphs, sentences, and phrases. It is not
sufficient to find out the polarity of the opinions; identifying the opinion targets is also
essential (Steinberger et al. , 2014). The aspect-level sentiment analysis can be
subdivided into two sub-tasks: aspect extraction and aspect sentiment classification
(Liu, 2012). The task of aspect extraction can also be an information extraction task,
which aims to extract the aspects that opinions are on. For instance, in the sentence,
“the rooms of the hotel Empire are amazing but its service is too slow” . “Service” and
“Rooms” are the aspects of the entity represented by the entity “Hotel Empire”. The
initial approach of extracting aspects is finding frequent nouns or noun phrases, which
are defined as aspects. Later, the text containing aspects are classified as neutral,
positive, or negative. (Blair-Goldensohn et al. , 2008).
However, the issue can still arise in aspect-level sentiment analysis, as most of the
studies are based on the assumption of the pre-specified aspects by keywords (Wang et
al., 2011; Li et al. , 2015). Ding et al. (2008) proposed a lexicon-based method for
aspect analysis in which they assumed that aspects were known before the analysis.
Liu (2012) points out that the accuracy at aspect level sentiment is still low because the
existing algorithms still cannot deal with complex sentences well. Thus the aspect level
sentiment analysis is more cumbersome than both the sentence-level classifications
and document-level.
Figure 3 Sentiment classification techniques (Medhat et al., 2014)
“Rooms” 1
“were” 2
“very” 3
“clean” 4
“small” 5
Hence, the feature vector of each document has “5 dimensionalities” based on the
constructed dictionary. As demonstrate in sentiment analysis discussion, word
appearance is very informative (in contrast with word frequency in information
retrieval). The challenge of natural language is that sometimes one word can express
the author's attitude clearly while a sequence of words cannot. The main disadvantages
of this model are that it does not keep track of sequence of the words and all the words
have same weight. For example, in previous example the feature is “Rooms” and
opinion words are “clean” and “small”. But this model distributes weight equally to
other parts of the sentence thus suppressing the true opinion of the sentence.
TF-IDF fits perfectly well for this dissertation as it helps in identifying opinion’s
polarity effectively from a small dataset.
3.5.3 Word2vec
This is Prediction-based vector technique that provides probabilities to the words and
proved to be state of the art for tasks like word analogies and word similarities.
Word2vec is a combination of two techniques which works by creating a shallow
neural networks which map terms to the target variable which is also a term. Both
techniques learn weights which act as word vector representations:
Word2vec models can capture different sentiments of single words and being
probabilistic adds more versatility to the output. But this technique requires a huge
amount of data for training which is not feasible for this dissertation.
3.7 Conclusions
Sentiment analysis involves research in several fields; Natural Language Processing,
Computational Linguistics, and Text Analysis. It refers to the extraction of subjective
information from raw data, often in text form. However, also other media types could
contain subjective data, like images, sounds and videos but these types are less studied.
In accordance, in all media types different kinds of sentiments exist.
The classification of user reviews is a difficult task because review can contain irony,
misspellings, emoticons, slang, abbreviations, and it may also contain only a few
words. Let us consider the following review example: “Nice restaurant, lovelyyyyyy
meal, warm atmosphere, although the klutzy waiter spill vine on my dress: [“. This
review contains an elongated word (“lovelyyyyyy” ), a misspelled word (“vine” instead
of “wine”), emoticon (“:[“ ), a slang term (“klutzy” ), it also holds both positive and
negative opinions. All these factors complicate the process of classification in this
example. Various techniques exist that can be used for the sentiment analysis task. The
main approaches are Machine Learning (Kim, 2014) and Lexicon-Based (Hailong,
Wenyan, and Bo, 2014). The machine learning approach uses dataset for training
classification which will be further applied for defining the sentiment of a text. The
lexicon-based method uses the Semantic Orientation (SO) of words or phrases to
define whether a text is positive or negative.
The sentiment can refer to opinions or emotions, however these two types are related
there is an evident difference. In sentiment analysis based on opinions, a distinction is
made between positive and negative opinions. On the other hand, sentiment analysis
based on emotions, is about the distinction between different kinds of emotions, like
angry, happy, and sad. All this subtle information will help in building a sentiment
analyser using machine learning algorithms. So it is very important to get the basics
right to build a state-of-the-art text generator.
4. EXPERIMENT DESIGN
4.1 Introduction
This chapter starts by explaining the design methodology followed in the process of
text analysis. Next it describes how the dataset was acquired and how the recent
reforms in data protection laws that must be kept in mind while dealing with third
party data sources. Later in the chapter, the experimental architecture is explained
systematically. The final section explains why a certain technology was chosen out of
many available choices that were implemented while creating experiment.
The experiments will provide evidence to explore the research question. The purpose
of this chapter is to structure a proposed solution in a well-defined manner so that other
researchers could also follow the procedure to generate similar results.
1. imperfectly understood problem domains where little knowledge exists for the
humans to develop effective algorithms;
2. domains that deals with huge amount of data containing valuable implicit
regularities to be discovered; or
3. domains where programs are subject to change quickly and must adapt to
changing conditions.
Machine learning deals with the issue of how to build computer programs that improve
their performance at some task through experience. Many times machine learning
algorithms require reverse engineering, since we are always aware of the result but it
can be hard to achieve in a single attempt. To achieve a state-of-the-art model, we must
reiterate the entire model again and again until a satisfactory result is obtained. There
are various ways to implement data mining techniques based on the purpose of the
project and the type of dataset. This project implemented the Cross Industry Standard
Process for Data Mining (CRISP-DM). As CRISP-DM follows the exact process
required in sentiment analysis tasks it perfectly fits the requirements of this research.
The CRISP-DM methodology is series of hierarchical process models, consisting of
four levels of abstraction (Wirth and Hipp, 2000):
Section 4.3, 4.4 and 4.5 contains a detailed implementation of Phase 1, Phase 2, and
Phase 3 respectively. The experimental design and the results are covered later in
Section 6 and Section 7.
Each of the authors who described how to enhance a cold start hotel recommendation
system published in their paper “Finding a Needle in a Haystacks of Reviews” (Levi,
et al. , 2012) were contacted. One of the author shared the dataset used in their research
which was acquired from TripAdvisor before GDPR was implemented.
The dataset contains 106,266 hotel text reviews along with their sub-ratings individual
scores ranging from 0 to 5 (whole numbers). The core idea of the system is to add five
tags (Location, Hospitality, Food, Price, and Room) with scores from 0 to 5 (whole
numbers) to every sentence present in text reviews. The process is divided into
separate stages based on the technique we are applying. Initial stages deal with natural
language processing techniques and later we applied CBR for text prediction.
Table 4 shows some sample data shared by the author. The column headers were
missing from this dataset, which was a crucial part of the data. Without the headers it
was hard to map which the amenities with their subsequent scores.
But with the help of hotel name present in the dataset, the original review was tracked
down from TripAdvisor. In there, review contain amenities scores which was then
mapped with the dataset and accurate headers were obtained as shown in Table 5
The original dataset provided was in compressed in a zip format. When unzipped, the
dataset was stored in multiple Comma Separated Values (CSV) files. Each CSV file
belonged to reviews of a hotel. All the CSV files were further merged into one CSV
file that contained 106,266 rows and 9 columns. This dataset became the initial input
to proposed experiment architecture described in Section 4.4.
Table 6 shows distribution of 5 categories obtained from the dataset. Most of the
reviews belongs to 4 or 5 star ratings across all sub categories.
4.4 Experiment Architecture
The first step is to filter dataset by removing text reviews that contains less than 4
words, as explained previously under project description section in first chapter. After
the dataset is cleaned, each review text will pass through a 3-stage process. Here is a
brief overview of the components and steps of method, depicted in Figure 5. In Stage
1, the text review will be broken down into multiple sentences using the python
Language Toolkit (NLTK) as S1, S2...Sn. Then for each sentence S1, sentiment
analysis will generate scores for each sub-ratings based on presence of an Aspect Tag.
An Aspect Tag is a set of synonyms used for sub-ratings in a review. This can be
achieved by a SpinGlass Community Detection Algorithm as shown in Table 7.
Aspect Features Per Aspect
Tag
Table 7: SpinGlass Community Detection Algorithm results (Levi & Mokryn, 2012,
p.5)
This will serve as final output for present solution. For the proposed solution an
additional step will be carried out in which the average of the sentiment scores and the
sub-rating score given by the user will be tagged for each sentence. The final output
will contain sentences labelled with sub-ratings scores. In this way two different
outputs will be generated for each sentence. For each query, two cases will be retrieved
using the Euclidean Distance formula, i.e. one from the proposed solution and one
from the existing solution. An online survey will be conducted in which user will be
presented with randomly generated scores from 0 to 5 against each sub-rating. Apart
from the sub-rating scores the user will see two text reviews, out of which the user will
be allowed to select only one of the texts based on what user thinks most closely
resembles the sub-rating scores.
Data Cleansing: In this step we skimmed over the text reviews and filter out reviews
that contains very few words, as it not be possible to classify words in which
sub-category they belong to, e.g. if a text review says “amazing hotel” , which amenity
of the hotel amazed the user is hard to retrieve and hence were removed from the
dataset. This reduced the dataset to 96,768 reviews. Also, some reviews did not contain
rating scores, as this is an integral part of the process I eliminated reviews that did not
contain sub-rating scores. This further reduced the data set to 88,128 text reviews.
The data next had to be pre-processed as it helps in reducing vocabulary clutter so that
the features produced in the end are more effective. The data was passed step-by-step
sequentially in order to clean it as follows:
● Lower Case: The first pre-processing step is remodel the text reviews into
lower-case letters. This avoids having multiple copies of the identical words,
for example, while calculating the word count, “Clean” and “clean” would be
counted as different words otherwise.
● Removing Punctuation: The next step is to get rid of punctuation, because it
doesn’t add any additional information while treating text knowledge, thus
removing all instances of it will facilitate a scale back of the dimensions of the
training set.
● Removal of Stop words: Stop words (or commonly occurring words) should be
removed from the text data. They are mostly articles prepositions which are
used to construct structured sentence and have no sentiment associated with
them. For this purpose, we can either create a list of stop words ourselves or we
can use predefined libraries. For this dissertation the Python NLTK kit was
used for removing stop words.
● Rare Words Removal: Now we'll remove seldom occurring words from the
text. Because they’re so rare, the association between them and other words is
dominated by noise.
● Spelling Correction: While writing reviews users can unintentionally misspell
words which can lead to noisy and redundant words. So it is quite essential to
fix such misspells and thus reducing multiple copies of words. For example,
“Precarious” and “Precarious” will be treated as different words even if the
intent of the user was same. The Python Textblob library will be used to
perform spelling correction on the dataset.
● Tokenization: In this step we will segregate remaining words from the sentence.
These words will be passed further for Stemming/Lemmatization.
● Lemmatization: In this step we convert words into their root word, also called a
“lemma”. This step further helps in reducing the dataset. For example,
“Cleanliness” and “cleaning” belongs to the lemma “clean” and they share
same sentiment.
Individual words are analysed into their components. Non-words tokens such as
punctuations are separated from the words. This phase uses:
● Part-of-Speech (POS) tagging: We will assign a tag to each word in a text and
classify a word to a morphological category such as noun, verb, and adjective.
We will use Hidden Markov Models for developing POS tags.
● Lemmatization: As explained above this process converts all the inflected
words present in the text into a root form called a lemma, e.g. “diversity” ,
“divergence” , and “diverging” are converted into the lemma “diverge” .
A linear sequence of words is transformed into structures that show how words related
to each other. Some word sequence may be rejected if they violate the language rules.
This phase is further dissolved into two steps:
Now we have the dataset ready, we must classify text based on the terms it contains.
Since the probability of terms with similar context is high, the “features” that are
interrelated can be extracted out. The meaning of an individual sentence may depend
on the sentences that precede it and may influence the meaning of the sentence that
follow it. The stages in this process are:
Using TF-IDF we can calculate the weight of each word which is then fed into a
classification algorithm for distinguishing features. There are many classification
algorithms, each of them have their own advantages and depends on the dataset that we
are dealing with (Kotsiantis, Zaharakis, and Pintelas, 2007). Some common algorithms
include:
● Decision Trees: These are trees that classify instances by sorting them based on
a feature value calculated using TF-IDF. Each node in a decision tree
represents a value that the node can assume. Instances are classified starting at
the root node and sorted based on their feature values. Decision trees are
usually univariate since they use splits based on a single feature at each internal
node. Since this research is working with more than one feature, Decision
Trees are not suitable.
xn are input features values and w1 through wn are connected weights/prediction
vector, then the Perceptron computes the sum of weighted inputs: ∑xiwi and
the
output goes through an adjustable threshold: if the sum is above a threshold, the
output is 1; else it is 0. This type of classification technique requires a lot of
data and computation power which is beyond the scope of this thesis.
This step will combine a user’s actual amenity rating with the sentiment score
generated in Step 4 using a Latent-Factor model. This models predicts a rating rt,a for
a
text review t and amenity a according to:
r (t,a) = o*Bu + Bs
where o is an offset parameter, Bu is the user’s actual rating and Bs is
the sentiment
score obtained in Step 4. The product of o and Bu will lie within the range of -1 to +1.
This is the Biased Factor (BF) we are trying to introduce in this research to increase
the similarity of the CBR recommender system. The BF will dynamically change
during the experimental phase and depends on the user acceptance.
In this phase we will identify similar text reviews based on other users’ previous
ratings obtained from Stage 5, e.g. if users Bob, Mary and Marley gave a 5-star rating
to hotel for its location then when a user X writes reviews for any hotel X with 5 star
ratings for location could possibly use similar words for describing hotel’s location
because the system identifies the reviews of Bob, Mary and Marley as being similar
based on the ratings.
Table 8 shows a rating Matrix built for the recommender system. Characters A-Z
symbolises the review text for a give category and score. When a new user comes and
selects their desired ratings and category the system will pick the text that fits the
coordinates in the matrix and present the result.
Aspect 1 2 3 4 5
Food A F K P U
Hospitality B G L Q V
Location C H M R W
Price D I N S X
Room E J O T Y
Table 8: Text Matrix for Prediction System.
4.6 Technology Choices
4.6.1 Programming Language
Lot of activities in machine learning which are focused on domain are being carried
out constantly. Since the domain is so vast each of available technology has excelled in
a domain. Thus, leaving us with no simple answer to the “which language?” question.
It depends on what we’re trying to achieve, what techniques are involved in it and how
the dataset is retrieved and structured. According to a survey conducted by
towardsdatascience.com[36], Python
is leading the pack, with 57% of data scientists and
developers using it. 33% of the remaining are planning to migrate to use python based
solutions in the future. Python is followed by R with 31% which is followed by Java,
C/C++ and Javascript.
Machine learning scientists engaged on sentiment analysis ranks Python (44%), Java
(15%), R (11%) and JavaScript (2%). Java is first choice of scientists on network
security/cyber-attacks and fraud detection, contrary to which python is least preferred
in these two domains.
Areas like linguistic communication process and sentiment analysis, developers go for
Python that offers a better and quicker way to build highly performing algorithms,
because of the in depth assortment of specialized libraries that are available in Python
community (towardsdatascience.com, 2017).
C/C++ tops the chart when it comes to Artificial Intelligence (AI) in games and robot
locomotion. As these languages works close to machine language they provide better
level of control, high performance and efficiency required for glitch free software.
While R is designed for statistical analysis and visualisations of big data, making it
more helpful for business evaluation using charts and curves.
Since the project revolves around sentiment analysis and natural language processing
Python becomes an obvious choice. It provides libraries like scikit for data processing,
nltk for NLP, matlab for visualization and numpy for data manipulation. Availability
of resource and the used case towards computational linguistics is the crux of option
for Python as a language of choice.
4.6.2 Technique Choices
Clustering
Today they are applied in a wide range of applications and are gradually replacing
traditional Machine Learning methods. It’s a very challenging task to make
recommendations for such a service because of the big scale, dynamic corpus, and a
variety of unobservable external factors.
Content-Based(CB): The system learns to recommend items that are like the ones that
the user liked in the past. The similarity of text is calculated based on the features
associated with the compared sub-categories. The content filtering approach creates a
profile for each user or product to characterize its nature.
Collaborative Filtering (CF): In this technique, similarity in taste of two users is
calculated based on the similarity in the rating history of the users. It is also referred as
“people-to-people correlation.” A major appeal of collaborative filtering is that it is
domain free, yet it can address data aspects that are often elusive and difficult to
profile using content filtering. Collaborative filtering suffers from what is called the
cold start problem, due to its inability to address the system’s new products and users.
The two primary areas of collaborative filtering are the neighbourhood methods and
latent factor models.
Latent factor models are an alternative approach that tries to explain the ratings by
characterizing both items and users.
While recommending text reviews to the user we won’t be having any user profile
which makes CF as our choice of CBR strategy. The goal is to identify most similar
rating of subcategories which other correlated users have given, which makes
neighbourhood method a perfect fit.
4.7 Conclusions
In this chapter, the CRISP-DM methodology was explained. First all the steps involved
in this methodology were explained and later it was justified why this methodology
was chosen for this research. New amendments in GDPR were also explained that
created problems while acquiring the dataset, followed by how the alternative option
was chosen in order to acquire the dataset. After having the dataset, inspired by
CRISP-DM, the experiment architecture was established. The experiment will be
divided into six different stages that involve extracting sentiments from the dataset to
store and present the results to the end user. Finally, in the last section it was justified
why a certain technology was chosen in the aforementioned six stages of experiment
stages. This chapter is a vital part of the thesis as the output obtained from it will be
passed on for development and evaluation process.
5. EXPERIMENT DEPLOYMENT
5.1. Introduction
This chapter starts by describing how the CRISP-DM methodology was implemented
for the experiment. Subsequent subsections of CRISP-DM cycle explains how the
noise was handled from the data, how the dataset was filtered, how features and
sentiments were extracted. It also looks at preserving the data and predicting the
outcomes.
Later in the chapter, the design of an online survey is described. It also describes the
criteria of volunteers to be chosen and how they were chosen. The approaches to
gathering the results from the experiments are also explained. The results obtained
from the experiment will help in answering the research question by providing us with
qualitative and quantitative data.
The experimental setup to test the proposed hypothesis has to measure whether people
are willing to select a new descriptor over the existing descriptor. User preferences
with the proposed system results are measured with an online evaluation methodology.
The experimental design (Knijnenburg, Willemsen, Gantner, Soncu and Newell, 2012)
does not measure absolute user opinion but only relative user preference with one set
of solutions over another. The experimental design is designed to measure the
acceptance rate of system’s output rather than accuracy.
Since the dataset contains reviews written by actual users, it contained lots of subtle
mistakes that needed to be fixed before the data could be used by machine learning
algorithms. Entire dataset was loaded into Pandas dataframe, which is a python library
for excel data manipulation. The loaded data frame was processed as below:
● Filter Results: The data set contained additional information about the hotel
and reviewer such as 'Traveller', 'Nationality', 'Date', 'Service', that are not
important for the purpose of the project. Hence, those columns were dropped
from the dataframe. df.drop(remove_cols, inplace=True, axis=1), where
● Spell Check: Review contained lot of spelling mistakes and were sanitized
with a spelling correction library “TextBlob”.
Since the data cleaning step is now complete, it is ready to pass it for NLP related
tasks. This will help in identifying patterns and further grouping of words. This phase
was completed by achieving follow subtasks:
● First normal Form: It was observed that there were lot of words that were
used in there second or third forms or superlative degrees. Since this
redundancy of words corrupts data while performing latent semantic analysis.
It was essential to reduce such words to there first normal words. NLTK library
was used to perform lemmatization as shown in Figure 9.
● Part-of-Speech (POS) tagging: For every sentence all the words are tagged to
their corresponding POS. This helps in identifying important and words or
phrases for feature selection and classification. Figure 10 shows a sentence
with pos tagging.
Figure 10: POS tag of a sentence.
The purpose of this phase is to extract features describing sub-category of the hotel.
These extracted features will then be used by classification algorithms for grouping
similar sentences. This phase was achieved in two steps:
● Feature Selection: Each of the POS tagged words were iterated and there
adjacent words were stored in a dictionary. By the end of the iteration each
word was associated with a list of adjacent words with their weights. Larger
weight implies better coherence with a feature. The list was sorted in
descending order and top 15 words were selected for each of the word. Finally
a list of concept words was obtained as shown in Figure 11.
In this phase all the sentences from text reviews are classified and tagged in their
corresponding groups. TF-IDF vectorizer technique was implemented to generate
latent semantic analysis. The process has following steps:
1. Fitting the data frame in TF Vectorize to generate term frequency of each word.
2. Inverse transform the fitted model to generate inverse document frequency
3. Multiply the two matrices to generate singular value decomposition.
The final output of this process is groups of text based on the number of value passed
in third step. The value usually represents number of variables in the system, which in
this case is five(hotel sub-categories). Implementing this steps results five groups of
text corpus. These text represents sub-category reviews. Figure 12 shows Python
implementation of LSA.
Figure 12: Python implementation of LSA
The final stage of Natural Language processing is to identify sentiment scores of text
groups. In this phase all the sentences from every text groups were iterated and
corresponding sentiment scores were assigned to them. For calculating sentiment
scores TextBlob Polarity module was used. This module takes a sentence and predicts
it polarity between 0 to 5, where zero represents negative sentiments and 5 means
positive sentiments. While iterating through all the sentences certain information was
stored in the database table as follows:
Figure 13 shows are sample sentence that belongs to Room s ub-category who’s user
rating as well as sentiment score is 5.
The correlation between the ratings of a hotel review as the similarity metric was used.
To find the correlation between the ratings of the hotel review, a matrix was created
where each column is a review name and each row contains the rating assigned by a
specific user to that hotel review. As shown in Figure 14, inside red boxes represents
empty value which will be predicted by the proposed system.
1 T1 T1 T3 T4 T5
2 T6 ?? T8 T9 ??
5 ?? Tn ?? ??
Figure 14: Dot product representation of CF Model.
Whenever a user issues a rating, the system will identify surrounding neighbours using
kNN method. Scikit library was used as it provides a direct implementation of kNN
algorithm. It takes sub-rating scores as the input and returns nearest text review
associated to it.
Secondly, many iterations were carried out to achieve satisfactory results. Changing
parameters in one step had cascading effects on the rest of the steps. Five variables
were hard to manage, changing one variable could change entire outcome.
And lastly, the computer used in this research for the purpose of training and testing
models had a low spec and it could take several hours to finish one experimental
iteration. Since several iterations were carried out, most of the time was spent waiting
for the tasks to complete. Running tasks in parallel for machine learning tasks in not
feasible as all the data need to be loaded at once in the memory and hence results can’t
be shared across sub-processes.
5.5. Design
Reviews were generated in advance for a random set of categories ratings. Their output
was saved and presented later in a questionnaire designed for this experiment. The
implemented system was available publicly as a web survey.
The survey contained four questions, each question consisted of the following:
All the 30 participants took part in online survey web. A survey was hosted on
surveymonkey.com, which contained a consent form and a total of four multiple
choice questions. Each question contained two text reviews generated from existing
and proposed algorithms based on pre-selected categories ratings. The participant
could choose either of the option, both of the options or none of them. The average
time for completing the survey was between 5 to 6 minutes. Figure 16 shows a sample
questionnaire used in the experiment. It contains two reviews, “Review A” and
“Review B”, a table that contains category name in one column and category score in
another column and list of options out of which a user can select any one.
Figure 16 Sample of online survey created in surverymonkey.com
5.7.2 Interviews
In order to gain understanding about the survey, a verbal interview was conducted with
10 volunteers. Historical qualitative approach was followed in this process.
Interviewers were first shown an online survey that gave them understanding about the
scenario. Later a face to face interview was conducted with each of the participants
separately. Based on their experience participants were asked a list of questions as
follows:
All the 10 participants were native English speakers, out of which 2 were females and
remaining 8 were males. 6 of them were frequent travellers and do read reviews of
places and hotels online before travelling. 3 of them like to read movie reviews before
watching them and remaining one likes to read books. Which makes everyone
proficient in reading and understanding the English language.
5.8. Conclusions
This chapter starts by explaining three top level phases of CRISP-DM cycle used in the
experiment creation. Each of the phase was divided into sub-tasks and so on. Data
cleaning process was described and what sort of noise was found on the dataset and
how it was cleaned was discussed in that section. Then morphological analysis was
explained along with its use in future phases. Next, the features from the dataset was
extracted which then helped in grouping text into their sub-categories. After grouping
sentences, their polarity was calculated and the entire information was stored in the
database. And at last phase of CRISP-DM cycle implementation of CF is explained.
The design of the survey was also discussed in this chapter. Two texts were generated
using the proposed and existing algorithms and were stored in Recommendation
engine A and B respectively. Later the stored texts were shown to the participants in
the form of an online survey. The stored results will provide us with quantitative data
to make conclusions about the hypothesis. An interview was also conducted to gather
qualitative data.
The results obtained from the online survey was stored in a CSV file and the answers
obtained from interviews were saved in a document file. This information will help us
in concluding the research question precisely.
6. EVALUATION
6.1. Introduction
In this chapter the evaluation of the system is discussed, the first section explains the
importance of evaluation and its various types. This helps in building a more efficient
and robust evaluation process. The next section discusses how the performance of
different classification algorithms is measured. Every classification algorithm has its
own pros and cons, sometimes the selection of an algorithm can only be decided with
the help of performance analysis. The third section compares the results of
classification algorithms. The fourth section discusses patterns present in the user
ratings and the sentiment ratings in the dataset. And at last, the results obtained from
the experiments are explained in great detail.
● Domain and application task oriented criteria (e.g. size theory strength,
openness).
● Technically and ergonomically oriented criteria (e.g. case and knowledge
representation, similarity assessment, user acceptance).
● Knowledge engineering criteria (e.g. ease of use of methodology, development
phase tools).
For the evaluation, the criteria were selected that produce quantitative or qualitative
results, where the latter split into mainly domain-dependent ones as well as
domain-independent ones.
The quantitative evaluation can also be considered as technically oriented criteria, e.g.:
The qualitative evaluation can also be seen as ergonomically oriented criteria, e.g.:
● User acceptance
● Adaptability
● Error management
The TPR is plotted against FPR at many different thresholds (for example 0.00, 0.01,
0.02, ..., 1.00) which decides when a prediction is assumed to be true (e.g. at a
threshold of 0.90 a prediction is assumed to be true if the computed probability is equal
or higher than 90% and any prediction with a lower probability is assumed to be
untrue).
Figure 17: AUC plots False Positive Rate against True Positive Rate
A random prediction will result in an AUC value close to 0.5, which is usually used as
a threshold. And if all the predictions are wrong the AUC value is 0 and if they are all
correct the value is 1. In the case of binary predictions, the AUC value is the same as
the accuracy.
For this research an AUC score of 0.82 was achieved, indicating that predictions using
the proposed method were achieving a consistently high score.
For the classification data, a brute force method was used for evaluating the best
classification algorithm. The dataset was split into 80% training and 20% test data.
First the model was trained with the training dataset, then three different algorithms
was evaluated on the test dataset. The three algorithms that were evaluated were:
● k-Nearest Neighbour
● Decision Tree
● Naive Bayes
The results of the tests show that Naive Bayes outperformed other machine learning
techniques as shown in Table 9.
Since, the five variables in the experiment, i.e., location, hospitality, food, price, and
room are independent of each other and may or may not occur in a text review, the
Naive Bayes algorithm considers all variables independent of each other which made it
the best classifier for the dataset used for the project.
As discussed in the experimental design chapter, an online survey was created. In the
survey, four questions were asked and the participants could choose only one option
per question. A total of 120 responses were collected from 30 participants and 4
questions each. Figure 6.4.1.1. shows participant’s acceptance for the options, where:
Results shows that the most preferred option was Review A with 46.73% followed by
Review B with 28.19%. 15.11% participants choose either of the options to be
applicable for a given rating matrix while 9.97% participants preferred not to choose
any of the review suggested by either of the system.
Figure 19: Options acceptance distribution by participants of the survey.
Interestingly, it was observed that 18.54% of the participants prefer the proposed
system over the existing system. This is an important finding as it indicates using
amenities scores with the sentiment scores predicts better review texts.
As observed in Table 10, both the systems present similar text when the sub-ratings lie
between the range 3 to 4. When text reviews with such ratings were shown to
participants 80.43% of them chose Either of Them as their choice, while 6.94% chose
Review A and 8.11% chose Review B. This signifies that the proposed system shows no
improvements when rating lies between a score of 3 or 4. Contrary to that, when
reviews were generated from the sub-ratings scores ranging 1 to 3 or 4 to 5, 56.01%
chose Review A and 28.29% chose Review B as detailed in Table 10.
Though the idea of the interview was clear to all the participants they found little
difficulty in mapping sub-category score with the text predicted by the system. For
instance, if a certain review says “The food was expensive” , it is hard to understand
whether this review is about the food category or the price category of the hotel.
Participants suggested that instead of showing a paragraph of review text, it would
have been easier if the sentences were grouped as per their sub-category.
Since the predicted text were generated using reviews written by humans and relate to
an actual place or a hotel, two of the participants totally disliked the idea of text
prediction as the text generated were very specific to a hotel and its location. For
instance, one of the reviews say “Hotel Amber is the cheapest hotel in Berlin” , it is not
feasible to reuse this sentence for describing another hotel or other location. The
participants said it would be helpful if the system just suggested phrases, adjectives or
verbs based on sub-categories scores.
6.5. Conclusions
The result evaluation is an integral step in answering the research question. From the
experiment we gathered both qualitative and quantitative information. The chapter
explained each of the gathered results in detail.
7.2. Conclusions
This section will look at the objectives of the chapters discussed earlier in this
dissertation. This will then be followed by the key findings of each chapter. This
section ends by summarizing the conclusions of all the sub-sections and conclusion for
the hypothesis.
Since NLP is a significant domain, the primary objective of this chapter was to narrow
down the scope which aligns with the research question. This chapter also explores the
past, the present and the future of Natural Language Processing for text analysis.
Another key point was to explain the steps involved in the NLP task. The limitations of
current technology were also explained which then helped in building the experiment
within the bounds of available technologies.
Every software design follows a design methodology, this chapter explains a novel
approach to building a machine learning program using the CRISP-DM methodology.
Also, some problems faced due to reforms in the GDPR rules while gathering dataset
are also addressed. The architecture of the proposed solution and its stages were also
explained.
This chapter explained the experimental process from the initial stage to the
deployment. First, the phases of the CRISP-DM cycle adjusted for this research were
explained with some code samples. Later, the design and deployment of the survey
were explained. The participant demographics were also discussed along with a list of
questions asked to them. This stage provided qualitative and quantitative data for the
evaluation process.
The objective of this chapter was to discuss the results obtained from the online survey
and the face-to-face questionnaire. The performance of certain classification
algorithms were also discussed. The results obtained by the evaluation process
provided firm evidence in relation to the research question.
From the quantitative analysis, most of the participants chose the text generated by the
proposed method with a difference of 18.54% between the text generated by the
proposed system and the text generated by the existing system. The results obtained
from the online survey indicate that the similarity of retrieval process of a CBR-based
hotel review recommendation system is increased by combining sub-category scores
with the sentiment scores of the hotel’s text reviews.
It was also concluded that when the subcategory scores lie in mid-range i.e. 3 or 4, the
existing system outperforms the proposed system by 1.83%. The difference is seen
significantly in favour of the proposed system when the subcategory scores are chosen
from extreme ends i.e. 1 or 2, and 4 or 5.
The results gathered by the qualitative analysis suggest that 60% of the participants are
satisfied with the generated text.
In context with the work presented in this dissertation there are many possible areas
that can be expanded:
● There are many neural network techniques available for text classification and
language generation which were not used in the project due to limitations of the
dataset. They have many advantages over techniques that were used in terms of
performance and predictions.
● The sentences generated by the algorithm were directly selected from the
corpus. Those sentences contained some grammatical errors and usually
contained names of a hotel or place.
● Standard libraries such as “TextBlob” were used for calculating the sentiment
score of a sentence. It would be great to implement a personalised sentiment
analyser for the dataset. This could result in more precise ratings.
● The dataset used in this dissertation contains only text in the English language.
The outcomes could be different for different languages, as each language has
its own grammar and requires different techniques to handle it.
● There are five variables addressed in the dissertation, further variables could be
uncovered and used.
● NLP tends to be based on turning natural language into machine language, but
with time as the technology matures – especially the AI component –the
computer will get better at “understanding” the query and start to deliver
answers rather than just search results.
● Language is a huge barrier when it comes to communication with non-native
English speakers. With the help of AI techniques an earbud could be build
which can translate any language in real time.
● With the help of AI techniques, it could be possible to automatically analyse
documents and other types of data in any business system which are subject to
GDPR rules. It will allow users to search quickly and easily, retrieve, flag,
classify and report on data mediated to be very sensitive under GDPR.
● NLP models require existing data to produce results, these results could
become monotonous after some time. A system could be built that constructs
sentences based on grammar and vocabulary.
● Many times the meaning of a word changes based on the context which is not
possible to understand with the current techniques. For instance, “The
comedian killed the show”, in this sentence the verb “killed” signifies that the
comedian performed the best which means that sentence has a positive
sentiment. But, “killed” in categorised as a bad world and current NLP
technique will produce negative sentiments for this sentence.
8. BIBLIOGRAPHY
Aamodt Agnar, & Plaza Enric. (1994). Case-Based Reasoning: Foundational Issues,
Methodological Variations, and System Approaches. AI Communications, 7(1), 39–59.
https://doi.org/10.3233/AIC-1994-7104
Aue, A., & Gamon, M. (2005, September). Customizing sentiment classifiers to new
domains: A case study. In Proceedings of recent advances in natural language
processing (RANLP) (Vol. 1, No. 3.1, pp. 2-1).
Basile, V., & Bos, J. (2011). Towards generating text from discourse representation
structures. 11 Proceedings of the 13th European Workshop on Natural Language
Generation (pp. 145-150). https://dl.acm.org/citation.cfm?id=2187705
Blair-Goldensohn, S., Hannan, K., McDonald, R., Neylon, T., Reis, G. A., & Reynar,
J. (2008, April). Building a sentiment summarizer for local service reviews. In WWW
workshop on NLP in the information explosion era (Vol. 14, pp. 339-348).
Bobadilla, J., Ortega, F., Hernando, A., & Gutiérrez, A. (2013). Recommender systems
survey. Knowledge-Based Systems, 46, 109–132.
https://doi.org/10.1016/j.knosys.2013.03.012
Bobadilla, J., Serradilla, F., & Bernal, J. (2010). A new collaborative filtering metric
that improves the behavior of recommender systems. Knowledge-Based Systems,
23(6), 520–528. https://doi.org/10.1016/j.knosys.2010.03.009
Breese, J.S., Heckerman, D., & Kadie, C.M. (1998). Empirical Analysis of Predictive
Algorithms for Collaborative Filtering. CoRR, abs/1301.7363.
Bridge, D., & Healy, P. (2012). The GhostWriter-2.0 Case-Based Reasoning system
for making content suggestions to the authors of product reviews. Knowledge-Based
Systems, 29, (pp. 93–103). doi:10.1016/j.knosys.2011.06.024
Charniak, E. (2001). Immediate-head parsing for language models. In Proceedings of
the 39th Annual Meeting on Association for Computational Linguistics - ACL ’01.
Association for Computational Linguistics (pp. 39-42). doi:10.3115/1073012.1073029
Debnath, S., Ganguly, N., & Mitra, P. (2008). Feature weighting in content based
recommendation system using social network analysis. In Proceeding of the 17th
international conference on World Wide Web - WWW ’08. ACM Press (pp. 259-266).
doi:10.1145/1367497.1367646
Ding, X., Liu, B., & Yu, P. S. (2008). A holistic lexicon-based approach to opinion
mining. In Proceedings of the international conference on Web search and web data
mining - WSDM ’08. ACM Press. https://doi.org/10.1145/1341531.1341561
Goldberg, D., Nichols, D., Oki, B. M., & Terry, D. (1992). Using collaborative
filtering to weave an information tapestry. Communications of the ACM, 35(12),
61–70. https://doi.org/10.1145/138859.138867
Gong, Y., & Liu, X. (2001). Generic text summarization using relevance measure and
latent semantic analysis. In Proceedings of the 24th annual international ACM SIGIR
conference on Research and development in information retrieval - SIGIR ’01 (pp.
441-448). doi:10.1145/383952.383955
Hailong, Z., Wenyan, G., & Bo, J. (2014). Machine Learning and Lexicon Based
Methods for Sentiment Classification: A Survey. In 2014 11th Web Information
System and Application Conference. IEEE. https://doi.org/10.1109/wisa.2014.55
Huand, .M & Liu, .B, (2004). Mining and summarizing customer reviews. ACM
SIGKDD international conference on Knowledge discovery and data mining , (pp.
168-177). doi:10.1145/1014052.1014073
Huang, Z., Chen, H., & Zeng, D. (2004). Applying associative retrieval techniques to
alleviate the sparsity problem in collaborative filtering. ACM Transactions on
Information Systems, 22(1), 116–142. https://doi.org/10.1145/963770.963775
Jakob, N., Weber, S. H., Müller, M. C., & Gurevych, I. (2009). Beyond the stars. In
Proceeding of the 1st international CIKM workshop on Topic-sentiment analysis for
mass opinion - TSA ’09. ACM Press (pp. 81-92). doi:10.1145/1651461.1651473
Jay J. Jiang & David W. Conrath (1997). Semantic Similarity Based on Corpus
Statistics and Lexical Taxonomy. In the Proceedings of ROCLING X, Taiwan, 1997
(pp. 32-39). https://arxiv.org/abs/cmp-lg/9709008v1
Jin, R., Chai, J. Y., & Si, L. (2004). An automatic weighting scheme for collaborative
filtering. In Proceedings of the 27th annual international conference on Research and
development in information retrieval - SIGIR ’04. ACM Press.
https://doi.org/10.1145/1008992.1009051
Kevin, L., Goldensohn, L.B, & McDonald, R., (2009). Sentiment summarization:
evaluating and learning user preferences. In EACL ’09: Proc. of the 12th Conference
of the European Chapter of the Association for Computational Linguistics. ACL,
Morristown, NJ, USA, (pp. 514–522). https://dl.acm.org/citation.cfm?id=1609124
Kipper, K., Korhonen, A., Ryant, N., & Palmer, M. (2007). A large-scale classification
of English verbs. Language Resources and Evaluation, 42(1), (pp .21–40).
doi:10.1007/s10579-007-9048-2
Knijnenburg, B. P., Willemsen, M. C., Gantner, Z., Soncu, H., & Newell, C. (2012).
Explaining the user experience of recommender systems. User Modeling and
User-Adapted Interaction, 22(4–5), 441–504.
https://doi.org/10.1007/s11257-011-9118-4
Ko, Y., & Seo, J. (2000). Automatic text categorization by unsupervised learning. In
Proceedings of the 18th conference on Computational linguistics -. Association for
Computational Linguistics (pp. 114-119). doi:10.3115/990820.990886
Kotsiantis, S. B., Zaharakis, I., & Pintelas, P. (2007). Supervised machine learning: A
review of classification techniques. Emerging artificial intelligence applications in
computer engineering, 160, 3-24.
Lamontagne, L., & Lapalme, G. (2004). Textual Reuse for Email Response. In Lecture
Notes in Computer Science (pp. 242–256). Springer Berlin Heidelberg.
doi:10.1007/978-3-540-28631-8_19
Lange, T., Roth, V., Braun, M. L., & Buhmann, J. M. (2004). Stability-Based
Validation of Clustering Solutions. Neural Computation, 16(6), (pp. 1299–1323).
doi:10.1162/089976604773717621
Levi, A., Mokryn, O., Diot, C., & Taft, N. (2012). Finding a needle in a haystack of
reviews. In Proceedings of the sixth ACM conference on Recommender systems -
RecSys ’12. ACM Press (pp. 41-48). doi:10.1145/2365952.2365977
Levine, E., & Domany, E. (2001). Resampling Method for Unsupervised Estimation of
Cluster Validity. Neural Computation, 13(11), (pp. 2573–2593).
doi:10.1162/089976601753196030
Lindstrom, L., & Jeffries, R. (2004). Extreme Programming and Agile Software
Development Methodologies. Information Systems Management, 21(3), 41–52.
https://doi.org/10.1201/1078/44432.21.3.20040601/82476.7
Liu, B., Hu, M., & Cheng, J. (2005). Opinion observer. In Proceedings of the 14th
international conference on World Wide Web - WWW ’05. ACM Press.
https://doi.org/10.1145/1060745.1060797
Liu, B. (2012). Sentiment Analysis and Opinion Mining. Synthesis Lectures on Human
Language Technologies, 5(1), 1–167.
https://doi.org/10.2200/s00416ed1v01y201204hlt016
Medhat, W., Hassan, A., & Korashy, H. (2014). Sentiment analysis algorithms and
applications: A survey. Ain Shams Engineering Journal, 5(4), 1093–1113.
https://doi.org/10.1016/j.asej.2014.04.011
Mooney, R. J., & Bunescu, R. (2005). Mining knowledge from text using information
extraction. ACM SIGKDD Explorations Newsletter, 7(1), (pp. 3–10).
doi:10.1145/1089815.1089817
Narayanan, R., Liu, B., & Choudhary, A. (2009, August). Sentiment analysis of
conditional sentences. In Proceedings of the 2009 Conference on Empirical Methods in
Natural Language Processing: Volume 1-Volume 1 (pp. 180-189). Association for
Computational Linguistics.
Nasukawa, T., & Yi, J. (2003). Sentiment analysis. In Proceedings of the international
conference on Knowledge capture - K-CAP ’03. ACM Press (pp 44-53).
doi:10.1145/945645.945658
Navathe, Shamkant B., and Elmasri Ramez, (2000), “Data Warehousing And Data
Mining”, in “Fundamentals of Database Systems”, Pearson Education pvt Inc,
Singapore, (pp .841-872). https://dl.acm.org/citation.cfm?id=1855347
Neuhuttler, J., Woyke, I. C., & Ganz, W. (2017). Applying Value Proposition Design
for Developing Smart Service Business Models in Manufacturing Firms. In Advances
in Intelligent Systems and Computing (pp. 103–114). Springer International
Publishing. https://doi.org/10.1007/978-3-319-60486-2_10
Pang, B., & Lee, L. (2008). Opinion Mining and Sentiment Analysis. Foundations and
Trends® in Information Retrieval, 2(1–2), (pp. 1–135). doi:10.1561/1500000011
Plaza, E., (2008). Semantics and Experience in the Future Web. In Lecture Notes in
Computer Science (pp. 44–58). Springer Berlin Heidelberg.
doi:10.1007/978-3-540-85502-6_3
Pustejovsky, J., & Boguraev, B. (1993). Lexical knowledge representation and natural
language processing. Artificial Intelligence, 63(1–2), (pp. 193–223).
doi:10.1016/0004-3702(93)90017-6
Sarwar, B., Karypis, G., Konstan, J., & Reidl, J. (2001). Item-based collaborative
filtering recommendation algorithms. In Proceedings of the tenth international
conference on World Wide Web - WWW ’01. ACM Press.
https://doi.org/10.1145/371920.372071 2018
Smyth, B., & McClave, P. (2001). Similarity vs. Diversity. In Case-Based Reasoning
Research and Development (pp. 347–361). Springer Berlin Heidelberg.
doi:10.1007/3-540-44593-5_25
Steinberger, J., Brychcín, T., & Konkol, M. (2014). Aspect-level sentiment analysis in
czech. In Proceedings of the 5th workshop on computational approaches to
subjectivity, sentiment and social media analysis (pp. 24-30).
Taboada, M., Brooke, J., Tofiloski, M., Voll, K., & Stede, M. (2011). Lexicon-Based
Methods for Sentiment Analysis. Computational Linguistics, 37(2), (pp. 267–307).
doi:10.1162/coli_a_00049
The research hub of the app economy, providing fact-based insights and developer tool
benchmarks (2017). Retrieved from
https://towardsdatascience.com/what-is-the-best-programming-language-for-machine-l
earning-a745c156d6b7
Wang, J., de Vries, A. P., & Reinders, M. J. T. (2006). Unifying user-based and
item-based collaborative filtering approaches by similarity fusion. In Proceedings of
the 29th annual international ACM SIGIR conference on Research and development in
information retrieval - SIGIR ’06. ACM Press.
https://doi.org/10.1145/1148170.1148257
Watson, I., & Marir, F. (1994). Case-based reasoning: A review. The Knowledge
Engineering Review, 9(4), 327. https://doi.org/10.1017/s0269888900007098
Wilson, T., Wiebe, J., & Hoffmann, P. (2005). Recognizing contextual polarity in
phrase-level sentiment analysis. In Proceedings of the conference on Human Language
Technology and Empirical Methods in Natural Language Processing - HLT ’05.
Association for Computational Linguistics. https://doi.org/10.3115/1220575.1220619
Wirth, R., & Hipp, J. (2000, April). CRISP-DM: Towards a standard process model for
data mining. In Proceedings of the 4th international conference on the practical
applications of knowledge discovery and data mining (pp. 29-39). Citeseer.
Wu, H. C., Luk, R. W. P., Wong, K. F., & Kwok, K. L. (2008). Interpreting TF-IDF
term weights as making relevance decisions. ACM Transactions on Information
Systems, 26(3), 1–37. https://doi.org/10.1145/1361684.1361686
Zhang, Y., Jin, R., & Zhou, Z.-H. (2010). Understanding bag-of-words model: a
statistical framework. International Journal of Machine Learning and Cybernetics,
1(1–4), 43–52. https://doi.org/10.1007/s13042-010-0001-0
Zhou, X., Xu, Y., Li, Y., Josang, A., & Cox, C. (2011). The state-of-the-art in
personalized recommender systems for social networking. Artificial Intelligence
Review, 37(2), 119–132. https://doi.org/10.1007/s10462-011-9222-1
APPENDIX A: CODE STRUCTURE