S2022 Noor Meijer Thesis
S2022 Noor Meijer Thesis
Student details
Thesis committee
Tilburg University
School of Humanities & Digital Sciences
Department of Cognitive Science & Artificial Intelligence
Tilburg, The Netherlands
July 8, 2022
Word Count: 8773
Abstract
Airbnb is one of the most popular and fastest growing sharing platforms in the world. However, its
offer of peer-to-peer accommodation can make it difficult to properly price listings. This thesis
provides new insights in price prediction by adding review information to a price prediction model
with standard listing characteristics. Reviews are often overlooked in Airbnb prediction research but
offer valuable insights. Using an open-source Airbnb dataset, several review features are mined from a
large set of Airbnb listing reviews. The unsupervised learning method topic modelling is applied on
reviews and included as predictor, which results in improved predictive performance for listing prices.
In addition, (weighted) sentiment analysis features are obtained using VADER and transformed to
features. However, they only marginally improve price prediction, due to the skewed distribution of
sentiment scores. Both a Support Vector Regression and XGBoost model are a good fit for the Airbnb
data, although XGBoost provides the best performance.
Data Source/Code/Ethics Statement
Work on this thesis (did/did not) involve collecting data from human participants or animals. The
original owner of the data and code used in this thesis retains ownership of the data and code during
and after the completion of this thesis. The code used in this thesis is publicly available
[http://insideairbnb.com/get-the-data].
1 Introduction
While currently being a staple in the tourism industry, accommodation platform Airbnb did not even
exist 15 years ago. Its unprecedented and explosive growth has been followed by an equally
impressive growth of other peer-to-peer sharing platforms. The development of AI and big data, in
combination with a customer demand for more sustainable consumption and less dependence on large
multinationals has led to the popularity of the sharing economy (Cheng & Jin, 2019, Wirtz et al.,
2019). These days, consumers can find shared alternatives for almost anything; accommodation
(Airbnb or Homeaway), cars (Blablacar), transportation (Uber), and even pets (Borrowmydoggy)
(Wirtz et al., 2019). The exponential growth of sharing platforms has been accompanied by a similar
development in research on the sharing economy, but due to the vast amount and quick innovation,
there are still many gaps (Hossain, 2020).
Research related to sharing platforms cannot simply be compared to its traditional equivalents,
because the sharing economy has its own unique challenges such as bad working conditions, lack of
regulation and higher risks for both provider and sharer (Malhotra & Van Alstyne, 2014). Therefore
research that is specifically focused on sharing platforms is necessary to understand differences and
how it changes today’s consumer patterns. Especially in the case of Airbnb, as trust is extremely
important when booking accommodation that is likely in a place the consumer is not familiar with yet
(Huurne et al., 2017).
One aspect of sharing platforms that has come to be of great importance are reviews. The internet and
social media has made it especially easy to write and read reviews about anything. With an
experiential good that is purchased online, such as hotels or Airbnb, a review becomes even more
important and can provide a consumer with the confidence that the product or service they purchase
will actually be delivered to them, in the state they expect (Lawani et al., 2019).
Therefore, it is essential that further research is done into sharing platform reviews, specifically
Airbnb. Reviews are the basis for trust between the provider and the sharer, and can significantly
affect both in different ways. This study will try to uncover the effect of reviews on the price of an
Airbnb listing price, by adding review information to standard listing characteristics as predictors.
1.1 Relevance
Airbnb has over 200 million users worldwide and therefore many users could be impacted by research
related to Airbnb (Cheng & Jin, 2019). With the importance of trust in the sharing economy, reviews
are an especially relevant aspect of Airbnb. A bad review can lead users to opt out of renting to or
from another user. However, reviews might not always be fair and can depend even on a user’s own
personal background. Therefore Airbnb users could benefit from knowing the effect a review has on
their own or other users’ profiles. Users might be missing out on rentals that perfectly fit their needs,
but they disregard because of a single bad review. New renters could be missing out on valuable
income due to the effects of bad reviews. Additionally, Airbnb itself might be able to adjust their
policies and how they display reviews.
Scientific research has shown to still have a significant amount of gaps in literature when it comes to
sharing platforms. While Airbnb is prominent in sharing economy research, there is still much to be
explored. This thesis will provide insights into the effects of review information, besides standard
features such as amount of reviews or aggregated ratings. Much information can be gathered from
reviews, such as sentiment and topics, that are not often considered in price prediction. The findings
could possibly be generalized to other sharing platforms as well, such as similar websites like
Couchsurfing or Homeaway.
1.2 Research questions
Based on previous research and the existent gaps in the literature, this leads to the following research
questions.
RQ1: To what extent can Airbnb listing price prediction with standard listing characteristics be
improved by including review features into a Support Vector Regression?
RQ1a: To what extent can Airbnb listing price prediction with basic listing characteristics be
improved by including sentiment analysis features into a Support Vector Regression?
RQ1b: To what extent can Airbnb listing price prediction with basic listing characteristics be
improved by including topic modelling features into a Support Vector Regression?
RQ1c: To what extent can Airbnb listing price prediction with basic listing characteristics be
improved by including review recency into a Support Vector Regression?
Based on previous research, the SVR had the best result for a similar research question. However, other
Machine Learning models should be compared to improve performance. Therefore the research will be
repeated with different models, based on successful models in other research.
RQ2: Can Extreme Gradient Boosting or Ridge Regression with new review features improve price
prediction performance of SVR?
2 Literature review
In this section the previous related work will be outlined. First, background and the current state of
research on Airbnb rental price prediction will be given. In the subsequent section, the use of
sentiment analysis for price prediction will be explored. Lastly, the two additional review features
(review recency and topic modelling) will be discussed.
3.3 XGBoost
Gradient boosting is a powerful machine learning technique that creates an ensemble of models
(usually decision trees) to minimize prediction error. XGBoost (eXtreme Gradient Boosting) is an
implementation of the Gradient Boosting algorithm that has quickly gained popularity due to its good
performance. It was designed to be fast to execute and have increased model performance. This leads
to XGBoost being a common choice among price prediction research, with good results. Peng, Li &
Qin (2020), Trang, Huy & Le (2020), Didarul Islam et al., (2022) all found that XGboost
outperformed several other ML models in predicting Airbnb listing prices.
4.1 Dataset
To answer the research questions, an open-source dataset with publicly available information from the
Airbnb website will be used. The data is independently scraped and made available on
http://insideAirbnb.com/index.html. This website provides data on Airbnb listings from several large
cities across the world, sorted by city. The data has already been verified, analyzed, cleansed and
aggregated where appropriate by the data provider. The available data is a snapshot of the current
listings on Airbnb, therefore the data used in this thesis represents all Airbnb listings on March 16,
2021. All variables therefore represent the value on this date, e.g. the price of the listing at that date.
For this thesis, the main focus will be on the city Amsterdam and therefore data from only this city
will be used.
Several files belong to the dataset, related to the listings, reviews, calendar and location. In this thesis
the data about listings and reviews will be used. From now on, these will be referred to as the listing
dataset and review dataset. Both datasets are linked by the key ‘listing ID’. The listing dataset has a
large amount of features about all listings since August 2020, with rental characteristics such as
amount of bedrooms, neighborhood and information about the host. This also includes information
about price, availability of the listing and (aggregated) review scores. The variables are in a variety of
formats: continuous, bools and categorical. This dataset contains a total of 5597 rows (listings) and 74
columns (features). Several irrelevant, uninformative or unusable features (e.g. URLs and duplicates)
from this dataset are dropped, the remaining set of features can be found in Appendix A. The review
dataset contains all reviews of the listings up until the downloaded date. This dataset has 272056
reviews and 6 features. The main feature is the review text, which is taken exactly as it was written,
including punctuation marks, symbols and emoticons. In addition, it contains the listing ID, review ID
reviewer ID, reviewer name, and date of the review.
Price Log_price
count 5597 5590
mean 164 4.923
median 135 4.905
std 162 0.565
min 0 2.890
25% 95 4.553
50% 135 4.905
75% 198 5.292
max 6477 8.776
Table 1 – Price vs. Log_price distribution
Length Number of
review words
mean 27 47
std 25 45
min 100 100
25% 104 170
50% 206 360
75% 357 630
max 595 999
Table 2 – Review Distribution
Table 2 shows the distribution of the reviews. Reviews have an average length of 269 characters and
47 words, showing that the average review is quite short. The range is quite large, with some reviews
only having a single word while the longest has 999 words (Airbnb review limit is a 1000 words).
Parameter Values
C [15,10,5,1]
gamma [0.001,0.01,0.1,0.0001,'scale', 'auto']
epsilon [0.2,0.1,0.05,0.01]
kernel ['rbf','poly',linear']
Table 4 – Hyperparameter values SVR
Ridge Regression
Parameter Values
Alpha Range(0, 1, 0.01)
Table 5 – Hyperparameter values Ridge Regression
XGBoost
N_estimators [10,50,100,300,400,500]
min_split_loss [0,0.2,0.5]
max_depth [2,3,5,10,15]
booster ['gbtree','gblinear','dart']
learning_rate [0.05,0.1,0.3,0.5]
min_child_weight [1,2,3]
subsample [0.5,0.7,1]
base_score [0.25,0.5,1]
Table 6 – Hyperparameter values XGBoost
Topic modelling
Before transforming the topic scores to features, the number of topics K needs to be determined, as
well as several hyperparameters. Because LDA is an unsupervised learning technique, it is not
possible to use regular evaluation metrics (such as accuracy or R-squared) to test the performance of
the model. For LDA, a measure that is often used to tune hyperparameters is coherence score. To
obtain this score, the coherence between all documents and topics is calculated. One type of coherence
score is Umass Coherence, which was proposed by Mimno, Wallach, Talley, Leenders & McCallum
(2011). The formula for Umass Coherence is given below
𝑃(𝑤𝑖 , 𝑤𝑗 ) + 1
𝐶𝑈𝑀𝐴𝑆𝑆 (𝑤𝑖 , 𝑤𝑗 , 𝜀) = log .
𝑃(𝑤𝑗 )
Where 𝑃(𝑤𝑖 , 𝑤𝑗 ) is the number of documents containing words wi and wj, 𝑃(𝑤𝑗 ) is the number of
documents containing word wi. The Umass metric outputs a score between -14 and 14. The closer the
absolute value is to 0, the better the coherence of the model.
Using the Umass score as an evaluation metric, hyperparameters K, alpha and eta are tuned. The
following values are tested:
Parameter Values
Alpha [0.1,0.2,0.3,0.4,'symmetric']
Eta [0.01,0.1,0.2,'symmetric']
Table 7 – Hyperparameter values LDA
The figure below illustrates the relation between predicted and actual values of the target variable. The
best fit line seems to follow the distribution quite well and most data points are clustered around the
line. However, as y increases, predicted values start to vary more and absolute errors become larger.
The figure ‘actual values and absolute error’ illustrates the absolute errors for each value in the test
data. This further shows that with more extreme values of y, especially high ones, the errors start
increasing.
Figure 4 – SVR - Sentiment Analysis: Prediction Error & Actual Values vs. Absolute Error.
0 0.9258 center, city, minute, walk, tram, [relatively, easy, find, central,
station, amsterdam, close, away, station, far, beautiful, area, canal,
restaurant surrounded, coffee, shop...
0.9242 amsterdam, place, come, home, [loved, staying, houseboat, week,
experience, best, host, house, carien, wonderful, host,
stay, time amsterdam, amazing, city, better,
way...
2 0.9306 apartment, located, clean, [neighborhood, good, apartment,
accommodation, recommend, ideally, placed, able, foot,
nice, stay, good, pleasant, apartment, clean, needed, present,
amsterdam ...
3 0.9425 stay, great, place, amsterdam, [amazing, time, maria, staying,
recommend, location, definitely, boat, time, amsterdam, little,
host, lovely, perfect special, space, lovely, clean, pe...
4 0.9248 great, nice, location, place, [tamaras, place, perfect, location,
clean, good, room, host, stay, easy, public, transportation,
super walking, distance, awesome, ba...
5 0.9351 room, hotel, staff, bed, [breakfast, sweetthe, stair, really,
bathroom, small, night, bit, really, really, narrow, it, difficult,
shower, kitchen people, climb, large...
Table 12 - Topic Modelling: k=6 final topics
All topics are relatively similar in contribution, with high percentages ranging between 92% and 95%.
The topics do not all have a clear general subject that can be identified. Topics 2, 3 and 4 are relatively
similar and have positive wording (perfect, great, recommend) + the words ‘Amsterdam’. However,
some topics clearly focus on a certain theme. Topic 0 seems to relate to location (center, close, tram,
walk), topic 1 appears to be related to service/social aspect (experience, host, home, best) and topic 5
indicates some theme associated with hotels/more commercial Airbnb’s and amenities (hotel, staff,
bathroom, kitchen).
RQ1: To what extent can Airbnb listing price prediction with standard listing characteristics be
improved by including review features into a Support Vector Regression?
RQ1a: To what extent can Airbnb listing price prediction with basic listing characteristics be
improved by including sentiment analysis features into a Support Vector Regression?
To some extent, but only marginally. While sentiment scores improve price prediction with
SVR, the increase in R-squared is very small. This is likely connected to the fact that Airbnb
reviews are extremely positive and the data contains almost no negative reviews.
RQ1b: To what extent can Airbnb listing price prediction with basic listing characteristics be
improved by including topic modelling features into a Support Vector Regression?
Topic modelling features provide a more successful increase in performance of the SVR. The
best performing model contains 6 topics and outperforms the baseline. Although topic subjects
are not all clearly divided in themes, they do improve performance.
RQ1c: To what extent can Airbnb listing price prediction with basic listing characteristics be
improved by including review recency into a Support Vector Regression?
Review recency only marginally improves price prediction performance, although more than
sentiment scores individually. Similar to RQ1a, this is likely associated with the skewed
sentiment scores.
Overall, review features are able to improve price prediction of Airbnb listings with a Support Vector
Regression model. However, the extent varies. Further research could improve by combining features
or engineering different features from reviews.
RQ2: Can Extreme Gradient Boosting or Ridge Regression with new review features improve price
prediction performance of SVR?
Extreme Gradient Boosting or XGBoost outperforms the SVR model on every feature combination.
This model is more suited to the Airbnb data and provides the best performing model: and XGBoost
with standard listing features + 6 topic features. Ridge Regression is not able to improve price
prediction compared to the SVR model. However, for the optimal feature combination (i.e. standard
listing features + 6 topics), the Ridge Regression scores are very close to those of the SVR.
Acknowledgements
Special thank to InsideAirbnb for providing and cleaning the data.
References
Aakash, A., & Jaiswal, A. (2020). Segmentation and Ranking of Online Reviewer Community.
International Journal of E-Adoption, 12(1), 63–83.
Bonta, V., Kumaresh, N., & Janardhan, N. (2019). A Comprehensive Study on Lexicon Based
Approaches for Sentiment Analysis. Asian Journal of Computer Science and Technology, 8(S2), 1–6
Md. Didarul Islam, Bin Li, Kazi Saiful Islam, Rakibul Ahasan, Md. Rimu Mia, Md. Emdadul Haque,
(2022). Airbnb rental price modeling based on Latent Dirichlet Allocation and MESF-XGBoost
composite model, Machine Learning with Applications, 7.
Chattopadhyay, M., & Mitra, S.K. (2019). Do airbnb host listing attributes influence room pricing
homogenously? International Journal of Hospitality Management, 81, 54-65.
Cheng, M., & Jin, X., (2019). What do Airbnb users care about? An analysis of online review
comments, International Journal of Hospitality Management, 76A, 58-70.
Dann, D., Teubner, T. & Weinhardt, C. (2019). Poster child and guinea pig – insights from a
structured literature review on Airbnb. International Journal of Contemporary Hospitality
Management, 31(1), 427-473.
Deep-Translator – Google Translate. Deep-Translator (n.d.). Retrieved from https://deep-
translator.readthedocs.io/en/latest/usage.html#google-translate
Guo, Y., Barnes, S., & Jia, Q. (2017). Mining meaning from online ratings and reviews: Tourist
satisfaction analysis using latent dirichlet allocation. Tourism Management, 59, 467-483.
Guttentag, D. (2019). Progress on Airbnb: a literature review. Journal of Hospitality and
Tourism Technology, 10(4), 814-844.
He, L. & Zheng, K. (2019). How do General-Purpose Sentiment Analyzers Perform when Applied to
Health-Related Online Social Media Data? Stud Health Technol Inform., 264, 1208–1212.
Hong Trang, L., Duong Huy, T., & Ngoc Le, A. (2020) Clustering helps to improve price prediction in
online booking systems. International Journal of Web Information Systems, 17(1), 45-53.
Hossain, M. (2020). Sharing economy: a comprehensive literature review. International Journal of
Hospitality Management, 87.
Hu, N., Zhang, T., Gao, B., & Bose, I. (2019). What do hotel customers complain about? Text analysis
using structural topic model. Tourism Management, 72, 417-426.
Huurne, M., Ronteltap, A., Corten, R., & Buskens, V. (2017). Antecedents of trust in the sharing
economy: A systematic review. Journal of Consumer Behaviour, 16(6), 485–498.
Lawani, A. Reed, M.R., Mark, T., & Zheng, Y. (2019). Reviews and price on online platforms:
Evidence from sentiment analysis of Airbnb reviews in Boston. Regional Science and Urban
Economics, 75, 22-34.
Kalehbasti, P.R., Nikolenko, L., & Rezaei, H. (2021). Airbnb Price Prediction Using Machine
Learning and Sentiment Analysis. In: Holzinger, A., Kieseberg, P., Tjoa, A.M., Weippl, E. (eds)
Machine Learning and Knowledge Extraction. CD-MAKE 2021. Lecture Notes in Computer
Science(), vol 12844. Springer, Cham.
Kwon, W., Lee, M., & Back, K. J. (2020). Exploring the underlying factors of customer value in
restaurants: A machine learning approach. International Journal of Hospitality Management, 91.
Malhotra, A. & Van Alstyne, M. (2014). The dark side of the sharing economy … and how to lighten
it. Communications of the ACM, 57(11), 24–27.
Mujahid, M., Lee, E., Rustam, F., Washington, P.B., Ullah, S., Reshi, A.A., & Ashraf, I. (2021).
Sentiment Analysis and Topic Modeling on Tweets about Online Education during COVID-19,
Appl.Sci., 11(8438), 1-25.
Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic
coherence in topic models. Proceedings of the 2011 Conference on Emperical Methods in Natural
Language Processing, 262–272.
nltk.stem.wordnet. NLTK (n.d.). Retrieved from
https://www.nltk.org/_modules/nltk/stem/wordnet.html
Peng, N., Li, K., & Qin, Y. (2020). Leveraging Multi-Modality Data to Airbnb Price Prediction.
Proceedings - 2020 2nd International Conference on Economic Management and Model Engineering,
ICEMME 2020, 1066–1071.
Sánchez-Franco, M.J. & Alonso-Dos-Santos, M. (2021). Exploring gender-based influences on key
features of Airbnb accommodations. Economic Research-Ekonomska Istraživanja, 34(1), 2484-2505.
Santos, G.V., Mota, F. S., Benevenuto, F.T., & Silva, H. (2020). Neutrality may matter: sentiment
analysis in reviews of Airbnb, Booking, and Couchsurfing in Brazil and USA. 10, 45.
Spacy en_core_web_sm. spaCy (n.d.). Retrieved from https://spacy.io/models/en
String — Common string operations. Python (n.d.). Retrieved from
https://docs.python.org/3/library/string.html
Shen, L., Liu, Q., Chen, G., & Ji, S. (2020). Text-Based Price Recommendation System for Online
Rental Houses. Big Data Mining and Analytics, 3(2), 143.
Tandon, A., Aakash, A., Aggarwal, A.G, & Kapur, P.K. (2021). Analyzing the impact of review
recency on helpfulness through conometric modeling. Int J Syst Assur Eng Manag, 12(1), 104–111.
Teubner, T. & Glaser, F. (2018) Up or out—The dynamics of star rating scores on Airbnb. Research
Papers, 96.
VaderSentiment 3.3.2. PyPi (n.d.) Retrieved from https://pypi.org/project/vaderSentiment/
Wang, D. & Nicolau, J.L. (2017). Price determinants of sharing economy based accommodation
rental: a study of listings from 33 cities on Airbnb.com. International Journal of Hospitality
Management, 62, 120-131.
Wang, B. C., Zhu, W. Y., & Chen, L. J. (2008). Improving the Amazon review system by exploiting
the credibility and Time-decay of public reviews. Proceedings - 2008 IEEE/WIC/ACM International
Conference on Web Intelligence and Intelligent Agent Technology - Workshops, WI-IAT Workshops
2008, 123–126.
Wirtz, J., So, K.K.F., Mody, M.A., Liu, S.Q. & Chun, H.H. (2019). Platforms in the peer-to-peer
sharing economy. Journal of Service Management, 30(4), 452-483.
Zhang, S., Ly, L., Mach, N., & Amaya, C. (2022). Topic Modeling and Sentiment Analysis of Yelp
Restaurant Reviews. International Journal of Information Systems in the Service Sector, 14(1), 1-16.
Appendices
Appendix A – Listing Dataset
Table 13 Included features in listing dataset
Variable % missing
host_since 0%
host_location 0%
host_about 36%
host_is_superhost 0%
host_total_listings_count 0%
host_verifications 0%
neighbourhood_cleansed 0%
property_type 0%
room_type 0%
accommodates 0%
bathrooms_text 0%
bedrooms 6%
beds 2%
amenities 0%
price 0%
minimum_nights 0%
maximum_nights 0%
availability_365 0%
calendar_last_scraped 0%
number_of_reviews 0%
first_review 9%
last_review 9%
review_scores_rating 9%
review_scores_accuracy 10%
review_scores_cleanliness 10%
review_scores_checkin 10%
review_scores_communication 10%
review_scores_location 10%
review_scores_value 10%
instant_bookable 0%
calculated_host_listings_count 0%
reviews_per_month 9%
Table 16 – variable distribution: selected variables with extreme values
Language Count
en 209308
fr 21568
de 14451
es 7048
nl 6956
it 3479
pt 1460
ru 989
zh-cn 764
ro 661
Other 560
ko 553
af 539
ca 380
tl 365
da 352
no 282
so 272
cs 241
sv 193
pl 187
zh-tw 161
id 159
ja 157
tr 145
fi 136
cy 115
he 87
hu 85
hr 72
sw 68
vi 39
sl 38
el 36
et 32
sk 27
ar 25
lt 16
uk 11
bg 10
th 9
sq 9
lv 7
mk 3
fa 1