0% found this document useful (0 votes)
24 views21 pages

BMGT 7074

Uploaded by

keshavk1401
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views21 pages

BMGT 7074

Uploaded by

keshavk1401
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Table of Contents

Introduction............................................................................................................................................2
Overview of the Assignment..............................................................................................................2
Objectives and Scope.........................................................................................................................2
Data Source Description....................................................................................................................2
Project Setup..........................................................................................................................................3
Dataset Collection and Justification.................................................................................................3
Ethical Considerations and Data Approval.....................................................................................3
Data Loading......................................................................................................................................3
Data Preparation and Description.......................................................................................................4
Data Cleaning and Pre-Processing...................................................................................................4
Descriptive Statistics..........................................................................................................................5
Exploratory Data Analysis (EDA)....................................................................................................6
Statistical Analysis.................................................................................................................................8
Hypothesis Formulation....................................................................................................................8
Statistical Test Execution...................................................................................................................8
Results and Interpretation................................................................................................................9
Predictive Modelling............................................................................................................................10
Simple Regression Model................................................................................................................10
Multiple Regression Model.............................................................................................................11
Decision Tree Regression Model.....................................................................................................12
KNN Regression Model...................................................................................................................13
Model Comparison and Selection Using K-Fold Validation Test................................................14
Clustering Analysis..............................................................................................................................15
Cluster Analysis................................................................................................................................16
Interpretation, Evaluation, and Recommendation...........................................................................16
Interpretation and Evaluation........................................................................................................16
Recommendations............................................................................................................................16
Important Features in Multiple Regression Model......................................................................17

1
Introduction
Overview of the Assignment
The following report is devoted and developed in such a way that it can be used to do the
illustration of the practical application of the advanced data analytics methods in the example
of the tourism data, which was taken from the hospitality sphere. Via complex analysis of
Airbnb’s data set from Seattle, the objective of the project is to use appropriate statistics,
predictive modelling, and grouping techniques to make useful conclusions. The reason for
these data points is to provide strategic information to management which can be used to tune
operations and improve guest experiences in the STR (Short-Term Rental) markets.

Objectives and Scope


The primary objectives of this whole report and project are:
1. Sample of data analysis on the available datasets containing Airbnb's presence in
Seattle, detecting the number of listings; pricing trends, and customers reviews among
others.
2. To exploring several analytical methods including the descriptive statistics, hypothesis
testing, regression and classification models and clustering as well.
3. To have these analyses meaningful and provide evidence-based insights to all
stakeholders in the market then a good poverty trend needs to be developed.
The tasks associated with this whole project are completely based on the meticulous data
preparation, EDA, extensive statistical testing, predictive modelling, and clustering to
discover the patterns and insights lying deep within the data. The results can be used by hosts
of Airbnb and other key stakeholders to decide on those very important issues concerning
control of prices, customer relations, and service improvement.

Data Source Description


The data for this analysis is derived from the "Seattle Airbnb Open Data", the dataset
basically includes:
 Listings.xlsx: Provides in-depth descriptions of the properties in Seattle, including
property information, hosts, rates, and reviews.
 Calendar.xlsx: Features the daily prices and availability stats for counted 2016, thus
showing a brief picture of the seasonal dynamics of the rental market.
 Reviews.xlsx: Property reviews are more of a unique feature for customers, they are
very important for evaluating degree of guest satisfaction and single pinpointing the
improvement areas.

2
Project Setup
Dataset Collection and Justification
This study is using three top datasets from the "Seattle Airbnb Open Data". These datasets
were selected due to their wide-ranging and factual susceptibility, which cover the company’s
operational aspects in Seattle, namely listings details, daily pricing and availability, and
customer reviews. Through this evaluation, we want to get a multidimensional view of the
market by operational aspects, dynamic consumers segment, etc.
 Listings.xlsx: This data set grant wide opportunities for researching extensive details
about the properties making it a right given for analysing the characteristics which
may affect rental rates and popularity.
 Calendar.xlsx: Once real-time availabilities and prices are presented, trends and
strategies are easy to be figured out.
 Reviews.xlsx: Customers' opinion is number one thing for us to understand the
customers' experience and determine places where we may be able to do better.

Ethical Considerations and Data Approval


The datasets used are open data and as such, are shared freely for the purpose of analysis, in
such a way that any risk of people’s individual data being viewed is circumvented.
Nevertheless, a careful handling of data and assuring that the representations that come from
analysis do not distort or harm the individuals and entities is a must.

Data Loading
To begin our whole analysis of this dataset, we will first load the data using Python and
Pandas “read_excel ()” functionality, and after that we will start with whole analysis.

3
Data Preparation and Description
Data Cleaning and Pre-Processing
Data analysis must be done based on clean and well-prepared datasets. undefined
 Handling Missing Values: Identification and treatment of the gap in the data will be
possible to not use the biased calculations.
 Data Transformation: Changing data of, process, so that the data eventually suited
for analysis use, such as converting data type from one to another or changing text
data into categorical variables use.
 Feature Engineering: New variables are constructed using existing data to improve
the accuracy of analytical models in prediction.

4
Descriptive Statistics
After the cleaning process, generating descriptive statistics will basically help in summarize
the central tendency, dispersion, and shape of the dataset's distributions, especially for price,
review scores, and other transformed variables that we got after doing the Data Cleaning.

5
Exploratory Data Analysis (EDA)
For EDA, we will use the visualizations created using matplotlib to explore relationships
between features, trends over time, and distributions of key variables.

Price Distribution in Listings:

Review Scores Distribution:

6
Price Vs Review Scores:

Trends in Calendar Data:

7
Statistical Analysis
Hypothesis Formulation
Based on the initial exploration of the datasets and some understanding, following hypothesis
were formulated:
Hypothesis 1: When we compare the prices of listings that score high (80 and up) reviews
with ones that get lower scores (80 and below), this difference is significant.
 Null Hypothesis (H0): The price means of the listings which achievements have over
80% by the average price is equal to the price mean of the listings which are less than
80%.
 Alternative Hypothesis (H1): On the contrary, the mean revealed by the analysis
applied to the listings accompanied by review scores higher than 80 is not equal to the
mean price of the listings joined with review scores of 80 or less.
Hypothesis 2: Despite wide seasonal variations in availability, the listings are persistent.
 Null Hypothesis (H0): The numbers of average listings can stay the same no matter
what season of the year it is.
 Alternative Hypothesis (H1): Normally the supply of listings become scarcer in
winter. This is reflected by the fluctuation of seasons.

Statistical Test Execution


We are going to perform t-test for the first Hypothesis, and ANOVA (Analysis of Variance)
for the second hypothesis:

Results and Interpretation


Hypothesis 1: Price Differences Based on Review Scores

8
The t-test was conducted to assess if there is a statistically significant difference in the
average price between listings with high review scores and those with lower scores.
T-test Results:
 T-statistic: -0.11633
 P-value: 0.9073
Interpretation: With a negative t-statistic and a p-value large than 0.05, one can conclude
that the rejection of the null hypothesis is ill-advised. This suggests that mean price of
featured listings with high review score is quite identical compared to listings that have
excellent rating of 80 or lower. It is therefore a case of not rejecting the null hypothesis.
As for the practical viewpoint it comes as such that customer is hardly paying different
amount for Airbnb listings in Seattle because of the review scores exclusively. Whether
review score is the only factor affecting the pricing strategy or the customers are favouring to
other attribute apart from reviews in choosing the listing, this might imply so.
Hypothesis 2: Listings Availability Varies Significantly by Season
An ANOVA test was performed to determine whether the average availability of listings
changes with the seasons.
ANOVA Results:
 F-statistic: 596.3915
 P-value: 0.0
Interpretation: The F-statistic behaves like a large number and the p-value is virtually zero,
giving away the fact that the seasonal differences in the listings’ average availability is not
negligible at all. Consequently, we must reject the given hypothesis.
The fact that Airbnb availability was affected by the seasons also shows seasonality in
Seattle's Airbnb housing. Seasonal changes of patterns in occupancy, which are determined
by host choices on the availability of their properties, customer behaviour affected by
changes in the seasonal prices and similar factors. Airbnb hosts can exploit this information
to the definition of the products and services they provide via dynamic availability and
pricing strategies according to the seasonal demand.

Predictive Modelling
Predictive modelling basically focuses on the statistical approach to do the outcome
predictions. For this project we have chosen the regression to do the predictive analysis, and
we will build several regression models such as simple regression mode, multiple regression

9
model, KNN regression model, decision tree regression model and then we will finally be
going to compare all these regression models using k-fold optimality test to get the best
regression model.

Simple Regression Model

 Decision Goal: The main goal for the simple regression model was to basically
predict Airbnb listing prices based on single independent variable “bedrooms”.
 Interpretation: The model basically provided an average R2 score of approximately
0.41 across the cross-validation folds. This basically means that about 41% of the
variance in the Airbnb listing prices can be explained by the model's predictors. It is
not a good prediction as it does not even touched half of the variance, hence
indicating to build multiple regression model.
 Data Distribution: In the simple regression (Linear Regression) model, the dataset is
split in such a way that 80% of the data is used for the model training and 20% of the
data is used for the model testing.

Multiple Regression Model

10
 Decision Goal: This means that the multiple linear regression model will aim at
predicting Airbnb price based on the combination of the numerical factors such as
bedrooms, bathrooms, and review scores as well as on the categorical ones, especially
when it comes to room types.
 Interpretation: The model exhibited an average R2 score around 0.55 in all the
cross-validation folds. It implies that 55 % of the differences in the Airbnb rental
amounts can be explained by the interactions of the model parameters. The
importance of the chosen features conveyed by such medium fit is a sign that though
the proposed variables are significant, the prices will be driven by other factors.
 Data Distribution: In the multiple regression model, the dataset has been split in
such a way that 80% of the data is used for the model training, and 20% of the data is
used for the model testing part.

Decision Tree Regression Model

11
 Decision Goal: The goal underlying the decision tree regression was to obtain a non-
linear model in addition to the linear ones, which might have been more appropriate
for the data. Such a model could have captured any more complex relationships in the
data, possible leading us into better predictions than linear models.
 Interpretation: The decision tree regression leads us to 0.52 of R2 score on average.
This score, however, is a bit lower than that of multiple linear regression, which may
mean that the tree has a way of expanding its branches so much that it can sometimes
capture things that are not there or that the complexities of the tree did not help to
handle the pricing structure as effectively as the linear approach.
 Data Distribution: For the decision tree regression model, the dataset is split into
such a way that 80% of the data is used for the training and 20% of the data is used
for the testing part of the model. Since, we have used the same variable for the
decision tree regression that we have used of the multiple regression model. Hence,
the partition of the dataset for the training and testing part will also be the same as that
of multiple regression model.

KNN Regression Model

12
 Decision Goal: The KNN regression model was to predict prices from a set of data by
looking for similar listings, which would infer that listings with same features would
have near same prices.
 Interpretation: Even though, with r2 score of 0.53 for K=15, KNN-model showed a
reasonable performance about the decision tree model, it was underperformed by the
multiple linear regression model. The model KNN is based on the neighbourhood
features points out, that though the feature space contains qualities which impact the
prices, they cannot be fully covered.

Model Comparison and Selection Using K-Fold Validation Test

13
The predictive modelling is formulated by using multiple linear regression, decision tree
regression, and KNN regression method to unfold factors which are driving the Airbnb listing
prices. Among multiple linear regression models the one that performed best on average,
namely the one with the highest R2 score, demonstrated that using such a combination of
features does improve the prediction of listing price, even though there is still some margin
for improvement.

In accordance with the cross-validation findings, the multiple linear regression model is
advisable to predict the Airbnb listing prices because of the consistent and highest R2 scores
this model has. This model simply comes up with a combination of complexity and
performance thereby making it appropriate for applications that are practical. However, their
projections may be different with other factors such as location specifics, amenities, and host
reputation which needs to be added. In accordance with this scheme, business managers
should target the established feature to price the listings relatively and to stand out in the
competitive market. We need to do the next work explore more additional features, the
ensembles of models, or the advanced machine learning technique to further enhance the
predictive accuracy.

Clustering Analysis
Outcomes and Variables:

14
 Objective: We offer a range of Airbnb segmentation that would investigate specific
aspects that could affect decision making or pricing. This can provide understandings
about market segmentation or serve as an aid for using targeted marketing strategy.
 Variables for Clustering: The location, type of the room, and specification of price,
and amenities might be significant features to enable the classifying. See to it that the
variables are either numerical or are encoded properly before putting them in the
cluster.
Clustering Methodology:
 K-means Clustering: This is an iterative algorithm that involving dividing a group of
n data points to the k (non-overlapping subgroups) based upon the mean of the data
points in each group.

Cluster Analysis
To reach climactic point of our clustering analysis of Airbnb listings the Elbow Method has
been picked for the best selection of the clusters. This boils down to a clear elbow at 3 – 4
clusters and the most appropriate level is the one that finds a good balance between the
compactness of clusters and model complexity. We can tighten our analysis one step more

15
and arrive at the correct number of clusters by looking at the silhouette scores of 3 and 4
clusters or considering the market factors for different segmentation.
The K-Means algorithm will be used to determine the number of clusters that would be
optimal, and each listing will be allocated to one of the clusters. And from that we can create
profiles of each cluster using the observations of the central tendencies of the important
features in each cluster. Through this, we will be able to find out distinct content of the
Airbnb market, such as "budget", "premium" and "family-friendly" feature segments, as well
as their specific features.

Interpretation, Evaluation, and Recommendation


Interpretation and Evaluation
The investigation over predictive modelling and clustering analysis resulted in showcasing
some important insights. Multiple types of regression models such as a linear regression,
decision tree and a KNN were used to research the levels of achieving Airbnb listings prices
forecast. The multiple linear regression models outperformed other models in terms of
consistency and reliability, as the results of cross-validation R2 showed. Thus, the decision
tree and KNN models supplement information even though the latter two are not as good as
MLR models.
The cross-section method demonstrated best our model robustness, presenting us after the
modelling that it was reasonably predictive but also it illustrated the intricate nature and price
variation features of Airbnb. Getting slightly lower performance the decision tree model
suggested that importance to non-linear relationships as well as a concomitant one.
Nevertheless, the **KNN model** placed the greatest weight on the impact of comparable
listings on the determination of price.
Based on the clustering analysis, the listings were grouped into homogeneous divisions
inhabiting unique market segments. These areas can be further developed for campaigning
and setting individual prices that meet the customer's expectations by linking the offer to the
customer's needs.

Recommendations
 Pricing Optimization: Use these insights to put together guidelines for hosts with
regards to the pricing strategy. The hosts should be educated on the basic factors
impacting the price namely position, room type, and facility features, and should set
prices accordingly.
 Market Positioning: Distribute the identified clusters to the hosts to help them rank
their listings in the market. Take for instance, a 'luxury' cluster that would utilize a
marketing strategy in the paper would highlight their exquisite features.
 Operational Improvements: Elicit hosts with clustered locations with the lower
review scores to surpass their offers. Ensuring customer satisfaction and listings
features improvement can guarantee their competitive edge, and the accessibility to as
many customers as possible.

16
 Expansion Opportunities: Use the clusters insights to locate the underserved market
areas where Airbnb could grow new listings to heal the existing demand gaps.

Important Features in Multiple Regression Model

Conclusion
Our report has drawn on a complete data analysis aiming to find the causes which lead to the
establishing of pricing and figure out some market segmentations existing in the data. Based
on data preparation, explorative analysis, statistical testing, and the creation necessary of the
forecasting model we have got information about Airbnb marketplace.
The process of fitting the linear model to the data started with a plain linear regression for
providing a basis for comparison and continued with a sophisticated multiple linear

17
regression considering both numerical and categorical covariables. Leading up to multiple
linear regression, the models exhibited that listing prices relied on such parameters as number
of bedrooms, room type, and reviews; reaching to average R2 score around 0.55 within cross-
validation usage, was only possible thanks to multiple linear regression.
To this we added non-linear relationships consideration as well as analysed decision-tree
regression model predictions based on a KNN regression model. However, both of models
offer us their useful outcomes but they did not outperform Multiple linear regression model
indicated by our cross-validation results.
As a result of this, we have also established clustering groups that display similar traits taken
as a whole, thus enabling a clear understanding of different market segments in the city. This
classification permits to approach the marketing where you target it and will be of some help
in making the decisions for Airbnb and its hosts.

References
McKinney, W. (2012). Python for data analysis: Data wrangling with Pandas, NumPy, and
IPython. " O'Reilly Media, Inc.".
Flöjs, J., & Herrgård, J. (2023). Driving Factors Behind Airbnb Pricing-A Multilinear
Regression Analysis.
Jung, Y. (2018). Multiple predicting K-fold cross-validation for model selection. Journal of
Nonparametric Statistics, 30(1), 197-215.
Melnykov, V., & Michael, S. (2020). Clustering large datasets by merging K-means solutions.
Journal of Classification, 37(1), 97-123.
Zhang, S., Li, X., Zong, M., Zhu, X., & Cheng, D. (2017). Learning k for knn classification.
ACM Transactions on Intelligent Systems and Technology (TIST), 8(3), 1-19.

Appendix
Appendix A: Regression Models Performance Overview

18
Appendix B: Tableau Findings

Basically, showcasing how the price is changing throughout the year and what are the trends
in specific months of the year. It can basically help some one who is thinking about renting
his/her property as Airbnb and who is new. So, they can see this visualization and decides
how much price to set in which month in Seattle.

19
The above visualization basically tells us about the price per beds (1, 2, 3, 4, 5, and 6). It
shows what are the average price for a day, for a week, and for a month in Seattle.

The above visualization basically tells us about the type of beds preferred by the tourists
when they generally rent someone’s property as Airbnb, specifically in Seattle.

20
The above visualization basically shows us the cost of renting someone’s property in Seattle
in different zip codes. This visualization basically helps us in finding that in Seattle which
area costs large and which area costs less.

21

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy