Project Summary
This project purpose is to find out the reason causing booking cancellations thus help reducing the cancellations. The dataset originally came from kaggle. In this project I were able to do :
- Cleaned raw data using various method and make sure the data is feasible to analyze.
- Conducted exploratory data analysis (EDA) to help find out what variable were causing booking cancellation.
- Made a customer segmentation to help determine what segment that cancel the most and then help targeted marketing
- Made a predictive model that have the highest predictive power to help hotel determine whether booking would be cancelled or not. This would be so helpful in hotel operations.
Insights
- Features with most correlation with booking cancellations are lead time, total of special requests, requested car parking spaces, and previous cancellations.
- We found that customer who reserved room type P and making non-refundable deposit have 100% cancellation rate, this is an interesting findings and have to be investigated further.
- We were able to identify 4 customer clusters: the cancellation squad, the family, the return customer, and the vacationers.
- We identified Random Forest as the model giving us the highest predictive power. This model classifies whether or not a booking will be canceled with 84% accuracy.
- Our interpretative Random Forest model revealed that lead time, average daily rate, and deposit payment type are the three features influencing cancellation prediction the most. Our analysis also pointed at the importance of number of special requests and reserved room type.
Project Files
For a more comprehensive analysis and visualization, please open the project files.Project Background
Over the years, the hotel industry has changed with a majority of bookings now made through third parties such as Traveloka. Those Online Travel Agengies (OTA) have transformed cancellation policies from a footnote at the bottom of the page to the main selling point in their marketing campaigns. As a result, customers have become accustomed to free cancellation policies. Over the year, booking cancellation rates are increasing. This increase in booking cancellation makes it harder for hotels to accurately forecast, leading to non-optimized occupancy and revenue loss.
When hotels try to protect themselves by using services such as Traveloka or Tiket.com, the burden then falls on OTAs. Indeed, this service requires the OTA to pay for the reservation if the booking is canceled and they cannot find a new guest to occupy the room. One thing is clear, whether you are a hotel or an OTA, cancellations have an negative financial impact on your business.
In addition to the direct financial consequences of cancellations, they also cause operational problems (such as over or understaffing). It is therefore very useful for hotels to know which bookings are likely to get canceled in order to plan their operations accordingly.
Characteristics of the booking itself may be good indicators of whether or not a booking will be canceled. For instance, the average length of stay of canceled reservations is 65% higher than non-canceled booking, with a lead time of 60 days. Engaging with the reasons why people are cancelling and what types of bookings are being canceled is crucial.
In order to solve this problem, we will use a real life hotel booking dataset to create a customer segmentation analysis in order to gain insights about the customers (and hopefully reasons why they cancel their reservation). We will then build a classification model (including the newly created customer clusters) to predict whether or not a booking will be canceled with the highest accuracy possible.
This model will allow hotels to predict if a new booking will be canceled or not, manage their business accordingly, and increase their revenue.
Data Scope, Goals & Objectives
We used data from the Hotel Booking Demand Datasets. The dataset provides data from real bookings scheduled to arrive between July, 1st 2015 and August, 31st 2017 from two hotels in Portugal. Booking data from both hotels share the same structure, with 31 variables describing the 40,060 observations of H1 and 79,330 observations of H2. For a detailed list and description of those variables refer to the data dictionary.
The two hotel datasets were merged into one main dataframe. The dataframe was then cleaned making sure to address any null values, reformat certain features, and engineer new ones.
Goals
Our goal is to find out what features that causes cancellation rate and reduce booking cancellation rate by at least 10% in the following year.
Objectives
- Exploratory analysis of the cancellation target variable and its relation with other features. Data visualisation tools were used to identify trends and valuable insights from those analysis.
- A clustering model was then used to create 4 custer segments whose profiles were then analysed.
- Four models were then presented: baseline, logistic regression, decision tree, and random forest. The model with the highest test accuracy was selected as our predictive model and a secondary interpretative model was also chosen in order to gain a deeper understanding of factors influencing cancellations. The models were evaluated, and conclusions and recommendations were derived to optimize occupancy, improve operations, and increase a hotel's revenue.
Data Dictionary
Feature Name | Type | Description |
---|---|---|
ADR | Float | Average Daily Rate. Calculated by dividing the sum of all lodging transactions by the total number of staying nights. |
Adults | Integer | Number of adults. |
Agent | Categorical | ID of the travel agency that made the booking. |
ArrivalDateDayOfMonth | Integer | Day of the month of the arrival date. |
ArrivalDateMonth | Categorical | Month of arrival date with 12 categories: “January” to “December”. |
ArrivalDateWeekNumber | Integer | Week number of the arrival date. |
ArrivalDateYear | Integer | Year of arrival date. |
AssignedRoomType | Categorical | Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons. |
Babies | Integer | Number of babies. |
BookingChanges | Integer | Number of changes/amendments made to the booking from the moment the booking was entered on the Property Management System until the moment of check-in or cancellation. Calculated by adding the number of unique iterations that change some of the booking attributes, namely: persons, arrival date, nights, reserved room type or meal. |
Children | Integer | Number of children. Sum of both payable and non-payable children. |
Company | Categorical | ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons. |
Country | Categorical | Country of origin. Categories are represented in the International Standards Organization (ISO) 3155–3:2013 format. |
CustomerType | Integer | Type of booking, assuming one of four categories: Contract (when the booking has an allotment or other type of contract associated to it), Group (when the booking is associated to a group), Transient (when the booking is not part of a group or contract, and is not associated to other transient booking), and Transient-party (when the booking is transient, but is associated to at least other transient booking). |
DaysInWaitingList | Integer | Number of days the booking was in the waiting list before it was confirmed to the customer. Calculated by subtracting the date the booking was confirmed to the customer from the date the booking entered on the Property Management System. |
DepositType | Categorical | Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: No Deposit (no deposit was made), Non Refund (a deposit was made in the value of the total stay cost), and Refundable (a deposit was made with a value under the total cost of stay). Value calculated based on the payments identified for the booking in the transaction (TR) table before the booking׳s arrival or cancellation date. In case no payments were found the value is “No Deposit”. If the payment was equal or exceeded the total cost of stay, the value is set as “Non Refund”. Otherwise the value is set as “Refundable”. |
DistributionChannel | Categorical | Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”. |
Hotel | Integer | Indicating which hotel the booking was made, (h1) represent resort hotel and (h2) represen city hotel. |
IsCanceled | Integer | Value indicating if the booking was canceled (1) or not (0). |
IsRepeatedGuest | Integer | Value indicating if the booking name was from a repeated guest (1) or not (0). Variable created by verifying if a profile was associated with the booking customer. If so, and if the customer profile creation date was prior to the creation date for the booking on the Property Management System database it was assumed the booking was from a repeated guest. |
LeadTime | Integer | Number of days that elapsed between the entering date of the booking into the Property Management System and the arrival date. Calculated by subtracting the entering date from the arrival date. |
MarketSegment | Categorical | Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”. |
Meals | Categorical | Type of meal booked. Categories are presented in standard hospitality meal packages: Undefined/SC (no meal package), BB (Bed & Breakfast), HB (Half board: breakfast and one other meal – usually dinner), and FB (Full board: breakfast, lunch and dinner). |
PreviousBookingsNotCanceled | Integer | Number of previous bookings not canceled by the customer prior to the current booking. In case there was no customer profile associated with the booking, the value is set to 0. Otherwise, the value is the number of bookings with the same customer profile created before the current booking and not canceled. |
PreviousCancellations | Integer | Number of previous bookings that were canceled by the customer prior to the current booking. In case there was no customer profile associated with the booking, the value is set to 0. Otherwise, the value is the number of bookings with the same customer profile created before the current booking and canceled. |
RequiredCarParkingSpaces | Integer | Number of car parking spaces required by the customer. |
ReservationStatus | Categorical | Reservation last status, assuming one of three categories: Canceled (booking was canceled by the customer), Check-Out (customer has checked in but already departed), No-Show (customer did not check-in and did inform the hotel of the reason why). |
ReservationStatusDate | Date | Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel. |
ReservedRoomType | Categorical | Code of room type reserved. Code is presented instead of designation for anonymity reasons. |
StaysInWeekendNights | Integer | Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel. Calculated by counting the number of weekend nights from the total number of nights. |
StaysInWeekNights | Integer | Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel. Calculated by counting the number of week nights from the total number of nights. |
TotalOfSpecialRequests | Integer | Number of special requests made by the customer (e.g. twin bed or high floor). |
Data Cleaning
- We cleaned few features containing white spaces that should be removed.
- We removed agent and company features due to large missing data and large categorical characteristics.
- We were able to engineer a new feature such as total day stays and total guests to simplify the analysis.
- We combined arrival date data such as date, month, year to a datetime format.
Data Analysis
What features correlates with booking cancellation?
- Lead time is the most highly correlated feature with whether or not a booking is canceled. It makes sense that as the number of days between when the booking is made and the supposed arrival date increases, customers have more time to cancel the reservation and there is more time for an unforeseen circumstance derailing travel plans to arise.
- Interestingly, the total number of special requests is the second feature with the strongest correlation to our cancellation target. As the number of special requests made increases, the likelihood that a booking is canceled decreases. This suggests that engagement with the hotel prior to arrival and feeling like their needs are heard may make a customer less likely to cancel their reservation.
- Related to special requests, the number of required car parking spaces is the third feature with the strongest correlation to our cancellation target. As the number of parking spaces requests increases, the likelihood that a booking is canceled decreases. Potential reasons for this relationship are discussed later on.
- Interestingly, a customer's prior history with the hotel (measured by the number of previous bookings not canceled or whether or not a customer is a repeated guest) does not seem to be highly correlated with whether or not the current booking will be canceled. On the other hand, a customer's prior history of cancellation (measured by the number of previous cancellations is more highly correlated with whether or not the current booking will be canceled.
Interpretation: Canceled bookings have a longer lead time on average. Potential reasons why were discussed above.
Interpretation: Customers who cancel their bookings make on average fewer special requests. Potential reasons why were discussed above.
Interpretation: On average, customers who do not cancel their bookings tend to require more parking spaces. Similarly to the number of special requests, it would make sense that the more a customer engages with the hotel (by putting in a request for a parking spot), the less likely they are to cancel. It is also fair to think that by the time a guest is thinking about where they will park their car, they are most likely pretty commited to their destination. Finally, thinking about this from the hotel perpective, it is possible that not many hotels around have a parking. As a result, the need for a parking space would limit the customer in their hotel options and make them less likely to cancel. More information would be required from the hotel directly to confirm this theory. However, if true, this suggests that adding parking spaces could be a way to help reduce cancellations.
Interpretation: The same pattern can be seen in both canceled and non-canceled bookings. Less bookings are canceled (or kept) around January. More bookings are canceled (and kept) in the warmer months between April and July.
Interpretation:
- On average customers spend 3 nights in the hotel
- On average, customers cancel 3 days before their supposed arrival date. This does not give hotels a lot of time to find a new guest or adjust their operations. This is further evidence for the need of a predictive model able to identify which bookings will be canceled earlier.
Interpretation: Customers who reserved room type P have the highest percentage booking cancellation with 100% of bookings canceled. As the dataset did not provide the actual room designations for anonymity purposes, it is hard to interpret why bookings of room type P are canceled more often.
Interpretation: Surprisingly, customers who pay a non-refundable deposit have a much higher percentage of canceled reservations. As this is a counter-intuitive finding, it is necessary to dig a little deeper into the characteristics of bookings with a non-refundable deposit.
How can customer segmentation help with reducing booking cancellation?
Interpretation:
- We used K-Prototypes since our data is mixed between categorical and numerical variable.
- Based on the cost, we choose 4 cluster because there is an obvious ‘elbow’ in the elbow method.
- This cluster is also used in our cancellation prediction model, this will be explained later.
We were able to identify 4 customer clusters, the cancellation squad, the family, the return customer, and the vacationers. Cluster characteristics were explained in the illustration above.
Can predictive model help reduce booking cancellation?
Our goal is to build a model able to predict whether or not a booking will be canceled with the highest level of accuracy. In order to do so, our baseline model was compared to a logistic regression, a decision tree, and a random forest
Model Preparation
We do not want to leak any information about our target (cancellation) into our model. As a result, we must remove is_canceled
, reservation_status
, and country
from our X variable.
The agent and company IDs recorded in the agent
and company
features include a large amount of categorical data that is de-identified and therefore difficult to interpret. Since information about the type of agent and company used is included in the market_segment
and distribution_channel
features, the agent and company features were not included in the model.
Finally, as models cannot take in datetime objects as features, the reservation_status_date
and arrival_date
features were also excluded from the predictive model. It's also worth mentioning that our cluster
features from previous analysis also included to further enhance our model accuracy
Model Evaluation
We use Random Forest for our prediction model not only because it has highest predictive power, it also adds additional randomness to the model (not overfitting), while growing the trees. This results in a wide diversity that generally results in a better model.
Interpretation :
We are correctly classifying 68% of the canceled bookings and 94% of the not canceled bookings. If our model predicted that a booking would be canceled, it was actually canceled in 86% of cases. If our model predicted that a booking would be not canceled, it was in fact not canceled in 83% of cases.
Looking at the confusion matrix, we see that there are 727 bookings that our model predicted to be canceled that were not actually canceled. This means that in 4% of the cases, a guest may arrive and the hotel may not be ready for them or the hotel may risk overbooking if they were looking for a replacement guest. In addition, there are 2118 bookings that our model predicted to be not canceled that were in fact canceled. This means that in 12% of cases, the hotel may be allocating their resources on the wrong reservations (getting a room ready that doesn't need to be or missing an opportunity to look for a replacement guest).
Feature Importances
Interpretation :
Having longer lead time, making more expensive booking, and making a non refundable deposit are the most impactful features that have the most weight in predicting that a booking will be canceled.As discussed in the EDA section, the type of customers potentially required to make a non-refundable deposit may explain the link between this deposit type and cancellations. Requiring more special requests, having longer day stay, and having history of previous cancellations also can have an impact, as it makes sense that having a history of cancellations would make a customer more likely to cancel their current booking.
Recommendation
- We recommend that a deeper understanding of the situation may require additional hotel specific information (such as deposit policies & hotel room information).
- Consider offering non-refundable rates. To appeal to more guests, we could offer both non-refundable and flexible rates. Guests who make non-refundable reservations are generally more committed to their stays because they'll have to pay if they cancel, make changes, or no-show.
- Manage booking restrictions. We may want to manage booking cancellation policies and restrictions to prevent certain kinds of bookings and lost revenue. Such as tightening the cancellation policies. We flexibility to specify whether we offer free cancellation, what time frame applies, and the charges if guests don’t show up.
- Offer compelling prices. As average daily rate is one of the most important features leading to cancellation, guests might cancel if they find a comparable room for a better price at a different property or the same room for a cheaper price on a different website.
- Make booking changes and requests easier. More special requests, request of parking spaces, and booking changes means they are less likely to cancel.
- Start making strategy for each specific cluster. Those clusters can be used for us to better prepare for our guests, engage with them in a more targeted manner, and give a sense of the cancellation risk.
- Consider using our predictive model, as a result, this would allow us to more accurately forecast their occupancy, manage our business accordingly, and increase our revenue.