Deriving the positive and negative factors from hotel reviews using NLP
Data driven decision making is an art of making decisions and improvements on business using data analysis. Today data is collected in every means possible. For example, the duration taken by you to read this article is a data. This article discusses about the data driven decision making in the scenario of hotel administration. The hotel which has high positive reviews is a good place to stay, and these reviews also hold multiple factors that contributes the success of a hotel.
Nowadays, hotel businesses depend on the star rating which
determines their popularity among the users. The star rating will give only
general idea about the popularity, but it will not specify the factors
(services that is provided by the hotel). In the recent times, we have a
powerful method based on Natural language processing. By harnessing this we can
determine what are factors that contribute to a good review for the hotel.
As discussed above, reviews contain multiple information. If
a hotel has good quality food, then the customers who reviews will mention
it. A hotel management need to investigate
these factors for improving their services. A hotel management will be a
successful one if they know which factors are to be improved based on the data.
The review data is very large in number. They typically range
in thousands and thousands for a hotel. So, it is very hard to harness and
process this data with conventional methods. Natural language processing provides
methods to process and get insight from this large data.
The data that I am going to use in this article is taken
from trivago which is a hotel price aggregator. It has data about 255 different
hotels in Chennai, Tamil Nadu, India. The dataset has 4768 individual reviews. First,
we will look at the first 5 rows
The dataset has 9 columns which has hotel name, review
title, review text and sentiment which has values of 1, 2, 3 which means to
negative, neutral, positive respectively and then the rating percentage. The
rating percentage which we already know is the traditional star rating. The
rest of the 4 columns we can eliminate.
We then proceed to see what the different hotels in the
dataset
In the above image, we can see that the grade of hotels are
ranging from big 5-star hotels to homestay.
We will analyse the sentiment of people and what are the
factors that affect the sentiment in a 3-step process, starting with analysis
of ratings and then proceed to analysis of sentiment. Finally analyse the
factors that determine the sentiment.
The first step in the process is to analyse the rating
percentage and check how the ratings are given for each hotel. We will consider
the first three hotels only to check the review percentage, due to the large
amount of data in this dataset.
Fig 3 Review rating
of accord metropolitan
Fig 4 Review rating
of Harrison Chennai
Fig 5 Review rating
of the park Chennai
We can see that the three hotels are very well rated and The
Park Chennai having a diverse rating among the three. Let us take a close
observation of the graph individually and see what is happening with the
reviews and try to determine the quality of the hotels. Even though the review
numbers are not equal for all the hotels, but for this analysis let us
generalize with the data that we have in hand.
In Fig 3 the accord metropolitan has a greater number of 100
% reviews which means that the hotel is having more positive reviews than the
negative reviews. In the fig 4 we see that the hotel Harrison Chennai is having
a significant number of positive reviews, but it is still less than the Accord
metropolitan thus concluding it is not as best as accord metropolitan. Finally,
let us look at fig 5 in this we can see the review percentage are very diverse
along many values not like the other two. we see that there are almost equal
number of positive reviews and negative reviews if we take 60% as our median.
In the second process we will look at sentiment of the
reviews. The sentiments are categorized as mentioned. We will first see how
many reviews are positive, neutral, and negative, for the three hotels that we
have already seen and see the correlation with the reviews and our observation.
Fig 6 Review
sentiment for Accord Metropolitan
Fig 7 Review
sentiment for Harrisons Chennai
Fig 8 Review
sentiment for The Park Chennai
The figures 6 to 8 shows us how our observations on review
percentages are reflected in the sentiment of the reviews. Now we need to determine
the factors that the users like in the hotel. This is a very essential step in
our process to determine what are the services the hotel need to improve.
Base Idea of determining the factors
The base idea of how to determine the factors that are liked
by the people that visited the hotel are to examine the reviews. For example,
take a sentence below which is a positive sentence
“This hotel has very
good food and free breakfast options”
If we look at the above sentence the hotel has a very good
food and breakfast options as the positive words. In the reviews of the same
type if we find good food and breakfast then we conclude that the success of
the hotel’s positive reviews is good food and breakfast options.
Let us see the negative sentence
“The toilet in the
rooms is bad and not clean”
The sentence mentions about toilet and its condition. If
more users mention about issues with toilet and cleanliness, then we determine
that the hotel has bad toilet facilities and need to improve it.
By using the base idea, we proceed with our analysis with
the same three hotels and look at what are the reviews saying about a
particular hotel.
Fig 9 Most frequent
words in Accord metropolitan positive reviews
Fig 10 Most frequent
words in Accord Metropolitan negative reviews
Fig 11 Most frequent
words in Harrisons Chennai positive reviews
Fig 12 Most frequent
words in Harrisons Chennai neutral reviews
Fig 13 Most frequent
words in The Park Chennai Positive reviews
Fig 14 Most frequent
words in The Park Chennai negative reviews
In Fig 9 you see the most frequent words in positive reviews
of accord metropolitan. Many reviews mentions that the hotel staff is good,
food, good, location, perfect, nice, experience, etc., The negative reviews of
accord metropolitan have the following frequent words hotel, room, phone,
technical, problem, etc., It mentions that the hotel is facing some technical
problem in its room, so it need to improve the basic amenities like phone, lights,
and others. Let us take the park check which has the equal number of positive
and negative reviews. In fig 13 the most frequent words are clean, staff,
pleasant, location, rooms, maintained, etc., Now let see the negative reviews
in fig 14 you can see the negative reviews are dirty, old, curtains, dirty,
pillows, stained, stains, etc., By the negative reviews we conclude that the
hotel is old and its rooms are dirty and there are stains in pillows and other
items like curtains, etc., So this hotel needs to improve on its overall
cleanliness to improve business.
Limitations
In this problem I’ve just used only stop words removal for
the text pre-processing. The text can be pre-processed more using lemmatization
and stemming to improve the accuracy of the frequent words. We can use other
parameters for extracting the frequent words that can eliminated more terms
like the, I, etc.,
Comments
Post a Comment