Sunday, July 11, 2021

Deriving the positive and negative factors from hotel reviews using NLP

Data driven decision making is an art of making decisions and improvements on business using data analysis. Today data is collected in every means possible. For example, the duration taken by you to read this article is a data. This article discusses about the data driven decision making in the scenario of hotel administration. The hotel which has high positive reviews is a good place to stay, and these reviews also hold multiple factors that contributes the success of a hotel.

Nowadays, hotel businesses depend on the star rating which determines their popularity among the users. The star rating will give only general idea about the popularity, but it will not specify the factors (services that is provided by the hotel). In the recent times, we have a powerful method based on Natural language processing. By harnessing this we can determine what are factors that contribute to a good review for the hotel.

As discussed above, reviews contain multiple information. If a hotel has good quality food, then the customers who reviews will mention it.  A hotel management need to investigate these factors for improving their services. A hotel management will be a successful one if they know which factors are to be improved based on the data.

The review data is very large in number. They typically range in thousands and thousands for a hotel. So, it is very hard to harness and process this data with conventional methods. Natural language processing provides methods to process and get insight from this large data.

The data that I am going to use in this article is taken from trivago which is a hotel price aggregator. It has data about 255 different hotels in Chennai, Tamil Nadu, India. The dataset has 4768 individual reviews. First, we will look at the first 5 rows

The dataset has 9 columns which has hotel name, review title, review text and sentiment which has values of 1, 2, 3 which means to negative, neutral, positive respectively and then the rating percentage. The rating percentage which we already know is the traditional star rating. The rest of the 4 columns we can eliminate.

We then proceed to see what the different hotels in the dataset





In the above image, we can see that the grade of hotels are ranging from big 5-star hotels to homestay.

We will analyse the sentiment of people and what are the factors that affect the sentiment in a 3-step process, starting with analysis of ratings and then proceed to analysis of sentiment. Finally analyse the factors that determine the sentiment.

The first step in the process is to analyse the rating percentage and check how the ratings are given for each hotel. We will consider the first three hotels only to check the review percentage, due to the large amount of data in this dataset.

Fig 3 Review rating of accord metropolitan

Fig 4 Review rating of Harrison Chennai

Fig 5 Review rating of the park Chennai

We can see that the three hotels are very well rated and The Park Chennai having a diverse rating among the three. Let us take a close observation of the graph individually and see what is happening with the reviews and try to determine the quality of the hotels. Even though the review numbers are not equal for all the hotels, but for this analysis let us generalize with the data that we have in hand.

In Fig 3 the accord metropolitan has a greater number of 100 % reviews which means that the hotel is having more positive reviews than the negative reviews. In the fig 4 we see that the hotel Harrison Chennai is having a significant number of positive reviews, but it is still less than the Accord metropolitan thus concluding it is not as best as accord metropolitan. Finally, let us look at fig 5 in this we can see the review percentage are very diverse along many values not like the other two. we see that there are almost equal number of positive reviews and negative reviews if we take 60% as our median.

In the second process we will look at sentiment of the reviews. The sentiments are categorized as mentioned. We will first see how many reviews are positive, neutral, and negative, for the three hotels that we have already seen and see the correlation with the reviews and our observation.

Fig 6 Review sentiment for Accord Metropolitan

Fig 7 Review sentiment for Harrisons Chennai

Fig 8 Review sentiment for The Park Chennai


The figures 6 to 8 shows us how our observations on review percentages are reflected in the sentiment of the reviews. Now we need to determine the factors that the users like in the hotel. This is a very essential step in our process to determine what are the services the hotel need to improve.


Base Idea of determining the factors

The base idea of how to determine the factors that are liked by the people that visited the hotel are to examine the reviews. For example, take a sentence below which is a positive sentence

“This hotel has very good food and free breakfast options”

If we look at the above sentence the hotel has a very good food and breakfast options as the positive words. In the reviews of the same type if we find good food and breakfast then we conclude that the success of the hotel’s positive reviews is good food and breakfast options.

Let us see the negative sentence

“The toilet in the rooms is bad and not clean”

The sentence mentions about toilet and its condition. If more users mention about issues with toilet and cleanliness, then we determine that the hotel has bad toilet facilities and need to improve it.


By using the base idea, we proceed with our analysis with the same three hotels and look at what are the reviews saying about a particular hotel.

Fig 9 Most frequent words in Accord metropolitan positive reviews

Fig 10 Most frequent words in Accord Metropolitan negative reviews

Fig 11 Most frequent words in Harrisons Chennai positive reviews

Fig 12 Most frequent words in Harrisons Chennai neutral reviews



Fig 13 Most frequent words in The Park Chennai Positive reviews



Fig 14 Most frequent words in The Park Chennai negative reviews


In Fig 9 you see the most frequent words in positive reviews of accord metropolitan. Many reviews mentions that the hotel staff is good, food, good, location, perfect, nice, experience, etc., The negative reviews of accord metropolitan have the following frequent words hotel, room, phone, technical, problem, etc., It mentions that the hotel is facing some technical problem in its room, so it need to improve the basic amenities like phone, lights, and others. Let us take the park check which has the equal number of positive and negative reviews. In fig 13 the most frequent words are clean, staff, pleasant, location, rooms, maintained, etc., Now let see the negative reviews in fig 14 you can see the negative reviews are dirty, old, curtains, dirty, pillows, stained, stains, etc., By the negative reviews we conclude that the hotel is old and its rooms are dirty and there are stains in pillows and other items like curtains, etc., So this hotel needs to improve on its overall cleanliness to improve business.


In this problem I’ve just used only stop words removal for the text pre-processing. The text can be pre-processed more using lemmatization and stemming to improve the accuracy of the frequent words. We can use other parameters for extracting the frequent words that can eliminated more terms like the, I, etc.,



No comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Deriving the positive and negative factors from hotel reviews using NLP

Data driven decision making is an art of making decisions and improvements on business using data analysis. Today data is collected in every...