Data Mining of Restaurant Review Using W PDF
Data Mining of Restaurant Review Using W PDF
Abstract:
Many customers visit a restaurant based on food critics and reviews on websites such as Zomato.com. Restaurant strive at the
initial stages of opening but their demand deteriorates after the initial hype. Further business for these restaurants are largely
based on their reviews. What can the restaurant do to make their ratings better? Food taste is an obvious trigger to improve the
ratings of a restaurant, but there are other factors that improve the ratings of a restaurant. Such as inclusion of cuisine, option
for home delivery, presence of table etc. This paper aims a creating a prediction model for the reviews and analyze the trigger
event that would improve the ratings.
Keywords- Restaurant review, Zomato, Multilayer Perceptron, Naïve Bayes, J48, Data mining
I. INTRODUCTION
Machine learning sheds light on various domains data. From the learned knowledge it predicts,
unexplored by human analyses. It provides a view associates, classifies and clusters the data. For this
point that are not visible in general. The prediction purpose, various algorithms are used.The
and classification models which took scientist information gained from machine learning is better
decades to created are achieved in days. Data when understood through visualization.
available in huge amount can be studied through Restaurant is a domain which are traversed by
machine learning algorithms to arrive at meaningful small as well as big players. Data mining provides a
information. way for both to improve their business with
Data miningused in predicting disease diagnosis, minimum effort. Restaurant business rely on the
weather, customer expectations; learning the data to taste of food, the variety of cuisine that is provided
create automation, purchase pattern etc.There are in the restaurant, ambience, availability of home
four steps in the process of Data mining.Data delivery, online booking, price etc. When any of the
collection, Data pre-processing, machine learning factor is improved or included it is possible to
and Data visualisation. increase customer attention and thus increase
Data collection is a predominant and difficult step productivity in business.
in data mining. The data that is collected should be Zomato is a webpage and a mobile application
relevant and should cover all the spaces of the which provides information about restaurants,
domain. The concentration on one sample space reviews of restaurants and allows online ordering
would lead to bias in the prediction or classification from the restaurants. The data is extracted using
of result. After data collection comes data pre- Zomato API [1] by Shruthi Metha. This dataset was
processing. When data is collected not all downloaded from Kaggle an online repository for
information are relevant in the machine learning. dataset [2].
For example, the age and date of birth can be two
attributes in an employee data, this information is II. LITERATURE SURVEY
dependent on one another. The presence of such There are a couple of research papers published
redundant information would lead to decreased based on restaurant reviews and hotel reviews.
accuracy. Next step is machine learning, Machine Following is a survey of such papers, [3] is a paper
learning is the process where the system learns the which reviews the Thai restaurants around the
world. It attempts to find classify the restaurant
based on the reviews. The model proposed in this classification or prediction it is necessary to identify
paper is, extraction of review from social the features that would enable higher accuracy in
networking site using text processing, artificial classification. [7] suggests the use of data pre-
neural network is used to classify the dataset as processing to improve machine learning.
positive and negative. mRMR feature selection Classification and clustering accuracy is
technique is used for selecting the features of data predominantly dependent on the proper
set. representation of data. Correlation based feature
[4] paper analyses the fast food franchise data to selection is used to reduce the number of features.
help franchise reap benefit. Time series data from
store as well as corporate is used with ARIMA V. MACHINE LEARNING
model understand data. Outlier detection is used to Machine learning literally means, make the
identify sales opportunities and risk. machine learn, machine learns by processing the
In [5] Yelp restaurant review dataset is used to data with various machine learning algorithm[7].
model a system to improve restaurants. Here There is no fixed algorithm to provide high
Latent Dirichlet Allocation (LDA) algorithm is accuracy this is called No Free lunch theorem [8],
used to find subtopics from the review. The ratings however deep learning provides a better accuracy in
for the hidden topic allowed to understand the most cases.
reason for rating. For any application it is important to apply few
In paper [6] the reviews are scraped from machine learning algorithms to find out the best
ww.tripadvisor.com using web crawler. The suited model. Machine learning algorithms can be
reviews are distinguished into positive and grouped under Bayes, Rule Based, Neural network
negative polarity using sentiwordnet and various and Decision tree.
machine learning algorithm are used to check their
accuracy. A. Naïve Bayes
Naïve Bayes theorem is the best machine learning
In most of these research papers reviews are
algorithm to use when the features are independent
extracted from one website and classification
of one another[10]. Each instance is considered as a
model is created. This paper is an attempt to create
vector. The posterior probability of a class given a
a trigger model to improve restaurants based on
predictor is found with
Zomato dataset.
P(h|d) = (P(d|h) * P(h)) / P(d)
III. DATA COLLECTION P(d|h) - the posterior probability of class given a
Zomato data set is downloaded from Kaggle data predictor
repository. The dataset contains 22 attributes and P(h) - Prior probability of a class
9552 instances. The attributes present in the dataset P(d) - Prior probability of a predictor
are: Restaurant Id, Restaurant Name, Country Code, B. Decision Tree
City, Address, Locality, Locality Verbose, Decision tree is arrived at by finding the optimum
Longitude, Latitude, Cuisines, Average Cost for way to arrange the various nodes. There are two
two, Currency, Has Table booking, Has Online ways to identify the best partition of dataset at node,
delivery, Is delivering, Switch to order menu, Price information gain or gain ratio. The decision tree
range, Aggregate Rating, Rating color, Rating text, model which uses information gain is ID3 and gain
Votes. ratio is J48 [11]
IV. DATA PRE-PROCESSING C. Multilayer Perceptron
Data pre-processing can be data cleaning or data Multilayer perceptron contains large number of
transformation. Dataset in Kaggle can be used for nodes called as neurons, joined together so that they
classification or association mining. When used in for input layer hidden layer and output layer. The
instances are supplied though the input layer, bias VII. CONCLUSION AND FUTURE WORK
and weight are added at the hidden layer and Zomato dataset is used to a create
supplies the class in output layer [12]. classification model for restaurant rating. It was
found that Multilayer perceptron work well with
VI. EXPERIMENTATION
this dataset. In this paper an attempt is made to
The dataset acquired from Kaggle, first undergoes predict the trigger which would further enhance the
data preprocessing. From the information about rating of the review. This project can be further
dataset it was found that some attributes were extended to create a tool to evaluate the trigger to
redundant, restaurant id and restaurant represented improve the ratings.
the same information; Locality, locality verbose and
latitude longitude represented the same information; ACKNOWLEDGMENT
rating color and rating text represented the same I thank my college New Horizon college of
information. To avoid redundancy of attributes only engineering for providing support and tools for this
one the representation was kept. Average cost for research. I thank Head of Department,
two is an attribute whose value is not standard. It Dr.B.Rajalakshmi for her support and guidance.
depends on the currency attribute. Using the
currency information, the average price is converted REFERENCES
into standard US dollar format. Correlation based 1. https://developers.zomato.com/api#headline1
feature selection with ranker algorithm is done to 2. https://www.kaggle.com/shrutimehta/zomato-restaurants-
data
reduce the number of dataset.
3. Claypo, Niphat, and SaichonJaiyen. "Opinion mining for
Thai restaurant reviews using neural networks and
Machine learning algorithm such as J48, Naïve mRMR feature selection." Computer Science and
Bayes and Multilayer perceptron are prone to reap Engineering Conference (ICSEC), 2014 International.
better results in most dataset. So Mutilayer IEEE, 2014.
4. Liu, Lon-Mu, et al. "Data mining on time series: an
perceptron, J48, naïve bayes classification is used
illustration using fast-food restaurant franchise data."
learn the algorithm in WEKA is free online data Computational Statistics & Data Analysis 37.4 (2001):
mining tool published by Waikato University. The 455-476.
dataset is preprocessed, Feature selected, trained 5. Huang, James, Stephanie Rogers, and EunkwangJoo.
and tested using WEKA.The algorithm found to "Improving restaurants by extracting subtopics from yelp
reviews." iConference 2014 (Social Media Expo) (2014).
reap better result is J48.
6. V. B. Raut and D. D. Londhe, "Opinion Mining and
Summarization of Hotel Reviews," 2014 International
TABLE II Conference on Computational Intelligence and
ACCURACY RESULTS FOR CLASSIFICATION MODEL FOR ZOMATODATSET Communication Networks, Bhopal, 2014, pp. 556-559.
doi: 10.1109/CICN.2014.126
Algorithm Accuracy
7. D. H. Deshmukh, T. Ghorpade, and P. Padiya,
J48 97.2% “Improving classification using preprocessing and
Multilayer 78.16% machine learning algorithms on nslkdddataset,” in
Perceptron Communication, Information & Computing
Naïve Bayes 82.2%
Technology(ICCICT), 2015 International Conference on.
IEEE, 2015, pp. 1–6
To find the trigger to improve ratings, a sample 8. Kotsiantis, Sotiris B., I. Zaharakis, and P. Pintelas.
record of poor rating is taken and modified to "Supervised machine learning: A review of classification
reduce the price range to one. This sample record is techniques." Emerging artificial intelligence applications
in computer engineering 160 (2007): 3-24.
tested on J48 Zomato model. It was found that there 9. Wolpert, David H., and William G. Macready. "No free
was no change in rating. Whereas when the country lunch theorems for optimization." IEEE transactions on
code was changed there was change in rating evolutionary computation 1.1 (1997): 67-82.
10. Lewis, David D. "Naive (Bayes) at forty: The
independence assumption in information retrieval."
European conference on machine learning. Springer, [1994] IV. Proceedings of the 1994 IEEE Workshop.
Berlin, Heidelberg, 1998. IEEE, 1994.
11. Quinlan, J. R. C4.5: Programs for Machine Learning.
Morgan Kaufmann Publishers, 1993
12. Goodman, Rodney M., and Zheng Zeng. "A learning .
algorithm for multi-layer perceptrons with hard-limiting
threshold units." Neural Networks for Signal Processing