TOC o “1-3” h z u NB PAGEREF _Toc528830192 h 1SVM PAGEREF _Toc528830193 h 2Random forest, PAGEREF _Toc528830194 h 3Decision tree PAGEREF _Toc528830195 h 4
NBTo analysis the class relationship, Naïve Bayes statistical classifier is used. Relationship probability mentions the probability of a given tuple related to a particular class. It is founded on individuality statement, it means that presence or not presence of a precise feature is independent of presence or not presence of any other feature. To determine this uncertain, Naïve Bayes classifier investigates each feature individually 5. “Naïve Bayes classification is based on Bayes theorem with naïve (strong) class conditional independence 5”. The equation for Bayes theorem is given as follows:
In natural English, the above equation can be written as
Vinodhini and Chandrasekaran 197 engaged the principal component analysis to reduce the dimensions and collaborate the hybrid technology with the sentiment classification. Its experimented with five hundred reviews on the digital camera. It used a combination of unigram, bi gram and trigram for the feature representation. The Support vector machine and naïve Bayes has been use for comparison within the bagged SVM and Bayesian boosting. It concluded that Bayesian boosting result was better than all other method like unigram for the feature representation. Its highest precision was 83.3%
Tan et al. 121 implement the supervised learning method for the sentiment cross-domain classification. It’s supposed an effective approach for example frequently co-occurring entropy, to find the features that happen in both domain data, i.e. old domain data, and new unlabeled data. This research implemented the weighted expectation based, Adapted Naïve Bayes to train the classification model for new domain data. For the concerned experiments, it used education reviews (1012 negative and 254 positive), and stock reviews (683 negative and 364 positive) and the computer reviews (390 negative and 544 positive). The result of proposed work outperformed the Naïve Bayes, Expectation maximization Naïve Bayes and Naive Bayes transfer classifier and yielded average 82.62% micro F1 and 79.26% macro.
Kang et al. 111 employed a latest sentiment technique based on senti-lexicon for the restaurant reviews. This lexicon was depend on the unigram, bigram, negation and intensifiers. The Naïve Bayes methods were suggested to find the gap between positive and negative classification accuracy for the improvement of the average classification curacy. This Sentiment classification has been accomplished on around about 70,000 documents from different restaurant sites. This proposed Naïve Bayes outperformed the baseline approaches, and generate the accuracy of 81.2%.
Naïve Bayes algorithm was trained on Pang and Lee 167 datasets for the binary sentiment classification system and get the accuracy up to 86% F1 measure level. It find multiple combination of results for example as a whole social media has good relation with respect to stock market performance than other media.
Naïve Bayes along with some alteration implemented well for document classification 44,121.
Zhang et al. 114 implement the sentiment classification by using the machine learning algorithm e.g. Naïve Bayes and Support Vector Machine for the reviews of restaurant. They work on the effect of feature representation and feature size on the performance of sentiment classification. Its experimented on 1500 positive and 1500 negative comments and with different feature representations like unigram, unigram_freq, bigram, bigram_freq, trigram, and trigram_freq and varying number of features in the range of 50 to 1600 features. The maximum accuracy testified was 95.67% by using Naïve Bayes.
SVMSupport vector machines have been practically implemented to text classification. Figure 3 can visually recognize for the example of points plotted in 2D-space. The collection of points are regarded with two categories (here demonstrated with ebony and white points) and SVM discards the hyper-plane that maximizes the margin between the two classes 16. This hyper plane is given by
Where x = (xi1, xi2,…,xin) is a n-dimensional input vector, yi is its output value, w = (w1, w2, … wn) is the weight vector (the normal vector) defining the hyper plane and the ai terms are the Lagrangian multipliers. Once the hyper plane is constructed (the vector w is defined) with a training set, the class of any other input vector xi can be determined:
If w . x + b ? 0 then it belongs to the positive class (the class we are interested in), otherwise it belongs to the negative class
Support vector machine based has been experimented on movies reviews databases 33,167. It’s succeeded the highest accuracy up to 87.95% with respect to features selection by using Stylistic and Syntactic techniques. The EWGA outperformed all other feature engineering techniques by achieving the accuracy of 91.70% using Support vector machine.
Pang and Lee 33 initiated to apply the machine learning with the Naïve Bayes, Maximum Entropy (ME), and Support Vector Machine for binary sentiment classification of movie reviews. For experiments, they collected movie reviews from IMDb.com. They experimented with various feature engineering, where SVM yielded the highest accuracy of 82.9% with unigrams features.
Basari et al. 157 performed a hybrid approach for the Particle Swarm Optimization (PSO) and Support Vector Machine for sentiment classification about the movie reviews. PSO was implemented for the selection for the best one parameter in order to find the dual optimization problem. The results were performed on EMOT datasets and it achieved an accuracy up to 76.20% after the data cleansing.
Agarwal et al. 177 implement the sentiment classification of the Twitter data. Its testified with 11,875 manually annotated tweets. It proposed the five different mixture of features over the unigram, senti-features, and tree kernel. Support Vector Machine was implemented for 2-way and 3-way classification functions. As a result, 2-way classification tasks, unigram and senti-features performed very well than all other feature representation by achieving an accuracy level of 5.39%. And For 3-way classification tasks, tree kernel and senti-features outperformed all the other features representation with the accuracy of 60.83%.
Abdul-Mageed et al. 108 implemented sentiment analysis at the base of subjective of social media for a morphologically-rich language. It were experimented at the achieved on 2798 chat turns, 3015 Arabic tweets, 3008 sentences from thirty modern standard Arabic language Wikipedia Talk pages, and 3097 web forum sentences. Each of the sentences of composed data sets was labeled by hand for the subjectivity and objectivity for the purpose of subjectivity analysis. And 3982 Arabic adjectives were composed for the binary polarity classification of a sentences. Support Vector Machine performance was the baseline methods for the subjectivity classification, and produced an accuracy level of 73% for tweets and 84.36% for the forums. Support Vector Machine performed at the baseline method for the sentiment classification with an accuracy up to 70.30% for chat turns.
Ortigosa et al. 209 worked at sentiment classification and the sentiment change detection for the Facebook comments by using lexicon dictionary and machine learning based technology. It developed the sentiment lexicon dictionary for the purpose of Spanish Linguistic Inquiry and Word Count (LIWC) and find a lot of slangs in the comments. In case of evaluation the proposed lexicon was employed at C4.5, Naïve Bayes, and Support Vector Machine to classify 3000 status messages including 1000 for each class e.g. positive, negative and neutral and generated an accuracy up to 83.17%, 83.13%, and 83.27% respectively.
Jiang et al. 176 Implement a contextual base and target dependent approach for the sentiment analysis classification for the Twitter dataset containing multiple tweets. This sentiment was based on subjective and implement at Support Vector Machine algorithm. While the graph base method was related to tweets for better performance. PMI has been used to find the top K nouns and some phrases for the given target. In this experiment it has been gathered 400 English language tweets for each personality like Obama, Google, iPad, Lady Gaga and finally its gets the 68.2% accuracy for subjectively base sentiment and 85.6% for the sentiment classification.
Boiy and Moens 46 did a job for the multiple domain sentiment and multilingual classification. It has been used different features like unigram, negation, and some verbs etc. It use cascaded method by using the 3 single classifiers like MNB, Maximum Entropy, and Support Vector Machine for the multiple combination of experiments. This work find the best results by using MNB for English, Support Vector Machine for the Dutch, and Maximum Entropy for the French language.
Li et al. 184 implement intra-day prediction for a market. Its data was news article bases in form of vector space model, which used sentiment word matrix. Each headline has been labeled for the intra-day return. It used 3-classes e.g. positive, negative, and neutral for the classification by using the Support Vector machine. For experiment data, news were collected from Hong Kong news agency during the 2003-2008. As a comparisons between stock and index they find the 69.98% accuracy for the validation testing.
Li 90 design a framework for the market intelligence by using multiple tasks like detection, opinion classification, credibility, and numerical sum. The first work has been done by topic tendency score with term frequency and second work used two phases, subjective classification and two sentiment classification by using Naïve Bayes, Support Vector Machine with the emoticons, unigram, and bigram. While the third task has been done on the base of follower-followed ratio. Finally they used the tweets of 3 brands, Google, Microsoft, and Sony for different products and train the 11,929 tweets and find that Support Vector Machine performance was better than the Naïve Bayes.
Lane et al. 73 used the opinion mining for the analysis for favorability. For this analysis the data represented the unigrams, bi and trigram for the dependency words. This data was managed at training level for the modification and evaluation. Three datasets gathered form the newspaper and magazines of high-tech companies and then different classifiers like Naïve Bayes, Support Vector Machine and RBF has been implement for the experiments. It find the accuracy of 91.2% for the SVM and for pseudo sentiment, Naïve Bayes was best classifier.
Ghose and Ipeirotis 81 detect the multiple features for subjective level and measure for readability of text and spelling errors to find the text-based features. Many features were related to many reviewers for example thee self-disclosed identity measures of reviewers have been considered. The observation about the subjectivity, and readability which was influencing the sales. For this experiments, it has been performed on the product reviews gathered form 15 months from the Amazon.com about audio and video player, digital camera, and DVD. SVM and Random forest (RF) has been used for the predicticcion review. It has been found that RF outperformed SVM. It has been resulted that subjective and objective reviews have best impact on product sales.
Krishnamoorthy 228 design a model for the prediction of features and it considered the review metadata at the base of subjective feature. While linguistic features like adjective, state verbs has emphasized the action verbs. This work has been done at the dataset of 1653 reviews about multiple items and find the F-measure of 87.21% by using RF. The RF outperformed NB and SVM in this study.
Ott et al. 212 test the spam detection on the artificially positive reviews. The dataset contained 400 positive reviews form the websites, TripAdvisor.com and Amazon’s Mechanical Turkers. To find and detect the spam they use POS method and LIWC software for the implementation of NB and SVM at the base of bigram. It has been found F-measure of 89.8% accuracy level by finding the RF as a best technique.
Coussement and Van den Poel 77 worked on the emotions which has been used in emails and some in newspapers. This work implemented at 18,331 emails and Belgian newspapers. Logistic Regression (LR), Support Vector Machine (SVM), and RF had been utilized for sentiment classification while it found that RF outperformed other two classifiers.
Decision treeWang et al. 66 perform an experiment for the comparisons of the 3 best methods for sentiment like Bagging, boosting, and random subspace on 5 different algorithm e.g. NB, ME, DT, KNN and SVM. The test the ten different datasets on all given algorithm and find the best accuracy level with the Decision tree implementation.
Liu et al. 76 design an algorithm in two different phases to calculate the measurement of helpfulness features. The product feature extraction has been used with PMI model. Furthermore, bootstrap algorithm along with decision tree outperform the simple linear regression.
Geva and Zahavi 53 implement an evaluation effect of augmenting market data with the textual news about the stock decisions. The dataset were collected form the New York Stock Exchange which was containing 51,263 news items. A neural network algorithm and decision trees were trained on dataset. This work used the double scoring method for the positive and negative model to reduce the prediction biases and rule-based expert system for feature selection. Some other classifiers and hybrid models can be tried on textual and market data for better accuracy.
Reyes and Rosso 86 proposed 6 notable feature for the n-gram, POS tag by considering the positive and negative profiling and affective profiling. For these profiling, preprocessing has been utilized. For the experiment, data has been collected from different sources, Amazon.com, Slashdot.com and TripAdvisior.com. For the classification, NB, SVM and DT has been trained. It calculated the different level of the accuracy for different classifier such as, SVM get 75.75% and NB 75.19% while the DT get the 89.05% accuracy level which is best than others.
5 S. Deshmukh, A.S. Dalvi, T. J. Bhalerao, “Crime Investigation using Data Mining”, International Journal of Advanced Research in Computer and Communication Engineering, Vol. 4, No. 03, 2015, ISSN 2319-5940 pp. 22-24
16 Tong S, Koller D (2001) Support vector machine active learning with applications to text classification. J Mach Learn Res 2:45–66
33 B. Pang, L. Lee, S. Vaithyanathan, Thumbs up? Sentiment classification using machine learning techniques, Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, vol. 10, Association for Computational Linguistics, 2002, pp. 79–86.
44 S. Wang, D. Li, X. Song, Y. Wei, H. Li, A feature selection method based on improved Fisher’s discriminant ratio for text sentiment classification, Expert Syst. Appl. 38 (2011) 8696–8702.
46 E. Boiy, M.-F. Moens, A machine learning approach to sentiment analysis in multilingual Web texts, Inform. Retrieval 12 (2009) 526–558, http://dx.doi.org/10.1007/s10791-008-9070-z.
53 T. Geva, J. Zahavi, Empirical evaluation of an automated intraday stock recommendation system incorporating both market data and textual news, Decis. Support Syst. (2013), http://dx.doi.org/10.1016/j.dss.2013.09.013.
66 G. Wang et al., Sentiment classification: the contribution of ensemble learning, Decis. Support Syst. (2013), http://dx.doi.org/10.1016/j.dss.2013.08.002.
73 P.C.R. Lane, D. Clarke, P. Hender, On developing robust models for favourability analysis: model choice, feature sets and imbalanced data, Decis. Support Syst. 53 (2012) 712–718.
76 Y. Liu, J. Jin, P. Ji, J.A. Harding, R.Y.K. Fung, Identifying helpful online reviews: a product designer’s perspective, Comput. Aided Des. 45 (2013) 180–194.
77 K. Coussement, D. von den Poel, Improving customer attrition prediction by integrating emotions from client/company interaction emails and evaluating multiple classifiers, Expert Syst. Appl. 36 (2009) 6127–6134.
81 A. Ghose, P.G. Ipeirotis, Estimating the helpfulness and economic impact of product reviews: mining text and reviewer characteristics, IEEE Trans. Knowl. Data Eng. 23 (10) (2011).
86 A. Reyes, P. Rosso, Making objective decisions from subjective data: detecting irony in customer reviews, Decis. Support Syst. 53 (2012) 754–760.
90 Y.M. Li, T.-Y. Li, Deriving market intelligence from microblogs, Decis. Support Syst. 55 (2013) 206–217
108 M. Abdul-Mageed, M. Diab, S. Kübler, SAMAR: subjectivity and sentiment analysis for Arabic social media, Comput. Speech Lang. 28 (2014) 20–37.
111 H. Kang, S.J. Yoo, D. Han, Senti-lexicon and improved Naïve Bayes algorithms for sentiment analysis of restaurant reviews, Expert Syst. Appl. 39 (2012) 6000–6010.
114 Z. Zhang, Q. Ye, Z. Zhang, Y. Li, Sentiment classification of Internet restaurant reviews written in Cantonese, Expert Syst. Appl. 38 (2011) 7674–7682
121 S. Tan, X. Cheng, Y. Wang, H. Xu, Adapting naive bayes to domain adaptation for sentiment analysis, in: M. Boughanem et al. (Eds.), ECIR 2009, LNCS 5478, 2009, pp. 337–349.
157 Abd. S.H. Basari, B. Hussin, I.G.P. Ananta, J. Zeniarja, Opinion mining of movie review using hybrid method of support vector machine and particle swarm optimization, Proc. Eng. 53 (2013) 453–462.
167 B. Pang, L. Lee, A sentiment education: sentiment analysis using subjectivity summarization based on minimum cuts, in: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, July 2004, p. 271.
176 L. Jiang, M. Yu, M. Zhou, X. Liu, T. Zhao, Target-dependent Twitter sentiment classification, in: ACL, June 2011, pp. 151–160.
177 A. Agarwal, B. Xie, I. Vovsha, O. Rambow, R. Passonneau, Sentiment analysis of twitter data, in: Proceedings of the Workshop on Languages in Social Media, June, Association for Computational Linguistics, 2011, pp. 30–38.
184 X. Li et al., News impact on stock price return via sentiment analysis, Knowl.-Based Syst. 69 (2014) 14–23.
197 G. Vinodhini, R.M. Chandrasekaran, Opinion mining using principal component analysis based ensemble model for e-commerce application, CSI Trans. ICT (2014) 1–11.
209 A. Ortigosa, J.M. Martín, R.M. Carro, Sentiment analysis in Facebook and its application to e-learning, Comput. Hum. Behav. 31 (2014) 527–541.
212 M. Ott, Y. Choi, C. Cardie, J.T. Hancock, Finding deceptive opinion spam by any stretch of the imagination, in: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, 2011, pp. 309–319.
228 S. Krishnamoorthy, Linguistic features for review helpfulness prediction, Expert Syst. Appl. 42 (2015) 3751–3759.