Abstract—Big Data has a large volume and has a variety of data, so it can’t be processed using usual traditional tools. Therefore, new ways and tools are needed to get the value of the data. Apache Spark is a distributed memory-based computing framework which is naturally suitable for machine learning and large-scale data processing. Based on several studies, Apache Spark is a lightning-fast unified analytics engine. In this study, an approach is made to find out how quickly Spark processes large data. This research was conducted by using Machine Learning library (MLlib) classification algorithm in Apache Spark. The comparative classification algorithms are Naive Bayes and Support Vector Machine (SVM). From the results of the comparison obtained, it can be seen which algorithm is better. The tool used in this study is to predict the analysis of sentiment based on a review of an application product. A sentiment analysis based on a product review is a challenging issue. That is because a review has nature, diversity and volume are quite varied. In addition, if the data can be managed properly, then it can be one of the tools in decision making. User reviews are derived from one of the redeveloped chat-based products, BlackBerry Messenger (BBM). The results show that apache spark has a very good speed in processing large data. The classification algorithm is evaluated by precision, recall, f-measure and ROC curve. Based on these evaluations, the SVM algorithm has better results than the Naive Bayes algorithm. Meanwhile, for data processing time, the Naive Bayes algorithm has better speed in doing big data processing than the SVM algorithm.
Keywords—Big Data, Apache Spark, Classification, Sentiment Analysis, Naïve Bayes, SVM.


I'm Katy

Would you like to get a custom essay? How about receiving a customized one?

Check it out