Sentiment Analysis Project Proposal
I will utilize the Amazon product review database for the sentiment analysis project (Blitzer, Dredze, & Pereira, 2007). This data set contains reviews of products from four different categories. I will utilize the Support Vector Machine (SVM) algorithm in Python to analyze the data. According to Blitzer, Dredze, and Pereira, SVM is the best algorithm while Naïve Bayes is the least accurate, although both methods give close results. Therefore, it may be necessary for more than one algorithm to be utilized to test their theory.
After analyzing the data sets and completing the required clean-up, SVM will be utilized on the test and training sets of data. The text of each of the product reviews, as well as the rating (one to five stars, representing a scale of negative to positive), will be analyzed. The sentiment analysis will train a classifier to identify positive reviews and negative reviews, and then analyze the validity of the algorithm compared to the number of stars given by the reviewer.
One research questions is, “Can the sentiment algorithm correctly classify positive and negative reviews based on text”. This question will be answered by using SVM (or another algorithm, if necessary), on the training and testing sets. If SVM does not accurately identify the text, another method may be used. However, only with testing of the algorithm can this question be answered.
The final step in this project will be to reflect upon the choices made in preparing the data, choosing an algorithm, and testing the data. A report will be completed to discuss the choices made step-by-step, the results, and any possible issues with the research methodologies. Importantly, the report will give an in-depth analysis of the algorithm(s) used and how future research could be completed to enhance the project results.