Sentiment Analysis of Movie Reviews using Logistic Regression
Sentiment analysis (also known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to the voice of the customer materials such as reviews and survey responses. A basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect level — whether the expressed opinion is positive, negative, or neutral.
This task of performing sentiment analysis on movie reviews was done in five steps:
1. Collection of data
2. Preprocessing and Feature extraction of the data
3. Implementing Logistic Regression and training and testing the model
4. Visualizing the loss with respect to the number of epochs
5. Comparing the results with Python’s built-in Scikit Learn library
By the end of this article, we’ll have had a hands-on experience with building our very own “Movie Review Sentiment Analyzer” in Python. We will implement logistic regression from scratch and evaluate our model’s performance and accuracy by comparing our results with the Scikit Learn library.
Dataset
We are given the Large Movie Review Dataset that contains separate labelled train and test set in .txt files. The core dataset contains 50,000 reviews labeled as either positive or negative based on the reviewer’s feedback and is split evenly into 25,000 train and 25,000 test sets. Further, the overall distribution of labels is balanced (25,000 positive and 25,000 negative). There are two top-level directories [train/, test/], and within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] corresponding to a unique id of that specific review and a rating ranging from 1–10. For example, the file [test/pos/200_8.txt] is the text for a positive-labeled test set example with a unique id 200 and star rating 8/10 from IMDb.
Our task is to develop and train a logistic regression model on the training set and use it to predict sentiment classes of the reviews present in the test set. Here is a sneak peek at the training reviews after they’ve been appended to separate lists:
Data Preprocessing
As we’ve observed above, we need to clean up this data before we can start extracting the features. To do so, we need to remove stop words (we were given a list of stop words), punctuation marks, and other unwanted characters from the reviews and convert them to lower case. A snippet of the code to do this for the negative reviews in the training set is given below:
#loading the list of stop words
stopWords = open("/content/new_folder/Dataset/stop_words.txt")
stopWordsList = stopWords.readlines()stopListNoNewLine = []
for x in stopWordsList[:-1]:
stopListNoNewLine.append(x[:-1])stopListNoNewLine.append(stopWordsList[-1])negTrainList = []for negTrainFile in glob.glob("/content/new_folder/Dataset/train/neg/*"):
negRev = open(negTrainFile)
eachRevNeg = negRev.readlines()
negTrainList.append(eachRevNeg[0])negativeReviews = []
negTrainRevFull = []
negRevWords = []for eachNegRev in negTrainList:
# regex used to remove punctuation
negativeReviews.append(re.sub('[^\w\s]+', ' ', eachNegRev))for eachNegRev in negativeReviews:
lowerRev = eachNegRev.casefold() # converted to lowercase
negRevWords = lowerRev.split()
for eachNegRevWord in negRevWords:
for eachStopWord in stopListNoNewLine:
if eachStopWord in negRevWords:
negRevWords.remove(eachStopWord)
negString = ' '.join(negRevWords)
negTrainRevFull.append(negString)
A very similar procedure was followed to apply preprocessing on the negative reviews in the test set, as well as the positive reviews in the training and test sets. After we are done with the necessary cleaning, this is what the data looks like:
Feature Extraction
In the feature extraction step we need to represent each review by the three features 𝑥0, 𝑥1, 𝑥2 and one class label 𝑦 as shown in the table below:
To do so, we must first label the positive reviews as 1, and the negative reviews as 0 and then the labels, as well as the negative and positive reviews, are to be merged into separate training and testing lists. This is how that was done:
labelTrainNeg = []
labelTrainPos = []
labelTestNeg = []
labelTestPos = []for negTrainFile in glob.glob("/content/new_folder/Dataset/train/neg/*"):
labelTrainNeg.append(0)
for posTrainFile in glob.glob("/content/new_folder/Dataset/train/pos/*"):
labelTrainPos.append(1)for negTestFile in glob.glob("/content/new_folder/Dataset/test/neg/*"):
labelTestNeg.append(0)
for posTestFile in glob.glob("/content/new_folder/Dataset/test/pos/*"):
labelTestPos.append(1)labelsTrain = labelTrainNeg + labelTrainPos
labelsTest = labelTestNeg + labelTestPostrainingReviews = negTrainRevFull + posTrainRevFull
testingReviews = negTestRevFull + posTestRevFull
Before we can continue, we must first count the number of positive and negative words (we were given a list of positive and negative words) in the reviews. This is done below:
# adding the negative and positive words to respective listspositiveWords = open("/content/new_folder/Dataset/positive_words.txt")
positiveWordList = positiveWords.readlines()positiveListNoNewLine = []
for x in positiveWordList[:-1]:
positiveListNoNewLine.append(x[:-1])positiveListNoNewLine.append(positiveWordList[-1])negativeWords = open("/content/new_folder/Dataset/negative_words.txt",encoding="ISO-8859-1")
negativeWordList = negativeWords.readlines()negativeListNoNewLine = []
for x in negativeWordList[:-1]:
negativeListNoNewLine.append(x[:-1])negativeListNoNewLine.append(negativeWordList[-1])# counting the negative and positive words in the reviewstrain_p = []
train_n = []
for rev in trainingReviews:
pos = 0
neg = 0
review = str(rev)
words = review.split()
for word in words:
if word in positiveListNoNewLine:
pos = pos + 1
if word in negativeListNoNewLine:
neg = neg + 1
train_p.append(pos)
train_n.append(neg)test_p = []
test_n = []
for rev in testingReviews:
pos = 0
neg = 0
review = str(rev)
words = review.split()
for word in words:
if word in positiveListNoNewLine:
pos = pos + 1
if word in negativeListNoNewLine:
neg = neg + 1
test_p.append(pos)
test_n.append(neg)
Now all that is left for us to do is to extract these features in four new lists of training and testing. They will be named trainX, trainY, testX and testY where the X arrays represent the counts, and the Y arrays represent the labels. A list of biases (this will be a list of ones) also needs to be created and appended to our training lists. This was done as shown:
biases = np.ones((len(train_n),1))trainX = np.array((train_n, train_p))
trainX = np.transpose(trainX)
trainX = np.append(biases, trainX, axis = 1)trainY = np.array(labelsTrain)testX = np.array((test_n, test_p))
testX = np.transpose(testX)
testX = np.append(biases, testX, axis = 1)testY = np.array(labelsTest)
Logistic Regression - without Scikit Learn
Logistic regression can be used to solve classification problems when thresholds are used on the probabilities predicted for each class. It uses either Sigmoid function or Softmax function to get the probabilities of the classes. In our case, we will use the Sigmoid function since we do not have a multiclass classification problem on our hands. We then define a cross-entropy loss function and apply the gradient descent algorithm to reduce this loss with increasing epochs by updating the thetas/weights, which hence helps train the model. We can implement logistic regression using the following steps:
1. Defining the Sigmoid function
2. Defining the prediction function using the Sigmoid function
3. Creating a function to calculate the cross-entropy loss
4. Implementing the gradient descent algorithm
We start by defining the Sigmoid and the prediction functions. The Sigmoid function maps any real value into another value between 0 and 1 and the prediction (or hypothesis) function uses the Sigmoid function’s definition to generate probabilities. The Sigmoid function and hence the prediction functions are given by the following expressions:
These can be very easily coded:
def sigmoid(x):
return 1/(1+np.exp(-x))def pred(X, thetas):
z = np.dot(X, thetas)
h_x = sigmoid(z)
return h_x
Next, we need to define a function to calculate the cross-entropy loss. This function is important as our goal is to ultimately reduce this loss in training. The cross-entropy loss function is defined as follows:
def crossEntropy(X, Y, thetas):
m = 25000 # Used 25000 because that is the training dataset size
h_x = pred(X,thetas)
Y = Y.reshape(-1, 1) # Was used to solve a broadcasting issue
J = -1/m * np.sum(Y*np.log(h_x)+(1-Y)*np.log(1-h_x))
return J
Finally, we need to implement the gradient descent algorithm. This is an optimization algorithm that aims to minimize the loss function and update the weights to reach those most optimal. The algorithm uses a learning rate (we are using 0.01) which determines the size of the step and works for a given number of epochs (we are using 100 epochs). As the number of iterations increases, the algorithm reduces the loss (which ultimately becomes asymptotic) by updating the weights. The algorithm is given as follows:
def gradDesc(X, Y, alpha, n_epoch, thetas):
m = 25000 # Used 25000 because that is the training dataset size
J = list()for epoch in range(n_epoch):
for i in range(X.shape[0]):
h_x = pred(X[i], thetas)
for j in range(X.shape[1]):
thetas[j] = thetas[j] - alpha*1/m * np.dot(X[i][j], np.subtract(h_x, Y[i]))
J.append(crossEntropy(X, Y, thetas))return thetas, J
We have successfully implemented logistic regression. Now all we need to do is train and test the model. To start off, we must initialize a weight vector. During initialization, it doesn’t matter what the weights are, as they will be updated to reach optimization, so we can initilize the weights with zeroes. Training will then be done by calling the gradient descent function on trainX and trainY, and will be passed 0.01 as the learning rate and 100 as the number of epochs as well as our initialized weight vector. This is shown below:
w = np.zeros((3,1))thetas, J = gradDesc(trainX, trainY, 0.01, 100, w)
To confirm if our training was successful, we can plot a graph to see if the loss has indeed minimized with increasing number of epochs:
We can now move on to the testing and evaluation phase. To do so, we predict the labels of the test data, and if we get a prediction ≥ 0.5, we label it as 1 (positive) and 0 (negative) otherwise. These predicted labels can then be compared with the actual labels and we can hence calculate the accuracy of our model, as well as make the confusion matrix, both of which we can use to assess how accurate our model was. We can define an evaluation function to do this, as seen:
def eval(X, Y, thetas):
predictions = pred(X,thetas)
predictedLabels = []
for i in predictions:
if i >= 0.5:
predictedLabels.append(1)
else:
predictedLabels.append(0)
correct = 0
for i in range(25000):
if predictedLabels[i] == Y[i]:
correct = correct + 1
print("Accuracy is: ", (correct/25000)*100, "%")print("\n")truePos = 0
trueNeg = 0
falsePos = 0
falseNeg = 0for i in range(25000):
if predictedLabels[i] == 1 and Y[i] == 1:
truePos += 1if predictedLabels[i] == 0 and Y[i] == 0:
trueNeg += 1if predictedLabels[i] == 1 and Y[i] == 0:
falsePos += 1
if predictedLabels[i] == 0 and Y[i] == 1:
falseNeg += 1
conf_M = np.array([[truePos,falsePos],
[falseNeg,trueNeg]])
print("Confusion Matrix:\n", conf_M)
We achieve an accuracy of 72.548% and get the following confusion matrix:
[[10158 4521]
[ 2342 7979]]
Logistic Regression using Scikit Learn
Now that we’re done with the training and testing of our own implementation of logistic regression, we can use Scikit Learn to confirm if what we did was indeed correct by comparing the performance measures. This is done as follows:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrixlogReg = LogisticRegression()logReg.fit(trainX, trainY)
h_x = logReg.predict(testX)
accuracy = accuracy_score(h_x, testY)
print("Accuracy: ", accuracy*100, "%")confMat = confusion_matrix(h_x, testY)
print("Confusion Matrix: \n", confMat)
Scikit Learn gives us an accuracy of 73.056% and gives us the following confusion matrix:
[[9226 3462]
[3274 9038]]
A short glimpse at the accuracy and the confusion matrix computed by Scikit-learn suggests that there is not a whole lot of difference between the two classifiers. In fact, the accuracies lie within the same range, i.e. between 72% and 73% and we can thus conclude that our implementation was indeed correct.
You can view and download the full project for your convenience from my Github and files needed for the project (the dataset, the stop words, the list of positive and negative words, and the instructions) can be downloaded here.
References
- Stanford. Large Movie Review Dataset
https://ai.stanford.edu/~amaas/data/sentiment/