As mentioned, this problem was tackled by 6 Duke undergraduate students — Milan Bhat, a sophomore studying Electrical and Computer Engineering, Andrew Cuffe, a senior studying Economics and Computer Science, Catherine Dana, a junior studying Computer Science, Melanie Farfel, a senior studying Economics and Computer Science, Adam Snowden, a junior studying Biology and Computer Science, and myself, a senior studying Mathematics and Computer Science.
Motivation & Problem Definition
Reddit has become widely used for the sharing of opinions and ideas. There are specific forums for the discussion of a topic; these are known as subreddits. We looked at 3 subreddits: r/politics, r/democrats, and r/republican. As you may guess, the first is dedicated to discussions surrounding all politics. Both r/democrats and r/republican are forums for members of the Democratic and Republican party to discuss policy, issues, and candidates.
Each of these subreddits has millions of comments and thousands of posts. We used a dataset containing these posts for our analysis. We also relied heavily on sentiment analysis. Sentiment Analysis is the use of natural language processing to systematically quantify the attitude expressed in written language. For example, the sentence “The most miserable thing is waiting in traffic.” is a negative sentence. However, the sentence “It’s wonderful to beat rush hour traffic and get home quickly.” is a positive sentence.
We can use sentiment analysis on posts about the President to see if users are speaking favorably of the President and if so, how favorably. We generated a link between the aggregate sentiment of posts on a subreddit and the approval rating of the President.
Our primary data source was a large JSON object of Reddit comments from 2011 through 2016 that includes the comment, score, and subreddit for all publicly available comments. It is a massive dataset that is 250GB compressed. We used a dataset of Presidential approval ratings from Rasmussen Reports that were generated by national surveys.
Using Google BigQuery, we pulled all comments that contained the word “Obama” from the three mentioned subreddits. We cleaned each comment with an exhaustive regex expression to replace all non-alphanumeric values. We aggregated comments by month and year. Our basic object had 3 values: the date the comment was made, the subreddit, and the comment itself.
The result of the queries on BigQuery was a CSV file. We used Python to access and analyze the information in the CSV file. We grouped the comments by the month and the year in which they were made.
For our sentiment analysis, we relied heavily on TextBlob, a Python library for processing and analyzing text. We primarily used their sentiment analysis method, which returns a number in the range [-1, 1] where -1 is totally negative and 1 is totally positive. There are two main approaches to sentiment analysis — semantic approaches and machine learning approaches.
The semantic approach takes the sum (or the average) of the sentiment of individuals words within a sentence. The words in an input sentence are lemmatized, which means they are grouped by a common root. This is because angry, anger, angrily, and angered all refer to the same negative idea. Then the sentiment of each word is summed to produce a value for the sentence. We can also invert sentiment when not is used. The phrase “not happy” can be quantified similarly to “upset”. Additionally, we can weight sentiment more heavily if modifiers like very, extremely, or incredibly are used.
The second method is a supervised machine learning approach. In machine learning, a supervised algorithm is given two parameters: (1) an input and (2) the expected output. The algorithm then infers a function that maps the input to the output. This function then can be used on a novel input to produce an output. This approach is typically a classification problem. For those interested in learning more, see here.
In our approach, we averaged sentiment across all categories in a given month and year. Over a 4 year span, this produced 48 data points. We then normalized these data points to the range of the actual approval rating. Our smallest data point became the minimum approval rating over the 4 years. Our largest became the maximum. All other values were mapped similarly. This may have distorted our underlying data but it allowed us to more clearly see trends and trajectories.
The first result comes from r/politics. We found a correlation coefficient of .495, which suggests these two lines are positively correlated. However, we found the trends most interesting. We see hyperbolic increases and decreases in our calculated approval rating at the same time as moderate increases and decreases in the actual approval rating. This result is intuitive given the nature of Reddit. People who are posting are often more exaggerated in their opinions. This leads to more polarized posts, larger sentiment values, and more hyperbolic trends.
In the above graph, we looked at the approval rating. For r/democrats and r/republican, we decided to analyze the trend of approval rating. This is done by calculating the slope (the first derivative) of our calculated values. Again, we see much more exaggerated and magnified change. This makes sense for similar reasons as above. The graph below shows the results for r/democrats.
However, this is not the case for r/republican. We again looked at the trend of approval rating and we see much more stagnant results. In fact, there is no meaningful change in the trend of approval ratings. Our team looked at the original values for the r/republican sentiment analysis and found the aggregation of posts was almost always negative. We hypothesized this is a byproduct of partisan politics, where each party heavily criticizes the other independent of any action or policy. This is likely especially true on subreddits, as the users are likely more partisan.
Extracting sentiment from online sources has become an interesting problem, especially in the area of politics. For those interested, here is an interesting article that looks at the 2016 election using sentiment analysis from Twitter. This methodology represents a relatively new way of predicting the public’s response to anything from policy to people. For those interested in replicating the results or using our code, see here.