Detection of events and the analysis of their sentiments in the US elections using Twitter
This research is the result of the first case for the course Fundamentals of Data Science at the University of Amsterdam. The goal of this assignment is to analyze social media posts (i.e. tweets) related to the previous elections in the US. In this research tweets will be used to investigate the relationship between the tweets and events related to the elections.
With 500 million tweets per day and 310 million monthly active users Twitter has become and still is a fascinating social medium (Twitter Statistics). There are several unique features about Twitter that distinguishes it from other media. Firstly, tweets are created in real-time. Because the messages are small (140 characters) and can easily be sent via mobile applications, users tweet a lot. And these tweets contain lots of rich information, such as user data, timestamps and geolocation. Events happening in everyday life can be detected fast using data from Twitter. Also, because Twitter has millions of users including organizations and public figures, many different aspects of life are covered. All these features turn Twitter into an excellent resource for the detection and analyses of events.
This research focusses on detecting events related to the US elections. Event detection can be classified into specified and unspecified techniques. Specified event detection relies on specific information on the event, such as location, venue, time and description. These data are mostly provided by the user and can be extracted from the tweet text. Unspecified event detection relies more on the stream of tweets to detect the occurrence of an event. So, it relies more on the monitoring of bursts or trends in tweets. During this research, both methods have been used. For this research the following main question is formulated:
How can tweets about the US elections be used to detect events, how can we can we clearly visualize the event sentiment and how can we find topics in events?
The following sub questions are raised:
- How can tweet entities (hashtags, mentions, urls) be used to detect an event?
- How can the sentiment of the potential voters about these events be determined?
- What can we say on the topics being discussed in these event. Do events that correlate in time also discuss similar topics?
The dataset used for the analyses consists of 657.307 tweets obtained from the Streaming API in the period 12 August – 12 September 2016. These tweets are obtained based on their hashtags, coordinates, geo and place.
Before the data could be used for further analyses it had to be pre-processed. The following steps were involved:
- Removal of duplicate tweets;
- Removal of tags such as ‘http:\example.com’;
- Removal of mentions line;
- Removal of hashtags line;
- Removal of html tags such as ‘<b>’;
- Removal of line endings such as ‘\n’.
Besides these steps, a smaller randomized 10.000 tweet collection was also made of the tweets for faster development and testing. All results are based on the full collection.
The first step involved the aggregation per tweet entity. Events are detected by aggregating all tweets with a certain tweet entity, for example a (lowered case) hashtag in the tweet text. Only hash tags with count over a certain threshold shall be considered.
The second step was classifying the events by analysis of the timestamp distributions. From the sample box plot of timestamp distributions we can visually distinguish shorter lived tweet topics like #goparemurders and #clintonsmemory from continues topic such as #trump and #hillary:
To automate the classification of events we define a 10%-90% quantile interval and consider the tweets an event when this interval is between 2 and 7 days.
SentiStrength estimates the strength of positive and negative sentiment in short texts. SentiStrength reports two sentiment strengths:
- 1 (not positive) to 5 (extremely positive)
- -1 (not negative) to -5 (extremely negative)
The senti-strengths [P, N] for all tweets in an event are aggregated by summing the counts of each [N, P] category. This results in a 5×5 matrix of counts of each category. This senti-strength matrix will be visualized using a heat map. Example: #Trump results in:
Topic modeling was done using the Latent Dirichlet Allocation model implemented in the gensim library. This model will search for a number of topics in the hash tag event tweet collection that most probable generated the tweet texts.
Analysis was performed on the clean text that was tokenized and stemmed. A predefined plus a user defined list of stop words were excluded. After some iteration we decided on 10 topics for all events and kept the default of 20 passes to fit the LDA model on the text.
In total 40933 different hash tags were found using a case insensitive query. The hash tag aggregation resulted in the frequency table below. For further processing only hash tags with count over 100 were considered. The hash tag frequency table was visualized in a USA shaped word cloud with words font size proportional to the count.
The classification of the tweets resulted in a set of 32 events. The visualisation below as timestamp distributions shows interesting features that may be used for further research:
- the occurrence in time – events at the border of the dataset timeframe may have to be omitted;
- time correlation of hash tags that share words, e.g. #goparemurderes and #gopendtimes;
- shape of the distribution, we expect an positively skewed distribution for events.
- during the LDA topic modelling we found five out of 32 events that consisted of spam bot tweets only: ‘gopendtimes’, ‘goparemurderers’, ‘alwayswithhillary’, ‘m_xico’ and ‘happening’ indicating that we must more throroughly filter tweets and be aware of the fact that other agents play a role in the communication. Interesting note: some spam events exhibit exactly the same time frame.
An assessment of the senti-strength results was performed for extremer cases and senti-strength seems to be capable of correctly predicting the positive and negative sentiments. Each event now has its own heat map and some examples with different results are shown below.
Correlation of events in time did not show many similar topics on first sight except for the mentioning of candidate names and racial issues. Example given #paranoidhillary and #dementeddonald
(0, '0.023*show + 0.023*don + 0.012*hillari + 0.012*bitch + 0.012*press + 0.012*campaign + 0.012*see + 0.012*peopl + 0.012*answer + 0.012*question')
(1, '0.027*make + 0.021*stupid + 0.014*vote + 0.014*speech + 0.014*ppl + 0.014*voter + 0.014*dem + 0.014*normal + 0.014*least + 0.014*see')
(2, '0.024*call + 0.016*trend + 0.016*go + 0.016*twitter + 0.016*black + 0.016*racist + 0.016*trump + 0.016*hell + 0.016*censor + 0.009*fire')
(3, '0.032*hillari + 0.021*speech + 0.016*got + 0.016*peopl + 0.016*version + 0.011*watch + 0.011*queen + 0.011*claim + 0.011*full + 0.011*ju')
(4, '0.017*right + 0.017*man + 0.017*cray + 0.009*sniper + 0.009*got + 0.009*free + 0.009*jail + 0.009*alt + 0.009*found + 0.009*new')
(0, '0.028*know + 0.023*don + 0.014*make + 0.014*mexico + 0.014*tax + 0.014*support + 0.014*hillari + 0.010*still + 0.010*need + 0.010*noth')
(1, '0.022*fuck + 0.018*bullshit + 0.014*record + 0.014*tax + 0.014*releas + 0.009*rant + 0.009*hide + 0.009*health + 0.009*hope + 0.009*coward')
(2, '0.049*vote + 0.040*american + 0.035*african + 0.025*1 + 0.020*trump + 0.015*peopl + 0.015*poll + 0.015*right + 0.015*95 + 0.011*coward')
(3, '0.031*just + 0.018*polici + 0.018*tri + 0.014*law + 0.014*vote + 0.009*lead + 0.009*think + 0.009*opinion + 0.009*person + 0.009*win')
(4, '0.014*realli + 0.014*watch + 0.014*like + 0.014*mexico + 0.014*feel + 0.014*bs + 0.014*cont + 0.014*lot + 0.014*much + 0.014*morn')
The example shows there is also more work to be done: addition of stopwords, removal of incivil language. One event #epn was mostly spanish and the LDA failed to model the topics as expected.
It can be concluded that tweets can indeed be used for event detection. Visualizing the timestamps of the tweets by using boxplots shows which topics are popular on Twitter within a certain timeframe. These distributions show many interesting features such as the quick hype vs longer discussion, correlating events of events that share words in the hash tag or identification of spam bot events by similar distribution as the tweet just contains a bag of identical hash tags.
Furthermore, creating small heat maps for the hashtags shows in an easy and quick manner what the sentiment is about the topic.
The LDA topic model provides further insight into the topics discussed in each event tweet collection but needs some further tweaking of stop word removal and model parameter setting.