Visualizing Text from Last Presidential Debate
October 25th, 2020
I like working with text data. If used correctly, it can improve accuracy of predictive models, and it can be really fun to visualize the contents of articles, tweets, reviews, and so on. Since the last presidential debate took place a few days ago, I thought I'd take a crack at analyzing its transcript.
My posts typically use or reference Python, but today I decided to do the analysis in R. Here are the things that I did.
- ✓ Imported transcript data from an html page in R
- ✓ Did some basic data cleaning and separated the data by candidate
- ✓ Created word frequency and word cloud charts
- ✓ Looked into word associations
You can find my full code Here.
I used www.rev.com transcripts as my data source. It's nicely formatted and not very difficult to work with, you just need to make sure you separate candidates' speeches if you want to look at them individually.
I also did some typical data cleaning which consists of removing numbers, punctuation, white
spaces, converting every word to lower case, and removing common stopwords that don't have much
meaning. This all can be easily achieved with the
A good starting point of text analysis is to look at words frequencies. Sometimes it can shed light on what the text is about, but I can't say that these charts below are very informative. It is interesting that top 3 words for both candidates are "going", "people" and "said" but not very surprising. I guess the only surprising thing is that I'd thought coronavirus or COVID-19 would make it in the top-15.
Another way to look at word frequency is by using word clouds. I think that it's a more fun way to visualize word frequencies rather than a bar chart.
Joe Biden's word cloud shows that he wasn't using the same words very frequently, and the cloud itself isn't very big. I only included words that appear at least 5 times in his speech. The color and size represent different frequency groups, and we can see 5 clear groups and most of the words are in red.
Donald Trump's is probably more interesting to interpret. While both Trump and Biden mention "China" a lot (20 and 22 times accordingly), Trump also used "Russia" 20 times in his speech. It's also interesting that we don't see Biden addressing Trump by his first or last name, whereas Trump said "Joe" 31 times. Again, it's surprising not to see "coronavirus" or "COVID-19". Overall, both word clouds seem to be consistent with what I remember from the debate.
While you can gain some insights from word frequency and word clouds, I really like
udpipe library that allows you to look at key words and word
We can look at frequencies of various parts of speech although not sure how informative it is. What's probably more interesting is to view most commonly used parts of speech. I first looked at verbs, but the result looked underwhelming.
I then looked at nouns which seemed to be more interesting. Both candidates used "people" the most in their speeches. Biden's second most common noun is "fact" whereas Trump's is "money". I think most frequently used adjectives are the most informative.
Udpipe also allows you to search for keywords. There are multiple
methods that you can use, and I chose RAKE. Rapid Automatic Keyword Extraction algorithm
utilizes most relevant words and takes into account their
frequency and co-occurrences.
I'm a little bit surprised about "big man" in Trump's speech and don't quite recall to whom it referred.
The next two charts are my favorites. While word frequencies can be useful, sometimes you may want to know which words follow one another. For instance, "sure" showed up as an important word in Joe Biden's speech. Was it used as "not sure" or "make sure"? That's where the following graphs can be quite helpful.
As you can see from Biden's word association chart, "sure" is associated with "make sure". It's interesting to see that "make" is also associated with "clear" and "money". "People" is another common word which is associated with "American people", "give people" and "enough". Overall, we can see a lot of associations related to health whether it's wearing a mask or having affordable healthcare.
In Donald Trump's speech, "have" seems to be associated with quite a few words -- "have good relationship", "have school open", etc. There is some reference to money ("make money", "lot money", "million dollar"), energy sector (carbon emissions and oil industry) and "big man" appearing again.
These are just a few ways you can visualize text data. I watched the debate, so I may be biased toward whether it was helpful to understand what candidates talked about. Overall, I think word association and key words charts were representative of the contents of the debate. What did you think?
As always, here are a few links that you can check out.