With all the audience reviews of shows at this year’s Edinburgh Fringe Festival scraped and counted, I decided to dig a little deeper into what they actually represent. Lots of reviews does not necessarily mean lots of good reviews. Over the next several posts, I’ll be trying a number of different approaches to sentiment analysis
I’m starting with a brute force approach, using sentiment lexicons applied word-by-word. The idea is that some words can be categorised as positive and some as negative, so if you count the number of those words in your text, you can calculate if it is overall positive or negative. This doesn’t consider context, or deal with negation (for example, “not good” would have the word “good” considered as positive). Nonetheless, it is a pretty simple way to get a sense of your data: some reviews might be wrongly classified, but a lot will be close enough.
For this work, I used the tidytext
package in R, and drew a lot from this chapter of the book by Julia Silge and Dave Robinson, who created the package. Unlike the example there, though, I’m using data.table
and at this stage am looking at a review level rather than a show (or chapter/book) level.
Comparing lexicons
The lexicon you use is important, because they will give you different results. If you have a particular context for your work, you might need a lexicon that deals with a different vocabulary. It’s also worth considering that language is constantly evolving; words can move from neutral to highly emotive, or even totally flip. Therefore lexicons can go out of date if not maintained, but also might not be appropriate for historical text.
There are three general use lexicons included in tidytext
: Bing, AFINN and NRC.
Sentiment lexicon | Scoring approach | Positive words | Negative words |
---|---|---|---|
Bing | Words categorised as positive or negative | 2005 | 4781 |
AFINN | Words scored between -5 and 5, with -5 being extremely negative and 5 being extremely positive | 878 | 1598 |
NRC | Words categorised as belonging to any of: positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. They can belong to multiple categories. | 2312 | 3324 |
You can see straight away that these will produce different results. For example, Bing has fewer positive words but more negative words than NRC. Meanwhile, AFINN has far fewer words assigned as positive or negative than either of the others. However, AFINN does consider a scale of positivity and negativity, which perhaps might give it a more subtle approach.
I’ll be trying out all three of these in order to compare the outcomes on the Fringe reviews data.
Analaysing for sentiment
To start with, I took the text of the reviews I’d already scraped and prepared for a previous post. As I want to evaluate how accurate the analysis is, I had to label a portion of these reviews to have something to assess against. This would need to be the same set of reviews for each approach I use, including those I cover in future posts. I’ll also need labelled data for some other approaches that involve supervised machine learning; for the current approach this is unnecessary. Therefore, I took a random sample of 400 reviews, labelled them manually as positive, negative or mixed, and then randomly split these so I had a 200 review dataset to use for testing.
Taking this test dataset (called test_labelled
in my code), I proceeded in my analysis.
# Load required libraries
library(data.table)
library(tidytext)
# Create a dataset splitting the reviews into one word per line
words_in_reviews <- unnest_tokens(test_labelled, word, review)
### Using bing --------------
# Add bing sentiment to dataset
bing_sentiment <- merge(words_in_reviews, get_sentiments("bing"))
# Count the number of positive/negative words by review
# (identified by an original_index number)
positive_bing_words <- bing_sentiment[sentiment == "positive",
.(positive_words = .N),
by = original_index]
negative_bing_words <- bing_sentiment[sentiment == "negative",
.(negative_words = .N),
by = original_index]
all_bing_words <- merge(positive_bing_words, negative_bing_words, all = TRUE)
all_bing_words[is.na(all_bing_words)] <- 0
# Calculate net positive/negative score
# Less than 0 is negative, more than 0 is positive
all_bing_words[, combined_sentiment := positive_words - negative_words]
all_bing_words[, overall_sentiment := ifelse(combined_sentiment < 0,
"negative",
ifelse(combined_sentiment == 0,
"mixed",
"positive"))]
# Combine sentiment score with reviews dataset
bing_sentiment_outcome <- merge(test_labelled,
all_bing_words[, .(original_index,
overall_sentiment)],
by = "original_index", all.x = TRUE)
# No sentiment is assigned mixed
bing_sentiment_outcome[is.na(overall_sentiment),
overall_sentiment := "mixed"]
# New column to show if the manually assigned sentiment_label matches the
# overall_sentiment calculated from the lexicon
bing_sentiment_outcome[, match := sentiment_label == overall_sentiment]
The above code can be repeated for the NRC lexicon, because it is similar to Bing in having words assigned as “positive” or “negative”. However, with the AFINN lexicon, the words all have scores instead, so the approach is slightly different.
### Using AFINN ---------------
# Add AFINN sentiment to dataset
afinn_sentiment <- merge(words_in_reviews, get_sentiments("afinn"))
# Sum the score of words by review#
# (This is the different section)
afinn_reviews <- afinn_sentiment[, .(sentiment_score = sum(value)),
by = original_index]
# Calculate net positive/negative score
# Less than 0 is negative, more than 0 is positive
afinn_reviews[, overall_sentiment := ifelse(sentiment_score < 0,
"negative",
ifelse(sentiment_score == 0,
"mixed",
"positive"))]
# Combine sentiment score with reviews dataset
afinn_sentiment_outcome <- merge(test_labelled,
afinn_reviews[, .(original_index,
overall_sentiment)],
by = "original_index", all.x = TRUE)
# No sentiment is assigned mixed
afinn_sentiment_outcome[is.na(overall_sentiment),
overall_sentiment := "mixed"]
# New column to show if the manually assigned sentiment_label matches the
# overall_sentiment calculated from the lexicon
afinn_sentiment_outcome[, match := sentiment_label == overall_sentiment]
Assessing the outcomes
By quickly counting the number of reviews where the manually assigned sentiment label matched that derived from the sentiment lexicon scores, we can compare how the lexicons did overall.
# Count how many matches each lexicon achieved
bing_sentiment_outcome[, .("Bing outcome" = .N), by = match]
## match Bing outcome
## 1: FALSE 31
## 2: TRUE 169
afinn_sentiment_outcome[, .("AFINN outcome" = .N), by = match]
## match AFINN outcome
## 1: FALSE 27
## 2: TRUE 173
nrc_sentiment_outcome[, .("NRC outcome" = .N), by = match]
## match NRC outcome
## 1: FALSE 44
## 2: TRUE 156
AFINN does best, with 173/200 reviews matching their label. Bing is very close behind, though NRC misses almost a quarter of the labels. Some more detailed breakdown shows us this:
Dataset | True positive | True negative | True mixed | False positive | False negative | False mixed |
---|---|---|---|---|---|---|
Actual labels | 184 | 12 | 4 | NA | NA | NA |
Bing | 164 | 5 | 0 | 9 | 9 | 13 |
AFINN | 170 | 3 | 0 | 10 | 6 | 11 |
NRC | 152 | 3 | 1 | 9 | 13 | 22 |
The first thing to note is that the original datset is overwhelmingly positive, with 184/200 positive reviews. For all lexicons, most of the confusion comes around the ‘mixed’ category - they all classify a LOT more reviews as mixed, which would mean that they had an equal number of positive and negative words, or a balanced score in the case of the AFINN lexicon. They would also count as mixed in this instance if there were no known positive or negative words in the review. Of course, my categorisation of what counts as a mixed review is subjective, so that is also a possible issue.
Interestingly, AFINN, which does best overall, returns the most false positives and joint least true negatives, so its success might be down to a very unbalanced dataset and a tendency to bias positive rather than a really excellent ability to distinguish between reviews.
What went wrong?
Having looked through reviews that were miscategorised, here are some reasons why.
“What a show.These girls are stars.”
All the lexicons struggled with this one. Although clearly positive to a reader, its concision doesn’t give the lexicons much to work with, and none of them consider “stars” a positive, which points to the fact these are general purpose lexicons.
“Fantastic performance by such a talented young cast. Disturbing subject matter, well delivered. Watch out for the stand out performance by the worryingly creepy doctor.”
For this review, the writer has referred to some subject matter and character elements, which are only meant as a description rather then relating to their reaction, but this leads some of the lexicons to add some unintented negative scoring.
“Didn’t know that to get multiple stars and happy reviews at the EdFringe, performers have to resort to sex and toilet ‘jokes’. This is one of the two shows I wanted to walk out on. The spousal gag was old after 2 minutes and then there was another 56-58 to tolerate! While many reviews below are positive , the audience I sat in was quite silent throughout and the burning anger at the show from the audience was palpable. Maybe EdFringe and its reviewers think they are breaking taboos and boundaries by fostering an environment of low level genital and toilet ‘humour’, and EdFringe is certainly raking in pounds this way; but it really ruined my 2 weeks at the Fringe to (a) see sensitive and truly funny and whimsical shows ignored by the venues in their tweets, (b) read inaccurate teenage ‘professional’ reviews from outfits like Broadwaybaby and Theatre Weekly while (c) promoting low level puerile humour as in this show. Overall I would give 0 stars to this show and to EdFringe itself for its imbalance in promoting puerile standup ‘comedy’ vs other genres in the arts.”
This review is almost the opposite, in that everything it says about the show is negative, but there are a few mentions of other hypotehtical better shows and other reviews that must balance out the sentiment score. I noticed that this was labelled differently by all lexicons: AFINN said it was positve, Bing said it was mixed, and only NRC correctly identified it as negative.
Had small audience but didn’t short change us. A very funny interactive show. Went as seen at Cambridge footlights show and we weren’t disappointed. Recommended.
Finally, this is an example of a review where negation changes the meaning (“weren’t disappointed”). By scoring sentiment on a word-by-word basis, the lexicon analysis misses this context and assigns a negative score for that word.
Conclusion
Firstly, I think it is worth pointing out that the results for the sentiment lexicons weren’t terrible. The AFINN lexicon was right 86.5% of the time. Its results would tell you that the reviews were overwhelmingly positive, which is true.
But there are clearly ways this approach could be improved. Next time, I’ll look at how considering multiple words together impacts the results when using sentiment lexicons.
(Update: I edited this on 14/11/2019 to reflect minor changes in the tidytext
package.)