Text Mining Wodehoude – Point Estimates

Text mining is one of those fields that can be hit or miss. Sometimes, it’s an extraordinarily useful way to get insight from a large amount of data, such as Tweets or customer comments. Other times, the input is sparse and you don’t get a lot out of it. Regardless, as an avid reader, it’s something I’ve always found interesting.

In that context, I thought it’d be fun to apply text mining to P.G. Wodehouse. I have a copy of “Carry On, Jeeves” that I’ve read many times, so it seemed like a good place to start. While most Jeeves and Wooster stories are told from Bertie’s point of view, the last chapter in this book is from Jeeve’s perspective. So we can look to see if there’s anything that stands out about that chapter compared to others.

We start by downloading the book from Project Gutenberg, then cleaning the data up a bit - removing some intro text we don’t need, adding chapter identifiers, and setting the data up for analysis. I’m using the tidytext package, I’d recommend reading “Text Mining with R” for more in-depth tutorials.

First, we’ll look at the ten most common words in each chapter, excluding common words like “the”, “and”, etc.

Show the code

library(hrbrthemes)
library(tidytext)
library(tidyverse)


jeeves_raw <- read_delim('http://aleph.gutenberg.org/6/5/9/7/65974/65974-0.txt', delim = "\t")

jeeves_prepped <- jeeves_raw %>%
  as_tibble() %>%
  rename(text = 1) %>%
  filter(row_number() >= 56) %>% #cheated and looked this up to get rid of preface stuff
  mutate(chapter = ifelse(str_detect(text, "\\d+--"), text, NA)) %>%
  fill(chapter) %>%
  mutate(chapter = trimws(chapter)) %>%
  separate_wider_delim(chapter, '--', names = c("chap_num", "chapter_name")) %>%
  mutate(chap_num = as.integer(chap_num)) %>%
  mutate(chapter_name = fct_reorder(chapter_name, chap_num)) %>%
  group_by(chap_num) %>%
  mutate(linenumber = row_number()) %>%
  ungroup()
  
tidy_jeeves <- jeeves_prepped %>%
  unnest_tokens(output = word, input = text)

tidy_jeeves_no_stop_words <- tidy_jeeves %>%
  anti_join(stop_words)

tidy_jeeves_no_stop_words %>%
  count(chapter_name, word, sort = TRUE) %>%
  group_by(chapter_name) %>%
  arrange(desc(n)) %>%
  mutate(n_rank = row_number()) %>%
  filter(n_rank <= 10) %>%
  mutate(word = reorder_within(word, n, chapter_name)) %>%
  ggplot(aes(word, n)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~chapter_name, scales = "free_y", ncol = 3) +
  coord_flip() +
  scale_x_reordered()+
  theme_ipsum()

This shows the prevalence of certain characters in each chapter. “Corky” is the most common word in “The Artistic Career of Corky”, for instance, which makes perfect sense. “Sir” is common throughout, as it’s how Jeeves addresses Wooster, while “Wooster” is the top word in the Jeeves chapter, as this is how Jeeves refers to his employer. (Bertie will talk of the Code of the Woosters from time to time, but doesn’t often bring up his own name.)

We can compare word usage across chapters 1-9 against the Jeeves chapter to see if there’s a change in word usage. We’ll use the log odds (see chapter 7 of “Text Mining with R“) to get the difference. We won’t focus so much on the values as the words themselves.

Show the code

jeeves_ratio <- tidy_jeeves_no_stop_words %>%
  mutate(author = ifelse(chap_num == 10, 'Jeeves', 'Wooster')) %>%
  count(word, author) %>%
  group_by(word) %>%
  filter(sum(n) >= 10) %>%
  ungroup() %>%
  pivot_wider(names_from = author, values_from = n, values_fill = 0) %>%
  mutate_if(is.numeric, list(~(. + 1) / (sum(.) + 1))) %>%
  mutate(logratio = log(Jeeves / Wooster)) %>%
  arrange(desc(logratio))

jeeves_ratio %>%
  group_by(logratio < 0) %>%
  slice_max(abs(logratio), n = 15) %>% 
  ungroup() %>%
  mutate(word = reorder(word, logratio)) %>%
  ggplot(aes(word, logratio, fill = logratio < 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  ylab("log odds ratio (Jeeves/Wooster)") +
  scale_fill_discrete(name = "", labels = c("Jeeves", "Wooster"))+
  theme_ipsum()+
  scale_fill_manual(values = c('#889185', '#5A93C1'))

This is largely story driven - chapter 10 deals with Ms Tomlison and a young ladies school that Bertie gives a disastrous speech at. Chapter 10 uses “Wooster” rather than “Bertie”, again showing Jeeve’s preference in addressing the character. Bertie’s end is more likely to refer to characters and relatives who aren’t in chapter 10.

All in all, it’s a fun exercise, but in the end, you’re still better off reading Wodehouse. If you haven’t, I guarantee it’s worth your time.