Using rvest to redo my dissertation

Jul 28, 2019 in HACKATHON • PROJECT • R • TOOLS
r rvest asylum-data data-wrangling web-scraping
9 min read

In 2013, I wrote my Masters dissertation on the barriers women face claiming asylum in the UK on the basis of sexual orientation. There were several elements to this research, one of which involved me downloading 762 decisions on cases in the Upper Tribunal (Immigration and Asylum Chamber) (UTIAC) from 2012 and reading them to find any that might be relevant. This was…time-consuming. But it was also totally worth it at the time; it’s an under-researched area and I care strongly about ensuring vulnerable people who come to the UK needing help and protection receive it.

A few times since then, I’ve thought about redoing the research to keep it up-to-date, but the scale of the task has always been a problem while I’ve been working. However, since getting into data science, I’ve wondered whether web scraping might offer a much more efficient way of identifying relevant cases, as well an easier way to reproduce research. The perfect opportunity to test this theory arose when I attended #TechItForward with Pride, a hackathon run by the wonderful group Women Driven Development. With a day to work on a project relating to LGBTQIA+ issues, I proposed trying this out, and on Friday worked with a small group of other interested technical professionals to take on the challenge.

It’s worth noting that information on cases relating to sexual orientation has become much better than in my dissertation days. Firstly, there are now official statistics published that show, at an aggregate level, success rates of these types of cases. Additionally, the site that dispays UTIAC decisions now has a (better) search function, so for example you can filter the results by searching for something like “sexual orientation”. So my dissertation would already have been much faster to pull together in 2019! However, partly because I wanted to learn more about the potential of web scraping, and partly because there would be advantages to having the text of the decisions in a structured data format for any other analysis, we decided it was still worth pursuing this approach.

The team all worked on different elements; I focused on web scraping using the R package rvest, so that’s what I’m going to cover in this post.

Scraping one page

The first thing to work out was whether I could grab the relevant text from a page. Choosing a page representing a decision, I used the following code:

# Load libraries
library(rvest)             # For webscraping

example_decision <- xml2::read_html("https://tribunalsdecisions.service.gov.uk/utiac/pa-05879-2018")

decision_text <- example_decision %>%
  html_node(".decision-inner") %>%
  html_text()

Here are some things to know about this:

read_html() downloads and parses the file. I used xml2:: as a prefix to read_html() because there is a clash between this and a function in textreadr, a package I load later on. The above code would work in isolation, but I use to to avoid it becoming an issue when running and re-running chunks of the project. xml2 is a package that is loaded when you load rvest.
To identify the part of the page that I needed to scrape, I used selectorgadget and worked out what the main body of text was defined as. You can also go through and inspect elements - this is just a quick way. html_node() identifies the first element that matches the selector; you would use the plural html_nodes() if you want all matches.
html_text() extracts text contents from html. We want a text output so we don’t need to change anything here.

This code gives us a character vector containing one value, which is the full text of decision on the selected page. Perfect.

Scraping a .doc

It would have been easy (or easier) if all the decisions were on web pages. Unfortunately for us, in some cases, you navigate to the link of a decision only to find the information is only available if you download a .doc file. I was hoping initially there might just be a few oddities here and there, but in the first batch of scraping I did, almost half the pages directed the reader to a .doc. So I had to come up with a solution.

library(textreadr)

example_decision_with_doc <- xml2::read_html("https://tribunalsdecisions.service.gov.uk/utiac/2019-ukut-197")

  link_name <- example_decision_with_doc %>%
    html_node(".doc-file") %>%
    html_attr('href')
  
  decision_text <- link_name %>%
    download() %>%
    read_document() %>%
    paste(collapse = "\n")
  
  decision_text <- gsub("\n", " ", decision_text)

This code is kind of similar, but instead of grabbing the text from the relevant page, I’m grabbing the link to download the document. I use html_attr('href') rather than html_text() because I’m dealing with a link and want to get what it is linking to, not just the link text.

Then armed with the link, I used the textreadr package to download the document, and do some tidying up to make it more readable and similar to the decision text on the web pages.

Identifying links for scraping

This is all working well, but obviously I need to automate this process, otherwise I’m not really saving any time! My next step is therefore to identify and save all the links that I need, so I can cycle through them.

Using a randomly selected page, here’s how I do this:

page <- xml2::read_html("https://tribunalsdecisions.service.gov.uk/utiac?&page=250")

case_links <- page %>%
  html_nodes("td:nth-child(1) a") %>% 
  html_attr('href')

# Loop to identify all links
# NB - the below takes several minutes and is commented to avoid running unnecessarily

# for (page_num in 1:933) {
#   
#   # Link for each new page
#   page <- read_html(paste0("https://tribunalsdecisions.service.gov.uk/utiac?&page=",
#                            page_num))
#   
#   # Get links of cases from page
#   new_links <- page %>%
#     html_nodes("td:nth-child(1) a") %>%
#     html_attr('href')
#   
#   # Combine with links already scraped
#   case_links <- c(case_links, new_links)
# }

Note that in this case I use html_nodes() because I’m grabbing ALL the links to decisions on the page, not just the first one. Other than that it’s pretty similar to what I’ve done before.

The for loop would then take me through all the pages of links to make sure I’ve got each one. I’ve commented it out because it takes a while and it’s not something you would want to run unless you had to!

Looping through the process

With all the links, you can then loop through extracting the text. This example just uses the cases from the random page I selected, not all of them.

library(data.table)

# Create an empty data table to populate

full_text_of_cases <- data.table("case_id" = character(), 
                                 "full_text" = character())

#### Loop through links to pull out information

for (i in 1:length(case_links)) {
  
  example_decision <- xml2::read_html(paste0("https://tribunalsdecisions.service.gov.uk", case_links[i]))
  
  decision_text <- example_decision %>%
    html_node(".decision-inner") %>%
    html_text()
  
  case_id <- example_decision %>%
    html_node("h1") %>%
    html_text()
  
  prom_date <- example_decision %>%
    html_node("li:nth-child(5) time") %>%
    html_text()
  
  prom_date <- as.Date(prom_date, format = "%d %B %Y")
  
  if (is.na(decision_text)) {
    
    link_name <- example_decision %>%
      html_nodes(".doc-file") %>%
      html_attr('href')
    
    decision_text <- link_name %>%
      download() %>%
      read_document() %>%
      paste(collapse = "\n")
    
    decision_text <- gsub("\n", " ", decision_text)
  }
  
  id_dt <- data.table("case_id" = case_id, 
                      "promulgation_date" = prom_date)
  
  full_text_of_cases <- rbind(full_text_of_cases, id_dt[, .(case_id, promulgation_date)], fill = TRUE)
  
  full_text_of_cases[case_id %in% id_dt[, case_id], full_text := decision_text]
}

Some of that code will be familiar - it’s just what we did at the start to extract the text from the individual pages. There are some extra steps here to make sure everything works:

I start off creating an empty data.table so the first iteration works, but after this, each outcome is getting added to the growing dataset of decision text.
I add in some extra fields that are outside of the decision text - promulgation date and case ID. Both of these were sourced using the selectorgadget.
The if statement means that if there isn’t text to populate the table from the page, it will switch to using the .doc approach.

Identifying relevant cases

At the end of this, I have a data.table containing 30 cases. But are there any that relate to sexual orientation or gender identity? I run a quick regex to find out.

# Identify cases that refer to sexual orientation or gender identity

regex_sogi <- "sexual orientation|lgbt|lesbian|\\bgay\\b|bisexual|transgender|gender identity|homosexual"

full_text_of_cases[, sogi_case := grepl(regex_sogi, full_text, 
                                        ignore.case = TRUE, perl = TRUE)]

full_text_of_cases[sogi_case == TRUE, .N]

If you look at the resulting data, you’ll find there is one case that seems relevant, and if you read the text (either from your dataset or by navigating to the page), you can confirm that it does relate to sexual orientation.

Thoughts

This took me a bit of time to pull together, especially because I am new to using rvest, but the time was nothing compared to finding relevant data manually in 2013! What’s more, this covers all available cases, not just one year. And now this code exists, it would be easy for me to rerun. The biggest time commitment is waiting for everything to be scraped - this can take a while, so it’s best to leave running when you don’t have lots of other coding to get on with.

In terms of further analysis, it could be useful in and of itself to simply have relevant cases identified for further research. The regex above isn’t perfect - for example, it will flag cases that mention sexual orientation as an example rather than as the focus of the case - but this could be improved by requiring multiple mentions, or otherwise somebody using this data could flick through the cases flagged - still a lot faster than going through all of them.

Additionally, there is more that could be done to identify if a case is successful or not. You could use another regex, but there are a LOT of different ways that success or failure is described across the cases. We did some work compiling different phrases, but I think there is more that could be done to generalise these a bit more (for e.g., phrases that include the word “appellant” might also be used with “claimant”). Alternatively, I might experiment at some point with an NLP approach - with enough examples, I think it would be possible to have a computer categorisation of success. Similar approaches could be used for extracting other information, such as the person’s country of origin or date of birth.

You can see our github repo here, which includes a similar method but using python, as well as data we’ve scraped (so you don’t have to).

One final thing: this was my major project this month so I’m not going to do another post on what I learnt in July - even though I only started that series last month! I have a few other little things that I’ll add into next month’s post though. Until then, stay on the edges of your seats…