Introduction to Data scraping with R

Scraping Data From the Web

Data on the web are often presented in tables. For instance, we can see a list of countries by population in 1900 on Wikipedia

Web pages are written in HTML (Hyper Text Markup Language) which uses tags to describe different aspects of document content. For example, a heading in a document is indicated by <h1>My Title</h1> whereas a paragraph would be indicated by <p>A paragraph of content...</p>.

In this tutorial, we will learn how to read data from a table on a web page into R. We will need the package rvest to get the data from the web page, and the stringr package to clean up the data.

library(rvest)
library(stringr)

1. Reading data into R with `rvest`

To get the population data on Wikipedia into R, we use the read_html command from the xml2 package (which is attached when rvest is called) to parse the page to obtain an HTML document.

We then use the html_nodes command that extracts all occurrences of the desired tag. We will be interested in scraping data presented in tables, so in the source code, we look for the table tag: <table> ... </table>.

Note: some of the rvest commands may be slow depending on your Internet connection and the complexity of the web page.

popParse <- read_html("https://en.wikipedia.org/wiki/List_of_countries_by_population_in_1900")

str(popParse)

## List of 2
##  $ node:<externalptr> 
##  $ doc :<externalptr> 
##  - attr(*, "class")= chr [1:2] "xml_document" "xml_node"

The read_html command creates an R object, basically a list, that stores information about the web page.

To extract a table from this web page (this may take some time):

popNodes <- html_nodes(popParse, "table")
popNodes

## {xml_nodeset (4)}
## [1] <table class="metadata plainlinks ambox ambox-style ambox-More_footn ...
## [2] <table class="wikitable floatright">\n  <tr><th colspan="3" style="b ...
## [3] <table class="wikitable" style="text-align:right">\n  <tr><th>Rank</ ...
## [4] <table class="nowraplinks hlist collapsible autocollapse navbox-inne ...

There are several tables in this document. By inspecting the output of popNodes, we make a guess that we want the third table. (In other cases, trial and error may be required.) We select the second table by using double brackets:

pop <- html_table(popNodes, header = TRUE, fill = TRUE)[[3]]
str(pop)

## 'data.frame':    228 obs. of  3 variables:
##  $ Rank                                 : chr  "-" "1" "2" "3" ...
##  $ Country/Territory                    : chr  "  World" " Qing China" " Indian Empire[a] (UK)" " Russian Empire" ...
##  $ Population c.<U+2009>1900 estimate[1]: chr  "1,700,000,000" "415,001,488" "280,912,000" "119,546,234" ...

3. Cleaning the data frame

We now have a workable data frame that we can analyze. Notice that even though the first and third columns are numbers, they are classified as “character.” For Rank, that is because the first observation is the world population and it is not assigned a rank, but rather, the character “-”. The Population column is also a character because the numbers have commas in them, plus Morocco’s population is given as “8,000,000 [2][3]” to indicate some footnotes. We need to convert these columns to be numeric.

In the case of the Rank column, we will remove the world population.

We will also simplify the name of the third column to “Population.”

pop2 <- pop[-1, ]         #remove row 1
head(pop2)

##   Rank                  Country/Territory
## 2    1                         Qing China
## 3    2              Indian Empire[a] (UK)
## 4    3                     Russian Empire
## 5    4  United States[b] (US proper only)
## 6    5                            Germany
## 7    6                    Austria-Hungary
##   Population c.<U+2009>1900 estimate[1]
## 2                           415,001,488
## 3                           280,912,000
## 4                           119,546,234
## 5                            75,994,575
## 6                            56,000,000
## 7                            51,356,465

row.names(pop2) <- NULL    #reset row numbers to start at 1

pop2$Rank <- as.numeric(pop2$Rank) #coerce Rank to numeric

names(pop2)[3] <- "Population" #rename 3rd column

To remove the commas in the Population numbers, we will use str_replace_all from the stringr package.

pop2$Population <- str_replace_all(pop2$Population, ",", "")
head(pop2)

##   Rank                  Country/Territory Population
## 1    1                         Qing China  415001488
## 2    2              Indian Empire[a] (UK)  280912000
## 3    3                     Russian Empire  119546234
## 4    4  United States[b] (US proper only)   75994575
## 5    5                            Germany   56000000
## 6    6                    Austria-Hungary   51356465

We still have a problem with Morocco, country number 19. Since it is the only country that is a problem, we can fix this individually, but we will also show a more general way to do this in case you encounter more than one instance.

pop2$Population[19]

## [1] "8000000[2][3]"

out <- pop2$Population
out[19] <- "8000000"
out <- as.numeric(out)
head(out, 20)

##  [1] 415001488 280912000 119546234  75994575  56000000  51356465  45500000
##  [8]  42000000  38000000  38000000  32000000  30860000  18520000  17000000
## [15]  12050000  12000000   8500000   8000000   8000000   8000000

out2 <- str_replace_all(pop2$Population, "\\[[^]]+\\]", "")
out2 <- as.numeric(out2)
head(out2, 20)

##  [1] 415001488 280912000 119546234  75994575  56000000  51356465  45500000
##  [8]  42000000  38000000  38000000  32000000  30860000  18520000  17000000
## [15]  12050000  12000000   8500000   8000000   8000000   8000000

identical(out, out2)

## [1] TRUE

pop2$Population <- out2

4. A movie box office example

The web site Box Office Mojo gives statistics on box office earnings of movies. In addition to daily earnings, the web site also maintains lists of yearly and all time record holders.

We will look at the movies in the top 100 in all time movie worldwide grosses in box office receipts. In particular, we will scrape the data from Box Office Mojo: All Time Box Office. The dollar amounts are in millions of dollars and the years marked with “^” indicate that the movie had multiple releases.

movieParse<- read_html("http://www.boxofficemojo.com/alltime/world/?pagenum=1")

movieTables <- html_nodes(movieParse, "table")
head(movieTables)

## {xml_nodeset (3)}
## [1] <table border="0" cellpadding="0" cellspacing="0" width="100%">\n  < ...
## [2] <table border="0" cellspacing="0" cellpadding="0" width="100%">\n  < ...
## [3] <table border="0" cellpadding="5" cellspacing="1" width="98%">\n  <t ...

movies <- html_table(movieTables, header = TRUE, fill = TRUE)[[3]]
str(movies)

## 'data.frame':    100 obs. of  9 variables:
##  $ Rank        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Title       : chr  "Avatar" "Titanic" "Star Wars: The Force Awakens" "Jurassic World" ...
##  $ Studio      : chr  "Fox" "Par." "BV" "Uni." ...
##  $ Worldwide   : chr  "$2,788.0" "$2,186.8" "$2,068.2" "$1,670.4" ...
##  $ Domestic / %: chr  "$760.5" "$658.7" "$936.7" "$652.3" ...
##  $ Domestic / %: chr  "27.3%" "30.1%" "45.3%" "39.0%" ...
##  $ Overseas / %: chr  "$2,027.5" "$1,528.1" "$1,131.6" "$1,018.1" ...
##  $ Overseas / %: chr  "72.7%" "69.9%" "54.7%" "61.0%" ...
##  $ Year^       : chr  "2009^" "1997^" "2015" "2015" ...

We will first clean up the column names and then convert the character variables into numeric variables.

names(movies)[5:9] <- c("DomesticDollars", "DomesticPercentage", "OverseasDollars", "OverseasPercentage", "Year")

For the grosses, we note that the numbers are either prefaced by a dollar sign or end with a percentage, both of which will need to be removed. We will also have to remove the commas.

out <- str_replace_all(movies$Worldwide, "\\$|,", "" )
head(out)

## [1] "2788.0" "2186.8" "2068.2" "1670.4" "1519.6" "1516.0"

movies$Worldwide <- as.numeric(out)

Exercise Clean up the remaining variables.

How many images are there on this web page? Images are coded as <img source = "NameOfImageFile.jpg">, so we search for the selector “img”.

moviesImg <- html_nodes(movieParse, "img")

moviesImg

## {xml_nodeset (8)}
## [1] <img src="/img/misc/bom_logo1.png" width="245" height="56" alt="Box  ...
## [2] <img src="/images/FaceBook_16x16.png" border="0" alt="Facebook Logo"/>
## [3] <img src="/images/Twitter_16x16.png" border="0" alt="Twitter Logo"/>
## [4] <img src="/images/mojo_imdb_sm.png"/>
## [5] <img src="/images/space.gif" border="0" width="1" height="5"/>
## [6] <img src="/img/misc/bom_logo1.png" width="245" height="56" alt="Box  ...
## [7] <img src="/img/misc/IMDbSm.png" width="34" height="16" alt="IMDb" va ...
## [8] <img src="http://b.scorecardresearch.com/p?c1=2&amp;c2=6034961&amp;c ...

4. A Billboard Top 200 example

The website billboard.com keeps track of top songs and albums from the music industry.

One page lists the top 200 albums for a given week. We will extract these albums. In the source code, the songs are listed in headers: <h2 class = "chart-row__song">Album Name </h2>.

albumParse <- read_html("http://www.billboard.com/charts/billboard-200")

albumNodes <- html_nodes(albumParse, "h2")

albumNodes[1:10]

## {xml_nodeset (10)}
##  [1] <h2 class="how-it-works__headline">\n                <span class="h ...
##  [2] <h2 class="chart-row__song">Birds In The Trap Sing McKnight</h2>
##  [3] <h2 class="info-module__title">\n\t\t<span class="visible-medium-on ...
##  [4] <h2 class="info-module__song">This Is Acting</h2>
##  [5] <h2 class="info-module__song">The Very Best Of Fleetwood Mac</h2>
##  [6] <h2 class="info-module__song">Birds In The Trap Sing McKnight</h2>
##  [7] <h2 class="chart-row__song">Bad Vibrations</h2>
##  [8] <h2 class="chart-row__song">Views</h2>
##  [9] <h2 class="chart-row__song">Suicide Squad: The Album</h2>
## [10] <h2 class="chart-row__song">Dig Your Roots</h2>

There are other entries with <h2> headers besides the top songs, so we must first detect the nodes of interest.

index <- str_detect(albumNodes, "chart-row__song")
index[1:10]

##  [1] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE

albums <- albumNodes[index]
albums[1:10]

## {xml_nodeset (10)}
##  [1] <h2 class="chart-row__song">Birds In The Trap Sing McKnight</h2>
##  [2] <h2 class="chart-row__song">Bad Vibrations</h2>
##  [3] <h2 class="chart-row__song">Views</h2>
##  [4] <h2 class="chart-row__song">Suicide Squad: The Album</h2>
##  [5] <h2 class="chart-row__song">Dig Your Roots</h2>
##  [6] <h2 class="chart-row__song">Encore: Movie Partners Sing Broadway</h2>
##  [7] <h2 class="chart-row__song">Blurryface</h2>
##  [8] <h2 class="chart-row__song">ANTI</h2>
##  [9] <h2 class="chart-row__song">Lemonade</h2>
## [10] <h2 class="chart-row__song">Blonde</h2>

Now, we just need to extract the album name which is placed between <h2 ... > and </h2>. We will use a regular expression to handle this.

albums <- str_replace(albums, ".*\">(.+)</h2>","\\1")
albums[1:10]

##  [1] "Birds In The Trap Sing McKnight"     
##  [2] "Bad Vibrations"                      
##  [3] "Views"                               
##  [4] "Suicide Squad: The Album"            
##  [5] "Dig Your Roots"                      
##  [6] "Encore: Movie Partners Sing Broadway"
##  [7] "Blurryface"                          
##  [8] "ANTI"                                
##  [9] "Lemonade"                            
## [10] "Blonde"

In str_replace, we first match 0 or more occurrences of any character followed by a “>. This is followed by one or more occurrences of any character—this is the (.+) portion of the regular expression, and finally followed by the </h2>. This expression will be replaced by what is in the (.+) portion of the regular expression—this is what the”\1" argument is accomplishing.

On Your Own

The web site BikeRaceInfo has a table with data on past winners of the Tour de France. Create a cleaned-up data frame of this data.
The web site NY Times Best Sellers: Hardcover Fiction contains a list of best-selling fiction books. Scrape the names of these top books. The list of books are tagged via <h2>title</h2>.

Resources

HTML Tutorial