Data manipulation is central to data analysis and is often the most time consuming portion of an analysis. The dplyr package contains a suite of functions to make data manipulation easier. The core functions of the dplyr package can be thought of as verbs for data manipulation.

Verb(s) Meaning
filter and slice pick specific observations (i.e. specific rows)
arrange reorder the rows
select pick variables by their names (i.e. specific columns)
mutate add new calculated columns to a data frame
summarize aggregate many rows into a single row

In this example we will explore how to use each of these functions, as well as how to combine them with the group_by function for groupwise manipulations.

To begin, let’s make sure that our data set and the dplyr package are loaded

colleges <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/data-wrangling/data/colleges2015.csv")
#install.packages("dplyr")
library(dplyr)

Data: The file college2015.csv contains information on predominantly bachelor’s-degree granting institutions from 2015 that might be of interest to a college applicant.

To get a feel for what data are available, look at the first six rows

head(colleges)
##   unitid                             college    type       city state    region
## 1 100654            Alabama A & M University  public     Normal    AL Southeast
## 2 100663 University of Alabama at Birmingham  public Birmingham    AL Southeast
## 3 100690                  Amridge University private Montgomery    AL Southeast
## 4 100706 University of Alabama in Huntsville  public Huntsville    AL Southeast
## 5 100724            Alabama State University  public Montgomery    AL Southeast
## 6 100751           The University of Alabama  public Tuscaloosa    AL Southeast
##   admissionRate ACTmath ACTenglish undergrads  cost gradRate FYretention
## 1        0.8989      17         17       4051 18888   0.2914      0.6314
## 2        0.8673      23         26      11200 19990   0.5377      0.8016
## 3            NA      NA         NA        322 12300   0.6667      0.3750
## 4        0.8062      25         26       5525 20306   0.4835      0.8098
## 5        0.5125      17         17       5354 17400   0.2517      0.6219
## 6        0.5655      25         27      28692 26717   0.6665      0.8700
##   fedloan    debt
## 1  0.8204 33611.5
## 2  0.5397 23117.0
## 3  0.7629 26995.0
## 4  0.4728 24738.0
## 5  0.8735 33452.0
## 6  0.4148 24000.0

the last six rows

tail(colleges)

and the structure of the data frame.

str(colleges)

1. Filtering rows

To extract the rows only for colleges and universities in a specific state we use the filter function. For example, we can extract the colleges in Wisconsin from the colleges data set using the following code:

wi <- filter(colleges, state == "WI")
head(wi)
##   unitid                     college    type      city state      region
## 1 238193             Alverno College private Milwaukee    WI Great Lakes
## 2 238324              Bellin College private Green Bay    WI Great Lakes
## 3 238333              Beloit College private    Beloit    WI Great Lakes
## 4 238430 Cardinal Stritch University private Milwaukee    WI Great Lakes
## 5 238458          Carroll University private  Waukesha    WI Great Lakes
## 6 238476            Carthage College private   Kenosha    WI Great Lakes
##   admissionRate ACTmath ACTenglish undergrads  cost gradRate FYretention
## 1        0.7887      19         19       1833 30496   0.3852      0.7173
## 2        0.5556      25         24        285    NA   0.6786      0.7500
## 3        0.6769      26         28       1244 48236   0.7815      0.9228
## 4        0.8423      22         22       2680 37563   0.4162      0.7099
## 5        0.8114      24         24       3024 37963   0.5629      0.7598
## 6        0.7018      24         24       2874 44910   0.6501      0.7841
##   fedloan  debt
## 1  0.8784 33110
## 2  0.8145 18282
## 3  0.5784 26500
## 4  0.7178 27875
## 5  0.7108 27000
## 6  0.8048 27000

Remarks

Questions:

  1. How many Maryland colleges are in the colleges data frame? (The abbreviation for Maryland is MD.)
  2. How many private Maryland colleges with under 5000 undergraduates are in the colleges data frame?

2. Slicing rows

To extract rows 10 through 16 from the colleges data frame we use the slice function.

slice(colleges, 10:16)
##   unitid                       college    type         city state    region
## 1 100937   Birmingham Southern College private   Birmingham    AL Southeast
## 2 101073     Concordia College Alabama private        Selma    AL Southeast
## 3 101189           Faulkner University private   Montgomery    AL Southeast
## 4 101435            Huntingdon College private   Montgomery    AL Southeast
## 5 101453 Heritage Christian University private     Florence    AL Southeast
## 6 101480 Jacksonville State University  public Jacksonville    AL Southeast
## 7 101541                Judson College private       Marion    AL Southeast
##   admissionRate ACTmath ACTenglish undergrads  cost gradRate FYretention
## 1        0.6422      25         27       1181 44512   0.6192      0.8037
## 2            NA      NA         NA        523 17655   0.2115      0.4103
## 3            NA      NA         NA       2358 28485   0.2287      0.5000
## 4        0.6279      20         22       1100 31433   0.4319      0.6196
## 5            NA      NA         NA         67 21160   0.0000      1.0000
## 6        0.8326      21         22       7195 19202   0.3083      0.7112
## 7        0.7388      20         23        331 27815   0.4051      0.5974
##   fedloan  debt
## 1  0.4939 27000
## 2  0.9100 26500
## 3  0.7427 23750
## 4  0.7227 27000
## 5  0.4839    NA
## 6  0.6811 23500
## 7  0.7110 26000

Remarks

3. Arranging rows

To sort the rows by total cost, from the least expensive to the most expensive, we use the arrange function.

costDF <- arrange(colleges, cost)
head(costDF)
##   unitid                                   college    type        city state
## 1 197027     United States Merchant Marine Academy  public Kings Point    NY
## 2 176336              Southeastern Baptist College private      Laurel    MS
## 3 241951 Escuela de Artes Plasticas de Puerto Rico  public    San Juan    PR
## 4 241216               Atlantic University College private    Guaynabo    PR
## 5 241377              Caribbean University-Bayamon private     Bayamon    PR
## 6 243221     University of Puerto Rico-Rio Piedras  public    San Juan    PR
##                 region admissionRate ACTmath ACTenglish undergrads cost
## 1 U.S. Service Schools            NA      NA         NA        958 6603
## 2            Southeast            NA      NA         NA         37 6753
## 3       Outlying Areas        0.7154      NA         NA        529 7248
## 4       Outlying Areas            NA      NA         NA       1365 7695
## 5       Outlying Areas            NA      NA         NA       1572 8006
## 6       Outlying Areas        0.5248      NA         NA      11834 8020
##   gradRate FYretention fedloan debt
## 1   0.7365      0.9733  0.0780 4211
## 2   0.6875      1.0000  0.0000   NA
## 3   0.4127      0.8382  0.0000   NA
## 4   0.3891      0.7200  0.1053 5000
## 5   0.2166      0.7951  0.2210 9000
## 6   0.4748      0.8968  0.0966 5500

Remarks

Questions

  1. What school is most expensive?
  2. What school has the least expensive tuition in Wisconsin?

4. Selecting columns

Suppose that you are only interested in a subset of the columns in the data set—say, college, city, state, undergrads, and cost—and want to create a data frame with only these columns. To do this, we select the desired columns:

lessCols <- select(colleges, college, city, state, undergrads, cost)
head(lessCols)
##                               college       city state undergrads  cost
## 1            Alabama A & M University     Normal    AL       4051 18888
## 2 University of Alabama at Birmingham Birmingham    AL      11200 19990
## 3                  Amridge University Montgomery    AL        322 12300
## 4 University of Alabama in Huntsville Huntsville    AL       5525 20306
## 5            Alabama State University Montgomery    AL       5354 17400
## 6           The University of Alabama Tuscaloosa    AL      28692 26717

Remarks

drop_unitid <- select(colleges, -unitid)
head(drop_unitid)
##                               college    type       city state    region
## 1            Alabama A & M University  public     Normal    AL Southeast
## 2 University of Alabama at Birmingham  public Birmingham    AL Southeast
## 3                  Amridge University private Montgomery    AL Southeast
## 4 University of Alabama in Huntsville  public Huntsville    AL Southeast
## 5            Alabama State University  public Montgomery    AL Southeast
## 6           The University of Alabama  public Tuscaloosa    AL Southeast
##   admissionRate ACTmath ACTenglish undergrads  cost gradRate FYretention
## 1        0.8989      17         17       4051 18888   0.2914      0.6314
## 2        0.8673      23         26      11200 19990   0.5377      0.8016
## 3            NA      NA         NA        322 12300   0.6667      0.3750
## 4        0.8062      25         26       5525 20306   0.4835      0.8098
## 5        0.5125      17         17       5354 17400   0.2517      0.6219
## 6        0.5655      25         27      28692 26717   0.6665      0.8700
##   fedloan    debt
## 1  0.8204 33611.5
## 2  0.5397 23117.0
## 3  0.7629 26995.0
## 4  0.4728 24738.0
## 5  0.8735 33452.0
## 6  0.4148 24000.0

5. Mutating data (adding new columns)

Data sets often do not contain the exact variables we need, but contain all of the information necessary to calculate the needed variables. In this case, we can use the mutate function to add a new column to a data frame that is calculated from other variables. For example, we may wish to report percentages rather than proportions for the admissions rate.

colleges <- mutate(colleges, admissionPct = 100 * admissionRate)

Remarks

colleges <- mutate(colleges, FYretentionPct = 100 * FYretention,
                   gradPct = 100 * gradRate)

6. Summarizing rows

To create summary statistics for columns within the data set we must aggregate all of the rows using the summarize command. (Note that you can also use the British spelling: summarise.) For example, to calculate the median cost of all 1776 colleges in our data set we run the following command:

summarize(colleges, medianCost = median(cost, na.rm = TRUE))
##   medianCost
## 1      29849

Remarks

summarize(colleges, 
          min = min(cost, na.rm = TRUE), 
          Q1 = quantile(cost, .25, na.rm = TRUE), 
          median = median(cost, na.rm = TRUE), 
          Q3 = quantile(cost, .75, na.rm = TRUE), 
          max = max(cost, na.rm = TRUE))
##    min    Q1 median    Q3   max
## 1 6603 19831  29849 41180 62636

Question

  1. What happens if we remove na.rm = TRUE from the code above?

7. Groupwise manipulation

Often it is of interest to manipulate data within groups. For example, we might be more interested in creating separate summaries for each state, or for private and public colleges. To do this we must first tell R what groups are of interest using the group_by function, and then we can use any of the above functions. Most often group_by is paired with summarise or mutate.

Let’s first consider comparing the cost of private and public colleges. First, we must specify that the variable type defines the groups of interest.

colleges_by_type <- group_by(colleges, type)

Remarks

Combining group_by with other commands

Once we have a grouped data frame, we can obtain summaries by group via summarize. For example, the five number summary of cost by institution type is obtained below

summarize(colleges_by_type, 
          min = min(cost, na.rm = TRUE), 
          Q1 = quantile(cost, .25, na.rm = TRUE), 
          median = median(cost, na.rm = TRUE), 
          Q3 = quantile(cost, .75, na.rm = TRUE), 
          max = max(cost, na.rm = TRUE))
## # A tibble: 2 x 6
##   type      min     Q1 median     Q3   max
##   <fct>   <int>  <dbl>  <dbl>  <dbl> <int>
## 1 private  6753 28788.  37302 45728. 62636
## 2 public   6603 16822   19303 21909  33208

We can also calculate new variables within groups, such as the standardized cost of attendance within each state:

colleges_by_state <- group_by(colleges, state)
colleges_by_state <- mutate(colleges_by_state, 
                            mean.cost = mean(cost, na.rm = TRUE), 
                            sd.cost = sd(cost, na.rm = TRUE),
                            std.cost = (cost - mean.cost) / sd.cost)
head(colleges_by_state)
## # A tibble: 6 x 21
## # Groups:   state [1]
##   unitid college type  city  state region admissionRate ACTmath ACTenglish
##    <int> <fct>   <fct> <fct> <fct> <fct>          <dbl>   <int>      <int>
## 1 100654 Alabam… publ… Norm… AL    South…         0.899      17         17
## 2 100663 Univer… publ… Birm… AL    South…         0.867      23         26
## 3 100690 Amridg… priv… Mont… AL    South…        NA          NA         NA
## 4 100706 Univer… publ… Hunt… AL    South…         0.806      25         26
## 5 100724 Alabam… publ… Mont… AL    South…         0.512      17         17
## 6 100751 The Un… publ… Tusc… AL    South…         0.566      25         27
## # … with 12 more variables: undergrads <int>, cost <int>, gradRate <dbl>,
## #   FYretention <dbl>, fedloan <dbl>, debt <dbl>, admissionPct <dbl>,
## #   FYretentionPct <dbl>, gradPct <dbl>, mean.cost <dbl>, sd.cost <dbl>,
## #   std.cost <dbl>

Remarks

  • mutate allows you to use variables defined earlier to calculate a new variable. This is how std.cost was calculated.
  • The group_by function returns an object of class c("grouped_df", "tbl_df", "tbl", "data.frame"), which looks confusing, but essentially allows the data frame to be printed neatly. Notice that only the first 10 rows print when we print the data frame in the console by typing colleges_by_state, and the width of the console determines how many variables are shown.
  • To print all columns we can convert the results back to a data.frame using the as.data.frame function. Try running head(as.data.frame(colleges_by_state)).
  • You can also use the viewer by running the command View(colleges_by_state).
  • Another option is to select a reduced number of columns to print.

8. On Your Own

Option #1 (Questions 1 - 5)

  1. Filter the rows for colleges in Great Lakes or Plains regions.
  2. Arrange the subset from question #1 to reveal what school has the highest first-year retention rate in this reduced data set.
  3. Arrange the subset from question #1 to reveal what school has the lowest admissions rate in this reduced data set.
  4. Using the full data set, create a column giving the cumulative average cost of attendance, assuming that students finish in four years and that costs increase 3% per year. Name this new column total.avg.cost4.
  5. Using the full data set, summarize the distribution of total cost of attendance by region using the five number summary. Briefly describe any differences in total cost that you observe.

Option #2 (Questions 6 and 7)

The package “nycflights13” contains data on all 336,776 flights that departed from New York City in 2013, the data comes from the US Bureau of Transportation Statistics. For all questions 6 and 7 you should ignore (exclude) any missing data using “na.rm” whenever applicable.

#install.packages("nycflights13")
#library(nycflights13)
#dim(flights)
#head(flights)
  1. For this question, suppose we are interested in determining which carriers are most reliable when it comes to reaching the destination on time. For each carrier, report:
  • The proportion of its flights that arrived late (positive arr_delay)
  • The average number of minutes late that the carrier was for its flights that arrived late
  • The average “relative lateness”" of the carrier for its flights that arrived late. Note that relative lateness refers to how many minutes late a flight was relative to how long the flight took. For example, a flight that is 4 hours long and 5 minutes late (2.1% late) is “later” than a flight that is 2 hours long and 2 minutes late (4.2% late)
  1. For this question, suppose we are interested in determining whether or not the pilots of flights that depart late tend to try to make up for lost time by flying faster. To do so, write code that finds the average air speed (in miles per hour) of late departing and early/on-time departing flights. Also, include 2-3 sentences discussing whether you believe that this analysis adequetely answers the question, or if there are other details (such as obvious confounding variables) that should be explored before making a definitive conclusion.

9. Additional Resources