Learning objectives

  • Understand when and why to iterate code
  • Be able to start with a single use and build up to iteration
  • Use for loops, apply functions, and purrr to iterate
  • Be able to write functions to cleanly iterate code


Once Twice Thrice in a Lifetime

And you may find yourself
Behind the keys of a large computing machine
And you may find yourself
Copy-pasting tons of code
And you may ask yourself, well
How did I get here?


It’s pretty common that you’ll want to run the same basic bit of code a bunch of times with different inputs. Maybe you want to read in a bunch of data files with different names or calculate something complex on every row of a dataframe. A general rule of thumb is that any code you want to run 3+ times should be iterated instead of copy-pasted. Copy-pasting code and replacing the parts you want to change is generally a bad practice for several reasons:

  • it’s easy to forget to change all the parts that need to be different
  • it’s easy to mistype
  • it is ugly to read
  • it scales very poorly (try copy-pasting 100 times…)

Lots of functions (including many base functions) are vectorized, meaning they already work on vectors of values. Here’s an example:

x <- 1:10
log(x)
##  [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
##  [8] 2.0794415 2.1972246 2.3025851

The log() function already knows we want to take the log of each element in x, and it returns a vector that’s the same length as x. If a vectorized function already exists to do what you want, use it! It’s going to be faster and cleaner than trying to iterate everything yourself.

However, we may want to do more complex iterations, which brings us to our first main iterating concept.

For Loops

A for loop will repeat some bit of code, each time with a new input value. Here’s the basic structure:

for(i in 1:10) {
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

You’ll often see i used in for loops, you can think of it as the iteration value. For each i value in the vector 1:10, we’ll print that index value. You can use the i value more than once in a loop:

for (i in 1:10) {
  print(i)
  print(i^2)
}
## [1] 1
## [1] 1
## [1] 2
## [1] 4
## [1] 3
## [1] 9
## [1] 4
## [1] 16
## [1] 5
## [1] 25
## [1] 6
## [1] 36
## [1] 7
## [1] 49
## [1] 8
## [1] 64
## [1] 9
## [1] 81
## [1] 10
## [1] 100

What’s happening is the value of i gets inserted into the code block, the block gets run, the value of i changes, and the process repeats. For loops can be a way to explicitly lay out fairly complicated procedures, since you can see exactly where your i value is going in the code.

You can also use the i value to index a vector or dataframe, which can be very powerful!

for (i in 1:10) {
  print(letters[i])
  print(mtcars$wt[i])
}
## [1] "a"
## [1] 2.62
## [1] "b"
## [1] 2.875
## [1] "c"
## [1] 2.32
## [1] "d"
## [1] 3.215
## [1] "e"
## [1] 3.44
## [1] "f"
## [1] 3.46
## [1] "g"
## [1] 3.57
## [1] "h"
## [1] 3.19
## [1] "i"
## [1] 3.15
## [1] "j"
## [1] 3.44

Here we printed out the first 10 letters of the alphabet from the letters vector, as well as the first 10 car weights from the mtcars dataframe.

If you want to store your results somewhere, it is important that you create an empty object to hold them before you run the loop. If you grow your results vector one value at a time, it will be much slower. Here’s how to make that empty vector first. We’ll also use the function seq_along to create a sequence that’s the proper length, instead of explicitly writing out something like 1:10.

results <- rep(NA, nrow(mtcars))

for (i in seq_along(mtcars$wt)) {
  results[i] <- mtcars$wt[i] * 1000
}
results
##  [1] 2620 2875 2320 3215 3440 3460 3570 3190 3150 3440 3440 4070 3730 3780 5250
## [16] 5424 5345 2200 1615 1835 2465 3520 3435 3840 3845 1935 2140 1513 3170 2770
## [31] 3570 2780

purrr

For loops are very handy and important to understand, but they can involve writing a lot of code and can generally look fairly messy.

The tidyverse includes another way to iterate, using the map family of functions. These functions all do the same basic thing: take a series of values and apply a function to each of them. That function could be a function from a package, or it could be one you write to do something specific.

For a wonderful and thorough exploration of the purrr package, check out Jenny Brian’s tutorial.

map

When using the map family of functions, the first argument (as in all tidyverse functions) is the data. One nice feature is that you can specify the format of the output explicitly by using a different member of the family.

mtcars %>% map(mean) # gives a list
## $mpg
## [1] 20.09062
## 
## $cyl
## [1] 6.1875
## 
## $disp
## [1] 230.7219
## 
## $hp
## [1] 146.6875
## 
## $drat
## [1] 3.596563
## 
## $wt
## [1] 3.21725
## 
## $qsec
## [1] 17.84875
## 
## $vs
## [1] 0.4375
## 
## $am
## [1] 0.40625
## 
## $gear
## [1] 3.6875
## 
## $carb
## [1] 2.8125
mtcars %>% map_dbl(mean) # gives a numeric vector
##        mpg        cyl       disp         hp       drat         wt       qsec 
##  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
##         vs         am       gear       carb 
##   0.437500   0.406250   3.687500   2.812500
mtcars %>% map_chr(mean) # gives a character vector
## Warning: Automatic coercion from double to
## character was deprecated in purrr
## 1.0.0.
## ℹ Please use an explicit call to
##   `as.character()` within `map_chr()`
##   instead.
## Call
## `lifecycle::last_lifecycle_warnings()`
## to see where this warning was
## generated.
##          mpg          cyl         disp           hp         drat           wt 
##  "20.090625"   "6.187500" "230.721875" "146.687500"   "3.596563"   "3.217250" 
##         qsec           vs           am         gear         carb 
##  "17.848750"   "0.437500"   "0.406250"   "3.687500"   "2.812500"

Additonal Arguments

You can pass additional arguments to functions that you map across your data. For example, if you have some NAs in your data, you might want to use mean() with na.rm = TRUE.

mtcars2 <- mtcars # make a copy of the mtcars dataset
mtcars2[3,c(1,6,8)] <- NA # make one of the cars have NAs for several columns
mtcars2
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710            NA   4 108.0  93 3.85    NA 18.61 NA  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
mtcars2 %>% map_dbl(mean) # returns NA for mpg, wt, and vs columns
##        mpg        cyl       disp         hp       drat         wt       qsec 
##         NA   6.187500 230.721875 146.687500   3.596563         NA  17.848750 
##         vs         am       gear       carb 
##         NA   0.406250   3.687500   2.812500
mtcars2 %>% map_dbl(mean, na.rm = TRUE)
##         mpg         cyl        disp          hp        drat          wt 
##  20.0032258   6.1875000 230.7218750 146.6875000   3.5965625   3.2461935 
##        qsec          vs          am        gear        carb 
##  17.8487500   0.4193548   0.4062500   3.6875000   2.8125000

map2

You can use the map2 series of functions if you need to map across two sets of inputs in parallel. Here, we’ll map across both the names of cars and their mpg values, using an anonymous function to paste the two together into a sentence.

We’ll use what’s called an “anonymous function”, which is a small function we define within the map function call. Our function will take 2 arguments, x and y, and paste them together with some other text.

map2_chr(rownames(mtcars), mtcars$mpg, function(x,y) paste(x, "gets", y, "miles per gallon")) %>% 
  head()
## [1] "Mazda RX4 gets 21 miles per gallon"          
## [2] "Mazda RX4 Wag gets 21 miles per gallon"      
## [3] "Datsun 710 gets 22.8 miles per gallon"       
## [4] "Hornet 4 Drive gets 21.4 miles per gallon"   
## [5] "Hornet Sportabout gets 18.7 miles per gallon"
## [6] "Valiant gets 18.1 miles per gallon"

You can use the pmap series of functions if you need to use more than two input lists.

Complete Workflow

Let’s try working through a complete example of how you might iterate a more complex operation across a dataset. This will follow 3 basic steps:

  1. Write code that does the thing you want once
  2. Generalize that code into a function that can take different inputs
  3. Apply that function across your data

Starting With a Single Case

The first thing we’ll do is figure out if we can do the right thing once! We want to rescale a vector of values to a 0-1 scale. We’ll try it out on the weights in mtcars. Our heaviest vehicle will have a scaled weight of 1, and our lightest will have a scaled weight of 0. We’ll do this by taking our weight, subtracting the minimum car weight from it, and dividing this by the range of the car weights (max minus min). We’ll have to be careful about our order of operations…

(mtcars$wt[1] - min(mtcars$wt, na.rm = T)) /
  (max(mtcars$wt, na.rm = T) - min(mtcars$wt, na.rm = T))
## [1] 0.2830478

Great! We got a scaled value out of the deal. Because we’re working with base functions like max, min, and /, we can vectorize. This means we can give it the whole weight vector, and we’ll get a whole scaled vector back.

mtcars$wt_scaled <- (mtcars$wt - min(mtcars$wt, na.rm = T)) /
  diff(range(mtcars$wt, na.rm = T))

mtcars$wt_scaled
##  [1] 0.28304781 0.34824853 0.20634109 0.43518282 0.49271286 0.49782664
##  [7] 0.52595244 0.42879059 0.41856303 0.49271286 0.49271286 0.65379698
## [13] 0.56686269 0.57964715 0.95551010 1.00000000 0.97980056 0.17565840
## [19] 0.02608029 0.08233188 0.24341601 0.51316799 0.49143442 0.59498849
## [25] 0.59626694 0.10790079 0.16031705 0.00000000 0.42367681 0.32140118
## [31] 0.52595244 0.32395807

Generalizing

Now let’s replace our reference to a specific vector of data with something generic: x. This code won’t run on its own, since x doesn’t have a value, but it’s just showing how we would refer to some generic value.

x_scaled <- (x - min(x, na.rm = T)) /
  diff(range(x, na.rm = T))

Making it a Function

Now that we’ve got a generalized bit of code, we can turn it into a function. All we need is a name, function, and a list of arguments. In this case, we’ve just got one argument: x.

rescale_0_1 <- function(x) {
  (x - min(x, na.rm = T)) /
  diff(range(x, na.rm = T))
}

rescale_0_1(mtcars$mpg) # it works on one of our columns
##  [1] 0.4510638 0.4510638 0.5276596 0.4680851 0.3531915 0.3276596 0.1659574
##  [8] 0.5957447 0.5276596 0.3744681 0.3148936 0.2553191 0.2936170 0.2042553
## [15] 0.0000000 0.0000000 0.1829787 0.9361702 0.8510638 1.0000000 0.4723404
## [22] 0.2170213 0.2042553 0.1234043 0.3744681 0.7191489 0.6638298 0.8510638
## [29] 0.2297872 0.3957447 0.1957447 0.4680851

Iterating!

Now that we’ve got a function that’ll rescale a vector of values, we can use one of the map functions to iterate across all the columns in a dataframe, rescaling each one. We’ll use map_df since it returns a dataframe, and we’re feeding it a dataframe.

map_df(mtcars, rescale_0_1)
## # A tibble: 32 × 12
##      mpg   cyl   disp     hp  drat    wt  qsec    vs    am  gear  carb wt_scaled
##    <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl>
##  1 0.451   0.5 0.222  0.205  0.525 0.283 0.233     0     1   0.5 0.429     0.283
##  2 0.451   0.5 0.222  0.205  0.525 0.348 0.3       0     1   0.5 0.429     0.348
##  3 0.528   0   0.0920 0.145  0.502 0.206 0.489     1     1   0.5 0         0.206
##  4 0.468   0.5 0.466  0.205  0.147 0.435 0.588     1     0   0   0         0.435
##  5 0.353   1   0.721  0.435  0.180 0.493 0.3       0     0   0   0.143     0.493
##  6 0.328   0.5 0.384  0.187  0     0.498 0.681     1     0   0   0         0.498
##  7 0.166   1   0.721  0.682  0.207 0.526 0.160     0     0   0   0.429     0.526
##  8 0.596   0   0.189  0.0353 0.429 0.429 0.655     1     0   0.5 0.143     0.429
##  9 0.528   0   0.174  0.152  0.535 0.419 1         1     0   0.5 0.143     0.419
## 10 0.374   0.5 0.241  0.251  0.535 0.493 0.452     1     0   0.5 0.429     0.493
## # ℹ 22 more rows

There you have it! We went from some code that calculated one value to being able to iterate it across any number of columns in a dataframe. It can be tempting to jump straight to your final iteration code, but it’s often better to start simple and work your way up, verifying that things work at each step, especially if you’re trying to do something even moderately complex.

apply Functions

While we learned the tidyverse series of map functions, it’s worth mentioning that there is a similar series of packages in base R called the apply series of functions. They are very similar to map functions, but the syntax is a little different and you have to be a little more careful about the data types you put in and get out.

We’re not going to go into the apply family, but if you want to learn more, here is a good tutorial. You might come across the apply functions in someone else’s code, so it’s good to know they exist.

This lesson was contributed by Michael Culshaw-Maurer, with ideas from Mike Koontz and Brandon Hurr’s D-RUG presentation.