And you may find yourself
Behind the keys of a large computing
machine
And you may find yourself
Copy-pasting tons of code
And you may ask yourself, well
How did I get here?
It’s pretty common that you’ll want to run the same basic bit of code a bunch of times with different inputs. Maybe you want to read in a bunch of data files with different names or calculate something complex on every row of a dataframe. A general rule of thumb is that any code you want to run 3+ times should be iterated instead of copy-pasted. Copy-pasting code and replacing the parts you want to change is generally a bad practice for several reasons:
Lots of functions (including many base
functions) are
vectorized, meaning they already work on vectors of values.
Here’s an example:
x <- 1:10
log(x)
## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
## [8] 2.0794415 2.1972246 2.3025851
The log()
function already knows we want to take the log
of each element in x, and it returns a vector that’s the same length as
x. If a vectorized function already exists to do what you want,
use it! It’s going to be faster and cleaner than trying to iterate
everything yourself.
However, we may want to do more complex iterations, which brings us to our first main iterating concept.
A for loop will repeat some bit of code, each time with a new input value. Here’s the basic structure:
for(i in 1:10) {
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
You’ll often see i
used in for loops, you can think of
it as the iteration value. For each i
value in the vector
1:10, we’ll print that index value. You can use the i
value
more than once in a loop:
for (i in 1:10) {
print(i)
print(i^2)
}
## [1] 1
## [1] 1
## [1] 2
## [1] 4
## [1] 3
## [1] 9
## [1] 4
## [1] 16
## [1] 5
## [1] 25
## [1] 6
## [1] 36
## [1] 7
## [1] 49
## [1] 8
## [1] 64
## [1] 9
## [1] 81
## [1] 10
## [1] 100
What’s happening is the value of i
gets inserted into
the code block, the block gets run, the value of i
changes,
and the process repeats. For loops can be a way to explicitly lay out
fairly complicated procedures, since you can see exactly where your
i
value is going in the code.
You can also use the i
value to index a vector or
dataframe, which can be very powerful!
for (i in 1:10) {
print(letters[i])
print(mtcars$wt[i])
}
## [1] "a"
## [1] 2.62
## [1] "b"
## [1] 2.875
## [1] "c"
## [1] 2.32
## [1] "d"
## [1] 3.215
## [1] "e"
## [1] 3.44
## [1] "f"
## [1] 3.46
## [1] "g"
## [1] 3.57
## [1] "h"
## [1] 3.19
## [1] "i"
## [1] 3.15
## [1] "j"
## [1] 3.44
Here we printed out the first 10 letters of the alphabet from the
letters
vector, as well as the first 10 car weights from
the mtcars
dataframe.
If you want to store your results somewhere, it is important that you
create an empty object to hold them before you run the
loop. If you grow your results vector one value at a time, it will be
much slower. Here’s how to make that empty vector first. We’ll also use
the function seq_along
to create a sequence that’s the
proper length, instead of explicitly writing out something like
1:10
.
results <- rep(NA, nrow(mtcars))
for (i in seq_along(mtcars$wt)) {
results[i] <- mtcars$wt[i] * 1000
}
results
## [1] 2620 2875 2320 3215 3440 3460 3570 3190 3150 3440 3440 4070 3730 3780 5250
## [16] 5424 5345 2200 1615 1835 2465 3520 3435 3840 3845 1935 2140 1513 3170 2770
## [31] 3570 2780
purrr
For loops are very handy and important to understand, but they can involve writing a lot of code and can generally look fairly messy.
The tidyverse
includes another way to iterate, using the
map
family of functions. These functions all do the same
basic thing: take a series of values and apply a function to each of
them. That function could be a function from a package, or it could be
one you write to do something specific.
For a wonderful and thorough exploration of the purrr
package, check out Jenny Brian’s
tutorial.
map
When using the map
family of functions, the first
argument (as in all tidyverse functions) is the data. One nice feature
is that you can specify the format of the output explicitly by using a
different member of the family.
mtcars %>% map(mean) # gives a list
## $mpg
## [1] 20.09062
##
## $cyl
## [1] 6.1875
##
## $disp
## [1] 230.7219
##
## $hp
## [1] 146.6875
##
## $drat
## [1] 3.596563
##
## $wt
## [1] 3.21725
##
## $qsec
## [1] 17.84875
##
## $vs
## [1] 0.4375
##
## $am
## [1] 0.40625
##
## $gear
## [1] 3.6875
##
## $carb
## [1] 2.8125
mtcars %>% map_dbl(mean) # gives a numeric vector
## mpg cyl disp hp drat wt qsec
## 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750
## vs am gear carb
## 0.437500 0.406250 3.687500 2.812500
mtcars %>% map_chr(mean) # gives a character vector
## Warning: Automatic coercion from double to
## character was deprecated in purrr
## 1.0.0.
## ℹ Please use an explicit call to
## `as.character()` within `map_chr()`
## instead.
## Call
## `lifecycle::last_lifecycle_warnings()`
## to see where this warning was
## generated.
## mpg cyl disp hp drat wt
## "20.090625" "6.187500" "230.721875" "146.687500" "3.596563" "3.217250"
## qsec vs am gear carb
## "17.848750" "0.437500" "0.406250" "3.687500" "2.812500"
You can pass additional arguments to functions that you map across
your data. For example, if you have some NAs in your data, you might
want to use mean()
with na.rm = TRUE
.
mtcars2 <- mtcars # make a copy of the mtcars dataset
mtcars2[3,c(1,6,8)] <- NA # make one of the cars have NAs for several columns
mtcars2
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 NA 4 108.0 93 3.85 NA 18.61 NA 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
mtcars2 %>% map_dbl(mean) # returns NA for mpg, wt, and vs columns
## mpg cyl disp hp drat wt qsec
## NA 6.187500 230.721875 146.687500 3.596563 NA 17.848750
## vs am gear carb
## NA 0.406250 3.687500 2.812500
mtcars2 %>% map_dbl(mean, na.rm = TRUE)
## mpg cyl disp hp drat wt
## 20.0032258 6.1875000 230.7218750 146.6875000 3.5965625 3.2461935
## qsec vs am gear carb
## 17.8487500 0.4193548 0.4062500 3.6875000 2.8125000
map2
You can use the map2
series of functions if you need to
map across two sets of inputs in parallel. Here, we’ll map across both
the names of cars and their mpg values, using an anonymous function to
paste the two together into a sentence.
We’ll use what’s called an “anonymous function”, which is a small
function we define within the map
function call. Our
function will take 2 arguments, x and y, and paste them together with
some other text.
map2_chr(rownames(mtcars), mtcars$mpg, function(x,y) paste(x, "gets", y, "miles per gallon")) %>%
head()
## [1] "Mazda RX4 gets 21 miles per gallon"
## [2] "Mazda RX4 Wag gets 21 miles per gallon"
## [3] "Datsun 710 gets 22.8 miles per gallon"
## [4] "Hornet 4 Drive gets 21.4 miles per gallon"
## [5] "Hornet Sportabout gets 18.7 miles per gallon"
## [6] "Valiant gets 18.1 miles per gallon"
You can use the pmap
series of functions if you need to
use more than two input lists.
Let’s try working through a complete example of how you might iterate a more complex operation across a dataset. This will follow 3 basic steps:
The first thing we’ll do is figure out if we can do the right thing
once! We want to rescale a vector of values to a 0-1 scale. We’ll try it
out on the weights in mtcars
. Our heaviest vehicle will
have a scaled weight of 1, and our lightest will have a scaled weight of
0. We’ll do this by taking our weight, subtracting the minimum car
weight from it, and dividing this by the range of the car weights (max
minus min). We’ll have to be careful about our order of operations…
(mtcars$wt[1] - min(mtcars$wt, na.rm = T)) /
(max(mtcars$wt, na.rm = T) - min(mtcars$wt, na.rm = T))
## [1] 0.2830478
Great! We got a scaled value out of the deal. Because we’re working
with base functions like max
, min
, and
/
, we can vectorize. This means we can give it the whole
weight vector, and we’ll get a whole scaled vector back.
mtcars$wt_scaled <- (mtcars$wt - min(mtcars$wt, na.rm = T)) /
diff(range(mtcars$wt, na.rm = T))
mtcars$wt_scaled
## [1] 0.28304781 0.34824853 0.20634109 0.43518282 0.49271286 0.49782664
## [7] 0.52595244 0.42879059 0.41856303 0.49271286 0.49271286 0.65379698
## [13] 0.56686269 0.57964715 0.95551010 1.00000000 0.97980056 0.17565840
## [19] 0.02608029 0.08233188 0.24341601 0.51316799 0.49143442 0.59498849
## [25] 0.59626694 0.10790079 0.16031705 0.00000000 0.42367681 0.32140118
## [31] 0.52595244 0.32395807
Now let’s replace our reference to a specific vector of data with
something generic: x
. This code won’t run on its own, since
x
doesn’t have a value, but it’s just showing how we would
refer to some generic value.
x_scaled <- (x - min(x, na.rm = T)) /
diff(range(x, na.rm = T))
Now that we’ve got a generalized bit of code, we can turn it into a
function. All we need is a name, function
, and a list of
arguments. In this case, we’ve just got one argument:
x
.
rescale_0_1 <- function(x) {
(x - min(x, na.rm = T)) /
diff(range(x, na.rm = T))
}
rescale_0_1(mtcars$mpg) # it works on one of our columns
## [1] 0.4510638 0.4510638 0.5276596 0.4680851 0.3531915 0.3276596 0.1659574
## [8] 0.5957447 0.5276596 0.3744681 0.3148936 0.2553191 0.2936170 0.2042553
## [15] 0.0000000 0.0000000 0.1829787 0.9361702 0.8510638 1.0000000 0.4723404
## [22] 0.2170213 0.2042553 0.1234043 0.3744681 0.7191489 0.6638298 0.8510638
## [29] 0.2297872 0.3957447 0.1957447 0.4680851
Now that we’ve got a function that’ll rescale a vector of values, we
can use one of the map
functions to iterate across all the
columns in a dataframe, rescaling each one. We’ll use
map_df
since it returns a dataframe, and we’re feeding it a
dataframe.
map_df(mtcars, rescale_0_1)
## # A tibble: 32 × 12
## mpg cyl disp hp drat wt qsec vs am gear carb wt_scaled
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.451 0.5 0.222 0.205 0.525 0.283 0.233 0 1 0.5 0.429 0.283
## 2 0.451 0.5 0.222 0.205 0.525 0.348 0.3 0 1 0.5 0.429 0.348
## 3 0.528 0 0.0920 0.145 0.502 0.206 0.489 1 1 0.5 0 0.206
## 4 0.468 0.5 0.466 0.205 0.147 0.435 0.588 1 0 0 0 0.435
## 5 0.353 1 0.721 0.435 0.180 0.493 0.3 0 0 0 0.143 0.493
## 6 0.328 0.5 0.384 0.187 0 0.498 0.681 1 0 0 0 0.498
## 7 0.166 1 0.721 0.682 0.207 0.526 0.160 0 0 0 0.429 0.526
## 8 0.596 0 0.189 0.0353 0.429 0.429 0.655 1 0 0.5 0.143 0.429
## 9 0.528 0 0.174 0.152 0.535 0.419 1 1 0 0.5 0.143 0.419
## 10 0.374 0.5 0.241 0.251 0.535 0.493 0.452 1 0 0.5 0.429 0.493
## # ℹ 22 more rows
There you have it! We went from some code that calculated one value to being able to iterate it across any number of columns in a dataframe. It can be tempting to jump straight to your final iteration code, but it’s often better to start simple and work your way up, verifying that things work at each step, especially if you’re trying to do something even moderately complex.
apply
FunctionsWhile we learned the tidyverse
series of
map
functions, it’s worth mentioning that there is a
similar series of packages in base R called the apply
series of functions. They are very similar to map
functions, but the syntax is a little different and you have to be a
little more careful about the data types you put in and get out.
We’re not going to go into the apply
family, but if you
want to learn more, here is a good
tutorial. You might come across the apply
functions in
someone else’s code, so it’s good to know they exist.
This lesson was contributed by Michael Culshaw-Maurer, with ideas from Mike Koontz and Brandon Hurr’s D-RUG presentation.