read.csv()
to read a CSV file into
Rtidyverse
Now that we’ve learned a bit about how R is thinking about data under the hood, using different types of vectors to build more complicated data structures, let’s actually look at some data.
We are studying the species repartition and weight of animals caught in plots in our study area. The dataset is stored as a comma separated value (CSV) file. Each row holds information for a single animal, and the columns represent:
Column | Description |
---|---|
record_id | Unique id for the observation |
month | month of observation |
day | day of observation |
year | year of observation |
plot_id | ID of a particular plot |
species_id | 2-letter code |
sex | sex of animal (“M”, “F”) |
hindfoot_length | length of the hindfoot in mm |
weight | weight of the animal in grams |
genus | genus of animal |
species | species of animal |
taxon | e.g. Rodent, Reptile, Bird, Rabbit |
plot_type | type of plot |
Your current R project should already have a data
folder
with the surveys data CSV file in it. We can read it into R and assign
it to an object by using the read.csv()
function. The first
argument to read.csv()
is the path of the file you want to
read, in quotes. This path will be relative to your current
working directory, which in our case is the R Project
folder. So from there, we want to access the “data” folder, and then the
name of the CSV file.
surveys <- read.csv("data/portal_data_joined.csv")
Take a look at your Environment pane and you should see an object called “surveys”. We can print out the object to take a look at it by just running the name of the object. We can also check to see what class it is.
surveys
## record_id month day year plot_id species_id sex hindfoot_length weight
## 1 1 7 16 1977 2 NL M 32 NA
## 2 72 8 19 1977 2 NL M 31 NA
## 3 224 9 13 1977 2 NL NA NA
## genus species taxa plot_type
## 1 Neotoma albigula Rodent Control
## 2 Neotoma albigula Rodent Control
## 3 Neotoma albigula Rodent Control
## [ reached 'max' / getOption("max.print") -- omitted 34783 rows ]
class(surveys)
## [1] "data.frame"
Wow, printing a data frame gives us quite a bit of output. This is a lot more data than the small vectors we worked with last lesson, but the basic principles remain the same.
Data frames are really just a collection of vectors: every column is
a vector with a single data type, and every column is the exact same
length. You can make a data frame “by hand”, but they’re usually created
when you import some sort of tabular data into R using a function like
read.csv()
.
data.frame
ObjectsWhen working with a large data frame, it’s usually impractical to try to look at it all at once, so we’ll need to arm ourselves with a series of tools for inspecting them. Here is a non-exhaustive list of some common functions to do this:
nrow(surveys)
- returns the number of rowsncol(surveys)
- returns the number of columnshead(surveys)
- shows the first 6 rowstail(surveys)
- shows the last 6 rowsView(surveys)
- opens a new tab in RStudio that shows
the entire data frame. Useful at times, but you shouldn’t become overly
reliant on checking data frames by eye, it’s easy to make mistakescolnames(surveys)
- returns the column namesrownames(surveys)
- returns the row namesstr(surveys)
- structure of the object and information
about the class, length and content of each columnsummary(surveys)
- summary statistics for each
columnNote: most of these functions are “generic”, they can be used on
other types of objects besides data.frame
.
Based on the output of str(surveys)
, can you answer the
following questions?
surveys
?str()
function. Try Googling around
how to count the unique observations in a character string in R)str(surveys)
## 'data.frame': 34786 obs. of 13 variables:
## $ record_id : int 1 72 224 266 349 363 435 506 588 661 ...
## $ month : int 7 8 9 10 11 11 12 1 2 3 ...
## $ day : int 16 19 13 16 12 12 10 8 18 11 ...
## $ year : int 1977 1977 1977 1977 1977 1977 1977 1978 1978 1978 ...
## $ plot_id : int 2 2 2 2 2 2 2 2 2 2 ...
## $ species_id : chr "NL" "NL" "NL" "NL" ...
## $ sex : chr "M" "M" "" "" ...
## $ hindfoot_length: int 32 31 NA NA NA NA NA NA NA NA ...
## $ weight : int NA NA NA NA NA NA NA NA 218 NA ...
## $ genus : chr "Neotoma" "Neotoma" "Neotoma" "Neotoma" ...
## $ species : chr "albigula" "albigula" "albigula" "albigula" ...
## $ taxa : chr "Rodent" "Rodent" "Rodent" "Rodent" ...
## $ plot_type : chr "Control" "Control" "Control" "Control" ...
## * class: data frame
## * how many rows: 34786, how many columns: 13
## * the character data are characters if you have R Version 4.0.0 of later, factors for older versions
length(unique(surveys$species))
## [1] 40
table(surveys$species)
##
## albigula audubonii baileyi bilineata brunneicapillus
## 1252 75 2891 303 50
## chlorurus clarki eremicus flavus fulvescens
## 39 1 1299 1597 75
## fulviventer fuscus gramineus harrisi hispidus
## 43 5 8 437 179
## intermedius leucogaster leucophrys leucopus maniculatus
## 9 1006 2 36 899
## megalotis melanocorys merriami montanus ochrognathus
## 2609 13 10596 8 43
## ordii penicillatus savannarum scutalatus sp.
## 3027 3123 2 1 86
## spectabilis spilosoma squamata taylori tereticaudus
## 2504 248 16 46 1
## tigris torridus undulatus uniparens viridis
## 1 2249 5 1 1
## * how many species: 48
When we wanted to extract particular values from a vector, we used square brackets and put index values in them. Since data frames are made out of vectors, we can use the square brackets again, but with one change. Data frames are 2-dimensional, so we need to specify row and column indices. Row numbers come first, then a comma, then column numbers. Leaving the row number blank will return all rows, and the same thing applies to column numbers.
One thing to note is that the different ways you write out these indices can give you back either a data frame or a vector.
# first element in the first column of the data frame (as a vector)
surveys[1, 1]
## [1] 1
# first element in the 6th column (as a vector)
surveys[1, 6]
## [1] "NL"
# first column of the data frame (as a vector)
surveys[, 1]
## [1] 1 72 224 266 349 363 435 506 588 661 748 845 990 1164 1261
## [16] 1374 1453 1756 1818 1882 2133 2184 2406 2728 3000 3002 4667 4859 5048 5180
## [31] 5299 5485 5558 5583 5966 6020 6023 6036 6167 6479 6500 8022 8263 8387 8394
## [46] 8407 8514 8543 8657 8675
## [ reached getOption("max.print") -- omitted 34736 entries ]
# first column of the data frame (as a data.frame)
surveys[1]
## record_id
## 1 1
## 2 72
## 3 224
## 4 266
## 5 349
## 6 363
## 7 435
## 8 506
## 9 588
## 10 661
## 11 748
## 12 845
## 13 990
## 14 1164
## 15 1261
## 16 1374
## 17 1453
## 18 1756
## 19 1818
## 20 1882
## 21 2133
## 22 2184
## 23 2406
## 24 2728
## 25 3000
## 26 3002
## 27 4667
## 28 4859
## 29 5048
## 30 5180
## 31 5299
## 32 5485
## 33 5558
## 34 5583
## 35 5966
## 36 6020
## 37 6023
## 38 6036
## 39 6167
## 40 6479
## 41 6500
## 42 8022
## 43 8263
## 44 8387
## 45 8394
## 46 8407
## 47 8514
## 48 8543
## 49 8657
## 50 8675
## [ reached 'max' / getOption("max.print") -- omitted 34736 rows ]
# first three elements in the 7th column (as a vector)
surveys[1:3, 7]
## [1] "M" "M" ""
# the 3rd row of the data frame (as a data.frame)
surveys[3, ]
## record_id month day year plot_id species_id sex hindfoot_length weight
## 3 224 9 13 1977 2 NL NA NA
## genus species taxa plot_type
## 3 Neotoma albigula Rodent Control
# equivalent to head_surveys <- head(surveys)
head_surveys <- surveys[1:6, ]
:
is a special function that creates numeric vectors of
integers in increasing or decreasing order; try running
1:10
and 10:1
to check this out.
You can also exclude certain indices of a data frame using the
“-
” sign:
surveys[, -1] # The whole data frame, except the first column
## month day year plot_id species_id sex hindfoot_length weight genus species
## 1 7 16 1977 2 NL M 32 NA Neotoma albigula
## 2 8 19 1977 2 NL M 31 NA Neotoma albigula
## 3 9 13 1977 2 NL NA NA Neotoma albigula
## 4 10 16 1977 2 NL NA NA Neotoma albigula
## taxa plot_type
## 1 Rodent Control
## 2 Rodent Control
## 3 Rodent Control
## 4 Rodent Control
## [ reached 'max' / getOption("max.print") -- omitted 34782 rows ]
surveys[-c(7:34786), ] # Equivalent to head(surveys)
## record_id month day year plot_id species_id sex hindfoot_length weight
## 1 1 7 16 1977 2 NL M 32 NA
## 2 72 8 19 1977 2 NL M 31 NA
## 3 224 9 13 1977 2 NL NA NA
## genus species taxa plot_type
## 1 Neotoma albigula Rodent Control
## 2 Neotoma albigula Rodent Control
## 3 Neotoma albigula Rodent Control
## [ reached 'max' / getOption("max.print") -- omitted 3 rows ]
Data frames can be subset by calling indices (as shown previously), but also by calling their column names directly:
surveys["species_id"] # Result is a data.frame
surveys[, "species_id"] # Result is a vector
surveys[["species_id"]] # Result is a vector
surveys$species_id # Result is a vector
In general, when you’re working with data frames, you should make sure you know whether your code returns a data frame or a vector, as we see that different methods yield different results. Sometimes you get a data frame with one column, sometimes you get one vector.
You will probably end up using the $
subsetting quite a
bit. What’s nice about it is that it supports tab-completion! Type out
your data frame name, then a dollar sign, then hit tab to get a list of
the column names that you can scroll through.
We are going to create a few new data frames using our subsetting skills.
surveys_200
containing
row 200 of the surveys
dataset.surveys_last
, which
extracts only the last row in of surveys
.
nrow()
gives you the number of rows
in a data framesurveys_last
data frame with what you see
as the last row using tail()
with the surveys
data frame to make sure it’s meeting expectations.nrow()
to identify the row that is in the middle of
surveys
. Subset this row and store it in a new data frame
called surveys_middle
.head()
function by using
the -
notation (e.g. removal) and the nrow()
function, keeping just the first through 6th rows of the
surveys
dataset.## 1.
surveys_200 <- surveys[200, ]
## 2.
# Saving `n_rows` to improve readability and reduce duplication
n_rows <- nrow(surveys)
surveys_last <- surveys[n_rows, ]
## 3.
surveys_middle <- surveys[n_rows / 2, ]
## 4.
surveys_head <- surveys[-(7:n_rows), ]
tidyverse
Almost every time you work in R, you will be using different “packages” to work with data. A package is a collection of functions used for some common purpose; there are packages for manipulating data, plotting, interfacing with other programs, and much much more.
All of the stuff we’ve covered so far has been using R’s “base”
functionality, the built in functions and techniques that come with R by
default. There is a new-ish set of packages called the
tidyverse
which does a lot of the same stuff as base R,
plus much much more. The tidyverse
is what we will focus on
primarily from here on out, as it is a very powerful set of tools with a
philosophy that focuses on being readable and intuitive when working
with data. There are a few reasons we’ve taught you a bunch of base R
stuff so far:
tidyverse
still works with the same building blocks
as base R: vectors!tidyverse
is constantly evolving, which can be good
(new features!) and bad (really old tidyverse
code may
behave differently when you update)For example, using []
to subset data and using
read.csv()
are base R ways of doing things, but we’ll show
you tidyverse
ways of doing them as well.
In R, there are almost always several ways of accomplishing the same task. Showing you every single way of getting a job done seems like a waste of time, but we also don’t want you to feel lost when you come across some base R code, so that’s why there might be a bit of redundancy.
Almost every time you work in R, you will be using different “packages” to work with data. A package is a collection of functions used for some common purpose; there are packages for manipulating data, plotting, interfacing with other programs, and much much more.
For much of this course, we’ll be working with a series of packages
collectively referred to as the tidyverse
. They are
packages designed to help you work with data, from cleaning and
manipulation to plotting. They are all designed to work together nicely,
and share a lot of similar principles. They are increasingly popular,
have large user bases, and are generally very well-documented. You can
install the core set of tidyverse
packages with the
install.packages()
function:
install.packages("tidyverse")
It is usually recommended that you do NOT write this code into a script, or the package will be reinstalled every time you run the script. Instead, just run it once in your console, and it will be permanently installed so you can use it any time.
Once a package has been installed on your computer, you can load it in order to use it:
library(tidyverse)
Loading the tidyverse
package actually loads a whole
bunch of commonly used tidyverse packages at once, which is pretty
convenient.
A common feature of tidyverse
functions is that they use
underscores in the name. For example, the tidyverse
function for reading a CSV file is read_csv()
instead of
read.csv()
. Let’s try it:
t_surveys <- read_csv("data/portal_data_joined.csv")
## Rows: 34786 Columns: 13
## ── Column specification ──────────────
## Delimiter: ","
## chr (6): species_id, sex, genus, species, taxa, plot_type
## dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Now let’s take a look at how prints and check the class:
t_surveys
## # A tibble: 34,786 × 13
## record_id month day year plot_id species_id sex hindfoot_length weight
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 7 16 1977 2 NL M 32 NA
## 2 72 8 19 1977 2 NL M 31 NA
## 3 224 9 13 1977 2 NL <NA> NA NA
## 4 266 10 16 1977 2 NL <NA> NA NA
## 5 349 11 12 1977 2 NL <NA> NA NA
## 6 363 11 12 1977 2 NL <NA> NA NA
## 7 435 12 10 1977 2 NL <NA> NA NA
## 8 506 1 8 1978 2 NL <NA> NA NA
## 9 588 2 18 1978 2 NL M NA 218
## 10 661 3 11 1978 2 NL <NA> NA NA
## # ℹ 34,776 more rows
## # ℹ 4 more variables: genus <chr>, species <chr>, taxa <chr>, plot_type <chr>
class(t_surveys)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
Ooh, doesn’t that print out nicely? It only prints 10 rows by
default, NAs are now colored red, and under the name of each column is
the type of data! One important thing to notice is that the column types
are only double
and character
, no factors
here. By default, read_csv()
keeps character data as
character
columns, which would be like setting
stringsAsFactors=FALSE
in read.csv()
.
Also, class()
returned multiple things! You’ll notice
one of them is data.frame
, but there are things like
tbl_df
as well. The tidyverse
has a special
type of data.frame
called a “tibble”. Tibbles are the same
as data frames, but they print nicely as we just saw, and they usually
return a tibble when you’re using bracket subsetting. As always, just be
sure to check whether you’re getting a tibble or a vector back.
surveys[,1] # gives a vector back
## [1] 1 72 224 266 349 363 435 506 588 661 748 845 990 1164 1261
## [16] 1374 1453 1756 1818 1882 2133 2184 2406 2728 3000 3002 4667 4859 5048 5180
## [31] 5299 5485 5558 5583 5966 6020 6023 6036 6167 6479 6500 8022 8263 8387 8394
## [46] 8407 8514 8543 8657 8675
## [ reached getOption("max.print") -- omitted 34736 entries ]
t_surveys[,1] # gives a tibble back
## # A tibble: 34,786 × 1
## record_id
## <dbl>
## 1 1
## 2 72
## 3 224
## 4 266
## 5 349
## 6 363
## 7 435
## 8 506
## 9 588
## 10 661
## # ℹ 34,776 more rows
This lesson is adapted from the Data Carpentry: R for Data Analysis and Visualization of Ecological Data Starting With Data materials.