Here we’ll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets.
suppressWarnings(library(tidyverse))
You can represent the same underlying data in multiple ways.
The example below shows the same data organised in four different ways. Each dataset shows the same values of four variables country, year, population, and cases, but each dataset organises the values in a different way.
table1
## # A tibble: 6 × 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
table2
## # A tibble: 12 × 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
table3
## # A tibble: 6 × 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
# Spread across two tibbles
table4a # cases
## # A tibble: 3 × 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
table4b # population
## # A tibble: 3 × 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
These are all representations of the same underlying data, but they are not equally easy to use.
One dataset, the tidy dataset, will be much easier to work with inside the tidyverse.
There are three interrelated rules which make a dataset tidy:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
Figure shows the rules visually.
These three rules are interrelated because it’s impossible to only satisfy two of the three.
In this example, only table1
is tidy. It’s the only
representation where each column is a variable.
Why ensure that your data is tidy?
There are two main advantages:
- There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.
- There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine.
dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data.
Here are a couple of small examples showing how you might work with
table1
.
# Compute rate per 10,000
table1 %>%
mutate(rate = cases / population * 10000)
## # A tibble: 6 × 5
## country year cases population rate
## <chr> <int> <int> <int> <dbl>
## 1 Afghanistan 1999 745 19987071 0.373
## 2 Afghanistan 2000 2666 20595360 1.29
## 3 Brazil 1999 37737 172006362 2.19
## 4 Brazil 2000 80488 174504898 4.61
## 5 China 1999 212258 1272915272 1.67
## 6 China 2000 213766 1280428583 1.67
# Compute cases per year
table1 %>%
count(year, wt = cases)
## # A tibble: 2 × 2
## year n
## <int> <int>
## 1 1999 250740
## 2 2000 296920
The first step is always to: figure out what the variables and observations are.
The second step is to solve one of two common problems:
One variable might be spread across multiple columns.
One observation might be scattered across multiple rows.
To fix these problems, you’ll need the two most important functions in tidyr:
gather()
spread()
gather()
A common problem is a dataset where:
some of the column names are NOT names of variables RATHER observations/values of a variable themselves.
Take table4a
: the column names 1999
and
2000
represent values of the year
variable,
and each row represents two observations, not one.
table4a
## # A tibble: 3 × 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
To tidy a dataset like this, we need to gather those columns into a new pair of variables.
To describe that operation we need three parameters:
The set of columns that represent values, not variables.
The name of the variable whose values form the
column names. I call that the key
, and here it is
year
.
The name of the variable whose
values are spread over the cells. I call that
value
, and here it’s the number of
cases
.
table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")
## # A tibble: 6 × 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
Here there are only two columns, so we list them individually. Note that “1999” and “2000” are non-syntactic names so we have to surround them in backticks.
table4b %>%
gather(`1999`, `2000`, key = "year", value = "population")
## # A tibble: 6 × 3
## country year population
## <chr> <chr> <int>
## 1 Afghanistan 1999 19987071
## 2 Brazil 1999 172006362
## 3 China 1999 1272915272
## 4 Afghanistan 2000 20595360
## 5 Brazil 2000 174504898
## 6 China 2000 1280428583
table4a
and
table4b
into a single tibble, we need to use
dplyr::left_join()
# gather table a
tidy4a <- table4a %>% gather(`1999`, `2000`, key = "year", value = "cases")
# gather table b
tidy4b <- table4b %>% gather(`1999`, `2000`, key = "year", value = "population")
# combine tables once tided
left_join(tidy4a, tidy4b)
## Joining, by = c("country", "year")
## # A tibble: 6 × 4
## country year cases population
## <chr> <chr> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Brazil 1999 37737 172006362
## 3 China 1999 212258 1272915272
## 4 Afghanistan 2000 2666 20595360
## 5 Brazil 2000 80488 174504898
## 6 China 2000 213766 1280428583
spread()
Spreading is the opposite of gathering. Here the issue is that:
an observation is scattered across multiple rows.
For example, take table2
: an observation is a country in
a year, but each observation is spread across two rows.
table2
## # A tibble: 12 × 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
To tidy this up, we need two parameters:
The column that contains variable names, the
key
column.
The column that contains values forms multiple
variables, the value
column.
spread(table2, key = type, value = count)
## # A tibble: 6 × 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
gather()
makes wide tables narrower and longer
spread()
makes long tables shorter and wider
table3
has a different problem :
one column (
rate
) that contains two variables (cases
andpopulation
)
table3
## # A tibble: 6 × 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
To fix this problem, we will need the separate()
function.
The complement of separate()
is unite()
, to
use when a single variable is spread across multiple columns.
separate()
This function pulls apart one column into multiple columns, by
splitting wherever a separator character appears. Take
table3
:
table3
## # A tibble: 6 × 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
The rate
column contains both cases
and
population
variables, and we need to split it into two
variables.
separate()
takes the name of the column to separate, and
the names of the columns to separate into, as shown in Figure
@ref(fig:tidy-separate) and the code below.
table3 %>%
separate(rate, into = c("cases", "population"))
## # A tibble: 6 × 4
## country year cases population
## <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
By default,
separate()
will split values wherever it sees a non-alphanumeric character.
table3 %>%
separate(rate, into = c("cases", "population"), sep = "/")
Look carefully at the column types: you’ll notice that
case
and population
are character columns.
This is the default behaviour in separate()
: it leaves the
type of the column as is. Here, however, it’s not very useful as those
really are numbers. We can ask separate()
to try and
convert to better types using convert = TRUE
:
table3 %>%
separate(rate, into = c("cases", "population"), convert = TRUE)
## # A tibble: 6 × 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
You can also pass a vector of integers to sep
.
separate()
will interpret the integers as positions to
split at. Positive values start at 1 on the far-left of the strings;
negative value start at -1 on the far-right of the strings. When using
integers to separate strings, the length of sep
should be
one less than the number of names in into
.
table3 %>%
separate(year, into = c("century", "year"), sep = 2)
## # A tibble: 6 × 4
## country century year rate
## <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
unite()
This function is the inverse of separate()
: it combines
multiple columns into a single column.
We can use unite()
to rejoin the century and
year columns that we created in the last example. That data is
saved as tidyr::table5
.
table5 %>%
unite(new, century, year)
## # A tibble: 6 × 3
## country new rate
## <chr> <chr> <chr>
## 1 Afghanistan 19_99 745/19987071
## 2 Afghanistan 20_00 2666/20595360
## 3 Brazil 19_99 37737/172006362
## 4 Brazil 20_00 80488/174504898
## 5 China 19_99 212258/1272915272
## 6 China 20_00 213766/1280428583
In this case we also need to use the sep
argument.
The default will place an underscore (_
) between the
values from different columns.
Here we don’t want any separator so we use ""
:
table5 %>%
unite(new, century, year, sep = "")
## # A tibble: 6 × 3
## country new rate
## <chr> <chr> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
Changing the representation of a dataset brings up an important subtlety of missing values.
A value can be missing in one of two possible ways:
Explicitly –> flagged with
NA
.
Implicitly –> simply not present in the data.
Let’s illustrate this idea with a very simple data set:
stocks <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)
print(stocks)
## # A tibble: 7 × 3
## year qtr return
## <dbl> <dbl> <dbl>
## 1 2015 1 1.88
## 2 2015 2 0.59
## 3 2015 3 0.35
## 4 2015 4 NA
## 5 2016 2 0.92
## 6 2016 3 0.17
## 7 2016 4 2.66
There are two missing values in this dataset.
We can make the implicit missing value explicit by putting years in the columns:
stocks %>%
spread(year, return)
## # A tibble: 4 × 3
## qtr `2015` `2016`
## <dbl> <dbl> <dbl>
## 1 1 1.88 NA
## 2 2 0.59 0.92
## 3 3 0.35 0.17
## 4 4 NA 2.66
Because these explicit missing values may not be important in other
representations of the data, you can set na.rm = TRUE
in
gather()
to turn explicit missing values implicit:
stocks %>%
spread(year, return) %>%
gather(year, return, `2015`:`2016`, na.rm = TRUE)
## # A tibble: 6 × 3
## qtr year return
## <dbl> <chr> <dbl>
## 1 1 2015 1.88
## 2 2 2015 0.59
## 3 3 2015 0.35
## 4 2 2016 0.92
## 5 3 2016 0.17
## 6 4 2016 2.66
Another important tool for making missing values explicit in tidy
data is complete()
:
stocks %>%
complete(year, qtr)
## # A tibble: 8 × 3
## year qtr return
## <dbl> <dbl> <dbl>
## 1 2015 1 1.88
## 2 2015 2 0.59
## 3 2015 3 0.35
## 4 2015 4 NA
## 5 2016 1 NA
## 6 2016 2 0.92
## 7 2016 3 0.17
## 8 2016 4 2.66
complete()
takes a set of columns, and finds all unique combinations.
It then ensures the original dataset contains all those values,
filling in explicit NA
s where necessary.
Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:
treatment <- tribble(
~ person, ~ treatment, ~response,
"Derrick Whitmore", 1, 7,
NA, 2, 10,
NA, 3, 9,
"Katherine Burke", 1, 4
)
You can fill in these missing values with fill()
. It
takes a set of columns where you want missing values to be replaced by
the most recent non-missing value (sometimes called last observation
carried forward).
treatment %>%
fill(person)
## # A tibble: 4 × 3
## person treatment response
## <chr> <dbl> <dbl>
## 1 Derrick Whitmore 1 7
## 2 Derrick Whitmore 2 10
## 3 Derrick Whitmore 3 9
## 4 Katherine Burke 1 4
The tidyr::who
dataset contains tuberculosis (TB)
cases broken down by year, country, age, gender, and diagnosis
method.
The data comes from the 2014 World Health Organization Global Tuberculosis Report, available at http://www.who.int/tb/country/data/download/en/.
print(who, n=4)
## # A tibble: 7,240 × 60
## country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544 new_sp_m4554
## <chr> <chr> <chr> <int> <int> <int> <int> <int> <int>
## 1 Afghan… AF AFG 1980 NA NA NA NA NA
## 2 Afghan… AF AFG 1981 NA NA NA NA NA
## 3 Afghan… AF AFG 1982 NA NA NA NA NA
## 4 Afghan… AF AFG 1983 NA NA NA NA NA
## # … with 7,236 more rows, and 51 more variables: new_sp_m5564 <int>, new_sp_m65 <int>,
## # new_sp_f014 <int>, new_sp_f1524 <int>, new_sp_f2534 <int>, new_sp_f3544 <int>,
## # new_sp_f4554 <int>, new_sp_f5564 <int>, new_sp_f65 <int>, new_sn_m014 <int>,
## # new_sn_m1524 <int>, new_sn_m2534 <int>, new_sn_m3544 <int>, new_sn_m4554 <int>,
## # new_sn_m5564 <int>, new_sn_m65 <int>, new_sn_f014 <int>, new_sn_f1524 <int>,
## # new_sn_f2534 <int>, new_sn_f3544 <int>, new_sn_f4554 <int>, new_sn_f5564 <int>,
## # new_sn_f65 <int>, new_ep_m014 <int>, new_ep_m1524 <int>, new_ep_m2534 <int>, …
It contains redundant columns, odd variable codes, and many missing values.
country
, iso2
, and iso3
are three variables that redundantly specify the country.
year
is clearly also a variable.
We don’t know what all the other columns are but given the
structure in the variable names (e.g. new_sp_m014
,
new_ep_m014
, new_ep_f014
) these are likely to
be values, not variables.
So we need to gather()
together all the columns from
new_sp_m014
to newrel_f65
.
We don’t know what those values represent yet, so we’ll give them the
generic name "key"
.
We know the cells represent the count of cases, so we’ll use the
variable cases
.
There are a lot of missing values in the current representation, so
for now we’ll use na.rm
just so we can focus on the values
that are present.
who1 <- who %>%
gather(new_sp_m014:newrel_f65, key = "key", value = "cases", na.rm = TRUE)
who1
## # A tibble: 76,046 × 6
## country iso2 iso3 year key cases
## <chr> <chr> <chr> <int> <chr> <int>
## 1 Afghanistan AF AFG 1997 new_sp_m014 0
## 2 Afghanistan AF AFG 1998 new_sp_m014 30
## 3 Afghanistan AF AFG 1999 new_sp_m014 8
## 4 Afghanistan AF AFG 2000 new_sp_m014 52
## 5 Afghanistan AF AFG 2001 new_sp_m014 129
## 6 Afghanistan AF AFG 2002 new_sp_m014 90
## 7 Afghanistan AF AFG 2003 new_sp_m014 127
## 8 Afghanistan AF AFG 2004 new_sp_m014 139
## 9 Afghanistan AF AFG 2005 new_sp_m014 151
## 10 Afghanistan AF AFG 2006 new_sp_m014 193
## # … with 76,036 more rows
We can get some hint of the structure of the values in the new
key
column by counting them:
who1 %>%
count(key)
## # A tibble: 56 × 2
## key n
## <chr> <int>
## 1 new_ep_f014 1032
## 2 new_ep_f1524 1021
## 3 new_ep_f2534 1021
## 4 new_ep_f3544 1021
## 5 new_ep_f4554 1017
## 6 new_ep_f5564 1017
## 7 new_ep_f65 1014
## 8 new_ep_m014 1038
## 9 new_ep_m1524 1026
## 10 new_ep_m2534 1020
## # … with 46 more rows
It tells us:
The first three letters of each column denote whether the column contains new or old cases of TB. In this dataset, each column contains new cases.
The next two letters describe the type of TB:
rel
stands for cases of relapseep
stands for cases of extrapulmonary TBsn
stands for cases of pulmonary TB that could not be
diagnosed by a pulmonary smear (smear negative)sp
stands for cases of pulmonary TB that could be
diagnosed be a pulmonary smear (smear positive)The sixth letter gives the sex of TB patients. The dataset groups
cases by males (m
) and females (f
).
The remaining numbers gives the age group. The dataset groups cases into seven age groups:
014
= 0 – 14 years old1524
= 15 – 24 years old2534
= 25 – 34 years old65
= 65 or olderWe need to make a minor fix to the format of the column names.
who2 <- who1 %>%
mutate(key = stringr::str_replace(key, "newrel", "new_rel"))
who2
## # A tibble: 76,046 × 6
## country iso2 iso3 year key cases
## <chr> <chr> <chr> <int> <chr> <int>
## 1 Afghanistan AF AFG 1997 new_sp_m014 0
## 2 Afghanistan AF AFG 1998 new_sp_m014 30
## 3 Afghanistan AF AFG 1999 new_sp_m014 8
## 4 Afghanistan AF AFG 2000 new_sp_m014 52
## 5 Afghanistan AF AFG 2001 new_sp_m014 129
## 6 Afghanistan AF AFG 2002 new_sp_m014 90
## 7 Afghanistan AF AFG 2003 new_sp_m014 127
## 8 Afghanistan AF AFG 2004 new_sp_m014 139
## 9 Afghanistan AF AFG 2005 new_sp_m014 151
## 10 Afghanistan AF AFG 2006 new_sp_m014 193
## # … with 76,036 more rows
We can separate the values in each code with two passes of
separate()
.
who3 <- who2 %>%
separate(key, c("new", "type", "sexage"), sep = "_")
who3
## # A tibble: 76,046 × 8
## country iso2 iso3 year new type sexage cases
## <chr> <chr> <chr> <int> <chr> <chr> <chr> <int>
## 1 Afghanistan AF AFG 1997 new sp m014 0
## 2 Afghanistan AF AFG 1998 new sp m014 30
## 3 Afghanistan AF AFG 1999 new sp m014 8
## 4 Afghanistan AF AFG 2000 new sp m014 52
## 5 Afghanistan AF AFG 2001 new sp m014 129
## 6 Afghanistan AF AFG 2002 new sp m014 90
## 7 Afghanistan AF AFG 2003 new sp m014 127
## 8 Afghanistan AF AFG 2004 new sp m014 139
## 9 Afghanistan AF AFG 2005 new sp m014 151
## 10 Afghanistan AF AFG 2006 new sp m014 193
## # … with 76,036 more rows
Then we might as well drop the new
column because it’s
constant in this dataset. While we’re dropping columns, let’s also drop
iso2
and iso3
since they’re redundant.
who3 %>%
count(new)
## # A tibble: 1 × 2
## new n
## <chr> <int>
## 1 new 76046
who4 <- who3 %>%
select(-new, -iso2, -iso3)
Next we’ll separate sexage
into sex
and
age
by splitting after the first character:
who5 <- who4 %>%
separate(sexage, c("sex", "age"), sep = 1)
print(who5, n=5)
## # A tibble: 76,046 × 6
## country year type sex age cases
## <chr> <int> <chr> <chr> <chr> <int>
## 1 Afghanistan 1997 sp m 014 0
## 2 Afghanistan 1998 sp m 014 30
## 3 Afghanistan 1999 sp m 014 8
## 4 Afghanistan 2000 sp m 014 52
## 5 Afghanistan 2001 sp m 014 129
## # … with 76,041 more rows
A work by Matteo Cereda and Fabio Iannelli