Load flat files in R with the readr package, which is part of the core tidyverse.
library(tidyverse)
Most of readr’s functions are concerned with turning flat files into data frames:
Function | File format read |
---|---|
read_csv() |
comma delimited files |
read_csv2() |
semicolon separated files (common in countries where
, is used as the decimal place) |
read_tsv() |
tab delimited files |
read_delim() |
files with any delimiter |
read_fwf() |
fixed width files |
read_table() |
files where columns are separated by white space |
read_log() |
Apache style log files |
These functions all have similar syntax.
heights <- read_csv("data/heights.csv")
## Rows: 1192 Columns: 6
## ── Column specification ───────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): sex, race
## dbl (4): earn, height, ed, age
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
You can also supply an inline csv file.
read_csv("a,b,c
1,2,3
4,5,6")
## Rows: 2 Columns: 3
## ── Column specification ───────────────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): a, b, c
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 × 3
## a b c
## <dbl> <dbl> <dbl>
## 1 1 2 3
## 2 4 5 6
Sometimes there are a few lines of metadata at the top of the file.
You can use skip = n
to skip the first n
lines; or use comment = "#"
to drop all lines that start
with (e.g.) #
.
read_csv("The first line of metadata
The second line of metadata
x,y,z
1,2,3", skip = 2)
## Rows: 1 Columns: 3
## ── Column specification ───────────────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): x, y, z
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 1 × 3
## x y z
## <dbl> <dbl> <dbl>
## 1 1 2 3
read_csv("# A comment I want to skip
x,y,z
1,2,3", comment = "#")
## Rows: 1 Columns: 3
## ── Column specification ───────────────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): x, y, z
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 1 × 3
## x y z
## <dbl> <dbl> <dbl>
## 1 1 2 3
The data might not have column names. You can use
col_names = FALSE
to tell read_csv()
not to
treat the first row as headings, and instead label them
sequentially from X1
to Xn
:
read_csv("1,2,3\n4,5,6", col_names = FALSE)
## Rows: 2 Columns: 3
## ── Column specification ───────────────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): X1, X2, X3
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 × 3
## X1 X2 X3
## <dbl> <dbl> <dbl>
## 1 1 2 3
## 2 4 5 6
Alternatively you can pass col_names
a character vector
which will be used as the column names:
read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
## Rows: 2 Columns: 3
## ── Column specification ───────────────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): x, y, z
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 × 3
## x y z
## <dbl> <dbl> <dbl>
## 1 1 2 3
## 2 4 5 6
Another option that commonly needs tweaking is na
: this
specifies the value (or values) that are used to represent missing
values in your file:
read_csv("a,b,c\n1,2,.", na = ".")
## Rows: 1 Columns: 3
## ── Column specification ───────────────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): a, b
## lgl (1): c
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 1 × 3
## a b c
## <dbl> <dbl> <lgl>
## 1 1 2 NA
- They are typically much faster (~10x).
Long running jobs have a progress
bar, so you can see what’s happening. If
you’re looking for raw speed, try data.table::fread()
.
- They produce tibbles, they don’t convert character vectors to factors.
parse_*()
functions take a character vector and return a
more specialised vector like a logical, integer, or date:
str(parse_logical(c("TRUE", "FALSE", "NA")))
## logi [1:3] TRUE FALSE NA
str(parse_integer(c("1", "2", "3")))
## int [1:3] 1 2 3
str(parse_date(c("2010-01-01", "1979-10-14")))
## Date[1:2], format: "2010-01-01" "1979-10-14"
parse_*()
functions are uniform: the first argument is a
character vector to parse, and the na
argument specifies
which strings should be treated as missing:
parse_integer(c("1", "231", ".", "456"), na = ".")
## [1] 1 231 NA 456
If parsing fails, you’ll get a warning:
x <- parse_integer(c("123", "345", "abc", "123.45"))
## Warning: 2 parsing failures.
## row col expected actual
## 3 -- no trailing characters abc
## 4 -- no trailing characters 123.45
And the failures will be missing in the output:
x
## [1] 123 345 NA NA
## attr(,"problems")
## # A tibble: 2 × 4
## row col expected actual
## <int> <int> <chr> <chr>
## 1 3 NA no trailing characters abc
## 2 4 NA no trailing characters 123.45
If there are many parsing failures, you’ll need to use
problems()
to get the complete set.
problems(x)
## # A tibble: 2 × 4
## row col expected actual
## <int> <int> <chr> <chr>
## 1 3 NA no trailing characters abc
## 2 4 NA no trailing characters 123.45
There are eight particularly important parsers:
Functions | Job |
---|---|
parse_logical() |
parse logicals |
parse_integer(),
parse_double() , parse_double() |
parse numbers |
parse_character() |
parse characters |
parse_datetime() ,
parse_date() , and parse_time() |
parse various date & time specifications |
It seems like it should be straightforward to parse a number, but three problems make it tricky:
People write numbers differently in different parts of the world.
For example, some countries use .
in between the integer
and fractional parts of a real number, while others use
,
.
Numbers are often surrounded by other characters that provide some context, like “$1000” or “10%”.
Numbers often contain “grouping” characters to make them easier to read, like “1,000,000”.
When parsing numbers, the most important option is the character you
use for the decimal mark. You can override the default value of
.
by creating a new locale and setting the
decimal_mark
argument:
parse_double("1.23")
## [1] 1.23
parse_double("1,23", locale = locale(decimal_mark = ","))
## [1] 1.23
parse_number()
addresses the second problem: it
ignores non-numeric characters before and after the number.
This is particularly useful for currencies and percentages, but also
works to extract numbers embedded in text.parse_number("$100")
## [1] 100
parse_number("20%")
## [1] 20
The final problem is addressed by the combination of
parse_number()
and the locale as
parse_number()
will ignore the “grouping mark”:
# Used in America
parse_number("$123,456,789")
## [1] 123456789
# Used in many parts of Europe
parse_number("123.456.789", locale = locale(grouping_mark = "."))
## [1] 123456789
# Used in Switzerland
parse_number("123'456'789", locale = locale(grouping_mark = "'"))
## [1] 123456789
R uses factors to represent categorical variables that have a known
set of possible values. Give parse_factor()
a vector of
known levels
to generate a warning whenever an unexpected
value is present:
fruit <- c("apple", "banana")
parse_factor(c("apple", "banana", "bananana"), levels = fruit)
## Warning: 1 parsing failure.
## row col expected actual
## 3 -- value in level set bananana
## [1] apple banana <NA>
## attr(,"problems")
## # A tibble: 1 × 4
## row col expected actual
## <int> <int> <chr> <chr>
## 1 3 NA value in level set bananana
## Levels: apple banana
You pick between three parsers depending on whether you want a date (the number of days since 1970-01-01), a date-time (the number of seconds since midnight 1970-01-01), or a time (the number of seconds since midnight). When called without any additional arguments:
parse_datetime()
expects an ISO8601 date-time.
ISO8601 is an international standard in which the components of a date
are organised from biggest to smallest: year, month, day, hour, minute,
second.
parse_datetime("2010-10-01T2010")
## [1] "2010-10-01 20:10:00 UTC"
# If time is omitted, it will be set to midnight
parse_datetime("20101010")
## [1] "2010-10-10 UTC"
parse_date()
expects a four digit year, a
-
or /
, the month, a -
or
/
, then the day:
parse_date("2010-10-01")
## [1] "2010-10-01"
parse_time()
expects the hour, :
,
minutes, optionally :
and seconds, and an optional am/pm
specifier:
library(hms)
parse_time("01:10 am")
## 01:10:00
parse_time("20:10:01")
## 20:10:01
Base R doesn’t have a great built in class for time data, so we use the one provided in the hms package.
If these defaults don’t work for your data you can supply your own
date-time format
, built up of the following pieces:
%Y
(4 digits).
%y
(2 digits); 00-69 -> 2000-2069, 70-99 ->
1970-1999.
%m
(2 digits).
%b
(abbreviated name, like “Jan”).
%B
(full name, “January”).
%d
(2 digits).
%e
(optional leading space).
%H
0-23 hour.
%I
0-12, must be used with %p
.
%p
AM/PM indicator.
%M
minutes.
%S
integer seconds.
%OS
real seconds.
%Z
Time zone (as name,
e.g. America/Chicago
). Beware of abbreviations: if you’re
American, note that “EST” is a Canadian time zone that does not have
daylight savings time. It is not Eastern Standard Time! We’ll
come back to this [time zones].
%z
(as offset from UTC, e.g. +0800
).
%.
skips one non-digit character.
%*
skips any number of non-digits.
The best way to figure out the correct format is to create a few examples in a character vector, and test with one of the parsing functions. For example:
parse_date("01/02/15", "%m/%d/%y")
## [1] "2015-01-02"
parse_date("01/02/15", "%d/%m/%y")
## [1] "2015-02-01"
parse_date("01/02/15", "%y/%m/%d")
## [1] "2001-02-15"
If you’re using %b
or %B
with non-English
month names, you’ll need to set the lang
argument to
locale()
. See the list of built-in languages in
date_names_langs()
, or if your language is not already
included, create your own with date_names()
.
parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))
## [1] "2015-01-01"
readr
also comes with two useful functions for writing
data back to disk: write_csv()
and
write_tsv()
. Both functions increase the chances of the
output file being read back incorrectly by:
Always encoding strings in UTF-8.
Saving dates and date-times in ISO8601 format so they are easily parsed elsewhere.
If you want to export a csv file to Excel, use
write_excel_csv()
.
The most important arguments are x
(the data frame to
save), and path
(the location to save it). You can also
specify how missing values are written with na
, and if you
want to append
to an existing file.
challenge <- read_csv("data/heights.csv")
write_csv(challenge, "challenge.csv")
Note that the type information is lost when you save to csv:
challenge <- read_csv("data/heights.csv")
## Rows: 1192 Columns: 6
## ── Column specification ───────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): sex, race
## dbl (4): earn, height, ed, age
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
write_csv(challenge, "challenge-2.csv")
read_csv("challenge-2.csv")
## Rows: 1192 Columns: 6
## ── Column specification ───────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): sex, race
## dbl (4): earn, height, ed, age
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 1,192 × 6
## earn height sex ed age race
## <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 50000 74.4 male 16 45 white
## 2 60000 65.5 female 16 58 white
## 3 30000 63.6 female 16 29 white
## 4 50000 63.1 female 16 91 other
## 5 51000 63.4 female 17 39 white
## 6 9000 64.4 female 15 26 white
## 7 29000 61.7 female 12 49 white
## 8 32000 72.7 male 17 46 white
## 9 2000 72.0 male 15 21 hispanic
## 10 27000 72.2 male 12 26 white
## # … with 1,182 more rows
There are two alternatives:
write_rds()
and read_rds()
are uniform
wrappers around the base unctions readRDS()
and
saveRDS()
. These store data in R’s custom binary format
called RDS:
write_rds(challenge, "challenge.rds")
read_rds("challenge.rds")
## # A tibble: 1,192 × 6
## earn height sex ed age race
## <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 50000 74.4 male 16 45 white
## 2 60000 65.5 female 16 58 white
## 3 30000 63.6 female 16 29 white
## 4 50000 63.1 female 16 91 other
## 5 51000 63.4 female 17 39 white
## 6 9000 64.4 female 15 26 white
## 7 29000 61.7 female 12 49 white
## 8 32000 72.7 male 17 46 white
## 9 2000 72.0 male 15 21 hispanic
## 10 27000 72.2 male 12 26 white
## # … with 1,182 more rows
The feather
package implements a fast binary
file format that can be shared across programming
languages:
library(feather)
write_feather(challenge, "challenge.feather")
read_feather("challenge.feather")
#> # A tibble: 2,000 x 2
#> x y
#> <dbl> <date>
#> 1 404 <NA>
#> 2 4172 <NA>
#> 3 3004 <NA>
#> 4 787 <NA>
#> 5 37 <NA>
#> 6 2332 <NA>
#> # ... with 1,994 more rows
Feather tends to be faster than RDS.
To get other types of data into R, we recommend starting with the tidyverse packages listed below. They’re certainly not perfect, but they are a good place to start. For rectangular data:
haven reads SPSS, Stata, and SAS files.
readxl reads excel files (both .xls
and .xlsx
).
DBI, along with a database specific backend (e.g. RMySQL, RSQLite, RPostgreSQL etc) allows you to run SQL queries against a database and return a data frame.
For hierarchical data: use jsonlite (by Jeroen Ooms) for json, and xml2 for XML. Jenny Bryan has some excellent worked examples at https://jennybc.github.io/purrr-tutorial/examples.html.
For other file types, try the R data import/export manual and the rio package.
A work by Matteo Cereda and Fabio Iannelli