ggplot2 presents lots of advantages with respect to basic plot:
has an underlying grammar, based on the Grammar of Graphics, that allows to compose graphs by combining independent components. This makes ggplot2 powerful. In fact, you are not limited to a set of pre-defined graphics
provides beautiful and elegant graphics, in fact its also its default can produce publication-quality plots
has a layered structure, so you can work iteratively. You can start with a layer that shows the raw data and then add annotations, statistics and so on
contains a lot of building blocks you can use to build your own representation
Before starting we have to determine the kind of visualization that is appropriate for the dataset. First, we have to identify the type of data we are using, in particular if they are continous or discrete.
Data type | Definition | Examples |
---|---|---|
continous | Any two observations on a continuous scale have an infinite number of observations between them | linear scales, log scales |
discrete | Any two observations on a discrete scale have a finite number of observations between them | [‘a’, ‘b’, ‘c’] |
Different plots are appropriate for different situations. To determine the kind of chart that is appropriate for a given situation, it’s helpful to consider the kind of argument you want to make with your data.
The main goals of visualizing data are:
comprehend the relationship between variables (e.g. their correlation)
highlight the differences between variables with a comparison
show different aspects of your data in a composition
show the distribution of data
If you want to go deeper in ggplot2 world please visit https://r-graph-gallery.com/index.html. You will find all types of graphs together with lots of examples
The Grammar of Graphics is the idea that you can build every graph from the same components:
a data set
a coordinate system
geoms—visual marks
To plot values, you have to map variables in the data to visual properties of the geom (aesthetics) like size, color, and x and y locations
The general function to make a graph is:
To better understand how ggplot2 works we will make some examples using the mpg dataset, that comes with ggplot2 package itself. After installing ggplot2 package you will first need to load it:
suppressWarnings(library(ggplot2))
This dataset provides fuel economy data from 1999 and 2008 for 38 popular models of cars, collected by the US Environmental Protection Agency.
We can compare the number of cylinders
(cy
l) with the engine displacement (displ
)
among a series of cars to see of there is a
correlation:
ggplot(mpg, aes(x=cyl, y=displ))+geom_point()
When a set of data includes a categorical variable and one or more continuous variables, you will probably be interested to know how the values of the continuous variables vary with the levels of the categorical variable. We can see, for example, if there is a difference between the engine displacement if considering the class of cars:
ggplot(mpg, aes(x=class, y=displ))+geom_boxplot()
If we have a discrete variable of interest, we can evaluate how many times it occours in our data. For example, how many cars belong to a certain class?:
ggplot(mpg, aes(x=class))+geom_bar()
Now that we have seen the basics of ggplot2 we can go deeper and adding features to make nicer graphs. If you want to have a summary of all what you can do with ggplot2 you can have a look to the package cheatsheet.
Focusing on the scatter plot, we may want to change colors of points respect to a variable in the dataset. ggplot2 takes care of the details of converting data (e.g., ‘f’, ‘r’, ‘4’) into aesthetics (e.g., ‘red’, ‘yellow’, ‘green’) with a scale:
ggplot(mpg, aes(x=cyl, y=displ, color=manufacturer))+geom_point()
Notice the difference between these three plots:
ggplot(mpg, aes(x=cyl, y=displ, color=manufacturer))+geom_point()
ggplot(mpg, aes(x=cyl, y=displ, color="blue"))+geom_point()
ggplot(mpg, aes(x=cyl, y=displ))+geom_point(color="blue")
In the first we use the aesthetics to say that the variable manufacturer has to be converted into a colour. In the second we give to the aesthetics a single value. In the third, instead, we work outside from the aesthetics and assign directly a colour to the points.
To give more or less importance to a point in a graph we can use the
transparency. In this case we will map the highway
mileage in the alpha
parameter**:
ggplot(mpg, aes(x=cyl, y=displ,alpha=hwy))+geom_point()
To add information to the scatter plot we can make
it changing the shape
and size
of point
respect to other variables. Shape must be
represented by a discrete variable, while size by a
continous one. Let’s choose as.character(year)
for
shape and hwy
(miles per gallon) for size.
ggplot(mpg, aes(x=cyl, y=displ, shape=as.character(year), size=hwy))+geom_point()
Another technique for displaying additional categorical
variables on a plot is faceting. Faceting
creates tables of graphics by splitting the data into subsets
and displaying the same graph for each subset. You can also
choose the number of
columns and rows the plots have to fit
in. If we want to facet respect to a unique variable we can use
facet_wrap( )
. Let’s facet respect to class
column:
ggplot(mpg, aes(x=cyl, y=displ))+geom_point()+facet_wrap(~class, nrow = 2, ncol=4)
To facet your plot on the combination of two variables, instead, add
facet_grid( )
to your plot call.
ggplot(mpg, aes(x=cyl, y=displ))+geom_point()+facet_grid(fl~year)
To see the dominant pattern in the points you can try to fit them through a line. You can use different methods. The most used are:
ggplot(mpg, aes(x=cyl, y=displ))+geom_point()+geom_smooth()
ggplot(mpg, aes(x=cyl, y=displ))+geom_point()+geom_smooth(method ="lm")
If we have a categorical variable and one or
more continuous variables we can make a
boxplot to observe the data. Nevertheless, other types
of graph can be useful if, for example, we want to see all the
values assumed by a variable even if avoiding overlapping of
points (using geom_jitter( )
) or we want to see
the distibution of values (using
geom_violin( )
).
ggplot(mpg, aes(x=class, y=displ))+geom_violin()
ggplot(mpg, aes(x=class, y=displ))+geom_jitter(aes(col=class),show.legend = FALSE)
For jittered points, geom_jitter( )
offers the
same control over aesthetics as
geom_point( )
: size, colour, and shape. For
geom_boxplot( )
and geom_violin( )
, you can
control the outline colour or the internal fill
colour.
As ggplot2 allows to add as many layers as we want (if they make sense and can be usefull), we can also use a combination of graphs. For example, we can add violin plot to the boxplot to see the distibution of points:
ggplot(mpg, aes(x=class, y=displ))+geom_violin()+geom_boxplot()
Notice that the order matters; layers are superimposed:
ggplot(mpg, aes(x=class, y=displ))+geom_boxplot()+geom_violin()
geom_histogram( )
and geom_freqpoly( )
are
used to evaluate the distribution of a single numeric
(continous) variable. They work in the same way: they
bin the data and then count the number of
observations in each bin. The only difference is the
representation: histograms use bars and
frequency polygons use lines. For example, we can see
the distibution of miles per gallon in the city:
ggplot(mpg, aes(x=cty))+geom_histogram()
ggplot(mpg, aes(x=cty))+geom_freqpoly()
You have to pay attention to binwidth to be sure it fits your data. In fact, the default just splits your data into 30 bins. You can change it by playing with binwidth parameter:
ggplot(mpg, aes(x=cty))+geom_histogram(binwidth=1)
ggplot(mpg, aes(x=cty))+geom_histogram(binwidth=3)
To visualize the distribution of data we can also make use of a density plot. Notice that behing a density plot there are computational steps to make the figure continuous, unbounded, and smooth.
ggplot(mpg, aes(x=cty))+geom_density()
We can compare distribution of different subgroups by mapping the categorical variable of interest to fill or colour. Density and frequency polygons are easier to compare as they can coexist in the same panel. For what concerns histograms, we have to split them with facet in most cases.
ggplot(mpg, aes(x=cty, color=class))+geom_density()
ggplot(mpg, aes(x=cty, fill=class))+geom_histogram()
ggplot(mpg, aes(x=cty, fill=class))+geom_histogram()+facet_wrap(~class, nrow=2)
You might want to draw the summary of a continous variable in your
data. You can use stat_summary( )
, which summarises the
y
values for each unique x
value:
ggplot(mpg, aes(x=class, y=displ))+stat_summary()
The default statistic is mean+standard deviation. If
you want to change statistic use fun=
. ggplot2 provides
over 20 stats for you to use. Each stat is a function, so you can get
help typing ?stat_bin
.
Barplot is discrete analogue of the histogram. Nevertheless can be used in several ways, as it contains two different functions. In fact, it can be applied to unsummarized data, as we did before or on summarized data, so that the height of the bars is equal to the value in the dataframe. Let’s make an example of this least type of barplot:
data=cbind.data.frame(Name=c("Tizio", "Caio", "Sempronio"), "Years_old"=c(40, 5, 80))
ggplot(data, aes(x=Name, y=Years_old))+geom_bar(stat="identity")
We can also reorder bars respect to the age, instead of the default alphabetic order:
ggplot(data, aes(x=reorder(Name, Years_old), y=Years_old))+geom_bar(stat="identity")
Also here, as we saw for basic plot, we can add plot title, change axis names or even remove them.
ggplot(data, aes(x=reorder(Name, Years_old), y=Years_old))+geom_bar(stat="identity")+ggtitle("Barplot showing age")+xlab("Name")+ylab("Age")
ggplot(data, aes(x=reorder(Name, Years_old), y=Years_old))+geom_bar(stat="identity")+ggtitle("Barplot showing age")+xlab(NULL)+ylab(NULL)
Note that if we try to fill a barplot respect to a variable that is not on x axis we automatically obtain a stacked graph:
ggplot(mpg, aes(x=class, fill=drv))+geom_bar()
If you don’t want a stacked bar chart, you can use one of three other
options: identity
, dodge
or
fill
.
position = "identity"
will place each object exactly
where it falls in the context of the graph. To see that overlapping we
either need to make the bars slightly transparent by setting
alpha
to a small value, or completely transparent by
setting fill = NA
ggplot(mpg, aes(x=class, fill=drv))+geom_bar(alpha=0.5,position="identity")
The “identity” setting is not very useful in this case
position = "dodge"
places overlapping objects directly
beside one another. This makes it easier to compare
individual values.
ggplot(mpg, aes(x=class, fill=drv))+geom_bar(position="dodge")
position = "fill"
works like stacking, but makes each
set of stacked bars the same height. This makes it easier to
compare proportions across groups:
ggplot(mpg, aes(x=class, fill=drv))+geom_bar(position="fill")
We may be interested in focusing only on certain values of x
and y axes. In this case we can use xlim( )
and
ylim( )
to fix them.
ggplot(mpg, aes(x=cyl, y=displ))+geom_point()+xlim(6,8)+ylim(3,5)
The default coordinate system is the Cartesian coordinate
system. Nevertheless there are a number of other coordinate
systems that are occasionally helpful, like the
coord_flip( )
function that invert x and y axes and the
coord_polar( )
that map data on polar coordinates:
ggplot(mpg, aes(x=class, fill=class))+geom_bar()+coord_flip()
ggplot(mpg, aes(x=class, fill=class))+geom_bar()+coord_polar()
Scales map data values to the visual values of an aesthetic. To change a mapping, add a new scale.
There are five different families of scales:
general purpose scales
x and y location scales
color and fill discrete scales
color and fill continous scales
shape and size scales
example_data=data.frame(class=sample(c("A","B","C"),1000,replace = TRUE),value=c(rnorm(800,mean = 1,sd=1),rnorm(200,mean = 100,sd=0.1)))
ggplot(example_data,aes(x=class,y=value))+geom_boxplot()
ggplot(example_data,aes(x=class,y=value))+geom_boxplot()+scale_y_log10()
+theme( )
is the least component we can add to a graph.
It customizes aspects of the plots such as
axis, legend, panel, and facet properties. There are
many already complete themes you can use (find them @ https://ggplot2.tidyverse.org/reference/ggtheme.html).
For making more elegant and paper-style plots we you to use
+theme_bw( )
Themes can also be combined. Notice that the
latter modify the previous theme (For example, if you want to
use a pre-constructed theme put it at first and then add another
theme()
part in which you specify some addings), like:
ggplot(mpg, aes(x=class, fill=class))+geom_bar()+coord_polar()+theme_bw()+
theme(axis.title.x=element_blank(), # no x name
axis.text.x=element_blank(), # no x labels
axis.ticks.x=element_blank(), # no ticks on x
legend.position="bottom", # lengend at the bottom of the graph
legend.direction = "horizontal") # lengend in horizontal
These are the themes()
options available in ggplot
Do you remember the image at the beginning of this chapter?
Now we know all components of a graph.
To save a plot you can follow the same procedure you
learnt for basic plot. It is usefull to add in the pdf( )
function useDingbats = F. This will allow you to import
the plot in external graphic tools (like Illustrator) and modify them
easily. For what concerns measures we can also use the function
units( )
from the package "grid"
, that allows
to choose custom scales.
library(grid)
pdf("file.pdf", useDingbats = F, width=unit(3, "cm"), height=unit(3, "cm"))
ggplot(mpg, aes(x=cyl, y=displ))+geom_point()+xlim(6,8)+ylim(3,5)
dev.off()