2025-03-12

Introduction

ggplot2 presents lots of advantages with respect to basic plot:

  • has an underlying grammar, based on the Grammar of Graphics, that allows to compose graphs by combining independent components. This makes ggplot2 powerful. In fact, you are not limited to a set of pre-defined graphics

  • provides beautiful and elegant graphics, in fact its also its default can produce publication-quality plots

  • has a layered structure, so you can work iteratively. You can start with a layer that shows the raw data and then add annotations, statistics and so on

  • contains a lot of building blocks you can use to build your own representation



Before plotting: chart design

Before starting we have to determine the kind of visualization that is appropriate for the dataset. First, we have to identify the type of data we are using, in particular if they are continous or discrete.


Data type Definition Examples
continous Any two observations on a continuous scale have an infinite number of observations between them linear scales, log scales
discrete Any two observations on a discrete scale have a finite number of observations between them [‘a’, ‘b’, ‘c’]


Different plots are appropriate for different situations. To determine the kind of chart that is appropriate for a given situation, it’s helpful to consider the kind of argument you want to make with your data.


The main goals of visualizing data are:

  • comprehend the relationship between variables (e.g. their correlation)

  • highlight the differences between variables with a comparison

  • show different aspects of your data in a composition

  • show the distribution of data



If you want to go deeper in ggplot2 world please visit https://r-graph-gallery.com/index.html. You will find all types of graphs together with lots of examples




The Grammar of Graphics

The Grammar of Graphics is the idea that you can build every graph from the same components:

  • a data set

  • a coordinate system

  • geoms—visual marks



To plot values, you have to map variables in the data to visual properties of the geom (aesthetics) like size, color, and x and y locations




Build a graph

The general function to make a graph is:



To better understand how ggplot2 works we will make some examples using the mpg dataset, that comes with ggplot2 package itself. After installing ggplot2 package you will first need to load it:

suppressWarnings(library(ggplot2))



Dataset exploration

This dataset provides fuel economy data from 1999 and 2008 for 38 popular models of cars, collected by the US Environmental Protection Agency.




Scatter plot

We can compare the number of cylinders (cyl) with the engine displacement (displ) among a series of cars to see of there is a correlation:

ggplot(mpg, aes(x=cyl, y=displ))+geom_point()



Boxplot

When a set of data includes a categorical variable and one or more continuous variables, you will probably be interested to know how the values of the continuous variables vary with the levels of the categorical variable. We can see, for example, if there is a difference between the engine displacement if considering the class of cars:

ggplot(mpg, aes(x=class, y=displ))+geom_boxplot()



Barplot

If we have a discrete variable of interest, we can evaluate how many times it occours in our data. For example, how many cars belong to a certain class?:

ggplot(mpg, aes(x=class))+geom_bar()

Now that we have seen the basics of ggplot2 we can go deeper and adding features to make nicer graphs. If you want to have a summary of all what you can do with ggplot2 you can have a look to the package cheatsheet.



Changing colour

Focusing on the scatter plot, we may want to change colors of points respect to a variable in the dataset. ggplot2 takes care of the details of converting data (e.g., ‘f’, ‘r’, ‘4’) into aesthetics (e.g., ‘red’, ‘yellow’, ‘green’) with a scale:

ggplot(mpg, aes(x=cyl, y=displ, color=manufacturer))+geom_point()


Notice the difference between these three plots:

ggplot(mpg, aes(x=cyl, y=displ, color=manufacturer))+geom_point()

ggplot(mpg, aes(x=cyl, y=displ, color="blue"))+geom_point()

ggplot(mpg, aes(x=cyl, y=displ))+geom_point(color="blue")

In the first we use the aesthetics to say that the variable manufacturer has to be converted into a colour. In the second we give to the aesthetics a single value. In the third, instead, we work outside from the aesthetics and assign directly a colour to the points.



Adding transparency to the points

To give more or less importance to a point in a graph we can use the transparency. In this case we will map the highway mileage in the alpha parameter**:

ggplot(mpg, aes(x=cyl, y=displ,alpha=hwy))+geom_point()



Changing shape end size of points

To add information to the scatter plot we can make it changing the shape and size of point respect to other variables. Shape must be represented by a discrete variable, while size by a continous one. Let’s choose as.character(year) for shape and hwy (miles per gallon) for size.

ggplot(mpg, aes(x=cyl, y=displ, shape=as.character(year), size=hwy))+geom_point() 



Faceting

Another technique for displaying additional categorical variables on a plot is faceting. Faceting creates tables of graphics by splitting the data into subsets and displaying the same graph for each subset. You can also choose the number of columns and rows the plots have to fit in. If we want to facet respect to a unique variable we can use facet_wrap( ). Let’s facet respect to class column:

ggplot(mpg, aes(x=cyl, y=displ))+geom_point()+facet_wrap(~class, nrow = 2, ncol=4)

To facet your plot on the combination of two variables, instead, add facet_grid( ) to your plot call.

ggplot(mpg, aes(x=cyl, y=displ))+geom_point()+facet_grid(fl~year)



Fit points with a line

To see the dominant pattern in the points you can try to fit them through a line. You can use different methods. The most used are:

  • loess (the default), uses a smooth local regression
  • lm, uses a linear model, giving the line of best fit
ggplot(mpg, aes(x=cyl, y=displ))+geom_point()+geom_smooth()

ggplot(mpg, aes(x=cyl, y=displ))+geom_point()+geom_smooth(method ="lm")



Alternative to boxplot

If we have a categorical variable and one or more continuous variables we can make a boxplot to observe the data. Nevertheless, other types of graph can be useful if, for example, we want to see all the values assumed by a variable even if avoiding overlapping of points (using geom_jitter( )) or we want to see the distibution of values (using geom_violin( )).

ggplot(mpg, aes(x=class, y=displ))+geom_violin()

ggplot(mpg, aes(x=class, y=displ))+geom_jitter(aes(col=class),show.legend = FALSE)

For jittered points, geom_jitter( ) offers the same control over aesthetics as geom_point( ): size, colour, and shape. For geom_boxplot( ) and geom_violin( ), you can control the outline colour or the internal fill colour.


As ggplot2 allows to add as many layers as we want (if they make sense and can be usefull), we can also use a combination of graphs. For example, we can add violin plot to the boxplot to see the distibution of points:

ggplot(mpg, aes(x=class, y=displ))+geom_violin()+geom_boxplot()

Notice that the order matters; layers are superimposed:

ggplot(mpg, aes(x=class, y=displ))+geom_boxplot()+geom_violin()



Histograms and frequency polygons

geom_histogram( ) and geom_freqpoly( ) are used to evaluate the distribution of a single numeric (continous) variable. They work in the same way: they bin the data and then count the number of observations in each bin. The only difference is the representation: histograms use bars and frequency polygons use lines. For example, we can see the distibution of miles per gallon in the city:

ggplot(mpg, aes(x=cty))+geom_histogram()

ggplot(mpg, aes(x=cty))+geom_freqpoly()

You have to pay attention to binwidth to be sure it fits your data. In fact, the default just splits your data into 30 bins. You can change it by playing with binwidth parameter:

ggplot(mpg, aes(x=cty))+geom_histogram(binwidth=1)

ggplot(mpg, aes(x=cty))+geom_histogram(binwidth=3)



Density plot

To visualize the distribution of data we can also make use of a density plot. Notice that behing a density plot there are computational steps to make the figure continuous, unbounded, and smooth.

ggplot(mpg, aes(x=cty))+geom_density()



Compare distributions

We can compare distribution of different subgroups by mapping the categorical variable of interest to fill or colour. Density and frequency polygons are easier to compare as they can coexist in the same panel. For what concerns histograms, we have to split them with facet in most cases.

ggplot(mpg, aes(x=cty, color=class))+geom_density()

ggplot(mpg, aes(x=cty, fill=class))+geom_histogram()

ggplot(mpg, aes(x=cty, fill=class))+geom_histogram()+facet_wrap(~class, nrow=2)



Statistical transformation

You might want to draw the summary of a continous variable in your data. You can use stat_summary( ), which summarises the y values for each unique x value:

ggplot(mpg, aes(x=class, y=displ))+stat_summary()

The default statistic is mean+standard deviation. If you want to change statistic use fun=. ggplot2 provides over 20 stats for you to use. Each stat is a function, so you can get help typing ?stat_bin.



Barplot in detail

Barplot is discrete analogue of the histogram. Nevertheless can be used in several ways, as it contains two different functions. In fact, it can be applied to unsummarized data, as we did before or on summarized data, so that the height of the bars is equal to the value in the dataframe. Let’s make an example of this least type of barplot:

data=cbind.data.frame(Name=c("Tizio", "Caio", "Sempronio"), "Years_old"=c(40, 5, 80))
ggplot(data, aes(x=Name, y=Years_old))+geom_bar(stat="identity")

We can also reorder bars respect to the age, instead of the default alphabetic order:

ggplot(data, aes(x=reorder(Name, Years_old), y=Years_old))+geom_bar(stat="identity")



Give a title to the plot and change axes names

Also here, as we saw for basic plot, we can add plot title, change axis names or even remove them.

ggplot(data, aes(x=reorder(Name, Years_old), y=Years_old))+geom_bar(stat="identity")+ggtitle("Barplot showing age")+xlab("Name")+ylab("Age")

ggplot(data, aes(x=reorder(Name, Years_old), y=Years_old))+geom_bar(stat="identity")+ggtitle("Barplot showing age")+xlab(NULL)+ylab(NULL)



Stacked barplot

Note that if we try to fill a barplot respect to a variable that is not on x axis we automatically obtain a stacked graph:

ggplot(mpg, aes(x=class, fill=drv))+geom_bar()

If you don’t want a stacked bar chart, you can use one of three other options: identity, dodge or fill.



Identity barplot

position = "identity" will place each object exactly where it falls in the context of the graph. To see that overlapping we either need to make the bars slightly transparent by setting alpha to a small value, or completely transparent by setting fill = NA

ggplot(mpg, aes(x=class, fill=drv))+geom_bar(alpha=0.5,position="identity")

The “identity” setting is not very useful in this case



Dodged barplot

position = "dodge" places overlapping objects directly beside one another. This makes it easier to compare individual values.

ggplot(mpg, aes(x=class, fill=drv))+geom_bar(position="dodge")



Percentage barplot

position = "fill" works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups:

ggplot(mpg, aes(x=class, fill=drv))+geom_bar(position="fill")



Modify limits of axes

We may be interested in focusing only on certain values of x and y axes. In this case we can use xlim( ) and ylim( ) to fix them.

ggplot(mpg, aes(x=cyl, y=displ))+geom_point()+xlim(6,8)+ylim(3,5)



Coordinate systems

The default coordinate system is the Cartesian coordinate system. Nevertheless there are a number of other coordinate systems that are occasionally helpful, like the coord_flip( ) function that invert x and y axes and the coord_polar( ) that map data on polar coordinates:

ggplot(mpg, aes(x=class, fill=class))+geom_bar()+coord_flip()

ggplot(mpg, aes(x=class, fill=class))+geom_bar()+coord_polar()



Scales

Scales map data values to the visual values of an aesthetic. To change a mapping, add a new scale.



There are five different families of scales:

  • general purpose scales

  • x and y location scales

  • color and fill discrete scales

  • color and fill continous scales

  • shape and size scales



General purpose scales




X and y location scales


example_data=data.frame(class=sample(c("A","B","C"),1000,replace = TRUE),value=c(rnorm(800,mean = 1,sd=1),rnorm(200,mean = 100,sd=0.1)))
ggplot(example_data,aes(x=class,y=value))+geom_boxplot()

ggplot(example_data,aes(x=class,y=value))+geom_boxplot()+scale_y_log10()



Color and fill discrete scales




Color and fill continous scales




Shape and size scales




The theme’s world

+theme( ) is the least component we can add to a graph. It customizes aspects of the plots such as axis, legend, panel, and facet properties. There are many already complete themes you can use (find them @ https://ggplot2.tidyverse.org/reference/ggtheme.html).

For making more elegant and paper-style plots we you to use +theme_bw( )


Themes can also be combined. Notice that the latter modify the previous theme (For example, if you want to use a pre-constructed theme put it at first and then add another theme() part in which you specify some addings), like:

ggplot(mpg, aes(x=class, fill=class))+geom_bar()+coord_polar()+theme_bw()+
  theme(axis.title.x=element_blank(), # no x name
        axis.text.x=element_blank(), # no x labels
        axis.ticks.x=element_blank(), # no ticks on x
        legend.position="bottom", # lengend at the bottom of the graph
        legend.direction = "horizontal")  # lengend in horizontal

These are the themes() options available in ggplot



Graph components

Do you remember the image at the beginning of this chapter?


Now we know all components of a graph.

Save the plot in a file

To save a plot you can follow the same procedure you learnt for basic plot. It is usefull to add in the pdf( ) function useDingbats = F. This will allow you to import the plot in external graphic tools (like Illustrator) and modify them easily. For what concerns measures we can also use the function units( ) from the package "grid", that allows to choose custom scales.

library(grid)
pdf("file.pdf", useDingbats = F, width=unit(3, "cm"), height=unit(3, "cm"))
ggplot(mpg, aes(x=cyl, y=displ))+geom_point()+xlim(6,8)+ylim(3,5)
dev.off()