R allows users to create plots without the need to load any additional packages, using only the built-in functions.
Visualizing data is an essential step in data analysis, especially in biology, as it helps to better understand the distribution, relationships, and potential outliers in the data before performing more complex statistical analysis.
The main types of plots we can create are:
hist()
function,
useful for visualizing the distribution of a continuous variableplot(density())
,
to estimate the probability distribution of a continuous variableplot()
, ideal for
exploring the relationship between two continuous variablesboxplot()
, to compare
the distributions of a variable across different groupsbarplot()
, commonly
used for categorical data or to compare the sizes of different
groupsA histogram is useful for visualizing the distribution of a continuous variable.
Let’s generate 100 random values from a normal distribution and plot them as a histogram.
x <- rnorm(100)
hist(x)
We can customize the histogram by adding arguments to the function.
For example, we can:
col
) and border color
(border
) of the barsmain
)breaks
)hist(x, col = "pink", border = "blue", main = "My first histogram", breaks = 100)
By increasing the number of breaks, we get a more detailed view of the distribution.
However, too many breaks may make the histogram less readable.
A density plot is another way to visualize the distribution of a continuous variable.
Unlike histograms, density plots estimate the probability density function of the data, providing a smooth curve instead of discrete bars.
Using the same dataset, we can create a density plot with:
x <- rnorm(100)
plot(density(x))
We can further customize the plot by:
Filling the area under the curve with color
Changing the border color
To do this, we first store the density object and then use the
polygon()
function to add the shaded area:
d <- density(x)
plot(d)
polygon(d, col = "pink", border = "blue")
The polygon()
function fills the area under the curve,
making the visualization more appealing and easier to interpret.
Scatter plots are widely used in bioinformatics as they help visualize the relationship between two variables.
We can create a second vector with 100 values from the normal
distribution (just like x
) and plot them in a
scatter plot:
x <- rnorm(100)
y <- rnorm(100)
plot(x, y)
We can customize the plot by:
Adding a title (main
)
Changing the axis labels (xlab
,
ylab
)
plot(x, y, main = "My first scatterplot", xlab = "X", ylab = "Y")
We can also modify the point shape
(pch
):
plot(x, y, main = "My first scatterplot", xlab = "X", ylab = "Y", pch = 16, col = "pink")
Some shapes allow additional customization, such as:
Changing the fill color
(bg
)
Changing the border color
(col
)
plot(x, y, main = "My first scatterplot", xlab = "X", ylab = "Y", pch = 24, col = "blue", bg = "pink")
The pch
parameter controls the point
shape. Some values (like 24
, a filled triangle) allow
specifying both a border color (col
) and a
fill color (bg
).
A boxplot is useful for visualizing the distribution, variability, and potential outliers of a dataset.
Using the same vectors we used for the scatter plot, we can create a boxplot to compare their distributions.
We can also use a vector of colors to distinguish the two variables:
boxplot(x, y, main = "My first boxplot", xlab = "Vector name", ylab = "Values", col = c("darkslategray1", "darkgoldenrod1"), names = c("X", "Y"))
The central box represents the interquartile range (IQR), containing the middle 50% of the data
The horizontal line inside the box is the median
The whiskers extend to the smallest and largest values within 1.5 times the IQR from the quartiles.
Points outside the whiskers are considered outliers.
By modifying parameters such as col
(box color),
names
(axis labels), and main
(title), we can
improve readability and presentation.
We can also change the orientation and add notches to the boxplots:
boxplot(x, y, main = "Boxplot with notches", ylab = "Vector name", xlab = "Values", col = c("darkslategray1", "darkgoldenrod1"), horizontal = TRUE, notch = TRUE, names = c("X", "Y"))
Notches in a boxplot represent a confidence interval around the median.
They are useful when comparing two or more distributions because:
If the notches of two boxplots do not overlap: this suggests a statistically significant difference between the medians
If the notches overlap: there is no strong evidence that the medians are different.
Using notches can help in biological data analysis, where comparing distributions (e.g., gene expression levels, experimental results) is common. What are notches?
In summary
A barplot is useful for visualizing categorical data. For example, if we have a dataframe containing information about eye color frequency, we can represent it as a barplot.
Let’s define a small dataset with eye colors, their respective percentages, rarity, and a color scheme:
options(stringsAsFactors = F)
data = cbind.data.frame(eyes_color = c("brown", "green", "grey", "blue"),
percentage = c(80, 10, 3, 7),
rarity = c("not rare", "rare", "rare", "rare"),
color = c("magenta", "yellow", "yellow", "yellow"))
data
## eyes_color percentage rarity color
## 1 brown 80 not rare magenta
## 2 green 10 rare yellow
## 3 grey 3 rare yellow
## 4 blue 7 rare yellow
We can plot these values using a barplot, assigning colors to each bar based on the dataset.
To improve readability, we can add a legend to indicate the rarity of each eye color:
barplot(height = data$percentage, names.arg = data$eyes_color, xlab = "Eyes colour", ylab = "Percentage", col = data$color)
legend("topright", legend = c("not rare","rare"), fill = c("magenta","yellow"))
If the category names on the x-axis are long or overlap, we can
rotate them for better visibility using las = 2
:
barplot(height = data$percentage, names.arg=data$eyes_color, xlab = "Eyes colour", ylab = "Percentage", col=data$color, las=2)
R allows us to add multiple layers to a plot, combining different plot types.
This is useful for adding annotations, lines, or shapes to enhance data visualization.
Let’s start with a scatter plot and add:
- A margin text (mtext()
) on the right
side of the plot.
- A text inside the plot (text()
) showing
the correlation between x
and y
.
plot(x, y, main = "Scatterplot", xlab = "X",ylab = "Y", pch = 24, col = "blue", bg = "pink")
mtext(text = "This is a plot", side = 4)
cor_value = cor(x,y)
text(1, 2, paste("The correlation is", round(cor_value, 2)))
We can also add lines, polygons, and other elements to enrich our visualization. For example, adding a regression line to the scatter plot:
plot(x, y, main = "Scatterplot", xlab = "X",ylab = "Y", pch = 24, col = "blue", bg = "pink")
abline(lm(y ~ x), col = "red", lwd = 2)
Other elements that can be added include:
Horizontal and vertical lines:
abline(h=...)
, abline(v=...)
Custom shapes: polygon()
,
rect()
, etc
Overlaying elements can be useful in biological data analysis, such as highlighting trends in gene expression or experimental measurements.
R allows us to combine multiple plots in the same
graphic using the par()
function.
The argument mfrow = c(nrows, ncols)
defines the layout
of the plotting area:
- nrows
→ Number of rows
- ncols
→ Number of columns
In the following example, we split the plotting area into two
columns (mfrow = c(1, 2)
) to display:
1. A density plot with a filled polygon
2. A scatter plot
par(mfrow = c(1, 2))
plot(d)
polygon(d, col = "pink", border = "blue")
plot(x,y, main = "Scatterplot", xlab = "X", ylab = "Y", pch = 24, col = "blue", bg = "pink")
To stack the plots vertically, we use
mfrow = c(2, 1)
, meaning:
2 rows
1 column
par(mfrow = c(2, 1))
plot(d)
polygon(d, col = "pink", border = "blue")
plot(x,y, main = "Scatterplot", xlab = "X", ylab = "Y", pch = 24, col = "blue", bg = "pink")
Key points:
par(mfrow = ...)
must be called before
plottingmfrow = c(1, 2)
: Plots side by side
(1 row, 2 columns)
mfrow = c(2, 1)
: Plots stacked
vertically (2 rows, 1 column)
This feature is useful for comparing different visualizations side by side, such as experimental data distributions in biological research.
In R, we can save a plot in different formats, with the most common being:
To save a plot, follow these steps:
pdf()
, png()
, or jpeg()
.dev.off()
, which finalizes the plot and saves it.In the following example, we will save a plot with two panels (a density plot and a scatter plot) as a PDF:
pdf("name.pdf", width = 7, height = 3)
par(mfrow = c(1, 2))
plot(d, main = "Density Plot")
polygon(d, col = "pink", border = "blue")
plot(x, y, main = "Scatterplot", xlab = "X", ylab = "Y", pch = 24, col = "blue", bg = "pink")
dev.off()
Key points
Graphics device functions (pdf()
,
png()
, jpeg()
) determine the file format and
settings.
Width
and height
can
be adjusted to fit the desired plot dimensions.
dev.off()
is essential to save and close the plot
file.
The easiest way to specify a color is to enter its name as a string.
R contains a wide variety of color names and shades. You can view all
the available color names by typing colors()
in R.
Additionally, you can find several color guides online, such as http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf.
Let’s create a vector with all the color names and select 16 random colors to visualize:
all_colors <- colors()
set.seed(18) # Let's fix a seed to always obtain the same selection
some_colors <- sample(all_colors, 16)
library(scales)
show_col(some_colors)
You can also specify colors using their hexadecimal code, for example:
hex <- c("#CC0066", "#9933CC", "#3399FF", "#00FF00")
pie(rep(1, length(hex)), col = hex, labels = hex)
To generate a vector of n contiguous colors, use functions like:
rainbow(n)
heat.colors(n)
terrain.colors(n)
topo.colors(n)
cm.colors(n)
These functions are useful for creating color gradients.
You can create custom continuous color palettes by mixing two or more colors.
The functions colorRamp()
and
colorRampPalette()
handle this “mixing” process. You can
either use the full palette or extract a specific number of colors.
Here is an example where we create a continuous palette from three colors and display it:
pal <- colorRampPalette(c("magenta", "cyan", "yellow"))
pal(3) # get 3 colors from the created palette
## [1] "#FF00FF" "#00FFFF" "#FFFF00"
pie(rep(1, 3), col = pal(3) , labels = pal(3), clockwise = TRUE)
pal(20) # see how many shades we get
## [1] "#FF00FF" "#E41AFF" "#C935FF" "#AE50FF" "#936BFF" "#7886FF" "#5DA1FF"
## [8] "#43BBFF" "#28D6FF" "#0DF1FF" "#0DFFF1" "#28FFD6" "#43FFBB" "#5DFFA1"
## [15] "#78FF86" "#93FF6B" "#AEFF50" "#C9FF35" "#E4FF1A" "#FFFF00"
pie(rep(1, 20), col = pal(20) , labels = pal(20), clockwise = TRUE)
RColorBrewer offers three main types of color palettes:
Sequential: Best for ordered data that progresses from low to high or vice versa.
Qualitative: Suited for categorical data where color differences do not imply magnitude.
Diverging: Emphasizes extremes at both ends of the data range.
To see all available palettes, use the following code:
library("RColorBrewer")
display.brewer.all()
To visualize a single palette, specify the number of colors you need:
display.brewer.pal(n = 9, name = 'PiYG')
display.brewer.pal(n = 5, name = 'PiYG')
You can also extract the hexadecimal color codes from a palette:
brewer.pal(n = 9, name = 'PiYG')
## [1] "#C51B7D" "#DE77AE" "#F1B6DA" "#FDE0EF" "#F7F7F7" "#E6F5D0" "#B8E186"
## [8] "#7FBC41" "#4D9221"
brewer.pal(n = 5, name = 'PiYG')
## [1] "#D01C8B" "#F1B6DA" "#F7F7F7" "#B8E186" "#4DAC26"
To use an RColorBrewer palette in a plot:
barplot(c(2,5,7), col = brewer.pal(n = 3, name = "PiYG"))
The viridis package offers a range of perceptually uniform color scales that are both colorblind-friendly and printable in grayscale.
In particular, viridis creators define their palettes as:
Colorful: covering as wide a range as possible to make differences easy to distinguish
Perceptually uniform: values close to each other have similar-appearing colors, while values that are far apart appear distinctly different, consistently across the entire range
Colorblind-friendly: ensuring that these properties hold true for people with common forms of colorblindness, as well as in grayscale printing
Aesthetically pleasing
Some of the available scale are: “viridis”, “magma”, “plasma”, “inferno”, “cividis”, “mako”, “rocket”, and “turbo”.
At each scale, a letter is assigned to select colors:
magma
: A
plasma
: B
inferno
: C
viridis
: D
cividis
: E
rocket
: F
mako
: G
turbo
: H
Here is how to use a viridis palette in a plot:
library(viridis)
myCol <- viridis(n = 4, option = "D")
pie(rep(1, 4), col = myCol, labels = myCol)
The wesanderson package provides color palettes inspired by the films of Wes Anderson.
Here is an example:
library(wesanderson)
names(wes_palettes)
## [1] "BottleRocket1" "BottleRocket2" "Rushmore1"
## [4] "Rushmore" "Royal1" "Royal2"
## [7] "Zissou1" "Zissou1Continuous" "Darjeeling1"
## [10] "Darjeeling2" "Chevalier1" "FantasticFox1"
## [13] "Moonrise1" "Moonrise2" "Moonrise3"
## [16] "Cavalcanti1" "GrandBudapest1" "GrandBudapest2"
## [19] "IsleofDogs1" "IsleofDogs2" "FrenchDispatch"
## [22] "AsteroidCity1" "AsteroidCity2" "AsteroidCity3"
wes_palette("Royal2")
You can use these palettes in plots as follows:
barplot(c(2,5,7), col = wes_palette(n = 3, name = "Royal2"))
The ggsci package offers a wide range of color palettes designed specifically for use with ggplot2 (you can find more information here).
One important feature of ggsci is the concept of palette families. A palette family consists of a set of related color schemes, where each family contains several variations of the same basic color theme.
For example, a family might include both light and dark versions of the same color palette, or different tones of the same color set. This allows for flexibility in choosing colors that maintain a consistent visual theme across different types of plots.
In addition, ggsci provides two types of palettes:
Discrete palettes: used when you have categorical (distinct) data and need a set of colors to represent each category.
Continuous palettes: used for data that has a range (e.g., numerical data) and need a gradient of colors to represent values.
For example:
library(ggsci)
barplot(c(2,5,7), col = pal_simpsons("springfield")(3)) # Discrete palette from Simpsons
barplot(c(2,5,7), col = pal_material("indigo")(3)) # continous palette from Material
barplot(c(2,5,7), col = pal_material("light-green")(3)) # continous palette from Material
Many other color palettes are available, such as those found in the MetBrewer package (based on famous paintings and sculptures housed at the Metropolitan Museum of Art) and Van Gogh’s palettes.
You can explore them for more creative color choices:
MetBrewer: MetBrewer GitHub repository
Van Gogh palettes: Van Gogh R package vignette
In base R plotting, you can control two margin areas:
You can modify these areas using the par()
function with
the appropriate arguments:
mar
: Defines the margins for the
plot area (inner space).
oma
: Defines the outer margins
(outside the plot area).
Both arguments require four values, which represent the space (in lines) for the bottom, left, top, and right sides of the plot, respectively.
For example, par(mar=c(4,0,0,0))
sets a margin of size 4
only on the bottom of the plot.
Alternatively, you can define margins in inches using:
mai()
omi()
Here’s an example of how to use these options in practice:
# set the outer margin (all sides) to 3 lines of space
par(oma=c(3,3,3,3))
# set the inner plot margin to specific values
par(mar=c(5,4,4,2) + 0.1)
# plot a basic empty plot (no points: type="n" hides the points)
plot(0:10, 0:10, type = "n", xlab = "X", ylab = "Y")
# add text to the plot area, coloring it red
text(5,5, "Plot", col = "red", cex = 2)
box(col = "red")
# add labels in the margins and color them forestgreen
mtext("Margins", side = 3, line = 2, cex = 2, col = "forestgreen")
mtext("par(mar=c(b,l,t,r))", side = 3, line = 1, cex = 1, col = "forestgreen")
mtext("Line 0", side = 3, line = 0, adj = 1.0, cex = 1, col = "forestgreen")
mtext("Line 1", side = 3, line = 1, adj = 1.0, cex = 1, col = "forestgreen")
mtext("Line 2", side = 3, line = 2, adj = 1.0, cex = 1, col = "forestgreen")
mtext("Line 3", side = 3, line = 3, adj = 1.0, cex = 1, col = "forestgreen")
box("figure", col = "forestgreen")
# add labels to the outer margin area and color it blue
# 'outer=TRUE' moves us from the figure margins to the outer margins
mtext("Outer Margin Area", side = 1, line = 1, cex = 2, col = "blue", outer = TRUE)
mtext("par(oma=c(b,l,t,r))", side = 1, line = 2, cex = 1, col = "blue", outer = TRUE)
mtext("Line 0", side = 1, line = 0, adj = 0.0, cex = 1, col = "blue", outer = TRUE)
mtext("Line 1", side = 1, line = 1, adj = 0.0, cex = 1, col = "blue", outer = TRUE)
mtext("Line 2", side = 1, line = 2, adj = 0.0, cex = 1, col = "blue", outer = TRUE)
box("outer", col = "blue")